[PR #1205] [MERGED] Fix hyphen placement in util.URL_REGEX #4365

Closed
opened 2026-03-15 01:40:40 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1205
Author: @overhacked
Created: 8/8/2023
Status: Merged
Merged: 8/9/2023
Merged by: @pirate

Base: devHead: fix_url_regex_hyphen


📝 Commits (1)

  • c039ef0 Fix hyphen placement in util.URL_REGEX

📊 Changes

2 files changed (+5 additions, -1 deletions)

View changed files

📝 archivebox/parsers/__init__.py (+4 -0)
📝 archivebox/util.py (+1 -1)

📄 Description

Summary

Incorrect hyphen placement in URL_REGEX was allowing it to match more characters than intended. In a regex character class, a literal hyphen can only appear as the first character in the class, or it will be interpreted as the delimiter of a range of characters.

The issue fixed here caused the range of characters from [$-_] be treated as valid URL characters, instead of the intended set of three characters [-_$]. The incorrect range interpretation inadvertantly included most ASCII punctuation, most importantly the angle brackets, square brackets, and single quote that the expression uses to mark the end of a match.

This causes the expression to match a URL that has a "hostname" portion beginning with one of the intended "stop parsing" characters. For example:

https://<b>www</b>.example.com/  # MATCHES but should not
https://[for example]            # MATCHES but should not
scheme='https://'                # MATCHES, including final quote, but should not

Some test cases have been added to the URL_REGEX assert in archivebox.parsers to cover this possibility.

Related issues

There are other URL_REGEX issues (#235, #287, #864, #874), but none that this change directly impacts.

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1205 **Author:** [@overhacked](https://github.com/overhacked) **Created:** 8/8/2023 **Status:** ✅ Merged **Merged:** 8/9/2023 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `fix_url_regex_hyphen` --- ### 📝 Commits (1) - [`c039ef0`](https://github.com/ArchiveBox/ArchiveBox/commit/c039ef05b3c1d019544db34c3acde445782ed46e) Fix hyphen placement in util.URL_REGEX ### 📊 Changes **2 files changed** (+5 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/parsers/__init__.py` (+4 -0) 📝 `archivebox/util.py` (+1 -1) </details> ### 📄 Description # Summary Incorrect hyphen placement in `URL_REGEX` was allowing it to match more characters than intended. In a regex character class, a literal hyphen can only appear as the first character in the class, or it will be interpreted as the delimiter of a range of characters. The issue fixed here caused the range of characters from `[$-_]` be treated as valid URL characters, instead of the intended set of three characters `[-_$]`. The incorrect range interpretation inadvertantly included most ASCII punctuation, most importantly the angle brackets, square brackets, and single quote that the expression uses to mark the end of a match. This causes the expression to match a URL that has a "hostname" portion beginning with one of the intended "stop parsing" characters. For example: ``` https://<b>www</b>.example.com/ # MATCHES but should not https://[for example] # MATCHES but should not scheme='https://' # MATCHES, including final quote, but should not ``` Some test cases have been added to the `URL_REGEX` assert in archivebox.parsers to cover this possibility. # Related issues There are other `URL_REGEX` issues (#235, #287, #864, #874), but none that this change directly impacts. # Changes these areas - [X] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-15 01:40:40 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#4365
No description provided.