mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[PR #1205] Fix hyphen placement in util.URL_REGEX #2862
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2862
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Original Pull Request: https://github.com/ArchiveBox/ArchiveBox/pull/1205
State: closed
Merged: Yes
Summary
Incorrect hyphen placement in
URL_REGEXwas allowing it to match more characters than intended. In a regex character class, a literal hyphen can only appear as the first character in the class, or it will be interpreted as the delimiter of a range of characters.The issue fixed here caused the range of characters from
[$-_]be treated as valid URL characters, instead of the intended set of three characters[-_$]. The incorrect range interpretation inadvertantly included most ASCII punctuation, most importantly the angle brackets, square brackets, and single quote that the expression uses to mark the end of a match.This causes the expression to match a URL that has a "hostname" portion beginning with one of the intended "stop parsing" characters. For example:
Some test cases have been added to the
URL_REGEXassert in archivebox.parsers to cover this possibility.Related issues
There are other
URL_REGEXissues (#235, #287, #864, #874), but none that this change directly impacts.Changes these areas