[GH-ISSUE #235] Parsing: Sources that aren't URL-encoded can contain URLs with brackets, parens, and other separator symbols #1672

Closed
opened 2026-03-01 17:52:43 +03:00 by kerem · 3 comments
Owner

Originally created by @anarcat on GitHub (May 6, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/235

Describe the bug

I was trying to archive this Wikipedia page:

https://en.wikipedia.org/wiki/Insert_(effects_processing)

Archivebox sees it as:

https://en.wikipedia.org/wiki/Insert_

Bad box! No cookie! :)

Steps to reproduce

echo 'https://en.wikipedia.org/wiki/Insert_(effects_processing)' | archivebox add

Screenshots or log output

[+] [2019-05-06 22:21:48] "en.wikipedia.org/wiki/Insert_"
    https://en.wikipedia.org/wiki/Insert_
    > ./archive/1557178364.70
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media
      > archive_org

Software versions

  • OS: Debian buster 10
  • ArchiveBox version: 0.4.1 installed from pypi
  • Python version: 3.7.3 something
  • Chrome version: irrelevant?

Workaround

Run sed 's/(/%28/;s/)/%29/' over the list of URLs, although I suspect more special characters might be affected.

Originally created by @anarcat on GitHub (May 6, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/235 #### Describe the bug I was trying to archive this Wikipedia page: https://en.wikipedia.org/wiki/Insert_(effects_processing) Archivebox sees it as: https://en.wikipedia.org/wiki/Insert_ Bad box! No cookie! :) #### Steps to reproduce echo 'https://en.wikipedia.org/wiki/Insert_(effects_processing)' | archivebox add #### Screenshots or log output ``` [+] [2019-05-06 22:21:48] "en.wikipedia.org/wiki/Insert_" https://en.wikipedia.org/wiki/Insert_ > ./archive/1557178364.70 > title > favicon > wget > pdf > screenshot > dom > media > archive_org ``` #### Software versions - OS: Debian buster 10 - ArchiveBox version: 0.4.1 installed from pypi - Python version: 3.7.3 something - Chrome version: irrelevant? #### Workaround Run `sed 's/(/%28/;s/)/%29/'` over the list of URLs, although I suspect more special characters might be affected.
kerem 2026-03-01 17:52:43 +03:00
Author
Owner

@pirate commented on GitHub (May 6, 2019):

Ah damn, I explicitly stop parsing at those characters because of cases like this:

>>>parse('This is some text in a markdown file with [a link](https://example.com)/[some other link](https://example.com/b/)...')
['https://example.com)/[some', 'https://example.com/b/)...']

The current regex looks like this:

URL_REGEX = re.compile(
    r'http[s]?://'                    # start matching from allowed schemes
    r'(?:[a-zA-Z]|[0-9]'              # followed by allowed alphanum characters
    r'|[$-_@.&+]|[!*\(\),]'           #    or allowed symbols
    r'|(?:%[0-9a-fA-F][0-9a-fA-F]))'  #    or allowed unicode bytes
    r'[^\]\[\(\)<>\""\'\s]+',         # stop parsing at these symbols
    re.IGNORECASE,
)

I don't see any easy solutions, but there are a few crappy options:

  • require all URLs be URL-encoded before parsing
  • add a config option to enable parsing of () and [] in URLs
  • attempt to autodetect the import format and parse accordingly (this is extremely hard)

I don't think the problem is limited to markdown files unfortunately, so I'm hesitant to add a separate parser just for markdown to paper over the issue. there are lots of formats that have meaningful symbols to break up URLs that are also valid URL characters.

<!-- gh-comment-id:489820909 --> @pirate commented on GitHub (May 6, 2019): Ah damn, I explicitly stop parsing at those characters because of cases like this: ```python >>>parse('This is some text in a markdown file with [a link](https://example.com)/[some other link](https://example.com/b/)...') ['https://example.com)/[some', 'https://example.com/b/)...'] ``` The current regex looks like this: ```python URL_REGEX = re.compile( r'http[s]?://' # start matching from allowed schemes r'(?:[a-zA-Z]|[0-9]' # followed by allowed alphanum characters r'|[$-_@.&+]|[!*\(\),]' # or allowed symbols r'|(?:%[0-9a-fA-F][0-9a-fA-F]))' # or allowed unicode bytes r'[^\]\[\(\)<>\""\'\s]+', # stop parsing at these symbols re.IGNORECASE, ) ``` I don't see any easy solutions, but there are a few crappy options: - require all URLs be URL-encoded before parsing - add a config option to enable parsing of `()` and `[]` in URLs - attempt to autodetect the import format and parse accordingly (this is extremely hard) I don't think the problem is limited to markdown files unfortunately, so I'm hesitant to add a separate parser just for markdown to paper over the issue. there are lots of formats that have meaningful symbols to break up URLs that are also valid URL characters.
Author
Owner

@anarcat commented on GitHub (May 6, 2019):

i think it might be worth making it possible for archivebox add to be told what kind of input to expect.

if it's a "one URL per line" file, it shouldn't do any parsing except for splitting on newlines. same with known JSON or bookmarks formats...

<!-- gh-comment-id:489825901 --> @anarcat commented on GitHub (May 6, 2019): i think it might be worth making it possible for `archivebox add` to be told what kind of input to expect. if it's a "one URL per line" file, it shouldn't do *any* parsing except for splitting on newlines. same with known JSON or bookmarks formats...
Author
Owner

@pirate commented on GitHub (Oct 18, 2019):

if it's a "one URL per line" file, it shouldn't do any parsing except for splitting on newlines. same with known JSON or bookmarks formats...

I think this is reasonable, if the first line starts with http... and ends with \n we can probably consider the whole import to be raw URLs separated by newlines.
It gets harder when parsing other types of sources though (e.g. txt, markdown, arbitrary html, etc):

http://what.if.it's.a.markdown/document/that/just/happens/to/start/with/a/url.html?difficult=true&thought=required
# This is some *markdown text [containing](http://exmaple.com/several/example_urls(abc)).*
There may be raw http://urls.example.com embedded in the text with whitespace too.
<!-- gh-comment-id:543499960 --> @pirate commented on GitHub (Oct 18, 2019): > if it's a "one URL per line" file, it shouldn't do any parsing except for splitting on newlines. same with known JSON or bookmarks formats... I think this is reasonable, if the first line starts with `http...` and ends with `\n` we can probably consider the whole import to be raw URLs separated by newlines. It gets harder when parsing other types of sources though (e.g. txt, markdown, arbitrary html, etc): ```md http://what.if.it's.a.markdown/document/that/just/happens/to/start/with/a/url.html?difficult=true&thought=required # This is some *markdown text [containing](http://exmaple.com/several/example_urls(abc)).* There may be raw http://urls.example.com embedded in the text with whitespace too. ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1672
No description provided.