[+] [2019-05-06 22:21:48] "en.wikipedia.org/wiki/Insert_"
    https://en.wikipedia.org/wiki/Insert_
    > ./archive/1557178364.70
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media
      > archive_org

Software versions

OS: Debian buster 10
ArchiveBox version: 0.4.1 installed from pypi
Python version: 3.7.3 something
Chrome version: irrelevant?

Workaround

Run sed 's/(/%28/;s/)/%29/' over the list of URLs, although I suspect more special characters might be affected.

Originally created by @anarcat on GitHub (May 6, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/235 #### Describe the bug I was trying to archive this Wikipedia page: https://en.wikipedia.org/wiki/Insert_(effects_processing) Archivebox sees it as: https://en.wikipedia.org/wiki/Insert_ Bad box! No cookie! :) #### Steps to reproduce echo 'https://en.wikipedia.org/wiki/Insert_(effects_processing)' | archivebox add #### Screenshots or log output ``` [+] [2019-05-06 22:21:48] "en.wikipedia.org/wiki/Insert_" https://en.wikipedia.org/wiki/Insert_ > ./archive/1557178364.70 > title > favicon > wget > pdf > screenshot > dom > media > archive_org ``` #### Software versions - OS: Debian buster 10 - ArchiveBox version: 0.4.1 installed from pypi - Python version: 3.7.3 something - Chrome version: irrelevant? #### Workaround Run `sed 's/(/%28/;s/)/%29/'` over the list of URLs, although I suspect more special characters might be affected.

kerem

2026-03-01 17:52:43 +03:00

closed this issue
added the
size: medium

status: wip
labels

kerem commented

2026-03-01 17:52:44 +03:00

Author

Owner

@pirate commented on GitHub (May 6, 2019):

Ah damn, I explicitly stop parsing at those characters because of cases like this:

>>>parse('This is some text in a markdown file with [a link](https://example.com)/[some other link](https://example.com/b/)...')
['https://example.com)/[some', 'https://example.com/b/)...']

The current regex looks like this:

URL_REGEX = re.compile(
    r'http[s]?://'                    # start matching from allowed schemes
    r'(?:[a-zA-Z]|[0-9]'              # followed by allowed alphanum characters
    r'|[$-_@.&+]|[!*\(\),]'           #    or allowed symbols
    r'|(?:%[0-9a-fA-F][0-9a-fA-F]))'  #    or allowed unicode bytes
    r'[^\]\[\(\)<>\""\'\s]+',         # stop parsing at these symbols
    re.IGNORECASE,
)

I don't see any easy solutions, but there are a few crappy options:

require all URLs be URL-encoded before parsing
add a config option to enable parsing of () and [] in URLs
attempt to autodetect the import format and parse accordingly (this is extremely hard)

I don't think the problem is limited to markdown files unfortunately, so I'm hesitant to add a separate parser just for markdown to paper over the issue. there are lots of formats that have meaningful symbols to break up URLs that are also valid URL characters.

@pirate commented on GitHub (May 6, 2019): Ah damn, I explicitly stop parsing at those characters because of cases like this: ```python >>>parse('This is some text in a markdown file with [a link](https://example.com)/[some other link](https://example.com/b/)...') ['https://example.com)/[some', 'https://example.com/b/)...'] ``` The current regex looks like this: ```python URL_REGEX = re.compile( r'http[s]?://' # start matching from allowed schemes r'(?:[a-zA-Z]|[0-9]' # followed by allowed alphanum characters r'|[$-_@.&+]|[!*,]' # or allowed symbols r'|(?:%[0-9a-fA-F][0-9a-fA-F]))' # or allowed unicode bytes r'[^\]\[<>\""\'\s]+', # stop parsing at these symbols re.IGNORECASE, ) ``` I don't see any easy solutions, but there are a few crappy options: - require all URLs be URL-encoded before parsing - add a config option to enable parsing of `()` and `[]` in URLs - attempt to autodetect the import format and parse accordingly (this is extremely hard) I don't think the problem is limited to markdown files unfortunately, so I'm hesitant to add a separate parser just for markdown to paper over the issue. there are lots of formats that have meaningful symbols to break up URLs that are also valid URL characters.

kerem commented

2026-03-01 17:52:44 +03:00

Author

Owner

@anarcat commented on GitHub (May 6, 2019):

i think it might be worth making it possible for archivebox add to be told what kind of input to expect.

if it's a "one URL per line" file, it shouldn't do any parsing except for splitting on newlines. same with known JSON or bookmarks formats...

@anarcat commented on GitHub (May 6, 2019): i think it might be worth making it possible for `archivebox add` to be told what kind of input to expect. if it's a "one URL per line" file, it shouldn't do *any* parsing except for splitting on newlines. same with known JSON or bookmarks formats...

kerem commented

2026-03-01 17:52:44 +03:00

Author

Owner

@pirate commented on GitHub (Oct 18, 2019):

if it's a "one URL per line" file, it shouldn't do any parsing except for splitting on newlines. same with known JSON or bookmarks formats...

I think this is reasonable, if the first line starts with http... and ends with \n we can probably consider the whole import to be raw URLs separated by newlines.
It gets harder when parsing other types of sources though (e.g. txt, markdown, arbitrary html, etc):

http://what.if.it's.a.markdown/document/that/just/happens/to/start/with/a/url.html?difficult=true&thought=required
# This is some *markdown text [containing](http://exmaple.com/several/example_urls(abc)).*
There may be raw http://urls.example.com embedded in the text with whitespace too.

@pirate commented on GitHub (Oct 18, 2019): > if it's a "one URL per line" file, it shouldn't do any parsing except for splitting on newlines. same with known JSON or bookmarks formats... I think this is reasonable, if the first line starts with `http...` and ends with `\n` we can probably consider the whole import to be raw URLs separated by newlines. It gets harder when parsing other types of sources though (e.g. txt, markdown, arbitrary html, etc): ```md http://what.if.it's.a.markdown/document/that/just/happens/to/start/with/a/url.html?difficult=true&thought=required # This is some *markdown text [containing](http://exmaple.com/several/example_urls(abc)).* There may be raw http://urls.example.com embedded in the text with whitespace too. ```

kerem referenced this issue

2026-03-01 17:59:32 +03:00

[GH-ISSUE #1672] Bug: "description" file (for media extractor) is not served with right content-type #2511

kerem referenced this issue

2026-03-15 01:18:17 +03:00

[GH-ISSUE #1672] Bug: "description" file (for media extractor) is not served with right content-type #4015

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#1672

No description provided.

Rows
Columns