[GH-ISSUE #432] Long URLs break when attempting to read/write them as filesystem paths #290

Closed
opened 2026-03-01 14:42:08 +03:00 by kerem · 8 comments
Owner

Originally created by @mpeteuil on GitHub (Aug 10, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/432

Describe the bug

In multiple places in the docs (Quickstart for example) it mentions the use of ./archive some-file.txt as a means of ingesting a file with a list of urls in it. There are also examples of using ./bin/archivebox-export-browser-history --firefox which generates a JSON file that users should be able to feed into ./archive as well.

With the latest release it seems that archivebox add would have taken over this behavior but it either isn't supposed to and this functionality was removed or there's a bug somewhere. The existence of the parsers in the archivebox add code path leads me to believe this is a bug and archivebox add should handle these cases.

When running archivebox init there is also a message at the end that states:

To add new links, you can run:
archivebox add ~/some/path/or/url/to/list_of_links.txt

Steps to reproduce

  1. Create a file test_urls.txt which only has the contents https://example.org
  2. Run archivebox add test_urls.txt
  3. Get back an error instead of archiving https://example.org

The same issue happens when passing the output JSON file from running ./bin/export-browser-history.sh --firefox or

Screenshots or log output

# archivebox add test_urls.txt 
[i] [2020-08-09 19:50:30] ArchiveBox v0.4.11: archivebox add test_urls.txt
    > /Users/mpeteuil/projects/ArchiveBox/data

[+] [2020-08-09 23:50:39] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597017039-import.txt
                                                                           0.0% (0/240sec)[X] Error while loading link! [1597017039.360603] test_urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                    
    > Found 0 new URLs not already in index                                                                         

[*] [2020-08-09 23:50:40] Writing 0 links to main index...
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.sqlite3                                                        
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.json                                                           
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.html 

It also happens when trying the input redirection route, except there is no [X] Error while loading link!:

archivebox add < firefox_history_urls.txt 
[i] [2020-08-09 23:00:29] ArchiveBox v0.4.11: archivebox add < /dev/stdin
    > /Users/mpeteuil/projects/ArchiveBox/data

[+] [2020-08-10 03:00:30] Adding 43291 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597028430-import.txt
    > Parsed 0 URLs from input (Failed to parse)
    > Found 0 new URLs not already in index

[*] [2020-08-10 03:00:34] Writing 0 links to main index...
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.sqlite3
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.json
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.html

Software versions

  • OS: macOS 10.15.6
  • ArchiveBox version: 87ba82a
  • Python version: 3.7.8
Originally created by @mpeteuil on GitHub (Aug 10, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/432 #### Describe the bug In multiple places in the docs ([Quickstart](https://github.com/pirate/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) for example) it mentions the use of `./archive some-file.txt` as a means of ingesting a file with a list of urls in it. There are also examples of using `./bin/archivebox-export-browser-history --firefox` which generates a JSON file that users should be able to feed into `./archive` as well. With the latest release it seems that `archivebox add` would have taken over this behavior but it either isn't supposed to and this functionality was removed or there's a bug somewhere. The existence of [the parsers](https://github.com/pirate/ArchiveBox/tree/v0.4.11/archivebox/parsers) in the `archivebox add` [code path](https://github.com/pirate/ArchiveBox/blob/v0.4.11/archivebox/main.py#L539) leads me to believe this is a bug and `archivebox add` should handle these cases. When running `archivebox init` there is also a message at the end that states: > To add new links, you can run: > archivebox add ~/some/path/or/url/to/list_of_links.txt #### Steps to reproduce 1. Create a file `test_urls.txt` which only has the contents `https://example.org` 2. Run `archivebox add test_urls.txt` 3. Get back an error instead of archiving `https://example.org` The same issue happens when passing the output JSON file from running `./bin/export-browser-history.sh --firefox` or #### Screenshots or log output ```sh # archivebox add test_urls.txt [i] [2020-08-09 19:50:30] ArchiveBox v0.4.11: archivebox add test_urls.txt > /Users/mpeteuil/projects/ArchiveBox/data [+] [2020-08-09 23:50:39] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1597017039-import.txt 0.0% (0/240sec)[X] Error while loading link! [1597017039.360603] test_urls.txt "None" > Parsed 0 URLs from input (Failed to parse) > Found 0 new URLs not already in index [*] [2020-08-09 23:50:40] Writing 0 links to main index... √ /Users/mpeteuil/projects/ArchiveBox/data/index.sqlite3 √ /Users/mpeteuil/projects/ArchiveBox/data/index.json √ /Users/mpeteuil/projects/ArchiveBox/data/index.html ``` It also happens when trying the input redirection route, except there is no `[X] Error while loading link!`: ```sh archivebox add < firefox_history_urls.txt [i] [2020-08-09 23:00:29] ArchiveBox v0.4.11: archivebox add < /dev/stdin > /Users/mpeteuil/projects/ArchiveBox/data [+] [2020-08-10 03:00:30] Adding 43291 links to index (crawl depth=0)... > Saved verbatim input to sources/1597028430-import.txt > Parsed 0 URLs from input (Failed to parse) > Found 0 new URLs not already in index [*] [2020-08-10 03:00:34] Writing 0 links to main index... √ /Users/mpeteuil/projects/ArchiveBox/data/index.sqlite3 √ /Users/mpeteuil/projects/ArchiveBox/data/index.json √ /Users/mpeteuil/projects/ArchiveBox/data/index.html ``` #### Software versions - OS: macOS 10.15.6 - ArchiveBox version: 87ba82a - Python version: 3.7.8
kerem 2026-03-01 14:42:08 +03:00
Author
Owner

@mpeteuil commented on GitHub (Aug 10, 2020):

I finally made my way to the usage docs in the wiki, which have examples of doing this successfully. For example:

archivebox add < urls_to_archive.txt

I'm going to leave this open for the moment because I think a to-do to come of this is that there seems to be some places in the docs that still have references to old methods which are no longer valid.

I've tried to update the references to the old ./archive on the Quickstart, Install, Scheduling, and Security wiki pages, but there may be others I missed.

There is also a note when running archivebox in the terminal without a subcommand that states:

Example Use:
...
archivebox add --depth=1 ~/Downloads/bookmarks_export.html

This may need to be updated to the input redirection syntax as well, but I haven't tested if there is an issue with that.

<!-- gh-comment-id:671134550 --> @mpeteuil commented on GitHub (Aug 10, 2020): I finally made my way to the [usage docs in the wiki](https://github.com/pirate/ArchiveBox/wiki/Usage#import-a-list-of-urls-from-a-txt-file), which have examples of doing this successfully. For example: > `archivebox add < urls_to_archive.txt` I'm going to leave this open for the moment because I think a to-do to come of this is that there seems to be some places in the docs that still have references to old methods which are no longer valid. I've tried to update the references to the old `./archive` on the [Quickstart](https://github.com/pirate/ArchiveBox/wiki/Quickstart), [Install](https://github.com/pirate/ArchiveBox/wiki/Install), [Scheduling](https://github.com/pirate/ArchiveBox/wiki/Scheduled-Archiving), and [Security](https://github.com/pirate/ArchiveBox/wiki/Security-Overview) wiki pages, but there may be others I missed. There is also a note when running `archivebox` in the terminal without a subcommand that states: > **Example Use:** > ... > `archivebox add --depth=1 ~/Downloads/bookmarks_export.html` This may need to be updated to the input redirection syntax as well, but I haven't tested if there is an issue with that.
Author
Owner

@pirate commented on GitHub (Aug 10, 2020):

The docs are indeed out-of-date, thanks for fixing them,

archivebox add --depth=1 ~/Downloads/bookmarks_export.html

is slightly different from:

archivebox add < ~/Downloads/bookmarks_export.html
archivebox add --depth=0 < ~/Downloads/bookmarks_export.html  # these are equivalent

The depth=1 + arg treats the file as a page itself that needs to be archived, then parses it for links and archives those as well.
The depth=0 + stdin version just parses stdin as an input file for links, and doesn't archive the bookmarks_export.html file itself as if it's a page (which is probably what you want).

<!-- gh-comment-id:671143765 --> @pirate commented on GitHub (Aug 10, 2020): The docs are indeed out-of-date, thanks for fixing them, ```bash archivebox add --depth=1 ~/Downloads/bookmarks_export.html ``` is slightly different from: ```bash archivebox add < ~/Downloads/bookmarks_export.html archivebox add --depth=0 < ~/Downloads/bookmarks_export.html # these are equivalent ``` The depth=1 + arg treats the file as a page itself that needs to be archived, then parses it for links and archives those as well. The depth=0 + stdin version just parses stdin as an input file for links, and doesn't archive the `bookmarks_export.html` file itself as if it's a page (which is probably what you want).
Author
Owner

@mpeteuil commented on GitHub (Aug 10, 2020):

Thanks for the explanation. I didn't realize the --depth option changed how the input file itself was treated depending on the value (0, 1).

I updated the description, but it looks like I still get Parsed 0 URLs from input (Failed to parse) even using input redirection.

<!-- gh-comment-id:671307451 --> @mpeteuil commented on GitHub (Aug 10, 2020): Thanks for the explanation. I didn't realize the `--depth` option changed how the input file itself was treated depending on the value (0, 1). I updated the description, but it looks like I still get `Parsed 0 URLs from input (Failed to parse)` even using input redirection.
Author
Owner

@pirate commented on GitHub (Aug 10, 2020):

--depth alone doesn't change how the input file is treated, note the < pipe is only present in the depth=0 example. This means that it imports all the URLs passed in, but only archives each one at depth=0. In the --depth=1 example the file is not piped in as a list of links, but rather archived as if it were a URL to a page itself, that behavior changes if you were to add a < and pipe the file in instead (it would archive each link within the file at depth=1).

Can you post a snippet of the file you're trying to import? (note the URLs must have a scheme, https://example.com √, example.com X)

<!-- gh-comment-id:671340730 --> @pirate commented on GitHub (Aug 10, 2020): `--depth` alone doesn't change how the input file is treated, note the `<` pipe is only present in the depth=0 example. This means that it imports all the URLs passed in, but only archives each one at depth=0. In the `--depth=1` example the file is not piped in as a list of links, but rather archived as if it were a URL to a page itself, that behavior changes if you were to add a `<` and pipe the file in instead (it would archive each link *within* the file at depth=1). Can you post a snippet of the file you're trying to import? (note the URLs must have a scheme, `https://example.com` √, `example.com` X)
Author
Owner

@mpeteuil commented on GitHub (Aug 11, 2020):

Can you post a snippet of the file you're trying to import? (note the URLs must have a scheme, https://example.com √, example.com X)

I would, but I'd like to avoid doing so out of privacy concerns if possible. However, I was able to find the URL which is causing the issue. The problem seems to be that it's just really long. In this instance the URL is 1092 characters, which is causing OSError: [Errno 63] File name too long to be thrown when executing if Path(line).exists() in the generic_txt parser. The error is eventually swallowed by _parse's exception handling , so it's not seen elsewhere.

The good news is that this is reproducible with any sufficiently long and valid URL in a txt file.

<!-- gh-comment-id:671653843 --> @mpeteuil commented on GitHub (Aug 11, 2020): > Can you post a snippet of the file you're trying to import? (note the URLs must have a scheme, `https://example.com` √, `example.com` X) I would, but I'd like to avoid doing so out of privacy concerns if possible. However, I was able to find the URL which is causing the issue. The problem seems to be that it's just really long. In this instance the URL is 1092 characters, which is causing `OSError: [Errno 63] File name too long` to be thrown when executing [`if Path(line).exists()`](https://github.com/pirate/ArchiveBox/blob/v0.4.11/archivebox/parsers/generic_txt.py#L28) in the generic_txt parser. The error is eventually swallowed by [`_parse`'s exception handling ](https://github.com/pirate/ArchiveBox/blob/v0.4.11/archivebox/parsers/__init__.py#L103), so it's not seen elsewhere. The good news is that this is reproducible with any sufficiently long and valid URL in a txt file.
Author
Owner

@pirate commented on GitHub (Aug 11, 2020):

Your diagnosis looks right.

Unfortunately, I think this is a much deeper problem than just the generic_txt parser and I can't promise that the deeper problem is going to get fixed anytime soon.

The "URL == filesystem path" decision was made early on in the design process, and it's a painful choice that has come back to bite me many times, but it's here to stay for the foreseeable future. When the wget output is saved to the filesystem, it translates the URL directly to a path.

URLs don't cleanly map to filesystem paths, and never will (some filesystems are case insensitive!!). On the other hand, there are many benefits to being able to see website paths easily in a file explorer, it's the most durable format of all. There are also many redundant methods to cover you in the rare case of URL => filename conflicts created by wget.

When it does break due to a filesystem:URL mapping failure, I believe it should only be the wget archive method that's broken, the other methods all have hardcoded output paths with no URL fragments in them. If you absolutely need it you can replay a broken wget archive you can always use the wget warc to get the same output independently.

What we can do for now is catch the exception and skip all attempts to read URL fragment paths, and save long URLs to the index normally up until some ridiculous limit like 65,000 characters. This will still result in broken wget clones, but all the other outputs and the index should work with the long URLs.

<!-- gh-comment-id:671711176 --> @pirate commented on GitHub (Aug 11, 2020): Your diagnosis looks right. Unfortunately, I think this is a much deeper problem than just the `generic_txt` parser and I can't promise that the deeper problem is going to get fixed anytime soon. The "URL == filesystem path" decision was made early on in the design process, and it's a painful choice that has come back to bite me many times, but it's here to stay for the foreseeable future. When the wget output is saved to the filesystem, it translates the URL directly to a path. URLs don't cleanly map to filesystem paths, and never will (some filesystems are case insensitive!!). On the other hand, there are many benefits to being able to see website paths easily in a file explorer, it's the most durable format of all. There are also many redundant methods to cover you in the rare case of URL => filename conflicts created by wget. When it does break due to a filesystem:URL mapping failure, I believe it should only be the wget archive method that's broken, the other methods all have hardcoded output paths with no URL fragments in them. If you absolutely need it you can replay a broken wget archive you can always use the wget warc to get the same output independently. What we can do for now is catch the exception and skip all attempts to read URL fragment paths, and save long URLs to the index normally up until some ridiculous limit like 65,000 characters. This will still result in broken wget clones, but all the other outputs and the index should work with the long URLs.
Author
Owner

@mpeteuil commented on GitHub (Aug 11, 2020):

That background definitely helps me understand this better and helps me see that it's not just this one isolated problem.

What we can do for now is catch the exception and skip all attempts to read URL fragment paths, and save long URLs to the index normally up until some ridiculous limit like 65,000 characters. This will still result in broken wget clones, but all the other outputs and the index should work with the long URLs.

That sounds reasonable. One long URL spoiling the whole batch that's being parsed is the main issue at hand here, so I think as long as that's resolved then it's case closed on this one.

Thanks for working through this with me, I appreciate all the help. It's not easy maintaining OSS, but my interactions with this project have been nothing but pleasant 😄

<!-- gh-comment-id:671935424 --> @mpeteuil commented on GitHub (Aug 11, 2020): That background definitely helps me understand this better and helps me see that it's not just this one isolated problem. > What we can do for now is catch the exception and skip all attempts to read URL fragment paths, and save long URLs to the index normally up until some ridiculous limit like 65,000 characters. This will still result in broken wget clones, but all the other outputs and the index should work with the long URLs. That sounds reasonable. One long URL spoiling the whole batch that's being parsed is the main issue at hand here, so I think as long as that's resolved then it's case closed on this one. Thanks for working through this with me, I appreciate all the help. It's not easy maintaining OSS, but my interactions with this project have been nothing but pleasant 😄
Author
Owner

@pirate commented on GitHub (Aug 18, 2020):

This should be fixed in 2e2b4f8. (going out with the next release)

git remote update
git checkout dev
pip install -e .

cd your/data/dir
archivebox add < ~/path/to/your/links.txt

If you give that a try and still have the issue then comment back here and I can reopen the ticket.

<!-- gh-comment-id:675447706 --> @pirate commented on GitHub (Aug 18, 2020): This should be fixed in 2e2b4f8. (going out with the next release) ```bash git remote update git checkout dev pip install -e . cd your/data/dir archivebox add < ~/path/to/your/links.txt ``` If you give that a try and still have the issue then comment back here and I can reopen the ticket.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#290
No description provided.