mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 01:26:00 +03:00
[GH-ISSUE #432] Long URLs break when attempting to read/write them as filesystem paths #1799
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1799
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mpeteuil on GitHub (Aug 10, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/432
Describe the bug
In multiple places in the docs (Quickstart for example) it mentions the use of
./archive some-file.txtas a means of ingesting a file with a list of urls in it. There are also examples of using./bin/archivebox-export-browser-history --firefoxwhich generates a JSON file that users should be able to feed into./archiveas well.With the latest release it seems that
archivebox addwould have taken over this behavior but it either isn't supposed to and this functionality was removed or there's a bug somewhere. The existence of the parsers in thearchivebox addcode path leads me to believe this is a bug andarchivebox addshould handle these cases.When running
archivebox initthere is also a message at the end that states:Steps to reproduce
test_urls.txtwhich only has the contentshttps://example.orgarchivebox add test_urls.txthttps://example.orgThe same issue happens when passing the output JSON file from running
./bin/export-browser-history.sh --firefoxorScreenshots or log output
It also happens when trying the input redirection route, except there is no
[X] Error while loading link!:Software versions
87ba82a@mpeteuil commented on GitHub (Aug 10, 2020):
I finally made my way to the usage docs in the wiki, which have examples of doing this successfully. For example:
I'm going to leave this open for the moment because I think a to-do to come of this is that there seems to be some places in the docs that still have references to old methods which are no longer valid.
I've tried to update the references to the old
./archiveon the Quickstart, Install, Scheduling, and Security wiki pages, but there may be others I missed.There is also a note when running
archiveboxin the terminal without a subcommand that states:This may need to be updated to the input redirection syntax as well, but I haven't tested if there is an issue with that.
@pirate commented on GitHub (Aug 10, 2020):
The docs are indeed out-of-date, thanks for fixing them,
is slightly different from:
The depth=1 + arg treats the file as a page itself that needs to be archived, then parses it for links and archives those as well.
The depth=0 + stdin version just parses stdin as an input file for links, and doesn't archive the
bookmarks_export.htmlfile itself as if it's a page (which is probably what you want).@mpeteuil commented on GitHub (Aug 10, 2020):
Thanks for the explanation. I didn't realize the
--depthoption changed how the input file itself was treated depending on the value (0, 1).I updated the description, but it looks like I still get
Parsed 0 URLs from input (Failed to parse)even using input redirection.@pirate commented on GitHub (Aug 10, 2020):
--depthalone doesn't change how the input file is treated, note the<pipe is only present in the depth=0 example. This means that it imports all the URLs passed in, but only archives each one at depth=0. In the--depth=1example the file is not piped in as a list of links, but rather archived as if it were a URL to a page itself, that behavior changes if you were to add a<and pipe the file in instead (it would archive each link within the file at depth=1).Can you post a snippet of the file you're trying to import? (note the URLs must have a scheme,
https://example.com√,example.comX)@mpeteuil commented on GitHub (Aug 11, 2020):
I would, but I'd like to avoid doing so out of privacy concerns if possible. However, I was able to find the URL which is causing the issue. The problem seems to be that it's just really long. In this instance the URL is 1092 characters, which is causing
OSError: [Errno 63] File name too longto be thrown when executingif Path(line).exists()in the generic_txt parser. The error is eventually swallowed by_parse's exception handling , so it's not seen elsewhere.The good news is that this is reproducible with any sufficiently long and valid URL in a txt file.
@pirate commented on GitHub (Aug 11, 2020):
Your diagnosis looks right.
Unfortunately, I think this is a much deeper problem than just the
generic_txtparser and I can't promise that the deeper problem is going to get fixed anytime soon.The "URL == filesystem path" decision was made early on in the design process, and it's a painful choice that has come back to bite me many times, but it's here to stay for the foreseeable future. When the wget output is saved to the filesystem, it translates the URL directly to a path.
URLs don't cleanly map to filesystem paths, and never will (some filesystems are case insensitive!!). On the other hand, there are many benefits to being able to see website paths easily in a file explorer, it's the most durable format of all. There are also many redundant methods to cover you in the rare case of URL => filename conflicts created by wget.
When it does break due to a filesystem:URL mapping failure, I believe it should only be the wget archive method that's broken, the other methods all have hardcoded output paths with no URL fragments in them. If you absolutely need it you can replay a broken wget archive you can always use the wget warc to get the same output independently.
What we can do for now is catch the exception and skip all attempts to read URL fragment paths, and save long URLs to the index normally up until some ridiculous limit like 65,000 characters. This will still result in broken wget clones, but all the other outputs and the index should work with the long URLs.
@mpeteuil commented on GitHub (Aug 11, 2020):
That background definitely helps me understand this better and helps me see that it's not just this one isolated problem.
That sounds reasonable. One long URL spoiling the whole batch that's being parsed is the main issue at hand here, so I think as long as that's resolved then it's case closed on this one.
Thanks for working through this with me, I appreciate all the help. It's not easy maintaining OSS, but my interactions with this project have been nothing but pleasant 😄
@pirate commented on GitHub (Aug 18, 2020):
This should be fixed in
2e2b4f8. (going out with the next release)If you give that a try and still have the issue then comment back here and I can reopen the ticket.