mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #549] Bug: ArchiveBox crashes when encountering wget-generated filenames that are too long #348
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#348
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @danst0 on GitHub (Nov 25, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/549
Hi,
I get a filename too long error after an import. Could someone please help?
Daniel

@pirate commented on GitHub (Nov 25, 2020):
It is what it says on the tin :)
That filename is indeed too long, you'll have to remove that URL from your archive and delete the file or live with the error. It's a hard limitation of using the filesystem for archiving with the WGET extractor.
@danst0 commented on GitHub (Nov 25, 2020):
ok.
Shouldn't there be something preventing the generation of these long names in the first place. If I re-archive this link I will have the same result again?
If there is this limitation on the filename length (I think I am on ext4 so 255 characters). Does it really make sense to simply use the url components to name the files? should archivebox use some kind of hash?
Daniel
@pirate commented on GitHub (Nov 25, 2020):
I know it's not what you want to hear, but you're the first person to hit this edge case in 5 years so I'm not super keen on adding code surface area to catch it, especially when it varies from filesystem to filesystem and the existing error message is also fairly clear about why it's happening.
This limitation also only exists for the wget extractor which uses the url->filesystem mapping (which is fundamentally difficult and flawed for many reasons beyond just filename length), none of the other extractors have this problem. If you want to re-archive this link (or others like it with super long filename components) you can turn off the wget extractor with
archivebox config --set SAVE_WGET=False. It should be fine because we're not relying as heavily on the wget extractor these days anyway. Singlefile, PDF, Screenshot, can make up for the lack of wget.@danst0 commented on GitHub (Nov 25, 2020):
Ok.
@pirate commented on GitHub (Nov 25, 2020):
Actually I just double-checked the code path and this error will be gone in the next v0.5 release.
Update on the next release and you should be good to go. We removed the filesystem checks for ArchiveResults entirely in #525 and as a side effect it'll never need to call this failing
.exists()check to render the UI.@mAAdhaTTah commented on GitHub (Nov 28, 2020):
FWIW, this may be a Windows-only issue. I vaguely recall early versions of npm having this issue with its naive dependency resolution algorithm.
@danst0 commented on GitHub (Nov 28, 2020):
probably not. I have archivebox running on docker on ubuntu
@clb92 commented on GitHub (May 6, 2024):
For what it's worth, I'm having the same problem right now on a Docker install of latest ArchiveBox image. Host system runs Unraid and uses XFS, if that matters.
From a normal user's perspective, ArchiveBox just seems to stop working randomly, which is terrible UX. I'm a firm believer that ArchiveBox should (at a minimum) catch the error gracefully and just skip the URL, instead.
This is all the log I could get easily:

EDIT: I can't even start ArchiveBox again, it just instantly crashes. How do I recover from this? How can I remove the offending URL(s)?
EDIT 2: Found this documented in the Roadmap (for some reason). In my case here, I just removed everything containing the string "clicks.getpocket.com" from my archive.
@pirate commented on GitHub (May 7, 2024):
We can definitely catch the error and skip the failing logic at the archivebox layer. It's not super common but at this point I've seen this issue ~4 times in the wild so it seems fair to add a workaround now.
@clb92 commented on GitHub (May 7, 2024):
Thank you very much for the context and explanation (though, it looks like that comment disappeared again just now).
My use case is that I have the browser addon continuously add every page I visit to ArchiveBox. The biggest annoyance for me, currently, is that I have to babysit ArchiveBox and manually go in and remove a URL from ArchiveBox if/when it stops working, so that I can keep it working. It's not a stable setup. I would rather have a few URLs that can't be archived properly than having my setup break every few days, needing manual intervention every time.
But I recognize that others may be using Archivebox in a different way, perhaps only manually adding URLs to be crawled and archived a lot more thoroughly.
Perhaps what's needed is a way to mark URLs being added as "not important", aka. "ignore errors and just skip whatever fails"?
Just throwing ideas out there.
@pirate commented on GitHub (May 7, 2024):
I deleted my comment because it was wrong, it turns out wget is not creating bad paths, it's my workaround logic checking for a path that cannot exist that's causing the error.
I just fixed it on
:dev🚀f770bba3Yeah obviously thats not ideal haha, it should definitely be stable and reliable and not crash for stuff like this.
Nah it should just work or in the worst case scenario skip a URL it cant archive, I don't want to add extra work for the user.
@clb92 commented on GitHub (May 7, 2024):
Oh, haha.
Thanks, I'll test it later today.
@clb92 commented on GitHub (May 7, 2024):
Sorry, but I can't test out :dev it seems, since now I'm getting the error
Could not find profile "Default" in CHROME_USER_DATA_DIRright after updating. I haven't changed anything in my config. CHROME_USER_DATA_DIR is set to None. Downgrading to :latest does not solve it, even though it worked fine before.EDIT: Ah, after downgrading, I had to go into the conf and remove CHROME_USER_DATA_DIR completely, which I had fiddled with.
@pirate commented on GitHub (May 7, 2024):
Oh thanks for reminding me of that, some other users reported it too. I'll change it to be a warning by default instead of an error so it's less annoying. https://github.com/ArchiveBox/ArchiveBox/issues/1425