[GH-ISSUE #549] Bug: ArchiveBox crashes when encountering wget-generated filenames that are too long #348

Closed
opened 2026-03-01 14:42:43 +03:00 by kerem · 14 comments
Owner

Originally created by @danst0 on GitHub (Nov 25, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/549

Hi,

I get a filename too long error after an import. Could someone please help?

Daniel
Bildschirmfoto 2020-11-25 um 08 20 53

Originally created by @danst0 on GitHub (Nov 25, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/549 Hi, I get a filename too long error after an import. Could someone please help? Daniel ![Bildschirmfoto 2020-11-25 um 08 20 53](https://user-images.githubusercontent.com/1544506/100195125-29eb0500-2ef7-11eb-9e70-761ba13673ce.png)
Author
Owner

@pirate commented on GitHub (Nov 25, 2020):

It is what it says on the tin :)

That filename is indeed too long, you'll have to remove that URL from your archive and delete the file or live with the error. It's a hard limitation of using the filesystem for archiving with the WGET extractor.

<!-- gh-comment-id:733804749 --> @pirate commented on GitHub (Nov 25, 2020): It is what it says on the tin :) That filename is indeed too long, you'll have to remove that URL from your archive and delete the file or live with the error. It's a hard limitation of using the filesystem for archiving with the WGET extractor.
Author
Owner

@danst0 commented on GitHub (Nov 25, 2020):

ok.
Shouldn't there be something preventing the generation of these long names in the first place. If I re-archive this link I will have the same result again?
If there is this limitation on the filename length (I think I am on ext4 so 255 characters). Does it really make sense to simply use the url components to name the files? should archivebox use some kind of hash?

Daniel

<!-- gh-comment-id:733881051 --> @danst0 commented on GitHub (Nov 25, 2020): ok. Shouldn't there be something preventing the generation of these long names in the first place. If I re-archive this link I will have the same result again? If there is this limitation on the filename length (I think I am on ext4 so 255 characters). Does it really make sense to simply use the url components to name the files? should archivebox use some kind of hash? Daniel
Author
Owner

@pirate commented on GitHub (Nov 25, 2020):

I know it's not what you want to hear, but you're the first person to hit this edge case in 5 years so I'm not super keen on adding code surface area to catch it, especially when it varies from filesystem to filesystem and the existing error message is also fairly clear about why it's happening.

This limitation also only exists for the wget extractor which uses the url->filesystem mapping (which is fundamentally difficult and flawed for many reasons beyond just filename length), none of the other extractors have this problem. If you want to re-archive this link (or others like it with super long filename components) you can turn off the wget extractor with archivebox config --set SAVE_WGET=False. It should be fine because we're not relying as heavily on the wget extractor these days anyway. Singlefile, PDF, Screenshot, can make up for the lack of wget.

<!-- gh-comment-id:733883489 --> @pirate commented on GitHub (Nov 25, 2020): I know it's not what you want to hear, but you're the first person to hit this edge case in 5 years so I'm not super keen on adding code surface area to catch it, especially when it varies from filesystem to filesystem and the existing error message is also fairly clear about why it's happening. This limitation also only exists for the wget extractor which uses the url->filesystem mapping (which is fundamentally difficult and flawed for many reasons beyond just filename length), none of the other extractors have this problem. If you want to re-archive this link (or others like it with super long filename components) you can turn off the wget extractor with `archivebox config --set SAVE_WGET=False`. It should be fine because we're not relying as heavily on the wget extractor these days anyway. Singlefile, PDF, Screenshot, can make up for the lack of wget.
Author
Owner

@danst0 commented on GitHub (Nov 25, 2020):

Ok.

Am 25.11.2020 um 19:37 schrieb Nick Sweeting notifications@github.com:

You're the first person to hit this edge case in 5 years so I'm not super keen on adding code surface area to catch it, especially when it varies from filesystem to filesystem. This limitation also only exists for the wget extractor which uses the url->filesystem mapping (which is fundamentally difficult and flawed for many reasons beyond just filename length), none of the other extractors have this problem. If you want to re-archive this link (or others like it with super long filename components) you can turn off the wget extractor with archivebox config --set SAVE_WGET=False. It should be fine because we're not relying as heavily on the wget extractor these days anyway. Singlefile, PDF, Screenshot, can make up for the lack of wget.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub https://github.com/ArchiveBox/ArchiveBox/issues/549#issuecomment-733883489, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALZCOV4WVT7TCTWKKNI5KTSRVFHBANCNFSM4UCAGDKA.

<!-- gh-comment-id:733889542 --> @danst0 commented on GitHub (Nov 25, 2020): Ok. > Am 25.11.2020 um 19:37 schrieb Nick Sweeting <notifications@github.com>: > > > You're the first person to hit this edge case in 5 years so I'm not super keen on adding code surface area to catch it, especially when it varies from filesystem to filesystem. This limitation also only exists for the wget extractor which uses the url->filesystem mapping (which is fundamentally difficult and flawed for many reasons beyond just filename length), none of the other extractors have this problem. If you want to re-archive this link (or others like it with super long filename components) you can turn off the wget extractor with archivebox config --set SAVE_WGET=False. It should be fine because we're not relying as heavily on the wget extractor these days anyway. Singlefile, PDF, Screenshot, can make up for the lack of wget. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub <https://github.com/ArchiveBox/ArchiveBox/issues/549#issuecomment-733883489>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALZCOV4WVT7TCTWKKNI5KTSRVFHBANCNFSM4UCAGDKA>. >
Author
Owner

@pirate commented on GitHub (Nov 25, 2020):

Actually I just double-checked the code path and this error will be gone in the next v0.5 release.

Update on the next release and you should be good to go. We removed the filesystem checks for ArchiveResults entirely in #525 and as a side effect it'll never need to call this failing .exists() check to render the UI.

<!-- gh-comment-id:733891632 --> @pirate commented on GitHub (Nov 25, 2020): Actually I just double-checked the code path and this error will be gone in the next v0.5 release. Update on the next release and you should be good to go. We removed the filesystem checks for ArchiveResults entirely in #525 and as a side effect it'll never need to call this failing `.exists()` check to render the UI.
Author
Owner

@mAAdhaTTah commented on GitHub (Nov 28, 2020):

FWIW, this may be a Windows-only issue. I vaguely recall early versions of npm having this issue with its naive dependency resolution algorithm.

<!-- gh-comment-id:735281201 --> @mAAdhaTTah commented on GitHub (Nov 28, 2020): FWIW, this _may_ be a Windows-only issue. I vaguely recall early versions of npm having this issue with its naive dependency resolution algorithm.
Author
Owner

@danst0 commented on GitHub (Nov 28, 2020):

probably not. I have archivebox running on docker on ubuntu

<!-- gh-comment-id:735283307 --> @danst0 commented on GitHub (Nov 28, 2020): probably not. I have archivebox running on docker on ubuntu
Author
Owner

@clb92 commented on GitHub (May 6, 2024):

For what it's worth, I'm having the same problem right now on a Docker install of latest ArchiveBox image. Host system runs Unraid and uses XFS, if that matters.

From a normal user's perspective, ArchiveBox just seems to stop working randomly, which is terrible UX. I'm a firm believer that ArchiveBox should (at a minimum) catch the error gracefully and just skip the URL, instead.

This is all the log I could get easily:
Screenshot

EDIT: I can't even start ArchiveBox again, it just instantly crashes. How do I recover from this? How can I remove the offending URL(s)?

EDIT 2: Found this documented in the Roadmap (for some reason). In my case here, I just removed everything containing the string "clicks.getpocket.com" from my archive.

docker run -it -v $PWD:/data archivebox/archivebox /usr/local/bin/archivebox remove --delete --filter-type substring 'clicks.getpocket.com'
<!-- gh-comment-id:2096858713 --> @clb92 commented on GitHub (May 6, 2024): For what it's worth, I'm having the same problem right now on a Docker install of latest ArchiveBox image. Host system runs Unraid and uses XFS, if that matters. From a normal user's perspective, ArchiveBox just seems to stop working randomly, which is terrible UX. I'm a firm believer that ArchiveBox should (at a minimum) catch the error gracefully and just skip the URL, instead. This is all the log I could get easily: ![Screenshot](https://github.com/ArchiveBox/ArchiveBox/assets/4537419/f12dcf32-9404-4350-869c-6a110a2728a8) **EDIT:** I can't even start ArchiveBox again, it just instantly crashes. How do I recover from this? How can I remove the offending URL(s)? **EDIT 2:** Found this documented in the Roadmap (for some reason). In my case here, I just removed everything containing the string "clicks.getpocket.com" from my archive. docker run -it -v $PWD:/data archivebox/archivebox /usr/local/bin/archivebox remove --delete --filter-type substring 'clicks.getpocket.com'
Author
Owner

@pirate commented on GitHub (May 7, 2024):

We can definitely catch the error and skip the failing logic at the archivebox layer. It's not super common but at this point I've seen this issue ~4 times in the wild so it seems fair to add a workaround now.

<!-- gh-comment-id:2098049002 --> @pirate commented on GitHub (May 7, 2024): We can definitely catch the error and skip the failing logic at the archivebox layer. It's not super common but at this point I've seen this issue ~4 times in the wild so it seems fair to add a workaround now.
Author
Owner

@clb92 commented on GitHub (May 7, 2024):

Thank you very much for the context and explanation (though, it looks like that comment disappeared again just now).

My use case is that I have the browser addon continuously add every page I visit to ArchiveBox. The biggest annoyance for me, currently, is that I have to babysit ArchiveBox and manually go in and remove a URL from ArchiveBox if/when it stops working, so that I can keep it working. It's not a stable setup. I would rather have a few URLs that can't be archived properly than having my setup break every few days, needing manual intervention every time.

But I recognize that others may be using Archivebox in a different way, perhaps only manually adding URLs to be crawled and archived a lot more thoroughly.

Perhaps what's needed is a way to mark URLs being added as "not important", aka. "ignore errors and just skip whatever fails"?

Just throwing ideas out there.

<!-- gh-comment-id:2098154285 --> @clb92 commented on GitHub (May 7, 2024): Thank you very much for the context and explanation (though, it looks like that comment disappeared again just now). My use case is that I have the browser addon continuously add every page I visit to ArchiveBox. The biggest annoyance for me, currently, is that I have to babysit ArchiveBox and manually go in and remove a URL from ArchiveBox if/when it stops working, so that I can keep it working. It's not a stable setup. I would rather have a few URLs that can't be archived properly than having my setup break every few days, needing manual intervention every time. But I recognize that others may be using Archivebox in a different way, perhaps only manually adding URLs to be crawled and archived a lot more thoroughly. Perhaps what's needed is a way to mark URLs being added as "not important", aka. "ignore errors and just skip whatever fails"? Just throwing ideas out there.
Author
Owner

@pirate commented on GitHub (May 7, 2024):

I deleted my comment because it was wrong, it turns out wget is not creating bad paths, it's my workaround logic checking for a path that cannot exist that's causing the error.

I just fixed it on :dev 🚀 f770bba3

The biggest annoyance for me, currently, is that I have to babysit ArchiveBox and manually go in and remove a URL from ArchiveBox if/when it stops working, so that I can keep it working.

Yeah obviously thats not ideal haha, it should definitely be stable and reliable and not crash for stuff like this.

Perhaps what's needed is a way to mark URLs being added as "not important", aka. "ignore errors and just skip whatever fails"?

Nah it should just work or in the worst case scenario skip a URL it cant archive, I don't want to add extra work for the user.

<!-- gh-comment-id:2098156673 --> @pirate commented on GitHub (May 7, 2024): I deleted my comment because it was wrong, it turns out wget is not creating bad paths, it's my workaround logic checking for a path that cannot exist that's causing the error. I just fixed it on `:dev` 🚀 f770bba3 > The biggest annoyance for me, currently, is that I have to babysit ArchiveBox and manually go in and remove a URL from ArchiveBox if/when it stops working, so that I can keep it working. Yeah obviously thats not ideal haha, it should definitely be stable and reliable and not crash for stuff like this. > Perhaps what's needed is a way to mark URLs being added as "not important", aka. "ignore errors and just skip whatever fails"? Nah it should just work or in the worst case scenario skip a URL it cant archive, I don't want to add extra work for the user.
Author
Owner

@clb92 commented on GitHub (May 7, 2024):

I deleted my comment because it was wrong, it turns out wget is not creating bad paths, it's my workaround logic checking for a path that cannot exist that's causing the error.

Oh, haha.

I just fixed it on :dev f770bba

Thanks, I'll test it later today.

<!-- gh-comment-id:2098173604 --> @clb92 commented on GitHub (May 7, 2024): > I deleted my comment because it was wrong, it turns out wget is not creating bad paths, it's my workaround logic checking for a path that cannot exist that's causing the error. Oh, haha. > I just fixed it on :dev f770bba Thanks, I'll test it later today.
Author
Owner

@clb92 commented on GitHub (May 7, 2024):

Sorry, but I can't test out :dev it seems, since now I'm getting the error Could not find profile "Default" in CHROME_USER_DATA_DIR right after updating. I haven't changed anything in my config. CHROME_USER_DATA_DIR is set to None. Downgrading to :latest does not solve it, even though it worked fine before.
EDIT: Ah, after downgrading, I had to go into the conf and remove CHROME_USER_DATA_DIR completely, which I had fiddled with.

<!-- gh-comment-id:2099401180 --> @clb92 commented on GitHub (May 7, 2024): Sorry, but I can't test out :dev it seems, since now I'm getting the error ```Could not find profile "Default" in CHROME_USER_DATA_DIR``` right after updating. I haven't changed anything in my config. CHROME_USER_DATA_DIR is set to None. Downgrading to :latest does not solve it, even though it worked fine before. **EDIT:** Ah, after downgrading, I had to go into the conf and remove CHROME_USER_DATA_DIR completely, which I had fiddled with.
Author
Owner

@pirate commented on GitHub (May 7, 2024):

Oh thanks for reminding me of that, some other users reported it too. I'll change it to be a warning by default instead of an error so it's less annoying. https://github.com/ArchiveBox/ArchiveBox/issues/1425

<!-- gh-comment-id:2099492490 --> @pirate commented on GitHub (May 7, 2024): Oh thanks for reminding me of that, some other users reported it too. I'll change it to be a warning by default instead of an error so it's less annoying. https://github.com/ArchiveBox/ArchiveBox/issues/1425
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#348
No description provided.