[GH-ISSUE #210] Can't write wget snapshots of URL with query-string #3164

Closed
opened 2026-03-14 21:22:35 +03:00 by kerem · 5 comments
Owner

Originally created by @bltavares on GitHub (Apr 5, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/210

Describe the bug

Most of Linux filesystems allow to use ? and other special characters as part of folder and filenames. But if you run this using ExFAT as the storage, it will throw the following error:

Got wget response code 3:
Cannot write to ‘www.youtube.com/watch?v=BkW1xQgrSPQ.html’ (No such file or directory).

ExFAT are common filesystems for portable USB drives, specially if there is the intention to use it with Linux, Windows and MacOS. It is not the best archival filesystem (eg: zfs), but it is the most portable filesystem.

I'm not sure how to do that yet, but it would be nice to convert special characters into safer ones when writing it down to the disk.

Steps to reproduce

  1. Create a ExFAT partition, cd into it
  2. Execute archivebox using a YouTube URL: echo 'www.youtube.com/watch?v=BkW1xQgrSPQ' | ./archive

Software versions

  • OS: Debian
  • ArchiveBox version: docker image: 296aa767078f
Originally created by @bltavares on GitHub (Apr 5, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/210 ## Describe the bug Most of Linux filesystems allow to use `?` and other special characters as part of folder and filenames. But if you run this using ExFAT as the storage, it will throw the following error: ``` Got wget response code 3: Cannot write to ‘www.youtube.com/watch?v=BkW1xQgrSPQ.html’ (No such file or directory). ``` ExFAT are common filesystems for portable USB drives, specially if there is the intention to use it with Linux, Windows and MacOS. It is not the best archival filesystem (eg: zfs), but it is the most portable filesystem. I'm not sure how to do that yet, but it would be nice to convert special characters into safer ones when writing it down to the disk. ## Steps to reproduce 1. Create a ExFAT partition, `cd` into it 2. Execute archivebox using a YouTube URL: `echo 'www.youtube.com/watch?v=BkW1xQgrSPQ' | ./archive` ## Software versions - OS: Debian - ArchiveBox version: docker image: `296aa767078f`
kerem closed this issue 2026-03-14 21:22:40 +03:00
Author
Owner

@pirate commented on GitHub (Apr 11, 2019):

Hmm interesting, looks like we'll have to switch from --restrict-file-names=unix to --restrict-file-names=windows now that we support Windows and Windows file systems. When set to windows, wget's filename escaping is a superset of the escaping that unix does, so it'll continue to work on existing systems but make all newly archived files ExFAT/NTFS compatible.

https://www.gnu.org/software/wget/manual/wget.html#targetText=--restrict-file-names=modes

<!-- gh-comment-id:481943709 --> @pirate commented on GitHub (Apr 11, 2019): Hmm interesting, looks like we'll have to switch from `--restrict-file-names=unix` to `--restrict-file-names=windows` now that we support Windows and Windows file systems. When set to `windows`, wget's filename escaping is a superset of the escaping that `unix` does, so it'll continue to work on existing systems but make all newly archived files ExFAT/NTFS compatible. https://www.gnu.org/software/wget/manual/wget.html#targetText=--restrict-file-names=modes
Author
Owner

@pirate commented on GitHub (Apr 11, 2019):

Done github.com/pirate/ArchiveBox@4f599c0b0b

<!-- gh-comment-id:481966842 --> @pirate commented on GitHub (Apr 11, 2019): Done https://github.com/pirate/ArchiveBox/commit/4f599c0b0b07c842b1a2d0ec31f229d8fa0d6294
Author
Owner

@bltavares commented on GitHub (Apr 11, 2019):

Thank you! 🙇

<!-- gh-comment-id:482116135 --> @bltavares commented on GitHub (Apr 11, 2019): Thank you! 🙇
Author
Owner

@bltavares commented on GitHub (Apr 12, 2019):

@pirate I've tried the latest Docker image published, which should include the fix, but it still reports as Unix on the error log.

Looking at the code, it might not be enough to only add the flag, as it might need to adjust the wget_output_path might need to change as well.


[*] [2019-04-12 01:26:38] "Om Next - David Nolen - YouTube"                                                                                 https://www.youtube.com/watch?v=ByNs9TG30E8                                                                                             √ /data/archive/1553864511.3                                                                                                              > wget                                                                                                                                    Failed: Got an error from the server                                                                                                        Got wget response code: 3.                                                                                                              www.youtube.com/watch?v=ByNs9TG30E8.html: No such file or directory                                                                     Cannot write to ‘www.youtube.com/watch?v=ByNs9TG30E8.html’ (No such file or directory).                                             Run to see full output:                                                                                                                     cd /data/archive/1553864511.3;                                                                                                          wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e rob ots=off --restrict-file-names=unix --timeout=60 --warc-file=warc/1555032398 --page-requisites "--user-agent=ArchiveBox/1191cf1df (+http s://github.com/pirate/ArchiveBox/) wget/1.18" https://www.youtube.com/watch?v=ByNs9TG30E8
--


<!-- gh-comment-id:482395537 --> @bltavares commented on GitHub (Apr 12, 2019): @pirate I've tried the latest Docker image published, which should include the fix, but it still reports as Unix on the error log. Looking at the code, it might not be enough to only add the flag, as it might need to adjust the `wget_output_path` might need to change as well. ``` [*] [2019-04-12 01:26:38] "Om Next - David Nolen - YouTube" https://www.youtube.com/watch?v=ByNs9TG30E8 √ /data/archive/1553864511.3 > wget Failed: Got an error from the server Got wget response code: 3. www.youtube.com/watch?v=ByNs9TG30E8.html: No such file or directory Cannot write to ‘www.youtube.com/watch?v=ByNs9TG30E8.html’ (No such file or directory). Run to see full output: cd /data/archive/1553864511.3; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e rob ots=off --restrict-file-names=unix --timeout=60 --warc-file=warc/1555032398 --page-requisites "--user-agent=ArchiveBox/1191cf1df (+http s://github.com/pirate/ArchiveBox/) wget/1.18" https://www.youtube.com/watch?v=ByNs9TG30E8 -- ```
Author
Owner

@bltavares commented on GitHub (Apr 12, 2019):

I've noticed that the Dockerfile clones the latest commit on Github, instead of copying the project with their changes into the image. This could cause some confusion on why a change is not being built, as it is being cached by Docker automated build layers, and given the text don't change it will not update the image.

I'll send a PR soon (unless its intentional) :)

<!-- gh-comment-id:482400861 --> @bltavares commented on GitHub (Apr 12, 2019): I've noticed that the Dockerfile clones the latest commit on Github, instead of copying the project with their changes into the image. This could cause some confusion on why a change is not being built, as it is being cached by Docker automated build layers, and given the text don't change it will not update the image. I'll send a PR soon (unless its intentional) :)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3164
No description provided.