[GH-ISSUE #276] Bugfix: Allow wget to span hosts and download external resources on other domains #198

Closed
opened 2026-03-01 14:41:26 +03:00 by kerem · 16 comments
Owner

Originally created by @mfioretti on GitHub (Sep 30, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/276

Greetings, and thanks for ArchiveBox! It looks really interesting.

So far, my only issue with it is that it does not makes local copies of all the images** embedded in a page, only those hosted on the same server.

As an example, when I told the archivebox docker container to archive this article https://www.theguardian.com/society/2019/sep/29/legal-weed-cannabis-vaping-deaths :

Selection_003

with this command:

echo 'https://www.theguardian.com/society/2019/sep/29/legal-weed-cannabis-vaping-deaths' | docker run -i -v ~/archivebox-testing:/data nikisweeting/archivebox

I got an output.html file that DOES include/points to a LOCAl copy on my own server of the big image on the left, but not of the round thumbnails in the right sidebar. If I right-click on any of them and select "copy image location", I do not get, as I would like, a link like "./some-local-jpg-file.jpg". I get the location of the original image on another server, i.e. "https://i.guim.co.uk/img/media/61755ffc1366344f6b820fa918 ..." This means that if I used the archive on a computer disconnected from the internet, I would not see those thumbnails.

Can this behaviour be changed? I remember wget had some option for this, but IIRC it does not always work well in these cases.

Thanks in advance for any feedback.

Originally created by @mfioretti on GitHub (Sep 30, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/276 Greetings, and thanks for ArchiveBox! It looks really interesting. So far, my only issue with it is that it does not makes local copies of **all** the images** embedded in a page, only those hosted on the same server. As an example, when I told the archivebox docker container to archive this article https://www.theguardian.com/society/2019/sep/29/legal-weed-cannabis-vaping-deaths : ![Selection_003](https://user-images.githubusercontent.com/6323914/65869219-4da9ce00-e37a-11e9-9edd-786e2291b6e4.png) with this command: echo 'https://www.theguardian.com/society/2019/sep/29/legal-weed-cannabis-vaping-deaths' | docker run -i -v ~/archivebox-testing:/data nikisweeting/archivebox I got an output.html file that DOES include/points to a LOCAl copy on my own server of the big image on the left, but not of the round thumbnails in the right sidebar. If I right-click on any of them and select "copy image location", I do not get, as I would like, a link like "./some-local-jpg-file.jpg". I get the location of the original image on another server, i.e. "https://i.guim.co.uk/img/media/61755ffc1366344f6b820fa918 ..." This means that if I used the archive on a computer disconnected from the internet, I would not see those thumbnails. Can this behaviour be changed? I remember wget had some option for this, but IIRC it does not always work well in these cases. Thanks in advance for any feedback.
kerem 2026-03-01 14:41:26 +03:00
Author
Owner

@arnauldb commented on GitHub (Sep 30, 2019):

I have the same behavior, it doesn't save images hosted on another server (imgur for example).

<!-- gh-comment-id:536668870 --> @arnauldb commented on GitHub (Sep 30, 2019): I have the same behavior, it doesn't save images hosted on another server (imgur for example).
Author
Owner

@mfioretti commented on GitHub (Sep 30, 2019):

I have the same behavior, it doesn't save images hosted on another server (imgur for example).

Nice to see I'm not alone...

The point is: is this an intrinsical limitation, or bug, of wget? Can it be overcome, and how?

<!-- gh-comment-id:536732692 --> @mfioretti commented on GitHub (Sep 30, 2019): > I have the same behavior, it doesn't save images hosted on another server (imgur for example). Nice to see I'm not alone... The point is: is this an intrinsical limitation, or bug, of wget? Can it be overcome, and how?
Author
Owner

@pirate commented on GitHub (Oct 1, 2019):

Ah wow I'm surprised it took this long to find this issue, I hadn't even noticed this was happening but we definitely want to fix it, I think adding --span-hosts to the wget command might do the trick!

<!-- gh-comment-id:537231157 --> @pirate commented on GitHub (Oct 1, 2019): Ah wow I'm surprised it took this long to find this issue, I hadn't even noticed this was happening but we definitely want to fix it, I think adding `--span-hosts` to the `wget` command might do the trick!
Author
Owner

@mfioretti commented on GitHub (Oct 2, 2019):

Ah wow I'm surprised it took this long to find this issue, I hadn't even noticed this was happening but we definitely want to fix it, I think adding --span-hosts to the wget command might do the trick!

Possibly, yes, as this example may show I found the same problem in this page:

https://blogs.lse.ac.uk/brexit/2017/05/17/the-brexit-referendum-question-was-flawed-in-its-design/

I told the archivebox container to archive it, and got this local copy

wget-archivebox-bug

where the "EU referendum ballot paper" image, that is https://blogsmedia.lse.ac.uk/blogs.dir/107/files/2017/05/2016_EU_Referendum_Ballot_Paper.jpg

is NOT a local copy of that image, but still fetched from its original location.

When I saw that, and your comment, I manually ran wget (version 1.14, what I have on the Centos server I use), with these options, including --span-hosts:

wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --page-requisites "--user-agent=ArchiveBox/4d25980e3 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://blogs.lse.ac.uk/brexit/2017/05/17/the-brexit-referendum-question-was-flawed-in-its-design/

then I was confused to discover that:

so I looked at the source code of the html file generated by wget, and saw that it seems a browser problem. The source code DOES point to the local copy, that is it says "<img src="../../... 2016_EU_Referendum_Ballot_Paper.jpg" . But that HTML tag also contains a srcset attribute with the original URL. And Firefox reports as "image location" the value in srcset, NOT the value in src. If I manually remove the whole srcset attribute from that tag, the local page displays as well as before, pointing to the local image, and firefox does show that as "image location".

Summing up, yes, --span-hosts should solve the problem, but firefox would not let users know...

In practice: how do I tell the archivebox container to use a) the --span-hosts for wget, and a standard user agent? The latter question is because I noticed that Medium.com does NOT like wget, and just returns 404 when it asks for pages. I am sure that this is documented somewhere, but for some reason cannot find it myself this morning, thanks in advance for your patience...

<!-- gh-comment-id:537353549 --> @mfioretti commented on GitHub (Oct 2, 2019): > Ah wow I'm surprised it took this long to find this issue, I hadn't even noticed this was happening but we definitely want to fix it, I think adding `--span-hosts` to the `wget` command might do the trick! Possibly, yes, as this example may show I found the same problem in this page: https://blogs.lse.ac.uk/brexit/2017/05/17/the-brexit-referendum-question-was-flawed-in-its-design/ I told the archivebox container to archive it, and got this local copy ![wget-archivebox-bug](https://user-images.githubusercontent.com/6323914/66019515-15260380-e4e4-11e9-838e-cc7efab28d55.png) where the "EU referendum ballot paper" image, that is https://blogsmedia.lse.ac.uk/blogs.dir/107/files/2017/05/2016_EU_Referendum_Ballot_Paper.jpg is NOT a local copy of that image, but still fetched from its original location. When I saw that, and your comment, I manually ran wget (version 1.14, what I have on the Centos server I use), with these options, including --span-hosts: `wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --page-requisites "--user-agent=ArchiveBox/4d25980e3 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://blogs.lse.ac.uk/brexit/2017/05/17/the-brexit-referendum-question-was-flawed-in-its-design/` then I was confused to discover that: - I DID have a file called "2016_EU_Referendum_Ballot_Paper.jpg" on my server - but firefox would still show "https://blogsmedia.lse.ac.uk/blogs.dir/107/files/2017/05/2016_EU_Referendum_Ballot_Paper.jpg" as location of that image, when I right-clicked on it so I looked at the source code of the html file generated by wget, and saw that it seems a browser problem. The source code DOES point to the local copy, that is it says "<img src="../../... 2016_EU_Referendum_Ballot_Paper.jpg" . But that HTML tag also contains a srcset attribute with the original URL. And Firefox reports as "image location" the value in srcset, NOT the value in src. If I manually remove the whole srcset attribute from that <img> tag, the local page displays as well as before, pointing to the local image, and firefox does show that as "image location". Summing up, yes, --span-hosts should solve the problem, but firefox would not let users know... In practice: how do I tell the archivebox **container** to use a) the --span-hosts for wget, and a standard user agent? The latter question is because I noticed that Medium.com does NOT like wget, and just returns 404 when it asks for pages. I am sure that this is documented somewhere, but for some reason cannot find it myself this morning, thanks in advance for your patience...
Author
Owner

@mfioretti commented on GitHub (Oct 2, 2019):

Two other things. First, you said:

I think adding --span-hosts to the wget command might do the trick!

But from what I read in https://github.com/pirate/ArchiveBox/blob/master/archivebox/archive_methods.py , --span-hosts is already passed as an option to wget in the WGET_BINARY command, no? If yes, why doesn't it work? Or is the container version of archivebox using different options?

Second, I found the docs that say to use --env-file=/path/to/archivebox.conf file to pass non-default config options. But if I do that (changing only the user agent string, and disabling warc), the docker container doesn't start at all. nothing happens. No error messages, nothing. Why?

<!-- gh-comment-id:537366584 --> @mfioretti commented on GitHub (Oct 2, 2019): Two other things. First, you said: > I think adding --span-hosts to the wget command might do the trick! But from what I read in https://github.com/pirate/ArchiveBox/blob/master/archivebox/archive_methods.py , --span-hosts is **already** passed as an option to wget in the WGET_BINARY command, no? If yes, why doesn't it work? Or is the container version of archivebox using different options? Second, I found the docs that say to use --env-file=/path/to/archivebox.conf file to pass non-default config options. But if I do that (changing only the user agent string, and disabling warc), the docker container doesn't start at all. nothing happens. No error messages, nothing. Why?
Author
Owner

@mfioretti commented on GitHub (Oct 2, 2019):

Another example of wget not working properly (whether due to misconfiguration in archivebox, or its own bugs, I have no idea):

Try to archive with the docker container this page: https://onezero.medium.com/being-indistractable-will-be-the-skill-of-the-future-a07780cf36f4

among other things, that page uses a script called main.32528bd7.chunk.js which is fetched from https://cdn-client.medium.com/lite/static/js/main.32528bd7.chunk.js

when archivebox archives it, it makes a LOCAL copy of that file inside cdn-client.medium.com/lite/static/js/main.32528bd7.chunk.js

but the source code of output.html still loads that file from the external website:

instead of

<!-- gh-comment-id:537385266 --> @mfioretti commented on GitHub (Oct 2, 2019): Another example of wget not working properly (whether due to misconfiguration in archivebox, or its own bugs, I have no idea): Try to archive with the docker container this page: https://onezero.medium.com/being-indistractable-will-be-the-skill-of-the-future-a07780cf36f4 among other things, that page uses a script called main.32528bd7.chunk.js which is fetched from https://cdn-client.medium.com/lite/static/js/main.32528bd7.chunk.js when archivebox archives it, it makes a LOCAL copy of that file inside cdn-client.medium.com/lite/static/js/main.32528bd7.chunk.js but the source code of output.html still loads that file from the external website: <script src="https://cdn-client.medium.com/lite/static/js/main.32528bd7.chunk.js"></script> instead of <script src="../../ .... /cdn-client.medium.com/lite/static/js/main.32528bd7.chunk.js"</script> so what I see from all these examples is that, even if the archivebox code does pass the --span-hosts option to wget, either that "passing" does not happen, or wget itself is buggy. In which case, it is basically useless to make "really offline" archives, isn't it?
Author
Owner

@rockdaboot commented on GitHub (Oct 2, 2019):

Wget 1.18 first supported the srcset tag. But if updating, please use the latest version 1.20.3 (or from git master).

The last example (main.32528bd7.chunk.js) should be translated to a relative path, as you say. If not you could help us (at wget) with a (relatively) small reproducer.

Another thing, that possibly doesn't come to play here. Wget doesn't understand javascript - so if any script loads URLs for displaying the page, you are lost with wget.

<!-- gh-comment-id:537476346 --> @rockdaboot commented on GitHub (Oct 2, 2019): Wget 1.18 first supported the `srcset` tag. But if updating, please use the latest version 1.20.3 (or from git master). The last example (main.32528bd7.chunk.js) should be translated to a relative path, as you say. If not you could help us (at wget) with a (relatively) small reproducer. Another thing, that possibly doesn't come to play here. Wget doesn't understand javascript - so if any script loads URLs for displaying the page, you are lost with wget.
Author
Owner

@mfioretti commented on GitHub (Oct 2, 2019):

Wget 1.18 first supported the srcset tag. But if updating, please use the latest version 1.20.3 (or from git master).

Hello @rockdaboot ,

maybe I misunderstand you, or what I found in the archivebox log, but what the archivebox container uses is wget 1.18. If so, why doesn't it deals properly with the srcset tag? May it be due to not using the --span-host option?

About this:

The last example (main.32528bd7.chunk.js) should be translated to a relative path, as you say. If not you could help us (at wget) with a (relatively) small reproducer.

I am not sure what you mean with a "small reproducer". If you mean "how wget was called to get that result", these should be the options that archivebox used:

wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --page-requisites "--user-agent=ArchiveBox/4d25980e3 (+https://github.com/pirate/ArchiveBox/) wget/1.18"

Another thing, that possibly doesn't come to play here. Wget doesn't understand javascript - so if any script loads URLs for displaying the page, you are lost with wget.

in the case of that medium article, the opposite it's true. wget downloads correctly, and it makes the saved page load the LOCAL copy of the (main.32528bd7.chunk.js). But one of the things that script does is to COVER the page with a 404 message. And if I RENAME that .js file, to not make it load, the saved page displays perfectly. Don't ask me why, try for yourself....

<!-- gh-comment-id:537490730 --> @mfioretti commented on GitHub (Oct 2, 2019): > Wget 1.18 first supported the `srcset` tag. But if updating, please use the latest version 1.20.3 (or from git master). Hello @rockdaboot , maybe I misunderstand you, or what I found in the archivebox log, but what the archivebox container uses **is** wget 1.18. If so, why doesn't it deals properly with the srcset tag? May it be due to not using the --span-host option? About this: > The last example (main.32528bd7.chunk.js) should be translated to a relative path, as you say. If not you could help us (at wget) with a (relatively) small reproducer. I am not sure what you mean with a "small reproducer". If you mean "how wget was called to get that result", these should be the options that archivebox used: `wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --page-requisites "--user-agent=ArchiveBox/4d25980e3 (+https://github.com/pirate/ArchiveBox/) wget/1.18"` > Another thing, that possibly doesn't come to play here. Wget doesn't understand javascript - so if any script loads URLs for displaying the page, you are lost with wget. in the case of that medium article, the opposite it's true. wget downloads correctly, and it makes the saved page load the LOCAL copy of the (main.32528bd7.chunk.js). But one of the things that script does is to COVER the page with a 404 message. And if I RENAME that .js file, to not make it load, the saved page displays perfectly. Don't ask me why, try for yourself....
Author
Owner

@rockdaboot commented on GitHub (Oct 2, 2019):

maybe I misunderstand you, or what I found in the archivebox log, but what the archivebox container uses is wget 1.18. If so, why doesn't it deals properly with the srcset tag? May it be due to not using the --span-host option?

Possibly yes. Make sure --span-host is given.

I used
wget --adjust-extension --convert-links --force-directories --span-hosts --no-parent -e robots=off --restrict-file-names=windows --page-requisites https://onezero.medium.com/being-indistractable-will-be-the-skill-of-the-future-a07780cf36f4
but do not see any absolute URL to main.32528bd7.chunk.js. It's a relative path here in being-indistractable-will-be-the-skill-of-the-future-a07780cf36f4.html

I have wget 1.20.3 coming with Debian.

<!-- gh-comment-id:537534034 --> @rockdaboot commented on GitHub (Oct 2, 2019): > maybe I misunderstand you, or what I found in the archivebox log, but what the archivebox container uses is wget 1.18. If so, why doesn't it deals properly with the srcset tag? May it be due to not using the --span-host option? Possibly yes. Make sure --span-host is given. I used `wget --adjust-extension --convert-links --force-directories --span-hosts --no-parent -e robots=off --restrict-file-names=windows --page-requisites https://onezero.medium.com/being-indistractable-will-be-the-skill-of-the-future-a07780cf36f4` but do not see any absolute URL to `main.32528bd7.chunk.js`. It's a relative path here in `being-indistractable-will-be-the-skill-of-the-future-a07780cf36f4.html` I have wget 1.20.3 coming with Debian.
Author
Owner

@mfioretti commented on GitHub (Oct 2, 2019):

Possibly yes. Make sure --span-host is given

that's my whole problem, right now, if I understand what is happening and where. At least in the foreseable future, for several reasons not really relevant here, I need to run the ready-to-use archivebox docker container, without actually installing the software, or its dependencies, or rebuilding the container myself. This means that if the container does not run wget with that option, and I have no way to pass it as an environment variable, I am out of luck, I guess?

<!-- gh-comment-id:537542418 --> @mfioretti commented on GitHub (Oct 2, 2019): > Possibly yes. Make sure --span-host is given that's my whole problem, right now, if I understand what is happening and where. At least in the foreseable future, for several reasons not really relevant here, I need to run the ready-to-use archivebox docker container, without actually installing the software, or its dependencies, or rebuilding the container myself. This means that if the container does not run wget with that option, and I have no way to pass it as an environment variable, I am out of luck, I guess?
Author
Owner

@rockdaboot commented on GitHub (Oct 2, 2019):

All you can do is set SYSTEM_WGETRC or WGETRC pointing to a wgetrc (config) file. Have a line in that file

spanhosts = on
<!-- gh-comment-id:537550016 --> @rockdaboot commented on GitHub (Oct 2, 2019): All you can do is set SYSTEM_WGETRC or WGETRC pointing to a wgetrc (config) file. Have a line in that file ``` spanhosts = on ```
Author
Owner

@pirate commented on GitHub (Oct 2, 2019):

--span-hosts is indeed already being passed, sorry for the confusion! I forgot I had added it earlier.

@mfioretti I think all that needs to happen now is for the official container image to be rebuilt with the latest wget version so that it works out-of-the-box. As for the env-file issue, I'm not sure what's causing that but I can take a look.

Also holy cow, @rockdaboot, a real-lifewget maintainer! You're practically a celebrity around these parts! Thanks for making all this possible and making wget such an awesome piece of software.

<!-- gh-comment-id:537555398 --> @pirate commented on GitHub (Oct 2, 2019): `--span-hosts` is indeed already being passed, sorry for the confusion! I forgot I had added it earlier. @mfioretti I think all that needs to happen now is for the official container image to be rebuilt with the latest wget version so that it works out-of-the-box. As for the env-file issue, I'm not sure what's causing that but I can take a look. Also holy cow, @rockdaboot, a real-life`wget` maintainer! You're practically a celebrity around these parts! Thanks for making all this possible and making `wget` such an awesome piece of software.
Author
Owner

@rockdaboot commented on GitHub (Oct 2, 2019):

Thanks for these nice words @pirate !

<!-- gh-comment-id:537556425 --> @rockdaboot commented on GitHub (Oct 2, 2019): Thanks for these nice words @pirate !
Author
Owner

@mfioretti commented on GitHub (Oct 2, 2019):

All you can do is set SYSTEM_WGETRC or WGETRC pointing to a wgetrc (config) file. Have a line in that file

spanhosts = on

thanks for this tip. I should be able to apply, test and report it in a few hours. In the meantime, just to be sure I get container/docker syntax right....

right now, in my tests I am calling the docker container in this way:

cat url_list.txt  | docker run -i -v $ARCHIVEBOXHOME:/data nikisweeting/archivebox env WGET_USER_AGENT="some user agent string here" env FETCH_WARC=False  /bin/archive &> archivebox.log

what you are saying is that I should add, right after the FETCH_WARC=False setting, another env statement like "env SYSTEM_WGETRC=mywgetrc" where mywgetrc is a file that contains ALL the options listed in my previous comments? OR just the span_host one? Of course, if as you say in the other comment, "--span-hosts is indeed already being passed" then it should not make any difference, right? Still, it can't hurt to test...

Oh, and of course sorry but... at this point I can't help to ask "when do you think the official container image will be rebuilt?" :-)

<!-- gh-comment-id:537558253 --> @mfioretti commented on GitHub (Oct 2, 2019): > All you can do is set SYSTEM_WGETRC or WGETRC pointing to a wgetrc (config) file. Have a line in that file > > ``` > spanhosts = on > ``` thanks for this tip. I should be able to apply, test and report it in a few hours. In the meantime, just to be sure I get container/docker syntax right.... right now, in my tests I am calling the docker container in this way: ```bash cat url_list.txt | docker run -i -v $ARCHIVEBOXHOME:/data nikisweeting/archivebox env WGET_USER_AGENT="some user agent string here" env FETCH_WARC=False /bin/archive &> archivebox.log ``` what you are saying is that I should add, right after the FETCH_WARC=False setting, another env statement like "env SYSTEM_WGETRC=mywgetrc" where mywgetrc is a file that contains ALL the options listed in my previous comments? OR just the span_host one? Of course, if as you say in the other comment, "--span-hosts is indeed already being passed" then it should not make any difference, right? Still, it can't hurt to test... Oh, and of course sorry but... at this point I can't help to ask "when do you think the official container image will be rebuilt?" :-)
Author
Owner

@pirate commented on GitHub (Oct 2, 2019):

I don't think you need that WGETRC config since we're already passing --span-hosts, it's an issue with nginx's content rewriting in the older version in Docker (not because that option isn't being passed).

However in the future if you wanted to do something like this you would just add it to the existing env ...vars... /bin/archive section like so:

cat url_list.txt | docker run -i -v $ARCHIVEBOXHOME:/data nikisweeting/archivebox env WGET_USER_AGENT="some user agent string here" env FETCH_WARC=False WGETRC=/data/.wgetrc /bin/archive &> archivebox.log

when do you think the official container image will be rebuilt?" :-)

I'm giving a talk on internet archiving this weekend and I'd like to have it working by then, so I'll aim to push an update in the next few days.

<!-- gh-comment-id:537640224 --> @pirate commented on GitHub (Oct 2, 2019): I don't think you need that WGETRC config since we're already passing `--span-hosts`, it's an issue with nginx's content rewriting in the older version in Docker (not because that option isn't being passed). However in the future if you wanted to do something like this you would just add it to the existing `env ...vars... /bin/archive` section like so: ```bash cat url_list.txt | docker run -i -v $ARCHIVEBOXHOME:/data nikisweeting/archivebox env WGET_USER_AGENT="some user agent string here" env FETCH_WARC=False WGETRC=/data/.wgetrc /bin/archive &> archivebox.log ``` > when do you think the official container image will be rebuilt?" :-) I'm giving a talk on internet archiving this weekend and I'd like to have it working by then, so I'll aim to push an update in the next few days.
Author
Owner

@pirate commented on GitHub (Jul 24, 2020):

This should be fixed on the latest django branch. If you still encounter any problems comment back here and I'll reopen the issue.

git checkout django
git pull
docker build . -t archivebox
docker run -v $PWD/output:/data archivebox init
<!-- gh-comment-id:663621870 --> @pirate commented on GitHub (Jul 24, 2020): This should be fixed on the latest `django` branch. If you still encounter any problems comment back here and I'll reopen the issue. ```bash git checkout django git pull docker build . -t archivebox docker run -v $PWD/output:/data archivebox init ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#198
No description provided.