mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #276] Bugfix: Allow wget to span hosts and download external resources on other domains #198
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#198
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mfioretti on GitHub (Sep 30, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/276
Greetings, and thanks for ArchiveBox! It looks really interesting.
So far, my only issue with it is that it does not makes local copies of all the images** embedded in a page, only those hosted on the same server.
As an example, when I told the archivebox docker container to archive this article https://www.theguardian.com/society/2019/sep/29/legal-weed-cannabis-vaping-deaths :
with this command:
echo 'https://www.theguardian.com/society/2019/sep/29/legal-weed-cannabis-vaping-deaths' | docker run -i -v ~/archivebox-testing:/data nikisweeting/archivebox
I got an output.html file that DOES include/points to a LOCAl copy on my own server of the big image on the left, but not of the round thumbnails in the right sidebar. If I right-click on any of them and select "copy image location", I do not get, as I would like, a link like "./some-local-jpg-file.jpg". I get the location of the original image on another server, i.e. "https://i.guim.co.uk/img/media/61755ffc1366344f6b820fa918 ..." This means that if I used the archive on a computer disconnected from the internet, I would not see those thumbnails.
Can this behaviour be changed? I remember wget had some option for this, but IIRC it does not always work well in these cases.
Thanks in advance for any feedback.
@arnauldb commented on GitHub (Sep 30, 2019):
I have the same behavior, it doesn't save images hosted on another server (imgur for example).
@mfioretti commented on GitHub (Sep 30, 2019):
Nice to see I'm not alone...
The point is: is this an intrinsical limitation, or bug, of wget? Can it be overcome, and how?
@pirate commented on GitHub (Oct 1, 2019):
Ah wow I'm surprised it took this long to find this issue, I hadn't even noticed this was happening but we definitely want to fix it, I think adding
--span-hoststo thewgetcommand might do the trick!@mfioretti commented on GitHub (Oct 2, 2019):
Possibly, yes, as this example may show I found the same problem in this page:
https://blogs.lse.ac.uk/brexit/2017/05/17/the-brexit-referendum-question-was-flawed-in-its-design/
I told the archivebox container to archive it, and got this local copy
where the "EU referendum ballot paper" image, that is https://blogsmedia.lse.ac.uk/blogs.dir/107/files/2017/05/2016_EU_Referendum_Ballot_Paper.jpg
is NOT a local copy of that image, but still fetched from its original location.
When I saw that, and your comment, I manually ran wget (version 1.14, what I have on the Centos server I use), with these options, including --span-hosts:
wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --page-requisites "--user-agent=ArchiveBox/4d25980e3 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://blogs.lse.ac.uk/brexit/2017/05/17/the-brexit-referendum-question-was-flawed-in-its-design/then I was confused to discover that:
so I looked at the source code of the html file generated by wget, and saw that it seems a browser problem. The source code DOES point to the local copy, that is it says "<img src="../../... 2016_EU_Referendum_Ballot_Paper.jpg" . But that HTML tag also contains a srcset attribute with the original URL. And Firefox reports as "image location" the value in srcset, NOT the value in src. If I manually remove the whole srcset attribute from that tag, the local page displays as well as before, pointing to the local image, and firefox does show that as "image location".
Summing up, yes, --span-hosts should solve the problem, but firefox would not let users know...
In practice: how do I tell the archivebox container to use a) the --span-hosts for wget, and a standard user agent? The latter question is because I noticed that Medium.com does NOT like wget, and just returns 404 when it asks for pages. I am sure that this is documented somewhere, but for some reason cannot find it myself this morning, thanks in advance for your patience...
@mfioretti commented on GitHub (Oct 2, 2019):
Two other things. First, you said:
But from what I read in https://github.com/pirate/ArchiveBox/blob/master/archivebox/archive_methods.py , --span-hosts is already passed as an option to wget in the WGET_BINARY command, no? If yes, why doesn't it work? Or is the container version of archivebox using different options?
Second, I found the docs that say to use --env-file=/path/to/archivebox.conf file to pass non-default config options. But if I do that (changing only the user agent string, and disabling warc), the docker container doesn't start at all. nothing happens. No error messages, nothing. Why?
@mfioretti commented on GitHub (Oct 2, 2019):
Another example of wget not working properly (whether due to misconfiguration in archivebox, or its own bugs, I have no idea):
Try to archive with the docker container this page: https://onezero.medium.com/being-indistractable-will-be-the-skill-of-the-future-a07780cf36f4
among other things, that page uses a script called main.32528bd7.chunk.js which is fetched from https://cdn-client.medium.com/lite/static/js/main.32528bd7.chunk.js
when archivebox archives it, it makes a LOCAL copy of that file inside cdn-client.medium.com/lite/static/js/main.32528bd7.chunk.js
but the source code of output.html still loads that file from the external website:
instead of
@rockdaboot commented on GitHub (Oct 2, 2019):
Wget 1.18 first supported the
srcsettag. But if updating, please use the latest version 1.20.3 (or from git master).The last example (main.32528bd7.chunk.js) should be translated to a relative path, as you say. If not you could help us (at wget) with a (relatively) small reproducer.
Another thing, that possibly doesn't come to play here. Wget doesn't understand javascript - so if any script loads URLs for displaying the page, you are lost with wget.
@mfioretti commented on GitHub (Oct 2, 2019):
Hello @rockdaboot ,
maybe I misunderstand you, or what I found in the archivebox log, but what the archivebox container uses is wget 1.18. If so, why doesn't it deals properly with the srcset tag? May it be due to not using the --span-host option?
About this:
I am not sure what you mean with a "small reproducer". If you mean "how wget was called to get that result", these should be the options that archivebox used:
wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --page-requisites "--user-agent=ArchiveBox/4d25980e3 (+https://github.com/pirate/ArchiveBox/) wget/1.18"in the case of that medium article, the opposite it's true. wget downloads correctly, and it makes the saved page load the LOCAL copy of the (main.32528bd7.chunk.js). But one of the things that script does is to COVER the page with a 404 message. And if I RENAME that .js file, to not make it load, the saved page displays perfectly. Don't ask me why, try for yourself....
@rockdaboot commented on GitHub (Oct 2, 2019):
Possibly yes. Make sure --span-host is given.
I used
wget --adjust-extension --convert-links --force-directories --span-hosts --no-parent -e robots=off --restrict-file-names=windows --page-requisites https://onezero.medium.com/being-indistractable-will-be-the-skill-of-the-future-a07780cf36f4but do not see any absolute URL to
main.32528bd7.chunk.js. It's a relative path here inbeing-indistractable-will-be-the-skill-of-the-future-a07780cf36f4.htmlI have wget 1.20.3 coming with Debian.
@mfioretti commented on GitHub (Oct 2, 2019):
that's my whole problem, right now, if I understand what is happening and where. At least in the foreseable future, for several reasons not really relevant here, I need to run the ready-to-use archivebox docker container, without actually installing the software, or its dependencies, or rebuilding the container myself. This means that if the container does not run wget with that option, and I have no way to pass it as an environment variable, I am out of luck, I guess?
@rockdaboot commented on GitHub (Oct 2, 2019):
All you can do is set SYSTEM_WGETRC or WGETRC pointing to a wgetrc (config) file. Have a line in that file
@pirate commented on GitHub (Oct 2, 2019):
--span-hostsis indeed already being passed, sorry for the confusion! I forgot I had added it earlier.@mfioretti I think all that needs to happen now is for the official container image to be rebuilt with the latest wget version so that it works out-of-the-box. As for the env-file issue, I'm not sure what's causing that but I can take a look.
Also holy cow, @rockdaboot, a real-life
wgetmaintainer! You're practically a celebrity around these parts! Thanks for making all this possible and makingwgetsuch an awesome piece of software.@rockdaboot commented on GitHub (Oct 2, 2019):
Thanks for these nice words @pirate !
@mfioretti commented on GitHub (Oct 2, 2019):
thanks for this tip. I should be able to apply, test and report it in a few hours. In the meantime, just to be sure I get container/docker syntax right....
right now, in my tests I am calling the docker container in this way:
what you are saying is that I should add, right after the FETCH_WARC=False setting, another env statement like "env SYSTEM_WGETRC=mywgetrc" where mywgetrc is a file that contains ALL the options listed in my previous comments? OR just the span_host one? Of course, if as you say in the other comment, "--span-hosts is indeed already being passed" then it should not make any difference, right? Still, it can't hurt to test...
Oh, and of course sorry but... at this point I can't help to ask "when do you think the official container image will be rebuilt?" :-)
@pirate commented on GitHub (Oct 2, 2019):
I don't think you need that WGETRC config since we're already passing
--span-hosts, it's an issue with nginx's content rewriting in the older version in Docker (not because that option isn't being passed).However in the future if you wanted to do something like this you would just add it to the existing
env ...vars... /bin/archivesection like so:I'm giving a talk on internet archiving this weekend and I'd like to have it working by then, so I'll aim to push an update in the next few days.
@pirate commented on GitHub (Jul 24, 2020):
This should be fixed on the latest
djangobranch. If you still encounter any problems comment back here and I'll reopen the issue.