mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #291] Issues with external resource URLs not being rewritten in wget archives #210
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#210
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mfioretti on GitHub (Nov 1, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/291
Greetings,
and thanks in advance for any comment on what follows. I would really need a real solution to this problem that makes ArchiveBox less useful, I can't be the only one to have it, and don't know where else to ask...
An "archive" of a web page should be a version of that page that looks and feels as much as possible as the original, even when that page is loaded:
If the archived page does not passes those tests, it has little or no value, because it is still dependent on third party resources that may disappear any moment.
A month ago, I filed an issue here because I found out that wget, as used in ArchiveBox, was not passing those tests (for details, see filed an issue here (https://github.com/pirate/ArchiveBox/issues/276). Immediately after that discussion...
Summarizing, what I have taken home so far is that wget cannot do real, useful archives, or at least nobody knows how to use it to that purpose, and that this issue raised more interest here than on the wget list. So this seems the best place to ask: am I missing something? Or is it time to replace wget with something else, both in and outside of ArchiveBox? If so, with what?
For example: Firefox/chrome plugins like SingleFile save whole page as ONE HTML file, with javascript, images and even video if you want, embedded inside it. Technically, it should be possible to add to the archivebox container that plugin, and code (selenium?) that, instead of calling wget, tells the browser "save this page with singlefile". Besides being above my own skills, however, this approach has the big problem that would make impossible the deduplication of identical images, scripts etc... across different pages, thus creating much, much bigger archives. Thoughts?
@pirate commented on GitHub (Nov 1, 2019):
Can you post some specific examples of failing pages on the latest version of wget?
@mfioretti commented on GitHub (Nov 2, 2019):
This morning I ran the archivebox container on about ten of my newest bookmarks. These:
https://news.wisc.edu/sharing-control-with-robots-may-make-aircraft-manufacturing-safer-more-efficient/https://www.theguardian.com/commentisfree/2019/nov/01/in-a-world-made-small-by-smartphones-we-crave-escape-into-othernessare just two examples of the problem, which I see in most of those ten copies:
first, the archived copies contain absolute ( external links to javascript files, or other resources, instead of their local copies, i.e. lots of links like
script type=
"text/javascript" src="https://news.wisc.edu/wp/wp-includes/js/wp-embed.js"iframe src=
"https://c.sharethis.mgr.consensu.org/portal.html"instead of src=
"../../news.wisc.edu/wp/wp-includes/js/wp-embed.js"or similar.Besides, when I load the "local copy" in my browser, I see the bottom bar of the browser itself full of notifications like "reading x", "reading Y"... where X and Y are external websites. That's a sure, definitive proof that the archived copy is NOT complete, isn't it? If it were, I'd only see the browser say "read example.com", that is my own local server.
FTR, I am not caring much for the version of wget because it really seems to be not relevant. If it were, I would have gotten at least a "try with the newer version" on the wget list...
@pirate commented on GitHub (Nov 4, 2019):
I just tried a standard wget archive of both those URLs and it worked fine, I was able to browse 100% local copies of the sites with no external resources requested.
URLs for images, javascript, css, Iframes, etc are all rewritten to relative local archive paths:

This is the standard behavior expected with both wget alone and ArchiveBox + wget, if you're not seeing this behavior then theres either a bug in the container / code setup or a problem with the process you're using to archive sites.
Can you post exactly the commands you've used to archive sites with archivebox? Ideally with a
.zipof your archive output folder as well.@mfioretti commented on GitHub (Dec 12, 2019):
Hello @pirate , and first of all sorry for not answering earlier. For some reason, I missed the email notification for your reply, and so only saw it today, when I came back to github for other reasons.
Now, wrt your question: first of all, if I run wget "natively", NOT in the container, on my server, with exactly the same options as you say, it just aborts. Looking into it, I realized that it is because I have version 1.14 of wget, which does not has the --no-hsts and --compression=auto options, only --no-warc-compression.
If I run your command without those options, that is like this:
wget -nv -E -k -x -K -H -np -e robots=off --restrict-file-names=windows --timeout=60 --warc-file=archive.warc -p --no-check-certificate "https://news.wisc.edu/sharing-control-with-robots-may-make-aircraft-manufacturing-safer-more-efficient/"I get the folder attached. As far as I can tell: images are local, and all scripts are local... except these two in the index.html file:
so, much better, but not 100%? What next?
Thanks,
wget_test.tar.gz
@pirate commented on GitHub (Dec 18, 2019):
Are you able to try on the latest version of wget? The URL rewriting logic is constantly being improved, so I'd recommend getting the latest version before debugging this further.
@mfioretti commented on GitHub (Dec 18, 2019):
This means that you do not have those absolute URLs to oss.maxcdn.com that I mentioned, but relative URLs pointing to local copies of those files, I gather? Or not?
Anyway, I am not sure if I can get the latest version of wget on the server. I may try it on my desktop, but not right away. Related question: which version of wget is used in the current container image of archivebox? If that were the latest, my problem would be solved. I went for the container exactly to avoid version issues...
@pirate commented on GitHub (Jan 5, 2020):
Correct, when I tried archiving it with the given command,
wgetsuccessfully pulled those files and rewrote the URLs to relative locations of the archived versions.The container is somewhat dated at this point, so I wouldn't be surprised if the
wgetversion within it is out-of-date as well.wgetis constantly evolving and I haven't updated the ArchiveBox container since the last release 6+ months ago. Obviously that's not ideal, the container should be the single-source-of-truth version that always works, but sometimes life gets in the way. Keep an eye on the v0.4 PR to be notified on when the next AB container version gets released. In the meantime, I recommend downloading the latestwgetversion and pointing AB to it with this config option: https://github.com/pirate/ArchiveBox/wiki/Configuration#wget_binary@pirate commented on GitHub (Feb 4, 2020):
Try the latest container version, I updated it recently so the
wgetbinary should've been updated as well.@mfioretti commented on GitHub (Feb 4, 2020):
Thanks! Will certainly do it over the weekend. And the way to download/start/run it is always this, right?
cat url-list | cat url_list.csv | docker run -i -v $ARCHIVEBOXHOME:/data nikisweeting/archivebox env [OPTIONS HERE]@pirate commented on GitHub (Feb 4, 2020):
cat url_list.csv | docker run -i -v $ARCHIVEBOXHOME:/data nikisweeting/archivebox env [OPTIONS HERE] /bin/archiveor
cat url_list.csv | docker run -i -v $ARCHIVEBOXHOME:/data --env-file=ArchiveBox.env nikisweeting/archiveboxhttps://github.com/pirate/ArchiveBox/wiki/Docker#usage-1
@neetij commented on GitHub (Mar 18, 2020):
Hi @pirate I'm facing a similar issue with the latest version through docker-compose. None of the archived pages seem to download external assets like CSS, images, and JS.
The command I'm using is
docker-compose exec archivebox /bin/archive https://feeds.pinboard.in/rss/secret:SECRET/u:USERNAME/t:TAGNAMEI've set the following environment settings in docker-compose.yml:
BTW, if there's a more appropriate forum this, please let me know.
@pirate commented on GitHub (Jul 16, 2020):
@neetij can you try with the latest
djangobranch version? I think this has been fixed. If you're still encountering issues, comment back here and I'll reopen the ticket.