mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #154] Archive Method: wget fails some acid tests as a standalone archive method #108
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#108
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @machawk1 on GitHub (Feb 28, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/154
Describe the bug
ArchiveBox produces an inaccurate representation when using an archival crawler basis.
As a very quick background, I created a "tool" a few years back to evaluate archival crawlers. See https://ieeexplore.ieee.org/document/6970146 and the free PDF for more information. This "Archival Acid Test" consisted of an exhibition of some basic and advanced features of the Web that were problematic for archival crawlers. This was sort-of in the spirit of the Web Standards Acid Tests for early browsers.
I tested ArchiveBox
552734241bby runningecho "http://acid.matkelly.com" | ./archiveonce installed and received a sub-par result when accessing the files on my local file system. To give it the benefit of the doubt, I ran a local server viaphp -S localhost:8090and accessed the capture. The result looked something like:Examining the console, I can see that there are some requests for live Web resource representations outside of the archive, which is one such test on evaluating archival tools as performed in the 2014 study.
Steps to reproduce
Steps to reproduce the behavior:
git clone https://github.com/pirate/ArchiveBox.gitcd ArchiveBox./setupecho "http://acid.matkelly.com" | ./archivecd outputphp -S localhost:8090http://localhost:8090/Screenshots or log output

Viewing capture w/ disconnected Internet
Software versions
552734241bOther
Note that I am not using this ticket to promote my tool but I hope it can be used to improve the capture quality. The source for the AAT is available at https://github.com/machawk1/archivalAcidTest .
@pirate commented on GitHub (Mar 1, 2019):
ArchiveBox is composed of 9 archive methods, wget is only one of the 9 used. This test is indeed an effective way to test an archive method, but you should check out the other outputs, as the screenshot and PDF versions will likely have all blue squares given they're rendered with a headless browser.
You're effectively just running the test on the effectiveness of this
wgetcommand, and not the entire ArchiveBox output:Future versions will save all archive method requests & responses (including wget & headless browser) together in a single WARC using pywb. Once that's released, running the acid test against the WARC proxy should return 100% success, as it'll be capturing a perfect replica of Chromium's experience when rendering the original page.
@machawk1 commented on GitHub (Mar 1, 2019):
Thanks for the info, @pirate. I did not realize that the default capture method used wget. The PDF and screenshots are not nearly as useful to me as having an interactive web page.
I am looking forward to the future integration with pywb for capture. With the current version, how would I go about using 1 of the other 8 archive methods to generate a capture with a little more fidelity than wget?
I attempted to work through the wiki instructions but things start to get convoluted with the Docker and ArchiveBox configuration example.
@pirate commented on GitHub (Mar 1, 2019):
It does all methods by default, just click the favicon next to the title on the left to see the link index page. https://github.com/pirate/ArchiveBox/wiki/Usage#ui-usage