mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #1567] Bug: archivebox server - snapshot index Wget > HTML panel pulls from live page if no "singlefile.html" #935
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#935
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @AFlowOfCode on GitHub (Oct 24, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1567
Originally assigned to: @benmuth on GitHub.
ArchiveBox v0.7.2
I'm archiving articles on a website which limits the amount of articles you can view unless you are a subscriber. I am a subscriber, so I have no problem archiving these articles. That's only important in that it made this bug obvious.
When I run the archivebox server and go to the snapshot index for any given article, the preview in the panel for "Wget > HTML" shows the live page, with a message like "you've reached your article limit for the month". This is identical to the "Original" panel preview. This page is not saved anywhere in the snapshot folder (or any other) - it's clearly pulling from the live site.
The cause of this behavior is the lack of
singlefile.html. I'm not using this with node and didn't turn offSAVE_SINGLEFILE. Then I set that toFalseto test if it still happens, and it does.When I click on the Wget panel, the same live paywall page is loaded into an
iframein the main display panel, with the source set to the non-existentsinglefile.html. Devtools network panel shows a 404 forhttp://127.0.0.1:8000/archive/<snapshot_id>/singlefile.html.When I go into the actual file system folder and open the
index.htmlfile in my browser (without using the server), I get the correct preview. I can also open the actual HTML file inarchivebox/archive/<snapshot_id>/<website_domain>/<article_path>/index.htmland the full article is there as expected. Of course, on the snapshot index the display pane and many other preview panes still try to loadsinglefile.htmland instead get the browser's "file not found" page. But the wget preview & display actually work here, unlike when using the server.Considering that singlefile's presence is not obligatory, the server behavior should be adjusted to adapt to its absence. I would expect that both the wget preview and the article should load from the correct source and not singlefile when viewing snapshots with the server. I found this very confusing and my first impression was that archivebox was simply not working until I investigated further.
In summary, if there's no
singlefile.html, the snapshot index defaults to the live page for both the Wget panel preview and the actual display. This is erroneous since the correct copy is present.You could reproduce this by archiving a page without using singlefile & then looking at the iframe source. To make it obvious that you're getting the live site instead of your archived copy, try it with paywalled articles or modify the source of your archive in a way that makes it obvious.
Thank you!
@pirate commented on GitHub (Oct 24, 2024):
The snapshot detail view page has been entirely rewritten as of the v0.8.2 BETA (see it running here), so I suspect this is fixed already.
If you're willing to try a BETA release, backup your archive first and give
archivebox/archivebox:deva go. Otherwise wait for the v0.9.0 stable release to land to get the new UI.@benmuth commented on GitHub (Oct 27, 2024):
Yeah it seems like this issue has been fixed in the latest versions. I was able to reproduce on
v0.7.2but not on0.8.5rc53. Whensinglefileisn't used as an archive method,singlefile.htmldoesn't show up as the source for any iframes, and each loads from the correct source@AFlowOfCode commented on GitHub (Oct 27, 2024):
Glad to hear it. I'll wait for the stable version but sounds like this can be closed, so I'll do that.