[GH-ISSUE #1567] Bug: archivebox server - snapshot index Wget > HTML panel pulls from live page if no "singlefile.html" #935

Closed
opened 2026-03-01 14:47:24 +03:00 by kerem · 3 comments
Owner

Originally created by @AFlowOfCode on GitHub (Oct 24, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1567

Originally assigned to: @benmuth on GitHub.

ArchiveBox v0.7.2

I'm archiving articles on a website which limits the amount of articles you can view unless you are a subscriber. I am a subscriber, so I have no problem archiving these articles. That's only important in that it made this bug obvious.

When I run the archivebox server and go to the snapshot index for any given article, the preview in the panel for "Wget > HTML" shows the live page, with a message like "you've reached your article limit for the month". This is identical to the "Original" panel preview. This page is not saved anywhere in the snapshot folder (or any other) - it's clearly pulling from the live site.

The cause of this behavior is the lack of singlefile.html. I'm not using this with node and didn't turn off SAVE_SINGLEFILE. Then I set that to False to test if it still happens, and it does.

When I click on the Wget panel, the same live paywall page is loaded into an iframe in the main display panel, with the source set to the non-existent singlefile.html. Devtools network panel shows a 404 for http://127.0.0.1:8000/archive/<snapshot_id>/singlefile.html.

When I go into the actual file system folder and open the index.html file in my browser (without using the server), I get the correct preview. I can also open the actual HTML file in archivebox/archive/<snapshot_id>/<website_domain>/<article_path>/index.html and the full article is there as expected. Of course, on the snapshot index the display pane and many other preview panes still try to load singlefile.html and instead get the browser's "file not found" page. But the wget preview & display actually work here, unlike when using the server.

Considering that singlefile's presence is not obligatory, the server behavior should be adjusted to adapt to its absence. I would expect that both the wget preview and the article should load from the correct source and not singlefile when viewing snapshots with the server. I found this very confusing and my first impression was that archivebox was simply not working until I investigated further.

In summary, if there's no singlefile.html, the snapshot index defaults to the live page for both the Wget panel preview and the actual display. This is erroneous since the correct copy is present.

# this seems to be a problem for the server
<iframe sandbox="allow-same-origin allow-scripts allow-forms allow-top-navigation-by-user-activation" class="full-page-iframe" src="singlefile.html" name="preview"></iframe>

You could reproduce this by archiving a page without using singlefile & then looking at the iframe source. To make it obvious that you're getting the live site instead of your archived copy, try it with paywalled articles or modify the source of your archive in a way that makes it obvious.

Thank you!

Originally created by @AFlowOfCode on GitHub (Oct 24, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1567 Originally assigned to: @benmuth on GitHub. ArchiveBox v0.7.2 I'm archiving articles on a website which limits the amount of articles you can view unless you are a subscriber. I am a subscriber, so I have no problem archiving these articles. That's only important in that it made this bug obvious. When I run the archivebox server and go to the snapshot index for any given article, the preview in the panel for "Wget > HTML" shows the live page, with a message like "you've reached your article limit for the month". This is identical to the "Original" panel preview. This page is not saved anywhere in the snapshot folder (or any other) - it's clearly pulling from the live site. The cause of this behavior is the lack of `singlefile.html`. I'm not using this with node and didn't turn off `SAVE_SINGLEFILE`. Then I set that to `False` to test if it still happens, and it does. When I click on the Wget panel, the same live paywall page is loaded into an `iframe` in the main display panel, with the source set to the non-existent `singlefile.html`. Devtools network panel shows a 404 for `http://127.0.0.1:8000/archive/<snapshot_id>/singlefile.html`. When I go into the actual file system folder and open the `index.html` file in my browser (without using the server), I get the correct preview. I can also open the actual HTML file in `archivebox/archive/<snapshot_id>/<website_domain>/<article_path>/index.html` and the full article is there as expected. Of course, on the snapshot index the display pane and many other preview panes still try to load `singlefile.html` and instead get the browser's "file not found" page. But the wget preview & display actually work here, unlike when using the server. Considering that singlefile's presence is not obligatory, the server behavior should be adjusted to adapt to its absence. I would expect that both the wget preview and the article should load from the correct source and not singlefile when viewing snapshots with the server. I found this very confusing and my first impression was that archivebox was simply not working until I investigated further. In summary, if there's no `singlefile.html`, the snapshot index defaults to the live page for both the Wget panel preview and the actual display. This is erroneous since the correct copy is present. ``` # this seems to be a problem for the server <iframe sandbox="allow-same-origin allow-scripts allow-forms allow-top-navigation-by-user-activation" class="full-page-iframe" src="singlefile.html" name="preview"></iframe> ``` You could reproduce this by archiving a page without using singlefile & then looking at the iframe source. To make it obvious that you're getting the live site instead of your archived copy, try it with paywalled articles or modify the source of your archive in a way that makes it obvious. Thank you!
Author
Owner

@pirate commented on GitHub (Oct 24, 2024):

The snapshot detail view page has been entirely rewritten as of the v0.8.2 BETA (see it running here), so I suspect this is fixed already.

If you're willing to try a BETA release, backup your archive first and give archivebox/archivebox:dev a go. Otherwise wait for the v0.9.0 stable release to land to get the new UI.

<!-- gh-comment-id:2433963974 --> @pirate commented on GitHub (Oct 24, 2024): The snapshot detail view page has been entirely rewritten as of the v0.8.2 BETA ([see it running here](https://demo.archivebox.io/archive/1725629932.912604/index.html)), so I suspect this is fixed already. If you're willing to try a BETA release, backup your archive first and give `archivebox/archivebox:dev` a go. Otherwise wait for the v0.9.0 stable release to land to get the new UI.
Author
Owner

@benmuth commented on GitHub (Oct 27, 2024):

Yeah it seems like this issue has been fixed in the latest versions. I was able to reproduce on v0.7.2 but not on 0.8.5rc53. When singlefile isn't used as an archive method, singlefile.html doesn't show up as the source for any iframes, and each loads from the correct source

<!-- gh-comment-id:2439938530 --> @benmuth commented on GitHub (Oct 27, 2024): Yeah it seems like this issue has been fixed in the latest versions. I was able to reproduce on `v0.7.2` but not on `0.8.5rc53`. When `singlefile` isn't used as an archive method, `singlefile.html` doesn't show up as the source for any iframes, and each loads from the correct source
Author
Owner

@AFlowOfCode commented on GitHub (Oct 27, 2024):

Glad to hear it. I'll wait for the stable version but sounds like this can be closed, so I'll do that.

<!-- gh-comment-id:2440116624 --> @AFlowOfCode commented on GitHub (Oct 27, 2024): Glad to hear it. I'll wait for the stable version but sounds like this can be closed, so I'll do that.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#935
No description provided.