mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #879] Question: why does the webserver display a wget HTML page differently than the raw file. #3563
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3563
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @cvsickle on GitHub (Oct 17, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/879
I'm trying to use wget to store offline pages that frequently get updated so that I can access the old page after it's been updated. It seems to work as expected, grabbing all of the images and storing them locally, but the webserver isn't displaying the images (and other page elements) properly. When I open the index.html file directly, the browser displays the page correctly though.
For example, I'm saving: https://lexyslotd.com/guide/mod-installation-part-1/
And here is a comparison of the two methods (webserver on the left, raw file on the right):
Comparison 1
Comparison 2
Note especially the images and the navigation tools on the right side of the site. You can see from the URL that this is the same file from the same snapshot, and this behavior in consistent between browsers, so I'm left to assume something is up with how the webserver is serving the pages. Are there any config options I should be looking into to fix this? Having to locate the specific file to access is much more of a pain than using the ArchiveBox UI to serve the page.
For reference, I have ArchiveBox running in a Docker container on my Synology NAS. Using the current "dev" image tag. Any help or explanation would be greatly appreciated. This seems like a very useful project if I can just get these small things worked out.
Edit:
@pirate commented on GitHub (Oct 18, 2021):
@pirate commented on GitHub (Oct 18, 2021):
It looks like you're only comparing the wget output, try checking the singlefile output as it'll have much better fidelity. In general small differences are to be expected, no method is perfect, that's why archivebox uses 10+ different methods in parallel.
Especially with wget's output, archivebox's server is not able to serve the original headers or correct paths for everything (which may be needed for those images in your comparisons). If you want help debugging further I need the full version output as specified in the new issue template, and your server logs or more specific info like the failing requests open in the dev console.
@cvsickle commented on GitHub (Oct 18, 2021):
I've updated my original post with the version information. My apologies for missing that.
For wget: Here are some screen grabs of the dev console when the page is open in the webserver and the dev console when the page is open from the raw file.
I'm guessing that this has something to do with the content panel that the images are served in. It expands/collapses using # tags in the URL. In the webserver screenshot, you can see that the # is not preset, because clicking the content panel doesn't do anything. In the raw file screenshot, the # is present after clicking on the expand button.
I've also used the singlefile output, but it leaves the content panel collapsed and unexpandable, regardless of whether it is open in the webserver or as a raw file. Here's the singlefile output with the dev console open to one of the collapsed panels: here
@pirate commented on GitHub (Oct 19, 2021):
Looks like it's refusing to run JS due to MIME type mismatches which makes sense. Probably not something thats going to change anytime soon, sorry. There is work in progress to add better header replaying in the far future, but for now if you need higher fidelity I recommend https://ArchiveWeb.page + https://ReplayWeb.page instead.