[GH-ISSUE #879] Question: why does the webserver display a wget HTML page differently than the raw file. #2051

Closed
opened 2026-03-01 17:56:05 +03:00 by kerem · 4 comments
Owner

Originally created by @cvsickle on GitHub (Oct 17, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/879

I'm trying to use wget to store offline pages that frequently get updated so that I can access the old page after it's been updated. It seems to work as expected, grabbing all of the images and storing them locally, but the webserver isn't displaying the images (and other page elements) properly. When I open the index.html file directly, the browser displays the page correctly though.

For example, I'm saving: https://lexyslotd.com/guide/mod-installation-part-1/

And here is a comparison of the two methods (webserver on the left, raw file on the right):

Comparison 1

Comparison 2

Note especially the images and the navigation tools on the right side of the site. You can see from the URL that this is the same file from the same snapshot, and this behavior in consistent between browsers, so I'm left to assume something is up with how the webserver is serving the pages. Are there any config options I should be looking into to fix this? Having to locate the specific file to access is much more of a pain than using the ArchiveBox UI to serve the page.

For reference, I have ArchiveBox running in a Docker container on my Synology NAS. Using the current "dev" image tag. Any help or explanation would be greatly appreciated. This seems like a very useful project if I can just get these small things worked out.

Edit:

ArchiveBox v0.6.3
Cpython Linux Linux-4.4.59+-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                         
 √  PYTHON_BINARY         v3.9.6          valid     /usr/local/bin/python3.9                                                          
 √  DJANGO_BINARY         v3.1.13         valid     /usr/local/lib/python3.9/sit                                                      e-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                                     
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                                     
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                                     
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-fi                                                      le/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readabili                                                      ty-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postligh                                                      t/mercury-parser/cli.js
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                      
 √  YOUTUBEDL_BINARY      v2021.06.06     valid     /usr/local/bin/youtube-dl                                                         
 √  CHROME_BINARY         v90.0.4430.212  valid     /usr/bin/chromium                                                                 
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                       

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                                   
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                         
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                                    

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                                    
 -  COOKIES_FILE          -               disabled                                                                                    

[i] Data locations:
 √  OUTPUT_DIR            6 files         valid     /data                                                                             
 √  SOURCES_DIR           25 files        valid     ./sources                                                                         
 √  LOGS_DIR              1 files         valid     ./logs                                                                            
 √  ARCHIVE_DIR           17 files        valid     ./archive                                                                         
 √  CONFIG_FILE           445.0 Bytes     valid     ./ArchiveBox.conf                                                                 
 √  SQL_INDEX             1.9 MB          valid     ./index.sqlite3  
Originally created by @cvsickle on GitHub (Oct 17, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/879 I'm trying to use wget to store offline pages that frequently get updated so that I can access the old page after it's been updated. It seems to work as expected, grabbing all of the images and storing them locally, but the webserver isn't displaying the images (and other page elements) properly. When I open the index.html file directly, the browser displays the page correctly though. For example, I'm saving: https://lexyslotd.com/guide/mod-installation-part-1/ And here is a comparison of the two methods (webserver on the left, raw file on the right): [Comparison 1](https://www.dropbox.com/s/br284c6wk82zxri/ArchiveBoxComparison1.png?dl=0) [Comparison 2](https://www.dropbox.com/s/d9aj3okv9v2k31u/ArchiveBoxComparison2.png?dl=0) Note especially the images and the navigation tools on the right side of the site. You can see from the URL that this is the same file from the same snapshot, and this behavior in consistent between browsers, so I'm left to assume something is up with how the webserver is serving the pages. Are there any config options I should be looking into to fix this? Having to locate the specific file to access is much more of a pain than using the ArchiveBox UI to serve the page. For reference, I have ArchiveBox running in a Docker container on my Synology NAS. Using the current "dev" image tag. Any help or explanation would be greatly appreciated. This seems like a very useful project if I can just get these small things worked out. Edit: ``` ArchiveBox v0.6.3 Cpython Linux Linux-4.4.59+-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.6 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.13 valid /usr/local/lib/python3.9/sit e-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-fi le/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readabili ty-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postligh t/mercury-parser/cli.js - GIT_BINARY - disabled /usr/bin/git √ YOUTUBEDL_BINARY v2021.06.06 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.212 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 6 files valid /data √ SOURCES_DIR 25 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 17 files valid ./archive √ CONFIG_FILE 445.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 1.9 MB valid ./index.sqlite3 ```
kerem closed this issue 2026-03-01 17:56:05 +03:00
Author
Owner

@pirate commented on GitHub (Oct 18, 2021):

ArchiveBox version

Run the archivebox version command locally then copy paste the result here:

replace this line with the *full*, unshortened output of running `archivebox version`

Tickets without full version info will closed until it is provided,
we need the full output here to help you solve your issue

<!-- gh-comment-id:946098336 --> @pirate commented on GitHub (Oct 18, 2021): > #### ArchiveBox version > Run the `archivebox version` command locally then copy paste the result here: ```logs replace this line with the *full*, unshortened output of running `archivebox version` ``` > Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue
Author
Owner

@pirate commented on GitHub (Oct 18, 2021):

It looks like you're only comparing the wget output, try checking the singlefile output as it'll have much better fidelity. In general small differences are to be expected, no method is perfect, that's why archivebox uses 10+ different methods in parallel.

Especially with wget's output, archivebox's server is not able to serve the original headers or correct paths for everything (which may be needed for those images in your comparisons). If you want help debugging further I need the full version output as specified in the new issue template, and your server logs or more specific info like the failing requests open in the dev console.

<!-- gh-comment-id:946099692 --> @pirate commented on GitHub (Oct 18, 2021): It looks like you're only comparing the wget output, try checking the singlefile output as it'll have much better fidelity. In general small differences are to be expected, no method is perfect, that's why archivebox uses 10+ different methods in parallel. Especially with wget's output, archivebox's server is not able to serve the original headers or correct paths for everything (which may be needed for those images in your comparisons). If you want help debugging further I need the full version output as specified in the new issue template, and your server logs or more specific info like the failing requests open in the dev console.
Author
Owner

@cvsickle commented on GitHub (Oct 18, 2021):

I've updated my original post with the version information. My apologies for missing that.

It looks like you're only comparing the wget output, try checking the singlefile output as it'll have much better fidelity. In general small differences are to be expected, no method is perfect, that's why archivebox uses 10+ different methods in parallel.

Especially with wget's output, archivebox's server is not able to serve the original headers or correct paths for everything (which may be needed for those images in your comparisons). If you want help debugging further I need the full version output as specified in the new issue template, and your server logs or more specific info like the failing requests open in the dev console.

For wget: Here are some screen grabs of the dev console when the page is open in the webserver and the dev console when the page is open from the raw file.

I'm guessing that this has something to do with the content panel that the images are served in. It expands/collapses using # tags in the URL. In the webserver screenshot, you can see that the # is not preset, because clicking the content panel doesn't do anything. In the raw file screenshot, the # is present after clicking on the expand button.

I've also used the singlefile output, but it leaves the content panel collapsed and unexpandable, regardless of whether it is open in the webserver or as a raw file. Here's the singlefile output with the dev console open to one of the collapsed panels: here

<!-- gh-comment-id:946233197 --> @cvsickle commented on GitHub (Oct 18, 2021): I've updated my original post with the version information. My apologies for missing that. > > > It looks like you're only comparing the wget output, try checking the singlefile output as it'll have much better fidelity. In general small differences are to be expected, no method is perfect, that's why archivebox uses 10+ different methods in parallel. > > Especially with wget's output, archivebox's server is not able to serve the original headers or correct paths for everything (which may be needed for those images in your comparisons). If you want help debugging further I need the full version output as specified in the new issue template, and your server logs or more specific info like the failing requests open in the dev console. For wget: Here are some screen grabs of the [dev console when the page is open in the webserver](https://www.dropbox.com/s/zcdlnh91a794ppt/ArchiveBoxWebserver_DevConsole.png?dl=0) and the [dev console when the page is open from the raw file](https://www.dropbox.com/s/tehj31gkwt9yzsm/ArchiveBoxRawFile_DevConsole.png?dl=0). I'm guessing that this has something to do with the content panel that the images are served in. It expands/collapses using # tags in the URL. In the webserver screenshot, you can see that the # is not preset, because clicking the content panel doesn't do anything. In the raw file screenshot, the # is present after clicking on the expand button. I've also used the singlefile output, but it leaves the content panel collapsed and unexpandable, regardless of whether it is open in the webserver or as a raw file. Here's the singlefile output with the dev console open to one of the collapsed panels: [here](https://www.dropbox.com/s/rgkebmozz0nb89v/ArchiveBoxSinglefile1_DevConsole.png?dl=0)
Author
Owner

@pirate commented on GitHub (Oct 19, 2021):

Looks like it's refusing to run JS due to MIME type mismatches which makes sense. Probably not something thats going to change anytime soon, sorry. There is work in progress to add better header replaying in the far future, but for now if you need higher fidelity I recommend https://ArchiveWeb.page + https://ReplayWeb.page instead.

<!-- gh-comment-id:946278848 --> @pirate commented on GitHub (Oct 19, 2021): Looks like it's refusing to run JS due to MIME type mismatches which makes sense. Probably not something thats going to change anytime soon, sorry. There is work in progress to add better header replaying in the far future, but for now if you need higher fidelity I recommend https://ArchiveWeb.page + https://ReplayWeb.page instead.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2051
No description provided.