[GH-ISSUE #335] Archived sites fail to load resources with subresource integrity checks because of wget URL rewriting #3261

Open
opened 2026-03-14 21:49:04 +03:00 by kerem · 1 comment
Owner

Originally created by @v3rmine on GitHub (Mar 28, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/335

Describe the bug

While archiving the github page https://github.com/phsym/prettytable-rs or any gist page / github page, the JS / CSS doesn't load, only the HTML so the webpage appear broken.

Steps to reproduce

I'm running ArchiveBox version c79ce2b1f with this config :

CHROME_HEADLESS=True
RESOLUTION=1440,900
ONLY_NEW=True

FETCH_PDF=True
FETCH_SCREENSHOT=True
FETCH_DOM=True
FETCH_WARC=True
FETCH_MEDIA=True
FETCH_FAVICON=True
FETCH_TITLE=True
FETCH_GIT=False

FETCH_WGET_REQUISITES=True
WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
CHROME_USER_AGENT=$WGET_USER_AGENT

USE_COLOR=True
SHOW_PROGRESS=True

I cannot load the Local Archive,There is only the HTML that load correctly and the console.output show this error :

Failed to find a valid digest in the 'integrity' attribute for resource '{ressource_url}' with computed SHA-256 integrity 'q/IwChPti+2XOSaqntWZu/Nb7KH5XEFM2opjgiky20o='. The resource has been blocked.

Screenshots or log output

  • Webpage
    Screen Shot 2020-03-28 at 20 24 56
  • Console output (I've hidden the base url)
    Screen Shot 2020-03-28 at 20 15 42

Software versions

  • OS: macOS 10.13
  • ArchiveBox version: 83197ef
  • Python version: 3.7.6
  • Chrome version: 83.0.4097.0
Originally created by @v3rmine on GitHub (Mar 28, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/335 #### Describe the bug While archiving the github page https://github.com/phsym/prettytable-rs or any gist page / github page, the JS / CSS doesn't load, only the HTML so the webpage appear broken. #### Steps to reproduce I'm running ArchiveBox version `c79ce2b1f` with this config : ```conf CHROME_HEADLESS=True RESOLUTION=1440,900 ONLY_NEW=True FETCH_PDF=True FETCH_SCREENSHOT=True FETCH_DOM=True FETCH_WARC=True FETCH_MEDIA=True FETCH_FAVICON=True FETCH_TITLE=True FETCH_GIT=False FETCH_WGET_REQUISITES=True WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36" CHROME_USER_AGENT=$WGET_USER_AGENT USE_COLOR=True SHOW_PROGRESS=True ``` I cannot load the Local Archive,There is only the HTML that load correctly and the console.output show this error : ``` Failed to find a valid digest in the 'integrity' attribute for resource '{ressource_url}' with computed SHA-256 integrity 'q/IwChPti+2XOSaqntWZu/Nb7KH5XEFM2opjgiky20o='. The resource has been blocked. ``` #### Screenshots or log output - Webpage ![Screen Shot 2020-03-28 at 20 24 56](https://user-images.githubusercontent.com/11839373/77831914-5a915980-7132-11ea-9059-79bc00c227f5.png) - Console output (I've hidden the base url) ![Screen Shot 2020-03-28 at 20 15 42](https://user-images.githubusercontent.com/11839373/77832220-b1982e00-7134-11ea-8d24-424c52d99515.png) #### Software versions - OS: macOS 10.13 - ArchiveBox version: 83197ef - Python version: 3.7.6 - Chrome version: 83.0.4097.0
Author
Owner

@pirate commented on GitHub (Mar 31, 2020):

Yeah subresource-integrity checks break archiving because wget rewrites URLs in source files to be relative. This is a known issue that's difficult to fix, so I recommend relying on the PDF and screenshot output more than the wget output.

Alternatively, you can try running find ./ -name "*.html" -type f -exec sed -E -i '' 's/integrity="sha.*"|crossorigin="anonymous"//g' {} \; in your archive folder to remove the subresource integrity checks in the archived HTML files.

See:

<!-- gh-comment-id:606372255 --> @pirate commented on GitHub (Mar 31, 2020): Yeah subresource-integrity checks break archiving because wget rewrites URLs in source files to be relative. This is a known issue that's difficult to fix, so I recommend relying on the PDF and screenshot output more than the wget output. Alternatively, you can try running `find ./ -name "*.html" -type f -exec sed -E -i '' 's/integrity="sha.*"|crossorigin="anonymous"//g' {} \;` in your archive folder to remove the subresource integrity checks in the archived HTML files. See: - https://www.smashingmagazine.com/2019/04/understanding-subresource-integrity/ - https://designnotes.blog.gov.uk/2018/05/23/how-i-used-wget-to-make-a-copy-of-the-service-manual-for-user-research/
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3261
No description provided.