[GH-ISSUE #130] Extend WARC file with all requests made via all archive methods #1595

Open
opened 2026-03-01 17:52:02 +03:00 by kerem · 8 comments
Owner

Originally created by @pirate on GitHub (Jan 13, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/130

Right now the FETCH_WARC option only creates a simple html file WARC with wget, it doesn't save all the requests made dynamically after JS executes by chrome headless.

We should set up https://github.com/internetarchive/warcprox so that all requests made during the archiving process are saved to a unified WARC file.

In the ideal scenario, the WARC should include:

  • √ base html for the page
  • √ all assets like images, styles, fonts, js
  • all dynamically requested assets after JS executes in chrome (e.g. images, ajax requests, etc)
  • any media files requested

I think we can record the wget warc first, then use warcat to merge it with a warcproxy-created warc containing all the chrome headless requests.

Originally created by @pirate on GitHub (Jan 13, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/130 Right now the `FETCH_WARC` option only creates a simple html file WARC with wget, it doesn't save all the requests made dynamically after JS executes by chrome headless. We should set up https://github.com/internetarchive/warcprox so that all requests made during the archiving process are saved to a unified WARC file. In the ideal scenario, the WARC should include: - √ base html for the page - √ all assets like images, styles, fonts, js - all dynamically requested assets after JS executes in chrome (e.g. images, ajax requests, etc) - any media files requested I think we can record the `wget` warc first, then use `warcat` to merge it with a warcproxy-created warc containing all the chrome headless requests.
Author
Owner

@pirate commented on GitHub (Jan 28, 2019):

I've been investigating using pywb's wayback --proxy-record --proxy archivebox and google-chrome --proxy-server=http://localhost:8080 --ignore-certificate-errors --disable-web-security to pipe all chrome and wget requests into a warc file.

So far it looks promising, I'm just resolving this before pushing it: https://github.com/webrecorder/pywb/issues/434

<!-- gh-comment-id:458049137 --> @pirate commented on GitHub (Jan 28, 2019): I've been investigating using `pywb`'s `wayback --proxy-record --proxy archivebox` and `google-chrome --proxy-server=http://localhost:8080 --ignore-certificate-errors --disable-web-security` to pipe all chrome and wget requests into a warc file. So far it looks promising, I'm just resolving this before pushing it: https://github.com/webrecorder/pywb/issues/434
Author
Owner

@brandongalbraith commented on GitHub (Feb 6, 2019):

Have you considered swapping out wget for ArchiveTeam/wpull? It's a python re-implementation of wget specifically for web crawling and archiving, and might provide the flexibility you seek.

<!-- gh-comment-id:460872379 --> @brandongalbraith commented on GitHub (Feb 6, 2019): Have you considered swapping out wget for [ArchiveTeam/wpull](https://github.com/ArchiveTeam/wpull)? It's a python re-implementation of wget specifically for web crawling and archiving, and might provide the flexibility you seek.
Author
Owner

@pirate commented on GitHub (Feb 6, 2019):

I have considered it, I just talked to the wpull authors this week in San Francisco. For now I think we'll stick with wget because it's nice to keep dependencies to a minimum, many people already have wget installed. We're switching to wpull in order to use pure python dependencies when packaging via pip.

<!-- gh-comment-id:461231517 --> @pirate commented on GitHub (Feb 6, 2019): I have considered it, I just talked to the wpull authors this week in San Francisco. ~~For now I think we'll stick with `wget` because it's nice to keep dependencies to a minimum, many people already have wget installed.~~ We're switching to wpull in order to use pure python dependencies when packaging via `pip`.
Author
Owner

@pirate commented on GitHub (Feb 20, 2019):

A quick update, this is still blocked by https://github.com/python-hyper/brotlipy/issues/146

<!-- gh-comment-id:465446594 --> @pirate commented on GitHub (Feb 20, 2019): A quick update, this is still blocked by https://github.com/python-hyper/brotlipy/issues/146
Author
Owner

@pirate commented on GitHub (Feb 28, 2019):

This can now proceed as pywb now disables brotli when it's unavailable:
https://github.com/webrecorder/pywb/issues/434#issuecomment-468409590

<!-- gh-comment-id:468429841 --> @pirate commented on GitHub (Feb 28, 2019): This can now proceed as pywb now disables brotli when it's unavailable: https://github.com/webrecorder/pywb/issues/434#issuecomment-468409590
Author
Owner

@goelayu commented on GitHub (May 18, 2022):

Any updates on this?
Specifically "all requests made during the archiving process (using Chrome for example) are saved to a unified WARC file" seems like a really helpful feature.
@pirate

<!-- gh-comment-id:1130504828 --> @goelayu commented on GitHub (May 18, 2022): Any updates on this? Specifically "all requests made during the archiving process (using Chrome for example) are saved to a unified WARC file" seems like a really helpful feature. @pirate
Author
Owner

@pirate commented on GitHub (May 18, 2022):

There is a way to do this already right now:

  1. Uncomment the example pywb proxy server in the docker-compose file
  2. Enable using that proxy via CLI flag on chrome/other dependencies you want to use it with archivebox config CHROME_ARGS
<!-- gh-comment-id:1130773848 --> @pirate commented on GitHub (May 18, 2022): There is a way to do this already right now: 1. Uncomment the example pywb proxy server in the docker-compose file 2. Enable using that proxy via CLI flag on chrome/other dependencies you want to use it with `archivebox config CHROME_ARGS`
Author
Owner

@goelayu commented on GitHub (May 19, 2022):

Correct me if I am wrong, but I don't think there is a way to pass Chrome arguments using the CLI as of now. The following are the only options it reads from the config file. @pirate
github.com/ArchiveBox/ArchiveBox@49faec8f6d/archivebox/util.py (L219-L263)

<!-- gh-comment-id:1131733595 --> @goelayu commented on GitHub (May 19, 2022): Correct me if I am wrong, but I don't think there is a way to pass Chrome arguments using the CLI as of now. The following are the only options it reads from the config file. @pirate https://github.com/ArchiveBox/ArchiveBox/blob/49faec8f6dfc15075203ad332abfea0940f4e7b7/archivebox/util.py#L219-L263
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1595
No description provided.