starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 09:06:02 +03:00

[GH-ISSUE #130] Extend WARC file with all requests made via all archive methods #3107

New issue

Open

opened 2026-03-14 21:03:36 +03:00 by kerem · 8 comments

kerem commented

2026-03-14 21:03:36 +03:00

Owner

Originally created by @pirate on GitHub (Jan 13, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/130

Right now the FETCH_WARC option only creates a simple html file WARC with wget, it doesn't save all the requests made dynamically after JS executes by chrome headless.

We should set up https://github.com/internetarchive/warcprox so that all requests made during the archiving process are saved to a unified WARC file.

In the ideal scenario, the WARC should include:

√ base html for the page
√ all assets like images, styles, fonts, js
all dynamically requested assets after JS executes in chrome (e.g. images, ajax requests, etc)
any media files requested

I think we can record the wget warc first, then use warcat to merge it with a warcproxy-created warc containing all the chrome headless requests.

Originally created by @pirate on GitHub (Jan 13, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/130 Right now the `FETCH_WARC` option only creates a simple html file WARC with wget, it doesn't save all the requests made dynamically after JS executes by chrome headless. We should set up https://github.com/internetarchive/warcprox so that all requests made during the archiving process are saved to a unified WARC file. In the ideal scenario, the WARC should include: - √ base html for the page - √ all assets like images, styles, fonts, js - all dynamically requested assets after JS executes in chrome (e.g. images, ajax requests, etc) - any media files requested I think we can record the `wget` warc first, then use `warcat` to merge it with a warcproxy-created warc containing all the chrome headless requests.

kerem added the

size: hard

why: functionality

status: wip

labels

2026-03-14 21:03:36 +03:00

kerem commented

2026-03-14 21:03:47 +03:00

Author

Owner

@pirate commented on GitHub (Jan 28, 2019):

I've been investigating using pywb's wayback --proxy-record --proxy archivebox and google-chrome --proxy-server=http://localhost:8080 --ignore-certificate-errors --disable-web-security to pipe all chrome and wget requests into a warc file.

So far it looks promising, I'm just resolving this before pushing it: https://github.com/webrecorder/pywb/issues/434

@pirate commented on GitHub (Jan 28, 2019): I've been investigating using `pywb`'s `wayback --proxy-record --proxy archivebox` and `google-chrome --proxy-server=http://localhost:8080 --ignore-certificate-errors --disable-web-security` to pipe all chrome and wget requests into a warc file. So far it looks promising, I'm just resolving this before pushing it: https://github.com/webrecorder/pywb/issues/434

kerem commented

2026-03-14 21:03:52 +03:00

Author

Owner

@brandongalbraith commented on GitHub (Feb 6, 2019):

Have you considered swapping out wget for ArchiveTeam/wpull? It's a python re-implementation of wget specifically for web crawling and archiving, and might provide the flexibility you seek.

@brandongalbraith commented on GitHub (Feb 6, 2019): Have you considered swapping out wget for [ArchiveTeam/wpull](https://github.com/ArchiveTeam/wpull)? It's a python re-implementation of wget specifically for web crawling and archiving, and might provide the flexibility you seek.

kerem commented

2026-03-14 21:03:57 +03:00

Author

Owner

@pirate commented on GitHub (Feb 6, 2019):

I have considered it, I just talked to the wpull authors this week in San Francisco. ~~For now I think we'll stick with wget because it's nice to keep dependencies to a minimum, many people already have wget installed.~~ We're switching to wpull in order to use pure python dependencies when packaging via pip.

@pirate commented on GitHub (Feb 6, 2019): I have considered it, I just talked to the wpull authors this week in San Francisco. ~~For now I think we'll stick with `wget` because it's nice to keep dependencies to a minimum, many people already have wget installed.~~ We're switching to wpull in order to use pure python dependencies when packaging via `pip`.

kerem commented

2026-03-14 21:04:03 +03:00

Author

Owner

@pirate commented on GitHub (Feb 20, 2019):

A quick update, this is still blocked by https://github.com/python-hyper/brotlipy/issues/146

@pirate commented on GitHub (Feb 20, 2019): A quick update, this is still blocked by https://github.com/python-hyper/brotlipy/issues/146

kerem commented

2026-03-14 21:04:08 +03:00

Author

Owner

@pirate commented on GitHub (Feb 28, 2019):

This can now proceed as pywb now disables brotli when it's unavailable:
https://github.com/webrecorder/pywb/issues/434#issuecomment-468409590

@pirate commented on GitHub (Feb 28, 2019): This can now proceed as pywb now disables brotli when it's unavailable: https://github.com/webrecorder/pywb/issues/434#issuecomment-468409590

kerem commented

2026-03-14 21:04:13 +03:00

Author

Owner

@goelayu commented on GitHub (May 18, 2022):

Any updates on this?
Specifically "all requests made during the archiving process (using Chrome for example) are saved to a unified WARC file" seems like a really helpful feature.
@pirate

@goelayu commented on GitHub (May 18, 2022): Any updates on this? Specifically "all requests made during the archiving process (using Chrome for example) are saved to a unified WARC file" seems like a really helpful feature. @pirate

kerem commented

2026-03-14 21:04:18 +03:00

Author

Owner

@pirate commented on GitHub (May 18, 2022):

There is a way to do this already right now:

Uncomment the example pywb proxy server in the docker-compose file
Enable using that proxy via CLI flag on chrome/other dependencies you want to use it with archivebox config CHROME_ARGS

@pirate commented on GitHub (May 18, 2022): There is a way to do this already right now: 1. Uncomment the example pywb proxy server in the docker-compose file 2. Enable using that proxy via CLI flag on chrome/other dependencies you want to use it with `archivebox config CHROME_ARGS`

kerem commented

2026-03-14 21:04:23 +03:00

Author

Owner

@goelayu commented on GitHub (May 19, 2022):

Correct me if I am wrong, but I don't think there is a way to pass Chrome arguments using the CLI as of now. The following are the only options it reads from the config file. @pirate
github.com/ArchiveBox/ArchiveBox@49faec8f6d/archivebox/util.py (L219-L263)

@goelayu commented on GitHub (May 19, 2022): Correct me if I am wrong, but I don't think there is a way to pass Chrome arguments using the CLI as of now. The following are the only options it reads from the config file. @pirate https://github.com/ArchiveBox/ArchiveBox/blob/49faec8f6dfc15075203ad332abfea0940f4e7b7/archivebox/util.py#L219-L263

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#3107

No description provided.

Rows
Columns