mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #130] Extend WARC file with all requests made via all archive methods #3107
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3107
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pirate on GitHub (Jan 13, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/130
Right now the
FETCH_WARCoption only creates a simple html file WARC with wget, it doesn't save all the requests made dynamically after JS executes by chrome headless.We should set up https://github.com/internetarchive/warcprox so that all requests made during the archiving process are saved to a unified WARC file.
In the ideal scenario, the WARC should include:
I think we can record the
wgetwarc first, then usewarcatto merge it with a warcproxy-created warc containing all the chrome headless requests.@pirate commented on GitHub (Jan 28, 2019):
I've been investigating using
pywb'swayback --proxy-record --proxy archiveboxandgoogle-chrome --proxy-server=http://localhost:8080 --ignore-certificate-errors --disable-web-securityto pipe all chrome and wget requests into a warc file.So far it looks promising, I'm just resolving this before pushing it: https://github.com/webrecorder/pywb/issues/434
@brandongalbraith commented on GitHub (Feb 6, 2019):
Have you considered swapping out wget for ArchiveTeam/wpull? It's a python re-implementation of wget specifically for web crawling and archiving, and might provide the flexibility you seek.
@pirate commented on GitHub (Feb 6, 2019):
I have considered it, I just talked to the wpull authors this week in San Francisco.
For now I think we'll stick withWe're switching to wpull in order to use pure python dependencies when packaging viawgetbecause it's nice to keep dependencies to a minimum, many people already have wget installed.pip.@pirate commented on GitHub (Feb 20, 2019):
A quick update, this is still blocked by https://github.com/python-hyper/brotlipy/issues/146
@pirate commented on GitHub (Feb 28, 2019):
This can now proceed as pywb now disables brotli when it's unavailable:
https://github.com/webrecorder/pywb/issues/434#issuecomment-468409590
@goelayu commented on GitHub (May 18, 2022):
Any updates on this?
Specifically "all requests made during the archiving process (using Chrome for example) are saved to a unified WARC file" seems like a really helpful feature.
@pirate
@pirate commented on GitHub (May 18, 2022):
There is a way to do this already right now:
archivebox config CHROME_ARGS@goelayu commented on GitHub (May 19, 2022):
Correct me if I am wrong, but I don't think there is a way to pass Chrome arguments using the CLI as of now. The following are the only options it reads from the config file. @pirate
github.com/ArchiveBox/ArchiveBox@49faec8f6d/archivebox/util.py (L219-L263)