[GH-ISSUE #6] Archive Method: Add WARC file output #1516

Closed
opened 2026-03-01 17:51:21 +03:00 by kerem · 11 comments
Owner

Originally created by @bmix on GitHub (May 5, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/6

An overview of existing projects to consume them is here:
http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem#WARC_viewer

Originally created by @bmix on GitHub (May 5, 2017). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/6 An overview of existing projects to consume them is here: http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem#WARC_viewer
kerem 2026-03-01 17:51:21 +03:00
Author
Owner

@pirate commented on GitHub (May 16, 2017):

See: https://github.com/pirate/pocket-archive-stream/issues/11

<!-- gh-comment-id:301918972 --> @pirate commented on GitHub (May 16, 2017): See: https://github.com/pirate/pocket-archive-stream/issues/11
Author
Owner

@eqyiel commented on GitHub (Jan 1, 2018):

Can this be re-opened? The linked issue is about the webarchive format, which is a totally different thing.

https://en.wikipedia.org/wiki/Webarchive
https://en.wikipedia.org/wiki/Web_ARChive

WARC is the one that archiveteam and libraries have thrown their weight behind: http://fileformats.archiveteam.org/wiki/WARC

<!-- gh-comment-id:354629852 --> @eqyiel commented on GitHub (Jan 1, 2018): Can this be re-opened? The linked issue is about the webarchive format, which is a totally different thing. https://en.wikipedia.org/wiki/Webarchive https://en.wikipedia.org/wiki/Web_ARChive WARC is the one that archiveteam and libraries have thrown their weight behind: http://fileformats.archiveteam.org/wiki/WARC
Author
Owner

@pirate commented on GitHub (Jan 1, 2018):

Oh I didn't know they were different, thanks @eqyiel.

<!-- gh-comment-id:354670497 --> @pirate commented on GitHub (Jan 1, 2018): Oh I didn't know they were different, thanks @eqyiel.
Author
Owner

@pirate commented on GitHub (Jan 9, 2018):

Just requires adding a new config FETCH_WARC option and archive_method.fetch_warc:

https://www.archiveteam.org/index.php/Wget_with_WARC_output

<!-- gh-comment-id:356145587 --> @pirate commented on GitHub (Jan 9, 2018): Just requires adding a new config `FETCH_WARC` option and `archive_method.fetch_warc`: https://www.archiveteam.org/index.php/Wget_with_WARC_output
Author
Owner

@anarcat commented on GitHub (Sep 13, 2018):

for the record, pywb has trouble reading wget WARC file as its output is non-standard: https://github.com/webrecorder/pywb/issues/294 you might want to consider another crawler for the task or see that wget fixes their stuff first.

<!-- gh-comment-id:421172701 --> @anarcat commented on GitHub (Sep 13, 2018): for the record, pywb has trouble reading wget WARC file as its output is non-standard: https://github.com/webrecorder/pywb/issues/294 you might want to consider another crawler for the task or see that wget fixes their stuff first.
Author
Owner

@f0086 commented on GitHub (Nov 23, 2018):

The bookmark-archiver is mentioned in a recent LWN article:
https://lwn.net/Articles/766374/

WARC archives seems the be the way to go for archiving webpages, especially for SPAs and other JS-heavy and streaming pages. Wget is -- according to that article -- a little broken in terms of WARC output, but there are a list of alternatives which works fine.

It would be very nice if bookmark-archiver get support for WARC archives.

<!-- gh-comment-id:441204329 --> @f0086 commented on GitHub (Nov 23, 2018): The bookmark-archiver is mentioned in a recent LWN article: https://lwn.net/Articles/766374/ WARC archives seems the be the way to go for archiving webpages, especially for SPAs and other JS-heavy and streaming pages. Wget is -- according to that article -- a little broken in terms of WARC output, but there are a list of alternatives which works fine. It would be very nice if bookmark-archiver get support for WARC archives.
Author
Owner

@pirate commented on GitHub (Nov 23, 2018):

I saw the article, and I actually emailed @anarcat about it since it turns out we both live in Montreal.

I think if I add WARC saving I want to do it with https://github.com/internetarchive/brozzler and a headless browser to cover all the edge cases around JS execution.

<!-- gh-comment-id:441206303 --> @pirate commented on GitHub (Nov 23, 2018): I saw the article, and I actually emailed @anarcat about it since it turns out we both live in Montreal. I think if I add WARC saving I want to [do it with](https://github.com/pirate/bookmark-archiver/issues/113) https://github.com/internetarchive/brozzler and a headless browser to cover all the edge cases around JS execution.
Author
Owner

@FiloSottile commented on GitHub (Dec 1, 2018):

The easiest way to do this might be to just point the headless browser you already run to a WARC recording proxy. However brozzler looks like a nicely maintained piece of software (and way better than the old Java stuff the IA was using before), so maybe migrating to that can be a longer term goal.

EDIT: just noticed there's #113 for brozzler already.

<!-- gh-comment-id:443463768 --> @FiloSottile commented on GitHub (Dec 1, 2018): The easiest way to do this might be to just point the headless browser you already run to a WARC recording proxy. However brozzler looks like a nicely maintained piece of software (and way better than the old Java stuff the IA was using before), so maybe migrating to that can be a longer term goal. EDIT: just noticed there's #113 for brozzler already.
Author
Owner

@pirate commented on GitHub (Jan 11, 2019):

Wget WARC file output is now supported in e8808b0, via the FETCH_WARC=True flag (on by default).

I'll investigate creating another a more complete warc including dynamic requests with the chrome proxy, but for now the simpler wget one is worth testing in the interim.

<!-- gh-comment-id:453497786 --> @pirate commented on GitHub (Jan 11, 2019): Wget WARC file output is now supported in e8808b0, via the `FETCH_WARC=True` flag (on by default). I'll investigate creating another a more complete warc including dynamic requests with the chrome proxy, but for now the simpler wget one is worth testing in the interim.
Author
Owner

@FiloSottile commented on GitHub (Jan 13, 2019):

WARCs are nice for bundling all resources needed to render a page in a browser, but using wget generates a WARC which is just a container for the HTML. Maybe open a new issue for the browser integration?

<!-- gh-comment-id:453793432 --> @FiloSottile commented on GitHub (Jan 13, 2019): WARCs are nice for bundling all resources needed to render a page in a browser, but using wget generates a WARC which is just a container for the HTML. Maybe open a new issue for the browser integration?
Author
Owner

@pirate commented on GitHub (Jan 13, 2019):

Yeah, actually I think I'll turn it off by default for now because it's too minimal to be a default until we get warcprox working with headless.

Here's the new issue to track the all-in-one WARC file https://github.com/pirate/ArchiveBox/issues/130

<!-- gh-comment-id:453793944 --> @pirate commented on GitHub (Jan 13, 2019): Yeah, actually I think I'll turn it off by default for now because it's too minimal to be a default until we get `warcprox` working with headless. Here's the new issue to track the all-in-one WARC file https://github.com/pirate/ArchiveBox/issues/130
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1516
No description provided.