mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #6] Archive Method: Add WARC file output #3025
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3025
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @bmix on GitHub (May 5, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/6
An overview of existing projects to consume them is here:
http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem#WARC_viewer
@pirate commented on GitHub (May 16, 2017):
See: https://github.com/pirate/pocket-archive-stream/issues/11
@eqyiel commented on GitHub (Jan 1, 2018):
Can this be re-opened? The linked issue is about the webarchive format, which is a totally different thing.
https://en.wikipedia.org/wiki/Webarchive
https://en.wikipedia.org/wiki/Web_ARChive
WARC is the one that archiveteam and libraries have thrown their weight behind: http://fileformats.archiveteam.org/wiki/WARC
@pirate commented on GitHub (Jan 1, 2018):
Oh I didn't know they were different, thanks @eqyiel.
@pirate commented on GitHub (Jan 9, 2018):
Just requires adding a new config
FETCH_WARCoption andarchive_method.fetch_warc:https://www.archiveteam.org/index.php/Wget_with_WARC_output
@anarcat commented on GitHub (Sep 13, 2018):
for the record, pywb has trouble reading wget WARC file as its output is non-standard: https://github.com/webrecorder/pywb/issues/294 you might want to consider another crawler for the task or see that wget fixes their stuff first.
@f0086 commented on GitHub (Nov 23, 2018):
The bookmark-archiver is mentioned in a recent LWN article:
https://lwn.net/Articles/766374/
WARC archives seems the be the way to go for archiving webpages, especially for SPAs and other JS-heavy and streaming pages. Wget is -- according to that article -- a little broken in terms of WARC output, but there are a list of alternatives which works fine.
It would be very nice if bookmark-archiver get support for WARC archives.
@pirate commented on GitHub (Nov 23, 2018):
I saw the article, and I actually emailed @anarcat about it since it turns out we both live in Montreal.
I think if I add WARC saving I want to do it with https://github.com/internetarchive/brozzler and a headless browser to cover all the edge cases around JS execution.
@FiloSottile commented on GitHub (Dec 1, 2018):
The easiest way to do this might be to just point the headless browser you already run to a WARC recording proxy. However brozzler looks like a nicely maintained piece of software (and way better than the old Java stuff the IA was using before), so maybe migrating to that can be a longer term goal.
EDIT: just noticed there's #113 for brozzler already.
@pirate commented on GitHub (Jan 11, 2019):
Wget WARC file output is now supported in
e8808b0, via theFETCH_WARC=Trueflag (on by default).I'll investigate creating another a more complete warc including dynamic requests with the chrome proxy, but for now the simpler wget one is worth testing in the interim.
@FiloSottile commented on GitHub (Jan 13, 2019):
WARCs are nice for bundling all resources needed to render a page in a browser, but using wget generates a WARC which is just a container for the HTML. Maybe open a new issue for the browser integration?
@pirate commented on GitHub (Jan 13, 2019):
Yeah, actually I think I'll turn it off by default for now because it's too minimal to be a default until we get
warcproxworking with headless.Here's the new issue to track the all-in-one WARC file https://github.com/pirate/ArchiveBox/issues/130