[GH-ISSUE #305] Question: Comparison between this and other archival products #1734

Closed
opened 2026-03-01 17:53:14 +03:00 by kerem · 6 comments
Owner

Originally created by @DonaldTsang on GitHub (Dec 25, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/305

See:

  1. Are they compatible with one another?
  2. Do they produce the same format, or are they different in nature?
Originally created by @DonaldTsang on GitHub (Dec 25, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/305 See: - https://github.com/webrecorder/webrecorder - https://github.com/webrecorder/pywb - https://github.com/webrecorder/warcio 1. Are they compatible with one another? 2. Do they produce the same format, or are they different in nature?
kerem closed this issue 2026-03-01 17:53:14 +03:00
Author
Owner

@crisdosaygo commented on GitHub (Dec 31, 2019):

22120 is able to archive everything you browse by hooking into the browser directly. Then offline, you can use your browser as normal, and your browser still works like you're online -- for the pages you've already browsed.

The archive format in 22120 is simple JSON files, organized by directory that represents their origin. There is 1 JSON file for every resource the origin serves you.

You can zip your archive folder and copy it around, and you can create multiple archive folders. You can also specify domain patterns to exclude from your archive.

More information about it is in the README

See: https://github.com/dosyago/22120 for instructions on how to install from the source, or from npm and for the latest binaries see here:
https://github.com/dosyago/22120/releases/latest

<!-- gh-comment-id:569969819 --> @crisdosaygo commented on GitHub (Dec 31, 2019): [22120](https://github.com/dosyago/22120) is able to archive everything you browse by hooking into the browser directly. Then offline, you can use your browser as normal, and your browser still works like you're online -- for the pages you've already browsed. The archive format in 22120 is simple JSON files, organized by directory that represents their origin. There is 1 JSON file for every resource the origin serves you. You can zip your archive folder and copy it around, and you can create multiple archive folders. You can also specify domain patterns to exclude from your archive. More information about it is in [the README](https://github.com/dosyago/22120/blob/master/README.md) See: https://github.com/dosyago/22120 for instructions on how to install from the source, or from npm and for the latest binaries see here: https://github.com/dosyago/22120/releases/latest
Author
Owner

@DonaldTsang commented on GitHub (Jan 1, 2020):

@crislin2046 hold on, I am asking for a comparison, not just what they do since ArchiveBox and others are too similar.

<!-- gh-comment-id:570019929 --> @DonaldTsang commented on GitHub (Jan 1, 2020): @crislin2046 hold on, I am asking for a comparison, not just what they do since ArchiveBox and others are too similar.
Author
Owner

@crisdosaygo commented on GitHub (Jan 1, 2020):

I don't know that much about ArchiveBox so my contribution to the comparison is to share what I know. Other people can then compare with ArchiveBox based on that, using what they know! 😄

<!-- gh-comment-id:570019970 --> @crisdosaygo commented on GitHub (Jan 1, 2020): I don't know that much about ArchiveBox so my contribution to the comparison is to share what I know. Other people can then compare with ArchiveBox based on that, using what they know! 😄
Author
Owner
<!-- gh-comment-id:570934725 --> @pirate commented on GitHub (Jan 5, 2020): https://github.com/pirate/ArchiveBox#comparison-to-other-projects https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects
Author
Owner

@DonaldTsang commented on GitHub (Jan 6, 2020):

@pirate thanks for the explanation, but could you describe how WARC differs from the native ArchiveBox format?

<!-- gh-comment-id:571108662 --> @DonaldTsang commented on GitHub (Jan 6, 2020): @pirate thanks for the explanation, but could you describe how WARC differs from the native ArchiveBox format?
Author
Owner

@pirate commented on GitHub (Jan 7, 2020):

There isn't really an "ArchiveBox format" (yet), it does produce some JSON index files, but really the main output is the output of the tools that it calls to archive each site, e.g.:

  • png: screenshot of the site
  • pdf: pdf version of the site
  • html: DOM dump of the site rendered in chrome after 2s of JS execution
  • WARC: wget WARC archive (same format as pywb output and other tools)
  • wget archive: static html + assets as archived by wget, it's normal html, not any proprietary format
  • media: e.g. mp3s, movies, git code repositories, etc are all raw fiels, not in any proprietary format

You can find more info here: https://github.com/pirate/ArchiveBox/wiki/Usage#Disk-Layout

<!-- gh-comment-id:571695433 --> @pirate commented on GitHub (Jan 7, 2020): There isn't really an "ArchiveBox format" (yet), it does produce some JSON index files, but really the main output is the output of the tools that it calls to archive each site, e.g.: - png: screenshot of the site - pdf: pdf version of the site - html: DOM dump of the site rendered in chrome after 2s of JS execution - WARC: wget WARC archive (same format as pywb output and other tools) - wget archive: static html + assets as archived by wget, it's normal html, not any proprietary format - media: e.g. mp3s, movies, git code repositories, etc are all raw fiels, not in any proprietary format You can find more info here: https://github.com/pirate/ArchiveBox/wiki/Usage#Disk-Layout
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1734
No description provided.