[GH-ISSUE #160] Ability to import user-provided/3rd-party WARCs from other archiving services (e.g. if user tries to archive a URL that is already down) #109

Open
opened 2026-03-01 14:40:42 +03:00 by kerem · 5 comments
Owner

Originally created by @pirate on GitHub (Mar 5, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/160

ArchiveBox should be able to load WARCs from outside sources, replay them with pywb, and re-archive them using all the redundant archive methods like Chrome Headless, Wget, etc.

This would be most useful when a user tries to archive a URL that is already down, or that is not accessible to the ArchiveBox server.

ArchiveBox should be able to ingest a user-provided .warc / .warcz / .warg.gz, auto-fetching any available WARC from Archive.org / Archive.it / Archive.is / etc., or as a last resort auto-fetch from search engine caches (Google / Bing / Yahoo / Yandex / etc.).

Related issues:

  • #177 Setting up pyppeteer and pywb (needs to be finished before this can start)
  • #130 Record all ArchiveBox requests into the generated WARC files with pywb's proxy archiver
  • #146 Add ability to export ArchiveBox WARC files to 3rd party archiving services
  • #179 Add support for multiple snapshots of archived sites
  • #63 Adding support for HTTP proxy archiving

WARCs should be directly importable easily using archivebox add ~/Downloads/path/to/some/warc.gz, or be configurable to do the fallback searches on 3rd party services automatically in the case of a 404/403/etc.

There are a few tools that may be helpful to integrate to achieve these goals:

This should allow us to redundantly archive URLs using ArchiveBox even when the original sites are no longer available.

Originally created by @pirate on GitHub (Mar 5, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/160 ArchiveBox should be able to load WARCs from outside sources, replay them with `pywb`, and re-archive them using all the redundant archive methods like Chrome Headless, Wget, etc. This would be most useful when a user tries to archive a URL that is already down, or that is not accessible to the ArchiveBox server. ArchiveBox should be able to ingest a user-provided `.warc` / `.warcz` / `.warg.gz`, auto-fetching any available WARC from Archive.org / Archive.it / Archive.is / etc., or as a last resort auto-fetch from search engine caches (Google / Bing / Yahoo / Yandex / etc.). Related issues: - #177 Setting up pyppeteer and pywb (needs to be finished before this can start) - #130 Record all ArchiveBox requests into the generated WARC files with pywb's proxy archiver - #146 Add ability to export ArchiveBox WARC files to 3rd party archiving services - #179 Add support for multiple snapshots of archived sites - #63 Adding support for HTTP proxy archiving WARCs should be directly importable easily using `archivebox add ~/Downloads/path/to/some/warc.gz`, or be configurable to do the fallback searches on 3rd party services automatically in the case of a 404/403/etc. There are a few tools that may be helpful to integrate to achieve these goals: - https://github.com/jsvine/waybackpack - https://github.com/hartator/wayback-machine-downloader - https://en.archivarix.com This should allow us to redundantly archive URLs using ArchiveBox even when the original sites are no longer available.
Author
Owner

@muramasatheninja commented on GitHub (Apr 2, 2020):

Would very much like to see this feature. I already have made a bunch of warc files and would love to have a way to bring them into Archivebox.

<!-- gh-comment-id:607664992 --> @muramasatheninja commented on GitHub (Apr 2, 2020): Would very much like to see this feature. I already have made a bunch of warc files and would love to have a way to bring them into Archivebox.
Author
Owner

@TheAnachronism commented on GitHub (Jun 6, 2021):

What is the status on this?
Currently have the problem, that I can't archive stuff from sites which have some kind of authentication or maturity filter so I wanted to try to manually archive it and then upload it into ArchiveBox. But there doesn't seem to be any workflow for this.

<!-- gh-comment-id:855461782 --> @TheAnachronism commented on GitHub (Jun 6, 2021): What is the status on this? Currently have the problem, that I can't archive stuff from sites which have some kind of authentication or maturity filter so I wanted to try to manually archive it and then upload it into ArchiveBox. But there doesn't seem to be any workflow for this.
Author
Owner

@pirate commented on GitHub (Jun 7, 2021):

Right now there's no official workflow or concrete plan to add this in the short-term, but in the for now you can put any manual warcs inside archive/<timestamp>/warc/*.warc.gz and archivebox wont touch them there. They wont show up in the UI, but AB won't delete/move them either, so it's a safe place to put them. If you want you can even manually create an ArchvieResult entry to track those warc files on the Log page or via archivebox shell, that way they'll show up in the UI and have any metadata you want to attach about how/when you saved them.

<!-- gh-comment-id:855593467 --> @pirate commented on GitHub (Jun 7, 2021): Right now there's no official workflow or concrete plan to add this in the short-term, but in the for now you can put any manual warcs inside `archive/<timestamp>/warc/*.warc.gz` and archivebox wont touch them there. They wont show up in the UI, but AB won't delete/move them either, so it's a safe place to put them. If you want you can even manually create an `ArchvieResult` entry to track those warc files on the `Log` page or via `archivebox shell`, that way they'll show up in the UI and have any metadata you want to attach about how/when you saved them.
Author
Owner

@refparo commented on GitHub (Jul 4, 2021):

Looking forward to this to be added. This would make it easier to get ArchiveBox work with browser extensions like https://github.com/machawk1/warcreate

<!-- gh-comment-id:873551349 --> @refparo commented on GitHub (Jul 4, 2021): Looking forward to this to be added. This would make it easier to get ArchiveBox work with browser extensions like https://github.com/machawk1/warcreate
Author
Owner

@pirate commented on GitHub (Jun 13, 2023):

I don't have any updates on progress here, but I did just think of an idea that I think would be related to this feature: adding support for automatically finding 3rd party copies of pages on Archive.org/search engine caches/etc. and pulling them into ArchiveBox.

My ideal vision of this feature is that it covers the case where a user tries to archive a URL that is already down / no longer available from the original server.

The flow from there could be to:

  • try to find a copy on archive.org / archive.it / archive.is / etc. and save their warc to the ArchiveBox Snapshot
  • try to find copies of the page in search engine caches (Google, Bing, Yahoo, Yandex, etc.) and save that to our Snapshot
  • try to find alternative non-canonical URLs for the page using search engines, and attempt to archive those versions instead
  • allow the user to manually upload / ingest a WARC from a URL or local filesystem to save to the ArchiveBox Snapshot

These options should be disabled by default (because it's not safe to give the impression to the user that it was the original page that was archived when in fact we got it from a 3rd-party mirror), but configurable via ArchiveBox.conf / env variables. I also imagine having an option where users could enable doing these 3rd party archive imports even if the original URL is up, that way they can save every version of the site thats available every time.

<!-- gh-comment-id:1588494415 --> @pirate commented on GitHub (Jun 13, 2023): I don't have any updates on progress here, but I did just think of an idea that I think would be related to this feature: adding support for automatically finding 3rd party copies of pages on Archive.org/search engine caches/etc. and pulling them into ArchiveBox. My ideal vision of this feature is that it covers the case where a user tries to archive a URL that is already down / no longer available from the original server. The flow from there could be to: - try to find a copy on archive.org / archive.it / archive.is / etc. and save their warc to the ArchiveBox Snapshot - try to find copies of the page in search engine caches (Google, Bing, Yahoo, Yandex, etc.) and save that to our Snapshot - try to find alternative non-canonical URLs for the page using search engines, and attempt to archive those versions instead - allow the user to manually upload / ingest a WARC from a URL or local filesystem to save to the ArchiveBox Snapshot These options should be disabled by default (because it's not safe to give the impression to the user that it was the *original* page that was archived when in fact we got it from a 3rd-party mirror), but configurable via `ArchiveBox.conf` / env variables. I also imagine having an option where users could enable doing these 3rd party archive imports even if the original URL is *up*, that way they can save every version of the site thats available every time.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#109
No description provided.