mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #160] Ability to import user-provided/3rd-party WARCs from other archiving services (e.g. if user tries to archive a URL that is already down) #1621
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1621
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pirate on GitHub (Mar 5, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/160
ArchiveBox should be able to load WARCs from outside sources, replay them with
pywb, and re-archive them using all the redundant archive methods like Chrome Headless, Wget, etc.This would be most useful when a user tries to archive a URL that is already down, or that is not accessible to the ArchiveBox server.
ArchiveBox should be able to ingest a user-provided
.warc/.warcz/.warg.gz, auto-fetching any available WARC from Archive.org / Archive.it / Archive.is / etc., or as a last resort auto-fetch from search engine caches (Google / Bing / Yahoo / Yandex / etc.).Related issues:
WARCs should be directly importable easily using
archivebox add ~/Downloads/path/to/some/warc.gz, or be configurable to do the fallback searches on 3rd party services automatically in the case of a 404/403/etc.There are a few tools that may be helpful to integrate to achieve these goals:
This should allow us to redundantly archive URLs using ArchiveBox even when the original sites are no longer available.
@muramasatheninja commented on GitHub (Apr 2, 2020):
Would very much like to see this feature. I already have made a bunch of warc files and would love to have a way to bring them into Archivebox.
@TheAnachronism commented on GitHub (Jun 6, 2021):
What is the status on this?
Currently have the problem, that I can't archive stuff from sites which have some kind of authentication or maturity filter so I wanted to try to manually archive it and then upload it into ArchiveBox. But there doesn't seem to be any workflow for this.
@pirate commented on GitHub (Jun 7, 2021):
Right now there's no official workflow or concrete plan to add this in the short-term, but in the for now you can put any manual warcs inside
archive/<timestamp>/warc/*.warc.gzand archivebox wont touch them there. They wont show up in the UI, but AB won't delete/move them either, so it's a safe place to put them. If you want you can even manually create anArchvieResultentry to track those warc files on theLogpage or viaarchivebox shell, that way they'll show up in the UI and have any metadata you want to attach about how/when you saved them.@refparo commented on GitHub (Jul 4, 2021):
Looking forward to this to be added. This would make it easier to get ArchiveBox work with browser extensions like https://github.com/machawk1/warcreate
@pirate commented on GitHub (Jun 13, 2023):
I don't have any updates on progress here, but I did just think of an idea that I think would be related to this feature: adding support for automatically finding 3rd party copies of pages on Archive.org/search engine caches/etc. and pulling them into ArchiveBox.
My ideal vision of this feature is that it covers the case where a user tries to archive a URL that is already down / no longer available from the original server.
The flow from there could be to:
These options should be disabled by default (because it's not safe to give the impression to the user that it was the original page that was archived when in fact we got it from a 3rd-party mirror), but configurable via
ArchiveBox.conf/ env variables. I also imagine having an option where users could enable doing these 3rd party archive imports even if the original URL is up, that way they can save every version of the site thats available every time.