mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #439] Feature Request: archive.today family integration #294
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#294
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jaw-sh on GitHub (Aug 13, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/439
The archive.today sites (including archive.is, archive.md, archive.vn, archive.fi, etc) should have special integrations..
Type
What is the problem that your feature request solves
archive.today's webmaster uses its status for activism. Using browsers the webmaster does not like (Brave) will result in the site being unusable. I would like to locally archive all archive.today links.
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
There is a .ZIP download available for every archive which can be downloaded, unzipped, and converted into the archive format.
What hacks or alternative solutions have you tried to solve the problem?
Currently, my attempts at
archivebox adding archive.today links results in the archive failing.How badly do you want this new feature?
@jaw-sh commented on GitHub (Aug 13, 2020):
When I try to archive an archive.today page, I get errors and the archive is a directory of junk instead of the actual page.
@cdvv7788 commented on GitHub (Aug 13, 2020):
@jaw-sh can you provide the command (with url) you are testing? I can give it a check (It is probably being blocked by the target url).
@jaw-sh commented on GitHub (Aug 13, 2020):
@cdvv7788 http://archive.is/nX7fq
@cdvv7788 commented on GitHub (Aug 13, 2020):
It works for me. How are you trying to run
archivebox? Are you setting some environment variable or changing some configuration? Are you onmaster?@jaw-sh commented on GitHub (Aug 13, 2020):
http://archive.vn/nX7fq
https://tinf.io/archive/1597333033/
All I really want is the static, non-interactive version of the page they already archived.
@jaw-sh commented on GitHub (Aug 13, 2020):
ArchiveBox really needs a way to capture the DOM at "first rest" when the page is fully loaded. With Twitter, the archive is completely mangled because it tries to totally replicate the entire Twitter living webpage. Instagram is also completely broken.
I can open a new issue for this and I am willing to put cash bounties on these things.
https://twitter.com/dril/status/134787490526658561

https://tinf.io/archive/1597335380/twitter.com/dril/status/134787490526658561.html
@pirate commented on GitHub (Aug 13, 2020):
@jaw-sh it does capture at first rest with 2 of the methods, the DOM dump and Singlefile. Have you tried looking at those outputs?
@jaw-sh commented on GitHub (Aug 13, 2020):
I have the single-file binary set. It didn't work at all before I set it.
@cdvv7788 commented on GitHub (Aug 13, 2020):
We have a fix in an incoming PR that will disable it by default. Using docker has support for all of the extractors out of the box.
@jaw-sh commented on GitHub (Aug 13, 2020):
@cdvv7788 Sounds good.
What I really, really need is this:
I am willing to pay for this.
@pirate commented on GitHub (Aug 13, 2020):
This is already present, you can POST to
https://127.0.0.1:8000/admin/core/snapshot/add/with the following fields to archive a link:url: str(a string containing any number of URLs)depth: int(either 0 or 1, as detailed inarchivebox add --help)As mentioned above, this is already present, both the DOM dump and SingleFile methods archive "at first rest", i.e. ~1s after DOM.ready event fires.
The other methods do not execute JS, and so "page ready" is not a concept that applies to them.
Your screenshot is of the wget output method only, have you tried viewing the SingleFile or DOM dump outputs? They should generally work fine for twitter.
I'm afraid this is not easily possible, archive.today explicitly does not expose an API that allows users to download their snapshots. If they did have such an API, then that task would fall under the umbrella of this ticket: https://github.com/pirate/ArchiveBox/issues/160
If you are serious about this, be aware that funding development on this issue would be on the order of $5k USD or more. We run a software consultancy and you can find more info about our hiring us here: Monadical.com.
Also related (for improving exporting to sites like archive.today/archive.org): https://github.com/pirate/ArchiveBox/issues/146
@jaw-sh commented on GitHub (Aug 13, 2020):
archive.today/is/vn/fi does not use the WARC format, they export a .zip download. Even if it's not easy, converting that .zip download into WARC and using it as a snapshot is something I would pay for. I have thousands of these links I would like to host myself.
I must be missing something re: the single file archive. Is there a special config setting I have to set to explicitly use single file? I believe I am already using it but Instagram and Twitter archives are malformed. I had to create a binary to get any archive to work.
@pirate commented on GitHub (Aug 13, 2020):
We might be able to download that ZIP and rehost it verbatim in the ArchiveBox index without converting it to WARC. ArchiveBox wouldn't be able to run any of its own extractors though (wget, youtubedl, git, chrome, etc.), you'd basically just see the archive.today version in the index with none of ArchiveBox's own functionality. Is that what you're asking for?
https://github.com/pirate/ArchiveBox/wiki/Usage#ui-usage
All archive methods (that are installed) are run for every URL, you can access them by clicking the favicon next to the title, or any of the icons in the "Files" column.
@pirate commented on GitHub (Jun 13, 2023):
I'm merging this feature with https://github.com/ArchiveBox/ArchiveBox/issues/160, which is a more general TODO to add support for searching/importing from 3rd party archiving platforms.
Please subscribe to that issue for progress updates / discussions.