mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #320] Feature Request: single-shot CLI filter functionality #232
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#232
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @gwern on GitHub (Feb 9, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/320
I'd like a ArchiveBox command which takes a single URL, archives it if not already archived, and prints the directory and/or directory+
output.htmlon stdout (or errors on stderr). Something like:This is a little like
oneshotbut integrates with one's existing corpus and provides the output path directly. (With ~20k URLs to go through, which is what I have, setting up a new AB instance for each URL, asoneshotdoes, is not a great idea.)This would be useful for a lot of possible automation, in particular, I'd like it for my
gwern.netscripts where I am working on transparently rewriting external links to point to local ArchiveBox-generated static HTML mirrors; right now with ArchiveBox, it's easy to request it to archive links ongwern.net(just run a link extraction script and pass to AB) and it's easy to browse one's full AB archive, but it's not easy to figure out where exactly a particular URL lives on disk, because the directory names are opaque. After reading the AB docs, I couldn't find any easy way to simply query AB for where the HTML file for a URL is, or tell it to archive a URL and return the result. (You can't pass in the URL as a regular argument because then AB will interpret it as a source of URLs, and archive a ton of URLs it shouldn't, and you have to pass the URL in on stdin or serialize it to a file, which ironically breaks the Haskell code I'm using which assumes that everything necessary can be passed as an argument.)Possibly one could parse the regular colorful commandline output, but my current solution is to use
jqto directly parse the JSON index and construct the file name:That was not easy to figure out (I'm no JSON or
jqguru) and will break if the JSON schema ever changes slightly. So it would be nice to have a simpler more direct way of doing this rather than digging through AB internals.@pirate commented on GitHub (Feb 13, 2020):
I think I can make the output more parsable or add this as a flag to
archivebox add, rather than creating a whole new subcommand. FWIW the newarchivebox addcommand also fixes that "pass as argument" vs "pass as stdin" confusion that has plagued the old design. The new design simply takes a--depth=nparameter to specify whether it should recursively archive or not.@cdvv7788 commented on GitHub (Jul 17, 2020):
In the django branch (upcoming 0.4.0) you can run
archivebox add <url> --depth=0orarchivebox add <url> --depth=1depending on your needs. That command should be enough to archive a single page.@pirate is there a command that returns the path of an archived url? that is present in the index, so extracting it for display should be simple, and that could be very useful for automation. That bit is the only missing part of this issue (isn't it?).
@pirate commented on GitHub (Jul 17, 2020):
archivebox list --csv link_dir 'https://example.com'should do the trick, the output is quite customizable too, checkarchivebox list --help.@cdvv7788 commented on GitHub (Sep 18, 2020):
@gwern is there that is not possible to do with the suggestions presented here? If that is the case, let me know and I will reopen this issue.
@pirate commented on GitHub (Jan 8, 2021):
@gwern
oneshotwas finally released in the latest version v0.5, check it out and let me know if you'd like anything additional in the way of stdout/functionality for your use case.https://github.com/ArchiveBox/ArchiveBox/releases/tag/v0.5.3