[GH-ISSUE #320] Feature Request: single-shot CLI filter functionality #232

Closed
opened 2026-03-01 14:41:43 +03:00 by kerem · 5 comments
Owner

Originally created by @gwern on GitHub (Feb 9, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/320

I'd like a ArchiveBox command which takes a single URL, archives it if not already archived, and prints the directory and/or directory+output.html on stdout (or errors on stderr). Something like:

$ archive singleshot "https://www.framerated.co.uk/the-haunting-1963/"
/home/gwern/www/ArchiveBox/archive/1581204108.580/output.html

This is a little like oneshot but integrates with one's existing corpus and provides the output path directly. (With ~20k URLs to go through, which is what I have, setting up a new AB instance for each URL, as oneshot does, is not a great idea.)

This would be useful for a lot of possible automation, in particular, I'd like it for my gwern.net scripts where I am working on transparently rewriting external links to point to local ArchiveBox-generated static HTML mirrors; right now with ArchiveBox, it's easy to request it to archive links on gwern.net (just run a link extraction script and pass to AB) and it's easy to browse one's full AB archive, but it's not easy to figure out where exactly a particular URL lives on disk, because the directory names are opaque. After reading the AB docs, I couldn't find any easy way to simply query AB for where the HTML file for a URL is, or tell it to archive a URL and return the result. (You can't pass in the URL as a regular argument because then AB will interpret it as a source of URLs, and archive a ton of URLs it shouldn't, and you have to pass the URL in on stdin or serialize it to a file, which ironically breaks the Haskell code I'm using which assumes that everything necessary can be passed as an argument.)

Possibly one could parse the regular colorful commandline output, but my current solution is to use jq to directly parse the JSON index and construct the file name:

SAVED_DIR=$(jq --raw-output ".links[] | select(.url==\"$1\") | .history.title[].pwd" "$HOME/www/ArchiveBox/index.json")

That was not easy to figure out (I'm no JSON or jq guru) and will break if the JSON schema ever changes slightly. So it would be nice to have a simpler more direct way of doing this rather than digging through AB internals.

Originally created by @gwern on GitHub (Feb 9, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/320 I'd like a ArchiveBox command which takes a single URL, archives it if not already archived, and prints the directory and/or directory+`output.html` on stdout (or errors on stderr). Something like: ```.Bash $ archive singleshot "https://www.framerated.co.uk/the-haunting-1963/" /home/gwern/www/ArchiveBox/archive/1581204108.580/output.html ``` This is a little like [`oneshot`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-oneshot) but integrates with one's existing corpus and provides the output path directly. (With ~20k URLs to go through, which is what I have, setting up a new AB instance for each URL, as `oneshot` does, is not a great idea.) This would be useful for a lot of possible automation, in particular, I'd like it for my `gwern.net` scripts where I am working on transparently rewriting external links to point to local ArchiveBox-generated static HTML mirrors; right now with ArchiveBox, it's easy to request it to archive links on `gwern.net` (just run a link extraction script and pass to AB) and it's easy to browse one's full AB archive, but it's not easy to figure out where exactly a particular URL lives on disk, because the directory names are opaque. After reading the AB docs, I couldn't find any easy way to simply query AB for where the HTML file for a URL is, or tell it to archive a URL and return the result. (You can't pass in the URL as a regular argument because then AB will interpret it as a source of URLs, and archive a ton of URLs it shouldn't, and you have to pass the URL in on stdin or serialize it to a file, which ironically breaks the Haskell code I'm using which assumes that everything necessary can be passed as an argument.) Possibly one could parse the regular colorful commandline output, but my current solution is to use `jq` to directly parse the JSON index and construct the file name: ```.Bash SAVED_DIR=$(jq --raw-output ".links[] | select(.url==\"$1\") | .history.title[].pwd" "$HOME/www/ArchiveBox/index.json") ``` That was not easy to figure out (I'm no JSON or `jq` guru) and will break if the JSON schema ever changes slightly. So it would be nice to have a simpler more direct way of doing this rather than digging through AB internals.
Author
Owner

@pirate commented on GitHub (Feb 13, 2020):

I think I can make the output more parsable or add this as a flag to archivebox add, rather than creating a whole new subcommand. FWIW the new archivebox add command also fixes that "pass as argument" vs "pass as stdin" confusion that has plagued the old design. The new design simply takes a --depth=n parameter to specify whether it should recursively archive or not.

<!-- gh-comment-id:585897027 --> @pirate commented on GitHub (Feb 13, 2020): I think I can make the output more parsable or add this as a flag to `archivebox add`, rather than creating a whole new subcommand. FWIW the new `archivebox add` command also fixes that "pass as argument" vs "pass as stdin" confusion that has plagued the old design. The new design simply takes a `--depth=n` parameter to specify whether it should recursively archive or not.
Author
Owner

@cdvv7788 commented on GitHub (Jul 17, 2020):

In the django branch (upcoming 0.4.0) you can run archivebox add <url> --depth=0 or archivebox add <url> --depth=1 depending on your needs. That command should be enough to archive a single page.
@pirate is there a command that returns the path of an archived url? that is present in the index, so extracting it for display should be simple, and that could be very useful for automation. That bit is the only missing part of this issue (isn't it?).

<!-- gh-comment-id:660088824 --> @cdvv7788 commented on GitHub (Jul 17, 2020): In the django branch (upcoming 0.4.0) you can run `archivebox add <url> --depth=0` or `archivebox add <url> --depth=1` depending on your needs. That command should be enough to archive a single page. @pirate is there a command that returns the path of an archived url? that is present in the index, so extracting it for display should be simple, and that could be very useful for automation. That bit is the only missing part of this issue (isn't it?).
Author
Owner

@pirate commented on GitHub (Jul 17, 2020):

archivebox list --csv link_dir 'https://example.com' should do the trick, the output is quite customizable too, check archivebox list --help.

<!-- gh-comment-id:660378376 --> @pirate commented on GitHub (Jul 17, 2020): `archivebox list --csv link_dir 'https://example.com'` should do the trick, the output is quite customizable too, check `archivebox list --help`.
Author
Owner

@cdvv7788 commented on GitHub (Sep 18, 2020):

@gwern is there that is not possible to do with the suggestions presented here? If that is the case, let me know and I will reopen this issue.

<!-- gh-comment-id:694848351 --> @cdvv7788 commented on GitHub (Sep 18, 2020): @gwern is there that is not possible to do with the suggestions presented here? If that is the case, let me know and I will reopen this issue.
Author
Owner

@pirate commented on GitHub (Jan 8, 2021):

@gwern oneshot was finally released in the latest version v0.5, check it out and let me know if you'd like anything additional in the way of stdout/functionality for your use case.

archivebox oneshot https://example.com
archivebox oneshot --extract=pdf,singlefile,media https://example.com
archivebox oneshot --depth=1 https://example.com
archivebox oneshot --help   # for more info

https://github.com/ArchiveBox/ArchiveBox/releases/tag/v0.5.3

<!-- gh-comment-id:756727447 --> @pirate commented on GitHub (Jan 8, 2021): @gwern `oneshot` was finally released in the latest version v0.5, check it out and let me know if you'd like anything additional in the way of stdout/functionality for your use case. ```bash archivebox oneshot https://example.com archivebox oneshot --extract=pdf,singlefile,media https://example.com archivebox oneshot --depth=1 https://example.com archivebox oneshot --help # for more info ``` https://github.com/ArchiveBox/ArchiveBox/releases/tag/v0.5.3
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#232
No description provided.