[GH-ISSUE #1082] Call for public comments: Considering deprecating the archivebox oneshot command as of the 0.7 release #673

Closed
opened 2026-03-01 14:45:26 +03:00 by kerem · 6 comments
Owner

Originally created by @pirate on GitHub (Jan 11, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1082

image

Long long ago before archivebox was a Django app, it used to be a one-shot bash script called archive-pocket-stream.sh. When we moved to the Django system archivebox oneshot was provided as an escape hatch for users that did not like being forced to create collections and manage data directories all of a sudden. It allows the new fancy django archivebox to run in "oneshot" mode without creating a main index file, data dir, etc. and only outputting the results of one snapshot into PWD.

As you might imagine, it required tremendous haxx to run the new Django archivebox without a db file in this way, including instantiating a fake sqlite3 db in memory, filesystem write filtering, etc. and it's imposing a large maintenance burden by making it hard to refactor other subsystems.

Now that we have solidly been on Django for several major versions, I think we can safely retire archivebox oneshot ?

Iif anyone is using it, speak up now and make a case for keeping it 😅 🤠👋

Originally created by @pirate on GitHub (Jan 11, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1082 <img width="1748" alt="image" src="https://user-images.githubusercontent.com/511499/211794758-4bfc9ed5-6aa3-409c-b48d-e0e3d8fe13ab.png"> Long long ago before `archivebox` was a Django app, it used to be a one-shot bash script called `archive-pocket-stream.sh`. When we moved to the Django system `archivebox oneshot` was provided as an escape hatch for users that did not like being forced to create collections and manage data directories all of a sudden. It allows the new fancy django archivebox to run in "oneshot" mode without creating a main index file, data dir, etc. and only outputting the results of one snapshot into PWD. As you might imagine, it required tremendous haxx to run the new Django archivebox without a db file in this way, including instantiating a fake sqlite3 db in memory, filesystem write filtering, etc. and it's imposing a large maintenance burden by making it hard to refactor other subsystems. Now that we have solidly been on Django for several major versions, I think we can safely retire `archivebox oneshot` ? Iif anyone is using it, speak up now and make a case for keeping it 😅 🤠👋
Author
Owner

@ianrobertsFF commented on GitHub (Apr 7, 2023):

This is my entire use case, I need to be able to do single page snapshots on a daily basis, and have no need for any of the other functionality of ArchiveBox, I'll be running it CLI only, and using another tool to trigger the daily snapshots of the pages, and piping them into specific directories.

Without the ability to create multiple snapshots of the same URL in your normal use cases, this is the only way I can achieve it.

<!-- gh-comment-id:1500425702 --> @ianrobertsFF commented on GitHub (Apr 7, 2023): This is my entire use case, I need to be able to do single page snapshots on a daily basis, and have no need for any of the other functionality of ArchiveBox, I'll be running it CLI only, and using another tool to trigger the daily snapshots of the pages, and piping them into specific directories. Without the ability to create multiple snapshots of the same URL in your normal use cases, this is the only way I can achieve it.
Author
Owner

@pirate commented on GitHub (Apr 7, 2023):

Good to know! Would your needs be satisfied if we add better native support for multiple snapshots in archivebox instead of keeping this older feature? @ianrobertsFF

<!-- gh-comment-id:1500667497 --> @pirate commented on GitHub (Apr 7, 2023): Good to know! Would your needs be satisfied if we add better native support for multiple snapshots in archivebox instead of keeping this older feature? @ianrobertsFF
Author
Owner

@jvican commented on GitHub (Apr 9, 2023):

I'm also using this. I think it makes a lot of sense to keep a command like oneshot around because it's fairly self-contained, and it aligns well with the UNIX philosophy. It does one thing and it does it well, without the need for archive init and the use of the rest of the software. Please don't take it away.

<!-- gh-comment-id:1501048580 --> @jvican commented on GitHub (Apr 9, 2023): I'm also using this. I think it makes a lot of sense to keep a command like oneshot around because it's fairly self-contained, and it aligns well with the UNIX philosophy. It does one thing and it does it well, without the need for `archive init` and the use of the rest of the software. Please don't take it away.
Author
Owner

@ianrobertsFF commented on GitHub (Apr 9, 2023):

My needs would be satisfied by multiple snapshots, although I still wouldn't be using any of the functionality that oneshot doesn't currently use, so it wouldn't be a better workflow, as oneshot does exactly what I need.

However assuming I can continue to take on-demand snapshots with the native support for multiple snapshots, this would be acceptable to me.

<!-- gh-comment-id:1501175053 --> @ianrobertsFF commented on GitHub (Apr 9, 2023): My needs would be satisfied by multiple snapshots, although I still wouldn't be using any of the functionality that oneshot doesn't currently use, so it wouldn't be a better workflow, as oneshot does exactly what I need. However assuming I can continue to take on-demand snapshots with the native support for multiple snapshots, this would be acceptable to me.
Author
Owner

@jwmh commented on GitHub (Jul 31, 2023):

Q:
Would it be possible to fork this off into its own separate project/repo?
Would it even be desireable?

I appreciate @jvican ’s comments on this, and agree.

<!-- gh-comment-id:1658082494 --> @jwmh commented on GitHub (Jul 31, 2023): Q: Would it be possible to fork this off into its own separate project/repo? Would it even be desireable? I appreciate @jvican ’s comments on this, and agree.
Author
Owner

@pirate commented on GitHub (Dec 18, 2023):

Ok I've decided to keep oneshot because it ties in nicely with the ongoing refactor to move ArchiveBox towards an event-driven job queue model. The old oneshot will be renamed and joined by a new command to run a single extractor method:

archivebox snapshot

Can be run to snapshot an individual URL into the current directory (runs all extractors by default).

archivebox snapshot --methods=all 'https://example.com/somepage.html'
# creates a subfolder for each extractor method, and an index.html and index.json file in $PWD

This works the same way as oneshot does now, and I'll alias oneshot to the new command so we don't break backwards compatibility.

archivebox extract

This runs an individual extractor method and outputs into the current directory.

archivebox extract --method=PDF --method-args-here 'https://example.com/somepage.html'
# writes output.pdf (and an index.json containing cmd+output for each run) into $PWD using the headless browser

After the refactor, archivebox add will work by internally enqueuing a job that runs archivebox snapshot ... for each imported URL.
The snapshot job then in turn enqueues a job for each extractor needed on that URL.
Each extractor job then runs archivebox extract --method=... internally to write the output into the final archive directory.

Please subscribe to this issue for updates: https://github.com/ArchiveBox/ArchiveBox/issues/1289

<!-- gh-comment-id:1859374306 --> @pirate commented on GitHub (Dec 18, 2023): Ok I've decided to keep `oneshot` because it ties in nicely with the ongoing refactor to move ArchiveBox towards an event-driven job queue model. The old `oneshot` will be renamed and joined by a new command to run a single extractor method: #### `archivebox snapshot` Can be run to snapshot an individual URL into the current directory (runs all extractors by default). ```bash archivebox snapshot --methods=all 'https://example.com/somepage.html' # creates a subfolder for each extractor method, and an index.html and index.json file in $PWD ``` This works the same way as `oneshot` does now, and I'll alias `oneshot` to the new command so we don't break backwards compatibility. #### `archivebox extract` This runs an individual extractor method and outputs into the current directory. ```bash archivebox extract --method=PDF --method-args-here 'https://example.com/somepage.html' # writes output.pdf (and an index.json containing cmd+output for each run) into $PWD using the headless browser ``` --- After the refactor, `archivebox add` will work by internally enqueuing a job that runs `archivebox snapshot ...` for each imported URL. The snapshot job then in turn enqueues a job for each extractor needed on that URL. Each extractor job then runs `archivebox extract --method=...` internally to write the output into the final archive directory. Please subscribe to this issue for updates: https://github.com/ArchiveBox/ArchiveBox/issues/1289
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#673
No description provided.