[GH-ISSUE #553] queuing for add #349

Closed
opened 2026-03-01 14:42:46 +03:00 by kerem · 2 comments
Owner

Originally created by @shepner on GitHub (Nov 28, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/553

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

archivebox add can be slow and I typically just want a quick "fire and forget" way to submit new URLs. Id also like this to be a multi-threaded process.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

Implement a command (ie archivebox queue) which parses the input (similar to archivebox add) and places each URL into a message queue. On the other side of the message queue, have a process which will kick off an archivebox add command in the background per CPU available.

What hacks or alternative solutions have you tried to solve the problem?

While "doing it right" is a rather involved process, this could also be done external to archivebox itself as scripting within the Docker container. Ive done "quick and dirty" variants similar to this a few times over the years with Python (and Perl) scripts.

In the simplest form, the message queue could just be a list or even a file. Running multiple threads can be as simple as just watching to ensure no more than N instances are running at any given time and pulling more entries from the queue when there are more slots open.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @shepner on GitHub (Nov 28, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/553 ## Type - [ ] General question or discussion - [x] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves `archivebox add` can be slow and I typically just want a quick "fire and forget" way to submit new URLs. Id also like this to be a multi-threaded process. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes Implement a command (ie `archivebox queue`) which parses the input (similar to `archivebox add`) and places each URL into a message queue. On the other side of the message queue, have a process which will kick off an `archivebox add` command in the background per CPU available. ## What hacks or alternative solutions have you tried to solve the problem? While "doing it right" is a rather involved process, this could also be done external to `archivebox` itself as scripting within the Docker container. Ive done "quick and dirty" variants similar to this a few times over the years with Python (and Perl) scripts. In the simplest form, the message queue could just be a list or even a file. Running multiple threads can be as simple as just watching to ensure no more than N instances are running at any given time and pulling more entries from the queue when there are more slots open. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [x] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [ ] I'm willing to contribute dev time / money to fix this issue - [x] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@cdvv7788 commented on GitHub (Nov 28, 2020):

@pirate this is related to the huey implementation, right?

<!-- gh-comment-id:735250316 --> @cdvv7788 commented on GitHub (Nov 28, 2020): @pirate this is related to the huey implementation, right?
Author
Owner

@pirate commented on GitHub (Nov 28, 2020):

The message queue-style implementation is coming soon with Huey, but the behavior you want can already be achieved with:

image

# add the URL to the index only, without running any of the archiving methods yet (effectively queuing it)
archivebox add --index-only https://example.com/some/url/here

...
# then run this later on / in a separate process to actually archive everything
archivebox update

Going to close this for now because the Huey implementation is already a long-running dev task we're tracking in other issues.
Feel free to reply if you still have questions / want help though and I'll continue answering here.

<!-- gh-comment-id:735268565 --> @pirate commented on GitHub (Nov 28, 2020): The message queue-style implementation is coming soon with Huey, but the behavior you want can already be achieved with: ![image](https://user-images.githubusercontent.com/511499/100522341-f1a62980-3177-11eb-920c-26399f02e841.png) ```bash # add the URL to the index only, without running any of the archiving methods yet (effectively queuing it) archivebox add --index-only https://example.com/some/url/here ... # then run this later on / in a separate process to actually archive everything archivebox update ``` Going to close this for now because the Huey implementation is already a long-running dev task we're tracking in other issues. Feel free to reply if you still have questions / want help though and I'll continue answering here.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#349
No description provided.