[GH-ISSUE #578] Feature Request: Scheduling Archival from the UI #3381

Open
opened 2026-03-14 22:31:30 +03:00 by kerem · 3 comments
Owner

Originally created by @BlipRanger on GitHub (Dec 10, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/578

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

Currently scheduling ingestion of new urls requires writing a cron job external to the web UI (external to the docker container in my case) which isn't entirely ideal in a docker/self-contained setup. I believe this would be a nice convenience feature for users that might want to manage the entire operation of AB from within the web UI.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

This feature would add a method for setting up scheduled pulls from various data sources via the web UI rather than only externally via cron. I specifically imagine at least a way to specify a RSS feed to be subscribed to that it can watch for new content from (something like Wallabag in my particular imagined use case). Technically I think this would involve a new menu/button in the UI and should dovetail with the internal scheduling processes already available.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @BlipRanger on GitHub (Dec 10, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/578 ## Type - [ ] General question or discussion - [x] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves Currently scheduling ingestion of new urls requires writing a cron job external to the web UI (external to the docker container in my case) which isn't entirely ideal in a docker/self-contained setup. I believe this would be a nice convenience feature for users that might want to manage the entire operation of AB from within the web UI. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes This feature would add a method for setting up scheduled pulls from various data sources via the web UI rather than only externally via cron. I specifically imagine at least a way to specify a RSS feed to be subscribed to that it can watch for new content from (something like Wallabag in my particular imagined use case). Technically I think this would involve a new menu/button in the UI and should dovetail with the internal scheduling processes already available. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [x] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [x] I'm willing to contribute dev time / money to fix this issue - [x] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@pirate commented on GitHub (Dec 10, 2020):

Yeah this is definitely on our mind, it probably won't be added for a couple versions but this is definitely something I've been planning.

It's blocked by adding a background queue system like Huey or dramatiq: https://github.com/ArchiveBox/ArchiveBox/issues/91

In the meantime I recommend using docker-compose instead of docker alone, as it allows you to declaratively define your scheduled imports all in one place (you can see the docker-compose.yml commented out section for an example of how to do that).

<!-- gh-comment-id:742548712 --> @pirate commented on GitHub (Dec 10, 2020): Yeah this is definitely on our mind, it probably won't be added for a couple versions but this is definitely something I've been planning. It's blocked by adding a background queue system like [Huey](https://github.com/coleifer/huey) or dramatiq: https://github.com/ArchiveBox/ArchiveBox/issues/91 In the meantime I recommend using docker-compose instead of docker alone, as it allows you to declaratively define your scheduled imports all in one place (you can see the `docker-compose.yml` commented out section for an example of how to do that).
Author
Owner

@BlipRanger commented on GitHub (Dec 10, 2020):

Gotcha, I saw the future queuing system and that makes sense! And yes, currently using compose, so I'll look into doing that. Thanks!

<!-- gh-comment-id:742553755 --> @BlipRanger commented on GitHub (Dec 10, 2020): Gotcha, I saw the future queuing system and that makes sense! And yes, currently using compose, so I'll look into doing that. Thanks!
Author
Owner

@pirate commented on GitHub (Apr 16, 2021):

Here's my proposed implementation of a new model to track scheduled imports: https://github.com/ArchiveBox/ArchiveBox/pull/707/files

Remaining TODOs:

  • figure out which python scheduler to use
    • huey + django-huey-monitor (my current favorite)
    • celery (ugh...)
    • APScheduler (will require lots of manual models and concurrency control code)
    • yacron (not sure if it can be configured dynamically)
    • dramatiq (doesn't support sqlite)
  • decide whether to continue supporting system crontab at all, or tear it out (imo we should just tear it out and move to using an internal scheduler)
  • fork the scheduled task worker off the server process automatically on startup, so no need to run separate archivebox schedule --foreground process manually
  • figure out how to enforce "at least once" or "at most once" concurrency model for scheduled tasks

Follow that PR for more updates as work progresses. https://github.com/ArchiveBox/ArchiveBox/pull/707

See this thread here for my WIP design that moves us towards a message-passing / async job worker structure internally: https://github.com/ArchiveBox/ArchiveBox/issues/91#issuecomment-871343428

<!-- gh-comment-id:820896456 --> @pirate commented on GitHub (Apr 16, 2021): Here's my proposed implementation of a new model to track scheduled imports: https://github.com/ArchiveBox/ArchiveBox/pull/707/files **Remaining TODOs:** - figure out which python scheduler to use - `huey` + [`django-huey-monitor`](https://github.com/boxine/django-huey-monitor) (my current favorite) - celery (ugh...) - APScheduler (will require lots of manual models and concurrency control code) - yacron (not sure if it can be configured dynamically) - dramatiq (doesn't support sqlite) - decide whether to continue supporting system crontab at all, or tear it out (imo we should just tear it out and move to using an internal scheduler) - fork the scheduled task worker off the server process automatically on startup, so no need to run separate `archivebox schedule --foreground` process manually - figure out how to enforce "at least once" or "at most once" concurrency model for scheduled tasks Follow that PR for more updates as work progresses. https://github.com/ArchiveBox/ArchiveBox/pull/707 See this thread here for my WIP design that moves us towards a message-passing / async job worker structure internally: https://github.com/ArchiveBox/ArchiveBox/issues/91#issuecomment-871343428
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3381
No description provided.