mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #1155] Scheduled jobs added in Docker with archivebox schedule ... don't persist when container restarts #2227
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2227
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @melyux on GitHub (Jun 8, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1155
Describe the bug
Adding a new schedule on a feed using
sudo docker compose run archivebox schedule --every=day --depth=1 https://domain.of/feedsays:[√] Scheduled new ArchiveBox cron job for user: archivebox (1 jobs are active).but subsequently running
sudo docker compose run archivebox schedule --showsays:[X] There are no ArchiveBox cron jobs scheduled for your user (archivebox).Steps to reproduce
docker compose yaml:
Schedule something. Then do
schedule --show.Screenshots or log output
ArchiveBox version
@pirate commented on GitHub (Jun 11, 2023):
The schedule commands write to your local crontab file, can you check your user's crontab to see if they're in there?
@pirate commented on GitHub (Jun 11, 2023):
Wait nvm I saw you're in docker, you need to use the same container to have them persist. Can you exec bash in the container and check everything in /etc/cron.d in the docker container to see if it's in there. Let me know if you need help with the commands to do that.
@melyux commented on GitHub (Jun 11, 2023):
The contents of /etc/cron.d inside the container (after running the schedule command) are these:
.placeholder
e2scrub_all
@pirate commented on GitHub (Jun 13, 2023):
I think it may be more trouble than it's worth trying to use cron within the Docker container, as the container is not designed to persist any state other than the archivebox data folder (which is why your crontab gets wiped out every time the container is re-created).
The recommended approach is the one we have as a commented-out example in the provided
docker-compose.yml:Using this pattern you would add a new yaml service definition block for every scheduled job you need to your
docker-compose.yml. The example above shows archiving a pocket user's RSS feed and a subreddit homepage every day, but you can modify it / add more scheduled job blocks as needed. (Comment back here if you have any issues with this recommend pattern and I'll reopen the issue.)This will be streamlined in the future once we add support for scheduling jobs via the UI: https://github.com/ArchiveBox/ArchiveBox/issues/578 (subscribe here for updates on progress). As part of that feature's implementation, jobs will be persisted in the DB anyway so they wont be wiped out when the container restarts, and this will become a non-issue / less of a UX wart.
@melyux commented on GitHub (Jun 13, 2023):
Isn't it overkill to have a separate container for each feed? I have hundreds I was going to add, would this consume a lot of resources?
I saw in a different issue you had recommended maybe just one container for feeds that would read the added jobs, but I guess that would not work since the cron additions don't seem to be persisted even in the same container without restarting at all.
Looking forward to the streamlined solution! Subscribed.
@pirate commented on GitHub (Jun 13, 2023):
Ah yeah that would be a huge pain to add hundreds of scheduled tasks that way 😢 It wouldn't necessarily consume tons of resources, because the containers don't do anything while they're sleeping and will readily swap to disk in case of RAM shortages, but if you have many jobs running frequently on a small machine I could see it being a problem.
If that was your plan, may I ask what your plan was for storing all that data? The main reason I haven't built out tons of scheduled archiving features is because it tends to rapidly balloon to terabytes of storage, so most people aren't actually able to archive that much.
There definitely is a way to do scheduled jobs with one container, it's just a matter of figuring out where the docker contab file is stored and mounting that path as a volume (maybe check
/var/spool/cron/crontabs?). I'm happy to work with you to figure that out, but I want to make sure it's feasible/desirable for you to actually store that much content first :)@melyux commented on GitHub (Jun 13, 2023):
It was just for tracking updates to blogs, which update only rarely, so disk space wouldn't be a problem even if I had only like 10 GB to spare. Maybe one new article to download every day out of all those hundreds of feeds. For me this would be a fairly primary use case for archivebox (set it and forget it!)
Let's do it! I couldn't get the scheduling to work inside the bash shell of the container because archivebox said it wouldn't run as root. If I mounted the crontab path, is there a way to schedule the feeds without running a separate container?
@pirate commented on GitHub (Jun 13, 2023):
Figured it out, just mount the crontabs directory as a volume:
./etc/crontabs:/var/spool/cron/crontabs, and then you only need one container runningarchivebox schedule --foregroundto handle running all the scheduled jobs together.docker-compose.yml:I've updated our provided example
docker-compose.ymltoo https://github.com/ArchiveBox/ArchiveBox/blob/dev/docker-compose.ymlAnd the equivalent in plain docker without compose:
Click to expand...
Then exit both terminals, stop the container completely, remove it, and restart it from scratch with the volumes mounted.
You should be able to run it with the foreground scheduler in docker and it will pick up the schedule task because it's persisted now.
Let me know if it works!
@melyux commented on GitHub (Jun 14, 2023):
This is great, testing now. Since we're adding a cron job, is the second scheduler container still necessary? Wouldn't the primary archivebox container run the scheduled cron jobs itself?
@pirate commented on GitHub (Jun 14, 2023):
The primary container has the cron package installed but doesn't have any init system / doesn't run cron itself, it only runs the webserver.
@melyux commented on GitHub (Jun 14, 2023):
Gotcha. Would be very cool if it did run cron ;)