Table of Contents
Scheduled Archiving
ArchiveBox now stores schedules in the database and lets the orchestrator materialize them into queued Crawl records at the right time. You no longer need host cron, user crontabs, or a separate archivebox_scheduler container when archivebox server is running.
How It Works
archivebox schedule ...creates aCrawlSchedulerecord plus a sealed templateCrawl.- The long-running global orchestrator inside
archivebox serverwatches enabled schedules. - When a schedule becomes due, the orchestrator creates a new queued
Crawl. - That queued crawl is processed the same way as UI/API-submitted work.
One-shot foreground flows such as archivebox add ... continue to process only the crawl they were asked to run. They do not also sweep and execute unrelated scheduled crawls.
CLI Usage
cd ~/archivebox/data
archivebox schedule --every=daily --depth=1 https://example.com/feed.xml
archivebox schedule --every='0 */6 * * *' https://example.com/feed.xml
archivebox schedule --show
archivebox schedule --clear
archivebox schedule --run-all
archivebox schedule --foreground
Accepted schedule formats:
- Aliases:
minute,hour,day,week,month,year,daily,weekly,monthly,yearly - Cron expressions: e.g.
0 */6 * * *
archivebox schedule --run-all enqueues every enabled schedule immediately.
archivebox schedule --foreground runs the global orchestrator in the foreground, which is useful outside archivebox server if you want a dedicated long-running scheduler/worker process without the web UI.
Running archivebox schedule --every=day with no import_path creates a recurring maintenance schedule that queues archivebox://update crawls.
Docker Compose
With the new orchestrator flow, you only need the main archivebox service:
services:
archivebox:
image: archivebox/archivebox:dev
command: server --quick-init 0.0.0.0:8000
volumes:
- ./data:/data
Create schedules with:
docker compose run --rm archivebox schedule --every=weekly --depth=1 https://example.com/feed.xml
docker compose run --rm archivebox schedule --show
If the main archivebox server container is already running, its orchestrator will pick up future scheduled runs automatically. There is no scheduler sidecar to restart.
Examples
Archive a Twitter mirror once a week:
archivebox schedule --every=weekly --depth=1 'https://nitter.net/ArchiveBoxApp'
Archive a subreddit and linked discussions once a week:
archivebox config --set URL_WHITELIST='^http(s)?:\/\/(.+)?teddit\.net\/?.*$'
archivebox schedule --every=weekly --overwrite --depth=1 'https://teddit.net/r/DataHoarder/'
Archive Hacker News every day:
archivebox config --set URL_BLACKLIST='^http(s)?:\/\/(.+\.)?(youtube\.com)|(amazon\.com)\/.*$'
archivebox schedule --every=daily --depth=1 'https://news.ycombinator.com'
Queue a daily maintenance update:
archivebox schedule --every=day
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Getting Started
- 🔢 Quickstart
- 🖥️ Install
- 🐳 Docker
- ➡️ Supported Sources
- ⬅️ Supported Outputs
Usage
- ﹩Command Line
- 🌐 Web UI
- 🧩 Browser Extension
- 👾 REST API / Webhooks
- 📜 Python API / REPL / SQL API
Reference
Guides
- Upgrading
- Setting up Storage (NFS/SMB/S3/etc)
- Setting up Authentication (SSO/LDAP/etc)
- Setting up Search (rg/sonic/etc)
- Scheduled Archiving
- Publishing Your Archive
- Chromium Install
- Cookies & Sessions Setup
- Merging Collections
- Troubleshooting
More Info
- ⭐️ Web Archiving Community
- Background & Motivation
- Comparison to Other Tools
- Architecture Diagram
- Changelog & Roadmap
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
