[GH-ISSUE #1092] Question: How can I automatically remove archived pages after a certain amount of time #3707

Closed
opened 2026-03-15 00:05:26 +03:00 by kerem · 3 comments
Owner

Originally created by @DominoDrifter on GitHub (Feb 4, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1092

My use case for Archivebox is keeping a copy of the last 3 months worth of sites I've visited available offline, I don't really need beyond that as for me it's just wasting space.

I can't find any option for auto-removing in the UI so I next looked to the CLI so I can setup a cron job to purge older than 3 months but I don't know where to start. My thoughts are something like this but the fact I can't see anything in the docs for what I'd expect to be a fairly common feature are setting alarm bells off that I've missed something obvious:

docker exec -ti archivebox archivebox list | sort <some sort options> | archivebox remove

Originally created by @DominoDrifter on GitHub (Feb 4, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1092 My use case for Archivebox is keeping a copy of the last 3 months worth of sites I've visited available offline, I don't really need beyond that as for me it's just wasting space. I can't find any option for auto-removing in the UI so I next looked to the CLI so I can setup a cron job to purge older than 3 months but I don't know where to start. My thoughts are something like this but the fact I can't see anything in the docs for what I'd expect to be a fairly common feature are setting alarm bells off that I've missed something obvious: `docker exec -ti archivebox archivebox list | sort <some sort options> | archivebox remove`
kerem 2026-03-15 00:05:26 +03:00
Author
Owner

@pirate commented on GitHub (Feb 4, 2023):

First time I've ever gotten this feature request 🤷 ArchiveBox is more built for long-term storage than short term, but you can definitely just run archivebox remove in a cronjob.

archivebox remove has a number of built in filtering and sorting options like --before and --after, so you don't need to chain list | sort | remove. Then just put your desired removal command in cron to run it every n days or so.

➜ ~/D/o/A/data ⎇ (dev) ❈2 # archivebox remove --help

[i] [2023-02-04 22:22:34] ArchiveBox v0.6.3: archivebox remove --help
    > /Users/squash/Documents/opt/ArchiveBox/data

usage: archivebox remove [-h] [--yes] [--delete] [--before BEFORE] [--after AFTER] [--filter-type {exact,substring,domain,regex,tag}] [filter_patterns ...]

Remove the specified URLs from the archive

positional arguments:
  filter_patterns       URLs matching this filter pattern will be removed from the index.

optional arguments:
  -h, --help            show this help message and exit
  --yes                 Remove links instantly without prompting to confirm.
  --delete              In addition to removing the link from the index, also delete its archived content and metadata folder.
  --before BEFORE       List only URLs bookmarked before the given timestamp.
  --after AFTER         List only URLs bookmarked after the given timestamp.
  --filter-type {exact,substring,domain,regex,tag}
                        Type of pattern matching to use when filtering URLs
<!-- gh-comment-id:1416863127 --> @pirate commented on GitHub (Feb 4, 2023): First time I've ever gotten this feature request 🤷 ArchiveBox is more built for long-term storage than short term, but you can definitely just run `archivebox remove` in a cronjob. `archivebox remove` has a number of built in filtering and sorting options like `--before` and `--after`, so you don't need to chain `list | sort | remove`. Then just put your desired removal command in cron to run it every n days or so. ```bash ➜ ~/D/o/A/data ⎇ (dev) ❈2 # archivebox remove --help [i] [2023-02-04 22:22:34] ArchiveBox v0.6.3: archivebox remove --help > /Users/squash/Documents/opt/ArchiveBox/data usage: archivebox remove [-h] [--yes] [--delete] [--before BEFORE] [--after AFTER] [--filter-type {exact,substring,domain,regex,tag}] [filter_patterns ...] Remove the specified URLs from the archive positional arguments: filter_patterns URLs matching this filter pattern will be removed from the index. optional arguments: -h, --help show this help message and exit --yes Remove links instantly without prompting to confirm. --delete In addition to removing the link from the index, also delete its archived content and metadata folder. --before BEFORE List only URLs bookmarked before the given timestamp. --after AFTER List only URLs bookmarked after the given timestamp. --filter-type {exact,substring,domain,regex,tag} Type of pattern matching to use when filtering URLs ```
Author
Owner

@DominoDrifter commented on GitHub (Feb 4, 2023):

wow, the man himself thanks @pirate - I had no idea you could run the help flag on an argument like remove, that explains a lot. I'll have a play!

<!-- gh-comment-id:1416863557 --> @DominoDrifter commented on GitHub (Feb 4, 2023): wow, the man himself thanks @pirate - I had no idea you could run the help flag on an argument like remove, that explains a lot. I'll have a play!
Author
Owner

@DominoDrifter commented on GitHub (Mar 25, 2023):

Just in case this comes in handy for anyone else with a similar use case to mine (i.e. using the browser extension to save every site I visit, but then only keep certain sites for longer than 6 months)

Below is what you would use to purge anything from archivebox older than 6 months except for any URL's containing reddit or github:

archivebox remove --before "$(date -d "$(date -d '-3 month' '+%Y-%m-%d') 00:00:00" +%s).0" --filter-type regex '^(?!.*(reddit|github)).*'

If you run archivebox in docker, rather than getting a bash shell to the container you can run it from the host like this (useful for a daily cron job):

docker exec --user archivebox archivebox archivebox remove --yes --delete --before "$(date -d "$(date -d '-6 month' '+%Y-%m-%d') 00:00:00" +%s).0" --filter-type regex '^(?!.*(reddit|github)).*'

<!-- gh-comment-id:1483859957 --> @DominoDrifter commented on GitHub (Mar 25, 2023): Just in case this comes in handy for anyone else with a similar use case to mine (i.e. using the browser extension to save every site I visit, but then only keep certain sites for longer than 6 months) Below is what you would use to purge anything from archivebox older than 6 months except for any URL's containing reddit or github: `archivebox remove --before "$(date -d "$(date -d '-3 month' '+%Y-%m-%d') 00:00:00" +%s).0" --filter-type regex '^(?!.*(reddit|github)).*'` If you run archivebox in docker, rather than getting a bash shell to the container you can run it from the host like this (useful for a daily cron job): `docker exec --user archivebox archivebox archivebox remove --yes --delete --before "$(date -d "$(date -d '-6 month' '+%Y-%m-%d') 00:00:00" +%s).0" --filter-type regex '^(?!.*(reddit|github)).*'`
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3707
No description provided.