mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 01:26:00 +03:00
[GH-ISSUE #1092] Question: How can I automatically remove archived pages after a certain amount of time #2197
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2197
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @DominoDrifter on GitHub (Feb 4, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1092
My use case for Archivebox is keeping a copy of the last 3 months worth of sites I've visited available offline, I don't really need beyond that as for me it's just wasting space.
I can't find any option for auto-removing in the UI so I next looked to the CLI so I can setup a cron job to purge older than 3 months but I don't know where to start. My thoughts are something like this but the fact I can't see anything in the docs for what I'd expect to be a fairly common feature are setting alarm bells off that I've missed something obvious:
docker exec -ti archivebox archivebox list | sort <some sort options> | archivebox remove@pirate commented on GitHub (Feb 4, 2023):
First time I've ever gotten this feature request 🤷 ArchiveBox is more built for long-term storage than short term, but you can definitely just run
archivebox removein a cronjob.archivebox removehas a number of built in filtering and sorting options like--beforeand--after, so you don't need to chainlist | sort | remove. Then just put your desired removal command in cron to run it every n days or so.@DominoDrifter commented on GitHub (Feb 4, 2023):
wow, the man himself thanks @pirate - I had no idea you could run the help flag on an argument like remove, that explains a lot. I'll have a play!
@DominoDrifter commented on GitHub (Mar 25, 2023):
Just in case this comes in handy for anyone else with a similar use case to mine (i.e. using the browser extension to save every site I visit, but then only keep certain sites for longer than 6 months)
Below is what you would use to purge anything from archivebox older than 6 months except for any URL's containing reddit or github:
archivebox remove --before "$(date -d "$(date -d '-3 month' '+%Y-%m-%d') 00:00:00" +%s).0" --filter-type regex '^(?!.*(reddit|github)).*'If you run archivebox in docker, rather than getting a bash shell to the container you can run it from the host like this (useful for a daily cron job):
docker exec --user archivebox archivebox archivebox remove --yes --delete --before "$(date -d "$(date -d '-6 month' '+%Y-%m-%d') 00:00:00" +%s).0" --filter-type regex '^(?!.*(reddit|github)).*'