[GH-ISSUE #179] Add official support for taking multiple snapshots of websites over time

kerem commented

2026-03-14 21:14:28 +03:00

Owner

Originally created by @pirate on GitHub (Mar 19, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/179

This is by far the most requested feature.

People want an easy way to take multiple snapshots of websites over time.

Here's how archive.org does it

For people finding this issue via Google / incoming links, if you want a hacky solution to take a second snapshot of a site, you can add the link with a new hash and it will be treated as a new page and a new snapshot will be taken:

echo https://example.com/some/page.html#archivedate=2019-03-18 | archivebox add
# then to re-shapshot it on another day...
echo https://example.com/some/page.html#archivedate=2019-03-22 | archivebox add

Edit: as of v0.6 there is now a button in the UI to do this ^

Originally created by @pirate on GitHub (Mar 19, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/179 This is by far the most requested feature. People want an easy way to take multiple snapshots of websites over time. > Here's how archive.org does it > <img width="500" alt="Screenshot 2024-03-09 at 18 58 25" src="https://github.com/ArchiveBox/ArchiveBox/assets/4463796/92a5cdf9-9fd1-4fb9-9903-0c28c47e7884"> --- For people finding this issue via Google / incoming links, if you want a hacky solution to take a second snapshot of a site, you can add the link with a new hash and it will be treated as a new page and a new snapshot will be taken: ```bash echo https://example.com/some/page.html#archivedate=2019-03-18 | archivebox add # then to re-shapshot it on another day... echo https://example.com/some/page.html#archivedate=2019-03-22 | archivebox add ``` *Edit: as of v0.6 there is now a button in the UI to do this ^* ![Screen Shot 2021-04-09 at 8 51 35 p](https://user-images.githubusercontent.com/511499/114264320-1f1da200-99b8-11eb-986f-bd8c808e5110.png)

kerem

2026-03-14 21:14:28 +03:00

closed this issue
added the
size: hard

touches: data/schema/architecture

status: done

expected: next release

scope: all users
labels

kerem commented

2026-03-14 21:14:45 +03:00

Author

Owner

@n0ncetonic commented on GitHub (Mar 19, 2019):

Looking forward to this feature. Thanks for the hacky workaround as well, I have a few pages I'd like to continue monitoring for new content but I was worried about the implications of my current backup being overwritten by a 404 page if the content went down.

@n0ncetonic commented on GitHub (Mar 19, 2019): Looking forward to this feature. Thanks for the hacky workaround as well, I have a few pages I'd like to continue monitoring for new content but I was worried about the implications of my current backup being overwritten by a 404 page if the content went down.

kerem commented

2026-03-14 21:14:50 +03:00

Author

Owner

@pirate commented on GitHub (Mar 19, 2019):

I just updated the README to make the current behavior clearer as well:

Running archivebox add adds only new, unique links into your data folder on each run. Because it will ignore duplicates and only archive each link the first time you add it, you can schedule it to run on a timer and re-import all your feeds multiple times a day. It will run quickly even if the feeds are large, because it's only archiving the newest links since the last run. For each link, it runs through all the archive methods. Methods that fail will save None and be automatically retried on the next run, methods that succeed save their output into the data folder and are never retried/overwritten by subsequent runs. Support for saving multiple snapshots of each site over time will be added soon (along with the ability to view diffs of the changes between runs).

@pirate commented on GitHub (Mar 19, 2019): I just updated the [README](https://github.com/pirate/ArchiveBox#Overview) to make the current behavior clearer as well: > Running `archivebox add` adds only new, unique links into your data folder on each run. Because it will ignore duplicates and only archive each link the first time you add it, you can schedule it to [run on a timer](https://github.com/pirate/ArchiveBox/wiki/Scheduled-Archiving) and re-import all your feeds multiple times a day. It will run quickly even if the feeds are large, because it's only archiving the newest links since the last run. For each link, it runs through all the archive methods. Methods that fail will save `None` and be automatically retried on the next run, methods that succeed save their output into the data folder and are never retried/overwritten by subsequent runs. Support for saving multiple snapshots of each site over time will be [added soon](https://github.com/pirate/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs).

kerem commented

2026-03-14 21:14:55 +03:00

Author

Owner

@alex9099 commented on GitHub (Aug 1, 2020):

Any updates on this? It would be really nice if it was possible to have versions, like the waybackmachine does :)

@alex9099 commented on GitHub (Aug 1, 2020): Any updates on this? It would be really nice if it was possible to have versions, like the waybackmachine does :)

kerem commented

2026-03-14 21:15:00 +03:00

Author

Owner

@pirate commented on GitHub (Aug 1, 2020):

You can accomplish this right now still by adding a hash at the end of the URL, e.g.

archivebox add https://example.com/#2020-08-01
archivebox add https://example.com/#2020-09-01
...

Official first-class support for multiple snapshots is still on the roadmap, but don't expect it anytime in the next month or two, it's quite a large feature with big implications for how we store and dedupe snapshot data internally.

@pirate commented on GitHub (Aug 1, 2020): You can accomplish this right now still by adding a hash at the end of the URL, e.g. ```bash archivebox add https://example.com/#2020-08-01 archivebox add https://example.com/#2020-09-01 ... ``` Official first-class support for multiple snapshots is still on the roadmap, but don't expect it anytime in the next month or two, it's quite a large feature with big implications for how we store and dedupe snapshot data internally.

kerem commented

2026-03-14 21:15:05 +03:00

Author

Owner

@TheOneValen commented on GitHub (Aug 4, 2020):

Would be nice if there also was a migration from the hash-date-hack to the first-class support.

@TheOneValen commented on GitHub (Aug 4, 2020): Would be nice if there also was a migration from the hash-date-hack to the first-class support.

kerem commented

2026-03-14 21:15:10 +03:00

Author

Owner

@Spacewalker2 commented on GitHub (Jan 23, 2021):

Do I get this right? At the point this is available I can for example add an URL (not a feed) like archivebox schedule --every=day 'http://example.com/static.html' and the URL gets archived everyday. If there are changes then ArchiveBox provides diffs of it.

Will it be possible that ArchiveBox notifies me if there are changes maybe by using the local MTA?

@Spacewalker2 commented on GitHub (Jan 23, 2021): Do I get this right? At the point this is available I can for example add an URL (not a feed) like `archivebox schedule --every=day 'http://example.com/static.html'` and the URL gets archived everyday. If there are changes then ArchiveBox provides diffs of it. Will it be possible that ArchiveBox notifies me if there are changes maybe by using the local MTA?

kerem commented

2026-03-14 21:15:15 +03:00

Author

Owner

@pirate commented on GitHub (Jan 23, 2021):

Scheduled archiving will not re-archive the initial page if snapshots already exist, the way that archivebox schedule feature is meant to be used is with the --depth=1 flag to pull in new links from a source like an RSS feed, bookmarks export file, or HTML page with some links in it, without re-archiving the initial page itself (it re-pulls it to do the crawl, but will not re-snapshot it).

ArchiveBox has no first-class support for taking multiple snapshots or any built-in diffing system, only the #hash hack mentioned above. It's still on the roadmap but not expected anytime soon due to the architectural complexity. If you absolutely need multiple snapshots of the same pages over time I recommend checking out some of the other tools available on our community wiki https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#other-archivebox-alternatives

@pirate commented on GitHub (Jan 23, 2021): Scheduled archiving will not re-archive the initial page if snapshots already exist, the way that `archivebox schedule` feature is meant to be used is with the `--depth=1` flag to pull in new links from a source like an RSS feed, bookmarks export file, or HTML page with some links in it, without re-archiving the initial page itself (it re-pulls it to do the crawl, but will not re-snapshot it). ArchiveBox has no first-class support for taking multiple snapshots or any built-in diffing system, only the `#hash` hack mentioned above. It's still on the roadmap but not expected anytime soon due to the architectural complexity. If you absolutely need multiple snapshots of the same pages over time I recommend checking out some of the other tools available on our community wiki https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#other-archivebox-alternatives

kerem commented

2026-03-14 21:15:20 +03:00

Author

Owner

@Spacewalker2 commented on GitHub (Jan 24, 2021):

Thanks for the quick answer and the very cool application! I already run an ArchiveBox instance on my FreeNAS, and it fits the purpose perfectly. Having the described feature above would be a nice extra. I asked because something like diffs is mentioned on the archivebox.io website itself. If archivebox schedule does not fit here then maybe running archivebox add with some other time-based job scheduler would be possible too. I look forward to it.

BTW: I hope ArchiveBox will end up in the FreeNAS/TrueNAS plugins section some time. Having ArchiveBox here available with one or two clicks would be very nice.

@Spacewalker2 commented on GitHub (Jan 24, 2021): Thanks for the quick answer and the very cool application! I already run an ArchiveBox instance on my FreeNAS, and it fits the purpose perfectly. Having the described feature above would be a nice extra. I asked because something like diffs is mentioned on the archivebox.io website itself. If `archivebox schedule` does not fit here then maybe running `archivebox add` with some other time-based job scheduler would be possible too. I look forward to it. BTW: I hope ArchiveBox will end up in the FreeNAS/TrueNAS plugins section some time. Having ArchiveBox here available with one or two clicks would be very nice.

kerem commented

2026-03-14 21:15:25 +03:00

Author

Owner

@pirate commented on GitHub (Apr 10, 2021):

This is now added in v0.6. It's not full support, but it's a step in the right direction. I just added a UI button labeled Re-snapshot that automates the process of creating a new snapshot with a bumped timestamp in the URL hash. I could also add a flag called --resnapshot or --duplicate that automates this step when archiving via the CLI too.

Then later when we add better real multi-snapshot support, we can migrate all the Snapshots with timestamps in their hashes to the new system automatically.

@pirate commented on GitHub (Apr 10, 2021): This is now added in v0.6. It's not full support, but it's a step in the right direction. I just added a UI button labeled `Re-snapshot` that automates the process of creating a new snapshot with a bumped timestamp in the URL hash. I could also add a flag called `--resnapshot` or `--duplicate` that automates this step when archiving via the CLI too. Then later when we add better real multi-snapshot support, we can migrate all the Snapshots with timestamps in their hashes to the new system automatically. ![Screen Shot 2021-04-09 at 8 51 35 p](https://user-images.githubusercontent.com/511499/114264320-1f1da200-99b8-11eb-986f-bd8c808e5110.png)

kerem commented

2026-03-14 21:15:31 +03:00

Author

Owner

@GlassedSilver commented on GitHub (Jun 7, 2021):

Sometimes websites remove pages and redirect them to something completely different.

IDK, an example I could think of is that if you tried to call for the OG URL for the Xbox 360 sub-page on xbox.com these days I think you'll get redirected to the Xbox One S page, since that is now... yeah I don't know how that's really relevant other than "this is old-ish and it's cheap-ish, have this instead???! kthx"

Try it for yourself:
http://xbox.com/en-US/xbox-360

Redirects at the time of writing to:
https://www.xbox.com/en-US/consoles/xbox-one-s

Not sure if the URL sends some HTML error code along with the redirect (???).
Just thought about this issue for a few days and wondered what the strategy is for either a no-error redirect and an error-returning redirect.

Also, I would consider being VERY careful about dropping URLs from the automated re-archival process on too many fails. It's not very uncommon for a site to go missing for months sometimes and then to come back. I'm not talking about the leagues of Microsoft, but fan sites, hobby projects, niche software developers who do it in their spare time and missed renewing their domain name registration and catching it a bit late, etc...

There are all sorts of sticks that can be thrown at you where a simple KO on 3 errors would lead to silent discontinuation of archiving of something that's only temporarily not there. Maybe asking for user confirmation at least per domain would be the best approach:
e.g.:

Yes, I know GeoCities is down and down forever, stop trying these URLs.
No, this <insert hobby dev's page here> project ain't gone forever, I checked the dev's Twitter and know they are working on a fix, keep trying, please.

Edit: Microsoft does supply the error 301, moved permanently. That's kind of them, not sure how much we can rely on this in the real world? Anyone with ample experience in this?

@GlassedSilver commented on GitHub (Jun 7, 2021): Sometimes websites remove pages and redirect them to something completely different. IDK, an example I could think of is that if you tried to call for the OG URL for the Xbox 360 sub-page on xbox.com these days I think you'll get redirected to the Xbox One S page, since that is now... yeah I don't know how that's really relevant other than "this is old-ish and it's cheap-ish, have this instead???! kthx" Try it for yourself: http://xbox.com/en-US/xbox-360 Redirects at the time of writing to: https://www.xbox.com/en-US/consoles/xbox-one-s Not sure if the URL sends some HTML error code along with the redirect (???). Just thought about this issue for a few days and wondered what the strategy is for either a no-error redirect and an error-returning redirect. Also, I would consider being VERY careful about dropping URLs from the automated re-archival process on too many fails. It's not very uncommon for a site to go missing for months sometimes and then to come back. I'm not talking about the leagues of Microsoft, but fan sites, hobby projects, niche software developers who do it in their spare time and missed renewing their domain name registration and catching it a bit late, etc... There are all sorts of sticks that can be thrown at you where a simple KO on 3 errors would lead to silent discontinuation of archiving of something that's only temporarily not there. **Maybe asking for user confirmation** at least per domain would be the best approach: e.g.: - Yes, I know GeoCities is down and down forever, stop trying these URLs. - No, this <insert hobby dev's page here> project ain't gone forever, I checked the dev's Twitter and know they are working on a fix, keep trying, please. **Edit:** Microsoft does supply the error 301, moved permanently. That's kind of them, not sure how much we can rely on this in the real world? Anyone with ample experience in this?

kerem commented

2026-03-14 21:15:36 +03:00

Author

Owner

@agnosticlines commented on GitHub (Jul 24, 2022):

Hi again,

Would you consider making a Patreon or linking your Paypal? There's some features like this one I'd absolutely love to support and would be happy to give you a few hundred bucks for them, I understand it's not a big corporate sponsorship, I'm just one person who loves using ArchiveBox and this + the security fixes would make it perfect. If you're busy with other projects I totally understand, hopefully not being too demanding. It's an incredible project and allows me to organise my research collections easily, it's a game changer :-)

@agnosticlines commented on GitHub (Jul 24, 2022): Hi again, Would you consider making a Patreon or linking your Paypal? There's some features like this one I'd absolutely love to support and would be happy to give you a few hundred bucks for them, I understand it's not a big corporate sponsorship, I'm just one person who loves using ArchiveBox and this + the security fixes would make it perfect. If you're busy with other projects I totally understand, hopefully not being too demanding. It's an incredible project and allows me to organise my research collections easily, it's a game changer :-)

kerem commented

2026-03-14 21:15:41 +03:00

Author

Owner

@pirate commented on GitHub (Aug 2, 2022):

Thanks for the support @agnosticlines I got your donation! <3 *(All the donation info is here for future reference: https://github.com/ArchiveBox/ArchiveBox/wiki/Donations)

This is still high on my priority list but development speed is slow these days, I only have a day or so per month to dedicate to this project and most of it is taken up by bugfixes. Occasionally I have a month where I sprint and do a big release but I cant make promises on the timeline that this particular feature will be released.

@pirate commented on GitHub (Aug 2, 2022): Thanks for the support @agnosticlines I got your donation! <3 *(All the donation info is here for future reference: https://github.com/ArchiveBox/ArchiveBox/wiki/Donations) This is still high on my priority list but development speed is slow these days, I only have a day or so per month to dedicate to this project and most of it is taken up by bugfixes. Occasionally I have a month where I sprint and do a big release but I cant make promises on the timeline that this particular feature will be released.

kerem commented

2026-03-14 21:15:46 +03:00

Author

Owner

@sysfu commented on GitHub (Apr 11, 2023):

BTW: I hope ArchiveBox will end up in the FreeNAS/TrueNAS plugins section some time. Having ArchiveBox here available with one or two clicks would be very nice.

Kennedy mentioned he was working on one a few years back https://github.com/ArchiveBox/ArchiveBox/issues/760#issuecomment-860121544

@sysfu commented on GitHub (Apr 11, 2023): > BTW: I hope ArchiveBox will end up in the FreeNAS/TrueNAS plugins section some time. Having ArchiveBox here available with one or two clicks would be very nice. Kennedy mentioned he was working on one a few years back https://github.com/ArchiveBox/ArchiveBox/issues/760#issuecomment-860121544

kerem commented

2026-03-14 21:15:51 +03:00

Author

Owner

@pirate commented on GitHub (Jan 8, 2026):

This is now implemented on dev, we now allow multiple snapshots per URL (at the db level they just need to be under separate crawls)!

archivebox add 'https://example.com'
then a few seconds later
archivebox add 'https://example.com'
should work fine now, it will automatically create a separate crawl for every add run.

It's not stable yet, so don't rush to upgrade big collections, but testing help and bugfix PRs are always appreciated. Please open new issues for any specific bugs encountered with the new dev design!

There is also no migration from the old hash-date workaround needed, essentially those will just remain as separate snapshots in the new system. We don't do anything to link snapshots of the same url together, if you search by url they should all show up normally.

@pirate commented on GitHub (Jan 8, 2026): This is now implemented on `dev`, we now allow multiple snapshots per URL (at the db level they just need to be under separate `crawl`s)! `archivebox add 'https://example.com'` then a few seconds later `archivebox add 'https://example.com'` should work fine now, it will automatically create a separate crawl for every `add` run. It's not stable yet, so don't rush to upgrade big collections, but testing help and bugfix PRs are always appreciated. Please open new issues for any specific bugs encountered with the new `dev` design! There is also no migration from the old hash-date workaround needed, essentially those will just remain as separate snapshots in the new system. We don't do anything to link snapshots of the same url together, if you search by url they should all show up normally.

Rows
Columns

[GH-ISSUE #179] Add official support for taking multiple snapshots of websites over time #3145