[GH-ISSUE #223] Architecture: Serverless oneshot archiving #1661

Closed
opened 2026-03-01 17:52:39 +03:00 by kerem · 3 comments
Owner

Originally created by @awendland on GitHub (Apr 27, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/223

First off, thanks for this amazing library! I've recently gotten into an archiving binge and this plugin is the ideal in almost every way.

Type

  • General Question or Disussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

To reduce maintenance requirements when hosting ArchiveBox, I was hoping to be able to leverage serverless solutions, such as AWS Lambda and S3 (Lambda letting me not care about OS patching and S3 providing redundant storage).

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

It would be awesome if ArchiveBox supported S3 as a storage backend. I believe the work in https://github.com/pirate/ArchiveBox/issues/177 will make the Lambda deployment easy enough.

AFAICT the website is generated but served statically from then on, so S3 would make a great host for the content. The only complications would come in needing to fetch the DB files, configs, and other data that the program touches when it archives a new site.

What hacks or alternative solutions have you tried to solve the problem?

I haven't explored the codebase yet to check out feasibility, but I wanted to get this on your mind now while you're thinking about the large refactors you're currently undertaking. I'll be looking into implementing this feature a couple weeks from now.

From a cursory glance, it looks like some file system operations would need to be abstracted behind an interface exposing that data explicitly, and then different implementations could provide that interface, such as a file system backed implementation or a S3 backed implementation.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I cant live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

I'd love to work on this feature! I just want to make sure you're bought in so I can upstream things.


  • I'm willing to contribute to development / fixing this issue
  • I like ArchiveBox so far / would recommend it to a friend

I'm hoping to create a Terraform/CloudFormation/etc. config that people can easily use to deploy a serverless version of ArchiveBox for themselves.

Thanks again for all the work you've put into this library, it's providing such a cool service!

Originally created by @awendland on GitHub (Apr 27, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/223 First off, thanks for this amazing library! I've recently gotten into an archiving binge and this plugin is the ideal in almost every way. ## Type - [ ] General Question or Disussion - [ ] Propose a brand new feature - [x] Request modification of existing behavior or design ## What is the problem that your feature request solves To reduce maintenance requirements when hosting ArchiveBox, I was hoping to be able to leverage serverless solutions, such as AWS Lambda and S3 (Lambda letting me not care about OS patching and S3 providing redundant storage). ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes It would be awesome if ArchiveBox supported S3 as a storage backend. I believe the work in https://github.com/pirate/ArchiveBox/issues/177 will make the Lambda deployment easy enough. AFAICT the website is generated but served statically from then on, so S3 would make a great host for the content. The only complications would come in needing to fetch the DB files, configs, and other data that the program touches when it archives a new site. ## What hacks or alternative solutions have you tried to solve the problem? I haven't explored the codebase yet to check out feasibility, but I wanted to get this on your mind now while you're thinking about the large refactors you're currently undertaking. I'll be looking into implementing this feature a couple weeks from now. From a cursory glance, it looks like some file system operations would need to be abstracted behind an interface exposing that data explicitly, and then different implementations could provide that interface, such as a file system backed implementation or a S3 backed implementation. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I cant live without it - [x] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually I'd love to work on this feature! I just want to make sure you're bought in so I can upstream things. --- - [x] I'm willing to contribute to development / fixing this issue - [x] I like ArchiveBox so far / would recommend it to a friend I'm hoping to create a Terraform/CloudFormation/etc. config that people can easily use to deploy a serverless version of ArchiveBox for themselves. Thanks again for all the work you've put into this library, it's providing such a cool service!
Author
Owner

@pirate commented on GitHub (Apr 30, 2019):

I think it's going to be quite tricky to do this serverless as chrome as a huge warmup time and a high memory footprint, and the archiving process is very stateful with lots of the state shared between each page archiving process.

I'll leave this ticket open to track feasibility, but I wouldn't get your hopes up for having serverless archiving anytime soon.

One thing you may be interested in is archivebox oneshot, which is a new command added in >v0.4.0 that will allow archiving individual sites without creating a master index. oneshot could potentially run serverless, but you'd have to have a main indexer process running elsewhere at some point if you want to have the pretty HTML main index for all the archived pages.

<!-- gh-comment-id:488033519 --> @pirate commented on GitHub (Apr 30, 2019): I think it's going to be quite tricky to do this serverless as chrome as a huge warmup time and a high memory footprint, and the archiving process is very stateful with lots of the state shared between each page archiving process. I'll leave this ticket open to track feasibility, but I wouldn't get your hopes up for having serverless archiving anytime soon. One thing you may be interested in is [`archivebox oneshot`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-oneshot), which is a new command added in >[v0.4.0](https://github.com/pirate/ArchiveBox/pull/207) that will allow archiving individual sites without creating a master index. `oneshot` could potentially run serverless, but you'd have to have a main indexer process running elsewhere at some point if you want to have the pretty HTML main index for all the archived pages.
Author
Owner

@pirate commented on GitHub (Jul 24, 2020):

I'm going to close this and say we're officially not going to support s3 as a storage backend. It would add a lot of complexity for relatively little gain when you can use something like a FUSE filesystem to write to s3, or just write a temp filesystem and upload to s3 after archivebox is done running using Duplicati or a home-written script.

That being said, the new archivebox oneshot command is still very much on the roadmap, you can follow progress on that here: https://github.com/pirate/ArchiveBox/issues/320.

<!-- gh-comment-id:663636991 --> @pirate commented on GitHub (Jul 24, 2020): I'm going to close this and say we're officially not going to support s3 as a storage backend. It would add a lot of complexity for relatively little gain when you can use something like a FUSE filesystem to write to s3, or just write a temp filesystem and upload to s3 after archivebox is done running using Duplicati or a home-written script. That being said, the new `archivebox oneshot` command is still very much on the roadmap, you can follow progress on that here: https://github.com/pirate/ArchiveBox/issues/320.
Author
Owner

@reelsense commented on GitHub (Feb 9, 2021):

Cartulary is a digital archiver. But, you could also call it a social network in a box. It's an RSS reader, RSS aggregator, readability tool, article archiver, microblogger, social graph manager and reading list manager.

It looks like it's designed only for S3.
https://github.com/daveajones/cartulary

<!-- gh-comment-id:776199458 --> @reelsense commented on GitHub (Feb 9, 2021): Cartulary is a digital archiver. But, you could also call it a social network in a box. It's an RSS reader, RSS aggregator, readability tool, article archiver, microblogger, social graph manager and reading list manager. It looks like it's designed _only_ for S3. https://github.com/daveajones/cartulary
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1661
No description provided.