mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #223] Architecture: Serverless oneshot archiving #1661
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1661
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @awendland on GitHub (Apr 27, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/223
First off, thanks for this amazing library! I've recently gotten into an archiving binge and this plugin is the ideal in almost every way.
Type
What is the problem that your feature request solves
To reduce maintenance requirements when hosting ArchiveBox, I was hoping to be able to leverage serverless solutions, such as AWS Lambda and S3 (Lambda letting me not care about OS patching and S3 providing redundant storage).
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
It would be awesome if ArchiveBox supported S3 as a storage backend. I believe the work in https://github.com/pirate/ArchiveBox/issues/177 will make the Lambda deployment easy enough.
AFAICT the website is generated but served statically from then on, so S3 would make a great host for the content. The only complications would come in needing to fetch the DB files, configs, and other data that the program touches when it archives a new site.
What hacks or alternative solutions have you tried to solve the problem?
I haven't explored the codebase yet to check out feasibility, but I wanted to get this on your mind now while you're thinking about the large refactors you're currently undertaking. I'll be looking into implementing this feature a couple weeks from now.
From a cursory glance, it looks like some file system operations would need to be abstracted behind an interface exposing that data explicitly, and then different implementations could provide that interface, such as a file system backed implementation or a S3 backed implementation.
How badly do you want this new feature?
I'd love to work on this feature! I just want to make sure you're bought in so I can upstream things.
I'm hoping to create a Terraform/CloudFormation/etc. config that people can easily use to deploy a serverless version of ArchiveBox for themselves.
Thanks again for all the work you've put into this library, it's providing such a cool service!
@pirate commented on GitHub (Apr 30, 2019):
I think it's going to be quite tricky to do this serverless as chrome as a huge warmup time and a high memory footprint, and the archiving process is very stateful with lots of the state shared between each page archiving process.
I'll leave this ticket open to track feasibility, but I wouldn't get your hopes up for having serverless archiving anytime soon.
One thing you may be interested in is
archivebox oneshot, which is a new command added in >v0.4.0 that will allow archiving individual sites without creating a master index.oneshotcould potentially run serverless, but you'd have to have a main indexer process running elsewhere at some point if you want to have the pretty HTML main index for all the archived pages.@pirate commented on GitHub (Jul 24, 2020):
I'm going to close this and say we're officially not going to support s3 as a storage backend. It would add a lot of complexity for relatively little gain when you can use something like a FUSE filesystem to write to s3, or just write a temp filesystem and upload to s3 after archivebox is done running using Duplicati or a home-written script.
That being said, the new
archivebox oneshotcommand is still very much on the roadmap, you can follow progress on that here: https://github.com/pirate/ArchiveBox/issues/320.@reelsense commented on GitHub (Feb 9, 2021):
Cartulary is a digital archiver. But, you could also call it a social network in a box. It's an RSS reader, RSS aggregator, readability tool, article archiver, microblogger, social graph manager and reading list manager.
It looks like it's designed only for S3.
https://github.com/daveajones/cartulary
export_browser_history.sh#2501export_browser_history.sh#4007