[GH-ISSUE #902] Refactor ArchiveResult filesystem calls to go through Django file storage backend #561

Open
opened 2026-03-01 14:44:34 +03:00 by kerem · 8 comments
Owner

Originally created by @pirate on GitHub (Dec 16, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/902

Instead of this:

class ArchiveResult:
    path = field.CharField(...)

ArchiveResult(path='./archive/warc/somefile.warc.gz')

We should be doing this:

class ArchiveResult:
    path = field.FileField(...)

ArchiveResult(path=Path('./archive/warc/somefile.warc.gz'))

settings.py:

MEDIA_URL = 'archive'
MEDIA_ROOT = 'archive'

DEFAULT_FILE_STORAGE = 'django.core.files.storage.FileSystemStorage'
# or
DEFAULT_FILE_STORAGE = 'storages.backends.s3boto3.S3Boto3Storage'   # or DigitalOcean, SFTP, etc.

https://django-storages.readthedocs.io/en/latest/
https://docs.djangoproject.com/en/4.0/topics/files/

related: https://github.com/ArchiveBox/ArchiveBox/issues/788

Originally created by @pirate on GitHub (Dec 16, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/902 Instead of this: ```python3 class ArchiveResult: path = field.CharField(...) ArchiveResult(path='./archive/warc/somefile.warc.gz') ``` We should be doing this: ```python3 class ArchiveResult: path = field.FileField(...) ArchiveResult(path=Path('./archive/warc/somefile.warc.gz')) ``` `settings.py`: ```python3 MEDIA_URL = 'archive' MEDIA_ROOT = 'archive' DEFAULT_FILE_STORAGE = 'django.core.files.storage.FileSystemStorage' # or DEFAULT_FILE_STORAGE = 'storages.backends.s3boto3.S3Boto3Storage' # or DigitalOcean, SFTP, etc. ``` https://django-storages.readthedocs.io/en/latest/ https://docs.djangoproject.com/en/4.0/topics/files/ related: https://github.com/ArchiveBox/ArchiveBox/issues/788
Author
Owner

@mAAdhaTTah commented on GitHub (Dec 16, 2021):

How do you imagine this interacting with the extractors? Will they write via Django storage?

<!-- gh-comment-id:996161676 --> @mAAdhaTTah commented on GitHub (Dec 16, 2021): How do you imagine this interacting with the extractors? Will they write via Django storage?
Author
Owner

@pirate commented on GitHub (Dec 16, 2021):

Yeah, I have a big re-archictecturing plan to move to plug-in-style hooks system that allow components to modify any archivebox behavior, including filesystem calls, extractors, parsing, etc.

<!-- gh-comment-id:996205998 --> @pirate commented on GitHub (Dec 16, 2021): Yeah, I have a big re-archictecturing plan to move to plug-in-style hooks system that allow components to modify any archivebox behavior, including filesystem calls, extractors, parsing, etc.
Author
Owner

@pirate commented on GitHub (Dec 16, 2021):

Actually I'm curious what you think of the new architecture, do you find this intuitive @mAAdhaTTah? https://gist.github.com/pirate/7193ab54557b051aa1e3a83191b69793

<!-- gh-comment-id:996238056 --> @pirate commented on GitHub (Dec 16, 2021): Actually I'm curious what you think of the new architecture, do you find this intuitive @mAAdhaTTah? https://gist.github.com/pirate/7193ab54557b051aa1e3a83191b69793
Author
Owner

@caj-larsson commented on GitHub (Jul 15, 2022):

Interesting, I've been thinking about building a plugable extractor but if you are already working on this on a whole application scale I could help out with this instead.

I don't quite understand how available hooks in the internals will be defined.
Is it correct that, because of the deep merge behavior, the last plugin is called first?

<!-- gh-comment-id:1185137035 --> @caj-larsson commented on GitHub (Jul 15, 2022): Interesting, I've been thinking about building a plugable extractor but if you are already working on this on a whole application scale I could help out with this instead. I don't quite understand how available hooks in the internals will be defined. Is it correct that, because of the deep merge behavior, the last plugin is called first?
Author
Owner

@opentyler commented on GitHub (Aug 4, 2022):

This seems like a great solution to #940. Being able to use an external storage would be incredible.

What do you think are the pain points for integrating this solution? Would love to be able to help

<!-- gh-comment-id:1205347631 --> @opentyler commented on GitHub (Aug 4, 2022): This seems like a great solution to #940. Being able to use an external storage would be incredible. What do you think are the pain points for integrating this solution? Would love to be able to help
Author
Owner

@brendanberg commented on GitHub (Dec 2, 2023):

I'm very interested in the ability to configure storage backends—specifically S3. Is there a way I can help out with implementing this?

<!-- gh-comment-id:1837052827 --> @brendanberg commented on GitHub (Dec 2, 2023): I'm very interested in the ability to configure storage backends—specifically S3. Is there a way I can help out with implementing this?
Author
Owner

@pirate commented on GitHub (Dec 2, 2023):

The main blocker is figuring out how to migrate from the existing paths to the new storage backend style. I'd like to do a proper data migration using the Django migrations system, but it takes some work to make sure it's atomic or resumable in the case of interruption (if any files need to move on disk).

Ideally the first pass implementation wouldn't move files around at all, it would just migrate the column to a file field, but I'd like to move all the extractor outputs into dedicated subfolders later so I want to make sure safe file moving is possible.

If you want to investigate the best practices around file moving during migrations and report back, that would be helpful!

<!-- gh-comment-id:1837143394 --> @pirate commented on GitHub (Dec 2, 2023): The main blocker is figuring out how to migrate from the existing paths to the new storage backend style. I'd like to do a proper data migration using the Django migrations system, but it takes some work to make sure it's atomic or resumable in the case of interruption (if any files need to move on disk). Ideally the first pass implementation wouldn't move files around at all, it would just migrate the column to a file field, but I'd like to move all the extractor outputs into dedicated subfolders later so I want to make sure safe file moving is possible. If you want to investigate the best practices around file moving during migrations and report back, that would be helpful!
Author
Owner

@pirate commented on GitHub (Apr 10, 2024):

FYI all I just created a new Wiki page covering how to set up ArchiveBox with a remote filesystem: https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-Up-Storage

I confirmed it works myself with Amazon S3, Backblaze B2, SFTP, SMB, NFS. It should also work with Google Drive, OneDrive, DropBox, and all the other platforms that RClone supports.

This allows us to use remote filsystems for now without having to change the codebase / implement the django-storages changes discussed earlier.

<!-- gh-comment-id:2047037832 --> @pirate commented on GitHub (Apr 10, 2024): FYI all I just created a new Wiki page covering how to set up ArchiveBox with a remote filesystem: https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-Up-Storage I confirmed it works myself with Amazon S3, Backblaze B2, SFTP, SMB, NFS. It should also work with Google Drive, OneDrive, DropBox, and all the other platforms that RClone supports. This allows us to use remote filsystems for now without having to change the codebase / implement the `django-storages` changes discussed earlier.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#561
No description provided.