starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 17:16:00 +03:00

[GH-ISSUE #902] Refactor ArchiveResult filesystem calls to go through Django file storage backend #561

New issue

Open

opened 2026-03-01 14:44:34 +03:00 by kerem · 8 comments

kerem commented

2026-03-01 14:44:34 +03:00

Owner

Originally created by @pirate on GitHub (Dec 16, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/902

Instead of this:

class ArchiveResult:
    path = field.CharField(...)

ArchiveResult(path='./archive/warc/somefile.warc.gz')

We should be doing this:

class ArchiveResult:
    path = field.FileField(...)

ArchiveResult(path=Path('./archive/warc/somefile.warc.gz'))

settings.py:

MEDIA_URL = 'archive'
MEDIA_ROOT = 'archive'

DEFAULT_FILE_STORAGE = 'django.core.files.storage.FileSystemStorage'
# or
DEFAULT_FILE_STORAGE = 'storages.backends.s3boto3.S3Boto3Storage'   # or DigitalOcean, SFTP, etc.

https://django-storages.readthedocs.io/en/latest/
https://docs.djangoproject.com/en/4.0/topics/files/

Originally created by @pirate on GitHub (Dec 16, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/902 Instead of this: ```python3 class ArchiveResult: path = field.CharField(...) ArchiveResult(path='./archive/warc/somefile.warc.gz') ``` We should be doing this: ```python3 class ArchiveResult: path = field.FileField(...) ArchiveResult(path=Path('./archive/warc/somefile.warc.gz')) ``` `settings.py`: ```python3 MEDIA_URL = 'archive' MEDIA_ROOT = 'archive' DEFAULT_FILE_STORAGE = 'django.core.files.storage.FileSystemStorage' # or DEFAULT_FILE_STORAGE = 'storages.backends.s3boto3.S3Boto3Storage' # or DigitalOcean, SFTP, etc. ``` https://django-storages.readthedocs.io/en/latest/ https://docs.djangoproject.com/en/4.0/topics/files/ related: https://github.com/ArchiveBox/ArchiveBox/issues/788

kerem added the

touches: configuration

touches: data/schema/architecture

touches: dependencies/packaging

labels

2026-03-01 14:44:34 +03:00

kerem commented

2026-03-01 14:44:35 +03:00

Author

Owner

@mAAdhaTTah commented on GitHub (Dec 16, 2021):

How do you imagine this interacting with the extractors? Will they write via Django storage?

@mAAdhaTTah commented on GitHub (Dec 16, 2021): How do you imagine this interacting with the extractors? Will they write via Django storage?

kerem commented

2026-03-01 14:44:35 +03:00

Author

Owner

@pirate commented on GitHub (Dec 16, 2021):

Yeah, I have a big re-archictecturing plan to move to plug-in-style hooks system that allow components to modify any archivebox behavior, including filesystem calls, extractors, parsing, etc.

@pirate commented on GitHub (Dec 16, 2021): Yeah, I have a big re-archictecturing plan to move to plug-in-style hooks system that allow components to modify any archivebox behavior, including filesystem calls, extractors, parsing, etc.

kerem commented

2026-03-01 14:44:35 +03:00

Author

Owner

@pirate commented on GitHub (Dec 16, 2021):

Actually I'm curious what you think of the new architecture, do you find this intuitive @mAAdhaTTah? https://gist.github.com/pirate/7193ab54557b051aa1e3a83191b69793

@pirate commented on GitHub (Dec 16, 2021): Actually I'm curious what you think of the new architecture, do you find this intuitive @mAAdhaTTah? https://gist.github.com/pirate/7193ab54557b051aa1e3a83191b69793

kerem commented

2026-03-01 14:44:35 +03:00

Author

Owner

@caj-larsson commented on GitHub (Jul 15, 2022):

Interesting, I've been thinking about building a plugable extractor but if you are already working on this on a whole application scale I could help out with this instead.

I don't quite understand how available hooks in the internals will be defined.
Is it correct that, because of the deep merge behavior, the last plugin is called first?

@caj-larsson commented on GitHub (Jul 15, 2022): Interesting, I've been thinking about building a plugable extractor but if you are already working on this on a whole application scale I could help out with this instead. I don't quite understand how available hooks in the internals will be defined. Is it correct that, because of the deep merge behavior, the last plugin is called first?

kerem commented

2026-03-01 14:44:35 +03:00

Author

Owner

@opentyler commented on GitHub (Aug 4, 2022):

This seems like a great solution to #940. Being able to use an external storage would be incredible.

What do you think are the pain points for integrating this solution? Would love to be able to help

@opentyler commented on GitHub (Aug 4, 2022): This seems like a great solution to #940. Being able to use an external storage would be incredible. What do you think are the pain points for integrating this solution? Would love to be able to help

kerem commented

2026-03-01 14:44:35 +03:00

Author

Owner

@brendanberg commented on GitHub (Dec 2, 2023):

I'm very interested in the ability to configure storage backends—specifically S3. Is there a way I can help out with implementing this?

@brendanberg commented on GitHub (Dec 2, 2023): I'm very interested in the ability to configure storage backends—specifically S3. Is there a way I can help out with implementing this?

kerem commented

2026-03-01 14:44:35 +03:00

Author

Owner

@pirate commented on GitHub (Dec 2, 2023):

The main blocker is figuring out how to migrate from the existing paths to the new storage backend style. I'd like to do a proper data migration using the Django migrations system, but it takes some work to make sure it's atomic or resumable in the case of interruption (if any files need to move on disk).

Ideally the first pass implementation wouldn't move files around at all, it would just migrate the column to a file field, but I'd like to move all the extractor outputs into dedicated subfolders later so I want to make sure safe file moving is possible.

If you want to investigate the best practices around file moving during migrations and report back, that would be helpful!

@pirate commented on GitHub (Dec 2, 2023): The main blocker is figuring out how to migrate from the existing paths to the new storage backend style. I'd like to do a proper data migration using the Django migrations system, but it takes some work to make sure it's atomic or resumable in the case of interruption (if any files need to move on disk). Ideally the first pass implementation wouldn't move files around at all, it would just migrate the column to a file field, but I'd like to move all the extractor outputs into dedicated subfolders later so I want to make sure safe file moving is possible. If you want to investigate the best practices around file moving during migrations and report back, that would be helpful!

kerem commented

2026-03-01 14:44:36 +03:00

Author

Owner

@pirate commented on GitHub (Apr 10, 2024):

FYI all I just created a new Wiki page covering how to set up ArchiveBox with a remote filesystem: https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-Up-Storage

I confirmed it works myself with Amazon S3, Backblaze B2, SFTP, SMB, NFS. It should also work with Google Drive, OneDrive, DropBox, and all the other platforms that RClone supports.

This allows us to use remote filsystems for now without having to change the codebase / implement the django-storages changes discussed earlier.

@pirate commented on GitHub (Apr 10, 2024): FYI all I just created a new Wiki page covering how to set up ArchiveBox with a remote filesystem: https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-Up-Storage I confirmed it works myself with Amazon S3, Backblaze B2, SFTP, SMB, NFS. It should also work with Google Drive, OneDrive, DropBox, and all the other platforms that RClone supports. This allows us to use remote filsystems for now without having to change the codebase / implement the `django-storages` changes discussed earlier.

kerem referenced this issue

2026-03-01 17:54:26 +03:00

[GH-ISSUE #561] Question: Import an Existing ArchiveBox Archive Folder? What Am I Missing?!? #1866

kerem referenced this issue

2026-03-14 22:30:18 +03:00

[GH-ISSUE #561] Question: Import an Existing ArchiveBox Archive Folder? What Am I Missing?!? #3375

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#561

No description provided.

Rows
Columns