[GH-ISSUE #323] Feature Request: Provide a untility to sync archived folders and index #1743

Closed
opened 2026-03-01 17:53:18 +03:00 by kerem · 5 comments
Owner

Originally created by @Haocen on GitHub (Mar 2, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/323

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

  1. Clean up duplicated archive folders if they contain URLs already exist in index.json
  2. Add archive folders back to index.json if they are missing.
  3. Delete empty folders created due to failed archive attempt.
  4. If URL exist in index.json but corresponding archive folder is missing, reset index to not yet archived.
  5. Report any other cases, including but not limited to folders with unidentified files in output.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

This feature is required so user can keep the archive folders clean, avoiding archives missing from index or duplicated archives taking too much space.

What hacks or alternative solutions have you tried to solve the problem?

I've already implemented this feature here:
https://github.com/Haocen/ArchiveBox/blob/production/archivebox/sync.py

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @Haocen on GitHub (Mar 2, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/323 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you :) --> ## Type - [ ] General question or discussion - [x] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves <!-- e.g. I need to be able to archive spanish and french subtitle files from a particular <example.com> movie site that's going down soon. --> 1. Clean up duplicated archive folders if they contain URLs already exist in index.json 2. Add archive folders back to index.json if they are missing. 3. Delete empty folders created due to failed archive attempt. 4. If URL exist in index.json but corresponding archive folder is missing, reset index to not yet archived. 5. Report any other cases, including but not limited to folders with unidentified files in output. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes <!-- e.g. I specifically need a new archive method to look for multilingual subtitle files related to pages. The bigger picture solution is the ability for custom user scripts to be run in a puppeteer context during archiving. --> This feature is required so user can keep the archive folders clean, avoiding archives missing from index or duplicated archives taking too much space. ## What hacks or alternative solutions have you tried to solve the problem? <!-- A clear and concise description of any alternative solutions, workarounds, or other software you've considered using to fix the problem. --> I've already implemented this feature here: https://github.com/Haocen/ArchiveBox/blob/production/archivebox/sync.py ## How badly do you want this new feature? - [x] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [x] I'm willing to contribute dev time / money to fix this issue - [x] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@pirate commented on GitHub (Mar 2, 2020):

Oh this already exists in v0.4.0, it does it out of the box.

<!-- gh-comment-id:593556240 --> @pirate commented on GitHub (Mar 2, 2020): Oh this already exists in v0.4.0, it does it out of the box.
Author
Owner

@Haocen commented on GitHub (Mar 2, 2020):

Please excuse me if it is very obvious, I wonder which branch is the most up to date one with the specific feature?
The dev, django or 0.5 branch?

<!-- gh-comment-id:593572066 --> @Haocen commented on GitHub (Mar 2, 2020): Please excuse me if it is very obvious, I wonder which branch is the most up to date one with the specific feature? The `dev`, `django` or `0.5` branch?
Author
Owner

@pirate commented on GitHub (Mar 2, 2020):

No worries, it's not obvious at all haha, I recommend trying the django branch first, or the v0.4.3 if you have trouble. It's not production-ready (make sure you back up your archive data), but it should work for the most part.

archivebox init does all of this out-of-the-box, and can be run repeatedly whenever cleanup is needed.

You can also use these as well:

archivebox list --status=invalid
archivebox list --status=orphaned
archivebox list --status=duplicate

archivebox remove --status=invalid
archivebox remove --status=orphaned
...
<!-- gh-comment-id:593615392 --> @pirate commented on GitHub (Mar 2, 2020): No worries, it's not obvious at all haha, I recommend trying the `django` branch first, or the `v0.4.3` if you have trouble. It's not production-ready (make sure you back up your archive data), but it should work for the most part. [`archivebox init`](https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-init) does all of this out-of-the-box, and can be run repeatedly whenever cleanup is needed. You can also use these as well: ``` archivebox list --status=invalid archivebox list --status=orphaned archivebox list --status=duplicate archivebox remove --status=invalid archivebox remove --status=orphaned ... ``` - https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-init - https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-list - https://github.com/pirate/ArchiveBox/wiki/Roadmap#-archivebox-remove
Author
Owner

@Haocen commented on GitHub (Mar 3, 2020):

Amazing! I'll try it out now.

<!-- gh-comment-id:593705636 --> @Haocen commented on GitHub (Mar 3, 2020): Amazing! I'll try it out now.
Author
Owner

@Haocen commented on GitHub (Mar 3, 2020):

Hi @pirate ,
I tried very hard but I cannot figure out how to run the django server with django branch.
I don't think I can install archivebox via pip:

(ArchiveBox) root@09108a959768:/home/ArchiveBox/archivebox# pip3 install archivebox
ERROR: Could not find a version that satisfies the requirement archivebox (from versions: none)
ERROR: No matching distribution found for archivebox

And django complain not being able to find archivebox module:

(ArchiveBox) root@09108a959768:/home/ArchiveBox/archivebox# python3 manage.py runserver
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 54, in wrapper
    fn(*args, **kwargs)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 109, in inner_run
    autoreload.raise_last_exception()
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 77, in raise_last_exception
    raise _exception[0](_exception[1]).with_traceback(_exception[2])
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 54, in wrapper
    fn(*args, **kwargs)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/__init__.py", line 24, in setup
    apps.populate(settings.INSTALLED_APPS)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/apps/registry.py", line 114, in populate
    app_config.import_models()
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/apps/config.py", line 211, in import_models
    self.models_module = import_module(models_module_name)
  File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/ArchiveBox/archivebox/core/models.py", line 7, in <module>
    from ..util import parse_date
ModuleNotFoundError: No module named 'archivebox'

Traceback (most recent call last):
  File "manage.py", line 15, in <module>
    execute_from_command_line(sys.argv)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
    utility.execute()
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/base.py", line 323, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 60, in execute
    super().execute(*args, **options)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/base.py", line 364, in execute
    output = self.handle(*args, **options)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 95, in handle
    self.run(**options)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 102, in run
    autoreload.run_with_reloader(self.inner_run, **options)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 579, in run_with_reloader
    start_django(reloader, main_func, *args, **kwargs)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 564, in start_django
    reloader.run(django_main_thread)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 272, in run
    get_resolver().urlconf_module
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/functional.py", line 80, in __get__
    res = instance.__dict__[self.name] = self.func(instance)
  File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/urls/resolvers.py", line 564, in urlconf_module
    return import_module(self.urlconf_name)
  File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/ArchiveBox/archivebox/core/urls.py", line 9, in <module>
    from core.views import MainIndex, AddLinks, LinkDetails
  File "/home/ArchiveBox/archivebox/core/views.py", line 8, in <module>
    from core.models import Snapshot
  File "/home/ArchiveBox/archivebox/core/models.py", line 7, in <module>
    from ..util import parse_date
ModuleNotFoundError: No module named 'archivebox'

Is there anything I'm missing?

It would be great if you can point me to a Docker image that will run the django release out of box, thank you in advance.

<!-- gh-comment-id:593743824 --> @Haocen commented on GitHub (Mar 3, 2020): Hi @pirate , I tried very hard but I cannot figure out how to run the django server with django branch. I don't think I can install archivebox via pip: ``` (ArchiveBox) root@09108a959768:/home/ArchiveBox/archivebox# pip3 install archivebox ERROR: Could not find a version that satisfies the requirement archivebox (from versions: none) ERROR: No matching distribution found for archivebox ``` And django complain not being able to find archivebox module: ``` (ArchiveBox) root@09108a959768:/home/ArchiveBox/archivebox# python3 manage.py runserver Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/usr/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 54, in wrapper fn(*args, **kwargs) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 109, in inner_run autoreload.raise_last_exception() File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 77, in raise_last_exception raise _exception[0](_exception[1]).with_traceback(_exception[2]) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 54, in wrapper fn(*args, **kwargs) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/__init__.py", line 24, in setup apps.populate(settings.INSTALLED_APPS) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/apps/registry.py", line 114, in populate app_config.import_models() File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/apps/config.py", line 211, in import_models self.models_module = import_module(models_module_name) File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 728, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/home/ArchiveBox/archivebox/core/models.py", line 7, in <module> from ..util import parse_date ModuleNotFoundError: No module named 'archivebox' Traceback (most recent call last): File "manage.py", line 15, in <module> execute_from_command_line(sys.argv) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line utility.execute() File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/base.py", line 323, in run_from_argv self.execute(*args, **cmd_options) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 60, in execute super().execute(*args, **options) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/base.py", line 364, in execute output = self.handle(*args, **options) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 95, in handle self.run(**options) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 102, in run autoreload.run_with_reloader(self.inner_run, **options) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 579, in run_with_reloader start_django(reloader, main_func, *args, **kwargs) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 564, in start_django reloader.run(django_main_thread) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/autoreload.py", line 272, in run get_resolver().urlconf_module File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/utils/functional.py", line 80, in __get__ res = instance.__dict__[self.name] = self.func(instance) File "/root/.local/share/virtualenvs/ArchiveBox-2ZjDHtaD/lib/python3.7/site-packages/django/urls/resolvers.py", line 564, in urlconf_module return import_module(self.urlconf_name) File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 728, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/home/ArchiveBox/archivebox/core/urls.py", line 9, in <module> from core.views import MainIndex, AddLinks, LinkDetails File "/home/ArchiveBox/archivebox/core/views.py", line 8, in <module> from core.models import Snapshot File "/home/ArchiveBox/archivebox/core/models.py", line 7, in <module> from ..util import parse_date ModuleNotFoundError: No module named 'archivebox' ``` Is there anything I'm missing? It would be great if you can point me to a Docker image that will run the django release out of box, thank you in advance.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1743
No description provided.