[GH-ISSUE #490] index not building? #321

Closed
opened 2026-03-01 14:42:26 +03:00 by kerem · 10 comments
Owner

Originally created by @ekiel on GitHub (Sep 25, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/490

I just updated the docker container to the latest (Digest:sha256:48a08a1c1e4a2e3480031f25817db57ae398f9720405f5d7495f92a92ced5659) and now when I add links, they seem to process, but the index is no longer rebuilding. I also tried the Django server and my newest archives are no longer showing up. Here is a snippet of my docker-compose.yml. Currently my archive method is as follows:

cat txtfileofurls.txt | docker-compose -f <location of docker-compose.yml> run archivebox add

This has been working great for me lately so it seems something has broken, but I'm not sure where to look.
docker-compose.yml:

version: '3.7'

services:
    archivebox:
        # build: .
        image: nikisweeting/archivebox:latest
        command: server 0.0.0.0:8000
        stdin_open: true
        tty: true
        ports:
            - 8006:8000
        environment:
            - USE_COLOR=True
            - SHOW_PROGRESS=False
            - SAVE_WARC=False
            - SAVE_PDF=False
            - SAVE_DOM=False
        volumes:
            - /data/archives/general:/data

Originally created by @ekiel on GitHub (Sep 25, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/490 I just updated the docker container to the latest (Digest:sha256:48a08a1c1e4a2e3480031f25817db57ae398f9720405f5d7495f92a92ced5659) and now when I add links, they seem to process, but the index is no longer rebuilding. I also tried the Django server and my newest archives are no longer showing up. Here is a snippet of my docker-compose.yml. Currently my archive method is as follows: ``` cat txtfileofurls.txt | docker-compose -f <location of docker-compose.yml> run archivebox add ``` This has been working great for me lately so it seems something has broken, but I'm not sure where to look. docker-compose.yml: ``` version: '3.7' services: archivebox: # build: . image: nikisweeting/archivebox:latest command: server 0.0.0.0:8000 stdin_open: true tty: true ports: - 8006:8000 environment: - USE_COLOR=True - SHOW_PROGRESS=False - SAVE_WARC=False - SAVE_PDF=False - SAVE_DOM=False volumes: - /data/archives/general:/data ```
kerem closed this issue 2026-03-01 14:42:26 +03:00
Author
Owner

@ekiel commented on GitHub (Sep 26, 2020):

I reverted back to the 0.4.21 docker image and it seems to be building now. Is there somewhere I should look to see the diff? I've never deconstructed the layers before.

<!-- gh-comment-id:699561714 --> @ekiel commented on GitHub (Sep 26, 2020): I reverted back to the 0.4.21 docker image and it seems to be building now. Is there somewhere I should look to see the diff? I've never deconstructed the layers before.
Author
Owner

@ekiel commented on GitHub (Sep 28, 2020):

After continuing to test, I'm only able to replicate this problem on 1 of my archive sets. using the same docker-compose.yml I've found that this is what happens - all on the latest docker image.

  1. Add links via same method (cat txt file | docker-compose run archivebox add)
    *if there are new links in the text file the index is built and works as expected otherwise the index is erased (shows 0 links in the archive)
    If then I add a new link to the archive, the index is built as expected.
    This is causing issues with scheduled cron jobs to run the archive job - if there aren't any new links the archive breaks functionality.

However, this seems to be a weird issue, if I do the exact same thing on another archive, the index isn't blown away. Any ideas as to what I can look at?

<!-- gh-comment-id:700020774 --> @ekiel commented on GitHub (Sep 28, 2020): After continuing to test, I'm only able to replicate this problem on 1 of my archive sets. using the same docker-compose.yml I've found that this is what happens - all on the latest docker image. 1. Add links via same method (cat txt file | docker-compose run archivebox add) *if there are new links in the text file the index is built and works as expected otherwise the index is erased (shows 0 links in the archive) If then I add a new link to the archive, the index is built as expected. This is causing issues with scheduled cron jobs to run the archive job - if there aren't any new links the archive breaks functionality. However, this seems to be a weird issue, if I do the exact same thing on another archive, the index isn't blown away. Any ideas as to what I can look at?
Author
Owner

@cdvv7788 commented on GitHub (Sep 28, 2020):

Hi. Thanks for reporting. What index are we talking about? index.json? I will try to reproduce it. The index.json file will not be generated automatically anymore. We are moving to the sqlite index in the current version. However, at this point it should still be working...can you try this command: archivebox list --json --with-headers and check if it outputs your links correctly?

<!-- gh-comment-id:700027869 --> @cdvv7788 commented on GitHub (Sep 28, 2020): Hi. Thanks for reporting. What index are we talking about? `index.json`? I will try to reproduce it. The `index.json` file will not be generated automatically anymore. We are moving to the `sqlite` index in the current version. However, at this point it should still be working...can you try this command: `archivebox list --json --with-headers` and check if it outputs your links correctly?
Author
Owner

@ekiel commented on GitHub (Sep 28, 2020):

It is actually the index.html that gets cleared out

<!-- gh-comment-id:700031527 --> @ekiel commented on GitHub (Sep 28, 2020): It is actually the index.html that gets cleared out
Author
Owner

@cdvv7788 commented on GitHub (Sep 28, 2020):

That one will be removed too. You can generate it with archivebox list --html --with-headers > index.html.

<!-- gh-comment-id:700032690 --> @cdvv7788 commented on GitHub (Sep 28, 2020): That one will be removed too. You can generate it with `archivebox list --html --with-headers > index.html`.
Author
Owner

@ekiel commented on GitHub (Sep 28, 2020):

OK that worked, but is this expected behavior? I would expect that an "archivebox add" wouldn't touch the index if no links are added

<!-- gh-comment-id:700044951 --> @ekiel commented on GitHub (Sep 28, 2020): OK that worked, but is this expected behavior? I would expect that an "archivebox add" wouldn't touch the index if no links are added
Author
Owner

@cdvv7788 commented on GitHub (Sep 28, 2020):

This may be fixed briefly (generating it correctly after archivebox add) but this is something that will stop working this way in the short term.

We are in the middle of the index refactor. Those will be completely removed in the future. They will not be updated or touched if they exist. You should not rely on them if you are using v0.5.x. The django server now has the list, and will be the central point of control. If you still need the index.html or index.json, you will need to generate it after you run your command using archivebox list.

<!-- gh-comment-id:700049340 --> @cdvv7788 commented on GitHub (Sep 28, 2020): This may be fixed briefly (generating it correctly after `archivebox add`) but this is something that will stop working this way in the short term. We are in the middle of the index refactor. Those will be completely removed in the future. They will not be updated or touched if they exist. You should not rely on them if you are using `v0.5.x`. The django server now has the list, and will be the central point of control. If you still need the `index.html` or `index.json`, you will need to generate it after you run your command using `archivebox list`.
Author
Owner

@ekiel commented on GitHub (Sep 28, 2020):

ok, thank you for clarifying - I haven't fully embraced the django server functionality as I have 2 archives, but I may consider merging the archives to simplify this usage.

<!-- gh-comment-id:700052685 --> @ekiel commented on GitHub (Sep 28, 2020): ok, thank you for clarifying - I haven't fully embraced the django server functionality as I have 2 archives, but I may consider merging the archives to simplify this usage.
Author
Owner

@cdvv7788 commented on GitHub (Sep 28, 2020):

Please be careful with this process to avoid breaking stuff (copy archives before playing with them). With this change, we expect big archives to behave better. Old indexes were written pretty often, and with big archives this was a BIG bottleneck. The sqlite index should be faster to update, so the overall performance of archivebox should be more stable and less dependent on archive size.

<!-- gh-comment-id:700055131 --> @cdvv7788 commented on GitHub (Sep 28, 2020): Please be careful with this process to avoid breaking stuff (copy archives before playing with them). With this change, we expect big archives to behave better. Old indexes were written pretty often, and with big archives this was a BIG bottleneck. The sqlite index should be faster to update, so the overall performance of archivebox should be more stable and less dependent on archive size.
Author
Owner

@ekiel commented on GitHub (Sep 29, 2020):

OK thanks for your help - now that I know this is expected I'll close this issue.

<!-- gh-comment-id:700818952 --> @ekiel commented on GitHub (Sep 29, 2020): OK thanks for your help - now that I know this is expected I'll close this issue.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#321
No description provided.