[GH-ISSUE #443] Remove VOLUME "$CODE_DIR" from Dockerfile #1806

Closed
opened 2026-03-01 17:53:49 +03:00 by kerem · 2 comments
Owner

Originally created by @jrruethe on GitHub (Aug 15, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/443

Hello! I'm a big fan of ArchiveBox, thank you for all your work!

The way I use ArchiveBox is to run archivebox add --update-all --depth 1 http://shaarli using cron, such that I can archive all my Shaarli bookmarks periodically. I have over 7000 links in my archive, and my index.json file is over 250MB.

I noticed that after each link is processed, that index.json file is rewritten. I traced it to the patch_main_index function in archivebox/extractors/__init__.py and I found that by commenting that out, my updates are much faster, and the main index is still written at the very end (instead of over and over again during the process).

My simple fix for this was to use the following Dockerfile:

FROM nikisweeting/archivebox:0.4.13
RUN sed -i "s@patch_main_index(link)@pass@g" /app/archivebox/extractors/__init__.py
ENTRYPOINT ["dumb-init", "--", "/app/bin/docker_entrypoint.sh"]

What this does is replace patch_main_index(link) with pass, such that it doesn't get called. This worked great for my purposes with version 0.4.13, and meant that I could use your existing Docker image.

I noticed that this broke with version 0.4.14 and after looking into it a bit, it was caused by this commit:
github.com/pirate/ArchiveBox@e7948cf161

Specifically, this line in the Dockerfile:

VOLUME "$CODE_DIR"

This change causes the /app directory to become a volume, and due to the way that Docker behaves, this makes it impossible to change that directory from derived images. See this thread for more details:
https://stackoverflow.com/questions/40074498/sed-inline-replacement-not-working-from-dockerfile
Unfortunately, I can't do anything about that in my derived Dockerfile, so I needed to open an issue to see if it could get fixed on your end.

Is there a reason that the /app directory needs to be exposed as a Docker volume? If not, may I suggest removing that line?

Alternatively, can the calling of patch_main_index(link) be conditional on an option? With a large index, it greatly slows down updating, and causes unnecessary disk I/O.

Thank you again!

Originally created by @jrruethe on GitHub (Aug 15, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/443 Hello! I'm a big fan of ArchiveBox, thank you for all your work! The way I use ArchiveBox is to run `archivebox add --update-all --depth 1 http://shaarli` using cron, such that I can archive all my Shaarli bookmarks periodically. I have over 7000 links in my archive, and my index.json file is over 250MB. I noticed that after each link is processed, that index.json file is rewritten. I traced it to the `patch_main_index` function in `archivebox/extractors/__init__.py` and I found that by commenting that out, my updates are much faster, and the main index is still written at the very end (instead of over and over again during the process). My simple fix for this was to use the following Dockerfile: ``` FROM nikisweeting/archivebox:0.4.13 RUN sed -i "s@patch_main_index(link)@pass@g" /app/archivebox/extractors/__init__.py ENTRYPOINT ["dumb-init", "--", "/app/bin/docker_entrypoint.sh"] ``` What this does is replace `patch_main_index(link)` with `pass`, such that it doesn't get called. This worked great for my purposes with version `0.4.13`, and meant that I could use your existing Docker image. I noticed that this broke with version `0.4.14` and after looking into it a bit, it was caused by this commit: https://github.com/pirate/ArchiveBox/commit/e7948cf1616cff1b47f48194af07bfd7738416fd Specifically, this line in the Dockerfile: ``` VOLUME "$CODE_DIR" ``` This change causes the /app directory to become a volume, and due to the way that Docker behaves, this makes it impossible to change that directory from derived images. See this thread for more details: https://stackoverflow.com/questions/40074498/sed-inline-replacement-not-working-from-dockerfile Unfortunately, I can't do anything about that in my derived Dockerfile, so I needed to open an issue to see if it could get fixed on your end. Is there a reason that the /app directory needs to be exposed as a Docker volume? If not, may I suggest removing that line? Alternatively, can the calling of `patch_main_index(link)` be conditional on an option? With a large index, it greatly slows down updating, and causes unnecessary disk I/O. Thank you again!
kerem closed this issue 2026-03-01 17:53:49 +03:00
Author
Owner

@pirate commented on GitHub (Aug 15, 2020):

Ah sorry, we should be explaining our process more publicly to help save people the time of debugging this stuff!

We are already in the process of abolishing patch_main_index and the index.json altogether. You're completely right that those are slow hotspots. The reason they existed was because ArchiveBox tried to be threadsafe by atomically writing the entire index from memory after every update. While this got us easy thread-safety, it leads to slowness and wasteful IO, and SQLite is much better at handling this kind of workload.

In v0.5.x, the SQLite db will become the single-source of truth for the index, completing a 3-month-long refactor where we migrated people off the old formats to the new sqlite db.
index.json and index.html will only be written at the end of the archive if enabled, or you can run archivebox export ... to export the db in json or static HTML format at any time.

VOLUME "$CODE_DIR" was set because it makes it easier to develop ArchiveBox in an editor on the host, but that shouldn't be exposed to end-users so I removed it in bdd111d 👍.

<!-- gh-comment-id:674342389 --> @pirate commented on GitHub (Aug 15, 2020): Ah sorry, we should be explaining our process more publicly to help save people the time of debugging this stuff! We are already in the process of abolishing `patch_main_index` and the index.json altogether. You're completely right that those are slow hotspots. The reason they existed was because ArchiveBox tried to be threadsafe by atomically writing the entire index from memory after every update. While this got us easy thread-safety, it leads to slowness and wasteful IO, and SQLite is much better at handling this kind of workload. In v0.5.x, the SQLite db will become the single-source of truth for the index, completing a 3-month-long refactor where we migrated people off the old formats to the new sqlite db. `index.json` and `index.html` will only be written at the end of the archive if enabled, or you can run `archivebox export ...` to export the db in json or static HTML format at any time. `VOLUME "$CODE_DIR"` was set because it makes it easier to develop ArchiveBox in an editor on the host, but that shouldn't be exposed to end-users so I removed it in bdd111d 👍.
Author
Owner

@jrruethe commented on GitHub (Aug 15, 2020):

Sounds great, thank you again! The Sqlite approach is a good one, and so far the 0.4 refactor is working great.

<!-- gh-comment-id:674409356 --> @jrruethe commented on GitHub (Aug 15, 2020): Sounds great, thank you again! The Sqlite approach is a good one, and so far the 0.4 refactor is working great.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1806
No description provided.