mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #443] Remove VOLUME "$CODE_DIR" from Dockerfile #295
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#295
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jrruethe on GitHub (Aug 15, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/443
Hello! I'm a big fan of ArchiveBox, thank you for all your work!
The way I use ArchiveBox is to run
archivebox add --update-all --depth 1 http://shaarliusing cron, such that I can archive all my Shaarli bookmarks periodically. I have over 7000 links in my archive, and my index.json file is over 250MB.I noticed that after each link is processed, that index.json file is rewritten. I traced it to the
patch_main_indexfunction inarchivebox/extractors/__init__.pyand I found that by commenting that out, my updates are much faster, and the main index is still written at the very end (instead of over and over again during the process).My simple fix for this was to use the following Dockerfile:
What this does is replace
patch_main_index(link)withpass, such that it doesn't get called. This worked great for my purposes with version0.4.13, and meant that I could use your existing Docker image.I noticed that this broke with version
0.4.14and after looking into it a bit, it was caused by this commit:github.com/pirate/ArchiveBox@e7948cf161Specifically, this line in the Dockerfile:
This change causes the /app directory to become a volume, and due to the way that Docker behaves, this makes it impossible to change that directory from derived images. See this thread for more details:
https://stackoverflow.com/questions/40074498/sed-inline-replacement-not-working-from-dockerfile
Unfortunately, I can't do anything about that in my derived Dockerfile, so I needed to open an issue to see if it could get fixed on your end.
Is there a reason that the /app directory needs to be exposed as a Docker volume? If not, may I suggest removing that line?
Alternatively, can the calling of
patch_main_index(link)be conditional on an option? With a large index, it greatly slows down updating, and causes unnecessary disk I/O.Thank you again!
@pirate commented on GitHub (Aug 15, 2020):
Ah sorry, we should be explaining our process more publicly to help save people the time of debugging this stuff!
We are already in the process of abolishing
patch_main_indexand the index.json altogether. You're completely right that those are slow hotspots. The reason they existed was because ArchiveBox tried to be threadsafe by atomically writing the entire index from memory after every update. While this got us easy thread-safety, it leads to slowness and wasteful IO, and SQLite is much better at handling this kind of workload.In v0.5.x, the SQLite db will become the single-source of truth for the index, completing a 3-month-long refactor where we migrated people off the old formats to the new sqlite db.
index.jsonandindex.htmlwill only be written at the end of the archive if enabled, or you can runarchivebox export ...to export the db in json or static HTML format at any time.VOLUME "$CODE_DIR"was set because it makes it easier to develop ArchiveBox in an editor on the host, but that shouldn't be exposed to end-users so I removed it inbdd111d👍.@jrruethe commented on GitHub (Aug 15, 2020):
Sounds great, thank you again! The Sqlite approach is a good one, and so far the 0.4 refactor is working great.