[GH-ISSUE #1035] Cache get_dir_size() to avoid slow list rendering performance on /admin/core/snapshot/ #2158

Open
opened 2026-03-01 17:56:55 +03:00 by kerem · 1 comment
Owner

Originally created by @matthazinski on GitHub (Oct 8, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1035

I noticed listing snapshots is really slow. I found a few optimizations that seem to increase performance, but wanted to know whether there any undesirable side effects to this:

  • get_dir_size() requires recursing the snapshot directories for every snapshot on a page. It seems archive_size() caches this, but only for the default Django cache TTL (300 seconds). Is there any reason we can't just set CACHES['default']['TIMEOUT'] = None to ensure these keys don't expire by default?
  • Archivebox doesn't expose any options for choosing an external cache, which isn't great when running in ephemeral containers. I've had luck with configuring django_redis in settings.py so the cache can be periodically written to disk. If there's interest, I can put up a PR which optionally enables this. (Alternatively, if there are no blockers to upgrading to Django>=4.0, we can use the built-in Redis cache client.)
  • With a warm cache, Snapshot.from_json() requires a round trip to both the DB and the cache. Calling self.tags_str() with nocache=False seems to cut-down DB latency by about half according to the Django debug toolbar.
Originally created by @matthazinski on GitHub (Oct 8, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1035 I noticed listing snapshots is really slow. I found a few optimizations that seem to increase performance, but wanted to know whether there any undesirable side effects to this: - `get_dir_size()` requires recursing the snapshot directories for every snapshot on a page. It seems `archive_size()` caches this, but only for the default Django cache TTL (300 seconds). Is there any reason we can't just set `CACHES['default']['TIMEOUT'] = None` to ensure these keys don't expire by default? - Archivebox doesn't expose any options for choosing an external cache, which isn't great when running in ephemeral containers. I've had luck with configuring `django_redis` in `settings.py` so the cache can be periodically written to disk. If there's interest, I can put up a PR which optionally enables this. (Alternatively, if there are no blockers to upgrading to Django>=4.0, we can use the built-in Redis cache client.) - With a warm cache, `Snapshot.from_json()` requires a round trip to both the DB and the cache. Calling `self.tags_str()` with `nocache=False` seems to cut-down DB latency by about half according to the Django debug toolbar.
Author
Owner

@pirate commented on GitHub (Nov 18, 2022):

Note that our DB latency is almost nothing as we use SQLite which serves from memory + filesystem, so adding cache for DB queries (e.g. from_json) is not worth it. But cache is still useful to avoid heavy filesystem operations like get_dir_size.

If you want to open a PR for that I'd review it!

I don't want to add redis by default / push people towards redis, it's too much complexity for now, but happy to add a filesystem cache, computed db field, or in-memory cache that doesn't require additional config to setup.

<!-- gh-comment-id:1320581190 --> @pirate commented on GitHub (Nov 18, 2022): Note that our DB latency is almost nothing as we use SQLite which serves from memory + filesystem, so adding cache for DB queries (e.g. `from_json`) is not worth it. But cache is still useful to avoid heavy filesystem operations like `get_dir_size`. If you want to open a PR for that I'd review it! I don't want to add redis by default / push people towards redis, it's too much complexity for now, but happy to add a filesystem cache, computed db field, or in-memory cache that doesn't require additional config to setup.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2158
No description provided.