mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #1035] Cache get_dir_size() to avoid slow list rendering performance on /admin/core/snapshot/ #646
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#646
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @matthazinski on GitHub (Oct 8, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1035
I noticed listing snapshots is really slow. I found a few optimizations that seem to increase performance, but wanted to know whether there any undesirable side effects to this:
get_dir_size()requires recursing the snapshot directories for every snapshot on a page. It seemsarchive_size()caches this, but only for the default Django cache TTL (300 seconds). Is there any reason we can't just setCACHES['default']['TIMEOUT'] = Noneto ensure these keys don't expire by default?django_redisinsettings.pyso the cache can be periodically written to disk. If there's interest, I can put up a PR which optionally enables this. (Alternatively, if there are no blockers to upgrading to Django>=4.0, we can use the built-in Redis cache client.)Snapshot.from_json()requires a round trip to both the DB and the cache. Callingself.tags_str()withnocache=Falseseems to cut-down DB latency by about half according to the Django debug toolbar.@pirate commented on GitHub (Nov 18, 2022):
Note that our DB latency is almost nothing as we use SQLite which serves from memory + filesystem, so adding cache for DB queries (e.g.
from_json) is not worth it. But cache is still useful to avoid heavy filesystem operations likeget_dir_size.If you want to open a PR for that I'd review it!
I don't want to add redis by default / push people towards redis, it's too much complexity for now, but happy to add a filesystem cache, computed db field, or in-memory cache that doesn't require additional config to setup.