mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #649] Improve DB query performance for rendering snapshot_icons on snapshot index pages #407
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#407
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @khimaros on GitHub (Feb 6, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/649
Describe the bug
Steps to reproduce
on a modern laptop with NVMe drive running Debian Testing:
archivebox initcat 3000urls.txt | archivebox addarchivebox server/public/page load takes 15+ seconds.
Screenshots or log output
i'm including a run with Django profiling middleware when hitting
/public/?prof.the majority of
cumtimeis spent insnapshot_icons:ArchiveBox version
@berezovskyi commented on GitHub (Feb 6, 2021):
Thanks for profiling this! I have the same problem but was thinking it was related to the slow HW I am using for running ArB.
Why does
snapshot_icons15.134s to execute?! I see that there's already some code commented out there, so perf issues must have been tracked down to that function before:github.com/ArchiveBox/ArchiveBox@04c951cdd5/archivebox/index/html.py (L154)@khimaros commented on GitHub (Feb 6, 2021):
it looks like per-call cost is relatively low at
0.005seconds, but it is being called once for each snapshot,3113times, even when only displaying 100 items per page.paginate_byis defined as it should be:github.com/ArchiveBox/ArchiveBox@cc80ceb0a2/archivebox/core/views.py (L94)i'm not an expert in django, but i suspect that
get_queryset()is being called prior to pagination and that it is not meant for doing this sort of heavy computation.https://docs.djangoproject.com/en/3.1/ref/class-based-views/mixins-multiple-object/#django.views.generic.list.MultipleObjectMixin.paginate_queryset
@khimaros commented on GitHub (Feb 6, 2021):
this may not be idiomatic for django, but i couldn't find any really clear documentation on how the overall call flow is meant to be for pagination. however, this seems to work and improves performance significantly:
@khimaros commented on GitHub (Feb 6, 2021):
actually, the above improves performance but is not correct. the snapshot data is not updated in the returned queryset. returning
pqshere isn't right either and breaks pagination. i feel there must be an equivalent to aget_paginated_queryset()where heavy lifting like this can be done but haven't found it.looking at the django implementation, it seems like it might be possible to do this in
get_context_data()but when i tried to add this to the PublicIndexView, i was not seeing any kwargs data in those calls.here's the django implementation:
github.com/django/django@92053acbb9/django/views/generic/list.py (L12)@pirate commented on GitHub (Feb 7, 2021):
snapshot_iconsis already a well-known hotspot in the code, the plan is to transition away from using FS access to determine if archied outputs are present, to using the DB exclusively.The idiomatic Django way to do this is with 2 bulk queries, one to fetch the queryset of Snapshots to show on the page, and one to select all the related recent ArchiveResults for each Snapshot returned by the first query (
ArchiveResult.objects.filter(snapshot__in=snapshots)all at once instead of a separate query to get the ArchiveResults on each snapshot). Then have some function that takes both QS's and returns the icons that should be shown for each snapshot. It's a relatively straightforward Django optimization process, but I haven't had a chance to get around to it myself as I've been focusing on correctness more than speed in the last sprint.If you want to take a crack at it, go for it, I'd be happy to review any PR to improve the speed here.
https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/index/html.py#L154
@pirate commented on GitHub (Feb 17, 2021):
I added Django caching and lots of other small query performance improvements. Brought load times on the main index down from 10-14sec for big collections to 110ms after the cache warms up.
https://github.com/ArchiveBox/ArchiveBox/pull/655, https://github.com/ArchiveBox/ArchiveBox/pull/655/files#diff-6c14c55129a45290db46e5bd8d2140bb22a9b7083a360c32998e59853e91ee2aR153-R163, https://github.com/ArchiveBox/ArchiveBox/pull/655/files#diff-74f28f9beb322ad19c72ad78d94288a2e9f19466834583c476f6862f8c678712R190, https://github.com/ArchiveBox/ArchiveBox/pull/655/files#diff-67b202908d4cd25c11c983552a64b29ed204d005fa1c5b55d8863a844494b7e9R216-R225, ...
Feel free to test that branch, but beware it's not totally finished, still some bugs left to fix so make sure you back up your archives first.
Let me know if that works.
@pirate commented on GitHub (Apr 6, 2021):
The new performance improvements will be out with the new v0.6 release (coming soon).