[GH-ISSUE #649] Improve DB query performance for rendering snapshot_icons on snapshot index pages #407

Closed
opened 2026-03-01 14:43:17 +03:00 by kerem · 7 comments
Owner

Originally created by @khimaros on GitHub (Feb 6, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/649

Describe the bug

Steps to reproduce

on a modern laptop with NVMe drive running Debian Testing:

  1. archivebox init
  2. cat 3000urls.txt | archivebox add
  3. cancel the addition, doesn't need to complete
  4. archivebox server
  5. visit /public/

page load takes 15+ seconds.

Screenshots or log output

i'm including a run with Django profiling middleware when hitting /public/?prof.

the majority of cumtime is spent in snapshot_icons:

         12839151 function calls (12677025 primitive calls) in 15.232 seconds

   Ordered by: internal time
   List reduced from 698 to 100 due to restriction <100>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    12453    3.313    0.000    3.314    0.000 {function SQLiteCursorWrapper.execute at 0x7fbfd54e65e0}
    14787    0.611    0.000    0.647    0.000 {built-in method posix.stat}
    63827    0.249    0.000    0.415    0.000 pathlib.py:63(parse_parts)
     6226    0.238    0.000    0.238    0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}
     3113    0.189    0.000   15.134    0.005 html.py:118(snapshot_icons)
   155650    0.179    0.000    0.849    0.000 functional.py:218(wrapper)
778294/728484    0.176    0.000    0.505    0.000 {built-in method builtins.hasattr}
    18680    0.170    0.000    0.170    0.000 {method 'close' of 'sqlite3.Cursor' objects}
    49810    0.165    0.000    0.390    0.000 local.py:46(_get_context_id)
892363/892362    0.156    0.000    0.181    0.000 {built-in method builtins.isinstance}
   193006    0.155    0.000    0.200    0.000 safestring.py:50(mark_safe)
     9339    0.150    0.000    1.610    0.000 query.py:1203(build_filter)
    28019    0.135    0.000    0.295    0.000 query.py:1423(names_to_path)
   155650    0.135    0.000    0.220    0.000 __init__.py:12(escape)
348721/323816    0.126    0.000    0.341    0.000 {built-in method builtins.getattr}
    34243    0.121    0.000    1.324    0.000 html.py:107(format_html)
    46695    0.121    0.000    0.139    0.000 {method 'format' of 'str' objects}
   155650    0.119    0.000    0.477    0.000 html.py:33(escape)
   158763    0.117    0.000    1.007    0.000 html.py:92(conditional_escape)
    60720    0.116    0.000    0.546    0.000 pathlib.py:671(_parse_args)
    17894    0.112    0.000    0.692    0.000 schema.py:262(link_dir)
...

ArchiveBox version

$ archivebox version
ArchiveBox v0.5.4
Cpython Linux Linux-5.10.0-2-amd64-x86_64-with-glibc2.31 x86_64 (not in Docker)

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.5.4          valid     /home/...local/bin/archivebox                                         
 √  PYTHON_BINARY         v3.9.1          valid     /usr/bin/python3.9                                                          
 √  DJANGO_BINARY         v3.1.3          valid     /home/.../.local/lib/python3.9/site-packages/django/bin/django-admin.py 
 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.0.0         valid     /home/.../.nvm/versions/node/v15.0.0/bin/node                           
 √  SINGLEFILE_BINARY     v0.3.6          valid     ./node_modules/archivebox/node_modules/single-file/cli/single-file          
 √  READABILITY_BINARY    v0.1.0          valid     ./node_modules/archivebox/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/archivebox/node_modules/@postlight/mercury-parser/cli.js     
 √  GIT_BINARY            v2.30.0         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.01.08     valid     /usr/bin/youtube-dl                                                         
 √  CHROME_BINARY         v88.0.4324.96   valid     /usr/bin/chromium                                                  
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/.../.local/lib/python3.9/site-packages/archivebox                 
 √  TEMPLATES_DIR         3 files         valid     /home/.../.local/lib/python3.9/site-packages/archivebox/templates       

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            9 files         valid     /home/.../scratch/archivebox                                            
 √  SOURCES_DIR           1 files         valid     ./sources                                                                   
 √  LOGS_DIR              0 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           20 files        valid     ./archive                                                                   
 √  CONFIG_FILE           164.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             1.3 MB          valid     ./index.sqlite3  
Originally created by @khimaros on GitHub (Feb 6, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/649 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> on a modern laptop with NVMe drive running Debian Testing: 1. `archivebox init` 2. `cat 3000urls.txt | archivebox add` 3. cancel the addition, doesn't need to complete 4. `archivebox server` 5. visit `/public/` page load takes 15+ seconds. #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> i'm including a run with [Django profiling middleware](https://medium.com/kami-people/profiling-in-django-9f4d403a394f) when hitting `/public/?prof`. the majority of `cumtime` is spent in `snapshot_icons`: ``` 12839151 function calls (12677025 primitive calls) in 15.232 seconds Ordered by: internal time List reduced from 698 to 100 due to restriction <100> ncalls tottime percall cumtime percall filename:lineno(function) 12453 3.313 0.000 3.314 0.000 {function SQLiteCursorWrapper.execute at 0x7fbfd54e65e0} 14787 0.611 0.000 0.647 0.000 {built-in method posix.stat} 63827 0.249 0.000 0.415 0.000 pathlib.py:63(parse_parts) 6226 0.238 0.000 0.238 0.000 {method 'fetchone' of 'sqlite3.Cursor' objects} 3113 0.189 0.000 15.134 0.005 html.py:118(snapshot_icons) 155650 0.179 0.000 0.849 0.000 functional.py:218(wrapper) 778294/728484 0.176 0.000 0.505 0.000 {built-in method builtins.hasattr} 18680 0.170 0.000 0.170 0.000 {method 'close' of 'sqlite3.Cursor' objects} 49810 0.165 0.000 0.390 0.000 local.py:46(_get_context_id) 892363/892362 0.156 0.000 0.181 0.000 {built-in method builtins.isinstance} 193006 0.155 0.000 0.200 0.000 safestring.py:50(mark_safe) 9339 0.150 0.000 1.610 0.000 query.py:1203(build_filter) 28019 0.135 0.000 0.295 0.000 query.py:1423(names_to_path) 155650 0.135 0.000 0.220 0.000 __init__.py:12(escape) 348721/323816 0.126 0.000 0.341 0.000 {built-in method builtins.getattr} 34243 0.121 0.000 1.324 0.000 html.py:107(format_html) 46695 0.121 0.000 0.139 0.000 {method 'format' of 'str' objects} 155650 0.119 0.000 0.477 0.000 html.py:33(escape) 158763 0.117 0.000 1.007 0.000 html.py:92(conditional_escape) 60720 0.116 0.000 0.546 0.000 pathlib.py:671(_parse_args) 17894 0.112 0.000 0.692 0.000 schema.py:262(link_dir) ... ``` #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs $ archivebox version ArchiveBox v0.5.4 Cpython Linux Linux-5.10.0-2-amd64-x86_64-with-glibc2.31 x86_64 (not in Docker) [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.5.4 valid /home/...local/bin/archivebox √ PYTHON_BINARY v3.9.1 valid /usr/bin/python3.9 √ DJANGO_BINARY v3.1.3 valid /home/.../.local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v15.0.0 valid /home/.../.nvm/versions/node/v15.0.0/bin/node √ SINGLEFILE_BINARY v0.3.6 valid ./node_modules/archivebox/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.1.0 valid ./node_modules/archivebox/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/archivebox/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.30.0 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.01.08 valid /usr/bin/youtube-dl √ CHROME_BINARY v88.0.4324.96 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /home/.../.local/lib/python3.9/site-packages/archivebox √ TEMPLATES_DIR 3 files valid /home/.../.local/lib/python3.9/site-packages/archivebox/templates [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 9 files valid /home/.../scratch/archivebox √ SOURCES_DIR 1 files valid ./sources √ LOGS_DIR 0 files valid ./logs √ ARCHIVE_DIR 20 files valid ./archive √ CONFIG_FILE 164.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 1.3 MB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@berezovskyi commented on GitHub (Feb 6, 2021):

Thanks for profiling this! I have the same problem but was thinking it was related to the slow HW I am using for running ArB.

Why does snapshot_icons 15.134s to execute?! I see that there's already some code commented out there, so perf issues must have been tracked down to that function before:

github.com/ArchiveBox/ArchiveBox@04c951cdd5/archivebox/index/html.py (L154)

<!-- gh-comment-id:774444211 --> @berezovskyi commented on GitHub (Feb 6, 2021): Thanks for profiling this! I have the same problem but was thinking it was related to the slow HW I am using for running ArB. Why does `snapshot_icons` 15.134s to execute?! I see that there's already some code commented out there, so perf issues must have been tracked down to that function before: https://github.com/ArchiveBox/ArchiveBox/blob/04c951cdd50e9ab8f4257e0f702ee909506af01d/archivebox/index/html.py#L154
Author
Owner

@khimaros commented on GitHub (Feb 6, 2021):

it looks like per-call cost is relatively low at 0.005 seconds, but it is being called once for each snapshot, 3113 times, even when only displaying 100 items per page. paginate_by is defined as it should be:

github.com/ArchiveBox/ArchiveBox@cc80ceb0a2/archivebox/core/views.py (L94)

i'm not an expert in django, but i suspect that get_queryset() is being called prior to pagination and that it is not meant for doing this sort of heavy computation.

https://docs.djangoproject.com/en/3.1/ref/class-based-views/mixins-multiple-object/#django.views.generic.list.MultipleObjectMixin.paginate_queryset

<!-- gh-comment-id:774508369 --> @khimaros commented on GitHub (Feb 6, 2021): it looks like per-call cost is relatively low at `0.005` seconds, but it is being called once for each snapshot, `3113` times, even when only displaying 100 items per page. `paginate_by` is defined as it should be: https://github.com/ArchiveBox/ArchiveBox/blob/cc80ceb0a27d1aa0564f43e4d21d069272eab3c0/archivebox/core/views.py#L94 i'm not an expert in django, but i suspect that `get_queryset()` is being called prior to pagination and that it is not meant for doing this sort of heavy computation. https://docs.djangoproject.com/en/3.1/ref/class-based-views/mixins-multiple-object/#django.views.generic.list.MultipleObjectMixin.paginate_queryset
Author
Owner

@khimaros commented on GitHub (Feb 6, 2021):

this may not be idiomatic for django, but i couldn't find any really clear documentation on how the overall call flow is meant to be for pagination. however, this seems to work and improves performance significantly:

    def get_queryset(self, **kwargs): 
        qs = super().get_queryset(**kwargs) 
        query = self.request.GET.get('q')
        if query:
            qs = qs.filter(Q(title__icontains=query) | Q(url__icontains=query) | Q(timestamp__icontains=query) | Q(tags__name__icontains=query))

        _, _, pqs, _ = self.paginate_queryset(qs, self.paginate_by)
        for snapshot in pqs:
            snapshot.icons = snapshot_icons(snapshot)
        return qs
<!-- gh-comment-id:774508553 --> @khimaros commented on GitHub (Feb 6, 2021): this may not be idiomatic for django, but i couldn't find any really clear documentation on how the overall call flow is meant to be for pagination. however, this seems to work and improves performance significantly: ``` def get_queryset(self, **kwargs): qs = super().get_queryset(**kwargs) query = self.request.GET.get('q') if query: qs = qs.filter(Q(title__icontains=query) | Q(url__icontains=query) | Q(timestamp__icontains=query) | Q(tags__name__icontains=query)) _, _, pqs, _ = self.paginate_queryset(qs, self.paginate_by) for snapshot in pqs: snapshot.icons = snapshot_icons(snapshot) return qs ```
Author
Owner

@khimaros commented on GitHub (Feb 6, 2021):

actually, the above improves performance but is not correct. the snapshot data is not updated in the returned queryset. returning pqs here isn't right either and breaks pagination. i feel there must be an equivalent to a get_paginated_queryset() where heavy lifting like this can be done but haven't found it.

looking at the django implementation, it seems like it might be possible to do this in get_context_data() but when i tried to add this to the PublicIndexView, i was not seeing any kwargs data in those calls.

here's the django implementation: github.com/django/django@92053acbb9/django/views/generic/list.py (L12)

<!-- gh-comment-id:774511552 --> @khimaros commented on GitHub (Feb 6, 2021): actually, the above improves performance but is not correct. the snapshot data is not updated in the returned queryset. returning `pqs` here isn't right either and breaks pagination. i feel there must be an equivalent to a `get_paginated_queryset()` where heavy lifting like this can be done but haven't found it. looking at the django implementation, it seems like it might be possible to do this in `get_context_data()` but when i tried to add this to the PublicIndexView, i was not seeing any kwargs data in those calls. here's the django implementation: https://github.com/django/django/blob/92053acbb9160862c3e743a99ed8ccff8d4f8fd6/django/views/generic/list.py#L12
Author
Owner

@pirate commented on GitHub (Feb 7, 2021):

snapshot_icons is already a well-known hotspot in the code, the plan is to transition away from using FS access to determine if archied outputs are present, to using the DB exclusively.

The idiomatic Django way to do this is with 2 bulk queries, one to fetch the queryset of Snapshots to show on the page, and one to select all the related recent ArchiveResults for each Snapshot returned by the first query (ArchiveResult.objects.filter(snapshot__in=snapshots) all at once instead of a separate query to get the ArchiveResults on each snapshot). Then have some function that takes both QS's and returns the icons that should be shown for each snapshot. It's a relatively straightforward Django optimization process, but I haven't had a chance to get around to it myself as I've been focusing on correctness more than speed in the last sprint.

If you want to take a crack at it, go for it, I'd be happy to review any PR to improve the speed here.
https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/index/html.py#L154

<!-- gh-comment-id:774744778 --> @pirate commented on GitHub (Feb 7, 2021): `snapshot_icons` is already a well-known hotspot in the code, the plan is to transition away from using FS access to determine if archied outputs are present, to using the DB exclusively. The idiomatic Django way to do this is with 2 bulk queries, one to fetch the queryset of Snapshots to show on the page, and one to select all the related recent ArchiveResults for each Snapshot returned by the first query (`ArchiveResult.objects.filter(snapshot__in=snapshots)` all at once instead of a separate query to get the ArchiveResults on each snapshot). Then have some function that takes both QS's and returns the icons that should be shown for each snapshot. It's a relatively straightforward Django optimization process, but I haven't had a chance to get around to it myself as I've been focusing on correctness more than speed in the last sprint. If you want to take a crack at it, go for it, I'd be happy to review any PR to improve the speed here. https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/index/html.py#L154
Author
Owner

@pirate commented on GitHub (Feb 17, 2021):

I added Django caching and lots of other small query performance improvements. Brought load times on the main index down from 10-14sec for big collections to 110ms after the cache warms up.

https://github.com/ArchiveBox/ArchiveBox/pull/655, https://github.com/ArchiveBox/ArchiveBox/pull/655/files#diff-6c14c55129a45290db46e5bd8d2140bb22a9b7083a360c32998e59853e91ee2aR153-R163, https://github.com/ArchiveBox/ArchiveBox/pull/655/files#diff-74f28f9beb322ad19c72ad78d94288a2e9f19466834583c476f6862f8c678712R190, https://github.com/ArchiveBox/ArchiveBox/pull/655/files#diff-67b202908d4cd25c11c983552a64b29ed204d005fa1c5b55d8863a844494b7e9R216-R225, ...

Feel free to test that branch, but beware it's not totally finished, still some bugs left to fix so make sure you back up your archives first.

docker build -t archivebox:0.5.7rc https://github.com/ArchiveBox/ArchiveBox.git#debug-toolbar
docker run -v $PWD:/data archivebox:0.5.7rc ...

Let me know if that works.

Screen Shot 2021-02-18 at 6 32 21 a
Screen Shot 2021-02-18 at 6 31 06 a
Screen Shot 2021-02-18 at 6 30 49 a

<!-- gh-comment-id:780758347 --> @pirate commented on GitHub (Feb 17, 2021): I added Django caching and lots of other small query performance improvements. Brought load times on the main index down from 10-14sec for big collections to 110ms after the cache warms up. https://github.com/ArchiveBox/ArchiveBox/pull/655, https://github.com/ArchiveBox/ArchiveBox/pull/655/files#diff-6c14c55129a45290db46e5bd8d2140bb22a9b7083a360c32998e59853e91ee2aR153-R163, https://github.com/ArchiveBox/ArchiveBox/pull/655/files#diff-74f28f9beb322ad19c72ad78d94288a2e9f19466834583c476f6862f8c678712R190, https://github.com/ArchiveBox/ArchiveBox/pull/655/files#diff-67b202908d4cd25c11c983552a64b29ed204d005fa1c5b55d8863a844494b7e9R216-R225, ... Feel free to test that branch, but beware it's not totally finished, still some bugs left to fix so make sure you back up your archives first. ```bash docker build -t archivebox:0.5.7rc https://github.com/ArchiveBox/ArchiveBox.git#debug-toolbar docker run -v $PWD:/data archivebox:0.5.7rc ... ``` Let me know if that works. ![Screen Shot 2021-02-18 at 6 32 21 a](https://user-images.githubusercontent.com/511499/108410075-f5849d80-71f4-11eb-9132-53a33a0b5f51.png) ![Screen Shot 2021-02-18 at 6 31 06 a](https://user-images.githubusercontent.com/511499/108410149-09c89a80-71f5-11eb-8034-f61d0796650b.png) ![Screen Shot 2021-02-18 at 6 30 49 a](https://user-images.githubusercontent.com/511499/108410169-0fbe7b80-71f5-11eb-90ab-d6edeab2ac97.png)
Author
Owner

@pirate commented on GitHub (Apr 6, 2021):

The new performance improvements will be out with the new v0.6 release (coming soon).

<!-- gh-comment-id:813855448 --> @pirate commented on GitHub (Apr 6, 2021): The new performance improvements will be out with the new v0.6 release (coming soon).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#407
No description provided.