[GH-ISSUE #1193] Bug: Search sometimes shows the same snapshot twice #2249

Closed
opened 2026-03-01 17:57:40 +03:00 by kerem · 7 comments
Owner

Originally created by @melyux on GitHub (Jul 26, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1193

Describe the bug

When doing a search (sonic), sometimes the same snapshot will show up in two rows right after one another.

Steps to reproduce

Not quite sure.

Screenshots or log output

N/A

ArchiveBox version

0.6.3
ArchiveBox v0.6.3 40ddd33 Cpython Linux Linux-6.1.0-10-amd64-x86_64-with-glibc2.31 x86_64
DEBUG=False IN_DOCKER=True IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=False FS_PERMS=644 1000:1000 SEARCH_BACKEND=ripgrep

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.4         valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v18.16.1        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.0.44         valid     /usr/lib/node_modules/single-file-cli/single-file                           
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2023.07.06     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v114.0.5735.198  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              


[i] Data locations:
Originally created by @melyux on GitHub (Jul 26, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1193 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> When doing a search (sonic), sometimes the same snapshot will show up in two rows right after one another. #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> Not quite sure. #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> N/A #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs 0.6.3 ArchiveBox v0.6.3 40ddd33 Cpython Linux Linux-6.1.0-10-amd64-x86_64-with-glibc2.31 x86_64 DEBUG=False IN_DOCKER=True IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=False FS_PERMS=644 1000:1000 SEARCH_BACKEND=ripgrep [i] Dependency versions: √ PYTHON_BINARY v3.11.4 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v18.16.1 valid /usr/bin/node √ SINGLEFILE_BINARY v1.0.44 valid /usr/lib/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js - GIT_BINARY - disabled /usr/bin/git √ YOUTUBEDL_BINARY v2023.07.06 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v114.0.5735.198 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@pirate commented on GitHub (Jul 28, 2023):

Thanks for reporting.

I'm not entirely surprised by this given how it works. We augment the default Django search of the db fields with the Sonic results, so it's possible the dedupe step is failing which leads to results showing twice if both their content and metadata match.

PR's welcome, otherwise I'll probably get to this in the next bug fixing passes after 0.7.0 is released.

<!-- gh-comment-id:1655958442 --> @pirate commented on GitHub (Jul 28, 2023): Thanks for reporting. I'm not entirely surprised by this given how it works. We augment the default Django search of the db fields with the Sonic results, so it's possible the dedupe step is failing which leads to results showing twice if both their content and metadata match. PR's welcome, otherwise I'll probably get to this in the next bug fixing passes after 0.7.0 is released.
Author
Owner

@neel-suthar commented on GitHub (Jan 18, 2024):

@pirate You mean handling dupes in the following function right? We just need to make sure it returns a distinct query set.

def get_queryset(self, **kwargs):
        qs = super().get_queryset(**kwargs)
        query = self.request.GET.get('q')
        if query and query.strip():
            qs = qs.filter(Q(title__icontains=query) | Q(url__icontains=query) | Q(timestamp__icontains=query) | Q(tags__name__icontains=query))
            try:
                qs = qs | query_search_index(query)
            except Exception as err:
                print(f'[!] Error while using search backend: {err.__class__.__name__} {err}')
        return qs

Will be happy to work on this. Please let me know.

<!-- gh-comment-id:1899017286 --> @neel-suthar commented on GitHub (Jan 18, 2024): @pirate You mean handling dupes in the following function right? We just need to make sure it returns a distinct query set. ``` def get_queryset(self, **kwargs): qs = super().get_queryset(**kwargs) query = self.request.GET.get('q') if query and query.strip(): qs = qs.filter(Q(title__icontains=query) | Q(url__icontains=query) | Q(timestamp__icontains=query) | Q(tags__name__icontains=query)) try: qs = qs | query_search_index(query) except Exception as err: print(f'[!] Error while using search backend: {err.__class__.__name__} {err}') return qs ``` Will be happy to work on this. Please let me know.
Author
Owner

@pirate commented on GitHub (Jan 19, 2024):

Yeah, that's the spot! If you wanna open a PR to change it to return qs.distinct() I'll approve + merge it into dev (0.7.3-rc). :)

<!-- gh-comment-id:1899447282 --> @pirate commented on GitHub (Jan 19, 2024): Yeah, that's the spot! If you wanna open a PR to change it to `return qs.distinct()` I'll approve + merge it into `dev` (0.7.3-rc). :)
Author
Owner

@neel-suthar commented on GitHub (Jan 19, 2024):

@pirate I found another place where we have the same kind of logic but I am not sure if it requires a distinct call or not. Can you please confirm? Here is the code...

from django.contrib import messages

from archivebox.search import query_search_index

class SearchResultsAdminMixin:
    def get_search_results(self, request, queryset, search_term: str):
        """Enhances the search queryset with results from the search backend"""
        
        qs, use_distinct = super().get_search_results(request, queryset, search_term)

        search_term = search_term.strip()
        if not search_term:
            return qs, use_distinct
        try:
            qsearch = query_search_index(search_term)
            qs = qs | qsearch
        except Exception as err:
            print(f'[!] Error while using search backend: {err.__class__.__name__} {err}')
            messages.add_message(request, messages.WARNING, f'Error from the search backend, only showing results from default admin search fields - Error: {err}')
        
        return qs, use_distinct
<!-- gh-comment-id:1900819472 --> @neel-suthar commented on GitHub (Jan 19, 2024): @pirate I found another place where we have the same kind of logic but I am not sure if it requires a distinct call or not. Can you please confirm? Here is the code... ``` from django.contrib import messages from archivebox.search import query_search_index class SearchResultsAdminMixin: def get_search_results(self, request, queryset, search_term: str): """Enhances the search queryset with results from the search backend""" qs, use_distinct = super().get_search_results(request, queryset, search_term) search_term = search_term.strip() if not search_term: return qs, use_distinct try: qsearch = query_search_index(search_term) qs = qs | qsearch except Exception as err: print(f'[!] Error while using search backend: {err.__class__.__name__} {err}') messages.add_message(request, messages.WARNING, f'Error from the search backend, only showing results from default admin search fields - Error: {err}') return qs, use_distinct ```
Author
Owner

@pirate commented on GitHub (Jan 19, 2024):

good catch @neel-suthar, want to also add return qs.distinct(), use_distinct there? sorry I saw this after merging your first PR already so you have to open another one

<!-- gh-comment-id:1901143402 --> @pirate commented on GitHub (Jan 19, 2024): good catch @neel-suthar, want to also add `return qs.distinct(), use_distinct` there? sorry I saw this after merging your first PR already so you have to open another one
Author
Owner

@neel-suthar commented on GitHub (Jan 19, 2024):

@pirate will take care of it. This time I will try to reproduce the issue as well. Maybe will include a clip showing that the issue is fixed. Thanks.

<!-- gh-comment-id:1901145548 --> @neel-suthar commented on GitHub (Jan 19, 2024): @pirate will take care of it. This time I will try to reproduce the issue as well. Maybe will include a clip showing that the issue is fixed. Thanks.
Author
Owner

@pirate commented on GitHub (Jan 20, 2024):

Closing as fixed, Thanks @neel-suthar! will be out in the next release 0.7.3.

<!-- gh-comment-id:1901575268 --> @pirate commented on GitHub (Jan 20, 2024): Closing as fixed, Thanks @neel-suthar! will be out in the next release `0.7.3`.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2249
No description provided.