[GH-ISSUE #697] Bug: Sonic throwing invalid_meta_key exception when indexing snapshots with headers #3457

Closed
opened 2026-03-14 23:01:47 +03:00 by kerem · 2 comments
Owner

Originally created by @erob8 on GitHub (Apr 8, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/697

Describe the bug

I have about 12 ifixit articles urls that I've added snapshots for. The majority of them throw an exception when i run the docker-compose run archivebox update --index-only These exceptions seem to break the content indexing completely and even snapshots that didn't exception are not returning search results for their content now.

Steps to reproduce

  1. Enable sonic in docker-compose.yml file
  2. Add a snapshot for https://www.ifixit.com/Guide/Samsung+Galaxy+S9+Midframe+assembly+Replacement/120556 using the title,favicon, headers, and wget methods.
  3. Execute docker-compose run archivebox update --index-only
  4. Note the [X] The search backend threw an exception=ERR invalid_meta_key(?[mp4a.40.2\",\"mime\":\"video\/mp4\",\"always_generate\":true},\"MP4_592\":{\"column\":\"MP4_592\",\"label\":\"Low\",\"encoding\":\"mp4\",\"width\":592,\"height\":444,\"ma"]) in the output.

Screenshots or log output

➜  archivebox docker-compose run archivebox update --index-only
Creating archivebox_archivebox_run ... done
[i] [2021-04-08 21:25:27] ArchiveBox v0.6.0: archivebox update --index-only
    > /data

[*] Indexing url: https://lukelowrey.com/recommended-dotnet-libraries/ in the search index

[*] Indexing url: https://www.ifixit.com/Guide/Samsung+Galaxy+S9+Battery+Replacement/116660 in the search index

[*] Indexing url: https://www.ifixit.com/Guide/Samsung+Galaxy+S9+SIM+Card+or+SD+Card+Replacement/111135 in the search index


[X] The search backend threw an exception=ERR invalid_meta_key(?[mp4a.40.2\",\"mime\":\"video\/mp4\",\"always_generate\":false}},\"userLikeInfo\":{\"guid"])
:
[*] Indexing url: https://www.ifixit.com/Guide/Samsung+Galaxy+S9+Rear-Facing+Camera+Replacement/119319 in the search index


[X] The search backend threw an exception=ERR invalid_meta_key(?[mp4a.40.2\",\"mime\":\"video\/mp4\",\"always_generate\":false}},\"userLikeInfo\":{\"guides_119319\":{\"count\":0,\"notLoggedIn\":true,\"likes\":false}},\"FrameModules\":[\"LoginFrameModule\",\"ImageMenuFrameModule\",\"MediaLibraryFrameModule\",\"NotifyFrameModule\",\"WatchFrameModule\",\"NewsletterFrameModule\",\"UserLikeFrameModule\",\"CommentsWatchFrameModule\",\"CommentsFrameModule\",\"PageStatsFrameModule\",\"GuideApprovalNotifyFrameModule\",\"ImageMarkersFrameModule\",\"ImageCropFrameModule\",\"ModeratorVoteFrameModule\"],\"isProduction\":true,\"locale\":\"en_US\",\"isAdmin\":false,\"siteName\":\"ifixit\",\"useSecureCookies\":true,\"sameSiteValue\":\"None\",\"imageSizeWidths\":{\"mini\":56,\"thumbnail\":96,\"140x105\":140,\"200x150\":200,\"standar"])

ArchiveBox version

ArchiveBox v0.6.0
Cpython Linux Linux-4.19.128-microsoft-standard-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.0          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.4          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.7          valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.13.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.04.01     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v88.0.4324.182  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            13 files        valid     /data
 √  SOURCES_DIR           8 files         valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           13 files        valid     ./archive
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             460.0 KB        valid     ./index.sqlite3
Originally created by @erob8 on GitHub (Apr 8, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/697 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> I have about 12 ifixit articles urls that I've added snapshots for. The majority of them throw an exception when i run the `docker-compose run archivebox update --index-only` These exceptions seem to break the content indexing completely and even snapshots that didn't exception are not returning search results for their content now. #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. Enable sonic in docker-compose.yml file 1. Add a snapshot for https://www.ifixit.com/Guide/Samsung+Galaxy+S9+Midframe+assembly+Replacement/120556 using the title,favicon, headers, and wget methods. 1. Execute `docker-compose run archivebox update --index-only` 1. Note the `[X] The search backend threw an exception=ERR invalid_meta_key(?[mp4a.40.2\",\"mime\":\"video\/mp4\",\"always_generate\":true},\"MP4_592\":{\"column\":\"MP4_592\",\"label\":\"Low\",\"encoding\":\"mp4\",\"width\":592,\"height\":444,\"ma"])` in the output. #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> ```logs ➜ archivebox docker-compose run archivebox update --index-only Creating archivebox_archivebox_run ... done [i] [2021-04-08 21:25:27] ArchiveBox v0.6.0: archivebox update --index-only > /data [*] Indexing url: https://lukelowrey.com/recommended-dotnet-libraries/ in the search index [*] Indexing url: https://www.ifixit.com/Guide/Samsung+Galaxy+S9+Battery+Replacement/116660 in the search index [*] Indexing url: https://www.ifixit.com/Guide/Samsung+Galaxy+S9+SIM+Card+or+SD+Card+Replacement/111135 in the search index [X] The search backend threw an exception=ERR invalid_meta_key(?[mp4a.40.2\",\"mime\":\"video\/mp4\",\"always_generate\":false}},\"userLikeInfo\":{\"guid"]) : [*] Indexing url: https://www.ifixit.com/Guide/Samsung+Galaxy+S9+Rear-Facing+Camera+Replacement/119319 in the search index [X] The search backend threw an exception=ERR invalid_meta_key(?[mp4a.40.2\",\"mime\":\"video\/mp4\",\"always_generate\":false}},\"userLikeInfo\":{\"guides_119319\":{\"count\":0,\"notLoggedIn\":true,\"likes\":false}},\"FrameModules\":[\"LoginFrameModule\",\"ImageMenuFrameModule\",\"MediaLibraryFrameModule\",\"NotifyFrameModule\",\"WatchFrameModule\",\"NewsletterFrameModule\",\"UserLikeFrameModule\",\"CommentsWatchFrameModule\",\"CommentsFrameModule\",\"PageStatsFrameModule\",\"GuideApprovalNotifyFrameModule\",\"ImageMarkersFrameModule\",\"ImageCropFrameModule\",\"ModeratorVoteFrameModule\"],\"isProduction\":true,\"locale\":\"en_US\",\"isAdmin\":false,\"siteName\":\"ifixit\",\"useSecureCookies\":true,\"sameSiteValue\":\"None\",\"imageSizeWidths\":{\"mini\":56,\"thumbnail\":96,\"140x105\":140,\"200x150\":200,\"standar"]) ``` #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.0 Cpython Linux Linux-4.19.128-microsoft-standard-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.0 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.4 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.7 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.13.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.01 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v88.0.4324.182 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 13 files valid /data √ SOURCES_DIR 8 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 13 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 460.0 KB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@erob8 commented on GitHub (Apr 8, 2021):

I was wrong about this part These exceptions seem to break the content indexing completely and even snapshots that didn't exception are not returning search results for their content now.

The search works as expected from /admin/core/snapshot search bar, even with content from snapshots that threw an exception during archivebox update --index-only which is good. However, searching on content doesn't work on the /public/ search bar. Which is not how I'd expect it to function as I have PUBLIC_INDEX set to True.

Is this part functioning as expected? I would like to enable the same search functionality from the public & admin view.

<!-- gh-comment-id:816247782 --> @erob8 commented on GitHub (Apr 8, 2021): I was wrong about this part `These exceptions seem to break the content indexing completely and even snapshots that didn't exception are not returning search results for their content now.` The search works as expected from /admin/core/snapshot search bar, even with content from snapshots that threw an exception during `archivebox update --index-only` which is good. However, searching on content doesn't work on the /public/ search bar. Which is not how I'd expect it to function as I have `PUBLIC_INDEX` set to `True`. Is this part functioning as expected? I would like to enable the same search functionality from the public & admin view.
Author
Owner

@pirate commented on GitHub (Apr 9, 2021):

The Full-text index is not connected to the public search yet actually. We'll likely push it in the next version after v0.6, I just haven't gotten around to it yet.
Currently the public site only searches the main Snapshot db fields (title, url, timestamp, tags, etc.).

Just added full-text search on the public index in v0.6 89158d5. Also added a thing to catch the index errors and bail out after 5 failures on files that don't support searching 3093057.

Give it a shot and comment back here if you're still having trouble and I'll reopen the issue.

<!-- gh-comment-id:816390124 --> @pirate commented on GitHub (Apr 9, 2021): ~~The Full-text index is not connected to the public search yet actually. We'll likely push it in the next version after v0.6, I just haven't gotten around to it yet.~~ ~~Currently the public site only searches the main Snapshot db fields (title, url, timestamp, tags, etc.).~~ Just added full-text search on the public index in v0.6 89158d5. Also added a thing to catch the index errors and bail out after 5 failures on files that don't support searching 3093057. Give it a shot and comment back here if you're still having trouble and I'll reopen the issue.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3457
No description provided.