[GH-ISSUE #1497] Sonic indexing not working on v0.7.2 bare metal install (needed pip install archivebox[sonic]) #3900

Closed
opened 2026-03-15 00:54:45 +03:00 by kerem · 13 comments
Owner

Originally created by @pirate on GitHub (Aug 27, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1497

Moved from: https://github.com/ArchiveBox/ArchiveBox/issues/1170#issuecomment-2313002236

@virtadpt posted:

I'm seeing the same problem with a bare-metal install of Archivebox (Python v3.11.6, Archivebox v0.7.2 installed with pip into a venv, Sonic v1.4.0 running on the same machine, listening on port 1491/tcp (and can be connected to with telnet)).

My ArchiveBox.conf file:

[SERVER_CONFIG]
SECRET_KEY = <redacted>
PUBLIC_INDEX = True
FOOTER_INFO = 
YOUTUBEDL_BINARY = /home/drwho/archivebox/bin/yt-dlp
RIPGREP_VERSION = 11.0.2
YOUTUBEDL_VERSION = 2024.08.06
TIMEOUT = 600
USE_CHROME = false
PUBLIC_SNAPSHOTS = False
PUBLIC_ADD_VIEW = False
BIND_ADDR = 0.0.0.0:8500

[ARCHIVE_METHOD_TOGGLES]
SAVE_SINGLEFILE = True
SAVE_READABILITY = False

[DEPENDENCY_CONFIG]
USE_YOUTUBEDL = False

[SEARCH_BACKEND_CONFIG]
SEARCH_BACKEND_TIMEOUT = 600
SEARCH_BACKEND_PASSWORD = <redacted>
SEARCH_BACKEND_PORT = 1491
SEARCH_BACKEND_ENGINE = sonic
SEARCH_BACKEND_HOST_NAME = localhost

I turned off a couple of things to minimize the number of variables to keep track of when debugging.

Setting loglevel=debug for Sonic just results in this over and over:

Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - running a tasker tick...
Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for kv store pool items to janitor
Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - done scanning for kv store pool items to janitor, expired 0 items, now has 0 items
Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for fst store pool items to janitor
Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - done scanning for fst store pool items to janitor, expired 0 items, now has 0 items
Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for kv store pool items to flush to disk
Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - no kv store pool items need to be flushed at the moment
Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for fst store pool items to consolidate
Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - no fst store pool items to consolidate in register
Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - ran tasker tick (took 0s + 0ms)

...so it doesn't look like Archivebox is even contacting the Sonic server to send it stuff to index. Running DEBUG=True archivebox init in my archive directory doesn't result in anything interesting in logs/errors.log ("> /home/drwho/archivebox/bin/archivebox init; TS=2024-08-27__16:09:32 VERSION=0.7.2 IN_DOCKER=False IS_TTY=True") or any novel output to the terminal compared to without DEBUG=True (i.e., only "Verifying and updating existing ArchiveBox collection to v0.7.2...", "Verifying archive folder structure...", "Verifying main SQL index and running any migrations needed...", and so forth but no actual debugging output (the same thing happens if I put DEBUG=True in ArchiveBox.conf)).

Originally posted by @virtadpt in https://github.com/ArchiveBox/ArchiveBox/issues/1170#issuecomment-2313002236

Originally created by @pirate on GitHub (Aug 27, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1497 *Moved from: https://github.com/ArchiveBox/ArchiveBox/issues/1170#issuecomment-2313002236* @virtadpt posted: I'm seeing the same problem with a bare-metal install of Archivebox (Python v3.11.6, Archivebox v0.7.2 installed with pip into a venv, Sonic v1.4.0 running on the same machine, listening on port 1491/tcp (and can be connected to with telnet)). My ArchiveBox.conf file: ``` [SERVER_CONFIG] SECRET_KEY = <redacted> PUBLIC_INDEX = True FOOTER_INFO = YOUTUBEDL_BINARY = /home/drwho/archivebox/bin/yt-dlp RIPGREP_VERSION = 11.0.2 YOUTUBEDL_VERSION = 2024.08.06 TIMEOUT = 600 USE_CHROME = false PUBLIC_SNAPSHOTS = False PUBLIC_ADD_VIEW = False BIND_ADDR = 0.0.0.0:8500 [ARCHIVE_METHOD_TOGGLES] SAVE_SINGLEFILE = True SAVE_READABILITY = False [DEPENDENCY_CONFIG] USE_YOUTUBEDL = False [SEARCH_BACKEND_CONFIG] SEARCH_BACKEND_TIMEOUT = 600 SEARCH_BACKEND_PASSWORD = <redacted> SEARCH_BACKEND_PORT = 1491 SEARCH_BACKEND_ENGINE = sonic SEARCH_BACKEND_HOST_NAME = localhost ``` I turned off a couple of things to minimize the number of variables to keep track of when debugging. Setting `loglevel=debug` for Sonic just results in this over and over: ``` Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - running a tasker tick... Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for kv store pool items to janitor Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - done scanning for kv store pool items to janitor, expired 0 items, now has 0 items Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for fst store pool items to janitor Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - done scanning for fst store pool items to janitor, expired 0 items, now has 0 items Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for kv store pool items to flush to disk Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - no kv store pool items need to be flushed at the moment Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for fst store pool items to consolidate Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - no fst store pool items to consolidate in register Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - ran tasker tick (took 0s + 0ms) ``` ...so it doesn't look like Archivebox is even contacting the Sonic server to send it stuff to index. Running `DEBUG=True archivebox init` in my archive directory doesn't result in anything interesting in `logs/errors.log` ("> /home/drwho/archivebox/bin/archivebox init; TS=2024-08-27__16:09:32 VERSION=0.7.2 IN_DOCKER=False IS_TTY=True") or any novel output to the terminal compared to without `DEBUG=True` (i.e., only "Verifying and updating existing ArchiveBox collection to v0.7.2...", "Verifying archive folder structure...", "Verifying main SQL index and running any migrations needed...", and so forth but no actual debugging output (the same thing happens if I put `DEBUG=True` in ArchiveBox.conf)). _Originally posted by @virtadpt in https://github.com/ArchiveBox/ArchiveBox/issues/1170#issuecomment-2313002236_
kerem 2026-03-15 00:54:45 +03:00
Author
Owner

@virtadpt commented on GitHub (Aug 27, 2024):

@pirate It does not. I even tried running Sonic in one terminal and archivebox update --index-only in another ( both in debug mode) and Sonic never received any connections from archivebox.

Readability... I thought that had been installed with archivebox setup but kept erroring out, which is why I turned it off. I just installed nodejs-readability-cli, turned it on in ArchiveBox.conf, and set READABILITY_BINARY = /usr/bin/readable to make sure it could be found. archivebox update --index-only is running right now and will take a while.

Incidentally, that seems like a thing that should be in the "how to configure search" documentation. Just what Readability is doesn't seem to be in the docs anywhere (and searching for it turns up a bunch of AI slop about readability in the context of writing papers and advertising and stuff).

<!-- gh-comment-id:2313301106 --> @virtadpt commented on GitHub (Aug 27, 2024): @pirate It does not. I even tried running Sonic in one terminal and `archivebox update --index-only` in another ( both in debug mode) and Sonic never received any connections from archivebox. Readability... I thought that had been installed with `archivebox setup` but kept erroring out, which is why I turned it off. I just installed nodejs-readability-cli, turned it on in ArchiveBox.conf, and set `READABILITY_BINARY = /usr/bin/readable` to make sure it could be found. `archivebox update --index-only` is running right now and will take a while. Incidentally, that seems like a thing that should be in the "how to configure search" documentation. Just what Readability is doesn't seem to be in the docs anywhere (and searching for it turns up a bunch of AI slop about readability in the context of writing papers and advertising and stuff).
Author
Owner

@pirate commented on GitHub (Aug 27, 2024):

@virtadpt I need three things to help:

  1. the full output of archivebox version
  2. the full output of archivebox add 'https://example.com/#123456' (I need to see which methods are working / any errors)
  3. the output of archivebox update --index-only + check if that sends stuff to Sonic

Please note if you have SAVE_READABILITY = False there may not be anything to send to Sonic if any of the other text-based output methods fail, as Sonic only indexes specific text-based output types.

<!-- gh-comment-id:2313302133 --> @pirate commented on GitHub (Aug 27, 2024): @virtadpt I need three things to help: 1. [x] the full output of `archivebox version` 2. [x] the full output of `archivebox add 'https://example.com/#123456'` (I need to see which methods are working / any errors) 3. [x] the output of `archivebox update --index-only` + check if that sends stuff to Sonic Please note if you have `SAVE_READABILITY = False` there may not be anything to send to Sonic if any of the other text-based output methods fail, as Sonic only indexes specific text-based output types.
Author
Owner

@virtadpt commented on GitHub (Aug 27, 2024):

(archivebox) {12:01:47 @ Tue Aug 27}
[drwho @ leandra:(7) ArchiveBox]$ archivebox version
0.7.2
ArchiveBox v0.7.2 BUILD_TIME=2024-08-19 15:32:40 1724106760
IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.6.8-arch1-1-x86_64-with-glibc2.38 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=False FS_USER=1000:1000 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.6         valid     /usr/bin/python3.11
 
 √  SQLITE_BINARY         v2.6.0          valid     /usr/lib/python3.11/sqlite3/dbapi2.py
 √  DJANGO_BINARY         v3.1.14         valid     /home/drwho/archivebox/lib/python3.11/site-packages/django/__init__.py
 √  ARCHIVEBOX_BINARY     v0.7.2          valid     /home/drwho/archivebox/bin/archivebox

 √  CURL_BINARY           v8.5.0          valid     /usr/bin/curl                

 √  WGET_BINARY           v1.21.4         valid     /usr/bin/wget                

 √  NODE_BINARY           v21.5.0         valid     /usr/bin/node                

 X  SINGLEFILE_BINARY     ?               invalid   single-file                  

 √  READABILITY_BINARY    v2.4.5          valid     /usr/lib/node_modules/readability-cli/index.js

 √  MERCURY_BINARY        v1.0.0          valid     /usr/lib/node_modules/@postlight/parser/cli.js

 √  GIT_BINARY            v2.43.0         valid     /usr/bin/git                 

 -  YOUTUBEDL_BINARY      -               disabled  /home/drwho/archivebox/bin/yt-dlp
 -  CHROME_BINARY         -               disabled  /usr/bin/chromium            
 √  RIPGREP_BINARY        v14.1.0         valid     /usr/bin/rg                  

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/drwho/archivebox/lib/python3.11/site-packages/archivebox
 √  TEMPLATES_DIR         3 files         valid     /home/drwho/archivebox/lib/python3.11/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None                         

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None                         
 -  COOKIES_FILE          -               disabled  None                         

[i] Data locations:
 √  OUTPUT_DIR            11 files        valid     /home/drwho/ArchiveBox
 √  SOURCES_DIR           93 files        valid     ./sources                    
 √  LOGS_DIR              1 files         valid     ./logs                       
 √  ARCHIVE_DIR           14061 files     valid     ./archive                    
 √  CONFIG_FILE           692.0 Bytes     valid     ./ArchiveBox.conf            
 √  SQL_INDEX             87.0 MB         valid     ./index.sqlite3              

[!] Warning: Missing 1 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
<!-- gh-comment-id:2313307059 --> @virtadpt commented on GitHub (Aug 27, 2024): ``` (archivebox) {12:01:47 @ Tue Aug 27} [drwho @ leandra:(7) ArchiveBox]$ archivebox version 0.7.2 ArchiveBox v0.7.2 BUILD_TIME=2024-08-19 15:32:40 1724106760 IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.6.8-arch1-1-x86_64-with-glibc2.38 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=False FS_USER=1000:1000 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.6 valid /usr/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /home/drwho/archivebox/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.2 valid /home/drwho/archivebox/bin/archivebox √ CURL_BINARY v8.5.0 valid /usr/bin/curl √ WGET_BINARY v1.21.4 valid /usr/bin/wget √ NODE_BINARY v21.5.0 valid /usr/bin/node X SINGLEFILE_BINARY ? invalid single-file √ READABILITY_BINARY v2.4.5 valid /usr/lib/node_modules/readability-cli/index.js √ MERCURY_BINARY v1.0.0 valid /usr/lib/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.43.0 valid /usr/bin/git - YOUTUBEDL_BINARY - disabled /home/drwho/archivebox/bin/yt-dlp - CHROME_BINARY - disabled /usr/bin/chromium √ RIPGREP_BINARY v14.1.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /home/drwho/archivebox/lib/python3.11/site-packages/archivebox √ TEMPLATES_DIR 3 files valid /home/drwho/archivebox/lib/python3.11/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 11 files valid /home/drwho/ArchiveBox √ SOURCES_DIR 93 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 14061 files valid ./archive √ CONFIG_FILE 692.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 87.0 MB valid ./index.sqlite3 [!] Warning: Missing 1 recommended dependencies ! SINGLEFILE_BINARY: single-file (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False ```
Author
Owner

@virtadpt commented on GitHub (Aug 27, 2024):

(archivebox) {12:04:21 @ Tue Aug 27}
[drwho @ leandra:(7) ArchiveBox]$ archivebox add 'https://example.com/#123456'
[i] [2024-08-27 19:04:36] ArchiveBox v0.7.2: archivebox add https://example.com/
#123456
    > /home/drwho/ArchiveBox

[!] Warning: Missing 1 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set S
AVE_SINGLEFILE=False


[+] [2024-08-27 19:04:36] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1724785476-import.txt
    > Parsed 1 URLs from input (Generic TXT)
    > Found 1 new URLs not already in index

[*] [2024-08-27 19:04:36] Writing 1 links to main index...
    √ ./index.sqlite3

[*] [2024-08-27 19:04:37] Archiving 1/14062 URLs from added set...

[▶] [2024-08-27 19:04:37] Starting archiving of 1 snapshots in index...

[+] [2024-08-27 19:04:37] "example.com/#123456"
    https://example.com/#123456
    > ./archive/1724785476.865335 
      > favicon
      > headers
      > wget
      > title
      > readability
        Extractor failed:
             Readability was not able to archive the page (invalid JSON)
            Unknown argument: https://example.com/#123456
            index.js [source]
            Process HTML input
        Run to see full output:
            cd /home/drwho/ArchiveBox/archive/1724785476.865335;
            /usr/lib/node_modules/readability-cli/index.js ./{dom,singlefile}.html

      > mercury
      > htmltotext
      > archive_org
        Extractor failed:
             Failed to find "content-location" URL header in Archive.org response.
        Run to see full output:
            cd /home/drwho/ArchiveBox/archive/1724785476.865335;
            curl --silent --location --compressed --head --max-time 600 --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.5.0 (x86_64-pc-linux-gnu)" "https://web.archive.org/save/https://example.com/#123456"

        10 files (252.2 KB) in 0:00:04s 

[√] [2024-08-27 19:04:42] Update of 1 pages complete (4.82 sec)
    - 0 links skipped
    - 1 links updated
    - 1 links had errors
<!-- gh-comment-id:2313310646 --> @virtadpt commented on GitHub (Aug 27, 2024): ``` (archivebox) {12:04:21 @ Tue Aug 27} [drwho @ leandra:(7) ArchiveBox]$ archivebox add 'https://example.com/#123456' [i] [2024-08-27 19:04:36] ArchiveBox v0.7.2: archivebox add https://example.com/ #123456 > /home/drwho/ArchiveBox [!] Warning: Missing 1 recommended dependencies ! SINGLEFILE_BINARY: single-file (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set S AVE_SINGLEFILE=False [+] [2024-08-27 19:04:36] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1724785476-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 1 new URLs not already in index [*] [2024-08-27 19:04:36] Writing 1 links to main index... √ ./index.sqlite3 [*] [2024-08-27 19:04:37] Archiving 1/14062 URLs from added set... [▶] [2024-08-27 19:04:37] Starting archiving of 1 snapshots in index... [+] [2024-08-27 19:04:37] "example.com/#123456" https://example.com/#123456 > ./archive/1724785476.865335 > favicon > headers > wget > title > readability Extractor failed: Readability was not able to archive the page (invalid JSON) Unknown argument: https://example.com/#123456 index.js [source] Process HTML input Run to see full output: cd /home/drwho/ArchiveBox/archive/1724785476.865335; /usr/lib/node_modules/readability-cli/index.js ./{dom,singlefile}.html > mercury > htmltotext > archive_org Extractor failed: Failed to find "content-location" URL header in Archive.org response. Run to see full output: cd /home/drwho/ArchiveBox/archive/1724785476.865335; curl --silent --location --compressed --head --max-time 600 --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.5.0 (x86_64-pc-linux-gnu)" "https://web.archive.org/save/https://example.com/#123456" 10 files (252.2 KB) in 0:00:04s [√] [2024-08-27 19:04:42] Update of 1 pages complete (4.82 sec) - 0 links skipped - 1 links updated - 1 links had errors ```
Author
Owner

@virtadpt commented on GitHub (Aug 27, 2024):

archivebox update --index-only is executing right now. When it's done I'll paste the output.

Incidentally, I installed Readability (I think) and enabled it (archivebox config --set READABILITY_BINARY=/us r/bin/readable) before this --index-only run.

<!-- gh-comment-id:2313314476 --> @virtadpt commented on GitHub (Aug 27, 2024): `archivebox update --index-only` is executing right now. When it's done I'll paste the output. Incidentally, I installed Readability (I think) and enabled it (`archivebox config --set READABILITY_BINARY=/us r/bin/readable`) before this --index-only run.
Author
Owner

@pirate commented on GitHub (Aug 27, 2024):

Ok so because both single-file and readability were missing, you were running on steam as far as full-text generation goes. You were relying on only htmltotext and mercury, and mercury is deprecated, the company that built it unfortunatley abandoned the project about a year ago so it hasn't been updated in a long time.

It's not something I document explicitly because search is constantly improving and evolving as we add, remove, and re-configure extractors. The new-ish htmltotext extractor was added as a fallback to cover this exact circumstance where there are no working JS-based extractors for example.

It looks like htmltotext and mercury both worked in your example though, so lets see why they're not sending anything to sonic.

Can you post the contents of ~/ArchiveBox/archive/1724785476.865335/index.json please, it will show exactly what texts were sent to sonic (if any). In the past almost all of the issues people have run into with sonic have been one of these two:

  1. archivebox is not succesfully extracting any text that can be sent to Sonic
  2. sonic is not running / some network config is making it unreachable from ArchiveBox

I want to really rule out those two common causes before diving into more intricate debugging.

Also note v0.8.x is bringing lots of improvements to the dependency management (single-file, readability, sonic, etc. should all auto-install more easily and reliably in the future).

<!-- gh-comment-id:2313317155 --> @pirate commented on GitHub (Aug 27, 2024): Ok so because both single-file and readability were missing, you were running on steam as far as full-text generation goes. You were relying on only htmltotext and mercury, and mercury is deprecated, the company that built it unfortunatley [abandoned the project about a year ago](https://github.com/postlight/parser) so it hasn't been updated in a long time. It's not something I document explicitly because search is constantly improving and evolving as we add, remove, and re-configure extractors. The new-ish `htmltotext` extractor was added as a fallback to cover this exact circumstance where there are no working JS-based extractors for example. It looks like htmltotext and mercury both worked in your example though, so lets see why they're not sending anything to sonic. Can you post the contents of `~/ArchiveBox/archive/1724785476.865335/index.json` please, it will show exactly what texts were sent to sonic (if any). In the past almost all of the issues people have run into with sonic have been one of these two: 1. archivebox is not succesfully extracting any text that can be sent to Sonic 2. sonic is not running / some network config is making it unreachable from ArchiveBox I want to really rule out those two common causes before diving into more intricate debugging. <sub>Also note v0.8.x is bringing lots of improvements to the dependency management (single-file, readability, sonic, etc. should all auto-install more easily and reliably in the future).</sub>
Author
Owner

@virtadpt commented on GitHub (Aug 27, 2024):

Oh! Something landed in logs/errors.log when I added that URL for example.com:

Exception in archive_methods.save_htmltotext(Link(url=https://example.com/#123456))
command=/home/drwho/archivebox/bin/archivebox add https://example.com/#123456;
ts=2024-08-27__19:04:41
cannot access local variable 'cmd' where it is not associated with a value
<!-- gh-comment-id:2313330958 --> @virtadpt commented on GitHub (Aug 27, 2024): Oh! Something landed in logs/errors.log when I added that URL for example.com: ``` Exception in archive_methods.save_htmltotext(Link(url=https://example.com/#123456)) command=/home/drwho/archivebox/bin/archivebox add https://example.com/#123456; ts=2024-08-27__19:04:41 cannot access local variable 'cmd' where it is not associated with a value ```
Author
Owner

@virtadpt commented on GitHub (Aug 27, 2024):

Contents of ~/ArchiveBox/archive/1724785476.865335/index.json:

{
    "archive_path": "archive/1724785476.865335",
    "base_url": "example.com/#123456",
    "basename": "",
    "bookmarked_date": "2024-08-27 19:04",
    "canonical": {
        "archive_org_path": "https://web.archive.org/web/example.com/#123456",
        "dom_path": "output.html",
        "favicon_path": "favicon.ico",
        "git_path": "git/",
        "google_favicon_path": "https://www.google.com/s2/favicons?domain=example.com",
        "headers_path": "headers.json",
        "htmltotext_path": "htmltotext.txt",
        "index_path": "index.html",
        "media_path": "media/",
        "mercury_path": "mercury/content.html",
        "pdf_path": "output.pdf",
        "readability_path": "readability/content.html",
        "screenshot_path": "screenshot.png",
        "singlefile_path": "singlefile.html",
        "warc_path": "warc/",
        "wget_path": "example.com/index.html"
    },
    "domain": "example.com",
    "extension": "",
    "hash": "1EKQDWKEYZXSX8802WM5",
    "history": {
        "archive_org": [
            {
                "cmd": [
                    "curl",
                    "--silent",
                    "--location",
                    "--compressed",
                    "--head",
                    "--max-time",
                    "600",
                    "--user-agent",
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.5.0 (x86_64-pc-linux-gnu)",
                    "https://web.archive.org/save/https://example.com/#123456"
                ],
                "cmd_version": "curl 8.5.0 (x86_64-pc-linux-gnu)",
                "end_ts": "2024-08-27T19:04:41.773482+00:00",
                "index_texts": null,
                "output": "ArchiveError: Failed to find \"content-location\" URL header in Archive.org response.",
                "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335",
                "schema": "ArchiveResult",
                "start_ts": "2024-08-27T19:04:41.078039+00:00",
                "status": "failed"
            }
        ],
        "dom": [],
        "favicon": [
            {
                "cmd": [
                    "curl",
                    "--silent",
                    "--location",
                    "--compressed",
                    "--max-time",
                    "600",
                    "--output",
                    "favicon.ico",
                    "--user-agent",
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.5.0 (x86_64-pc-linux-gnu)",
                    "https://www.google.com/s2/favicons?domain=example.com"
                ],
                "cmd_version": "curl 8.5.0 (x86_64-pc-linux-gnu)",
                "end_ts": "2024-08-27T19:04:38.216098+00:00",
                "index_texts": null,
                "output": "favicon.ico",
                "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335",
                "schema": "ArchiveResult",
                "start_ts": "2024-08-27T19:04:37.854612+00:00",
                "status": "succeeded"
            }
        ],
        "git": [],
        "headers": [
            {
                "cmd": [
                    "curl",
                    "--silent",
                    "--location",
                    "--compressed",
                    "--head",
                    "--max-time",
                    "600",
                    "--user-agent",
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.5.0 (x86_64-pc-linux-gnu)",
                    "https://example.com/#123456"
                ],
                "cmd_version": "curl 8.5.0 (x86_64-pc-linux-gnu)",
                "end_ts": "2024-08-27T19:04:38.654441+00:00",
                "index_texts": null,
                "output": "headers.json",
                "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335",
                "schema": "ArchiveResult",
                "start_ts": "2024-08-27T19:04:38.383034+00:00",
                "status": "succeeded"
            }
        ],
        "htmltotext": [],
        "media": [],
        "mercury": [
            {
                "cmd": [
                    "/usr/lib/node_modules/@postlight/parser/cli.js",
                    "https://example.com/#123456"
                ],
                "cmd_version": "1.0.0",
                "end_ts": "2024-08-27T19:04:40.865055+00:00",
                "index_texts": null,
                "output": "mercury",
                "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335",
                "schema": "ArchiveResult",
                "start_ts": "2024-08-27T19:04:39.465976+00:00",
                "status": "succeeded"
            }
        ],
        "pdf": [],
        "readability": [
            {
                "cmd": [
                    "/usr/lib/node_modules/readability-cli/index.js",
                    "./{dom,singlefile}.html"
                ],
                "cmd_version": "readability-cli v2.4.5",
                "end_ts": "2024-08-27T19:04:39.345037+00:00",
                "index_texts": [],
                "output": "ArchiveError: Readability was not able to archive the page (invalid JSON)",
                "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335",
                "schema": "ArchiveResult",
                "start_ts": "2024-08-27T19:04:39.212726+00:00",
                "status": "failed"
            }
        ],
        "screenshot": [],
        "singlefile": [],
        "title": [
            {
                "cmd": [
                    "curl",
                    "--silent",
                    "--location",
                    "--compressed",
                    "--max-time",
                    "600",
                    "--user-agent",
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.5.0 (x86_64-pc-linux-gnu)",
                    "https://example.com/#123456"
                ],
                "cmd_version": "curl 8.5.0 (x86_64-pc-linux-gnu)",
                "end_ts": "2024-08-27T19:04:39.120976+00:00",
                "index_texts": null,
                "output": "Example Domain",
                "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335",
                "schema": "ArchiveResult",
                "start_ts": "2024-08-27T19:04:39.052059+00:00",
                "status": "succeeded"
            }
        ],
        "wget": [
            {
                "cmd": [
                    "wget",
                    "--no-verbose",
                    "--adjust-extension",
                    "--convert-links",
                    "--force-directories",
                    "--backup-converted",
                    "--span-hosts",
                    "--no-parent",
                    "-e",
                    "robots=off",
                    "--timeout=600",
                    "--restrict-file-names=windows",
                    "--warc-file=/home/drwho/ArchiveBox/archive/1724785476.865335/warc/1724785478",
                    "--page-requisites",
                    "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.21.4",
                    "--compression=auto",
                    "https://example.com/#123456"
                ],
                "cmd_version": "GNU Wget 1.21.4",
                "end_ts": "2024-08-27T19:04:38.949628+00:00",
                "index_texts": null,
                "output": "example.com/index.html",
                "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335",
                "schema": "ArchiveResult",
                "start_ts": "2024-08-27T19:04:38.745495+00:00",
                "status": "succeeded"
            }
        ]
    },
    "icons": null,
    "is_archived": true,
    "is_static": false,
    "latest": {
        "archive_org": "ArchiveError: Failed to find \"content-location\" URL header in Archive.org response.",
        "dom": null,
        "favicon": "favicon.ico",
        "git": null,
        "media": null,
        "pdf": null,
        "screenshot": null,
        "singlefile": null,
        "title": "Example Domain",
        "warc": null,
        "wget": "example.com/index.html"
    },
    "link_dir": "/home/drwho/ArchiveBox/archive/1724785476.865335",
    "newest_archive_date": "2024-08-27T19:04:41.078039+00:00",
    "num_failures": 2,
    "num_outputs": 5,
    "oldest_archive_date": "2024-08-27T19:04:37.854612+00:00",
    "path": "/",
    "schema": "Link",
    "scheme": "https",
    "snapshot_id": "989a4828-fa0b-4566-80b5-824f447e0ee5",
    "sources": [
        "/home/drwho/ArchiveBox/sources/1724785476-import.txt"
    ],
    "tags": null,
    "tags_str": "",
    "timestamp": "1724785476.865335",
    "title": "Example Domain",
    "updated": "2024-08-27T19:04:37.850805+00:00",
    "updated_date": "2024-08-27 19:04",
    "url": "https://example.com/#123456"
}
<!-- gh-comment-id:2313339358 --> @virtadpt commented on GitHub (Aug 27, 2024): Contents of ~/ArchiveBox/archive/1724785476.865335/index.json: ``` { "archive_path": "archive/1724785476.865335", "base_url": "example.com/#123456", "basename": "", "bookmarked_date": "2024-08-27 19:04", "canonical": { "archive_org_path": "https://web.archive.org/web/example.com/#123456", "dom_path": "output.html", "favicon_path": "favicon.ico", "git_path": "git/", "google_favicon_path": "https://www.google.com/s2/favicons?domain=example.com", "headers_path": "headers.json", "htmltotext_path": "htmltotext.txt", "index_path": "index.html", "media_path": "media/", "mercury_path": "mercury/content.html", "pdf_path": "output.pdf", "readability_path": "readability/content.html", "screenshot_path": "screenshot.png", "singlefile_path": "singlefile.html", "warc_path": "warc/", "wget_path": "example.com/index.html" }, "domain": "example.com", "extension": "", "hash": "1EKQDWKEYZXSX8802WM5", "history": { "archive_org": [ { "cmd": [ "curl", "--silent", "--location", "--compressed", "--head", "--max-time", "600", "--user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.5.0 (x86_64-pc-linux-gnu)", "https://web.archive.org/save/https://example.com/#123456" ], "cmd_version": "curl 8.5.0 (x86_64-pc-linux-gnu)", "end_ts": "2024-08-27T19:04:41.773482+00:00", "index_texts": null, "output": "ArchiveError: Failed to find \"content-location\" URL header in Archive.org response.", "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335", "schema": "ArchiveResult", "start_ts": "2024-08-27T19:04:41.078039+00:00", "status": "failed" } ], "dom": [], "favicon": [ { "cmd": [ "curl", "--silent", "--location", "--compressed", "--max-time", "600", "--output", "favicon.ico", "--user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.5.0 (x86_64-pc-linux-gnu)", "https://www.google.com/s2/favicons?domain=example.com" ], "cmd_version": "curl 8.5.0 (x86_64-pc-linux-gnu)", "end_ts": "2024-08-27T19:04:38.216098+00:00", "index_texts": null, "output": "favicon.ico", "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335", "schema": "ArchiveResult", "start_ts": "2024-08-27T19:04:37.854612+00:00", "status": "succeeded" } ], "git": [], "headers": [ { "cmd": [ "curl", "--silent", "--location", "--compressed", "--head", "--max-time", "600", "--user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.5.0 (x86_64-pc-linux-gnu)", "https://example.com/#123456" ], "cmd_version": "curl 8.5.0 (x86_64-pc-linux-gnu)", "end_ts": "2024-08-27T19:04:38.654441+00:00", "index_texts": null, "output": "headers.json", "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335", "schema": "ArchiveResult", "start_ts": "2024-08-27T19:04:38.383034+00:00", "status": "succeeded" } ], "htmltotext": [], "media": [], "mercury": [ { "cmd": [ "/usr/lib/node_modules/@postlight/parser/cli.js", "https://example.com/#123456" ], "cmd_version": "1.0.0", "end_ts": "2024-08-27T19:04:40.865055+00:00", "index_texts": null, "output": "mercury", "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335", "schema": "ArchiveResult", "start_ts": "2024-08-27T19:04:39.465976+00:00", "status": "succeeded" } ], "pdf": [], "readability": [ { "cmd": [ "/usr/lib/node_modules/readability-cli/index.js", "./{dom,singlefile}.html" ], "cmd_version": "readability-cli v2.4.5", "end_ts": "2024-08-27T19:04:39.345037+00:00", "index_texts": [], "output": "ArchiveError: Readability was not able to archive the page (invalid JSON)", "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335", "schema": "ArchiveResult", "start_ts": "2024-08-27T19:04:39.212726+00:00", "status": "failed" } ], "screenshot": [], "singlefile": [], "title": [ { "cmd": [ "curl", "--silent", "--location", "--compressed", "--max-time", "600", "--user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.5.0 (x86_64-pc-linux-gnu)", "https://example.com/#123456" ], "cmd_version": "curl 8.5.0 (x86_64-pc-linux-gnu)", "end_ts": "2024-08-27T19:04:39.120976+00:00", "index_texts": null, "output": "Example Domain", "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335", "schema": "ArchiveResult", "start_ts": "2024-08-27T19:04:39.052059+00:00", "status": "succeeded" } ], "wget": [ { "cmd": [ "wget", "--no-verbose", "--adjust-extension", "--convert-links", "--force-directories", "--backup-converted", "--span-hosts", "--no-parent", "-e", "robots=off", "--timeout=600", "--restrict-file-names=windows", "--warc-file=/home/drwho/ArchiveBox/archive/1724785476.865335/warc/1724785478", "--page-requisites", "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.21.4", "--compression=auto", "https://example.com/#123456" ], "cmd_version": "GNU Wget 1.21.4", "end_ts": "2024-08-27T19:04:38.949628+00:00", "index_texts": null, "output": "example.com/index.html", "pwd": "/home/drwho/ArchiveBox/archive/1724785476.865335", "schema": "ArchiveResult", "start_ts": "2024-08-27T19:04:38.745495+00:00", "status": "succeeded" } ] }, "icons": null, "is_archived": true, "is_static": false, "latest": { "archive_org": "ArchiveError: Failed to find \"content-location\" URL header in Archive.org response.", "dom": null, "favicon": "favicon.ico", "git": null, "media": null, "pdf": null, "screenshot": null, "singlefile": null, "title": "Example Domain", "warc": null, "wget": "example.com/index.html" }, "link_dir": "/home/drwho/ArchiveBox/archive/1724785476.865335", "newest_archive_date": "2024-08-27T19:04:41.078039+00:00", "num_failures": 2, "num_outputs": 5, "oldest_archive_date": "2024-08-27T19:04:37.854612+00:00", "path": "/", "schema": "Link", "scheme": "https", "snapshot_id": "989a4828-fa0b-4566-80b5-824f447e0ee5", "sources": [ "/home/drwho/ArchiveBox/sources/1724785476-import.txt" ], "tags": null, "tags_str": "", "timestamp": "1724785476.865335", "title": "Example Domain", "updated": "2024-08-27T19:04:37.850805+00:00", "updated_date": "2024-08-27 19:04", "url": "https://example.com/#123456" } ```
Author
Owner

@virtadpt commented on GitHub (Aug 27, 2024):

Incidentally, I'm really looking forward to v0.8.x because the REST API should be part of that.

<!-- gh-comment-id:2313393171 --> @virtadpt commented on GitHub (Aug 27, 2024): Incidentally, I'm really looking forward to v0.8.x because the REST API should be part of that.
Author
Owner

@virtadpt commented on GitHub (Aug 27, 2024):

Is archivebox update --index-only supposed to create those /index.*/ files? I seem to recall reading in a couple of closed tickets' comments that this is deprecated and should not happen anymore.

{12:56:08 @ Tue Aug 27}
[drwho @ leandra:(7) ~]$ ls -ltr ArchiveBox/
drwxr-xr-x drwho drwho   20 B  Wed Aug 21 13:53:07 2024  logs
.rwxr-xr-x drwho drwho  277 B  Fri Aug 23 12:02:40 2024  archivebox.sh
drwxr-xr-x drwho drwho   10 B  Mon Aug 26 12:30:37 2024  sonic
.rw-r--r-- drwho drwho 1017 B  Mon Aug 26 14:46:07 2024  sonic.cfg
.rw-r--r-- drwho drwho   87 MB Tue Aug 27 11:53:32 2024  index.sqlite3
.rwxr-xr-x drwho drwho  692 B  Tue Aug 27 11:56:57 2024  ArchiveBox.conf
drwxr-xr-x drwho drwho  3.9 KB Tue Aug 27 12:04:36 2024  sources
.rw-r--r-- drwho drwho   32 KB Tue Aug 27 12:04:37 2024  index.sqlite3-shm
drwxr-xr-x drwho drwho  464 KB Tue Aug 27 12:04:37 2024  archive
.rw-r--r-- drwho drwho  418 KB Tue Aug 27 12:04:42 2024  index.sqlite3-wal
.rw-r--r-- drwho drwho  236 KB Tue Aug 27 12:56:13 2024  index.html
.rw------- drwho drwho   12 KB Tue Aug 27 12:56:13 2024  index.json
<!-- gh-comment-id:2313396130 --> @virtadpt commented on GitHub (Aug 27, 2024): Is `archivebox update --index-only` supposed to create those /index.*/ files? I seem to recall reading in a couple of closed tickets' comments that this is deprecated and should not happen anymore. ``` {12:56:08 @ Tue Aug 27} [drwho @ leandra:(7) ~]$ ls -ltr ArchiveBox/ drwxr-xr-x drwho drwho 20 B Wed Aug 21 13:53:07 2024  logs .rwxr-xr-x drwho drwho 277 B Fri Aug 23 12:02:40 2024  archivebox.sh drwxr-xr-x drwho drwho 10 B Mon Aug 26 12:30:37 2024  sonic .rw-r--r-- drwho drwho 1017 B Mon Aug 26 14:46:07 2024  sonic.cfg .rw-r--r-- drwho drwho 87 MB Tue Aug 27 11:53:32 2024  index.sqlite3 .rwxr-xr-x drwho drwho 692 B Tue Aug 27 11:56:57 2024  ArchiveBox.conf drwxr-xr-x drwho drwho 3.9 KB Tue Aug 27 12:04:36 2024  sources .rw-r--r-- drwho drwho 32 KB Tue Aug 27 12:04:37 2024  index.sqlite3-shm drwxr-xr-x drwho drwho 464 KB Tue Aug 27 12:04:37 2024  archive .rw-r--r-- drwho drwho 418 KB Tue Aug 27 12:04:42 2024  index.sqlite3-wal .rw-r--r-- drwho drwho 236 KB Tue Aug 27 12:56:13 2024  index.html .rw------- drwho drwho 12 KB Tue Aug 27 12:56:13 2024  index.json ```
Author
Owner

@virtadpt commented on GitHub (Aug 27, 2024):

It just failed on me. Stack trace:

Traceback (most recent call last):
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/search/__init__.py", line 25, in import_backend
    backend = import_module(backend_string)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/search/backends/sonic.py", line 3, in <module>
    from sonic import IngestClient, SearchClient
ModuleNotFoundError: No module named 'sonic'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/drwho/archivebox/bin/archivebox", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/cli/archivebox_update.py", line 119, in main
    update(
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function
    return func(*args, **kwargs)  
           ^^^^^^^^^^^^^^^^^^^^^  
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/main.py", line 818, in update
    index_links(all_links, out_dir=out_dir)
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function
    return func(*args, **kwargs)  
           ^^^^^^^^^^^^^^^^^^^^^  
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/search/__init__.py", line 108, in index_links
    write_search_index(link, texts, out_dir=out_dir)
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function
    return func(*args, **kwargs)  
           ^^^^^^^^^^^^^^^^^^^^^  
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/search/__init__.py", line 39, in write_search_index
    backend = import_backend()
              ^^^^^^^^^^^^^^^^
  File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/search/__init__.py", line 27, in import_backend
    raise Exception("Could not load '%s' as a backend: %s" % (backend_string, err))
Exception: Could not load 'search.backends.sonic' as a backend: No module named 'sonic'

Is archivebox setup supposed to install the module sonic-client or pysonic-channel or python-sonic-client or something? Or is that something that pip install archivebox is supposed to do? Or...?

Basically, did I make a procedural mistake (when originally installing Archive Box), an operational mistake (when I installed and set up Sonic, did I not blow away the venv and reinstall Archive Box because archivebox setup needs to be re-run whenever something changes), or is it a bug?

<!-- gh-comment-id:2313418674 --> @virtadpt commented on GitHub (Aug 27, 2024): It just failed on me. Stack trace: ``` Traceback (most recent call last): File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/search/__init__.py", line 25, in import_backend backend = import_module(backend_string) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen importlib._bootstrap>", line 1204, in _gcd_import File "<frozen importlib._bootstrap>", line 1176, in _find_and_load File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 690, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 940, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/search/backends/sonic.py", line 3, in <module> from sonic import IngestClient, SearchClient ModuleNotFoundError: No module named 'sonic' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/drwho/archivebox/bin/archivebox", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/cli/__init__.py", line 140, in main run_subcommand( File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/cli/__init__.py", line 80, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/cli/archivebox_update.py", line 119, in main update( File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/main.py", line 818, in update index_links(all_links, out_dir=out_dir) File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/search/__init__.py", line 108, in index_links write_search_index(link, texts, out_dir=out_dir) File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/search/__init__.py", line 39, in write_search_index backend = import_backend() ^^^^^^^^^^^^^^^^ File "/home/drwho/archivebox/lib/python3.11/site-packages/archivebox/search/__init__.py", line 27, in import_backend raise Exception("Could not load '%s' as a backend: %s" % (backend_string, err)) Exception: Could not load 'search.backends.sonic' as a backend: No module named 'sonic' ``` Is `archivebox setup` supposed to install the module sonic-client or pysonic-channel or python-sonic-client or something? Or is that something that `pip install archivebox` is supposed to do? Or...? Basically, did I make a procedural mistake (when originally installing Archive Box), an operational mistake (when I installed and set up Sonic, did I not blow away the venv and reinstall Archive Box because `archivebox setup` needs to be re-run whenever something changes), or is it a bug?
Author
Owner

@virtadpt commented on GitHub (Aug 27, 2024):

Okay. So I installed the sonic-client module (pip install sonic-client) and re-ran archivebox update --index-only. It took just over an hour to execute. However, it ran to completion as expected, with no errors report. Moreover, the empty ArchiveBox/sonic/ directory structure now has 21 megabytes of indices in it (whereas before it was completely empty). Sonic's logs have only one new entry ("(WARN) - took a lot of time: 476ms to process channel message") but that seems to be it.

<!-- gh-comment-id:2313634863 --> @virtadpt commented on GitHub (Aug 27, 2024): Okay. So I installed the sonic-client module (`pip install sonic-client`) and re-ran `archivebox update --index-only`. It took just over an hour to execute. However, it ran to completion as expected, with no errors report. Moreover, the empty ArchiveBox/sonic/ directory structure now has 21 megabytes of indices in it (whereas before it was completely empty). Sonic's logs have only one new entry ("(WARN) - took a lot of time: 476ms to process channel message") but that seems to be it.
Author
Owner

@pirate commented on GitHub (Aug 28, 2024):

Aha yeah you figured it out, you needed pip install sonic-client or pip install archivebox[sonic] to use Sonic on a bare metal install (Docker comes with it included) (archivebox setup does not install it, that's only for extractor dependencies). I've just updated the docs to make sure that's clear.


cannot access local variable 'cmd' where it is not associated with a value

This is fixed already in v0.8.x github.com/ArchiveBox/ArchiveBox@6a4e568d1b (diff-5fabc1178e)


Is archivebox update --index-only supposed to create those /index.*/ files? I seem to recall reading in a couple of closed tickets' comments that this is deprecated and should not happen anymore.

It is indeed not supposed to create those, you are correct this is a bug. I'll fix it in v0.8, thanks for reporting!

https://github.com/ArchiveBox/ArchiveBox/issues/1500


Contents of ~/ArchiveBox/archive/1724785476.865335/index.json: ...

Note in your ~/ArchiveBox/archive/1724785476.865335/index.json it shows that it produced no indexable text for that URL, you can see all the "index_texts": null, and "index_texts": [], entries, so that page wont have any text added to the Sonic index.

<!-- gh-comment-id:2314378104 --> @pirate commented on GitHub (Aug 28, 2024): Aha yeah you figured it out, you needed `pip install sonic-client` or `pip install archivebox[sonic]` to use Sonic on a bare metal install (Docker comes with it included) (`archivebox setup` does not install it, that's only for extractor dependencies). I've just [updated the docs](https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Search#sonic-%EF%B8%8F-the-recommended-upgrade-path-for-most-people:~:text=running%20pip%20install-,archivebox%5Bsonic%5D) to make sure that's clear. --- > cannot access local variable 'cmd' where it is not associated with a value This is fixed already in v0.8.x https://github.com/ArchiveBox/ArchiveBox/commit/6a4e568d1b9e18967278970039ae507144abdb54#diff-5fabc1178ec333515d1c30f5b93acafc9ec6c445b8a762208c9a9019c2dc9294 --- > Is archivebox update --index-only supposed to create those /index.*/ files? I seem to recall reading in a couple of closed tickets' comments that this is deprecated and should not happen anymore. It is indeed not supposed to create those, you are correct this is a bug. I'll fix it in v0.8, thanks for reporting! https://github.com/ArchiveBox/ArchiveBox/issues/1500 --- > Contents of ~/ArchiveBox/archive/1724785476.865335/index.json: ... Note in your `~/ArchiveBox/archive/1724785476.865335/index.json` it shows that it produced *no indexable text* for that URL, you can see all the `"index_texts": null,` and `"index_texts": [],` entries, so that page wont have any text added to the Sonic index.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3900
No description provided.