[GH-ISSUE #1170] Question: Sonic auto index #2235

Closed
opened 2026-03-01 17:57:32 +03:00 by kerem · 8 comments
Owner

Originally created by @ghost on GitHub (Jul 3, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1170

Hello There
I've configured Docker stack
Sonic doesn't index new items.
I can search content in new archived sites only after I start "archivebox update --index-only" in Docker
Could you please help me to know how to invetigate and fix it?
Also I use NginxProxyManager as reverse proxy, so 8000 is not exposed to docker host

My config:

version: '2.4'

services:
    archivebox:
        image: ${DOCKER_IMAGE:-archivebox/archivebox:latest}
        restart: unless-stopped
        command: server --quick-init 0.0.0.0:8000
        #ports:
        #    - 8000:8000
        environment:
            - ALLOWED_HOSTS=${ALLOWED_HOSTS}                   # add any config options you want as env vars
            - MEDIA_MAX_SIZE=${MEDIA_MAX_SIZE}
            - SEARCH_BACKEND_ENGINE=${SEARCH_BACKEND_ENGINE}     # uncomment these if you enable sonic below
            - SEARCH_BACKEND_HOST_NAME=${SEARCH_BACKEND_HOST_NAME}
            - SEARCH_BACKEND_PASSWORD=${SEARCH_BACKEND_PASSWORD}
            - TIMEOUT=${TIMEOUT}
            - MEDIA_TIMEOUT=${MEDIA_TIMEOUT}
            - SAVE_FAVICON=${SAVE_FAVICON}
            - SAVE_WGET=${SAVE_WGET}
            - SAVE_WARC=${SAVE_WARC}
            - SAVE_PDF=${SAVE_PDF}
            - SAVE_SCREENSHOT=${SAVE_SCREENSHOT}
            - SAVE_DOM=${SAVE_DOM}
            - SAVE_SINGLEFILE=${SAVE_SINGLEFILE}
            - SAVE_READABILITY=${SAVE_READABILITY}
            - SAVE_MERCURY=${SAVE_MERCURY}
            - SAVE_GIT=${SAVE_GIT}
            - SAVE_MEDIA=${SAVE_MEDIA}
            - SAVE_ARCHIVE_DOT_ORG=${SAVE_ARCHIVE_DOT_ORG}
            - CHECK_SSL_VALIDITY=${CHECK_SSL_VALIDITY}
            - SAVE_WGET_REQUISITES=${SAVE_WGET_REQUISITES}
            - PUBLIC_INDEX=${PUBLIC_INDEX}
            - PUBLIC_SNAPSHOTS=${PUBLIC_SNAPSHOTS}
            - PUBLIC_ADD_VIEW=${PUBLIC_ADD_VIEW} 
        volumes:
            - /srv/archivebox/archivebox_data:/data
            - /srv/archivebox/archivebox_archive:/data/archive
            - /etc/localtime:/etc/localtime:ro
            - /etc/timezone:/etc/timezone:ro
        networks:
            - npm-external
            - archivebox-internal

    sonic:
        image: valeriansaliou/sonic:v1.3.0
        restart: unless-stopped
        depends_on:
            - archivebox
        expose:
            - 1491
        environment:
            - SEARCH_BACKEND_PASSWORD=${SEARCH_BACKEND_PASSWORD}
        volumes:
            - /srv/archivebox/sonic_cfg/sonic.cfg:/etc/sonic.cfg:ro
            - /srv/archivebox/sonic_data/:/var/lib/sonic/store
            - /etc/localtime:/etc/localtime:ro
            - /etc/timezone:/etc/timezone:ro
        networks:
            - archivebox-internal

networks:
    npm-external:
      external: true
    archivebox-internal:
      external: true
Originally created by @ghost on GitHub (Jul 3, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1170 Hello There I've configured Docker stack Sonic doesn't index new items. I can search content in new archived sites only after I start "archivebox update --index-only" in Docker Could you please help me to know how to invetigate and fix it? Also I use NginxProxyManager as reverse proxy, so 8000 is not exposed to docker host My config: ```yaml version: '2.4' services: archivebox: image: ${DOCKER_IMAGE:-archivebox/archivebox:latest} restart: unless-stopped command: server --quick-init 0.0.0.0:8000 #ports: # - 8000:8000 environment: - ALLOWED_HOSTS=${ALLOWED_HOSTS} # add any config options you want as env vars - MEDIA_MAX_SIZE=${MEDIA_MAX_SIZE} - SEARCH_BACKEND_ENGINE=${SEARCH_BACKEND_ENGINE} # uncomment these if you enable sonic below - SEARCH_BACKEND_HOST_NAME=${SEARCH_BACKEND_HOST_NAME} - SEARCH_BACKEND_PASSWORD=${SEARCH_BACKEND_PASSWORD} - TIMEOUT=${TIMEOUT} - MEDIA_TIMEOUT=${MEDIA_TIMEOUT} - SAVE_FAVICON=${SAVE_FAVICON} - SAVE_WGET=${SAVE_WGET} - SAVE_WARC=${SAVE_WARC} - SAVE_PDF=${SAVE_PDF} - SAVE_SCREENSHOT=${SAVE_SCREENSHOT} - SAVE_DOM=${SAVE_DOM} - SAVE_SINGLEFILE=${SAVE_SINGLEFILE} - SAVE_READABILITY=${SAVE_READABILITY} - SAVE_MERCURY=${SAVE_MERCURY} - SAVE_GIT=${SAVE_GIT} - SAVE_MEDIA=${SAVE_MEDIA} - SAVE_ARCHIVE_DOT_ORG=${SAVE_ARCHIVE_DOT_ORG} - CHECK_SSL_VALIDITY=${CHECK_SSL_VALIDITY} - SAVE_WGET_REQUISITES=${SAVE_WGET_REQUISITES} - PUBLIC_INDEX=${PUBLIC_INDEX} - PUBLIC_SNAPSHOTS=${PUBLIC_SNAPSHOTS} - PUBLIC_ADD_VIEW=${PUBLIC_ADD_VIEW} volumes: - /srv/archivebox/archivebox_data:/data - /srv/archivebox/archivebox_archive:/data/archive - /etc/localtime:/etc/localtime:ro - /etc/timezone:/etc/timezone:ro networks: - npm-external - archivebox-internal sonic: image: valeriansaliou/sonic:v1.3.0 restart: unless-stopped depends_on: - archivebox expose: - 1491 environment: - SEARCH_BACKEND_PASSWORD=${SEARCH_BACKEND_PASSWORD} volumes: - /srv/archivebox/sonic_cfg/sonic.cfg:/etc/sonic.cfg:ro - /srv/archivebox/sonic_data/:/var/lib/sonic/store - /etc/localtime:/etc/localtime:ro - /etc/timezone:/etc/timezone:ro networks: - archivebox-internal networks: npm-external: external: true archivebox-internal: external: true ```
kerem 2026-03-01 17:57:32 +03:00
Author
Owner

@IvanVas commented on GitHub (Aug 11, 2023):

+1

<!-- gh-comment-id:1674577148 --> @IvanVas commented on GitHub (Aug 11, 2023): +1
Author
Owner

@pirate commented on GitHub (Jan 19, 2024):

Can you set log_level = "debug" in sonic.cfg, and DEBUG=True on the archivebox environment, then restart everything.

Check the output / post it here from docker compose logs when you add a new URL.

Also note there have been some changes in v0.7.2 to the docker-compose.yml file, I recommend upgrading your container version and checking for any differences you might want to copy over.
I would also not sync /etc/timezone:/etc/timezone:ro with the archivebox container, as archivebox must always be run in UTC and does not support server side timezone changes (only client-side).

<!-- gh-comment-id:1899639513 --> @pirate commented on GitHub (Jan 19, 2024): Can you set `log_level = "debug"` in `sonic.cfg`, and `DEBUG=True` on the archivebox environment, then restart everything. Check the output / post it here from `docker compose logs` when you add a new URL. Also note there have been some changes in v0.7.2 to the docker-compose.yml file, I recommend upgrading your container version and checking for any differences you might want to copy over. I would also not sync `/etc/timezone:/etc/timezone:ro` with the archivebox container, as archivebox must always be run in UTC and does not support server side timezone changes (only client-side).
Author
Owner

@pirate commented on GitHub (Mar 1, 2024):

Closing as inactive for now, feel free to comment back if you're still having issues on :dev and I can reopen it.

<!-- gh-comment-id:1972365951 --> @pirate commented on GitHub (Mar 1, 2024): Closing as inactive for now, feel free to comment back if you're still having issues on `:dev` and I can reopen it.
Author
Owner

@dehlen commented on GitHub (Aug 22, 2024):

Hey, I am currently experiencing the same issue. Happy to provide more logs when I am back at my computer tomorrow. I am only adding new URLs via the scheduler. Could this be related? I just triggered the update --index-only but this is very time consuming as I have quite a large library of snapshots. I just stumbled upon this issue because I was searching for something in my library i was very sure about should be in the index. I found my saved article with a valid readability representation containing my keyword but search did not return this result.

<!-- gh-comment-id:2305195246 --> @dehlen commented on GitHub (Aug 22, 2024): Hey, I am currently experiencing the same issue. Happy to provide more logs when I am back at my computer tomorrow. I am only adding new URLs via the scheduler. Could this be related? I just triggered the update --index-only but this is very time consuming as I have quite a large library of snapshots. I just stumbled upon this issue because I was searching for something in my library i was very sure about should be in the index. I found my saved article with a valid readability representation containing my keyword but search did not return this result.
Author
Owner

@pirate commented on GitHub (Aug 22, 2024):

Ok, I can investigate but please open a new issue and share logs + archivebox version output when you get a chance.

<!-- gh-comment-id:2305616757 --> @pirate commented on GitHub (Aug 22, 2024): Ok, I can investigate but please open a new issue and share logs + `archivebox version` output when you get a chance.
Author
Owner

@dehlen commented on GitHub (Aug 23, 2024):

Nevermind I had a look this morning at my config again and I think I found the issue. So here is my assumption:
I use docker compose. In it I use the archive box service with these environment variables:

- SEARCH_BACKEND_ENGINE=sonic
- SEARCH_BACKEND_HOST_NAME=sonic
- SEARCH_BACKEND_PASSWORD=MY_PASSWORD

I also setup the archive box scheduler by adding this volume to the archive box service:
- path/to/crontabs:/var/spool/cron/crontabs

Then I setup the sonic service:

sonic:
        image: valeriansaliou/sonic:latest
        container_name: sonic
        expose:
            - 1491
        environment:
            - SEARCH_BACKEND_PASSWORD=MY_PASSWORD
        volumes:
            - path/to/sonic.cfg:/etc/sonic.cfg:ro
            - path/to/sonic:/var/lib/sonic/store

And I setup the archive box scheduler service:

archivebox_scheduler:
  build: path/to/Dockerfile
  command: schedule --foreground
  environment:
      - MEDIA_MAX_SIZE=750m # increase this number to allow archiving larger audio/video files
      - TIMEOUT=120 # increase if you see timeouts often during archiving / on slow networks
      - ONLY_NEW=True # set to False to retry previously failed URLs when re-adding instead of skipping them
      # - CHECK_SSL_VALIDITY=True         # set to False to allow saving URLs w/ broken SSL certs
      - PUID=1026 # set to your host user's UID & GID if you encounter permissions issues
      - PGID=100
      - SAVE_TITLE=True
      - SAVE_FAVICON=True
      - SAVE_WGET=True
      - SAVE_WARC=True
      - SAVE_PDF=False
      - SAVE_SCREENSHOT=False
      - SAVE_DOM=False
      - SAVE_SINGLEFILE=True
      - SAVE_READABILITY=True
      - SAVE_GIT=False
      - SAVE_MEDIA=False
      - SUBMIT_ARCHIVE_DOT_ORG=False
      - SAVE_ARCHIVE_DOT_ORG=False
  volumes:
      - path/to/data:/data
      - path/to/crontabs:/var/spool/cron/crontabs

What was missing was the SEARCH_BACKEND related environment variables for my scheduler service. So whenever my scheduler was running archive box add was executed but the scheduler service adding the URLs did not know about the sonic backend. I added the 3 SEARCH_BACKEND related env variables to my scheduler service as well and now it seems to be working. I checked whether I could find a new link by connecting to the scheduler service container and running archivebox schedule --run-all. This executed all my scheduled scripts and added new links by that which immediately were findable via search. Previously this was not the case for URLs added via the scheduler. When I added URLs from the web UI all worked flawlessly. So I think this probably caused me the above problems.

<!-- gh-comment-id:2306388740 --> @dehlen commented on GitHub (Aug 23, 2024): Nevermind I had a look this morning at my config again and I think I found the issue. So here is my assumption: I use docker compose. In it I use the archive box service with these environment variables: ``` - SEARCH_BACKEND_ENGINE=sonic - SEARCH_BACKEND_HOST_NAME=sonic - SEARCH_BACKEND_PASSWORD=MY_PASSWORD ``` I also setup the archive box scheduler by adding this volume to the archive box service: `- path/to/crontabs:/var/spool/cron/crontabs` Then I setup the sonic service: ``` sonic: image: valeriansaliou/sonic:latest container_name: sonic expose: - 1491 environment: - SEARCH_BACKEND_PASSWORD=MY_PASSWORD volumes: - path/to/sonic.cfg:/etc/sonic.cfg:ro - path/to/sonic:/var/lib/sonic/store ``` And I setup the archive box scheduler service: ``` archivebox_scheduler: build: path/to/Dockerfile command: schedule --foreground environment: - MEDIA_MAX_SIZE=750m # increase this number to allow archiving larger audio/video files - TIMEOUT=120 # increase if you see timeouts often during archiving / on slow networks - ONLY_NEW=True # set to False to retry previously failed URLs when re-adding instead of skipping them # - CHECK_SSL_VALIDITY=True # set to False to allow saving URLs w/ broken SSL certs - PUID=1026 # set to your host user's UID & GID if you encounter permissions issues - PGID=100 - SAVE_TITLE=True - SAVE_FAVICON=True - SAVE_WGET=True - SAVE_WARC=True - SAVE_PDF=False - SAVE_SCREENSHOT=False - SAVE_DOM=False - SAVE_SINGLEFILE=True - SAVE_READABILITY=True - SAVE_GIT=False - SAVE_MEDIA=False - SUBMIT_ARCHIVE_DOT_ORG=False - SAVE_ARCHIVE_DOT_ORG=False volumes: - path/to/data:/data - path/to/crontabs:/var/spool/cron/crontabs ``` What was missing was the SEARCH_BACKEND related environment variables for my scheduler service. So whenever my scheduler was running archive box add was executed but the scheduler service adding the URLs did not know about the sonic backend. I added the 3 SEARCH_BACKEND related env variables to my scheduler service as well and now it seems to be working. I checked whether I could find a new link by connecting to the scheduler service container and running archivebox schedule --run-all. This executed all my scheduled scripts and added new links by that which immediately were findable via search. Previously this was not the case for URLs added via the scheduler. When I added URLs from the web UI all worked flawlessly. So I think this probably caused me the above problems.
Author
Owner

@pirate commented on GitHub (Aug 23, 2024):

That makes sense @dehlen. That's why I generally recommend using ArchiveBox.conf for config (which is shared between all archivebox containers), instead of docker-compose environment: lines (which are per-container).

<!-- gh-comment-id:2307090296 --> @pirate commented on GitHub (Aug 23, 2024): That makes sense @dehlen. That's why I generally recommend using `ArchiveBox.conf` for config (which is shared between all archivebox containers), instead of docker-compose `environment:` lines (which are per-container).
Author
Owner

@virtadpt commented on GitHub (Aug 27, 2024):

Moved to: https://github.com/ArchiveBox/ArchiveBox/issues/1497

I'm seeing the same problem with a bare-metal install of Archivebox (Python v3.11.6, Archivebox v0.7.2 installed with pip into a venv, Sonic v1.4.0 running on the same machine, listening on port 1491/tcp (and can be connected to with telnet)).

My ArchiveBox.conf file:

[SERVER_CONFIG]
SECRET_KEY = <redacted>
PUBLIC_INDEX = True
FOOTER_INFO = 
YOUTUBEDL_BINARY = /home/drwho/archivebox/bin/yt-dlp
RIPGREP_VERSION = 11.0.2
YOUTUBEDL_VERSION = 2024.08.06
TIMEOUT = 600
USE_CHROME = false
PUBLIC_SNAPSHOTS = False
PUBLIC_ADD_VIEW = False
BIND_ADDR = 0.0.0.0:8500

[ARCHIVE_METHOD_TOGGLES]
SAVE_SINGLEFILE = True
SAVE_READABILITY = False

[DEPENDENCY_CONFIG]
USE_YOUTUBEDL = False

[SEARCH_BACKEND_CONFIG]
SEARCH_BACKEND_TIMEOUT = 600
SEARCH_BACKEND_PASSWORD = <redacted>
SEARCH_BACKEND_PORT = 1491
SEARCH_BACKEND_ENGINE = sonic
SEARCH_BACKEND_HOST_NAME = localhost

I turned off a couple of things to minimize the number of variables to keep track of when debugging.

Setting loglevel=debug for Sonic just results in this over and over:

Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - running a tasker tick...
Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for kv store pool items to janitor
Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - done scanning for kv store pool items to janitor, expired 0 items, now has 0 items
Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for fst store pool items to janitor
Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - done scanning for fst store pool items to janitor, expired 0 items, now has 0 items
Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for kv store pool items to flush to disk
Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - no kv store pool items need to be flushed at the moment
Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for fst store pool items to consolidate
Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - no fst store pool items to consolidate in register
Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - ran tasker tick (took 0s + 0ms)

...so it doesn't look like Archivebox is even contacting the Sonic server to send it stuff to index. Running DEBUG=True archivebox init in my archive directory doesn't result in anything interesting in logs/errors.log ("> /home/drwho/archivebox/bin/archivebox init; TS=2024-08-27__16:09:32 VERSION=0.7.2 IN_DOCKER=False IS_TTY=True") or any novel output to the terminal compared to without DEBUG=True (i.e., only "Verifying and updating existing ArchiveBox collection to v0.7.2...", "Verifying archive folder structure...", "Verifying main SQL index and running any migrations needed...", and so forth but no actual debugging output (the same thing happens if I put DEBUG=True in ArchiveBox.conf)).

<!-- gh-comment-id:2313002236 --> @virtadpt commented on GitHub (Aug 27, 2024): *Moved to: https://github.com/ArchiveBox/ArchiveBox/issues/1497* > I'm seeing the same problem with a bare-metal install of Archivebox (Python v3.11.6, Archivebox v0.7.2 installed with pip into a venv, Sonic v1.4.0 running on the same machine, listening on port 1491/tcp (and can be connected to with telnet)). > > My ArchiveBox.conf file: > > ``` > [SERVER_CONFIG] > SECRET_KEY = <redacted> > PUBLIC_INDEX = True > FOOTER_INFO = > YOUTUBEDL_BINARY = /home/drwho/archivebox/bin/yt-dlp > RIPGREP_VERSION = 11.0.2 > YOUTUBEDL_VERSION = 2024.08.06 > TIMEOUT = 600 > USE_CHROME = false > PUBLIC_SNAPSHOTS = False > PUBLIC_ADD_VIEW = False > BIND_ADDR = 0.0.0.0:8500 > > [ARCHIVE_METHOD_TOGGLES] > SAVE_SINGLEFILE = True > SAVE_READABILITY = False > > [DEPENDENCY_CONFIG] > USE_YOUTUBEDL = False > > [SEARCH_BACKEND_CONFIG] > SEARCH_BACKEND_TIMEOUT = 600 > SEARCH_BACKEND_PASSWORD = <redacted> > SEARCH_BACKEND_PORT = 1491 > SEARCH_BACKEND_ENGINE = sonic > SEARCH_BACKEND_HOST_NAME = localhost > ``` > > I turned off a couple of things to minimize the number of variables to keep track of when debugging. > > Setting `loglevel=debug` for Sonic just results in this over and over: > > ``` > Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - running a tasker tick... > Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for kv store pool items to janitor > Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - done scanning for kv store pool items to janitor, expired 0 items, now has 0 items > Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for fst store pool items to janitor > Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - done scanning for fst store pool items to janitor, expired 0 items, now has 0 items > Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for kv store pool items to flush to disk > Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - no kv store pool items need to be flushed at the moment > Aug 26 14:45:27 leandra sonic[1062826]: (DEBUG) - scanning for fst store pool items to consolidate > Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - no fst store pool items to consolidate in register > Aug 26 14:45:27 leandra sonic[1062826]: (INFO) - ran tasker tick (took 0s + 0ms) > ``` > > ...so it doesn't look like Archivebox is even contacting the Sonic server to send it stuff to index. Running `DEBUG=True archivebox init` in my archive directory doesn't result in anything interesting in `logs/errors.log` ("> /home/drwho/archivebox/bin/archivebox init; TS=2024-08-27__16:09:32 VERSION=0.7.2 IN_DOCKER=False IS_TTY=True") or any novel output to the terminal compared to without `DEBUG=True` (i.e., only "Verifying and updating existing ArchiveBox collection to v0.7.2...", "Verifying archive folder structure...", "Verifying main SQL index and running any migrations needed...", and so forth but no actual debugging output (the same thing happens if I put `DEBUG=True` in ArchiveBox.conf)).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2235
No description provided.