[GH-ISSUE #1626] Bug: ArchiveBox keeps readding a certain tag to items #3989

Open
opened 2026-03-15 01:12:48 +03:00 by kerem · 8 comments
Owner

Originally created by @FeverGyorn on GitHub (Dec 25, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1626

Originally assigned to: @pirate on GitHub.

Provide a screenshot and describe the bug

Hello together,

so, I'm encountering a bit of a weird situation with my ArchiveBox installation I couldn't really find something about.

I have around 600 snapshots in my instance at the moment. Every night it keeps re-adding a certain tag (to-be-reviewed) to around 330ish elements by itself. I can manually delete the tag then, but it will be there again the next day.

This is NOT present for new items I'm adding currently!

Docker Compose is mostly standard besides using an NFS mounted archive volume.

Steps to reproduce

1. Accessing ArchiveBox installation
2. Noticing the tag "to-be-reviewed" to be present
3. Delete the tag during the corresponding menu. Issue resolved for the meantime.
4. Server keeps adding them at 00:20 (local time ofc) to the same items as before

Logs or errors

None that become apparent or related to the issue (apart from the usual, expected parsing messages)

ArchiveBox Version

[+] Creating 2/1
 ✔ Network 34_default              Created                                 0.2s 
 ✔ Volume "34_archivebox-archive"  Cre...                                  0.0s 
0.7.3
ArchiveBox v0.7.3 COMMIT_HASH=069aabc BUILD_TIME=2024-12-15 09:54:03 1734256443
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.1.0-27-amd64-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.11        valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.7.3          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v8.10.1         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v20.18.1        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.1.54         valid     /app/node_modules/single-file-cli/single-file                               
 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor               
 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js                                  
 √  GIT_BINARY            v2.39.5         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2024.12.13     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v131.0.6778.33  valid     /usr/bin/chromium-browser                                                   
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None                                                                        

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None                                                                        
 -  COOKIES_FILE          -               disabled  None                                                                        


[i] Data locations: (not in a data directory)

How did you install the version of ArchiveBox you are using?

Docker (or other container system like podman/LXC/Kubernetes or TrueNAS/Cloudron/YunoHost/etc.)

What operating system are you running on?

Linux (Ubuntu/Debian/Arch/Alpine/etc.)

What type of drive are you using to store your ArchiveBox data?

  • data/ is on a local SSD or NVMe drive
  • data/ is on a spinning hard drive or external USB drive
  • data/ is on a network mount (e.g. NFS/SMB/CIFS/etc.)
  • data/ is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.)

Docker Compose Configuration

# Usage:
#     curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml
#     docker compose up
#     docker compose run archivebox version
#     docker compose run -T archivebox add < urls_to_archive.txt
#     docker compose run archivebox add --depth=1 'https://news.ycombinator.com'
#     docker compose run archivebox config --set SAVE_ARCHIVE_DOT_ORG=False
#     docker compose run archivebox help
# Documentation:
#     https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#docker-compose

services:
    archivebox:
        image: archivebox/archivebox:latest
        ports:
            - 8009:8000
        volumes:
            - ./data:/data
            - archivebox-archive:/data/archive
            # ./data/personas/Default/chrome_profile/Default:/data/personas/Default/chrome_profile/Default
        environment:
            # - ADMIN_USERNAME=admin            # creates an admin user on first run with the given user/pass combo
            # - ADMIN_PASSWORD=SomeSecretPassword
            - CSRF_TRUSTED_ORIGINS=https://archivebox.example.com  # REQUIRED for auth, REST API, etc. to work
            - ALLOWED_HOSTS=*                   # set this to the hostname(s) from your CSRF_TRUSTED_ORIGINS
            - PUBLIC_INDEX=True                 # set to False to prevent anonymous users from viewing snapshot list
            - PUBLIC_SNAPSHOTS=True             # set to False to prevent anonymous users from viewing snapshot content
            - PUBLIC_ADD_VIEW=False             # set to True to allow anonymous users to submit new URLs to archive
            - SEARCH_BACKEND_ENGINE=sonic       # tells ArchiveBox to use sonic container below for fast full-text search
            - SEARCH_BACKEND_HOST_NAME=sonic
            - SEARCH_BACKEND_PASSWORD=SomeSecretPassword
            # - PUID=911                        # set to your host user's UID & GID if you encounter permissions issues
            # - PGID=911                        # UID/GIDs <500 may clash with existing users and are not recommended
            # For options below, it's better to set using `docker compose run archivebox config --set SOME_KEY=someval` instead of setting here:
            # - MEDIA_MAX_SIZE=750m             # increase this filesize limit to allow archiving larger audio/video files
            # - TIMEOUT=60                      # increase this number to 120+ seconds if you see many slow downloads timing out
            # - CHECK_SSL_VALIDITY=True         # set to False to disable strict SSL checking (allows saving URLs w/ broken certs)
            # - SAVE_ARCHIVE_DOT_ORG=True       # set to False to disable submitting all URLs to Archive.org when archiving
            # - USER_AGENT="..."                # set a custom USER_AGENT to avoid being blocked as a bot
            # ...
            # For more info, see: https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration
            
        # For ad-blocking during archiving, uncomment this section and the pihole service below
        # networks:
        #   - dns
        # dns:
        #   - 172.20.0.53


    ######## Optional Addons: tweak examples below as needed for your specific use case ########

    ### This optional container runs scheduled jobs in the background (and retries failed ones). To add a new job:
    #   $ docker compose run archivebox schedule --add --every=day --depth=1 'https://example.com/some/rss/feed.xml'
    # then restart the scheduler container to apply any changes to the scheduled task list:
    #   $ docker compose restart archivebox_scheduler
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Scheduled-Archiving

    archivebox_scheduler:
        
        image: archivebox/archivebox:latest
        command: schedule --foreground --update --every=day
        environment:
            # - PUID=911                        # set to your host user's UID & GID if you encounter permissions issues
            # - PGID=911
            - TIMEOUT=120                       # use a higher timeout than the main container to give slow tasks more time when retrying
            - SEARCH_BACKEND_ENGINE=sonic       # tells ArchiveBox to use sonic container below for fast full-text search
            - SEARCH_BACKEND_HOST_NAME=sonic
            - SEARCH_BACKEND_PASSWORD=blablub
            # For other config it's better to set using `docker compose run archivebox config --set SOME_KEY=someval` instead of setting here
            # ...
            # For more info, see: https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration
        volumes:
            - ./data:/data
        # cpus: 2                               # uncomment / edit these values to limit scheduler container resource consumption
        # mem_limit: 2048m
        # restart: always


    ### This runs the optional Sonic full-text search backend (much faster than default rg backend).
    # If Sonic is ever started after not running for a while, update its full-text index by running:
    #   $ docker-compose run archivebox update --index-only
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Search

    sonic:
        image: archivebox/sonic:latest
        expose:
            - 1491
        environment:
            - SEARCH_BACKEND_PASSWORD=blablub
        volumes:
            - ./data/sonic.cfg:/etc/sonic.cfg:ro    # mount to customize: https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/stable/etc/sonic.cfg
            - ./data/sonic:/var/lib/sonic/store


    ### This optional container runs xvfb+noVNC so you can watch the ArchiveBox browser as it archives things,
    # or remote control it to set up a chrome profile w/ login credentials for sites you want to archive.
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#docker-vnc-setup

    novnc:
        image: theasp/novnc:latest
        environment:
            - DISPLAY_WIDTH=1920
            - DISPLAY_HEIGHT=1080
            - RUN_XTERM=no
        ports:
            # to view/control ArchiveBox's browser, visit: http://127.0.0.1:8080/vnc.html
            # restricted to access from localhost by default because it has no authentication
            - 127.0.0.1:8080:8080


    ### Example: Put Nginx in front of the ArchiveBox server for SSL termination and static file serving.
    # You can also any other ingress provider for SSL like Apache, Caddy, Traefik, Cloudflare Tunnels, etc.

    # nginx:
    #     image: nginx:alpine
    #     ports:
    #         - 443:443
    #         - 80:80
    #     volumes:
    #         - ./etc/nginx.conf:/etc/nginx/nginx.conf
    #         - ./data:/var/www


    ### Example: To run pihole in order to block ad/tracker requests during archiving,
    # uncomment this optional block and set up pihole using its admin interface

    # pihole:
    #   image: pihole/pihole:latest
    #   ports:
    #     # access the admin HTTP interface on http://localhost:8090
    #     - 127.0.0.1:8090:80
    #   environment:
    #     - WEBPASSWORD=SET_THIS_TO_SOME_SECRET_PASSWORD_FOR_ADMIN_DASHBOARD
    #     - DNSMASQ_LISTENING=all
    #   dns:
    #     - 127.0.0.1
    #     - 1.1.1.1
    #   networks:
    #     dns:
    #       ipv4_address: 172.20.0.53
    #   volumes:
    #     - ./etc/pihole:/etc/pihole
    #     - ./etc/dnsmasq:/etc/dnsmasq.d


    ### Example: run all your ArchiveBox traffic through a WireGuard VPN tunnel to avoid IP blocks.
    # You can also use any other VPN that works at the docker/IP level, e.g. Tailscale, OpenVPN, etc.

    # wireguard:
    #   image: linuxserver/wireguard:latest
    #   network_mode: 'service:archivebox'
    #   cap_add:
    #     - NET_ADMIN
    #     - SYS_MODULE
    #   sysctls:
    #     - net.ipv4.conf.all.rp_filter=2
    #     - net.ipv4.conf.all.src_valid_mark=1
    #   volumes:
    #     - /lib/modules:/lib/modules
    #     - ./wireguard.conf:/config/wg0.conf:ro

    ### Example: Run ChangeDetection.io to watch for changes to websites, then trigger ArchiveBox to archive them
    # Documentation: https://github.com/dgtlmoon/changedetection.io
    # More info: https://github.com/dgtlmoon/changedetection.io/blob/master/docker-compose.yml

    # changedetection:
    #     image: ghcr.io/dgtlmoon/changedetection.io
    #     volumes:
    #         - ./data-changedetection:/datastore


    ### Example: Run PYWB in parallel and auto-import WARCs from ArchiveBox

    # pywb:
    #     image: webrecorder/pywb:latest
    #     entrypoint: /bin/sh -c '(wb-manager init default || test $$? -eq 2) && wb-manager add default /archivebox/archive/*/warc/*.warc.gz; wayback;'
    #     environment:
    #         - INIT_COLLECTION=archivebox
    #     ports:
    #         - 8686:8080
    #     volumes:
    #         - ./data:/archivebox
    #         - ./data/wayback:/webarchive


networks:
    # network just used for pihole container to offer :53 dns resolving on fixed ip for archivebox container
    dns:
        ipam:
            driver: default
            config:
                - subnet: 172.20.0.0/24


# HOW TO: Set up cloud storage for your ./data/archive (e.g. Amazon S3, Backblaze B2, Google Drive, OneDrive, SFTP, etc.)
#   https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-Up-Storage
#
#   Follow the steps here to set up the Docker RClone Plugin https://rclone.org/docker/
#     $ docker plugin install rclone/docker-volume-rclone:amd64 --grant-all-permissions --alias rclone
#     $ nano /var/lib/docker-plugins/rclone/config/rclone.conf
#     [examplegdrive]
#     type = drive
#     scope = drive
#     drive_id = 1234567...
#     root_folder_id = 0Abcd...
#     token = {"access_token":...}

# volumes:
#     archive:
#         driver: rclone
#         driver_opts:
#             remote: 'examplegdrive:archivebox'
#             allow_other: 'true'
#             vfs_cache_mode: full
#             poll_interval: 0

volumes:
    archivebox-archive:
        driver_opts:
          type: "nfs"
          o: "addr=192.168.178.109,rw,nfsvers=4"
          device: ":/mnt/Private_Shared-Data/Archivebox-Archive"

ArchiveBox Configuration

[SERVER_CONFIG]
SECRET_KEY = blablabla

[SEARCH_BACKEND_CONFIG]
SEARCH_BACKEND_ENGINE = sonic
Originally created by @FeverGyorn on GitHub (Dec 25, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1626 Originally assigned to: @pirate on GitHub. ### Provide a screenshot and describe the bug Hello together, so, I'm encountering a bit of a weird situation with my ArchiveBox installation I couldn't really find something about. I have around 600 snapshots in my instance at the moment. Every night it keeps re-adding a certain tag (to-be-reviewed) to around 330ish elements by itself. I can manually delete the tag then, but it will be there again the next day. This is **NOT** present for new items I'm adding currently! Docker Compose is mostly standard besides using an NFS mounted archive volume. ### Steps to reproduce ```markdown 1. Accessing ArchiveBox installation 2. Noticing the tag "to-be-reviewed" to be present 3. Delete the tag during the corresponding menu. Issue resolved for the meantime. 4. Server keeps adding them at 00:20 (local time ofc) to the same items as before ``` ### Logs or errors ```shell None that become apparent or related to the issue (apart from the usual, expected parsing messages) ``` ### ArchiveBox Version ```shell [+] Creating 2/1 ✔ Network 34_default Created 0.2s ✔ Volume "34_archivebox-archive" Cre... 0.0s 0.7.3 ArchiveBox v0.7.3 COMMIT_HASH=069aabc BUILD_TIME=2024-12-15 09:54:03 1734256443 IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.1.0-27-amd64-x86_64-with-glibc2.36 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.11 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.3 valid /usr/local/bin/archivebox √ CURL_BINARY v8.10.1 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v20.18.1 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.54 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid /app/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /app/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.39.5 valid /usr/bin/git √ YOUTUBEDL_BINARY v2024.12.13 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v131.0.6778.33 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: (not in a data directory) ``` ### How did you install the version of ArchiveBox you are using? Docker (or other container system like podman/LXC/Kubernetes or TrueNAS/Cloudron/YunoHost/etc.) ### What operating system are you running on? Linux (Ubuntu/Debian/Arch/Alpine/etc.) ### What type of drive are you using to store your ArchiveBox data? - [ ] `data/` is on a local SSD or NVMe drive - [ ] `data/` is on a spinning hard drive or external USB drive - [x] `data/` is on a network mount (e.g. NFS/SMB/CIFS/etc.) - [ ] `data/` is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.) ### Docker Compose Configuration ```shell # Usage: # curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml # docker compose up # docker compose run archivebox version # docker compose run -T archivebox add < urls_to_archive.txt # docker compose run archivebox add --depth=1 'https://news.ycombinator.com' # docker compose run archivebox config --set SAVE_ARCHIVE_DOT_ORG=False # docker compose run archivebox help # Documentation: # https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#docker-compose services: archivebox: image: archivebox/archivebox:latest ports: - 8009:8000 volumes: - ./data:/data - archivebox-archive:/data/archive # ./data/personas/Default/chrome_profile/Default:/data/personas/Default/chrome_profile/Default environment: # - ADMIN_USERNAME=admin # creates an admin user on first run with the given user/pass combo # - ADMIN_PASSWORD=SomeSecretPassword - CSRF_TRUSTED_ORIGINS=https://archivebox.example.com # REQUIRED for auth, REST API, etc. to work - ALLOWED_HOSTS=* # set this to the hostname(s) from your CSRF_TRUSTED_ORIGINS - PUBLIC_INDEX=True # set to False to prevent anonymous users from viewing snapshot list - PUBLIC_SNAPSHOTS=True # set to False to prevent anonymous users from viewing snapshot content - PUBLIC_ADD_VIEW=False # set to True to allow anonymous users to submit new URLs to archive - SEARCH_BACKEND_ENGINE=sonic # tells ArchiveBox to use sonic container below for fast full-text search - SEARCH_BACKEND_HOST_NAME=sonic - SEARCH_BACKEND_PASSWORD=SomeSecretPassword # - PUID=911 # set to your host user's UID & GID if you encounter permissions issues # - PGID=911 # UID/GIDs <500 may clash with existing users and are not recommended # For options below, it's better to set using `docker compose run archivebox config --set SOME_KEY=someval` instead of setting here: # - MEDIA_MAX_SIZE=750m # increase this filesize limit to allow archiving larger audio/video files # - TIMEOUT=60 # increase this number to 120+ seconds if you see many slow downloads timing out # - CHECK_SSL_VALIDITY=True # set to False to disable strict SSL checking (allows saving URLs w/ broken certs) # - SAVE_ARCHIVE_DOT_ORG=True # set to False to disable submitting all URLs to Archive.org when archiving # - USER_AGENT="..." # set a custom USER_AGENT to avoid being blocked as a bot # ... # For more info, see: https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration # For ad-blocking during archiving, uncomment this section and the pihole service below # networks: # - dns # dns: # - 172.20.0.53 ######## Optional Addons: tweak examples below as needed for your specific use case ######## ### This optional container runs scheduled jobs in the background (and retries failed ones). To add a new job: # $ docker compose run archivebox schedule --add --every=day --depth=1 'https://example.com/some/rss/feed.xml' # then restart the scheduler container to apply any changes to the scheduled task list: # $ docker compose restart archivebox_scheduler # https://github.com/ArchiveBox/ArchiveBox/wiki/Scheduled-Archiving archivebox_scheduler: image: archivebox/archivebox:latest command: schedule --foreground --update --every=day environment: # - PUID=911 # set to your host user's UID & GID if you encounter permissions issues # - PGID=911 - TIMEOUT=120 # use a higher timeout than the main container to give slow tasks more time when retrying - SEARCH_BACKEND_ENGINE=sonic # tells ArchiveBox to use sonic container below for fast full-text search - SEARCH_BACKEND_HOST_NAME=sonic - SEARCH_BACKEND_PASSWORD=blablub # For other config it's better to set using `docker compose run archivebox config --set SOME_KEY=someval` instead of setting here # ... # For more info, see: https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration volumes: - ./data:/data # cpus: 2 # uncomment / edit these values to limit scheduler container resource consumption # mem_limit: 2048m # restart: always ### This runs the optional Sonic full-text search backend (much faster than default rg backend). # If Sonic is ever started after not running for a while, update its full-text index by running: # $ docker-compose run archivebox update --index-only # https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Search sonic: image: archivebox/sonic:latest expose: - 1491 environment: - SEARCH_BACKEND_PASSWORD=blablub volumes: - ./data/sonic.cfg:/etc/sonic.cfg:ro # mount to customize: https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/stable/etc/sonic.cfg - ./data/sonic:/var/lib/sonic/store ### This optional container runs xvfb+noVNC so you can watch the ArchiveBox browser as it archives things, # or remote control it to set up a chrome profile w/ login credentials for sites you want to archive. # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#docker-vnc-setup novnc: image: theasp/novnc:latest environment: - DISPLAY_WIDTH=1920 - DISPLAY_HEIGHT=1080 - RUN_XTERM=no ports: # to view/control ArchiveBox's browser, visit: http://127.0.0.1:8080/vnc.html # restricted to access from localhost by default because it has no authentication - 127.0.0.1:8080:8080 ### Example: Put Nginx in front of the ArchiveBox server for SSL termination and static file serving. # You can also any other ingress provider for SSL like Apache, Caddy, Traefik, Cloudflare Tunnels, etc. # nginx: # image: nginx:alpine # ports: # - 443:443 # - 80:80 # volumes: # - ./etc/nginx.conf:/etc/nginx/nginx.conf # - ./data:/var/www ### Example: To run pihole in order to block ad/tracker requests during archiving, # uncomment this optional block and set up pihole using its admin interface # pihole: # image: pihole/pihole:latest # ports: # # access the admin HTTP interface on http://localhost:8090 # - 127.0.0.1:8090:80 # environment: # - WEBPASSWORD=SET_THIS_TO_SOME_SECRET_PASSWORD_FOR_ADMIN_DASHBOARD # - DNSMASQ_LISTENING=all # dns: # - 127.0.0.1 # - 1.1.1.1 # networks: # dns: # ipv4_address: 172.20.0.53 # volumes: # - ./etc/pihole:/etc/pihole # - ./etc/dnsmasq:/etc/dnsmasq.d ### Example: run all your ArchiveBox traffic through a WireGuard VPN tunnel to avoid IP blocks. # You can also use any other VPN that works at the docker/IP level, e.g. Tailscale, OpenVPN, etc. # wireguard: # image: linuxserver/wireguard:latest # network_mode: 'service:archivebox' # cap_add: # - NET_ADMIN # - SYS_MODULE # sysctls: # - net.ipv4.conf.all.rp_filter=2 # - net.ipv4.conf.all.src_valid_mark=1 # volumes: # - /lib/modules:/lib/modules # - ./wireguard.conf:/config/wg0.conf:ro ### Example: Run ChangeDetection.io to watch for changes to websites, then trigger ArchiveBox to archive them # Documentation: https://github.com/dgtlmoon/changedetection.io # More info: https://github.com/dgtlmoon/changedetection.io/blob/master/docker-compose.yml # changedetection: # image: ghcr.io/dgtlmoon/changedetection.io # volumes: # - ./data-changedetection:/datastore ### Example: Run PYWB in parallel and auto-import WARCs from ArchiveBox # pywb: # image: webrecorder/pywb:latest # entrypoint: /bin/sh -c '(wb-manager init default || test $$? -eq 2) && wb-manager add default /archivebox/archive/*/warc/*.warc.gz; wayback;' # environment: # - INIT_COLLECTION=archivebox # ports: # - 8686:8080 # volumes: # - ./data:/archivebox # - ./data/wayback:/webarchive networks: # network just used for pihole container to offer :53 dns resolving on fixed ip for archivebox container dns: ipam: driver: default config: - subnet: 172.20.0.0/24 # HOW TO: Set up cloud storage for your ./data/archive (e.g. Amazon S3, Backblaze B2, Google Drive, OneDrive, SFTP, etc.) # https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-Up-Storage # # Follow the steps here to set up the Docker RClone Plugin https://rclone.org/docker/ # $ docker plugin install rclone/docker-volume-rclone:amd64 --grant-all-permissions --alias rclone # $ nano /var/lib/docker-plugins/rclone/config/rclone.conf # [examplegdrive] # type = drive # scope = drive # drive_id = 1234567... # root_folder_id = 0Abcd... # token = {"access_token":...} # volumes: # archive: # driver: rclone # driver_opts: # remote: 'examplegdrive:archivebox' # allow_other: 'true' # vfs_cache_mode: full # poll_interval: 0 volumes: archivebox-archive: driver_opts: type: "nfs" o: "addr=192.168.178.109,rw,nfsvers=4" device: ":/mnt/Private_Shared-Data/Archivebox-Archive" ``` ### ArchiveBox Configuration ```shell [SERVER_CONFIG] SECRET_KEY = blablabla [SEARCH_BACKEND_CONFIG] SEARCH_BACKEND_ENGINE = sonic ```
Author
Owner

@pirate commented on GitHub (Dec 26, 2024):

Can you share the output of:

docker compose run archivebox_scheduler --show

<!-- gh-comment-id:2562056989 --> @pirate commented on GitHub (Dec 26, 2024): Can you share the output of: `docker compose run archivebox_scheduler --show`
Author
Owner

@FeverGyorn commented on GitHub (Dec 26, 2024):

docker compose run archivebox_scheduler --show
WARN[0000] Found orphan containers ([34-archivebox_scheduler-run-2112d17c1973 34-archivebox-run-bc4724b63987]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up. 
usage: archivebox [--help] [--version]
                  [{help,version,init,config,setup,add,remove,update,list,status,shell,server,manage,oneshot,schedule}] ...
archivebox: error: unrecognized arguments: --show

Not sure the argument is really available here.

<!-- gh-comment-id:2562144591 --> @FeverGyorn commented on GitHub (Dec 26, 2024): ``` docker compose run archivebox_scheduler --show WARN[0000] Found orphan containers ([34-archivebox_scheduler-run-2112d17c1973 34-archivebox-run-bc4724b63987]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up. usage: archivebox [--help] [--version] [{help,version,init,config,setup,add,remove,update,list,status,shell,server,manage,oneshot,schedule}] ... archivebox: error: unrecognized arguments: --show ``` Not sure the argument is really available here.
Author
Owner

@pirate commented on GitHub (Dec 26, 2024):

Sorry typo'd a word, I meant:

docker compose run archivebox_scheduler schedule --show

# you should also share the output from these commands to see if there are any orphan snapshots/tags that it's trying to re-add:
docker compose run archviebox_scheduler status
docker compose run archviebox_scheduler config  # redact any keys/secrets
<!-- gh-comment-id:2562232193 --> @pirate commented on GitHub (Dec 26, 2024): Sorry typo'd a word, I meant: ```bash docker compose run archivebox_scheduler schedule --show # you should also share the output from these commands to see if there are any orphan snapshots/tags that it's trying to re-add: docker compose run archviebox_scheduler status docker compose run archviebox_scheduler config # redact any keys/secrets ```
Author
Owner

@FeverGyorn commented on GitHub (Dec 26, 2024):

Aye, that's the output here:
Remark: I'll be coming back with cleaning up the orphans and init again later (because I assume, that would be the recommendation now?).

docker compose run archivebox_scheduler schedule --show
WARN[0000] Found orphan containers ([34-archivebox_scheduler-run-f814d8941c17 34-archivebox_scheduler-run-8e655660df23 34-archivebox_scheduler-run-2112d17c1973 34-archivebox-run-bc4724b63987]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up. 
[i] [2024-12-26 07:23:09] ArchiveBox v0.7.3: archivebox schedule --show
    > /data

[X] There are no ArchiveBox cron jobs scheduled for your user (archivebox).
    To schedule a new job, run:
        archivebox schedule --every=[timeperiod] --depth=1 https://example.com/some/rss/feed.xml
docker compose run archivebox_scheduler status
WARN[0000] Found orphan containers ([34-archivebox_scheduler-run-4451707cd18a 34-archivebox_scheduler-run-d07a8df85137 34-archivebox_scheduler-run-46a1913c2d16 34-archivebox_scheduler-run-a0d1e3a7e678 34-archivebox_scheduler-run-f814d8941c17 34-archivebox_scheduler-run-8e655660df23 34-archivebox_scheduler-run-2112d17c1973 34-archivebox-run-bc4724b63987]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up. 
[X] This collection was created with an older version of ArchiveBox and must be upgraded first.
    /data

    To upgrade it to the latest version and apply the 40 pending migrations, run:
        archivebox init
docker compose run archivebox_scheduler config
WARN[0000] Found orphan containers ([34-archivebox_scheduler-run-46a1913c2d16 34-archivebox_scheduler-run-a0d1e3a7e678 34-archivebox_scheduler-run-f814d8941c17 34-archivebox_scheduler-run-8e655660df23 34-archivebox_scheduler-run-2112d17c1973 34-archivebox-run-bc4724b63987]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up. 
[i] [2024-12-26 07:26:42] ArchiveBox v0.7.3: archivebox config
    > /data

IS_TTY=True
USE_COLOR=True
SHOW_PROGRESS=True
IN_DOCKER=True
IN_QEMU=False
PUID=911
PGID=911
OUTPUT_DIR=/data
CONFIG_FILE=/data/ArchiveBox.conf
ONLY_NEW=True
TIMEOUT=120
MEDIA_TIMEOUT=3600
OUTPUT_PERMISSIONS=644
RESTRICT_FILE_NAMES=windows
URL_DENYLIST=\.(css|js|otf|ttf|woff|woff2|gstatic\.com|googleapis\.com/css)(\?.*)?$
URL_ALLOWLIST=None
ADMIN_USERNAME=None
ADMIN_PASSWORD=None
ENFORCE_ATOMIC_WRITES=True
TAG_SEPARATOR_PATTERN=[,]
SECRET_KEY=None
BIND_ADDR=0.0.0.0:8000
ALLOWED_HOSTS=*
DEBUG=False
PUBLIC_INDEX=True
PUBLIC_SNAPSHOTS=True
PUBLIC_ADD_VIEW=False
FOOTER_INFO=Content is hosted for personal archiving purposes only.  Contact server owner for any takedown requests.
SNAPSHOTS_PER_PAGE=40
CUSTOM_TEMPLATES_DIR=None
TIME_ZONE=UTC
TIMEZONE=UTC
REVERSE_PROXY_USER_HEADER=Remote-User
REVERSE_PROXY_WHITELIST=
LOGOUT_REDIRECT_URL=/
PREVIEW_ORIGINALS=True
LDAP=False
LDAP_SERVER_URI=None
LDAP_BIND_DN=None
LDAP_BIND_PASSWORD=None
LDAP_USER_BASE=None
LDAP_USER_FILTER=None
LDAP_USERNAME_ATTR=None
LDAP_FIRSTNAME_ATTR=None
LDAP_LASTNAME_ATTR=None
LDAP_EMAIL_ATTR=None
SAVE_TITLE=True
SAVE_FAVICON=True
SAVE_WGET=True
SAVE_WGET_REQUISITES=True
SAVE_SINGLEFILE=True
SAVE_READABILITY=True
SAVE_MERCURY=True
SAVE_HTMLTOTEXT=True
SAVE_PDF=True
SAVE_SCREENSHOT=True
SAVE_DOM=True
SAVE_HEADERS=True
SAVE_WARC=True
SAVE_GIT=True
SAVE_MEDIA=True
SAVE_ARCHIVE_DOT_ORG=True
RESOLUTION=1440,2000
GIT_DOMAINS=github.com,bitbucket.org,gitlab.com,gist.github.com
CHECK_SSL_VALIDITY=True
MEDIA_MAX_SIZE=750m
CURL_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.3 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.10.1 (x86_64-pc-linux-gnu)
WGET_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.3 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.21.3
CHROME_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.3 (+https://github.com/ArchiveBox/ArchiveBox/)
COOKIES_FILE=None
CHROME_USER_DATA_DIR=None
CHROME_TIMEOUT=0
CHROME_HEADLESS=True
CHROME_SANDBOX=False
YOUTUBEDL_ARGS=['--write-description', '--write-info-json', '--write-annotations', '--write-thumbnail', '--no-call-home', '--write-sub', '--write-auto-subs', '--convert-subs=srt', '--yes-playlist', '--continue', '--no-abort-on-error', '--ignore-errors', '--geo-bypass', '--add-metadata', '--format=(bv*+ba/b)[filesize<=750m][filesize_approx<=?750m]/(bv*+ba/b)']
WGET_ARGS=['--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off']
CURL_ARGS=['--silent', '--location', '--compressed']
GIT_ARGS=['--recursive']
SINGLEFILE_ARGS=[]
FAVICON_PROVIDER=https://www.google.com/s2/favicons?domain={}
USE_INDEXING_BACKEND=True
USE_SEARCHING_BACKEND=True
SEARCH_BACKEND_ENGINE=sonic
SEARCH_BACKEND_HOST_NAME=sonic
SEARCH_BACKEND_PORT=1491
SEARCH_BACKEND_PASSWORD=SomeSecretPassword
SEARCH_PROCESS_HTML=True
SONIC_COLLECTION=archivebox
SONIC_BUCKET=snapshots
SEARCH_BACKEND_TIMEOUT=90
FTS_SEPARATE_DATABASE=True
FTS_TOKENIZERS=porter unicode61 remove_diacritics 2
FTS_SQLITE_MAX_LENGTH=1000000000
USE_CURL=True
USE_WGET=True
USE_SINGLEFILE=True
USE_READABILITY=True
USE_MERCURY=True
USE_GIT=True
USE_CHROME=True
USE_NODE=True
USE_YOUTUBEDL=True
USE_RIPGREP=True
CURL_BINARY=curl
GIT_BINARY=git
WGET_BINARY=wget
SINGLEFILE_BINARY=/app/node_modules/.bin/single-file
READABILITY_BINARY=/app/node_modules/.bin/readability-extractor
MERCURY_BINARY=/app/node_modules/.bin/postlight-parser
YOUTUBEDL_BINARY=yt-dlp
NODE_BINARY=node
RIPGREP_BINARY=rg
CHROME_BINARY=chromium-browser
POCKET_CONSUMER_KEY=None
USER=archivebox
PACKAGE_DIR=/app/archivebox
TEMPLATES_DIR=/app/archivebox/templates
ARCHIVE_DIR=/data/archive
SOURCES_DIR=/data/sources
LOGS_DIR=/data/logs
URL_DENYLIST_PTN=re.compile('\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', re.IGNORECASE|re.MULTILINE)
URL_ALLOWLIST_PTN=None
DIR_OUTPUT_PERMISSIONS=755
ARCHIVEBOX_BINARY=/usr/local/bin/archivebox
VERSION=0.7.3
COMMIT_HASH=069aabceb2d2cc72963727f7607b8f877e018302
BUILD_TIME=2024-12-15 09:54:03 1734256443
VERSIONS_AVAILABLE=None
CAN_UPGRADE=False
PYTHON_BINARY=/usr/local/bin/python3.11
PYTHON_ENCODING=UTF-8
PYTHON_VERSION=3.11.11
DJANGO_BINARY=/usr/local/lib/python3.11/site-packages/django/__init__.py
DJANGO_VERSION=3.1.14 final (0)
SQLITE_BINARY=/usr/local/lib/python3.11/sqlite3/dbapi2.py
SQLITE_VERSION=2.6.0
CURL_VERSION=curl 8.10.1 (x86_64-pc-linux-gnu)
WGET_VERSION=GNU Wget 1.21.3
WGET_AUTO_COMPRESSION=True
RIPGREP_VERSION=ripgrep 13.0.0
SINGLEFILE_VERSION=1.1.54
READABILITY_VERSION=0.0.11
MERCURY_VERSION=1.0.0
GIT_VERSION=git version 2.39.5
YOUTUBEDL_VERSION=2024.12.13
CHROME_VERSION=Chromium 131.0.6778.33
NODE_VERSION=v20.18.1
<!-- gh-comment-id:2562253772 --> @FeverGyorn commented on GitHub (Dec 26, 2024): Aye, that's the output here: Remark: I'll be coming back with cleaning up the orphans and init again later (because I assume, that would be the recommendation now?). ``` docker compose run archivebox_scheduler schedule --show WARN[0000] Found orphan containers ([34-archivebox_scheduler-run-f814d8941c17 34-archivebox_scheduler-run-8e655660df23 34-archivebox_scheduler-run-2112d17c1973 34-archivebox-run-bc4724b63987]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up. [i] [2024-12-26 07:23:09] ArchiveBox v0.7.3: archivebox schedule --show > /data [X] There are no ArchiveBox cron jobs scheduled for your user (archivebox). To schedule a new job, run: archivebox schedule --every=[timeperiod] --depth=1 https://example.com/some/rss/feed.xml ``` ``` docker compose run archivebox_scheduler status WARN[0000] Found orphan containers ([34-archivebox_scheduler-run-4451707cd18a 34-archivebox_scheduler-run-d07a8df85137 34-archivebox_scheduler-run-46a1913c2d16 34-archivebox_scheduler-run-a0d1e3a7e678 34-archivebox_scheduler-run-f814d8941c17 34-archivebox_scheduler-run-8e655660df23 34-archivebox_scheduler-run-2112d17c1973 34-archivebox-run-bc4724b63987]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up. [X] This collection was created with an older version of ArchiveBox and must be upgraded first. /data To upgrade it to the latest version and apply the 40 pending migrations, run: archivebox init ``` ``` docker compose run archivebox_scheduler config WARN[0000] Found orphan containers ([34-archivebox_scheduler-run-46a1913c2d16 34-archivebox_scheduler-run-a0d1e3a7e678 34-archivebox_scheduler-run-f814d8941c17 34-archivebox_scheduler-run-8e655660df23 34-archivebox_scheduler-run-2112d17c1973 34-archivebox-run-bc4724b63987]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up. [i] [2024-12-26 07:26:42] ArchiveBox v0.7.3: archivebox config > /data IS_TTY=True USE_COLOR=True SHOW_PROGRESS=True IN_DOCKER=True IN_QEMU=False PUID=911 PGID=911 OUTPUT_DIR=/data CONFIG_FILE=/data/ArchiveBox.conf ONLY_NEW=True TIMEOUT=120 MEDIA_TIMEOUT=3600 OUTPUT_PERMISSIONS=644 RESTRICT_FILE_NAMES=windows URL_DENYLIST=\.(css|js|otf|ttf|woff|woff2|gstatic\.com|googleapis\.com/css)(\?.*)?$ URL_ALLOWLIST=None ADMIN_USERNAME=None ADMIN_PASSWORD=None ENFORCE_ATOMIC_WRITES=True TAG_SEPARATOR_PATTERN=[,] SECRET_KEY=None BIND_ADDR=0.0.0.0:8000 ALLOWED_HOSTS=* DEBUG=False PUBLIC_INDEX=True PUBLIC_SNAPSHOTS=True PUBLIC_ADD_VIEW=False FOOTER_INFO=Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests. SNAPSHOTS_PER_PAGE=40 CUSTOM_TEMPLATES_DIR=None TIME_ZONE=UTC TIMEZONE=UTC REVERSE_PROXY_USER_HEADER=Remote-User REVERSE_PROXY_WHITELIST= LOGOUT_REDIRECT_URL=/ PREVIEW_ORIGINALS=True LDAP=False LDAP_SERVER_URI=None LDAP_BIND_DN=None LDAP_BIND_PASSWORD=None LDAP_USER_BASE=None LDAP_USER_FILTER=None LDAP_USERNAME_ATTR=None LDAP_FIRSTNAME_ATTR=None LDAP_LASTNAME_ATTR=None LDAP_EMAIL_ATTR=None SAVE_TITLE=True SAVE_FAVICON=True SAVE_WGET=True SAVE_WGET_REQUISITES=True SAVE_SINGLEFILE=True SAVE_READABILITY=True SAVE_MERCURY=True SAVE_HTMLTOTEXT=True SAVE_PDF=True SAVE_SCREENSHOT=True SAVE_DOM=True SAVE_HEADERS=True SAVE_WARC=True SAVE_GIT=True SAVE_MEDIA=True SAVE_ARCHIVE_DOT_ORG=True RESOLUTION=1440,2000 GIT_DOMAINS=github.com,bitbucket.org,gitlab.com,gist.github.com CHECK_SSL_VALIDITY=True MEDIA_MAX_SIZE=750m CURL_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.3 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.10.1 (x86_64-pc-linux-gnu) WGET_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.3 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.21.3 CHROME_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.3 (+https://github.com/ArchiveBox/ArchiveBox/) COOKIES_FILE=None CHROME_USER_DATA_DIR=None CHROME_TIMEOUT=0 CHROME_HEADLESS=True CHROME_SANDBOX=False YOUTUBEDL_ARGS=['--write-description', '--write-info-json', '--write-annotations', '--write-thumbnail', '--no-call-home', '--write-sub', '--write-auto-subs', '--convert-subs=srt', '--yes-playlist', '--continue', '--no-abort-on-error', '--ignore-errors', '--geo-bypass', '--add-metadata', '--format=(bv*+ba/b)[filesize<=750m][filesize_approx<=?750m]/(bv*+ba/b)'] WGET_ARGS=['--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off'] CURL_ARGS=['--silent', '--location', '--compressed'] GIT_ARGS=['--recursive'] SINGLEFILE_ARGS=[] FAVICON_PROVIDER=https://www.google.com/s2/favicons?domain={} USE_INDEXING_BACKEND=True USE_SEARCHING_BACKEND=True SEARCH_BACKEND_ENGINE=sonic SEARCH_BACKEND_HOST_NAME=sonic SEARCH_BACKEND_PORT=1491 SEARCH_BACKEND_PASSWORD=SomeSecretPassword SEARCH_PROCESS_HTML=True SONIC_COLLECTION=archivebox SONIC_BUCKET=snapshots SEARCH_BACKEND_TIMEOUT=90 FTS_SEPARATE_DATABASE=True FTS_TOKENIZERS=porter unicode61 remove_diacritics 2 FTS_SQLITE_MAX_LENGTH=1000000000 USE_CURL=True USE_WGET=True USE_SINGLEFILE=True USE_READABILITY=True USE_MERCURY=True USE_GIT=True USE_CHROME=True USE_NODE=True USE_YOUTUBEDL=True USE_RIPGREP=True CURL_BINARY=curl GIT_BINARY=git WGET_BINARY=wget SINGLEFILE_BINARY=/app/node_modules/.bin/single-file READABILITY_BINARY=/app/node_modules/.bin/readability-extractor MERCURY_BINARY=/app/node_modules/.bin/postlight-parser YOUTUBEDL_BINARY=yt-dlp NODE_BINARY=node RIPGREP_BINARY=rg CHROME_BINARY=chromium-browser POCKET_CONSUMER_KEY=None USER=archivebox PACKAGE_DIR=/app/archivebox TEMPLATES_DIR=/app/archivebox/templates ARCHIVE_DIR=/data/archive SOURCES_DIR=/data/sources LOGS_DIR=/data/logs URL_DENYLIST_PTN=re.compile('\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', re.IGNORECASE|re.MULTILINE) URL_ALLOWLIST_PTN=None DIR_OUTPUT_PERMISSIONS=755 ARCHIVEBOX_BINARY=/usr/local/bin/archivebox VERSION=0.7.3 COMMIT_HASH=069aabceb2d2cc72963727f7607b8f877e018302 BUILD_TIME=2024-12-15 09:54:03 1734256443 VERSIONS_AVAILABLE=None CAN_UPGRADE=False PYTHON_BINARY=/usr/local/bin/python3.11 PYTHON_ENCODING=UTF-8 PYTHON_VERSION=3.11.11 DJANGO_BINARY=/usr/local/lib/python3.11/site-packages/django/__init__.py DJANGO_VERSION=3.1.14 final (0) SQLITE_BINARY=/usr/local/lib/python3.11/sqlite3/dbapi2.py SQLITE_VERSION=2.6.0 CURL_VERSION=curl 8.10.1 (x86_64-pc-linux-gnu) WGET_VERSION=GNU Wget 1.21.3 WGET_AUTO_COMPRESSION=True RIPGREP_VERSION=ripgrep 13.0.0 SINGLEFILE_VERSION=1.1.54 READABILITY_VERSION=0.0.11 MERCURY_VERSION=1.0.0 GIT_VERSION=git version 2.39.5 YOUTUBEDL_VERSION=2024.12.13 CHROME_VERSION=Chromium 131.0.6778.33 NODE_VERSION=v20.18.1 ```
Author
Owner

@FeverGyorn commented on GitHub (Dec 26, 2024):

To reiterate on my previous comment, as I think there was a misunderstanding on my end.
As I moved the archive collection from a local storage to an NFS share, I did run archivebox init before but INSIDE the docker container.
Wondered why I got the message on missing migrations above.

I rerun this now on the host and the message on missing migrations is gone. At a first glance a few of the tags I've applied in the meantime seems to be missing but working with a somewhat corrupt install is certainly my bad then.

I assume, best wait for the next night now?

docker compose run archivebox_scheduler status
WARN[0000] Found orphan containers ([34-archivebox-run-9150a39482f7]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up. 
[*] Scanning archive main index...
    /data/* 
    Index size: 17.4 MB across 3 files

    > SQL Main Index: 687 links      (found in index.sqlite3)
    > JSON Link Details: 0 links     (found in archive/*/index.json)

[*] Scanning archive data directories...
    /data/archive/* 
    Size: 0.0 Bytes across 0 files in 0 directories

    > indexed: 687                   (indexed links without checking archive status or data directory validity)
      > archived: 0                  (indexed links that are archived with a valid data directory)
      > unarchived: 687              (indexed links that are unarchived with no data directory or an empty data directory)

    > present: 0                     (dirs that actually exist in the archive/ folder)
      > valid: 0                     (dirs with a valid index matched to the main index and archived content)
      > invalid: 0                   (dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized)
        > duplicate: 0               (dirs that conflict with other directories that have the same link URL or timestamp)
        > orphaned: 0                (dirs that contain a valid index but aren't listed in the main index)
        > corrupted: 0               (dirs that don't contain a valid index and aren't listed in the main index)
        > unrecognized: 0            (dirs that don't contain recognizable archive data and aren't listed in the main index)

    Hint: You can list link data directories by status like so:
        archivebox list --status=<status>  (e.g. indexed, corrupted, archived, etc.)

[*] Scanning recent archive changes and user logins:
    /data/logs/* 
    UI users 0: 
    Last changes: 2024-12-26 07:45
<!-- gh-comment-id:2562279787 --> @FeverGyorn commented on GitHub (Dec 26, 2024): To reiterate on my previous comment, as I think there was a misunderstanding on my end. As I moved the archive collection from a local storage to an NFS share, I did run archivebox init before but INSIDE the docker container. Wondered why I got the message on missing migrations above. I rerun this now on the host and the message on missing migrations is gone. At a first glance a few of the tags I've applied in the meantime seems to be missing but working with a somewhat corrupt install is certainly my bad then. I assume, best wait for the next night now? ``` docker compose run archivebox_scheduler status WARN[0000] Found orphan containers ([34-archivebox-run-9150a39482f7]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up. [*] Scanning archive main index... /data/* Index size: 17.4 MB across 3 files > SQL Main Index: 687 links (found in index.sqlite3) > JSON Link Details: 0 links (found in archive/*/index.json) [*] Scanning archive data directories... /data/archive/* Size: 0.0 Bytes across 0 files in 0 directories > indexed: 687 (indexed links without checking archive status or data directory validity) > archived: 0 (indexed links that are archived with a valid data directory) > unarchived: 687 (indexed links that are unarchived with no data directory or an empty data directory) > present: 0 (dirs that actually exist in the archive/ folder) > valid: 0 (dirs with a valid index matched to the main index and archived content) > invalid: 0 (dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized) > duplicate: 0 (dirs that conflict with other directories that have the same link URL or timestamp) > orphaned: 0 (dirs that contain a valid index but aren't listed in the main index) > corrupted: 0 (dirs that don't contain a valid index and aren't listed in the main index) > unrecognized: 0 (dirs that don't contain recognizable archive data and aren't listed in the main index) Hint: You can list link data directories by status like so: archivebox list --status=<status> (e.g. indexed, corrupted, archived, etc.) [*] Scanning recent archive changes and user logins: /data/logs/* UI users 0: Last changes: 2024-12-26 07:45 ```
Author
Owner

@pirate commented on GitHub (Dec 26, 2024):

You can cancel the scan it's not important, it seems fine now. Looks like you ran it with some of the data dir missing briefly, then when it was run again with the full data dir it caught up.

<!-- gh-comment-id:2562309913 --> @pirate commented on GitHub (Dec 26, 2024): You can cancel the scan it's not important, it seems fine now. Looks like you ran it with some of the data dir missing briefly, then when it was run again with the full data dir it caught up.
Author
Owner

@FeverGyorn commented on GitHub (Dec 27, 2024):

Small update: Tags are still changed unexpectedly. This is still related to re-adding tags I already removed (going so far to re-add a tag that doesn't even exist at all anymore). Tags that are added to items are NOT removed.

I've reviewed the logs and the compose definition and doesn't the following config part refetch all links in the index?
I can see this in the schedule.log.

    archivebox_scheduler:
        
        image: archivebox/archivebox:latest
        command: schedule --foreground --update --every=day
        environment:
            # - PUID=911                        # set to your host user's UID & GID if you encounter permissions issues
            # - PGID=911
            - TIMEOUT=120                       # use a higher timeout than the main container to give slow tasks more time when retrying
            - SEARCH_BACKEND_ENGINE=sonic       # tells ArchiveBox to use sonic container below for fast full-text search
            - SEARCH_BACKEND_HOST_NAME=sonic
            - SEARCH_BACKEND_PASSWORD=blablub
            # For other config it's better to set using `docker compose run archivebox config --set SOME_KEY=someval` instead of setting here
            # ...
            # For more info, see: https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration
        volumes:
            - ./data:/data
        # cpus: 2                               # uncomment / edit these values to limit scheduler container resource consumption
        # mem_limit: 2048m
        # restart: always

I commented out the command line for the next night. This does ofc not really explain why tags are being readded. Messages about missing migrations or something like that are not shown anymore.

<!-- gh-comment-id:2563273637 --> @FeverGyorn commented on GitHub (Dec 27, 2024): Small update: Tags are still changed unexpectedly. This is still related to re-adding tags I already removed (going so far to re-add a tag that doesn't even exist at all anymore). Tags that are added to items are NOT removed. I've reviewed the logs and the compose definition and doesn't the following config part refetch all links in the index? I can see this in the schedule.log. ``` archivebox_scheduler: image: archivebox/archivebox:latest command: schedule --foreground --update --every=day environment: # - PUID=911 # set to your host user's UID & GID if you encounter permissions issues # - PGID=911 - TIMEOUT=120 # use a higher timeout than the main container to give slow tasks more time when retrying - SEARCH_BACKEND_ENGINE=sonic # tells ArchiveBox to use sonic container below for fast full-text search - SEARCH_BACKEND_HOST_NAME=sonic - SEARCH_BACKEND_PASSWORD=blablub # For other config it's better to set using `docker compose run archivebox config --set SOME_KEY=someval` instead of setting here # ... # For more info, see: https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration volumes: - ./data:/data # cpus: 2 # uncomment / edit these values to limit scheduler container resource consumption # mem_limit: 2048m # restart: always ``` I commented out the command line for the next night. This does ofc not really explain why tags are being readded. Messages about missing migrations or something like that are not shown anymore.
Author
Owner

@pirate commented on GitHub (Jan 3, 2025):

The scheduled daily update only fetches URLs that are missing / that have failed previously, it shouldn't touch any URLs that have already succesfully downloaded.

Even then, when it retries failed/incomplete snapshots, it should never re-add tags to them, I'm not sure why it's doing that for you.

The missing migrations message is very suspicious, whatever container was producing that was likely the cause of the issue. A worker running out of sync with the schema in your DB could cause problems like this in theory, though it still doesn't fully explain why deleted tags would get re-added.

I'm investigating this still, but I may prioritize the more complete fix that's coming in v0.8.x: I'm migrating to a new tagging system entirely.

<!-- gh-comment-id:2568814096 --> @pirate commented on GitHub (Jan 3, 2025): The scheduled daily update only fetches URLs that are missing / that have failed previously, it shouldn't touch any URLs that have already succesfully downloaded. Even then, when it retries failed/incomplete snapshots, it should never re-add tags to them, I'm not sure why it's doing that for you. The missing migrations message is very suspicious, whatever container was producing that was likely the cause of the issue. A worker running out of sync with the schema in your DB could cause problems like this in theory, though it still doesn't fully explain why deleted tags would get re-added. I'm investigating this still, but I may prioritize the more complete fix that's coming in v0.8.x: I'm migrating to a [new tagging system entirely](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/tags/models.py#L125).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3989
No description provided.