[GH-ISSUE #1605] Bug: large number of chromium-browser processes spawned by the application causes high memory consumption #2469

Closed
opened 2026-03-01 17:59:15 +03:00 by kerem · 2 comments
Owner

Originally created by @comatory on GitHub (Nov 28, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1605

Originally assigned to: @pirate on GitHub.

Provide a screenshot and describe the bug

I'm having problems on my server caused by lot of chromium-browser processes. It basically makes my server to run out of memory and swap a lot, see htop output:

Image

I have many and many processes run by chromium-browser and I'm not sure why. I was trying to import Pinboard archive (see here) and maybe it somehow caused to stall some jobs or something.

All of these processes look like this (there are tens, maybe hundreds of them):

chromium-browser --allow-pre-commit-input --disable-background-networking --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-field-trial-config --disable-hang-monitor --disable-infobars --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-search-engine-choice-screen --disable-sync --enable-automation --export-tagged-pdf --force-color-profile=srgb --metrics-recording-only --no-first-run --password-store=basic --use-mock-keychain --disable-features=Translate,AcceptCHFrame,MediaRouter,OptimizationHints,ProcessPerSiteUpToMainFrameThreshold --enable-features=NetworkServiceInProcess2 --headless --hide-scrollbars --mute-audio about:blank --headless=new --no-sandbox --no-zygote --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --window-size=1440,2000 --autoplay-policy=no-user-gesture-required --no-first-run --use-fake-ui-for-media-stream --use-fake-device-for-media-stream --disable-sync --user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) --window-size=1440,2000 --disable-web-security --no-pings --window-size=1280,720 --remote-debugging-port=0 --user-data-dir=/tmp/puppeteer_dev_chrome_profile-XXXXXXQeqOCT

Tearing down the application with compose down does not help.

Steps to reproduce

1. download the official `docker-compose.yml`
2. `docker compose up -d`

Hard to tell what caused it though :(

Logs or errors

The request's session was deleted before the request completed. The user may have logged out in a concurrent request, for example.

Exception in archive_methods.save_htmltotext(Link(url=https://joshcollinsworth.com/blog/antiquated-react)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2024-11-27__07:06:38
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=http://www.vimregex.com/)) command=/usr/local/bin/archivebox update; ts=2024-11-27__07:13:22
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=https://rominirani.com/docker-tutorial-series-a7e6ff90a023)) command=/usr/local/bin/archivebox update; ts=2024-11-27__07:13:30
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=https://solid.mit.edu/)) command=/usr/local/bin/archivebox update; ts=2024-11-27__07:45:48
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=https://www.codementor.io/mayowa.a/how-to-build-a-simple-session-based-authentication-system-with-nodejs-from-scratch-6vn67mcy3)) command=/usr/local/bin/archivebox update; ts=2024-11-27__07:56:18
cannot access local variable 'cmd' where it is not associated with a value
You're accessing the development server over HTTPS, but it only supports HTTP.

You're accessing the development server over HTTPS, but it only supports HTTP.


Exception in archive_methods.save_htmltotext(Link(url=http://cb.vu/unixtoolbox.xhtml)) command=/usr/local/bin/archivebox update; ts=2024-11-27__08:32:21
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=http://www.faqs.org/docs/artu/index.html)) command=/usr/local/bin/archivebox update; ts=2024-11-27__08:52:48
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=https://blog.svpino.com/2015/05/07/five-programming-problems-every-software-engineer-should-be-able-to-solve-in-less-than-1-hour)) command=/usr/local/bin/archivebox update; ts=2024-11-27__08:53:04
cannot access local variable 'cmd' where it is not associated with a value
"PRI * HTTP/2.0" 505 -
"PRI * HTTP/2.0" 505 -
"PRI * HTTP/2.0" 505 -
"PRI * HTTP/2.0" 505 -

> /usr/local/bin/archivebox update; TS=2024-11-28__00:00:26 VERSION=0.7.2 IN_DOCKER=True IS_TTY=False
You're accessing the development server over HTTPS, but it only supports HTTP.

ArchiveBox Version

ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:47:02 1713998822
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.0-125-generic-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.9         valid     /usr/local/bin/python3.11                             
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py           
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py
 √  ARCHIVEBOX_BINARY     v0.7.2          valid     /usr/local/bin/archivebox                             

 √  CURL_BINARY           v8.5.0          valid     /usr/bin/curl                                         
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                         
 √  NODE_BINARY           v20.12.2        valid     /usr/bin/node                                         
 √  SINGLEFILE_BINARY     v1.1.46         valid     /app/node_modules/single-file-cli/single-file         
 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js            
 √  GIT_BINARY            v2.39.2         valid     /usr/bin/git                                          
 √  YOUTUBEDL_BINARY      v2023.12.30     valid     /usr/local/bin/yt-dlp                                 
 √  CHROME_BINARY         v124.0.6367.29  valid     /usr/bin/chromium-browser                             
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                           

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                       
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                             
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None                                                  

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None                                                  
 -  COOKIES_FILE          -               disabled  None                                                  

[i] Data locations:
 √  OUTPUT_DIR            9 files @       valid     /data                                                 
 √  SOURCES_DIR           19 files        valid     ./sources                                             
 √  LOGS_DIR              2 files         valid     ./logs                                                
 √  ARCHIVE_DIR           364 files       valid     ./archive                                             
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                     
 √  SQL_INDEX             13.6 MB         valid     ./index.sqlite3

How did you install the version of ArchiveBox you are using?

Docker (or other container system like podman/LXC/Kubernetes or TrueNAS/Cloudron/YunoHost/etc.)

What operating system are you running on?

Linux (Ubuntu/Debian/Arch/Alpine/etc.)

What type of drive are you using to store your ArchiveBox data?

  • data/ is on a local SSD or NVMe drive
  • data/ is on a spinning hard drive or external USB drive
  • data/ is on a network mount (e.g. NFS/SMB/CIFS/etc.)
  • data/ is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.)

Docker Compose Configuration

# Usage:
#     curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml
#     docker compose up
#     docker compose run archivebox version
#     echo 'https://example.com' | docker compose run -T archivebox add
#     docker compose run archivebox add --depth=1 'https://news.ycombinator.com'
#     docker compose run archivebox config --set SAVE_ARCHIVE_DOT_ORG=False
#     docker compose run archivebox status
#     docker compose run archivebox help
# Documentation:
#     https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#docker-compose

services:
    archivebox:
        image: archivebox/archivebox:latest
        ports:
            - 8000:8000
        volumes:
            - ./data:/data
            # ./data/personas/Default/chrome_profile/Default:/data/personas/Default/chrome_profile/Default
        environment:
            # - ADMIN_USERNAME=admin            # create an admin user on first run with the given user/pass combo
            # - ADMIN_PASSWORD=SomeSecretPassword
            - CSRF_TRUSTED_ORIGINS=https://archivebox.example.com  # REQUIRED for auth, REST API, etc. to work
            - ALLOWED_HOSTS=*                   # set this to the hostname(s) from your CSRF_TRUSTED_ORIGINS
            - PUBLIC_INDEX=False                 # set to False to prevent anonymous users from viewing snapshot list
            - PUBLIC_SNAPSHOTS=False            # set to False to prevent anonymous users from viewing snapshot content
            - PUBLIC_ADD_VIEW=False             # set to True to allow anonymous users to submit new URLs to archive
            - SEARCH_BACKEND_ENGINE=sonic       # tells ArchiveBox to use sonic container below for fast full-text search
            - SEARCH_BACKEND_HOST_NAME=sonic
            - SEARCH_BACKEND_PASSWORD=SomeSecretPassword
            # - PUID=911                        # set to your host user's UID & GID if you encounter permissions issues
            # - PGID=911                        # UID/GIDs <500 may clash with existing users and are not recommended
            # - MEDIA_MAX_SIZE=750m             # increase this filesize limit to allow archiving larger audio/video files
            # - TIMEOUT=60                      # increase this number to 120+ seconds if you see many slow downloads timing out
            # - CHECK_SSL_VALIDITY=True         # set to False to disable strict SSL checking (allows saving URLs w/ broken certs)
            # - SAVE_ARCHIVE_DOT_ORG=True       # set to False to disable submitting all URLs to Archive.org when archiving
            # - USER_AGENT="..."                # set a custom USER_AGENT to avoid being blocked as a bot
            # ...
            # add further configuration options from archivebox/config.py as needed (to apply them only to this container)
            # or set using `docker compose run archivebox config --set SOME_KEY=someval` (to persist config across all containers)
        # For ad-blocking during archiving, uncomment this section and pihole service section below
        # networks:
        #   - dns
        # dns:
        #   - 172.20.0.53
        restart: always


    ######## Optional Addons: tweak examples below as needed for your specific use case ########

    ### This optional container runs any scheduled tasks in the background, add new tasks like so:
    #   $ docker compose run archivebox schedule --add --every=day --depth=1 'https://example.com/some/rss/feed.xml'
    # then restart the scheduler container to apply any changes to the scheduled task list:
    #   $ docker compose restart archivebox_scheduler
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Scheduled-Archiving

    archivebox_scheduler:

        image: archivebox/archivebox:latest
        command: schedule --foreground --update --every=day
        environment:
            - TIMEOUT=120                       # use a higher timeout than the main container to give slow tasks more time when retrying
            # - PUID=502                        # set to your host user's UID & GID if you encounter permissions issues
            # - PGID=20
        volumes:
            - ./data:/data
        # cpus: 2                               # uncomment / edit these values to limit scheduler container resource consumption
        # mem_limit: 2048m
        restart: always


    ### This runs the optional Sonic full-text search backend (much faster than default rg backend).
    # If Sonic is ever started after not running for a while, update its full-text index by running:
    #   $ docker-compose run archivebox update --index-only
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Search

    sonic:
        image: valeriansaliou/sonic:latest
        build:
            # custom build just auto-downloads archivebox's default sonic.cfg as a convenience
            # not needed after first run / if you have already have ./etc/sonic.cfg present
            dockerfile_inline: |
                FROM quay.io/curl/curl:latest AS config_downloader
                RUN curl -fsSL 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/stable/etc/sonic.cfg' > /tmp/sonic.cfg
                FROM valeriansaliou/sonic:latest
                COPY --from=config_downloader /tmp/sonic.cfg /etc/sonic.cfg
        expose:
            - 1491
        environment:
            - SEARCH_BACKEND_PASSWORD=SomeSecretPassword
        volumes:
            #- ./sonic.cfg:/etc/sonic.cfg:ro    # use this if you prefer to download the config on the host and mount it manually
            - ./data/sonic:/var/lib/sonic/store
        restart: always


    ### This container runs xvfb+noVNC so you can watch the ArchiveBox browser as it archives things,
    # or remote control it to set up a chrome profile w/ login credentials for sites you want to archive.
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#docker-vnc-setup

    novnc:
        image: theasp/novnc:latest
        environment:
            - DISPLAY_WIDTH=1920
            - DISPLAY_HEIGHT=1080
            - RUN_XTERM=no
        ports:
            # to view/control ArchiveBox's browser, visit: http://127.0.0.1:8080/vnc.html
            # restricted to access from localhost by default because it has no authentication
            - 127.0.0.1:8080:8080
        restart: always


    ### Example: Put Nginx in front of the ArchiveBox server for SSL termination and static file serving.
    # You can also any other ingress provider for SSL like Apache, Caddy, Traefik, Cloudflare Tunnels, etc.

    # nginx:
    #     image: nginx:alpine
    #     ports:
    #         - 443:443
    #         - 80:80
    #     volumes:
    #         - ./etc/nginx.conf:/etc/nginx/nginx.conf
    #         - ./data:/var/www


    ### Example: To run pihole in order to block ad/tracker requests during archiving,
    # uncomment this block and set up pihole using its admin interface

    # pihole:
    #   image: pihole/pihole:latest
    #   ports:
    #     # access the admin HTTP interface on http://localhost:8090
    #     - 127.0.0.1:8090:80
    #   environment:
    #     - WEBPASSWORD=SET_THIS_TO_SOME_SECRET_PASSWORD_FOR_ADMIN_DASHBOARD
    #     - DNSMASQ_LISTENING=all
    #   dns:
    #     - 127.0.0.1
    #     - 1.1.1.1
    #   networks:
    #     dns:
    #       ipv4_address: 172.20.0.53
    #   volumes:
    #     - ./etc/pihole:/etc/pihole
    #     - ./etc/dnsmasq:/etc/dnsmasq.d


    ### Example: run all your ArchiveBox traffic through a WireGuard VPN tunnel to avoid IP blocks.
    # You can also use any other VPN that works at the docker IP level, e.g. Tailscale, OpenVPN, etc.

    # wireguard:
    #   image: linuxserver/wireguard:latest
    #   network_mode: 'service:archivebox'
    #   cap_add:
    #     - NET_ADMIN
    #     - SYS_MODULE
    #   sysctls:
    #     - net.ipv4.conf.all.rp_filter=2
    #     - net.ipv4.conf.all.src_valid_mark=1
    #   volumes:
    #     - /lib/modules:/lib/modules
    #     - ./wireguard.conf:/config/wg0.conf:ro

    ### Example: Run ChangeDetection.io to watch for changes to websites, then trigger ArchiveBox to archive them
    # Documentation: https://github.com/dgtlmoon/changedetection.io
    # More info: https://github.com/dgtlmoon/changedetection.io/blob/master/docker-compose.yml

    # changedetection:
    #     image: ghcr.io/dgtlmoon/changedetection.io
    #     volumes:
    #         - ./data-changedetection:/datastore


    ### Example: Run PYWB in parallel and auto-import WARCs from ArchiveBox

    # pywb:
    #     image: webrecorder/pywb:latest
    #     entrypoint: /bin/sh -c '(wb-manager init default || test $$? -eq 2) && wb-manager add default /archivebox/archive/*/warc/*.warc.gz; wayback;'
    #     environment:
    #         - INIT_COLLECTION=archivebox
    #     ports:
    #         - 8686:8080
    #     volumes:
    #         - ./data:/archivebox
    #         - ./data/wayback:/webarchive


networks:
    # network just used for pihole container to offer :53 dns resolving on fixed ip for archivebox container
    dns:
        ipam:
            driver: default
            config:
                - subnet: 172.20.0.0/24


# HOW TO: Set up cloud storage for your ./data/archive (e.g. Amazon S3, Backblaze B2, Google Drive, OneDrive, SFTP, etc.)
#   https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-Up-Storage
#
#   Follow the steps here to set up the Docker RClone Plugin https://rclone.org/docker/
#     $ docker plugin install rclone/docker-volume-rclone:amd64 --grant-all-permissions --alias rclone
#     $ nano /var/lib/docker-plugins/rclone/config/rclone.conf
#     [examplegdrive]
#     type = drive
#     scope = drive
#     drive_id = 1234567...
#     root_folder_id = 0Abcd...
#     token = {"access_token":...}

# volumes:
#     archive:
#         driver: rclone
#         driver_opts:
#             remote: 'examplegdrive:archivebox'
#             allow_other: 'true'
#             vfs_cache_mode: full
#             poll_interval: 0

ArchiveBox Configuration


Originally created by @comatory on GitHub (Nov 28, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1605 Originally assigned to: @pirate on GitHub. ### Provide a screenshot and describe the bug I'm having problems on my server caused by lot of `chromium-browser` processes. It basically makes my server to run out of memory and swap a lot, see `htop` output: ![Image](https://github.com/user-attachments/assets/9be72e03-82ca-4849-98f8-c0b27065f2e5) I have many and many processes run by `chromium-browser` and I'm not sure why. I was trying to import Pinboard archive (see [here](https://github.com/ArchiveBox/ArchiveBox/issues/1588)) and maybe it somehow caused to stall some jobs or something. All of these processes look like this (there are tens, maybe hundreds of them): ``` chromium-browser --allow-pre-commit-input --disable-background-networking --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-field-trial-config --disable-hang-monitor --disable-infobars --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-search-engine-choice-screen --disable-sync --enable-automation --export-tagged-pdf --force-color-profile=srgb --metrics-recording-only --no-first-run --password-store=basic --use-mock-keychain --disable-features=Translate,AcceptCHFrame,MediaRouter,OptimizationHints,ProcessPerSiteUpToMainFrameThreshold --enable-features=NetworkServiceInProcess2 --headless --hide-scrollbars --mute-audio about:blank --headless=new --no-sandbox --no-zygote --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --window-size=1440,2000 --autoplay-policy=no-user-gesture-required --no-first-run --use-fake-ui-for-media-stream --use-fake-device-for-media-stream --disable-sync --user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/) --window-size=1440,2000 --disable-web-security --no-pings --window-size=1280,720 --remote-debugging-port=0 --user-data-dir=/tmp/puppeteer_dev_chrome_profile-XXXXXXQeqOCT ``` Tearing down the application with `compose down` does not help. ### Steps to reproduce ```markdown 1. download the official `docker-compose.yml` 2. `docker compose up -d` Hard to tell what caused it though :( ``` ### Logs or errors ```shell The request's session was deleted before the request completed. The user may have logged out in a concurrent request, for example. Exception in archive_methods.save_htmltotext(Link(url=https://joshcollinsworth.com/blog/antiquated-react)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2024-11-27__07:06:38 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=http://www.vimregex.com/)) command=/usr/local/bin/archivebox update; ts=2024-11-27__07:13:22 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=https://rominirani.com/docker-tutorial-series-a7e6ff90a023)) command=/usr/local/bin/archivebox update; ts=2024-11-27__07:13:30 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=https://solid.mit.edu/)) command=/usr/local/bin/archivebox update; ts=2024-11-27__07:45:48 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=https://www.codementor.io/mayowa.a/how-to-build-a-simple-session-based-authentication-system-with-nodejs-from-scratch-6vn67mcy3)) command=/usr/local/bin/archivebox update; ts=2024-11-27__07:56:18 cannot access local variable 'cmd' where it is not associated with a value You're accessing the development server over HTTPS, but it only supports HTTP. You're accessing the development server over HTTPS, but it only supports HTTP. Exception in archive_methods.save_htmltotext(Link(url=http://cb.vu/unixtoolbox.xhtml)) command=/usr/local/bin/archivebox update; ts=2024-11-27__08:32:21 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=http://www.faqs.org/docs/artu/index.html)) command=/usr/local/bin/archivebox update; ts=2024-11-27__08:52:48 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=https://blog.svpino.com/2015/05/07/five-programming-problems-every-software-engineer-should-be-able-to-solve-in-less-than-1-hour)) command=/usr/local/bin/archivebox update; ts=2024-11-27__08:53:04 cannot access local variable 'cmd' where it is not associated with a value "PRI * HTTP/2.0" 505 - "PRI * HTTP/2.0" 505 - "PRI * HTTP/2.0" 505 - "PRI * HTTP/2.0" 505 - > /usr/local/bin/archivebox update; TS=2024-11-28__00:00:26 VERSION=0.7.2 IN_DOCKER=True IS_TTY=False You're accessing the development server over HTTPS, but it only supports HTTP. ``` ### ArchiveBox Version ```shell ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:47:02 1713998822 IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.0-125-generic-x86_64-with-glibc2.36 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.9 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.2 valid /usr/local/bin/archivebox √ CURL_BINARY v8.5.0 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v20.12.2 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.46 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid /app/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /app/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.39.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2023.12.30 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v124.0.6367.29 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 9 files @ valid /data √ SOURCES_DIR 19 files valid ./sources √ LOGS_DIR 2 files valid ./logs √ ARCHIVE_DIR 364 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 13.6 MB valid ./index.sqlite3 ``` ### How did you install the version of ArchiveBox you are using? Docker (or other container system like podman/LXC/Kubernetes or TrueNAS/Cloudron/YunoHost/etc.) ### What operating system are you running on? Linux (Ubuntu/Debian/Arch/Alpine/etc.) ### What type of drive are you using to store your ArchiveBox data? - [x] `data/` is on a local SSD or NVMe drive - [ ] `data/` is on a spinning hard drive or external USB drive - [ ] `data/` is on a network mount (e.g. NFS/SMB/CIFS/etc.) - [ ] `data/` is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.) ### Docker Compose Configuration ```shell # Usage: # curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml # docker compose up # docker compose run archivebox version # echo 'https://example.com' | docker compose run -T archivebox add # docker compose run archivebox add --depth=1 'https://news.ycombinator.com' # docker compose run archivebox config --set SAVE_ARCHIVE_DOT_ORG=False # docker compose run archivebox status # docker compose run archivebox help # Documentation: # https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#docker-compose services: archivebox: image: archivebox/archivebox:latest ports: - 8000:8000 volumes: - ./data:/data # ./data/personas/Default/chrome_profile/Default:/data/personas/Default/chrome_profile/Default environment: # - ADMIN_USERNAME=admin # create an admin user on first run with the given user/pass combo # - ADMIN_PASSWORD=SomeSecretPassword - CSRF_TRUSTED_ORIGINS=https://archivebox.example.com # REQUIRED for auth, REST API, etc. to work - ALLOWED_HOSTS=* # set this to the hostname(s) from your CSRF_TRUSTED_ORIGINS - PUBLIC_INDEX=False # set to False to prevent anonymous users from viewing snapshot list - PUBLIC_SNAPSHOTS=False # set to False to prevent anonymous users from viewing snapshot content - PUBLIC_ADD_VIEW=False # set to True to allow anonymous users to submit new URLs to archive - SEARCH_BACKEND_ENGINE=sonic # tells ArchiveBox to use sonic container below for fast full-text search - SEARCH_BACKEND_HOST_NAME=sonic - SEARCH_BACKEND_PASSWORD=SomeSecretPassword # - PUID=911 # set to your host user's UID & GID if you encounter permissions issues # - PGID=911 # UID/GIDs <500 may clash with existing users and are not recommended # - MEDIA_MAX_SIZE=750m # increase this filesize limit to allow archiving larger audio/video files # - TIMEOUT=60 # increase this number to 120+ seconds if you see many slow downloads timing out # - CHECK_SSL_VALIDITY=True # set to False to disable strict SSL checking (allows saving URLs w/ broken certs) # - SAVE_ARCHIVE_DOT_ORG=True # set to False to disable submitting all URLs to Archive.org when archiving # - USER_AGENT="..." # set a custom USER_AGENT to avoid being blocked as a bot # ... # add further configuration options from archivebox/config.py as needed (to apply them only to this container) # or set using `docker compose run archivebox config --set SOME_KEY=someval` (to persist config across all containers) # For ad-blocking during archiving, uncomment this section and pihole service section below # networks: # - dns # dns: # - 172.20.0.53 restart: always ######## Optional Addons: tweak examples below as needed for your specific use case ######## ### This optional container runs any scheduled tasks in the background, add new tasks like so: # $ docker compose run archivebox schedule --add --every=day --depth=1 'https://example.com/some/rss/feed.xml' # then restart the scheduler container to apply any changes to the scheduled task list: # $ docker compose restart archivebox_scheduler # https://github.com/ArchiveBox/ArchiveBox/wiki/Scheduled-Archiving archivebox_scheduler: image: archivebox/archivebox:latest command: schedule --foreground --update --every=day environment: - TIMEOUT=120 # use a higher timeout than the main container to give slow tasks more time when retrying # - PUID=502 # set to your host user's UID & GID if you encounter permissions issues # - PGID=20 volumes: - ./data:/data # cpus: 2 # uncomment / edit these values to limit scheduler container resource consumption # mem_limit: 2048m restart: always ### This runs the optional Sonic full-text search backend (much faster than default rg backend). # If Sonic is ever started after not running for a while, update its full-text index by running: # $ docker-compose run archivebox update --index-only # https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Search sonic: image: valeriansaliou/sonic:latest build: # custom build just auto-downloads archivebox's default sonic.cfg as a convenience # not needed after first run / if you have already have ./etc/sonic.cfg present dockerfile_inline: | FROM quay.io/curl/curl:latest AS config_downloader RUN curl -fsSL 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/stable/etc/sonic.cfg' > /tmp/sonic.cfg FROM valeriansaliou/sonic:latest COPY --from=config_downloader /tmp/sonic.cfg /etc/sonic.cfg expose: - 1491 environment: - SEARCH_BACKEND_PASSWORD=SomeSecretPassword volumes: #- ./sonic.cfg:/etc/sonic.cfg:ro # use this if you prefer to download the config on the host and mount it manually - ./data/sonic:/var/lib/sonic/store restart: always ### This container runs xvfb+noVNC so you can watch the ArchiveBox browser as it archives things, # or remote control it to set up a chrome profile w/ login credentials for sites you want to archive. # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#docker-vnc-setup novnc: image: theasp/novnc:latest environment: - DISPLAY_WIDTH=1920 - DISPLAY_HEIGHT=1080 - RUN_XTERM=no ports: # to view/control ArchiveBox's browser, visit: http://127.0.0.1:8080/vnc.html # restricted to access from localhost by default because it has no authentication - 127.0.0.1:8080:8080 restart: always ### Example: Put Nginx in front of the ArchiveBox server for SSL termination and static file serving. # You can also any other ingress provider for SSL like Apache, Caddy, Traefik, Cloudflare Tunnels, etc. # nginx: # image: nginx:alpine # ports: # - 443:443 # - 80:80 # volumes: # - ./etc/nginx.conf:/etc/nginx/nginx.conf # - ./data:/var/www ### Example: To run pihole in order to block ad/tracker requests during archiving, # uncomment this block and set up pihole using its admin interface # pihole: # image: pihole/pihole:latest # ports: # # access the admin HTTP interface on http://localhost:8090 # - 127.0.0.1:8090:80 # environment: # - WEBPASSWORD=SET_THIS_TO_SOME_SECRET_PASSWORD_FOR_ADMIN_DASHBOARD # - DNSMASQ_LISTENING=all # dns: # - 127.0.0.1 # - 1.1.1.1 # networks: # dns: # ipv4_address: 172.20.0.53 # volumes: # - ./etc/pihole:/etc/pihole # - ./etc/dnsmasq:/etc/dnsmasq.d ### Example: run all your ArchiveBox traffic through a WireGuard VPN tunnel to avoid IP blocks. # You can also use any other VPN that works at the docker IP level, e.g. Tailscale, OpenVPN, etc. # wireguard: # image: linuxserver/wireguard:latest # network_mode: 'service:archivebox' # cap_add: # - NET_ADMIN # - SYS_MODULE # sysctls: # - net.ipv4.conf.all.rp_filter=2 # - net.ipv4.conf.all.src_valid_mark=1 # volumes: # - /lib/modules:/lib/modules # - ./wireguard.conf:/config/wg0.conf:ro ### Example: Run ChangeDetection.io to watch for changes to websites, then trigger ArchiveBox to archive them # Documentation: https://github.com/dgtlmoon/changedetection.io # More info: https://github.com/dgtlmoon/changedetection.io/blob/master/docker-compose.yml # changedetection: # image: ghcr.io/dgtlmoon/changedetection.io # volumes: # - ./data-changedetection:/datastore ### Example: Run PYWB in parallel and auto-import WARCs from ArchiveBox # pywb: # image: webrecorder/pywb:latest # entrypoint: /bin/sh -c '(wb-manager init default || test $$? -eq 2) && wb-manager add default /archivebox/archive/*/warc/*.warc.gz; wayback;' # environment: # - INIT_COLLECTION=archivebox # ports: # - 8686:8080 # volumes: # - ./data:/archivebox # - ./data/wayback:/webarchive networks: # network just used for pihole container to offer :53 dns resolving on fixed ip for archivebox container dns: ipam: driver: default config: - subnet: 172.20.0.0/24 # HOW TO: Set up cloud storage for your ./data/archive (e.g. Amazon S3, Backblaze B2, Google Drive, OneDrive, SFTP, etc.) # https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-Up-Storage # # Follow the steps here to set up the Docker RClone Plugin https://rclone.org/docker/ # $ docker plugin install rclone/docker-volume-rclone:amd64 --grant-all-permissions --alias rclone # $ nano /var/lib/docker-plugins/rclone/config/rclone.conf # [examplegdrive] # type = drive # scope = drive # drive_id = 1234567... # root_folder_id = 0Abcd... # token = {"access_token":...} # volumes: # archive: # driver: rclone # driver_opts: # remote: 'examplegdrive:archivebox' # allow_other: 'true' # vfs_cache_mode: full # poll_interval: 0 ``` ### ArchiveBox Configuration ```shell ```
kerem closed this issue 2026-03-01 17:59:15 +03:00
Author
Owner

@pirate commented on GitHub (Nov 28, 2024):

Duplicate of: https://github.com/ArchiveBox/ArchiveBox/issues/746

(It's an old well known issue with a fix in progress)

<!-- gh-comment-id:2506417342 --> @pirate commented on GitHub (Nov 28, 2024): Duplicate of: https://github.com/ArchiveBox/ArchiveBox/issues/746 (It's an old well known issue with a fix in progress)
Author
Owner

@comatory commented on GitHub (Nov 28, 2024):

Cool 👍 I'll be waiting for new release then. Strange, I'd expect restarting the containers would help but it doesn't.

<!-- gh-comment-id:2506438947 --> @comatory commented on GitHub (Nov 28, 2024): Cool 👍 I'll be waiting for new release then. Strange, I'd expect restarting the containers would help but it doesn't.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2469
No description provided.