[GH-ISSUE #1651] Bug: Archivebox will expand to crash host machine when a lot of crawls are scheduled #2498

Closed
opened 2026-03-01 17:59:27 +03:00 by kerem · 1 comment
Owner

Originally created by @AramZS on GitHub (Feb 4, 2025).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1651

Originally assigned to: @pirate on GitHub.

Provide a screenshot and describe the bug

Image

I may be running archivebox on a machine a little below what is expected (an RPi 3) but I am regularly hitting it maxing out to all available RAM until it crashes the host machine. This seems undesirable. It is not in the documentation that I can see, but I would think that a good solution might be allowing the configuration in Docker Compose for anyone running the self-hosted browser-facing version of the application to set a limit on the number of Chrome instances the system can launch and keep up simultaneously to not hit the limit. 10 gigs of memory is not huge, but it isn't nothing either and I think being able to avoid dropping a machine with that amount of memory available is a desirable feature.

I'd be glad to try my hand at this if you can point me in the right direction, I'm just getting familiar with the code base here now, but I would like to contribute if it seems this is a good place to start. Also, I'm just starting, so maybe I just screwed up the config somewhere.

Steps to reproduce

1. `sudo docker compose up --remove-orphans` 
2. Accessed the system through its web interface 
3. Crawled this URL with depth set to 1: https://www.todayintabs.com/p/the-that-funny-feeling-coup (but I have managed to cause this crash pretty regularly). 
4. I observed for about 1.5 hours as the number of chrome clients increased while it continued to run archiving operations until memory and swap spiked up to max. It ran a little while at that state and then the machine crashed.

Logs or errors

Looks like the most recent one crashed right after timing out Mercury.

Here's the last messages in the log:


Exception in archive_methods.save_htmltotext(Link(url=https://www.fec.gov/help-candidates-and-committees/registering-candidate/testing-the-waters-possible-candidacy/)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:13:43
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=https://www.gq.com/story/game-of-thrones-donald-trump-theory)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:18:50
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=https://www.jezebel.com/what-the-fuck-are-democrats-doing)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:24:29
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=https://www.newyorker.com/news/daily-comment/the-world-according-to-elon-musks-grandfather)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:31:11
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=https://www.patreon.com/posts/121361763?utm_campaign=postshare_fan)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:37:35
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=https://www.politico.com/news/2025/02/03/bessent-musk-doge-treasury-payments-00202278)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:44:05
cannot access local variable 'cmd' where it is not associated with a value

Exception in archive_methods.save_htmltotext(Link(url=https://www.thedailybeast.com/the-doge-musketeers-the-secret-team-elon-wants-to-keep-in-the-shadows/)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:52:06
cannot access local variable 'cmd' where it is not associated with a value

ArchiveBox Version

0.7.3
ArchiveBox v0.7.3 COMMIT_HASH=069aabc BUILD_TIME=2024-12-15 09:54:01 1734256441
IN_DOCKER=True IN_QEMU=False ARCH=aarch64 OS=Linux PLATFORM=Linux-6.6.51+rpt-rpi-v8-aarch64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.11        valid     /usr/local/bin/python3.11
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py
 √  ARCHIVEBOX_BINARY     v0.7.3          valid     /usr/local/bin/archivebox

 √  CURL_BINARY           v8.10.1         valid     /usr/bin/curl
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget
 √  NODE_BINARY           v20.18.1        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v1.1.54         valid     /app/node_modules/single-file-cli/single-file
 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js
 √  GIT_BINARY            v2.39.5         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2024.12.13     valid     /usr/local/bin/yt-dlp
 √  CHROME_BINARY         v131.0.6778.33  valid     /usr/bin/chromium-browser
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None
 -  COOKIES_FILE          -               disabled  None

[i] Data locations:
 √  OUTPUT_DIR            8 files @       valid     /data
 √  SOURCES_DIR           105 files       valid     ./sources
 √  LOGS_DIR              2 files         valid     ./logs
 √  ARCHIVE_DIR           384 files       valid     ./archive
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             6.8 MB          valid     ./index.sqlite3

How did you install the version of ArchiveBox you are using?

Docker (or Podman/LXC/K8s/TrueNAS/Proxmox/etc)

What operating system are you running on?

Linux (Ubuntu/Debian/Arch/Alpine/etc.)

What type of drive are you using to store your ArchiveBox data?

  • some of data/ is on a local SSD or NVMe drive
  • some of data/ is on a spinning hard drive or external USB drive
  • some of data/ is on a network mount (e.g. NFS/SMB/Ceph/GlusterFS/etc.)
  • some of data/ is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/Google Drive/Dropbox/etc.)

Docker Compose Configuration

services:
    archivebox:
        image: archivebox/archivebox:latest
        ports:
            - 8000:8000
        volumes:
            - ./data:/data
            # ./data/personas/Default/chrome_profile/Default:/data/personas/Default/chrome_profile/Default
        environment:
            # - ADMIN_USERNAME=admin            # creates an admin user on first run with the given user/pass combo
            # - ADMIN_PASSWORD=SomeSecretPassword
            - CSRF_TRUSTED_ORIGINS=https://archivebox.example.com  # REQUIRED for auth, REST API, etc. to work
            - ALLOWED_HOSTS=*                   # set this to the hostname(s) from your CSRF_TRUSTED_ORIGINS
            - PUBLIC_INDEX=True                 # set to False to prevent anonymous users from viewing snapshot list
            - PUBLIC_SNAPSHOTS=True             # set to False to prevent anonymous users from viewing snapshot content
            - PUBLIC_ADD_VIEW=False             # set to True to allow anonymous users to submit new URLs to archive
            - SEARCH_BACKEND_ENGINE=sonic       # tells ArchiveBox to use sonic container below for fast full-text search
            - SEARCH_BACKEND_HOST_NAME=sonic
            - SEARCH_BACKEND_PASSWORD=[password redacted]

    archivebox_scheduler:

        image: archivebox/archivebox:latest
        command: schedule --foreground --update --every=day
        environment:
            # - PUID=911                        # set to your host user's UID & GID if you encounter permissions issues
            # - PGID=911
            - TIMEOUT=120                       # use a higher timeout than the main container to give slow tasks more time when retrying
            - SEARCH_BACKEND_ENGINE=sonic       # tells ArchiveBox to use sonic container below for fast full-text search
            - SEARCH_BACKEND_HOST_NAME=sonic
            - SEARCH_BACKEND_PASSWORD=[password redacted]
            # For other config it's better to set using `docker compose run archivebox config --set SOME_KEY=someval` instead of setting here
            # ...
            # For more info, see: https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration
        volumes:
            - ./data:/data

    sonic:
        image: archivebox/sonic:latest
        expose:
            - 1491
        environment:
            - SEARCH_BACKEND_PASSWORD=[password redacted]
        volumes:
            #- ./sonic.cfg:/etc/sonic.cfg:ro    # mount to customize: https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/stable/etc/sonic.cfg
            - ./data/sonic:/var/lib/sonic/store

ArchiveBox Configuration


Originally created by @AramZS on GitHub (Feb 4, 2025). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1651 Originally assigned to: @pirate on GitHub. ### Provide a screenshot and describe the bug ![Image](https://github.com/user-attachments/assets/38cc1409-9f32-42de-9c98-03602e22afef) I may be running archivebox on a machine a little below what is expected (an RPi 3) but I am regularly hitting it maxing out to all available RAM until it crashes the host machine. This seems undesirable. It is not in the documentation that I can see, but I would think that a good solution might be allowing the configuration in Docker Compose for anyone running the self-hosted browser-facing version of the application to set a limit on the number of Chrome instances the system can launch and keep up simultaneously to not hit the limit. 10 gigs of memory is not huge, but it isn't nothing either and I think being able to avoid dropping a machine with that amount of memory available is a desirable feature. I'd be glad to try my hand at this if you can point me in the right direction, I'm just getting familiar with the code base here now, but I would like to contribute if it seems this is a good place to start. Also, I'm just starting, so maybe I just screwed up the config somewhere. ### Steps to reproduce ```markdown 1. `sudo docker compose up --remove-orphans` 2. Accessed the system through its web interface 3. Crawled this URL with depth set to 1: https://www.todayintabs.com/p/the-that-funny-feeling-coup (but I have managed to cause this crash pretty regularly). 4. I observed for about 1.5 hours as the number of chrome clients increased while it continued to run archiving operations until memory and swap spiked up to max. It ran a little while at that state and then the machine crashed. ``` ### Logs or errors ```shell Looks like the most recent one crashed right after timing out Mercury. Here's the last messages in the log: Exception in archive_methods.save_htmltotext(Link(url=https://www.fec.gov/help-candidates-and-committees/registering-candidate/testing-the-waters-possible-candidacy/)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:13:43 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=https://www.gq.com/story/game-of-thrones-donald-trump-theory)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:18:50 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=https://www.jezebel.com/what-the-fuck-are-democrats-doing)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:24:29 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=https://www.newyorker.com/news/daily-comment/the-world-according-to-elon-musks-grandfather)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:31:11 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=https://www.patreon.com/posts/121361763?utm_campaign=postshare_fan)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:37:35 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=https://www.politico.com/news/2025/02/03/bessent-musk-doge-treasury-payments-00202278)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:44:05 cannot access local variable 'cmd' where it is not associated with a value Exception in archive_methods.save_htmltotext(Link(url=https://www.thedailybeast.com/the-doge-musketeers-the-secret-team-elon-wants-to-keep-in-the-shadows/)) command=/usr/local/bin/archivebox server --quick-init 0.0.0.0:8000; ts=2025-02-04__21:52:06 cannot access local variable 'cmd' where it is not associated with a value ``` ### ArchiveBox Version ```shell 0.7.3 ArchiveBox v0.7.3 COMMIT_HASH=069aabc BUILD_TIME=2024-12-15 09:54:01 1734256441 IN_DOCKER=True IN_QEMU=False ARCH=aarch64 OS=Linux PLATFORM=Linux-6.6.51+rpt-rpi-v8-aarch64-with-glibc2.36 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.11 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.3 valid /usr/local/bin/archivebox √ CURL_BINARY v8.10.1 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v20.18.1 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.54 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid /app/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /app/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.39.5 valid /usr/bin/git √ YOUTUBEDL_BINARY v2024.12.13 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v131.0.6778.33 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 8 files @ valid /data √ SOURCES_DIR 105 files valid ./sources √ LOGS_DIR 2 files valid ./logs √ ARCHIVE_DIR 384 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 6.8 MB valid ./index.sqlite3 ``` ### How did you install the version of ArchiveBox you are using? Docker (or Podman/LXC/K8s/TrueNAS/Proxmox/etc) ### What operating system are you running on? Linux (Ubuntu/Debian/Arch/Alpine/etc.) ### What type of drive are you using to store your ArchiveBox data? - [x] some of `data/` is on a local SSD or NVMe drive - [ ] some of `data/` is on a spinning hard drive or external USB drive - [ ] some of `data/` is on a network mount (e.g. NFS/SMB/Ceph/GlusterFS/etc.) - [ ] some of `data/` is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/Google Drive/Dropbox/etc.) ### Docker Compose Configuration ```shell services: archivebox: image: archivebox/archivebox:latest ports: - 8000:8000 volumes: - ./data:/data # ./data/personas/Default/chrome_profile/Default:/data/personas/Default/chrome_profile/Default environment: # - ADMIN_USERNAME=admin # creates an admin user on first run with the given user/pass combo # - ADMIN_PASSWORD=SomeSecretPassword - CSRF_TRUSTED_ORIGINS=https://archivebox.example.com # REQUIRED for auth, REST API, etc. to work - ALLOWED_HOSTS=* # set this to the hostname(s) from your CSRF_TRUSTED_ORIGINS - PUBLIC_INDEX=True # set to False to prevent anonymous users from viewing snapshot list - PUBLIC_SNAPSHOTS=True # set to False to prevent anonymous users from viewing snapshot content - PUBLIC_ADD_VIEW=False # set to True to allow anonymous users to submit new URLs to archive - SEARCH_BACKEND_ENGINE=sonic # tells ArchiveBox to use sonic container below for fast full-text search - SEARCH_BACKEND_HOST_NAME=sonic - SEARCH_BACKEND_PASSWORD=[password redacted] archivebox_scheduler: image: archivebox/archivebox:latest command: schedule --foreground --update --every=day environment: # - PUID=911 # set to your host user's UID & GID if you encounter permissions issues # - PGID=911 - TIMEOUT=120 # use a higher timeout than the main container to give slow tasks more time when retrying - SEARCH_BACKEND_ENGINE=sonic # tells ArchiveBox to use sonic container below for fast full-text search - SEARCH_BACKEND_HOST_NAME=sonic - SEARCH_BACKEND_PASSWORD=[password redacted] # For other config it's better to set using `docker compose run archivebox config --set SOME_KEY=someval` instead of setting here # ... # For more info, see: https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration volumes: - ./data:/data sonic: image: archivebox/sonic:latest expose: - 1491 environment: - SEARCH_BACKEND_PASSWORD=[password redacted] volumes: #- ./sonic.cfg:/etc/sonic.cfg:ro # mount to customize: https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/stable/etc/sonic.cfg - ./data/sonic:/var/lib/sonic/store ``` ### ArchiveBox Configuration ```shell ```
kerem closed this issue 2026-03-01 17:59:28 +03:00
Author
Owner

@pirate commented on GitHub (Feb 5, 2025):

Thanks for reporting, it's actually this well known issue, please subscribe over there for updates: https://github.com/ArchiveBox/ArchiveBox/issues/746

<!-- gh-comment-id:2635558975 --> @pirate commented on GitHub (Feb 5, 2025): Thanks for reporting, it's actually this well known issue, please subscribe over there for updates: https://github.com/ArchiveBox/ArchiveBox/issues/746
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2498
No description provided.