[GH-ISSUE #1548] Bug: Archivebox-Scheduler running full-bore when no tasks scheduled #3940

Open
opened 2026-03-15 01:03:27 +03:00 by kerem · 1 comment
Owner

Originally created by @JPeroutek on GitHub (Oct 18, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1548

Describe the bug

While running archivebox via docker-compose, after a few hours archivebox_scheduler fully saturates multiple cores on my system, even though it shows now tasks scheduled.

Steps to reproduce

Run ArchiveBox via the docker-compose file provided on the homepage.

Screenshots or log output

Docker status of the archivebox_scheduler container

results of archivebox schedule --show

ArchiveBox version

0.7.2
ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:47:02 1713998822
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.9         valid     /usr/local/bin/python3.11

 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py

 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py

 √  ARCHIVEBOX_BINARY     v0.7.2          valid     /usr/local/bin/archivebox


 √  CURL_BINARY           v8.5.0          valid     /usr/bin/curl

 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget

 √  NODE_BINARY           v20.12.2        valid     /usr/bin/node

 √  SINGLEFILE_BINARY     v1.1.46         valid     /app/node_modules/single-file-cli/single-file

 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor

 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js

 √  GIT_BINARY            v2.39.2         valid     /usr/bin/git

 √  YOUTUBEDL_BINARY      v2023.12.30     valid     /usr/local/bin/yt-dlp

 √  CHROME_BINARY         v124.0.6367.29  valid     /usr/bin/chromium-browser

 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg


[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox

 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates

 -  CUSTOM_TEMPLATES_DIR  -               disabled  None


[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None

 -  COOKIES_FILE          -               disabled  None


[i] Data locations:
 √  OUTPUT_DIR            6 files @       valid     /data

 √  SOURCES_DIR           116 files       valid     ./sources

 √  LOGS_DIR              2 files         valid     ./logs

 √  ARCHIVE_DIR           969 files       valid     ./archive

 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf

 √  SQL_INDEX             10.8 MB         valid     ./index.sqlite3
Originally created by @JPeroutek on GitHub (Oct 18, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1548 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> While running archivebox via docker-compose, after a few hours `archivebox_scheduler` fully saturates multiple cores on my system, even though it shows now tasks scheduled. #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> Run ArchiveBox via the docker-compose file provided on the homepage. #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> ![Docker status of the archivebox_scheduler container](https://github.com/user-attachments/assets/549800af-5baf-4b1a-881e-096353997a60) ![results of `archivebox schedule --show`](https://github.com/user-attachments/assets/0c4a2a90-bd91-472b-aa03-f26dd6d809e4) #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs 0.7.2 ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:47:02 1713998822 IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.36 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.9 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.2 valid /usr/local/bin/archivebox √ CURL_BINARY v8.5.0 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v20.12.2 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.46 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid /app/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /app/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.39.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2023.12.30 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v124.0.6367.29 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 6 files @ valid /data √ SOURCES_DIR 116 files valid ./sources √ LOGS_DIR 2 files valid ./logs √ ARCHIVE_DIR 969 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 10.8 MB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@pirate commented on GitHub (Oct 21, 2024):

Older chrome versions would sometimes fail to end their child processes after a snapshot, eventually leading to high CPU/memory usage from all the zombie processes.

If you're willing to try a BETA this should be improved in the latest releases.

Back your archive first and give archivebox/archivebox:dev a shot.

Otherwise as a workaround you can set up a cronjob to restarting the scheduler container every 24hr, then you can remove it when you upgrade to the next stable release (v0.9.x).

<!-- gh-comment-id:2427909319 --> @pirate commented on GitHub (Oct 21, 2024): Older chrome versions would sometimes fail to end their child processes after a snapshot, eventually leading to high CPU/memory usage from all the zombie processes. If you're willing to try a BETA this should be improved in the latest releases. Back your archive first and give `archivebox/archivebox:dev` a shot. Otherwise as a workaround you can set up a cronjob to restarting the scheduler container every 24hr, then you can remove it when you upgrade to the next stable release (v0.9.x).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3940
No description provided.