[GH-ISSUE #550] Bugfix: Long-running update end up with orphan chromium processes #351

Closed
opened 2026-03-01 14:42:46 +03:00 by kerem · 2 comments
Owner

Originally created by @mAAdhaTTah on GitHub (Nov 25, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/550

Describe the bug

I imported my Pocket library, totaling 27k+ links, and have been archiving those links on and off for a week. I went away for a few days, and figured I'd let the process run on my server. When I returned, I found the RAM completely maxed out on the box (16GBs) and dozens of stray Chromium processes still running. ArchiveBox was run in Docker, so I was able to kill the container and reclaim the RAM, rather than needing to kill all of the processes.

My theory is that the timeout doesn't kill the underlying process properly and so it just stayed open, but I'm not 100% sure.

Steps to reproduce

  1. Create large ArchiveBox db.
  2. Set low timeout for archving.
  3. Run archivebox update.
  4. Wait a while.
  5. Watch for stray Chromium processes.

Screenshots or log output

I can pull some logs if needed.

Software versions

  • OS: Ubuntu
  • ArchiveBox version: archivebox/archivebox:latest
  • Python version: 3.8 (whatever's in the Dockerfile)
  • Chrome version: Not sure (same as above, whatever's in the Dockerfile)
Originally created by @mAAdhaTTah on GitHub (Nov 25, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/550 #### Describe the bug I imported my Pocket library, totaling 27k+ links, and have been archiving those links on and off for a week. I went away for a few days, and figured I'd let the process run on my server. When I returned, I found the RAM completely maxed out on the box (16GBs) and dozens of stray Chromium processes still running. ArchiveBox was run in Docker, so I was able to kill the container and reclaim the RAM, rather than needing to kill all of the processes. My theory is that the timeout doesn't kill the underlying process properly and so it just stayed open, but I'm not 100% sure. #### Steps to reproduce 1. Create large ArchiveBox db. 2. Set low timeout for archving. 3. Run `archivebox update`. 4. Wait a while. 5. Watch for stray Chromium processes. #### Screenshots or log output I can pull some logs if needed. #### Software versions - OS: Ubuntu - ArchiveBox version: archivebox/archivebox:latest - Python version: 3.8 (whatever's in the Dockerfile) - Chrome version: Not sure (same as above, whatever's in the Dockerfile)
Author
Owner

@berezovskyi commented on GitHub (Feb 6, 2021):

This does not solve the problem but here is a workaround that I have developed on my system. I have a crontab entry to run this a few times at night:

#!/usr/bin/env bash
set -euxo pipefail

LOG=/home/driib/var/log/archivebox.log
LOG_PROGRESS=/home/driib/var/log/archivebox-update.log
REPEAT=10

touch "$LOG"

for n in {1..$REPEAT}; do
    docker restart archivebox
    sleep 10

    RESUME_ID=$( tail -n 1 "$LOG" )
    echo "[`date -Iseconds`] Restarting from $RESUME_ID" >> "$LOG_PROGRESS"
    RESUME="--resume $RESUME_ID"
    if [[ -z "$RESUME_ID" ]]; then
        RESUME=""
    fi
    docker exec -it -u archivebox archivebox archivebox update $RESUME | grep -P "\d{10}\.\d{6}" -o | sed -e 's/^[ \t]*//' >> "$LOG"

    test $? -gt 128 && break
done

If you want to update the whole archive, delete the $LOG file.

You should also enable swap limit support: https://docs.docker.com/engine/install/linux-postinstall/#your-kernel-does-not-support-cgroup-swap-limit-capabilities and set CPU/RAM limits.

<!-- gh-comment-id:774500057 --> @berezovskyi commented on GitHub (Feb 6, 2021): This does not solve the problem but here is a workaround that I have developed on my system. I have a crontab entry to run this a few times at night: ```bash #!/usr/bin/env bash set -euxo pipefail LOG=/home/driib/var/log/archivebox.log LOG_PROGRESS=/home/driib/var/log/archivebox-update.log REPEAT=10 touch "$LOG" for n in {1..$REPEAT}; do docker restart archivebox sleep 10 RESUME_ID=$( tail -n 1 "$LOG" ) echo "[`date -Iseconds`] Restarting from $RESUME_ID" >> "$LOG_PROGRESS" RESUME="--resume $RESUME_ID" if [[ -z "$RESUME_ID" ]]; then RESUME="" fi docker exec -it -u archivebox archivebox archivebox update $RESUME | grep -P "\d{10}\.\d{6}" -o | sed -e 's/^[ \t]*//' >> "$LOG" test $? -gt 128 && break done ``` If you want to update the whole archive, delete the $LOG file. You should also enable swap limit support: https://docs.docker.com/engine/install/linux-postinstall/#your-kernel-does-not-support-cgroup-swap-limit-capabilities and set CPU/RAM limits.
Author
Owner

@pirate commented on GitHub (Apr 6, 2021):

I think I fixed this in e7c7a8f . Comment back here if you're still seeing issues with orphan child processes after v0.6 is released and I'll reopen the issue.

<!-- gh-comment-id:813835984 --> @pirate commented on GitHub (Apr 6, 2021): I think I fixed this in e7c7a8f . Comment back here if you're still seeing issues with orphan child processes after v0.6 is released and I'll reopen the issue.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#351
No description provided.