[GH-ISSUE #746] singlefile and other Chrome extractors leave behind zombie orphan chromium processes that never exit #1977

Closed
opened 2026-03-01 17:55:30 +03:00 by kerem · 44 comments
Owner

Originally created by @ghost on GitHub (May 13, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/746

Originally assigned to: @pirate on GitHub.

I notice a huge resource allocation hole on Ubuntu Server 18.04 w/ docker-compose... the system never seems to go back to idle unless I reboot... up to 4.00+ system load after archiving has stopped.

Originally created by @ghost on GitHub (May 13, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/746 Originally assigned to: @pirate on GitHub. I notice a huge resource allocation hole on Ubuntu Server 18.04 w/ docker-compose... the system never seems to go back to idle unless I reboot... up to 4.00+ system load after archiving has stopped.
Author
Owner

@pirate commented on GitHub (May 13, 2021):

Please post the full output of archivebox --version.

I suspect it's the Chrome orphan child problem we've been before https://github.com/ArchiveBox/ArchiveBox/issues/550 (Chrome is naughty by forking some child processes that it doesn't clean up on exit).

The fix is usually just to upgrade your ArchiveBox + Chrome version, but on some setups the issue persists and needs to be fixed by adding a --no-zygote and --single-process flags.

<!-- gh-comment-id:840387066 --> @pirate commented on GitHub (May 13, 2021): Please post the full output of `archivebox --version`. I suspect it's the Chrome orphan child problem we've been before https://github.com/ArchiveBox/ArchiveBox/issues/550 (Chrome is naughty by forking some child processes that it doesn't clean up on exit). The fix is usually just to upgrade your ArchiveBox + Chrome version, but on some setups the issue persists and needs to be fixed by adding a `--no-zygote` and `--single-process` flags.
Author
Owner

@ghost commented on GitHub (May 13, 2021):

$ docker-compose run archivebox --version
ArchiveBox v0.6.2
Cpython Linux Linux-4.15.0-142-generic-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            7 files         valid     /data
 √  SOURCES_DIR           15 files        valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           8 files         valid     ./archive
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             348.0 KB        valid     ./index.sqlite3
<!-- gh-comment-id:840389204 --> @ghost commented on GitHub (May 13, 2021): ``` $ docker-compose run archivebox --version ArchiveBox v0.6.2 Cpython Linux Linux-4.15.0-142-generic-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files valid /data √ SOURCES_DIR 15 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 8 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 348.0 KB valid ./index.sqlite3 ```
Author
Owner

@pirate commented on GitHub (May 13, 2021):

Hmm it's the latest version which should have the fix. Can you check if the high resource use is by chromium or a different process using htop? (sort by CPU / try viewing in tree mode to see which procs in the container are using the most CPU/mem)

<!-- gh-comment-id:840400885 --> @pirate commented on GitHub (May 13, 2021): Hmm it's the latest version which should have the fix. Can you check if the high resource use is by chromium or a different process using `htop`? (sort by CPU / try viewing in tree mode to see which procs in the container are using the most CPU/mem)
Author
Owner

@ghost commented on GitHub (May 13, 2021):

It does appear to be chromium, here's a screencap of htop (this is after restarting the system and running a couple archives. the system load seems to be OK-ish, but there are a lot of chromium processes listed?)

htop-archivebox

here is sorted w/ tree: (edit, actually uploaded the right cap)

htop-archivebox2

<!-- gh-comment-id:840408874 --> @ghost commented on GitHub (May 13, 2021): It does appear to be chromium, here's a screencap of htop (this is after restarting the system and running a couple archives. the system load seems to be OK-ish, but there are a lot of chromium processes listed?) ![htop-archivebox](https://user-images.githubusercontent.com/79405840/118100227-36c7bc00-b3a4-11eb-9905-f8a962266632.png) here is sorted w/ tree: (edit, actually uploaded the right cap) ![htop-archivebox2](https://user-images.githubusercontent.com/79405840/118100649-b05faa00-b3a4-11eb-8421-5e9ba27a179e.png)
Author
Owner

@pirate commented on GitHub (May 13, 2021):

Yeah ok, this is an annoying known bug we've seen a few times with Chromium.

Can you try the fix I just put on dev by building a new docker container and running from that:

docker build -t archivebox:dev 'https://github.com/ArchiveBox/ArchiveBox.git#dev'

Then update your docker-compose.yml to use archivebox:dev:

services:
    archivebox:
        image: 'archivebox/archivebox:dev'
        ...
docker-compose down
docker-compose up

Let me know if that fixes it or not.

<!-- gh-comment-id:840426913 --> @pirate commented on GitHub (May 13, 2021): Yeah ok, this is an annoying known bug we've seen a few times with Chromium. Can you try the fix I just put on `dev` by building a new docker container and running from that: ```bash docker build -t archivebox:dev 'https://github.com/ArchiveBox/ArchiveBox.git#dev' ``` Then update your `docker-compose.yml` to use `archivebox:dev`: ```yaml services: archivebox: image: 'archivebox/archivebox:dev' ... ``` ```bash docker-compose down docker-compose up ``` Let me know if that fixes it or not.
Author
Owner

@ghost commented on GitHub (May 13, 2021):

Will update -- thank you for your help, much appreciated

<!-- gh-comment-id:840427841 --> @ghost commented on GitHub (May 13, 2021): Will update -- thank you for your help, much appreciated
Author
Owner

@ghost commented on GitHub (May 13, 2021):

Update: chromium no longer causing problems. All is well. Thanks again!

edit: lengthy tasks still cause this bug

<!-- gh-comment-id:840533084 --> @ghost commented on GitHub (May 13, 2021): Update: chromium no longer causing problems. All is well. Thanks again! edit: lengthy tasks still cause this bug
Author
Owner

@ghost commented on GitHub (May 13, 2021):

It seems to happen on archive tasks that take a while to complete. performance has improved noticeably though since the last change

<!-- gh-comment-id:840566587 --> @ghost commented on GitHub (May 13, 2021): It seems to happen on archive tasks that take a while to complete. performance has improved noticeably though since the last change
Author
Owner

@pirate commented on GitHub (May 13, 2021):

Can you screenshot htop again, but scroll over to the right a bit more to see the full chrome args? I think that'll tell us what extractor is running chrome.

I suspect SingleFile is the odd one out not using our --no-zygote fix, because it handles launching its own version of chrome.

As a test you can try temporarily disabling the singlefile extractor with archivebox config --set SAVE_SINGLEFILE=False.
If that stops it then we know it's Singlefile's chrome. Then I can look into changing the singlefile chrome args by getting them into singlefile --browser-executable-path=....

<!-- gh-comment-id:840724117 --> @pirate commented on GitHub (May 13, 2021): Can you screenshot htop again, but scroll over to the right a bit more to see the full chrome args? I think that'll tell us what extractor is running chrome. I suspect SingleFile is the odd one out not using our `--no-zygote` fix, because it handles launching its own version of chrome. As a test you can try temporarily disabling the singlefile extractor with `archivebox config --set SAVE_SINGLEFILE=False`. If that stops it then we know it's Singlefile's chrome. Then I can look into changing the singlefile chrome args by getting them into `singlefile --browser-executable-path=...`.
Author
Owner

@ghost commented on GitHub (May 13, 2021):

the full args is huge, I don't have enough screens -- but I copied the line for the chromium PID:

13484 caddy 20 0 37.6G 226M 157M S 0.0 11.4 0:00.00 /usr/lib/chromium/chromium --show-component-extension-options --enable-gpu-rasterization --no-default-browser-check --disable-pings --media-router=0 --enable-remote-extensions --load-extension= --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=TranslateUI --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --enable-blink-features=IdleDetection --headless --hide-scrollbars --mute-audio about:blank --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --single-process --no-zygote --user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/) --window-size=1440,2000 --disable-web-security --no-pings --window-size=1280,720 --remote-debugging-port=0 --user-data-dir=/tmp/puppeteer_dev_chrome_profile-T7r5l7

<!-- gh-comment-id:840732800 --> @ghost commented on GitHub (May 13, 2021): the full args is huge, I don't have enough screens -- but I copied the line for the chromium PID: `13484 caddy 20 0 37.6G 226M 157M S 0.0 11.4 0:00.00 /usr/lib/chromium/chromium --show-component-extension-options --enable-gpu-rasterization --no-default-browser-check --disable-pings --media-router=0 --enable-remote-extensions --load-extension= --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=TranslateUI --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --enable-blink-features=IdleDetection --headless --hide-scrollbars --mute-audio about:blank --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --single-process --no-zygote --user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/) --window-size=1440,2000 --disable-web-security --no-pings --window-size=1280,720 --remote-debugging-port=0 --user-data-dir=/tmp/puppeteer_dev_chrome_profile-T7r5l7`
Author
Owner

@ghost commented on GitHub (May 13, 2021):

I threw 6 URLs at it and so far it seems to be working without leaving chromium hanging (about 10 mins now OK)

edit: yeah now it's chugging along nicely, performance jump as well.

<!-- gh-comment-id:840740729 --> @ghost commented on GitHub (May 13, 2021): I threw 6 URLs at it and so far it seems to be working without leaving chromium hanging (about 10 mins now OK) edit: yeah now it's chugging along nicely, performance jump as well.
Author
Owner

@pirate commented on GitHub (May 13, 2021):

Looks like you have --single-process --no-zygote in the args anyway, so my idea was wrong. That chrome proc shouldn't leave behind any orphan processes with --single-process --no-zygote (in theory).

<!-- gh-comment-id:840791850 --> @pirate commented on GitHub (May 13, 2021): Looks like you have `--single-process --no-zygote` in the args anyway, so my idea was wrong. That chrome proc shouldn't leave behind any orphan processes with `--single-process --no-zygote` (in theory).
Author
Owner

@axb21 commented on GitHub (May 14, 2021):

Just wanted to throw in here that I've observed the same thing. The load on the machine running archivebox jumped to very high levels, mostly caused by hundreds of chromium processes, after I submitted ~150 URLs to be archived.

me@machine:/path/archivebox$ docker-compose run archivebox --version
ArchiveBox v0.6.2
Cpython Linux Linux-4.15.0-142-generic-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            7 files         valid     /data                                                                       
 √  SOURCES_DIR           64 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           373 files       valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             4.9 MB          valid     ./index.sqlite3 
<!-- gh-comment-id:841310085 --> @axb21 commented on GitHub (May 14, 2021): Just wanted to throw in here that I've observed the same thing. The load on the machine running archivebox jumped to very high levels, mostly caused by hundreds of chromium processes, after I submitted ~150 URLs to be archived. ``` me@machine:/path/archivebox$ docker-compose run archivebox --version ArchiveBox v0.6.2 Cpython Linux Linux-4.15.0-142-generic-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files valid /data √ SOURCES_DIR 64 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 373 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 4.9 MB valid ./index.sqlite3 ```
Author
Owner

@agg23 commented on GitHub (Jun 9, 2021):

I am also experiencing the same. I am automatically archiving pages as I visit them, and finally got around to checking on the slowdown of my Docker host. ps aux | grep -c chromium reports that there are 443 instances of Chromium open right now...

<!-- gh-comment-id:858038496 --> @agg23 commented on GitHub (Jun 9, 2021): I am also experiencing the same. I am automatically archiving pages as I visit them, and finally got around to checking on the slowdown of my Docker host. `ps aux | grep -c chromium` reports that there are 443 instances of Chromium open right now...
Author
Owner

@pirate commented on GitHub (Jun 11, 2021):

This problem will go away when we switch to using long-running playwright workers instead of spawning chrome headless 3x for each URL https://github.com/ArchiveBox/ArchiveBox/issues/51

In the meantime to work around this chrome bug we may have to add machinery to keep track of all forked subprocesses and kill them after every extractor, but thats a lot of additional time, code complexity, and hassle. Hopefully chromium fixes it first, but if not I may just focus on switching to playwright faster rather than write subprocess killing code that will have to be torn out anyway.

<!-- gh-comment-id:859200454 --> @pirate commented on GitHub (Jun 11, 2021): This problem will go away when we switch to using long-running playwright workers instead of spawning chrome headless 3x for each URL https://github.com/ArchiveBox/ArchiveBox/issues/51 In the meantime to work around this chrome bug we may have to add machinery to keep track of all forked subprocesses and kill them after every extractor, but thats a lot of additional time, code complexity, and hassle. Hopefully chromium fixes it first, but if not I may just focus on switching to playwright faster rather than write subprocess killing code that will have to be torn out anyway.
Author
Owner

@cdzombak commented on GitHub (Nov 16, 2021):

Ran into this last night trying to import my Pinboard archive of ~8000 links. When I woke up this morning, the machine was almost entirely out of RAM (8GB) & swap (16GB), and load averages were around 350-400.

A representative line from ps:

archive+ 23719  5.3  2.4 41847012 399996 ?     SLsl 01:34  22:33 /usr/lib/chromium/chromium --show-component-extension-options --enable-gpu-rasterization --no-default-browser-check --disable-pings --media-router=0 --enable-remote-extensions --load-extension= --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=TranslateUI --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --enable-blink-features=IdleDetection --headless --hide-scrollbars --mute-audio about:blank --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --single-process --no-zygote --user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.1 Safari/605.1.15 --window-size=1440,2000 --disable-web-security --no-pings --window-size=1280,720 --remote-debugging-port=0 --user-data-dir=/tmp/puppeteer_dev_chrome_profile-IfuaOP

Running the archivebox/dev tag I pulled yesterday via docker-compose:

$ docker-compose run --rm archivebox version
ArchiveBox v0.6.3
Cpython Linux Linux-5.10.0-0.bpo.8-amd64-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.8          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.13         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v90.0.4430.212  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            8 files         valid     /data
 √  SOURCES_DIR           5 files         valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           681 files       valid     ./archive
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             9.2 MB          valid     ./index.sqlite3
<!-- gh-comment-id:970302769 --> @cdzombak commented on GitHub (Nov 16, 2021): Ran into this last night trying to import my Pinboard archive of ~8000 links. When I woke up this morning, the machine was almost entirely out of RAM (8GB) & swap (16GB), and load averages were around 350-400. A representative line from `ps`: ``` archive+ 23719 5.3 2.4 41847012 399996 ? SLsl 01:34 22:33 /usr/lib/chromium/chromium --show-component-extension-options --enable-gpu-rasterization --no-default-browser-check --disable-pings --media-router=0 --enable-remote-extensions --load-extension= --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=TranslateUI --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --enable-blink-features=IdleDetection --headless --hide-scrollbars --mute-audio about:blank --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --single-process --no-zygote --user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.1 Safari/605.1.15 --window-size=1440,2000 --disable-web-security --no-pings --window-size=1280,720 --remote-debugging-port=0 --user-data-dir=/tmp/puppeteer_dev_chrome_profile-IfuaOP ``` Running the `archivebox/dev` tag I pulled yesterday via docker-compose: ``` $ docker-compose run --rm archivebox version ArchiveBox v0.6.3 Cpython Linux Linux-5.10.0-0.bpo.8-amd64-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.8 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.13 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git - YOUTUBEDL_BINARY - disabled /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.212 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 8 files valid /data √ SOURCES_DIR 5 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 681 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 9.2 MB valid ./index.sqlite3 ```
Author
Owner

@pirate commented on GitHub (Nov 16, 2021):

Ok, I can't promise a fix immediately but will take a look soon at why --no-zygote isn't working as it should.

<!-- gh-comment-id:970605254 --> @pirate commented on GitHub (Nov 16, 2021): Ok, I can't promise a fix immediately but will take a look soon at why `--no-zygote` isn't working as it should.
Author
Owner

@cdzombak commented on GitHub (Nov 16, 2021):

Ok, I can't promise a fix immediately …

Understood, this isn't a life-or-death issue for me 😄

<!-- gh-comment-id:970611760 --> @cdzombak commented on GitHub (Nov 16, 2021): > Ok, I can't promise a fix immediately … Understood, this isn't a life-or-death issue for me 😄
Author
Owner

@salykin commented on GitHub (Nov 17, 2022):

Hello! Did anybody find something?

<!-- gh-comment-id:1319166042 --> @salykin commented on GitHub (Nov 17, 2022): Hello! Did anybody find something?
Author
Owner

@salykin commented on GitHub (Nov 17, 2022):

If anyone wants to somehow workaround the problem, take a look at this bash script:

start_time() {
  hz=$(getconf CLK_TCK)
  uptime=$(awk '{print $1}' < /proc/uptime)
  starttime=$(awk '{print $22}' < /proc/$1/stat)
  echo $(( ${uptime%.*} - $starttime / $hz ))
}

kill_old_chrome_processes() {
  for pid in `ps -ef | grep /usr/lib/chromium/chromium | awk '{print $2}'` ; do  
    st=$(start_time $pid);
    if (( st > $1 )); then
      kill $pid
      echo killed $pid
    fi
  done
}

PROCESS_LIVE_SECONDS_THRESHOLD=600
CLEANUP_SECONDS_INTERVAL=300

while true
do
  echo vvvvvvvvvvvvvvvvv
  kill_old_chrome_processes $PROCESS_LIVE_SECONDS_THRESHOLD
  echo ^^^^^^^^^^^^^^^^^
  echo waiting $CLEANUP_SECONDS_INTERVAL seconds
  echo
  sleep $CLEANUP_SECONDS_INTERVAL
done

The script loops over chromium processes every 5 minutes, looks for longliving chromiums (older than 10 minutes), and kills them all.

<!-- gh-comment-id:1319253170 --> @salykin commented on GitHub (Nov 17, 2022): If anyone wants to somehow workaround the problem, take a look at this bash script: ```bash start_time() { hz=$(getconf CLK_TCK) uptime=$(awk '{print $1}' < /proc/uptime) starttime=$(awk '{print $22}' < /proc/$1/stat) echo $(( ${uptime%.*} - $starttime / $hz )) } kill_old_chrome_processes() { for pid in `ps -ef | grep /usr/lib/chromium/chromium | awk '{print $2}'` ; do st=$(start_time $pid); if (( st > $1 )); then kill $pid echo killed $pid fi done } PROCESS_LIVE_SECONDS_THRESHOLD=600 CLEANUP_SECONDS_INTERVAL=300 while true do echo vvvvvvvvvvvvvvvvv kill_old_chrome_processes $PROCESS_LIVE_SECONDS_THRESHOLD echo ^^^^^^^^^^^^^^^^^ echo waiting $CLEANUP_SECONDS_INTERVAL seconds echo sleep $CLEANUP_SECONDS_INTERVAL done ``` The script loops over chromium processes every 5 minutes, looks for longliving chromiums (older than 10 minutes), and kills them all.
Author
Owner

@pirate commented on GitHub (Jun 13, 2023):

Are ya'll still seeing this issue? There have been many updates to chromium and archivebox/archivebox:dev since 2022/11, and I'm curious if this is still a concern.

I personally haven't noticed major issues with zombie Chrome processes on our demo server or my personal ArchiveBox instance in the last few months, so let me know if you're seeing it on your machines still.

<!-- gh-comment-id:1589092677 --> @pirate commented on GitHub (Jun 13, 2023): Are ya'll still seeing this issue? There have been many updates to chromium and `archivebox/archivebox:dev` since 2022/11, and I'm curious if this is still a concern. I personally haven't noticed major issues with zombie Chrome processes on our demo server or my personal ArchiveBox instance in the last few months, so let me know if you're seeing it on your machines still.
Author
Owner

@msalmasi commented on GitHub (Jul 19, 2023):

Hi @pirate

I seem to be having this issue on Chromium v114 and ArchiveBox 0.6.3 (dev branch) (running in docker). This leads to the creation of a SingletonLock file in the user data profile that prevents private archiving from working, and ultimately profile corruption.

<!-- gh-comment-id:1642544548 --> @msalmasi commented on GitHub (Jul 19, 2023): Hi @pirate I seem to be having this issue on Chromium v114 and ArchiveBox 0.6.3 (dev branch) (running in docker). This leads to the creation of a SingletonLock file in the user data profile that prevents private archiving from working, and ultimately profile corruption.
Author
Owner

@mAAdhaTTah commented on GitHub (Jul 19, 2023):

I can also confirm I'm still experiencing this and periodically restart the archivebox docker container to compensate.

<!-- gh-comment-id:1642572219 --> @mAAdhaTTah commented on GitHub (Jul 19, 2023): I can also confirm I'm still experiencing this and periodically restart the archivebox docker container to compensate.
Author
Owner

@gmsotavio commented on GitHub (Aug 25, 2023):

I can also confirm I'm still experiencing this and periodically restart the archivebox FreeBSD jail to compensate.

I have been using the dev branch.

>[85067:55623680:0825/101611.186656:ERROR:process_singleton_posix.cc(334)] Failed to create /mnt/archivebox/data/chromium/.config/chromium/SingletonLock: File exists (17)

image
<!-- gh-comment-id:1693139398 --> @gmsotavio commented on GitHub (Aug 25, 2023): I can also confirm I'm still experiencing this and periodically restart the archivebox FreeBSD **jail** to compensate. I have been using the dev branch. `>[85067:55623680:0825/101611.186656:ERROR:process_singleton_posix.cc(334)] Failed to create /mnt/archivebox/data/chromium/.config/chromium/SingletonLock: File exists (17)` <img width="1095" alt="image" src="https://github.com/ArchiveBox/ArchiveBox/assets/10669626/fbad605c-d74b-420e-93b7-a3ce57d4d62c">
Author
Owner

@pirate commented on GitHub (Aug 25, 2023):

I'm currently working on a refactor to use a long-running scrapy-playwright based worker inside of a huey job queue system. It should solve this issue for good, as even if chrome misbehaves the worker can periodically restart on it's own to clear out zombie processes and release leaked memory. It's complex but is looking like it might be a big upgrade for the project, maybe finally warranting an ArchiveBox 1.0 version.

<!-- gh-comment-id:1693996419 --> @pirate commented on GitHub (Aug 25, 2023): I'm currently working on a refactor to use a long-running scrapy-playwright based worker inside of a huey job queue system. It should solve this issue for good, as even if chrome misbehaves the worker can periodically restart on it's own to clear out zombie processes and release leaked memory. It's complex but is looking like it might be a big upgrade for the project, maybe finally warranting an ArchiveBox 1.0 version.
Author
Owner

@sclu1034 commented on GitHub (Jun 14, 2024):

I seem to be running into this issue as well, but I haven't been able to check the process list for chrome, yet.

image

The spikes up coincide with importing large sets of URLs, the drops at "06/12 00:00" and at the very end of the chart are both manual restarts. The imports had long finished by that time.

<!-- gh-comment-id:2167668549 --> @sclu1034 commented on GitHub (Jun 14, 2024): I seem to be running into this issue as well, but I haven't been able to check the process list for chrome, yet. ![image](https://github.com/ArchiveBox/ArchiveBox/assets/4508454/c570eb9f-d586-424c-8efd-1f696b26f23b) The spikes up coincide with importing large sets of URLs, the drops at "06/12 00:00" and at the very end of the chart are both manual restarts. The imports had long finished by that time.
Author
Owner

@clb92 commented on GitHub (Jun 19, 2024):

I'm having this problem now, where I notice all 40 threads on my server pegged at 100%. Running top shows that it's a lot of Chrome processes belonging to ArchiveBox. This happens even though there's not very many pending items in ArchiveBox. I basically have to restart ArchiveBox container daily to clear the Chrome processes.

<!-- gh-comment-id:2178414094 --> @clb92 commented on GitHub (Jun 19, 2024): I'm having this problem now, where I notice all 40 threads on my server pegged at 100%. Running `top` shows that it's a lot of Chrome processes belonging to ArchiveBox. This happens even though there's not very many pending items in ArchiveBox. I basically have to restart ArchiveBox container daily to clear the Chrome processes.
Author
Owner

@krosseyed commented on GitHub (Jul 1, 2024):

I think am also running into this issue, as my search results ended me up here. Please let me know if this should be a different ticket.

I recently turned on SingleFile support after ensuring that chromium is working using docker-compose run version, and now I want to run docker-compose run archivebox update to get my existing (183 total) snapshots updated.

However I can only get through the first 10 or so of my existing links before it starts to error out after 60 seconds whenever creating a singlefile or a pdf is attempted. I can however go one at a time by doing something like archivebox update -t timestamp 1717948555.850022 and the little RaspberryPi I am using doesn't fall over after doing about 20 or so.

I hope this brings some light to this issue and if what I am running into is the same bug.

<!-- gh-comment-id:2200636563 --> @krosseyed commented on GitHub (Jul 1, 2024): I think am also running into this issue, as my search results ended me up here. Please let me know if this should be a different ticket. I recently turned on SingleFile support after ensuring that chromium is working using `docker-compose run version`, and now I want to run `docker-compose run archivebox update` to get my existing (183 total) snapshots updated. However I can only get through the first 10 or so of my existing links before it starts to error out after 60 seconds whenever creating a singlefile or a pdf is attempted. I can however go one at a time by doing something like `archivebox update -t timestamp 1717948555.850022` and the little RaspberryPi I am using doesn't fall over after doing about 20 or so. I hope this brings some light to this issue and if what I am running into is the same bug.
Author
Owner

@srd424 commented on GitHub (Nov 2, 2024):

Still seeing this - or something very like it - with 0.7.2 (docker image.) Would systemd-run help here? It's very good at dumping everything in its own cgroup, which makes mercilessly killing cleaning up orphans a bit easier ..

<!-- gh-comment-id:2452975467 --> @srd424 commented on GitHub (Nov 2, 2024): Still seeing this - or something very like it - with 0.7.2 (docker image.) Would `systemd-run` help here? It's very good at dumping everything in its own cgroup, which makes ~~mercilessly killing~~ cleaning up orphans a bit easier ..
Author
Owner

@jmeggitt commented on GitHub (Nov 9, 2024):

After some debugging, I found the source of the issue.

The orphaned Chome processes are getting spawned by /app/node_modules/single-file-cli/single-file. Normally it properly cleans up the child processes when it finishes, however that does not happen if the node process is killed. Since ArchiveBox runs single-file via's Python's subprocess.run api with a timeout, python will send SIGKILL to the process when the timeout is reached.

Reproduction Steps

1. Find a server with a long response time

First, you will need to find a page that will take long enough to render that you can kill it before it finishes. If you have trouble finding one, you can use this Python script to run a local HTTP server with an artificially long delay.

#!/usr/bin/python3
from http.server import HTTPServer, BaseHTTPRequestHandler
from time import sleep

class LongRunningHandler(BaseHTTPRequestHandler):
    def handle_one_request(self):
        print("Received request")
        sleep(60)

server_address = ('', 5555)
server = HTTPServer(server_address, LongRunningHandler)
server.serve_forever()

2. Run single-file

Since I use docker compose for ArchiveBox, I used the following command to run single-file within my existing container.

$ sudo docker compose exec archivebox \
    /app/node_modules/single-file-cli/single-file \
    --browser-executable-path=chromium-browser \
    '--browser-args=["--headless=new", "--no-sandbox", "--no-zygote", "--disable-dev-shm-usage", "--disable-software-rasterizer", "--run-all-compositor-stages-before-draw", "--hide-scrollbars", "--window-size=1440,2000", "--autoplay-policy=no-user-gesture-required", "--no-first-run", "--use-fake-ui-for-media-stream", "--use-fake-device-for-media-stream", "--disable-sync", "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)", "--window-size=1440,2000"]' \
    http://192.168.1.102:5555/ /dev/null

3. Kill the single-file process before it finishes

Kill the process running single-file. Make sure to use kill -9 to send SIGKILL to the process. This bug will not be triggered if you attempt to kill single-file with ctrl+C.

$ sudo kill -9 $(pgrep -f 'node /app/node_modules/single-file-cli/single-file')

4. Check running processes

At this point, the bug should have been hit. During my testing I had a 100% hit rate, so I expect that it will not be difficult for others to reproduce. pstree can be used to verify that there are dangling chrome processes are still running.

$ pstree -anpT $(pgrep -f 'docker_entrypoint.sh archivebox')
dumb-init,768013 -- /app/bin/docker_entrypoint.sh archivebox server --quick-init 0.0.0.0:8000
  ├─archivebox,768076 /usr/local/bin/archivebox server --quick-init 0.0.0.0:8000
  ├─chrome,775781 --allow-pre-commit-input --disable-background-networking ...
  │   ├─chrome,775804
  │   ├─chrome,775809
  │   ├─chrome,775834
  │   ├─chrome,775855
  │   └─chrome,776000
  ├─chrome_crashpad,775783 --monitor-self--monitor-self-annotation=ptype=crashpad
  └─chrome_crashpad,775785 --no-periodic-tasks--monitor-self-annotation=ptype=cra

Possible Fixes

1. Manually terminate the process when the timeout is reached.

single-file has code to handle the signals SIGTERM and SIGINT (code). ArchiveBox could add some logic to perform the timeout manually, then send one of these signals when the timeout is reached. That might look something like this:

from subprocess import Popen, CompletedProcess, TimeoutExpired

def run_or_terminate(cmd: list[str], timeout: int=60, **kwargs) -> CompletedProcess:
    # Pass 1.5x the requested timeout to ensure that the child process is killed in the
    # event it has issues handling the signal we send it.
    with Popen(cmd, timeout=1.5*timeout, **kwargs) as child:
        try:
            out, err = child.communicate(timeout=timeout)
            return CompletedProcess(child.args, child.returncode, stdout=out, stderr=err)
        except TimeoutExpired:
            # Send SIGTERM to the child and wait for it to exit
            child.terminate()
            child.wait()
            raise

2. Make single-file kill subprocesses on exit

This means it needs to do the following, so the OS knows what to do with its children when it dies. I am not sure what the nodejs equivalent of this is, but I imagine there is probably some sort of API for this. Additionally, I only looked into the details for Linux. I am not sure what happens on other platforms.

#include <sys/prctl.h>
#include <signal.h>

prctl(PR_SET_PDEATHSIG, SIGKILL);
<!-- gh-comment-id:2466470006 --> @jmeggitt commented on GitHub (Nov 9, 2024): After some debugging, I found the source of the issue. The orphaned Chome processes are getting spawned by `/app/node_modules/single-file-cli/single-file`. Normally it properly cleans up the child processes when it finishes, however that does not happen if the node process is killed. Since [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox/blob/a9a3b153b11e8070d23f6aed5eb7169c60eb3a5e/archivebox/pkgs/abx-plugin-singlefile/abx_plugin_singlefile/singlefile.py#L68) runs `single-file` via's Python's `subprocess.run` api with a timeout, python will send `SIGKILL` to the process when the timeout is reached. ### Reproduction Steps #### 1. Find a server with a long response time First, you will need to find a page that will take long enough to render that you can kill it before it finishes. If you have trouble finding one, you can use this Python script to run a local HTTP server with an artificially long delay. ```python3 #!/usr/bin/python3 from http.server import HTTPServer, BaseHTTPRequestHandler from time import sleep class LongRunningHandler(BaseHTTPRequestHandler): def handle_one_request(self): print("Received request") sleep(60) server_address = ('', 5555) server = HTTPServer(server_address, LongRunningHandler) server.serve_forever() ``` #### 2. Run `single-file` Since I use docker compose for ArchiveBox, I used the following command to run `single-file` within my existing container. ```bash $ sudo docker compose exec archivebox \ /app/node_modules/single-file-cli/single-file \ --browser-executable-path=chromium-browser \ '--browser-args=["--headless=new", "--no-sandbox", "--no-zygote", "--disable-dev-shm-usage", "--disable-software-rasterizer", "--run-all-compositor-stages-before-draw", "--hide-scrollbars", "--window-size=1440,2000", "--autoplay-policy=no-user-gesture-required", "--no-first-run", "--use-fake-ui-for-media-stream", "--use-fake-device-for-media-stream", "--disable-sync", "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)", "--window-size=1440,2000"]' \ http://192.168.1.102:5555/ /dev/null ``` #### 3. Kill the `single-file` process before it finishes Kill the process running `single-file`. Make sure to use `kill -9` to send `SIGKILL` to the process. This bug will not be triggered if you attempt to kill `single-file` with ctrl+C. ```bash $ sudo kill -9 $(pgrep -f 'node /app/node_modules/single-file-cli/single-file') ``` #### 4. Check running processes At this point, the bug should have been hit. During my testing I had a 100% hit rate, so I expect that it will not be difficult for others to reproduce. `pstree` can be used to verify that there are dangling chrome processes are still running. ```bash $ pstree -anpT $(pgrep -f 'docker_entrypoint.sh archivebox') dumb-init,768013 -- /app/bin/docker_entrypoint.sh archivebox server --quick-init 0.0.0.0:8000 ├─archivebox,768076 /usr/local/bin/archivebox server --quick-init 0.0.0.0:8000 ├─chrome,775781 --allow-pre-commit-input --disable-background-networking ... │ ├─chrome,775804 │ ├─chrome,775809 │ ├─chrome,775834 │ ├─chrome,775855 │ └─chrome,776000 ├─chrome_crashpad,775783 --monitor-self--monitor-self-annotation=ptype=crashpad └─chrome_crashpad,775785 --no-periodic-tasks--monitor-self-annotation=ptype=cra ``` ### Possible Fixes #### 1. Manually terminate the process when the timeout is reached. `single-file` has code to handle the signals `SIGTERM` and `SIGINT` ([code](https://github.com/gildas-lormeau/single-file-cli/blob/b27dd1c3537ce7f27ce8f95533e9899eeb693c12/single-file-launcher.js#L31-L40)). ArchiveBox could add some logic to perform the timeout manually, then send one of these signals when the timeout is reached. That might look something like this: ```python3 from subprocess import Popen, CompletedProcess, TimeoutExpired def run_or_terminate(cmd: list[str], timeout: int=60, **kwargs) -> CompletedProcess: # Pass 1.5x the requested timeout to ensure that the child process is killed in the # event it has issues handling the signal we send it. with Popen(cmd, timeout=1.5*timeout, **kwargs) as child: try: out, err = child.communicate(timeout=timeout) return CompletedProcess(child.args, child.returncode, stdout=out, stderr=err) except TimeoutExpired: # Send SIGTERM to the child and wait for it to exit child.terminate() child.wait() raise ``` #### 2. Make `single-file` kill subprocesses on exit This means it needs to do the following, so the OS knows what to do with its children when it dies. I am not sure what the nodejs equivalent of this is, but I imagine there is probably some sort of API for this. Additionally, I only looked into the details for Linux. I am not sure what happens on other platforms. ```c #include <sys/prctl.h> #include <signal.h> prctl(PR_SET_PDEATHSIG, SIGKILL); ```
Author
Owner

@pirate commented on GitHub (Nov 12, 2024):

Thanks for this debugging @jmeggitt, what I'd ideally like to do is handle this in a general way where extractors are started in a process group, and the entire group is killed when the timeout is hit, so that way we don't need to rely on extractors killing their children properly on SIGINT.

<!-- gh-comment-id:2469363511 --> @pirate commented on GitHub (Nov 12, 2024): Thanks for this debugging @jmeggitt, what I'd ideally like to do is handle this in a general way where extractors are started in a process group, and the entire group is killed when the timeout is hit, so that way we don't need to rely on extractors killing their children properly on SIGINT.
Author
Owner

@jmeggitt commented on GitHub (Nov 12, 2024):

Good point. Process groups would defiantly be a more complete solution. I have not worked directly with process groups before, so I can't say for sure, however I think subprocess.run(..., start_new_session=True) may achieve this? If so, this should not be that difficult to fix.

https://docs.python.org/3/library/subprocess.html#subprocess.Popen

<!-- gh-comment-id:2469846298 --> @jmeggitt commented on GitHub (Nov 12, 2024): Good point. Process groups would defiantly be a more complete solution. I have not worked directly with process groups before, so I can't say for sure, however I think `subprocess.run(..., start_new_session=True)` may achieve this? If so, this should not be that difficult to fix. https://docs.python.org/3/library/subprocess.html#subprocess.Popen
Author
Owner

@pirate commented on GitHub (Nov 12, 2024):

Yup 💯 , the v0.8.5rc already uses start_new_session=True for some workers (just supervisord at the moment), and I plan on implementing it within the new actors system that runs extractor jobs: github.com/ArchiveBox/ArchiveBox@a9a3b153b1/archivebox/actors/actor.py (L229)

proc = None
try:
    proc = subprocess.Popen(['/bin/bash', 'arg1', 'arg2'], stdout=PIPE, stdin=PIPE, text=True, start_new_session=True)
    stdout, stderr = proc.communicate(timeout=60)
    print(f'Subprocess completed! ({proc.returncode})', stdout, stderr)
except (subprocess.TimeoutExpired, KeyboardInterrupt, BrokenPipeError, OSError, BaseException) as err:
    print(f'Subprocessed was killed due to {err}', file=sys.stderr)
finally:
    try:
        os.killpg(os.getpgid(proc.pid), signal.SIGTERM)  # always kill the entire child process group at the end, even if it succeeded
    except BaseException:
        pass   # ignore pgroup kill failures, process group might already have exited / might never have existed
<!-- gh-comment-id:2469913964 --> @pirate commented on GitHub (Nov 12, 2024): Yup 💯 , the v0.8.5rc already uses `start_new_session=True` for some workers (just supervisord at the moment), and I plan on implementing it within the new `actors` system that runs extractor jobs: https://github.com/ArchiveBox/ArchiveBox/blob/a9a3b153b11e8070d23f6aed5eb7169c60eb3a5e/archivebox/actors/actor.py#L229 ```python proc = None try: proc = subprocess.Popen(['/bin/bash', 'arg1', 'arg2'], stdout=PIPE, stdin=PIPE, text=True, start_new_session=True) stdout, stderr = proc.communicate(timeout=60) print(f'Subprocess completed! ({proc.returncode})', stdout, stderr) except (subprocess.TimeoutExpired, KeyboardInterrupt, BrokenPipeError, OSError, BaseException) as err: print(f'Subprocessed was killed due to {err}', file=sys.stderr) finally: try: os.killpg(os.getpgid(proc.pid), signal.SIGTERM) # always kill the entire child process group at the end, even if it succeeded except BaseException: pass # ignore pgroup kill failures, process group might already have exited / might never have existed ```
Author
Owner

@comatory commented on GitHub (Dec 6, 2024):

Is this one addressed in https://github.com/ArchiveBox/ArchiveBox/pull/1311 ?

I'd like to even run RC because Archivebox is hogging up my server unfortunately 😅

<!-- gh-comment-id:2522520631 --> @comatory commented on GitHub (Dec 6, 2024): Is this one addressed in https://github.com/ArchiveBox/ArchiveBox/pull/1311 ? I'd like to even run RC because Archivebox is hogging up my server unfortunately 😅
Author
Owner

@pirate commented on GitHub (Jan 8, 2025):

PSA to everyone following this, a little while back I did some testing and confirmed this is an underlying bug with Chrome's implementation of their new --headless=new mode on some platforms, and happens reliably outside of ArchiveBox / even if ArchiveBox is not installed.

Image

You can see my full analysis + steps to reproduce the issue outside of ArchiveBox here:

🚨 Please help us get this fixed upstream by commenting / upvoting the issue over on the Chromium bug tracker: ➡️ https://issues.chromium.org/issues/327583144 🚨


In the meantime I'm working on 3 workarounds to fix the issue in the v0.8.x dev branch:

  1. launching chrome processes inside a process group with start_new_session=True, and killing the entire process group at the end of each extractor run (as described above)

  2. always running chrome in headful mode (aka CHROME_HEADLESS=False) and connecting it to a virtual Xvfb display (also allows archiving to be watched in realtime in the browser using novnc), this works becuase the hang-before-exit bug only happens in headless mode on macOS when not using a user data dir, and non-headless mode on Linux when using a user data dir AFAICT.

  3. always running chrome with a --user-data-dir=... by creating a new empty user data dir when one is not provided, and always copying any provided user-data-dir to a new folder that's unique per chrome instance to avoid SingletonLock contention (using copy-on-write when available on the filesystem to avoid actually duplicating the entire chrome user data dir)

I'm not yet done with fix 3., but fix 1. and fix 2. (currently available only when running >=v0.8.5rc51 on docker-compose only) are partially completed on dev . I'll post updates here as more progess is made.

<!-- gh-comment-id:2578562957 --> @pirate commented on GitHub (Jan 8, 2025): PSA to everyone following this, a little while back I did some testing and confirmed this is an underlying bug with Chrome's implementation of their new `--headless=new` mode on some platforms, and happens reliably outside of ArchiveBox / even if ArchiveBox is not installed. ![Image](https://github.com/user-attachments/assets/5efb37d2-d015-43c0-a8cc-815f42b69775) You can see my full analysis + steps to reproduce the issue outside of ArchiveBox here: - https://github.com/cypress-io/cypress/issues/27264#issuecomment-1972167140 ### 🚨 Please help us get this fixed upstream by commenting / upvoting the issue over on the Chromium bug tracker: ➡️ https://issues.chromium.org/issues/327583144 🚨 --- In the meantime I'm working on 3 workarounds to fix the issue in the v0.8.x dev branch: 1. launching chrome processes inside a process group with `start_new_session=True`, and killing the entire process group at the end of each extractor run (as [described above](https://github.com/ArchiveBox/ArchiveBox/issues/746#issuecomment-2469913964)) 2. *always* running chrome in headful mode (aka `CHROME_HEADLESS=False`) and connecting it to a virtual Xvfb display (also allows archiving to be watched in realtime in the browser using novnc), this works becuase the hang-before-exit bug only happens in headless mode on macOS when not using a user data dir, and non-headless mode on Linux when using a user data dir AFAICT. 3. *always* running chrome with a `--user-data-dir=...` by creating a new empty user data dir when one is not provided, and always copying any provided user-data-dir to a new folder that's unique per chrome instance to avoid SingletonLock contention (using copy-on-write when available on the filesystem to avoid actually duplicating the entire chrome user data dir) I'm not yet done with fix `3.`, but fix `1.` and fix `2.` (currently available only when running >=`v0.8.5rc51` on docker-compose only) are partially completed on `dev` . I'll post updates here as more progess is made.
Author
Owner

@scttnlsn commented on GitHub (Jan 17, 2025):

@pirate I was running into this issue so I tried out 0.8.5rc51 but it looks like some migrations are missing:

archivebox  | [*] Verifying main SQL index and running any migrations needed...
archivebox  |     Operations to perform:
archivebox  |     Apply all migrations: admin, api, auth, contenttypes, core, huey_monitor, machine, sessions
archivebox  |     Running migrations:
archivebox  |     No migrations to apply.
archivebox  |     Your models in app(s): 'crawls', 'seeds' have changes that are not yet reflected in a migration, and so won't be applied.
archivebox  |     Run 'manage.py makemigrations' to make new migrations, and then re-run 'manage.py migrate' to apply them.
archivebox  |     Operations to perform:
archivebox  |     Apply all migrations: huey_monitor
archivebox  |     Running migrations:
archivebox  |     No migrations to apply.
archivebox  |     Your models in app(s): 'crawls', 'seeds' have changes that are not yet reflected in a migration, and so won't be applied.
archivebox  |     Run 'manage.py makemigrations' to make new migrations, and then re-run 'manage.py migrate' to apply them.
archivebox  | 
archivebox  |     √ ./index.sqlite3

I'm running via Docker Compose. I get related errors when taking various actions in the app as well:

Error occurred while loading the page: no such table: crawls_outlink

I tried pulling the latest code and running Django's makemigrations command myself but was getting some other errors related to circular imports (the same error seen in CI here: https://github.com/ArchiveBox/ArchiveBox/actions/runs/12681404211/job/35345049585). I realize things are in flux so wasn't sure if you want any contributions to fix any of this right now.


EDIT: I see this is already documented here: https://github.com/ArchiveBox/ArchiveBox/issues/1566

<!-- gh-comment-id:2599196488 --> @scttnlsn commented on GitHub (Jan 17, 2025): @pirate I was running into this issue so I tried out `0.8.5rc51` but it looks like some migrations are missing: ``` archivebox | [*] Verifying main SQL index and running any migrations needed... archivebox | Operations to perform: archivebox | Apply all migrations: admin, api, auth, contenttypes, core, huey_monitor, machine, sessions archivebox | Running migrations: archivebox | No migrations to apply. archivebox | Your models in app(s): 'crawls', 'seeds' have changes that are not yet reflected in a migration, and so won't be applied. archivebox | Run 'manage.py makemigrations' to make new migrations, and then re-run 'manage.py migrate' to apply them. archivebox | Operations to perform: archivebox | Apply all migrations: huey_monitor archivebox | Running migrations: archivebox | No migrations to apply. archivebox | Your models in app(s): 'crawls', 'seeds' have changes that are not yet reflected in a migration, and so won't be applied. archivebox | Run 'manage.py makemigrations' to make new migrations, and then re-run 'manage.py migrate' to apply them. archivebox | archivebox | √ ./index.sqlite3 ``` I'm running via Docker Compose. I get related errors when taking various actions in the app as well: ``` Error occurred while loading the page: no such table: crawls_outlink ``` I tried pulling the latest code and running Django's `makemigrations` command myself but was getting some other errors related to circular imports (the same error seen in CI here: https://github.com/ArchiveBox/ArchiveBox/actions/runs/12681404211/job/35345049585). I realize things are in flux so wasn't sure if you want any contributions to fix any of this right now. --- EDIT: I see this is already documented here: https://github.com/ArchiveBox/ArchiveBox/issues/1566
Author
Owner

@pirate commented on GitHub (Feb 6, 2025):

The last few weeks I've been testing alternative browser driver solutions that take care of cleaning up processes on their own (among many other things that I don't want to have to build). Here are the top contenders:

The main things I'm looking for to be useful to ArchiveBox:

  • ability to auto-install/use latest version of chrome/chromium/brave without complex packaging
  • ability to pass custom chrome CLI launch flags, connect to both debug port and wss:// remote debugger, update cookies, etc. for up to 16 sessions at a time per host
  • ability to open and close new tabs, connect via puppeteer/playwright/etc.
  • ability to save in-browser downloads, screenshots, PDFs, byte-arrays to the local filesystem and retrieve them
  • ability to trigger browser extension actions, listen for and connect to service workers / background contexts
<!-- gh-comment-id:2638318080 --> @pirate commented on GitHub (Feb 6, 2025): The last few weeks I've been testing alternative browser driver solutions that take care of cleaning up processes on their own (among many other things that I don't want to have to build). Here are the top contenders: - [browserless](https://github.com/browserless/browserless) - [browsertrix-crawler](https://github.com/webrecorder/browsertrix-crawler) - [selenium-grid](https://github.com/SeleniumHQ/docker-selenium#experimental-multi-arch-amd64aarch64armhf-images) The main things I'm looking for to be useful to ArchiveBox: - ability to auto-install/use latest version of chrome/chromium/brave without complex packaging - ability to pass custom chrome CLI launch flags, connect to both debug port and `wss://` remote debugger, update cookies, etc. for up to 16 sessions at a time per host - ability to open and close new tabs, connect via puppeteer/playwright/etc. - ability to save in-browser downloads, screenshots, PDFs, byte-arrays to the local filesystem and retrieve them - ability to trigger browser extension actions, listen for and connect to service workers / background contexts
Author
Owner

@pirate commented on GitHub (Feb 16, 2025):

Quick update: the chromium team is finally taking a look at our bug report! They've assigned it to a team member and they're working on it now. https://issues.chromium.org/issues/327583144

Image
<!-- gh-comment-id:2661600329 --> @pirate commented on GitHub (Feb 16, 2025): Quick update: the chromium team is finally taking a look at our bug report! They've assigned it to a team member and they're working on it now. https://issues.chromium.org/issues/327583144 <img width="1403" alt="Image" src="https://github.com/user-attachments/assets/966b8ddd-e559-4c80-ab5c-cb0f8056fbed" />
Author
Owner

@comatory commented on GitHub (Feb 17, 2025):

@pirate do you have a link so I could subscribe to the updates? Or if you can keep us updated here that would be much appreciated 🙏

<!-- gh-comment-id:2662718719 --> @comatory commented on GitHub (Feb 17, 2025): @pirate do you have a link so I could subscribe to the updates? Or if you can keep us updated here that would be much appreciated 🙏
Author
Owner

@JustTooKrul commented on GitHub (Apr 11, 2025):

After spending quite a while getting chromium and novnc to work to properly setup the browser, it seems like migrating to something that has the flexibility to run different browsers and have the entire process operate in one container would be a big benefit. It's especially intriguing given how restrictive chrome-flavored browsers are becoming with the manifest change. Using uBlock in the headless browser to filter ads and bypass paywalls is a benefit that chrome is about to lose--having other browser options in a nice package could help those collection methods.

<!-- gh-comment-id:2796037781 --> @JustTooKrul commented on GitHub (Apr 11, 2025): After spending quite a while getting chromium and novnc to work to properly setup the browser, it seems like migrating to something that has the flexibility to run different browsers and have the entire process operate in one container would be a big benefit. It's especially intriguing given how restrictive chrome-flavored browsers are becoming with the manifest change. Using uBlock in the headless browser to filter ads and bypass paywalls is a benefit that chrome is about to lose--having other browser options in a nice package could help those collection methods.
Author
Owner

@comatory commented on GitHub (May 22, 2025):

I would love to start using archivebox again, I wanted to have it as a replacement for (other) bookmarking service. I'm just unable to, because every time I archive something, it crashes my server.

Is there an easy way to disable the behaviour with current version? I see that last release was in Dec 2024, not sure if new release is in the work.

I see this as a planned work for 0.8 milestone, so 🤞

<!-- gh-comment-id:2900447384 --> @comatory commented on GitHub (May 22, 2025): I would love to start using archivebox again, I wanted to have it as a replacement for (other) bookmarking service. I'm just unable to, because every time I archive something, it crashes my server. Is there an easy way to disable the behaviour with current version? I see that last release was in Dec 2024, not sure if new release is in the work. I see this as a planned work for `0.8` milestone, so 🤞
Author
Owner

@pirate commented on GitHub (May 23, 2025):

see recent news post here: https://github.com/ArchiveBox/ArchiveBox/issues/191#issuecomment-2848370416

<!-- gh-comment-id:2904399809 --> @pirate commented on GitHub (May 23, 2025): see recent news post here: https://github.com/ArchiveBox/ArchiveBox/issues/191#issuecomment-2848370416
Author
Owner

@pirate commented on GitHub (Dec 29, 2025):

this should be fixed on dev, we now auto-kill zombie chrome processs anytime we launch a new one

<!-- gh-comment-id:3697671536 --> @pirate commented on GitHub (Dec 29, 2025): this should be fixed on dev, we now auto-kill zombie chrome processs anytime we launch a new one
Author
Owner

@clb92 commented on GitHub (Dec 29, 2025):

That's good news, thanks! Then I won't need to restart the container daily anymore.

<!-- gh-comment-id:3697681422 --> @clb92 commented on GitHub (Dec 29, 2025): That's good news, thanks! Then I won't need to restart the container daily anymore.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1977
No description provided.