[GH-ISSUE #376] Bugfix: Crome crashes inside of docker #3279

Closed
opened 2026-03-14 21:55:06 +03:00 by kerem · 7 comments
Owner

Originally created by @1n5aN1aC on GitHub (Jul 21, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/376

Describe the bug

PDF and Screenshot generation failing. (It seems like some of them complete eventually on subsequent attempts)
400MB+ "core" files are left in directories when those failures happen.

Steps to reproduce

This is a new install of ArchiveBox, (First time trying it) using the dockerHub image. (nikisweeting/archivebox)
The default capture in the docker-compose fails as well: (command: bash -c 'echo "https://github.com/pirate/ArchiveBox" | /bin/archive; tail -f /dev/null')

  Archivebox:
    image: nikisweeting/archivebox
    container_name: archivebox
    restart: ${RESTART_MODE}
    command: bash -c 'echo "https://github.com/pirate/ArchiveBox" | /bin/archive; tail -f /dev/null'
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /mnt/data/Archive/ArchiveBox:/data
    environment:
      - USE_COLOR=False
      - SHOW_PROGRESS=False

Screenshots or log output

First entry succeeds (at least for pdf+screenshot) second does not:

[*] [2020-07-21 14:09:23] "Opinion: The unspoken truth about managing geeks | Computerworld"
    https://www.computerworld.com/article/2527153/opinion-the-unspoken-truth-about-managing-geeks.html
    √ /data/archive/1595363993
      > archive_org
        Failed: Failed to find "content-location" URL header in Archive.org response.
        Run to see full output:
            cd /data/archive/1595363993;
            curl --location --head --user-agent "ArchiveBox/6c4c6862e (+https://github.com/pirate/ArchiveBox/)" --max-time 60 https://web.archive.org/save/https://www.computerworld.com/article/2527153/opinion-the-unspoken-truth-about-managing-geeks.html

[*] [2020-07-21 14:10:17] "GitHub - pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more..."
    https://github.com/pirate/ArchiveBox
    √ /data/archive/1595353459
      > pdf
        Failed:Exception Failed to chmod: output.pdf does not exist (did the previous step fail?)
        Run to see full output:
            cd /data/archive/1595353459;
            google-chrome-unstable --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://github.com/pirate/ArchiveBox
      > screenshot
        Failed:Exception Failed to chmod: screenshot.png does not exist (did the previous step fail?)
        Run to see full output:
            cd /data/archive/1595353459;
            google-chrome-unstable --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --screenshot https://github.com/pirate/ArchiveBox
      > archive_org
        Failed: Failed to find "content-location" URL header in Archive.org response.
        Run to see full output:
            cd /data/archive/1595353459;
            curl --location --head --user-agent "ArchiveBox/6c4c6862e (+https://github.com/pirate/ArchiveBox/)" --max-time 60 https://web.archive.org/save/https://github.com/pirate/ArchiveBox
[√] [2020-07-21 14:11:09] Update of 3 pages complete (3.00 min)
    - 0 links skipped
    - 0 links updated
    - 3 links had errors
    To view your archive, open: /data/index.html

Software versions

  • OS: Linux 5.7
  • Docker: Docker version 19.03.12, build 48a66213fe
  • ArchiveBox version: 10799e4
  • Python version: Python 3.5.3
  • Chrome version: unknown
Originally created by @1n5aN1aC on GitHub (Jul 21, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/376 #### Describe the bug PDF and Screenshot generation failing. (It seems like some of them complete eventually on subsequent attempts) 400MB+ "core" files are left in directories when those failures happen. #### Steps to reproduce This is a new install of ArchiveBox, (First time trying it) using the dockerHub image. (nikisweeting/archivebox) The default capture in the docker-compose fails as well: (`command: bash -c 'echo "https://github.com/pirate/ArchiveBox" | /bin/archive; tail -f /dev/null'`) ``` Archivebox: image: nikisweeting/archivebox container_name: archivebox restart: ${RESTART_MODE} command: bash -c 'echo "https://github.com/pirate/ArchiveBox" | /bin/archive; tail -f /dev/null' volumes: - /etc/localtime:/etc/localtime:ro - /mnt/data/Archive/ArchiveBox:/data environment: - USE_COLOR=False - SHOW_PROGRESS=False ``` #### Screenshots or log output First entry succeeds (at least for pdf+screenshot) second does not: ``` [*] [2020-07-21 14:09:23] "Opinion: The unspoken truth about managing geeks | Computerworld" https://www.computerworld.com/article/2527153/opinion-the-unspoken-truth-about-managing-geeks.html √ /data/archive/1595363993 > archive_org Failed: Failed to find "content-location" URL header in Archive.org response. Run to see full output: cd /data/archive/1595363993; curl --location --head --user-agent "ArchiveBox/6c4c6862e (+https://github.com/pirate/ArchiveBox/)" --max-time 60 https://web.archive.org/save/https://www.computerworld.com/article/2527153/opinion-the-unspoken-truth-about-managing-geeks.html [*] [2020-07-21 14:10:17] "GitHub - pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more..." https://github.com/pirate/ArchiveBox √ /data/archive/1595353459 > pdf Failed:Exception Failed to chmod: output.pdf does not exist (did the previous step fail?) Run to see full output: cd /data/archive/1595353459; google-chrome-unstable --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://github.com/pirate/ArchiveBox > screenshot Failed:Exception Failed to chmod: screenshot.png does not exist (did the previous step fail?) Run to see full output: cd /data/archive/1595353459; google-chrome-unstable --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --screenshot https://github.com/pirate/ArchiveBox > archive_org Failed: Failed to find "content-location" URL header in Archive.org response. Run to see full output: cd /data/archive/1595353459; curl --location --head --user-agent "ArchiveBox/6c4c6862e (+https://github.com/pirate/ArchiveBox/)" --max-time 60 https://web.archive.org/save/https://github.com/pirate/ArchiveBox [√] [2020-07-21 14:11:09] Update of 3 pages complete (3.00 min) - 0 links skipped - 0 links updated - 3 links had errors To view your archive, open: /data/index.html ``` #### Software versions - OS: Linux 5.7 - Docker: Docker version 19.03.12, build 48a66213fe - ArchiveBox version: 10799e4 - Python version: Python 3.5.3 - Chrome version: unknown
kerem closed this issue 2026-03-14 21:55:11 +03:00
Author
Owner

@pirate commented on GitHub (Jul 21, 2020):

This is sometimes due to lack of RAM on low-powered devices, or the older Chrome build in that docker image being incompatible with an unusual architecture (When Chrome crashes it core-dumps to disk, which leaves those large files).

Can you try on v0.4.3 (the next release coming out this week).

git checkout django
git pull
docker build . -t archivebox:0.4.3
docker run -v "$(pwd)/output":/data archivebox:0.4.3 update
docker run -v "$(pwd)/output":/data archivebox:0.4.3 add "https://example.com"
<!-- gh-comment-id:662116565 --> @pirate commented on GitHub (Jul 21, 2020): This is sometimes due to lack of RAM on low-powered devices, or the older Chrome build in that docker image being incompatible with an unusual architecture (When Chrome crashes it core-dumps to disk, which leaves those large files). Can you try on v0.4.3 (the next release coming out this week). ```bash git checkout django git pull docker build . -t archivebox:0.4.3 docker run -v "$(pwd)/output":/data archivebox:0.4.3 update docker run -v "$(pwd)/output":/data archivebox:0.4.3 add "https://example.com" ```
Author
Owner

@1n5aN1aC commented on GitHub (Jul 21, 2020):

This is on a "relatively-standard" device (amd64 / Ryzen) and should have plenty of ram. (32GB Installed, most of that available)

The Chrome thing seems somewhat likely, I will try it now.

<!-- gh-comment-id:662127683 --> @1n5aN1aC commented on GitHub (Jul 21, 2020): This is on a "relatively-standard" device (amd64 / Ryzen) and should have plenty of ram. (32GB Installed, most of that available) The Chrome thing seems somewhat likely, I will try it now.
Author
Owner

@1n5aN1aC commented on GitHub (Jul 21, 2020):

Seems to be the same result as before:

root@CENTAUR:~/deleteme/ArchiveBox# docker run -v "/mnt/data/Archive/ArchiveBox":/data archivebox:0.4.3 init
[i] [2020-07-21 22:15:03] ArchiveBox v0.4.3: archivebox init < /dev/stdin
    > /data

[+] Initializing a new ArchiveBox collection in this folder...
    /data
------------------------------------------------------------------

[+] Building archive folder structure...
    √ /data/sources
    √ /data/archive
    √ /data/logs
    √ /data/ArchiveBox.conf

[+] Building main SQL index and running migrations...
    √ /data/index.sqlite3

    Operations to perform:
    Apply all migrations: admin, auth, contenttypes, core, sessions
    Running migrations:
    Applying contenttypes.0001_initial... OK
    Applying auth.0001_initial... OK
    Applying admin.0001_initial... OK
    Applying admin.0002_logentry_remove_auto_add... OK
    Applying admin.0003_logentry_add_action_flag_choices... OK
    Applying contenttypes.0002_remove_content_type_name... OK
    Applying auth.0002_alter_permission_name_max_length... OK
    Applying auth.0003_alter_user_email_max_length... OK
    Applying auth.0004_alter_user_username_opts... OK
    Applying auth.0005_alter_user_last_login_null... OK
    Applying auth.0006_require_contenttypes_0002... OK
    Applying auth.0007_alter_validators_add_error_messages... OK
    Applying auth.0008_alter_user_username_max_length... OK
    Applying auth.0009_alter_user_last_name_max_length... OK
    Applying auth.0010_alter_group_name_max_length... OK
    Applying auth.0011_update_proxy_permissions... OK
    Applying core.0001_initial... OK
    Applying core.0002_auto_20200625_1521... OK
    Applying core.0003_auto_20200630_1034... OK
    Applying core.0004_auto_20200713_1552... OK
    Applying sessions.0001_initial... OK

[*] Collecting links from any existing indexes and archive folders...

[*] [2020-07-21 22:15:10] Writing 0 links to main index...
    √ /data/index.sqlite3
    √ /data/index.json
    √ /data/index.html

------------------------------------------------------------------
[√] Done. A new ArchiveBox collection was initialized (0 links).

    Hint: To view your archive index, run:
        archivebox server  # then visit http://127.0.0.1:8000

    To add new links, you can run:
        archivebox add ~/some/path/or/url/to/list_of_links.txt

    For more usage and examples, run:
        archivebox help
root@CENTAUR:~/deleteme/ArchiveBox# docker run -v "/mnt/data/Archive/ArchiveBox":/data archivebox:0.4.3 add "https://github.com/pirate/ArchiveBox"
[i] [2020-07-21 22:15:17] ArchiveBox v0.4.3: archivebox add https://github.com/pirate/ArchiveBox < /dev/stdin
    > /data

[+] [2020-07-21 22:15:18] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1595369718-import.txt
    > Parsed 1 URLs from input (Plain Text)
    > Found 1 new URLs not already in index

[*] [2020-07-21 22:15:18] Writing 1 links to main index...
    √ /data/index.sqlite3
    √ /data/index.json
    √ /data/index.html

[▶] [2020-07-21 22:15:19] Collecting content for 1 Snapshots in archive...

[+] [2020-07-21 22:15:19] "github.com/pirate/ArchiveBox"
    https://github.com/pirate/ArchiveBox
    > ./archive/1595369718
      > title
      > favicon
      > wget
      > pdf
        Failed:
            Exception Failed to chmod: output.pdf does not exist (did the previous step fail?)
        Run to see full output:
            cd /data/archive/1595369718;
            google-chrome --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://github.com/pirate/ArchiveBox

      > screenshot
        Failed:
            Exception Failed to chmod: screenshot.png does not exist (did the previous step fail?)
        Run to see full output:
            cd /data/archive/1595369718;
            google-chrome --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --screenshot https://github.com/pirate/ArchiveBox

      > dom
      > git
      > media
      > archive_org
        Failed:
             Failed to find "content-location" URL header in Archive.org response.
        Run to see full output:
            cd /data/archive/1595369718;
            curl --silent --location --head --max-time 60 --user-agent "ArchiveBox/0.4.3 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/https://github.com/pirate/ArchiveBox


[√] [2020-07-21 22:16:23] Update of 1 pages complete (1.07 min)
    - 0 links skipped
    - 0 links updated
    - 1 links had errors

    Hint: To view your archive index, open:
        /data/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2020-07-21 22:16:23] Writing 1 links to main index...
    √ /data/index.sqlite3
    √ /data/index.json
    √ /data/index.html

root@LOCALHOST:~/deleteme/ArchiveBox# uname -a
Linux LOCALHOST 5.7.0-1-amd64 #1 SMP Debian 5.7.6-1 (2020-06-24) x86_64 GNU/Linux

<!-- gh-comment-id:662136908 --> @1n5aN1aC commented on GitHub (Jul 21, 2020): Seems to be the same result as before: ``` root@CENTAUR:~/deleteme/ArchiveBox# docker run -v "/mnt/data/Archive/ArchiveBox":/data archivebox:0.4.3 init [i] [2020-07-21 22:15:03] ArchiveBox v0.4.3: archivebox init < /dev/stdin > /data [+] Initializing a new ArchiveBox collection in this folder... /data ------------------------------------------------------------------ [+] Building archive folder structure... √ /data/sources √ /data/archive √ /data/logs √ /data/ArchiveBox.conf [+] Building main SQL index and running migrations... √ /data/index.sqlite3 Operations to perform: Apply all migrations: admin, auth, contenttypes, core, sessions Running migrations: Applying contenttypes.0001_initial... OK Applying auth.0001_initial... OK Applying admin.0001_initial... OK Applying admin.0002_logentry_remove_auto_add... OK Applying admin.0003_logentry_add_action_flag_choices... OK Applying contenttypes.0002_remove_content_type_name... OK Applying auth.0002_alter_permission_name_max_length... OK Applying auth.0003_alter_user_email_max_length... OK Applying auth.0004_alter_user_username_opts... OK Applying auth.0005_alter_user_last_login_null... OK Applying auth.0006_require_contenttypes_0002... OK Applying auth.0007_alter_validators_add_error_messages... OK Applying auth.0008_alter_user_username_max_length... OK Applying auth.0009_alter_user_last_name_max_length... OK Applying auth.0010_alter_group_name_max_length... OK Applying auth.0011_update_proxy_permissions... OK Applying core.0001_initial... OK Applying core.0002_auto_20200625_1521... OK Applying core.0003_auto_20200630_1034... OK Applying core.0004_auto_20200713_1552... OK Applying sessions.0001_initial... OK [*] Collecting links from any existing indexes and archive folders... [*] [2020-07-21 22:15:10] Writing 0 links to main index... √ /data/index.sqlite3 √ /data/index.json √ /data/index.html ------------------------------------------------------------------ [√] Done. A new ArchiveBox collection was initialized (0 links). Hint: To view your archive index, run: archivebox server # then visit http://127.0.0.1:8000 To add new links, you can run: archivebox add ~/some/path/or/url/to/list_of_links.txt For more usage and examples, run: archivebox help root@CENTAUR:~/deleteme/ArchiveBox# docker run -v "/mnt/data/Archive/ArchiveBox":/data archivebox:0.4.3 add "https://github.com/pirate/ArchiveBox" [i] [2020-07-21 22:15:17] ArchiveBox v0.4.3: archivebox add https://github.com/pirate/ArchiveBox < /dev/stdin > /data [+] [2020-07-21 22:15:18] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1595369718-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index [*] [2020-07-21 22:15:18] Writing 1 links to main index... √ /data/index.sqlite3 √ /data/index.json √ /data/index.html [▶] [2020-07-21 22:15:19] Collecting content for 1 Snapshots in archive... [+] [2020-07-21 22:15:19] "github.com/pirate/ArchiveBox" https://github.com/pirate/ArchiveBox > ./archive/1595369718 > title > favicon > wget > pdf Failed: Exception Failed to chmod: output.pdf does not exist (did the previous step fail?) Run to see full output: cd /data/archive/1595369718; google-chrome --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://github.com/pirate/ArchiveBox > screenshot Failed: Exception Failed to chmod: screenshot.png does not exist (did the previous step fail?) Run to see full output: cd /data/archive/1595369718; google-chrome --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --screenshot https://github.com/pirate/ArchiveBox > dom > git > media > archive_org Failed: Failed to find "content-location" URL header in Archive.org response. Run to see full output: cd /data/archive/1595369718; curl --silent --location --head --max-time 60 --user-agent "ArchiveBox/0.4.3 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/https://github.com/pirate/ArchiveBox [√] [2020-07-21 22:16:23] Update of 1 pages complete (1.07 min) - 0 links skipped - 0 links updated - 1 links had errors Hint: To view your archive index, open: /data/index.html Or run the built-in webserver: archivebox server [*] [2020-07-21 22:16:23] Writing 1 links to main index... √ /data/index.sqlite3 √ /data/index.json √ /data/index.html ``` root@LOCALHOST:~/deleteme/ArchiveBox# uname -a `Linux LOCALHOST 5.7.0-1-amd64 #1 SMP Debian 5.7.6-1 (2020-06-24) x86_64 GNU/Linux`
Author
Owner

@pirate commented on GitHub (Jul 21, 2020):

Ok, I suspect it's running out of shared memory (the docker default is 64MB regardless of how much RAM you have).
Try launching it with the --shm-size 512M flag like so:

docker run -v "$(pwd)/output":/data --shm-size 512M archivebox:0.4.3 update

If that works, then it confirmed shared memory was the limitation and we can either add that to the docker setup or update chrome to no use shared memory like so --disable-software-rasterizer --disable-dev-shm-usage.

<!-- gh-comment-id:662147753 --> @pirate commented on GitHub (Jul 21, 2020): Ok, I suspect it's running out of shared memory (the docker default is 64MB regardless of how much RAM you have). Try launching it with the `--shm-size 512M` flag like so: ```bash docker run -v "$(pwd)/output":/data --shm-size 512M archivebox:0.4.3 update ``` If that works, then it confirmed shared memory was the limitation and we can either add that to the docker setup or update chrome to no use shared memory like so `--disable-software-rasterizer --disable-dev-shm-usage`.
Author
Owner

@1n5aN1aC commented on GitHub (Jul 22, 2020):

Very sorry for forgetting to set a name for the issue!

Yes, --shm-size 512M does resolve the issue, thank you very much!

Here was my solution to configure this in docker-compose:

build:
  context: .
  shm_size: '512m'

However, it should be noted, (possibly added to documentation?) that this solution requires your docker-compose.yml file to be version: '3.5' at a minimum. (not just '3')

<!-- gh-comment-id:662213628 --> @1n5aN1aC commented on GitHub (Jul 22, 2020): Very sorry for forgetting to set a name for the issue! Yes, `--shm-size 512M` does resolve the issue, thank you very much! Here was my solution to configure this in docker-compose: ``` build: context: . shm_size: '512m' ``` However, it should be noted, (possibly added to documentation?) that this solution requires your docker-compose.yml file to be `version: '3.5'` at a minimum. (not just '3')
Author
Owner

@pirate commented on GitHub (Jul 22, 2020):

Fixed in 8cb5302. You can remove the shm_size: '512m' falg from your docker setup, I disabled SHM usage inside containers using chrome cli args.

<!-- gh-comment-id:662224320 --> @pirate commented on GitHub (Jul 22, 2020): Fixed in 8cb5302. You can remove the `shm_size: '512m'` falg from your docker setup, I disabled SHM usage inside containers using chrome cli args.
Author
Owner

@ttimasdf commented on GitHub (Sep 22, 2020):

Tested on b18bbf8874 , shm_size parameter is still needed for Chromium to work. Screenshot is okay without it, but it's still required for PDF output.

<!-- gh-comment-id:696670247 --> @ttimasdf commented on GitHub (Sep 22, 2020): Tested on b18bbf88749984d10b04d1c7cfe0cae34257d6e4 , `shm_size` parameter is still needed for Chromium to work. Screenshot is okay without it, but it's still required for PDF output.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3279
No description provided.