[GH-ISSUE #1599] Bug: "Archive Again" with multiple URLs breaks all Chromium based archival methods, and others #2465

Open
opened 2026-03-01 17:59:14 +03:00 by kerem · 2 comments
Owner

Originally created by @nguyenmp on GitHub (Nov 18, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1599

Originally assigned to: @pirate on GitHub.

Provide a screenshot and describe the bug

When I submit multiple URLs for archival, it's very serial which I think is intentional and good.

However, when I select multiple snapshots and click "Archive Again", it's very noticeably done in parallel and breaks the Chromium profile. It'll sometimes leave the Chromium lock files in /data/personas/Default/chrome_profile/Singleton* which prevents future Chromium launches. Pretty much all archival attempts fail on the second run and even single URLs will fail after triggering the Chromium lockfile issue.

I'm not exactly sure what the knock-on effects are but the following fail very consistently once I get into this state:

  • archive_org
  • htmltotext
  • readability
  • title
  • dom
  • screenshot
  • pdf
  • singlefile

Workaround is to delete the Chrome profile and only submit one URL at any time:

rm -r data/personas/

Steps to reproduce

  1. docker run -v "./data/:/data/" archivebox/archivebox:dev archivebox init
  2. docker run -v "./data/:/data/" -it archivebox/archivebox:dev archivebox manage createsuperuser
  3. docker run -p "8000:8000" -v "./data/:/data/" archivebox/archivebox:dev
  4. Visit http://localhost:8000/add/ and add 3 urls, all at once:
    1. https://google.com
    2. https://wikipedia.com
    3. https://reddit.com
  5. Let those finish archiving for like 2 minutes
  6. Select all and "Archive Again" from http://localhost:8000/admin/core/snapshot/

Logs or errors

From `worker_scheduler.log`:

      > screenshot
        Extractor failed:
             Failed to save screenshot
            [1437:1437:1118/224805.720253:ERROR:process_singleton_posix.cc(340)]
Failed to create /data/personas/Default/chrome_profile/SingletonLock: File 
exists (17)
            [1437:1437:1118/224805.720577:ERROR:chrome_main_delegate.cc(594)] 
Failed to create a ProcessSingleton for your profile directory. This means that 
running multiple instances would start multiple browser processes rather than 
opening a new window in the existing process. Aborting now to avoid profile 
corruption.

ArchiveBox Version

0.8.5rc51
ArchiveBox v0.8.5rc51 COMMIT_HASH=63bf902 BUILD_TIME=2024-10-24 06:30:40 
1729751440
IN_DOCKER=True IN_QEMU=False ARCH=aarch64 OS=Linux 
PLATFORM=Linux-6.10.4-linuxkit-aarch64-with-glibc2.36 PYTHON=Cpython
EUID=911:0 UID=911:0 PUID=911:0 FS_UID=911:0 FS_PERMS=644 FS_ATOMIC=True 
FS_REMOTE=True
DEBUG=False IS_TTY=False SUDO=False ID=9f373648:efbea00e SEARCH_BACKEND=ripgrep 
LDAP=False

 Binary Dependencies:
 √  python                3.11.10      sys_pip    /usr/local/bin/python3.11
 √  django                5.1.2        sys_pip    /usr/local/lib/python3.11/site-packages/django/__init__.py
 √  sqlite                2.6.0        sys_pip    /usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py
 √  pip                   24.0.0       sys_pip    /usr/local/bin/pip
 √  pipx                  1.1.0        sys_pip    /usr/bin/pipx
 √  node                  22.10.0      apt        /usr/bin/node
 √  npm                   10.9.0       apt        /usr/bin/npm
 √  npx                   10.9.0       apt        /usr/bin/npx
 √  playwright            1.48.0       sys_pip    /usr/local/bin/playwright
 √  puppeteer             23.6.0       lib_npm    ~/.npm/bin/puppeteer
 √  ldap                  3.4.4        sys_pip    /usr/local/lib/python3.11/site-packages/ldap/__init__.py
 √  rg                    13.0.0       apt        /usr/bin/rg
 √  sonic                 1.4.9        env        /usr/local/bin/sonic
 √  chrome                130.0.6723   env        /usr/bin/chromium-browser
 √  curl                  8.10.1       apt        /usr/bin/curl
 √  git                   2.39.5       apt        /usr/bin/git
 √  postlight-parser      2.2.3        sys_npm    ~/.npm/bin/postlight-parser
 √  readability-extractor 0.0.11       lib_npm    ~/.npm/bin/readability-extractor
 √  single-file           1.1.54       lib_npm    ~/.npm/bin/single-file
 √  wget                  1.21.3       apt        /usr/bin/wget
 √  yt-dlp                2024.10.22   sys_pip    /usr/local/bin/yt-dlp
 √  ffmpeg                5.1.6        env        /usr/bin/ffmpeg

 Package Managers:
 √  env         /usr/bin/which                                       UID=911  P…
 √  apt         /usr/bin/apt-get                                     UID=0    P…
 -  brew        not available                                        UID=911  P…
 √  sys_pip     /usr/local/bin/pip                                   UID=911  P…
 -  venv_pip    not available                                        UID=911  P…
 -  lib_pip     not available                                        UID=911  P…
 √  sys_npm     /usr/bin/npm                                         UID=911  P…
 -  lib_npm     /usr/bin/npm                                         UID=911  P…
 √  playwright  /usr/local/bin/playwright                            UID=0    P…
 √  puppeteer   /usr/bin/npx                                         UID=911  P…

 Code locations:
 √  PACKAGE_DIR           39 files        valid     /app/archivebox             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates   
 -  CUSTOM_TEMPLATES_DIR  missing         unused    ./user_templates            
 -  USER_PLUGINS_DIR      missing         unused    ./user_plugins              
 √  LIB_DIR               0 files         valid     /usr/share/archivebox/lib   

 Data locations:
 √  DATA_DIR              17 files @      valid     /data                       
 √  CONFIG_FILE           139.0 Bytes     valid     ./ArchiveBox.conf           
 √  SQL_INDEX             476.0 KB        valid     ./index.sqlite3             
 √  QUEUE_DATABASE        92.0 KB         valid     ./queue.sqlite3             
 √  ARCHIVE_DIR           9 files         valid     ./archive                   
 √  SOURCES_DIR           6 files         valid     ./sources                   
 √  PERSONAS_DIR          2 files         valid     ./personas                  
 √  LOGS_DIR              5 files         valid     ./logs                      
 √  TMP_DIR               0 files         valid     /tmp/archivebox

How did you install the version of ArchiveBox you are using?

Docker (or other container system like podman/LXC/Kubernetes or TrueNAS/Cloudron/YunoHost/etc.)

What operating system are you running on?

macOS (including Docker on macOS)

What type of drive are you using to store your ArchiveBox data?

  • data/ is on a local SSD or NVMe drive
  • data/ is on a spinning hard drive or external USB drive
  • data/ is on a network mount (e.g. NFS/SMB/CIFS/etc.)
  • data/ is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.)

Docker Compose Configuration

N/A

ArchiveBox Configuration

# Converted from INI to TOML format: https://toml.io/en/

[SERVER_CONFIG]
SECRET_KEY = "abcdefg"
Originally created by @nguyenmp on GitHub (Nov 18, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1599 Originally assigned to: @pirate on GitHub. ### Provide a screenshot and describe the bug When I submit multiple URLs for archival, it's very serial which I think is intentional and good. However, when I select multiple snapshots and click "Archive Again", it's very noticeably done in parallel and breaks the Chromium profile. It'll sometimes leave the Chromium lock files in `/data/personas/Default/chrome_profile/Singleton*` which prevents future Chromium launches. Pretty much all archival attempts fail on the second run and even single URLs will fail after triggering the Chromium lockfile issue. I'm not exactly sure what the knock-on effects are but the following fail very consistently once I get into this state: * archive_org * htmltotext * readability * title * dom * screenshot * pdf * singlefile Workaround is to delete the Chrome profile and only submit one URL at any time: ``` rm -r data/personas/ ``` ### Steps to reproduce 1. `docker run -v "./data/:/data/" archivebox/archivebox:dev archivebox init` 2. `docker run -v "./data/:/data/" -it archivebox/archivebox:dev archivebox manage createsuperuser` 3. `docker run -p "8000:8000" -v "./data/:/data/" archivebox/archivebox:dev` 5. Visit http://localhost:8000/add/ and add 3 urls, all at once: 1. https://google.com 2. https://wikipedia.com 3. https://reddit.com 4. Let those finish archiving for like 2 minutes 5. Select all and "Archive Again" from http://localhost:8000/admin/core/snapshot/ ### Logs or errors ```shell From `worker_scheduler.log`: > screenshot Extractor failed: Failed to save screenshot [1437:1437:1118/224805.720253:ERROR:process_singleton_posix.cc(340)] Failed to create /data/personas/Default/chrome_profile/SingletonLock: File exists (17) [1437:1437:1118/224805.720577:ERROR:chrome_main_delegate.cc(594)] Failed to create a ProcessSingleton for your profile directory. This means that running multiple instances would start multiple browser processes rather than opening a new window in the existing process. Aborting now to avoid profile corruption. ``` ### ArchiveBox Version ```shell 0.8.5rc51 ArchiveBox v0.8.5rc51 COMMIT_HASH=63bf902 BUILD_TIME=2024-10-24 06:30:40 1729751440 IN_DOCKER=True IN_QEMU=False ARCH=aarch64 OS=Linux PLATFORM=Linux-6.10.4-linuxkit-aarch64-with-glibc2.36 PYTHON=Cpython EUID=911:0 UID=911:0 PUID=911:0 FS_UID=911:0 FS_PERMS=644 FS_ATOMIC=True FS_REMOTE=True DEBUG=False IS_TTY=False SUDO=False ID=9f373648:efbea00e SEARCH_BACKEND=ripgrep LDAP=False Binary Dependencies: √ python 3.11.10 sys_pip /usr/local/bin/python3.11 √ django 5.1.2 sys_pip /usr/local/lib/python3.11/site-packages/django/__init__.py √ sqlite 2.6.0 sys_pip /usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py √ pip 24.0.0 sys_pip /usr/local/bin/pip √ pipx 1.1.0 sys_pip /usr/bin/pipx √ node 22.10.0 apt /usr/bin/node √ npm 10.9.0 apt /usr/bin/npm √ npx 10.9.0 apt /usr/bin/npx √ playwright 1.48.0 sys_pip /usr/local/bin/playwright √ puppeteer 23.6.0 lib_npm ~/.npm/bin/puppeteer √ ldap 3.4.4 sys_pip /usr/local/lib/python3.11/site-packages/ldap/__init__.py √ rg 13.0.0 apt /usr/bin/rg √ sonic 1.4.9 env /usr/local/bin/sonic √ chrome 130.0.6723 env /usr/bin/chromium-browser √ curl 8.10.1 apt /usr/bin/curl √ git 2.39.5 apt /usr/bin/git √ postlight-parser 2.2.3 sys_npm ~/.npm/bin/postlight-parser √ readability-extractor 0.0.11 lib_npm ~/.npm/bin/readability-extractor √ single-file 1.1.54 lib_npm ~/.npm/bin/single-file √ wget 1.21.3 apt /usr/bin/wget √ yt-dlp 2024.10.22 sys_pip /usr/local/bin/yt-dlp √ ffmpeg 5.1.6 env /usr/bin/ffmpeg Package Managers: √ env /usr/bin/which UID=911 P… √ apt /usr/bin/apt-get UID=0 P… - brew not available UID=911 P… √ sys_pip /usr/local/bin/pip UID=911 P… - venv_pip not available UID=911 P… - lib_pip not available UID=911 P… √ sys_npm /usr/bin/npm UID=911 P… - lib_npm /usr/bin/npm UID=911 P… √ playwright /usr/local/bin/playwright UID=0 P… √ puppeteer /usr/bin/npx UID=911 P… Code locations: √ PACKAGE_DIR 39 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR missing unused ./user_templates - USER_PLUGINS_DIR missing unused ./user_plugins √ LIB_DIR 0 files valid /usr/share/archivebox/lib Data locations: √ DATA_DIR 17 files @ valid /data √ CONFIG_FILE 139.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 476.0 KB valid ./index.sqlite3 √ QUEUE_DATABASE 92.0 KB valid ./queue.sqlite3 √ ARCHIVE_DIR 9 files valid ./archive √ SOURCES_DIR 6 files valid ./sources √ PERSONAS_DIR 2 files valid ./personas √ LOGS_DIR 5 files valid ./logs √ TMP_DIR 0 files valid /tmp/archivebox ``` ### How did you install the version of ArchiveBox you are using? Docker (or other container system like podman/LXC/Kubernetes or TrueNAS/Cloudron/YunoHost/etc.) ### What operating system are you running on? macOS (including Docker on macOS) ### What type of drive are you using to store your ArchiveBox data? - [x] `data/` is on a local SSD or NVMe drive - [ ] `data/` is on a spinning hard drive or external USB drive - [ ] `data/` is on a network mount (e.g. NFS/SMB/CIFS/etc.) - [ ] `data/` is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.) ### Docker Compose Configuration ```shell N/A ``` ### ArchiveBox Configuration ```shell # Converted from INI to TOML format: https://toml.io/en/ [SERVER_CONFIG] SECRET_KEY = "abcdefg" ```
Author
Owner

@TobiasHonscha commented on GitHub (Nov 22, 2024):

I have the same problem !

<!-- gh-comment-id:2494672634 --> @TobiasHonscha commented on GitHub (Nov 22, 2024): I have the same problem !
Author
Owner

@pirate commented on GitHub (Nov 22, 2024):

Yup this is a known old issue that the new Personas system (WIP) is being built to address.

In the upcoming release it will copy the entire chrome profile directory to a unique tmp dir before starting a new chrome intsance, which should allow parallel chrome instances to run at once without stepping on each other's lockfiles.

I'm going to initially soft-limit it to 4 maximum instances running in parallel per-machine-per-collection to prevent hitting too many ratelimits, but let users configure it to be higher if so people can scale it if they have more advanced infrastructure (e.g. VPNs/proxies/extra CPUs/etc) that can handle it.

<!-- gh-comment-id:2495062208 --> @pirate commented on GitHub (Nov 22, 2024): Yup this is a known old issue that the new [`Personas` system](https://github.com/ArchiveBox/ArchiveBox/tree/dev/archivebox/personas) (WIP) is being built to address. In the upcoming release it will copy the entire chrome profile directory to a unique tmp dir before starting a new chrome intsance, which should allow parallel chrome instances to run at once without stepping on each other's lockfiles. I'm going to initially soft-limit it to 4 maximum instances running in parallel per-machine-per-collection to prevent hitting too many ratelimits, but let users configure it to be higher if so people can scale it if they have more advanced infrastructure (e.g. VPNs/proxies/extra CPUs/etc) that can handle it.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2465
No description provided.