[GH-ISSUE #1606] Bug: Adding many URLs at once results in OperationalErrors #3973

Open
opened 2026-03-15 01:09:45 +03:00 by kerem · 1 comment
Owner

Originally created by @hatzka-nezumi on GitHub (Nov 30, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1606

Originally assigned to: @pirate on GitHub.

Provide a screenshot and describe the bug

I tried to add a list of 22k URLs. ArchiveBox read them all in and started queuing them, but after a few seconds it started spewing tracebacks. Unfortunately, it looks like different threads are racing to print the tracebacks, so they end up overlapping and not all that helpful, but the tail of the last traceback is below.

(For reference, I am the person who asked about using git-annex a little bit ago, but the issue occurs outside of any git repository, so that shouldn't be related.)

Steps to reproduce

1. Install ArchiveBox and run `archivebox install`. (I tried pipx first, then raw pip; both have the same issue. Also, `archivebox install` complains about `ldap`, but it works anyway and the `version` output confirms that `ldap` is available, so.)
2. Pipe 22k URLs into `archivebox add`.

Logs or errors

On stderr:

File "/Users/nasado/Library/Python/3.13/lib/python/site-packages/django/db/backends/base/base.py", line 279, in ensure_connection
    self.connect()
    ~~~~~~~~~~~~^^
  File "/Users/nasado/Library/Python/3.13/lib/python/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/Users/nasado/Library/Python/3.13/lib/python/site-packages/django/db/backends/base/base.py", line 256, in connect
    self.connection = self.get_new_connection(conn_params)
                      ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/Users/nasado/Library/Python/3.13/lib/python/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/Users/nasado/Library/Python/3.13/lib/python/site-packages/django/db/backends/sqlite3/base.py", line 200, in get_new_connection
    conn = Database.connect(**conn_params)
django.db.utils.OperationalError: unable to open database file

In logs/errors.log:

> /Users/nasado/Library/Python/3.13/bin/archivebox add; TS=2024-11-30__03:38:52 VERSION=0.8.5rc51 IN_DOCKER=False IS_TTY=True

> /Users/nasado/Library/Python/3.13/bin/archivebox config; TS=2024-11-30__03:54:12 VERSION=0.8.5rc51 IN_DOCKER=False IS_TTY=True

> /Users/nasado/Library/Python/3.13/bin/archivebox update; TS=2024-11-30__03:54:30 VERSION=0.8.5rc51 IN_DOCKER=False IS_TTY=True

ArchiveBox Version

0.8.5rc51
ArchiveBox v0.8.5rc51 COMMIT_HASH=unknown BUILD_TIME=2024-11-29 21:36:17 1732937777
IN_DOCKER=False IN_QEMU=False ARCH=arm64 OS=Darwin PLATFORM=macOS-15.1.1-arm64-arm-64bit-Mach-O PYTHON=Cpython
EUID=501:20 UID=501:20 PUID=501:20 FS_UID=501:20 FS_PERMS=644 FS_ATOMIC=True FS_REMOTE=False
DEBUG=False IS_TTY=True SUDO=False ID=75407311:6998f8a3 SEARCH_BACKEND=ripgrep LDAP=False

 Binary Dependencies:
 √  python                3.13.0       sys_pip    /opt/homebrew/opt/python@3.13/bin/python3.13
 √  django                5.1.3        sys_pip    ~/Library/Python/3.13/lib/python/site-packages/django/__init__.py
 √  sqlite                2.6.0        sys_pip    ~/Library/Python/3.13/lib/python/site-packages/django/db/backends/sqlite3/base.py
 √  pip                   24.3.1       lib_pip    ./lib/arm64-darwin/pip/venv/bin/pip
 √  pipx                  1.7.1        sys_pip    /opt/homebrew/bin/pipx
 √  node                  23.3.0       brew       /opt/homebrew/bin/node
 √  npm                   10.9.0       brew       /opt/homebrew/bin/npm
 √  npx                   10.9.0       brew       /opt/homebrew/bin/npx
 √  playwright            1.49.0       lib_pip    ./lib/arm64-darwin/pip/venv/bin/playwright
 √  puppeteer             23.9.0       lib_npm    ./lib/arm64-darwin/npm/node_modules/.bin/puppeteer
 √  ldap                  3.4.4        lib_pip    ./lib/arm64-darwin/pip/venv/lib/python3.13/site-packages/ldap/__init__.py
 √  rg                    14.1.1       brew       /opt/homebrew/bin/rg
 √  sonic                 1.4.9        brew       /opt/homebrew/bin/sonic
 √  chrome                131.0.6778   env        /Applications/Chromium.app/Contents/MacOS/Chromium
 √  curl                  8.7.1        env        /usr/bin/curl
 √  git                   2.47.1       brew       /opt/homebrew/bin/git
 √  postlight-parser      2.2.3        lib_npm    ./lib/arm64-darwin/npm/node_modules/.bin/postlight-parser
 √  readability-extractor 0.0.11       lib_npm    ./lib/arm64-darwin/npm/node_modules/.bin/readability-extractor
 √  single-file           1.1.54       lib_npm    ./lib/arm64-darwin/npm/node_modules/.bin/single-file
 √  wget                  1.25.0       brew       /opt/homebrew/bin/wget
 √  yt-dlp                2024.11.18   sys_pip    /opt/homebrew/bin/yt-dlp
 √  ffmpeg                7.1.0        brew       /opt/homebrew/bin/ffmpeg

 Package Managers:
 √  env         /usr/bin/which                                       UID=501  PATH=/opt/homebrew/opt/python@3.13/bin:~/.local/bin:~/Library/Python/3.13/bin:/opt…
 -  apt         not available                                        UID=0    PATH=
 √  brew        /opt/homebrew/bin/brew                               UID=501  PATH=/usr/local/bin:/home/linuxbrew/.linuxbrew/bin:/opt/homebrew/bin
 -  sys_pip     not available                                        UID=501  PATH=/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Version…
 -  venv_pip    not available                                        UID=501  PATH=/tmp/NotInsideAVenv/lib/bin
 √  lib_pip     ./lib/arm64-darwin/pip/venv/bin/pip                  UID=501  PATH=./lib/arm64-darwin/pip/venv/bin
 √  sys_npm     /opt/homebrew/bin/npm                                UID=501  PATH=/opt/homebrew/bin
 √  lib_npm     /opt/homebrew/bin/npm                                UID=501  PATH=./lib/arm64-darwin/npm/node_modules/.bin:./node_modules/.bin:/opt/homebrew/bin
 √  playwright  ./lib/arm64-darwin/pip/venv/bin/playwright           UID=501  PATH=./lib/arm64-darwin/bin:/opt/homebrew/opt/python@3.13/bin:~/.local/bin:~/Libra…
 √  puppeteer   /opt/homebrew/bin/npx                                UID=501  PATH=./lib/arm64-darwin/bin

 Code locations:
 √  PACKAGE_DIR           36 files        valid     ~/Library/Python/3.13/lib/python/site-packages/archivebox                   
 √  TEMPLATES_DIR         4 files         valid     ~/Library/Python/3.13/lib/python/site-packages/archivebox/templates         
 -  CUSTOM_TEMPLATES_DIR  missing         unused    ./user_templates                       
 -  USER_PLUGINS_DIR      missing         unused    ./user_plugins                         
 √  LIB_DIR               4 files         valid     ./lib/arm64-darwin                     

 Data locations:
 √  DATA_DIR              11 files        valid     ~/収集/私的/未加工/archivebox                                                      
 √  CONFIG_FILE           139.0 Bytes     valid     ./ArchiveBox.conf                      
 √  SQL_INDEX             5.2 MB          valid     ./index.sqlite3                        
 √  QUEUE_DATABASE        92.0 KB         valid     ./queue.sqlite3                        
 √  ARCHIVE_DIR           0 files         valid     ./archive                              
 √  SOURCES_DIR           1 files         valid     ./sources                              
 √  PERSONAS_DIR          1 files         valid     ./personas                             
 √  LOGS_DIR              1 files         valid     ./logs                                 
 √  TMP_DIR               0 files         valid     ./tmp/75407311

How did you install the version of ArchiveBox you are using?

pip

What operating system are you running on?

macOS (including Docker on macOS)

What type of drive are you using to store your ArchiveBox data?

  • data/ is on a local SSD or NVMe drive
  • data/ is on a spinning hard drive or external USB drive
  • data/ is on a network mount (e.g. NFS/SMB/CIFS/etc.)
  • data/ is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.)

Docker Compose Configuration


ArchiveBox Configuration

Just the secret key.

Originally created by @hatzka-nezumi on GitHub (Nov 30, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1606 Originally assigned to: @pirate on GitHub. ### Provide a screenshot and describe the bug I tried to add a list of 22k URLs. ArchiveBox read them all in and started queuing them, but after a few seconds it started spewing tracebacks. Unfortunately, it looks like different threads are racing to print the tracebacks, so they end up overlapping and not all that helpful, but the tail of the last traceback is below. (For reference, I am the person who asked about using `git-annex` a little bit ago, but the issue occurs outside of any git repository, so that shouldn't be related.) ### Steps to reproduce ```markdown 1. Install ArchiveBox and run `archivebox install`. (I tried pipx first, then raw pip; both have the same issue. Also, `archivebox install` complains about `ldap`, but it works anyway and the `version` output confirms that `ldap` is available, so.) 2. Pipe 22k URLs into `archivebox add`. ``` ### Logs or errors On stderr: ```shell File "/Users/nasado/Library/Python/3.13/lib/python/site-packages/django/db/backends/base/base.py", line 279, in ensure_connection self.connect() ~~~~~~~~~~~~^^ File "/Users/nasado/Library/Python/3.13/lib/python/site-packages/django/utils/asyncio.py", line 26, in inner return func(*args, **kwargs) File "/Users/nasado/Library/Python/3.13/lib/python/site-packages/django/db/backends/base/base.py", line 256, in connect self.connection = self.get_new_connection(conn_params) ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^ File "/Users/nasado/Library/Python/3.13/lib/python/site-packages/django/utils/asyncio.py", line 26, in inner return func(*args, **kwargs) File "/Users/nasado/Library/Python/3.13/lib/python/site-packages/django/db/backends/sqlite3/base.py", line 200, in get_new_connection conn = Database.connect(**conn_params) django.db.utils.OperationalError: unable to open database file ``` In `logs/errors.log`: ``` > /Users/nasado/Library/Python/3.13/bin/archivebox add; TS=2024-11-30__03:38:52 VERSION=0.8.5rc51 IN_DOCKER=False IS_TTY=True > /Users/nasado/Library/Python/3.13/bin/archivebox config; TS=2024-11-30__03:54:12 VERSION=0.8.5rc51 IN_DOCKER=False IS_TTY=True > /Users/nasado/Library/Python/3.13/bin/archivebox update; TS=2024-11-30__03:54:30 VERSION=0.8.5rc51 IN_DOCKER=False IS_TTY=True ``` ### ArchiveBox Version ```shell 0.8.5rc51 ArchiveBox v0.8.5rc51 COMMIT_HASH=unknown BUILD_TIME=2024-11-29 21:36:17 1732937777 IN_DOCKER=False IN_QEMU=False ARCH=arm64 OS=Darwin PLATFORM=macOS-15.1.1-arm64-arm-64bit-Mach-O PYTHON=Cpython EUID=501:20 UID=501:20 PUID=501:20 FS_UID=501:20 FS_PERMS=644 FS_ATOMIC=True FS_REMOTE=False DEBUG=False IS_TTY=True SUDO=False ID=75407311:6998f8a3 SEARCH_BACKEND=ripgrep LDAP=False Binary Dependencies: √ python 3.13.0 sys_pip /opt/homebrew/opt/python@3.13/bin/python3.13 √ django 5.1.3 sys_pip ~/Library/Python/3.13/lib/python/site-packages/django/__init__.py √ sqlite 2.6.0 sys_pip ~/Library/Python/3.13/lib/python/site-packages/django/db/backends/sqlite3/base.py √ pip 24.3.1 lib_pip ./lib/arm64-darwin/pip/venv/bin/pip √ pipx 1.7.1 sys_pip /opt/homebrew/bin/pipx √ node 23.3.0 brew /opt/homebrew/bin/node √ npm 10.9.0 brew /opt/homebrew/bin/npm √ npx 10.9.0 brew /opt/homebrew/bin/npx √ playwright 1.49.0 lib_pip ./lib/arm64-darwin/pip/venv/bin/playwright √ puppeteer 23.9.0 lib_npm ./lib/arm64-darwin/npm/node_modules/.bin/puppeteer √ ldap 3.4.4 lib_pip ./lib/arm64-darwin/pip/venv/lib/python3.13/site-packages/ldap/__init__.py √ rg 14.1.1 brew /opt/homebrew/bin/rg √ sonic 1.4.9 brew /opt/homebrew/bin/sonic √ chrome 131.0.6778 env /Applications/Chromium.app/Contents/MacOS/Chromium √ curl 8.7.1 env /usr/bin/curl √ git 2.47.1 brew /opt/homebrew/bin/git √ postlight-parser 2.2.3 lib_npm ./lib/arm64-darwin/npm/node_modules/.bin/postlight-parser √ readability-extractor 0.0.11 lib_npm ./lib/arm64-darwin/npm/node_modules/.bin/readability-extractor √ single-file 1.1.54 lib_npm ./lib/arm64-darwin/npm/node_modules/.bin/single-file √ wget 1.25.0 brew /opt/homebrew/bin/wget √ yt-dlp 2024.11.18 sys_pip /opt/homebrew/bin/yt-dlp √ ffmpeg 7.1.0 brew /opt/homebrew/bin/ffmpeg Package Managers: √ env /usr/bin/which UID=501 PATH=/opt/homebrew/opt/python@3.13/bin:~/.local/bin:~/Library/Python/3.13/bin:/opt… - apt not available UID=0 PATH= √ brew /opt/homebrew/bin/brew UID=501 PATH=/usr/local/bin:/home/linuxbrew/.linuxbrew/bin:/opt/homebrew/bin - sys_pip not available UID=501 PATH=/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Version… - venv_pip not available UID=501 PATH=/tmp/NotInsideAVenv/lib/bin √ lib_pip ./lib/arm64-darwin/pip/venv/bin/pip UID=501 PATH=./lib/arm64-darwin/pip/venv/bin √ sys_npm /opt/homebrew/bin/npm UID=501 PATH=/opt/homebrew/bin √ lib_npm /opt/homebrew/bin/npm UID=501 PATH=./lib/arm64-darwin/npm/node_modules/.bin:./node_modules/.bin:/opt/homebrew/bin √ playwright ./lib/arm64-darwin/pip/venv/bin/playwright UID=501 PATH=./lib/arm64-darwin/bin:/opt/homebrew/opt/python@3.13/bin:~/.local/bin:~/Libra… √ puppeteer /opt/homebrew/bin/npx UID=501 PATH=./lib/arm64-darwin/bin Code locations: √ PACKAGE_DIR 36 files valid ~/Library/Python/3.13/lib/python/site-packages/archivebox √ TEMPLATES_DIR 4 files valid ~/Library/Python/3.13/lib/python/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR missing unused ./user_templates - USER_PLUGINS_DIR missing unused ./user_plugins √ LIB_DIR 4 files valid ./lib/arm64-darwin Data locations: √ DATA_DIR 11 files valid ~/収集/私的/未加工/archivebox √ CONFIG_FILE 139.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 5.2 MB valid ./index.sqlite3 √ QUEUE_DATABASE 92.0 KB valid ./queue.sqlite3 √ ARCHIVE_DIR 0 files valid ./archive √ SOURCES_DIR 1 files valid ./sources √ PERSONAS_DIR 1 files valid ./personas √ LOGS_DIR 1 files valid ./logs √ TMP_DIR 0 files valid ./tmp/75407311 ``` ### How did you install the version of ArchiveBox you are using? pip ### What operating system are you running on? macOS (including Docker on macOS) ### What type of drive are you using to store your ArchiveBox data? - [x] `data/` is on a local SSD or NVMe drive - [ ] `data/` is on a spinning hard drive or external USB drive - [ ] `data/` is on a network mount (e.g. NFS/SMB/CIFS/etc.) - [ ] `data/` is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.) ### Docker Compose Configuration ```shell ``` ### ArchiveBox Configuration Just the secret key.
Author
Owner

@pirate commented on GitHub (Dec 1, 2024):

0.8.6rc0 (dev) has some WIP improvements that should allow up to 100k URLs in one go, it's not yet released but keep an eye out for it.

<!-- gh-comment-id:2509510194 --> @pirate commented on GitHub (Dec 1, 2024): 0.8.6rc0 (dev) has some WIP improvements that should allow up to 100k URLs in one go, it's not yet released but keep an eye out for it.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3973
No description provided.