[GH-ISSUE #781] Database is locked and other weird behavior when doing simultaneous adds #2004

Closed
opened 2026-03-01 17:55:45 +03:00 by kerem · 3 comments
Owner

Originally created by @jgoerzen on GitHub (Jul 5, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/781

Describe the bug

Simultaneous invocations of "archivebox add" crash with a database locked error, or, with enough persistence, have mistakes with others.

Steps to reproduce

The first invocation of "archivebox add" works normally.

While it continues to run, subsqeuent invocations crash with a "database is locked" error at the point where they attempt to insert into the master index. Oddly, it seems they do manage to insert SOME data into the master index. Rerunning the add with the same source will cause the number of items to add to the master index to reduce, until eventually the second add starts executing as well (this may take dozens of attempts for large sets).

At that point, however, both "add" processes that are running begin to develop mysterious errors in the processing stages, and I am unsure of the reliability.

Screenshots or log output

The failed "archivebox add" error looks like this:

docker run --rm -i -v /opt/deleteme:/data archivebox/archivebox add < ~/tmp/foo
[i] [2021-07-05 20:13:01] ArchiveBox v0.6.2: archivebox add
    > /data

[+] [2021-07-05 20:13:02] Adding 156589 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1625515982-import.txt
    > Parsed 23649 URLs from input (Generic TXT)
    > Found 3809 new URLs not already in index

[*] [2021-07-05 20:13:55] Writing 3809 links to main index...
Traceback (most recent call last):
  File "/app/archivebox/index/sql.py", line 41, in write_link_to_sql_index
    info["timestamp"] = Snapshot.objects.get(url=link.url).timestamp
  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get
    raise self.model.DoesNotExist(
core.models.DoesNotExist: Snapshot matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 589, in update_or_create
    obj = self.select_for_update().get(**kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get
    raise self.model.DoesNotExist(
core.models.DoesNotExist: Snapshot matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
sqlite3.OperationalError: database is locked

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_add.py", line 103, in main
    add(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 602, in add
    write_main_index(links=new_links, out_dir=out_dir)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/__init__.py", line 232, in write_main_index
    write_sql_main_index(links, out_dir=out_dir)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/sql.py", line 88, in write_sql_main_index
    write_link_to_sql_index(link)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/sql.py", line 46, in write_link_to_sql_index
    snapshot, _ = Snapshot.objects.update_or_create(url=link.url, defaults=info)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 594, in update_or_create
    obj, created = self._create_object_from_params(kwargs, params, lock=True)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 610, in _create_object_from_params
    obj = self.create(**params)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 447, in create
    obj.save(force_insert=True, using=self.db)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 753, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 790, in save_base
    updated = self._save_table(
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 895, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 933, in _do_insert
    return manager._insert(
  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1254, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1397, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
django.db.utils.OperationalError: database is locked

I have observed that if the first downloader is doing something big, like downloading from Youtube, it is possible that the subsequent ones will proceed without an error.

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.10.0-0.bpo.7-amd64-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            7 files         valid     /data                                                                       
 √  SOURCES_DIR           4 files         valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           8 files         valid     ./archive                                                                   
 √  CONFIG_FILE           136.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             3.6 MB          valid     ./index.sqlite3                                                             
Originally created by @jgoerzen on GitHub (Jul 5, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/781 #### Describe the bug Simultaneous invocations of "archivebox add" crash with a database locked error, or, with enough persistence, have mistakes with others. #### Steps to reproduce The first invocation of "archivebox add" works normally. While it continues to run, subsqeuent invocations crash with a "database is locked" error at the point where they attempt to insert into the master index. Oddly, it seems they do manage to insert SOME data into the master index. Rerunning the add with the same source will cause the number of items to add to the master index to reduce, until eventually the second add starts executing as well (this may take dozens of attempts for large sets). At that point, however, both "add" processes that are running begin to develop mysterious errors in the processing stages, and I am unsure of the reliability. #### Screenshots or log output The failed "archivebox add" error looks like this: ``` docker run --rm -i -v /opt/deleteme:/data archivebox/archivebox add < ~/tmp/foo [i] [2021-07-05 20:13:01] ArchiveBox v0.6.2: archivebox add > /data [+] [2021-07-05 20:13:02] Adding 156589 links to index (crawl depth=0)... > Saved verbatim input to sources/1625515982-import.txt > Parsed 23649 URLs from input (Generic TXT) > Found 3809 new URLs not already in index [*] [2021-07-05 20:13:55] Writing 3809 links to main index... Traceback (most recent call last): File "/app/archivebox/index/sql.py", line 41, in write_link_to_sql_index info["timestamp"] = Snapshot.objects.get(url=link.url).timestamp File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get raise self.model.DoesNotExist( core.models.DoesNotExist: Snapshot matching query does not exist. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 589, in update_or_create obj = self.select_for_update().get(**kwargs) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get raise self.model.DoesNotExist( core.models.DoesNotExist: Snapshot matching query does not exist. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute return Database.Cursor.execute(self, query, params) sqlite3.OperationalError: database is locked The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/bin/archivebox", line 33, in <module> sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')()) File "/app/archivebox/cli/__init__.py", line 140, in main run_subcommand( File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/app/archivebox/cli/archivebox_add.py", line 103, in main add( File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/main.py", line 602, in add write_main_index(links=new_links, out_dir=out_dir) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/index/__init__.py", line 232, in write_main_index write_sql_main_index(links, out_dir=out_dir) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/index/sql.py", line 88, in write_sql_main_index write_link_to_sql_index(link) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/index/sql.py", line 46, in write_link_to_sql_index snapshot, _ = Snapshot.objects.update_or_create(url=link.url, defaults=info) File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 594, in update_or_create obj, created = self._create_object_from_params(kwargs, params, lock=True) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 610, in _create_object_from_params obj = self.create(**params) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 447, in create obj.save(force_insert=True, using=self.db) File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 753, in save self.save_base(using=using, force_insert=force_insert, File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 790, in save_base updated = self._save_table( File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 895, in _save_table results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw) File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 933, in _do_insert return manager._insert( File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1254, in _insert return query.get_compiler(using=using).execute_sql(returning_fields) File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1397, in execute_sql cursor.execute(sql, params) File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute return self._execute_with_wrappers(sql, params, many=False, executor=self._execute) File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers return executor(sql, params, many, context) File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__ raise dj_exc_value.with_traceback(traceback) from exc_value File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute return Database.Cursor.execute(self, query, params) django.db.utils.OperationalError: database is locked ``` I have observed that if the first downloader is doing something big, like downloading from Youtube, it is possible that the subsequent ones will proceed without an error. #### ArchiveBox version ```logs ArchiveBox v0.6.2 Cpython Linux Linux-5.10.0-0.bpo.7-amd64-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files valid /data √ SOURCES_DIR 4 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 8 files valid ./archive √ CONFIG_FILE 136.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 3.6 MB valid ./index.sqlite3 ```
kerem closed this issue 2026-03-01 17:55:45 +03:00
Author
Owner

@pirate commented on GitHub (Jul 5, 2021):

Expected behavior, fully parallel add is not yet supported, see https://github.com/ArchiveBox/ArchiveBox/issues/91.

It's complicated to implement because SQLite does not support multiple writers, so we have to do inter-process coordination and use a job queue system to serialize writes at the application level, which is a major refactor not expected for at least another 6 months.

<!-- gh-comment-id:874361005 --> @pirate commented on GitHub (Jul 5, 2021): Expected behavior, fully parallel add is not yet supported, see https://github.com/ArchiveBox/ArchiveBox/issues/91. It's complicated to implement because SQLite does not support multiple writers, so we have to do inter-process coordination and use a job queue system to serialize writes at the application level, which is a major refactor not expected for at least another 6 months.
Author
Owner

@jgoerzen commented on GitHub (Jul 5, 2021):

Ah. Well, over in the docs here:

https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#large-archives

it was explicitly suggested, so maybe that doc page needs fixing?

<!-- gh-comment-id:874371753 --> @jgoerzen commented on GitHub (Jul 5, 2021): Ah. Well, over in the docs here: https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#large-archives it was explicitly suggested, so maybe that doc page needs fixing?
Author
Owner

@pirate commented on GitHub (Apr 12, 2022):

Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting

Contributions/suggestions welcome there.

<!-- gh-comment-id:1097264624 --> @pirate commented on GitHub (Apr 12, 2022): Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting Contributions/suggestions welcome there.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2004
No description provided.