[GH-ISSUE #236] Bugfixes: Large crawls eventually crash during json loading/dumping #1675

Closed
opened 2026-03-01 17:52:44 +03:00 by kerem · 14 comments
Owner

Originally created by @anarcat on GitHub (May 7, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/236

Describe the bug

This is yet another 0.4.1 bug, feel free to close it but do notice that I can't upgrade either. ;)

Steps to reproduce

  1. Ran ArchiveBox with around 10,000 URLs to crawl
  2. Wait around 3 hours
  3. Crawl eventually crashes with: TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'

Screenshots or log output

Full backtrace:

[+] [2019-05-07 03:38:34] "www.varnish-software.com/blog/introducing-varnish-massive-storage-engine"
    https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine
    > ./archive/1557189662.144
      > title
        Failed:
            HTTPError HTTP Error 404: Not Found
        Run to see full output:
            cd /srv/backup/archive/archivebox/archive/1557189662.144;
            curl https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine | grep <title

      > favicon
      > wget
        Failed:
             Got an error from the server
            Got wget response code: 8.
            https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine:
            2019-05-07 03:38:35 erreur 404 : Not Found.
        Run to see full output:
            cd /srv/backup/archive/archivebox/archive/1557189662.144;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --warc-file=warc/1557200315 --page-requisites "--user-agent=ArchiveBox/0.4.1 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine

      > pdf
      > screenshot
      > dom
      > media
    ! Failed to archive link: TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'
                                     
Traceback (most recent call last):   
  File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module>
    sys.exit(main())                 
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main
    archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main
    pwd=pwd or OUTPUT_DIR,           
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main
    out_dir=pwd or OUTPUT_DIR,       
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 521, in add
    archive_link(link, out_dir=link.link_dir)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/extractors/__init__.py", line 84, in archive_link
    write_link_details(link, out_dir=link.link_dir)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/__init__.py", line 345, in write_link_details
    write_json_link_details(link, out_dir=out_dir)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/json.py", line 89, in write_json_link_details
    atomic_write(link._asdict(extended=True), path)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/system.py", line 73, in atomic_write
    pyjson.dump(contents, f, indent=4, sort_keys=True, cls=ExtendedEncoder)
  File "/usr/lib/python3.7/json/__init__.py", line 179, in dump
    for chunk in iterable:           
  File "/usr/lib/python3.7/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks                
  File "/usr/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks                
  File "/usr/lib/python3.7/json/encoder.py", line 325, in _iterencode_list
    yield from chunks                
  File "/usr/lib/python3.7/json/encoder.py", line 438, in _iterencode
    o = _default(o)                  
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 250, in default
    return obj._asdict()             
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/schema.py", line 36, in _asdict
    return asdict(self)              
  File "/usr/lib/python3.7/dataclasses.py", line 1044, in asdict
    return _asdict_inner(obj, dict_factory)
  File "/usr/lib/python3.7/dataclasses.py", line 1051, in _asdict_inner
    value = _asdict_inner(getattr(obj, f.name), dict_factory)
  File "/usr/lib/python3.7/dataclasses.py", line 1085, in _asdict_inner
    return copy.deepcopy(obj)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 274, in _reconstruct
    y = func(*args)
TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'
Command exited with non-zero status 1
2466.01user 359.38system 2:57:42elapsed 26%CPU (0avgtext+0avgdata 258044maxresident)k
316056inputs+48205712outputs (1181major+28465380minor)pagefaults 0swaps
"time archivebox add wallabag.list " took 2 hours 57 mins 43 secs

I have tried upgrading archivebox to the django branch but then it fails with:

$ time archivebox add --update-all wallabag.list
    > ./sources/wallabag.list-1557227903.txt

[*] [2019-05-07 11:18:24] Parsing new links from output/sources/wallabag.list-1557227903.txt...
    > Parsed 10317 links as Plain Text (0 new links added)

[*] [2019-05-07 11:19:02] Writing 10262 links to main index...
Traceback (most recent call last):
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 383, in execute
    return Database.Cursor.execute(self, query, params)
sqlite3.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 11, in <module>
    load_entry_point('archivebox', 'console_scripts', 'archivebox')()
  File "/home/anarcat/dist/ArchiveBox/archivebox/__main__.py", line 10, in main
    archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
  File "/home/anarcat/dist/ArchiveBox/archivebox/cli/archivebox.py", line 58, in main
    pwd=pwd or OUTPUT_DIR,
  File "/home/anarcat/dist/ArchiveBox/archivebox/cli/__init__.py", line 55, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/anarcat/dist/ArchiveBox/archivebox/cli/archivebox_add.py", line 55, in main
    out_dir=pwd or OUTPUT_DIR,
  File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)
  File "/home/anarcat/dist/ArchiveBox/archivebox/main.py", line 509, in add
    write_main_index(links=all_links, out_dir=out_dir)
  File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)
  File "/home/anarcat/dist/ArchiveBox/archivebox/index/__init__.py", line 233, in write_main_index
    write_sql_main_index(links, out_dir=out_dir)
  File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)
  File "/home/anarcat/dist/ArchiveBox/archivebox/index/sql.py", line 37, in write_sql_main_index
    Snapshot.objects.create(**info)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/query.py", line 422, in create
    obj.save(force_insert=True, using=self.db)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 741, in save
    force_update=force_update, update_fields=update_fields)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 779, in save_base
    force_update, using, update_fields,
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 870, in _save_table
    result = self._do_insert(cls._base_manager, using, fields, update_pk, raw)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 908, in _do_insert
    using=using, raw=raw)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/query.py", line 1186, in _insert
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1332, in execute_sql
    cursor.execute(sql, params)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 383, in execute
    return Database.Cursor.execute(self, query, params)
django.db.utils.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp

I suspect the database structure has changed but it's not immediately obvious to me how to fix that...

Software versions

  • OS: Debian buster 10
  • ArchiveBox version: django branch, installed through pip -e in a virtualenv
  • Python version: 3.7.3rc3?
  • Chrome version: N/A
Originally created by @anarcat on GitHub (May 7, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/236 #### Describe the bug This is yet another 0.4.1 bug, feel free to close it but do notice that I can't upgrade either. ;) #### Steps to reproduce 1. Ran ArchiveBox with around 10,000 URLs to crawl 2. Wait around 3 hours 3. Crawl eventually crashes with: `TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'` #### Screenshots or log output Full backtrace: ``` [+] [2019-05-07 03:38:34] "www.varnish-software.com/blog/introducing-varnish-massive-storage-engine" https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine > ./archive/1557189662.144 > title Failed: HTTPError HTTP Error 404: Not Found Run to see full output: cd /srv/backup/archive/archivebox/archive/1557189662.144; curl https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine | grep <title > favicon > wget Failed: Got an error from the server Got wget response code: 8. https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine: 2019-05-07 03:38:35 erreur 404 : Not Found. Run to see full output: cd /srv/backup/archive/archivebox/archive/1557189662.144; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --warc-file=warc/1557200315 --page-requisites "--user-agent=ArchiveBox/0.4.1 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine > pdf > screenshot > dom > media ! Failed to archive link: TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp' Traceback (most recent call last): File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module> sys.exit(main()) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main archivebox.main(args=sys.argv[1:], stdin=sys.stdin) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main pwd=pwd or OUTPUT_DIR, File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main out_dir=pwd or OUTPUT_DIR, File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 521, in add archive_link(link, out_dir=link.link_dir) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/extractors/__init__.py", line 84, in archive_link write_link_details(link, out_dir=link.link_dir) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/__init__.py", line 345, in write_link_details write_json_link_details(link, out_dir=out_dir) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/json.py", line 89, in write_json_link_details atomic_write(link._asdict(extended=True), path) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/system.py", line 73, in atomic_write pyjson.dump(contents, f, indent=4, sort_keys=True, cls=ExtendedEncoder) File "/usr/lib/python3.7/json/__init__.py", line 179, in dump for chunk in iterable: File "/usr/lib/python3.7/json/encoder.py", line 431, in _iterencode yield from _iterencode_dict(o, _current_indent_level) File "/usr/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict yield from chunks File "/usr/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict yield from chunks File "/usr/lib/python3.7/json/encoder.py", line 325, in _iterencode_list yield from chunks File "/usr/lib/python3.7/json/encoder.py", line 438, in _iterencode o = _default(o) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 250, in default return obj._asdict() File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/schema.py", line 36, in _asdict return asdict(self) File "/usr/lib/python3.7/dataclasses.py", line 1044, in asdict return _asdict_inner(obj, dict_factory) File "/usr/lib/python3.7/dataclasses.py", line 1051, in _asdict_inner value = _asdict_inner(getattr(obj, f.name), dict_factory) File "/usr/lib/python3.7/dataclasses.py", line 1085, in _asdict_inner return copy.deepcopy(obj) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 180, in deepcopy y = _reconstruct(x, memo, *rv) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 274, in _reconstruct y = func(*args) TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp' Command exited with non-zero status 1 2466.01user 359.38system 2:57:42elapsed 26%CPU (0avgtext+0avgdata 258044maxresident)k 316056inputs+48205712outputs (1181major+28465380minor)pagefaults 0swaps "time archivebox add wallabag.list " took 2 hours 57 mins 43 secs ``` I have tried upgrading archivebox to the django branch but then it fails with: ``` $ time archivebox add --update-all wallabag.list > ./sources/wallabag.list-1557227903.txt [*] [2019-05-07 11:18:24] Parsing new links from output/sources/wallabag.list-1557227903.txt... > Parsed 10317 links as Plain Text (0 new links added) [*] [2019-05-07 11:19:02] Writing 10262 links to main index... Traceback (most recent call last): File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 383, in execute return Database.Cursor.execute(self, query, params) sqlite3.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 11, in <module> load_entry_point('archivebox', 'console_scripts', 'archivebox')() File "/home/anarcat/dist/ArchiveBox/archivebox/__main__.py", line 10, in main archivebox.main(args=sys.argv[1:], stdin=sys.stdin) File "/home/anarcat/dist/ArchiveBox/archivebox/cli/archivebox.py", line 58, in main pwd=pwd or OUTPUT_DIR, File "/home/anarcat/dist/ArchiveBox/archivebox/cli/__init__.py", line 55, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/home/anarcat/dist/ArchiveBox/archivebox/cli/archivebox_add.py", line 55, in main out_dir=pwd or OUTPUT_DIR, File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/dist/ArchiveBox/archivebox/main.py", line 509, in add write_main_index(links=all_links, out_dir=out_dir) File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/dist/ArchiveBox/archivebox/index/__init__.py", line 233, in write_main_index write_sql_main_index(links, out_dir=out_dir) File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/dist/ArchiveBox/archivebox/index/sql.py", line 37, in write_sql_main_index Snapshot.objects.create(**info) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/query.py", line 422, in create obj.save(force_insert=True, using=self.db) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 741, in save force_update=force_update, update_fields=update_fields) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 779, in save_base force_update, using, update_fields, File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 870, in _save_table result = self._do_insert(cls._base_manager, using, fields, update_pk, raw) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 908, in _do_insert using=using, raw=raw) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/query.py", line 1186, in _insert return query.get_compiler(using=using).execute_sql(return_id) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1332, in execute_sql cursor.execute(sql, params) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 67, in execute return self._execute_with_wrappers(sql, params, many=False, executor=self._execute) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers return executor(sql, params, many, context) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__ raise dj_exc_value.with_traceback(traceback) from exc_value File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 383, in execute return Database.Cursor.execute(self, query, params) django.db.utils.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp ``` I suspect the database structure has changed but it's not immediately obvious to me how to fix that... #### Software versions - OS: Debian buster 10 - ArchiveBox version: django branch, installed through pip -e in a virtualenv - Python version: 3.7.3rc3? - Chrome version: N/A
kerem 2026-03-01 17:52:44 +03:00
Author
Owner

@pirate commented on GitHub (May 8, 2019):

Run archivebox init to migrate the db to the latest version, and definitely don't continue using v0.4.1, it's full of bugs (my latest local version is already v0.4.5 but I'm hesitant to release more alpha versions it as there's still lots of stuff unfinished and I don't want to ruin people's archives).

<!-- gh-comment-id:490689056 --> @pirate commented on GitHub (May 8, 2019): Run `archivebox init` to migrate the db to the latest version, and definitely don't continue using v0.4.1, it's full of bugs (my latest local version is already v0.4.5 but I'm hesitant to release more alpha versions it as there's still lots of stuff unfinished and I don't want to ruin people's archives).
Author
Owner

@anarcat commented on GitHub (May 9, 2019):

On 2019-05-08 16:35:11, Nick Sweeting wrote:

Run archivebox init to migrate the db to the latest version,

I'll try that, thanks.

and definitely don't continue using v0.4.1, it's full of bugs (my latest local version is already v0.4.5 but I'm hesitant to release more alpha versions it as there's still lots of stuff unfinished and I don't want to ruin people's archives).

I'd say you should release what you have. :)

<!-- gh-comment-id:490698223 --> @anarcat commented on GitHub (May 9, 2019): On 2019-05-08 16:35:11, Nick Sweeting wrote: > Run `archivebox init` to migrate the db to the latest version, I'll try that, thanks. > and definitely don't continue using v0.4.1, it's full of bugs (my latest local version is already v0.4.5 but I'm hesitant to release more alpha versions it as there's still lots of stuff unfinished and I don't want to ruin people's archives). I'd say you should release what you have. :)
Author
Owner

@anarcat commented on GitHub (May 9, 2019):

archivebox init doesn't solve the problem, i still get:

django.db.utils.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp

to be honest, i'd be fine with flushing this entire archive and starting from scratch - i have no history to keep, really, so it's not a big deal. if you're confident the __init__ bug is fixed and the UNIQUE stuff is just a weird fluke, i'm happy to close this and move on.

i'm just worried it would do the same thing after crawling 20GB for three hours. ;)

<!-- gh-comment-id:490975756 --> @anarcat commented on GitHub (May 9, 2019): `archivebox init` doesn't solve the problem, i still get: ``` django.db.utils.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp ``` to be honest, i'd be fine with flushing this entire archive and starting from scratch - i have no history to keep, really, so it's not a big deal. if you're confident the `__init__` bug is fixed and the `UNIQUE` stuff is just a weird fluke, i'm happy to close this and move on. i'm just worried it would do the same thing after crawling 20GB for three hours. ;)
Author
Owner

@anarcat commented on GitHub (Jun 3, 2019):

any suggestion on where i should start to debug this? should i scrap the 20GB archive and/or database and start from scratch?

thanks in advance! :)

<!-- gh-comment-id:498080849 --> @anarcat commented on GitHub (Jun 3, 2019): any suggestion on where i should start to debug this? should i scrap the 20GB archive and/or database and start from scratch? thanks in advance! :)
Author
Owner

@pirate commented on GitHub (Jul 9, 2019):

Sorry for the long delay @anarcat I'm still swamped by my day job, going to try to get to this in the next couple months but it may be tricky with upcoming travel and client meetings.
Whatever you do don't scrap that archive, it's 100% recoverable, I'm sure there's a simple fix I can add for this in v0.4, I just need a solid block of time to figure it out.

<!-- gh-comment-id:509746495 --> @pirate commented on GitHub (Jul 9, 2019): Sorry for the long delay @anarcat I'm still swamped by my day job, going to try to get to this in the next couple months but it may be tricky with upcoming travel and client meetings. Whatever you do don't scrap that archive, it's 100% recoverable, I'm sure there's a simple fix I can add for this in v0.4, I just need a solid block of time to figure it out.
Author
Owner

@anarcat commented on GitHub (Jul 9, 2019):

awesome, thanks for the update! no rush, of course :)

<!-- gh-comment-id:509747262 --> @anarcat commented on GitHub (Jul 9, 2019): awesome, thanks for the update! no rush, of course :)
Author
Owner

@pirate commented on GitHub (May 9, 2020):

Closing this for now, I think I've fixed a few of these bugs in the django branch, and the atomicity/corruption issue is moved to #234.

<!-- gh-comment-id:626220977 --> @pirate commented on GitHub (May 9, 2020): Closing this for now, I think I've fixed a few of these bugs in the `django` branch, and the atomicity/corruption issue is moved to #234.
Author
Owner

@dvpc commented on GitHub (May 16, 2020):

I was running into the exact same problem (tested both v.0.4.2 and v.0.4.3 branches) yesterday and
noticed that the type error (below) occurs when a link couldn't be processed (e.g. 404).

...
  File "/usr/lib/python3.7/dataclasses.py", line 1085, in _asdict_inner
    return copy.deepcopy(obj)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 274, in _reconstruct
    y = func(*args)
TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'

tldr:
The output field in the class ArchiveResult must always (i guess) contain a string value. In case of an error it holds an instance of the error object, which in turn makes the deepcopy operation
at the end of the json serialization to throw the type error.

Solution:
in archivebox/extractors/title.py (line 62)
Change the value of output from err to str(err).

def save_title(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> ArchiveResult:
...
    except Exception as err:
        status = 'failed'
        output = str(err)
    finally:
        timer.end()
...

I don't know if i did overlook something else but this appears to fix the error.

<!-- gh-comment-id:629701762 --> @dvpc commented on GitHub (May 16, 2020): I was running into the exact same problem (tested both v.0.4.2 and v.0.4.3 branches) yesterday and noticed that the type error (below) occurs when a link couldn't be processed (e.g. 404). ``` ... File "/usr/lib/python3.7/dataclasses.py", line 1085, in _asdict_inner return copy.deepcopy(obj) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 180, in deepcopy y = _reconstruct(x, memo, *rv) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 274, in _reconstruct y = func(*args) TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp' ``` tldr: The `output` field in the class `ArchiveResult` must always (i guess) contain a string value. In case of an error it holds an instance of the error object, which in turn makes the deepcopy operation at the end of the json serialization to throw the type error. Solution: in archivebox/extractors/title.py (line 62) Change the value of output from err to str(err). ``` def save_title(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> ArchiveResult: ... except Exception as err: status = 'failed' output = str(err) finally: timer.end() ... ``` I don't know if i did overlook something else but this appears to fix the error.
Author
Owner

@pirate commented on GitHub (May 18, 2020):

Thanks, good catch @dvpc.

<!-- gh-comment-id:629903235 --> @pirate commented on GitHub (May 18, 2020): Thanks, good catch @dvpc.
Author
Owner

@dvpc commented on GitHub (May 18, 2020):

You're welcome :)

I happened to look into the code of the other extractors and i guess they should convert the error responses as well.

Should i create a patch?

<!-- gh-comment-id:630163048 --> @dvpc commented on GitHub (May 18, 2020): You're welcome :) I happened to look into the code of the other extractors and i guess they should convert the error responses as well. Should i create a patch?
Author
Owner

@pirate commented on GitHub (May 21, 2020):

If you can that would be awesome, otherwise if you're ok waiting an indefinite amount of time I've already saved this issue in the queue of the long list of things left for me to do in the 0.4 release.

<!-- gh-comment-id:631819705 --> @pirate commented on GitHub (May 21, 2020): If you can that would be awesome, otherwise if you're ok waiting an indefinite amount of time I've already saved this issue in the queue of the long list of things left for me to do in the 0.4 release.
Author
Owner

@dvpc commented on GitHub (Sep 9, 2020):

Sorry for the delay. If its still relevant, i attach the patch file (from branch origin/v0.4.3) here.
fix_json_type_error.patch.txt

PS
I had to rename the file (add .txt suffix), that seems to be a "wrong" workflow i guess (I didn't fork the project yet).
Hope it helps anyway.

<!-- gh-comment-id:689626019 --> @dvpc commented on GitHub (Sep 9, 2020): Sorry for the delay. If its still relevant, i attach the patch file (from branch origin/v0.4.3) here. [fix_json_type_error.patch.txt](https://github.com/pirate/ArchiveBox/files/5196076/fix_json_type_error.patch.txt) PS I had to rename the file (add .txt suffix), that seems to be a "wrong" workflow i guess (I didn't fork the project yet). Hope it helps anyway.
Author
Owner

@cdvv7788 commented on GitHub (Sep 9, 2020):

@dvpc Can you please describe a way to reliably reproduce it, and create a PR if this happens in the latest version?
There have been a LOT of changes, so I would like to make sure it still makes sense. Thanks!

<!-- gh-comment-id:689633623 --> @cdvv7788 commented on GitHub (Sep 9, 2020): @dvpc Can you please describe a way to reliably reproduce it, and create a PR if this happens in the latest version? There have been a LOT of changes, so I would like to make sure it still makes sense. Thanks!
Author
Owner

@dvpc commented on GitHub (Sep 10, 2020):

@cdvv7788 Sorry i don't have any time now. If it helps, i took a quick look into for example
https://github.com/pirate/ArchiveBox/blob/v0.5.0/archivebox/extractors/archive_org.py and in line 73 for example it still says: output = err, which reliably leads to the error described in the first post of this thread:
TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'
So it seems it is still makes sense.
(Assuming that v0.5.0 is the latest version - which i can't say since i didn't follow the progress).

All changes in the patch are one-liners for all extractors. Take a look at the patch file and my first post. It is very simple.

<!-- gh-comment-id:690249371 --> @dvpc commented on GitHub (Sep 10, 2020): @cdvv7788 Sorry i don't have any time now. If it helps, i took a quick look into for example https://github.com/pirate/ArchiveBox/blob/v0.5.0/archivebox/extractors/archive_org.py and in line 73 for example it still says: `output = err`, which reliably leads to the error described in the first post of this thread: `TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'` So it seems it is still makes sense. (Assuming that v0.5.0 is the latest version - which i can't say since i didn't follow the progress). All changes in the patch are one-liners for all extractors. Take a look at the patch file and my first post. It is very simple.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1675
No description provided.