[GH-ISSUE #703] Init fails when an unrecognized/invalid data/index.json is present in data directory root #1952

Closed
opened 2026-03-01 17:55:17 +03:00 by kerem · 2 comments
Owner

Originally created by @dohlin on GitHub (Apr 12, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/703

Describe the bug

Running docker-compose run archivebox init --setup (before and after upgrade from 0.5.6 to 0.6.0) results in a crash/traceback message

Steps to reproduce

Run docker-compose run archivebox init --setup. Same behavior noticed before and after upgrade from 0.5.6 to 0.6.0

Screenshots or log output

This is from after the upgrade to 0.6.0, but the end result is the same as 0.5.6.

dave@Bookmarks01:~/archivebox$ docker-compose run archivebox init --setup
Creating archivebox_sonic_1 ... done
Creating archivebox_archivebox_run ... done
[i] [2021-04-12 18:44:21] ArchiveBox v0.6.0: archivebox init --setup
    > /data

[!] This folder contains a JSON index. It is deprecated, and will no longer be kept up to date automatically.
    You can run `archivebox list --json --with-headers > index.json` to manually generate it.
[^] Verifying and updating existing ArchiveBox collection to v0.6.0...
----------------------------------------------------------------------

[*] Verifying archive folder structure...
    + ./archive, ./sources, ./logs...
    + ./ArchiveBox.conf...

[*] Verifying main SQL index and running any migrations needed...
    Operations to perform:
      Apply all migrations: admin, auth, contenttypes, core, sessions
    Running migrations:
    Applying core.0009_auto_20210216_1038... OK
    Applying core.0010_auto_20210216_1055... OK
    Applying core.0011_auto_20210216_1331... OK
    Applying core.0012_auto_20210216_1425... OK
    Applying core.0013_auto_20210218_0729... OK
    Applying core.0014_auto_20210218_0729... OK
    Applying core.0015_auto_20210218_0730... OK
    Applying core.0016_auto_20210218_1204... OK
    Applying core.0017_auto_20210219_0211... OK
    Applying core.0018_auto_20210327_0952... OK
    Applying core.0019_auto_20210401_0654... OK
    Applying core.0020_auto_20210410_1031... OK

    √ ./index.sqlite3

[*] Checking links from indexes and archive folders (safe to Ctrl+C)...
    √ Loaded 2171 links from existing main index.
Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_init.py", line 43, in main
    init(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 392, in init
    orphaned_json_links = {
  File "/app/archivebox/main.py", line 392, in <dictcomp>
    orphaned_json_links = {
  File "/app/archivebox/index/json.py", line 63, in parse_json_main_index
    links = pyjson.load(f)['links']
KeyError: 'links'
ERROR: 1

ArchiveBox version

dave@Bookmarks01:~/archivebox$ docker-compose run archivebox --version
Creating archivebox_archivebox_run ... done
ArchiveBox v0.6.0
Cpython Linux Linux-5.8.0-48-generic-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.0          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.4          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.8          valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.13.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.04.07     valid     /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v89.0.4389.114  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            8 files         valid     /data                                                                       
 √  SOURCES_DIR           37 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           2171 files      valid     ./archive                                                                   
 √  CONFIG_FILE           250.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             34.1 MB         valid     ./index.sqlite3

Thoughts on why this is crashing and what I can do to fix it?

Originally created by @dohlin on GitHub (Apr 12, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/703 #### Describe the bug Running `docker-compose run archivebox init --setup` (before and after upgrade from 0.5.6 to 0.6.0) results in a crash/traceback message #### Steps to reproduce Run `docker-compose run archivebox init --setup`. Same behavior noticed before and after upgrade from 0.5.6 to 0.6.0 #### Screenshots or log output This is from after the upgrade to 0.6.0, but the end result is the same as 0.5.6. ``` dave@Bookmarks01:~/archivebox$ docker-compose run archivebox init --setup Creating archivebox_sonic_1 ... done Creating archivebox_archivebox_run ... done [i] [2021-04-12 18:44:21] ArchiveBox v0.6.0: archivebox init --setup > /data [!] This folder contains a JSON index. It is deprecated, and will no longer be kept up to date automatically. You can run `archivebox list --json --with-headers > index.json` to manually generate it. [^] Verifying and updating existing ArchiveBox collection to v0.6.0... ---------------------------------------------------------------------- [*] Verifying archive folder structure... + ./archive, ./sources, ./logs... + ./ArchiveBox.conf... [*] Verifying main SQL index and running any migrations needed... Operations to perform: Apply all migrations: admin, auth, contenttypes, core, sessions Running migrations: Applying core.0009_auto_20210216_1038... OK Applying core.0010_auto_20210216_1055... OK Applying core.0011_auto_20210216_1331... OK Applying core.0012_auto_20210216_1425... OK Applying core.0013_auto_20210218_0729... OK Applying core.0014_auto_20210218_0729... OK Applying core.0015_auto_20210218_0730... OK Applying core.0016_auto_20210218_1204... OK Applying core.0017_auto_20210219_0211... OK Applying core.0018_auto_20210327_0952... OK Applying core.0019_auto_20210401_0654... OK Applying core.0020_auto_20210410_1031... OK √ ./index.sqlite3 [*] Checking links from indexes and archive folders (safe to Ctrl+C)... √ Loaded 2171 links from existing main index. Traceback (most recent call last): File "/usr/local/bin/archivebox", line 33, in <module> sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')()) File "/app/archivebox/cli/__init__.py", line 140, in main run_subcommand( File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/app/archivebox/cli/archivebox_init.py", line 43, in main init( File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/main.py", line 392, in init orphaned_json_links = { File "/app/archivebox/main.py", line 392, in <dictcomp> orphaned_json_links = { File "/app/archivebox/index/json.py", line 63, in parse_json_main_index links = pyjson.load(f)['links'] KeyError: 'links' ERROR: 1 ``` #### ArchiveBox version ``` dave@Bookmarks01:~/archivebox$ docker-compose run archivebox --version Creating archivebox_archivebox_run ... done ArchiveBox v0.6.0 Cpython Linux Linux-5.8.0-48-generic-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.0 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.4 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.8 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.13.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.07 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v89.0.4389.114 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 8 files valid /data √ SOURCES_DIR 37 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 2171 files valid ./archive √ CONFIG_FILE 250.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 34.1 MB valid ./index.sqlite3 ``` Thoughts on why this is crashing and what I can do to fix it?
Author
Owner

@pirate commented on GitHub (Apr 12, 2021):

According to this line in the output:

[!] This folder contains a JSON index. It is deprecated, and will no longer be kept up to date automatically.
    You can run `archivebox list --json --with-headers > index.json` to manually generate it.

Looks like you have a rogue index.json file in the root of your data folder that it's trying to parse as an old v0.4.x archive main index.
The reason it's failing is because that file is not actually a valid main index file according to ArchiveBox, perhaps it got added by accident or it's a static export you generated with archivebox list and not an old main index file?

Can you rename or move that file somewhere else and try again?

ls ./data/index.json
mv ./data/index.json ./data/2021-04-12_old-index.json

docker-compose run --rm archivebox init

I also just added a case to catch this in the code and show a better error message instead of blocking the init: 50b341b. You can use early before the next release by changing your docker-compose.yml to use the image: archivebox/archivebox:dev container.

<!-- gh-comment-id:818208537 --> @pirate commented on GitHub (Apr 12, 2021): According to this line in the output: ```logs [!] This folder contains a JSON index. It is deprecated, and will no longer be kept up to date automatically. You can run `archivebox list --json --with-headers > index.json` to manually generate it. ``` Looks like you have a rogue `index.json` file in the root of your data folder that it's trying to parse as an old v0.4.x archive main index. The reason it's failing is because that file is not actually a valid main index file according to ArchiveBox, perhaps it got added by accident or it's a static export you generated with `archivebox list` and not an old main index file? Can you rename or move that file somewhere else and try again? ```bash ls ./data/index.json mv ./data/index.json ./data/2021-04-12_old-index.json docker-compose run --rm archivebox init ``` I also just added a case to catch this in the code and show a better error message instead of blocking the `init`: 50b341b. You can use early before the next release by changing your `docker-compose.yml` to use the `image: archivebox/archivebox:dev` container.
Author
Owner

@dohlin commented on GitHub (Apr 12, 2021):

Sure enough, that worked. Thank you!!

<!-- gh-comment-id:818241925 --> @dohlin commented on GitHub (Apr 12, 2021): Sure enough, that worked. Thank you!!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1952
No description provided.