[GH-ISSUE #962] Bug: Running archivebox update --index-only doesn't upgrade Snapshot index.{html,json} files #600

Closed
opened 2026-03-01 14:44:53 +03:00 by kerem · 2 comments
Owner

Originally created by @mwnoo on GitHub (Apr 7, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/962

Describe the bug

I tried to update the data/archive/<timestamp>/index.{json,html} files by running the archivebox update --index-only command as described in #544. I expected that this command would update the index.{json,html} files in the data/archive/<timestamp>/ folders, but the updated index.{json,html} files are only written to OUTPUT_DIR and not to the timestamp folder. The updated index.{json,html} files in the OUTPUT_DIR are probably only used by sonic to create the search index.

Steps to reproduce

  1. Use git on the archive folder
  2. Run archivebox update --index-only (also tried: archivebox update --index-only --overwrite but same result)
  3. Updated index.{json,html} only written to OUTPUT_DIR
  4. Running git status shows no changes to the archive folder
  5. data/archive/<timestamp>/index.{json,html} files are not updated

Screenshots or log output

N/A

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.13.0-39-generic-x86_64-with-glibc2.29 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.8.10         valid     /usr/bin/python3.8                                                          
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.8/dist-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.68.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v14.19.1        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.32         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.0.3          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 √  GIT_BINARY            v2.25.1         valid     /usr/bin/git                                                                
 -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v100.0.4896.60  valid     /usr/bin/chromium-browser                                                   
 √  RIPGREP_BINARY        v11.0.2         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /usr/local/lib/python3.8/dist-packages/archivebox                           
 √  TEMPLATES_DIR         3 files         valid     /usr/local/lib/python3.8/dist-packages/archivebox/templates                 
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 √  CHROME_USER_DATA_DIR  34 files        valid     ./chromium                                                                  
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            13 files        valid     /archive_data/archivebox/data                                               
 √  SOURCES_DIR           112 files       valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           812 files       valid     ./archive                                                                   
 √  CONFIG_FILE           460.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             6.8 MB          valid     ./index.sqlite3                                                             

Config

[SERVER_CONFIG]
SECRET_KEY = XXXX
SNAPSHOTS_PER_PAGE = 100

[ARCHIVE_METHOD_TOGGLES]
SAVE_ARCHIVE_DOT_ORG = False
SAVE_MEDIA = False
SAVE_WGET = False
SAVE_READABILITY = True
SAVE_MERCURY = True
SAVE_DOM = True

[ARCHIVE_METHOD_OPTIONS]
CHROME_USER_DATA_DIR = chromium/

[GENERAL_CONFIG]
TIMEOUT = 180

[SEARCH_BACKEND_CONFIG]
SEARCH_BACKEND_ENGINE = sonic
SEARCH_BACKEND_HOST_NAME = localhost
Originally created by @mwnoo on GitHub (Apr 7, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/962 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug I tried to update the `data/archive/<timestamp>/index.{json,html}` files by running the `archivebox update --index-only` command as described in #544. I expected that this command would update the `index.{json,html}` files in the `data/archive/<timestamp>/` folders, but the updated `index.{json,html}` files are only written to `OUTPUT_DIR` and not to the timestamp folder. The updated `index.{json,html}` files in the `OUTPUT_DIR` are probably only used by sonic to create the search index. #### Steps to reproduce 1. Use `git` on the `archive` folder 2. Run `archivebox update --index-only` (also tried: `archivebox update --index-only --overwrite` but same result) 3. Updated `index.{json,html}` only written to `OUTPUT_DIR` 4. Running `git status` shows no changes to the archive folder 5. `data/archive/<timestamp>/index.{json,html}` files are not updated #### Screenshots or log output N/A #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.2 Cpython Linux Linux-5.13.0-39-generic-x86_64-with-glibc2.29 x86_64 IN_DOCKER=False DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.8.10 valid /usr/bin/python3.8 √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.8/dist-packages/django/bin/django-admin.py √ CURL_BINARY v7.68.0 valid /usr/bin/curl √ WGET_BINARY v1.20.3 valid /usr/bin/wget √ NODE_BINARY v14.19.1 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.32 valid ./node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.3 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.25.1 valid /usr/bin/git - YOUTUBEDL_BINARY - disabled /usr/local/bin/youtube-dl √ CHROME_BINARY v100.0.4896.60 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v11.0.2 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /usr/local/lib/python3.8/dist-packages/archivebox √ TEMPLATES_DIR 3 files valid /usr/local/lib/python3.8/dist-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: √ CHROME_USER_DATA_DIR 34 files valid ./chromium - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 13 files valid /archive_data/archivebox/data √ SOURCES_DIR 112 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 812 files valid ./archive √ CONFIG_FILE 460.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 6.8 MB valid ./index.sqlite3 ``` Config ``` [SERVER_CONFIG] SECRET_KEY = XXXX SNAPSHOTS_PER_PAGE = 100 [ARCHIVE_METHOD_TOGGLES] SAVE_ARCHIVE_DOT_ORG = False SAVE_MEDIA = False SAVE_WGET = False SAVE_READABILITY = True SAVE_MERCURY = True SAVE_DOM = True [ARCHIVE_METHOD_OPTIONS] CHROME_USER_DATA_DIR = chromium/ [GENERAL_CONFIG] TIMEOUT = 180 [SEARCH_BACKEND_CONFIG] SEARCH_BACKEND_ENGINE = sonic SEARCH_BACKEND_HOST_NAME = localhost ```
kerem closed this issue 2026-03-01 14:44:54 +03:00
Author
Owner

@pirate commented on GitHub (Apr 11, 2022):

This is by design, for safety and performance on large collections the timestamp folder index files are only lazily updated when they actually need to be changed. If you want to update them all, check all the snapshot rows in the UI and click the update button.

I've added more notes to the Wiki page on upgrading to explain this: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#merge-two-or-more-existing-archives

I've also added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting

Contributions/suggestions welcome there.

<!-- gh-comment-id:1095522166 --> @pirate commented on GitHub (Apr 11, 2022): This is by design, for safety and performance on large collections the timestamp folder index files are only lazily updated when they actually need to be changed. If you want to update them all, check all the snapshot rows in the UI and click the update button. I've added more notes to the Wiki page on upgrading to explain this: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#merge-two-or-more-existing-archives I've also added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting Contributions/suggestions welcome there.
Author
Owner

@mwnoo commented on GitHub (Apr 15, 2022):

Thanks @pirate for the UI suggestion (I focused mainly on the CLI options)
Great project!

<!-- gh-comment-id:1099941693 --> @mwnoo commented on GitHub (Apr 15, 2022): Thanks @pirate for the UI suggestion (I focused mainly on the CLI options) Great project!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#600
No description provided.