[GH-ISSUE #1477] Bug: archivebox add --update creates a different snapshot directory #3893

Open
opened 2026-03-15 00:53:06 +03:00 by kerem · 4 comments
Owner

Originally created by @blastrock on GitHub (Aug 4, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1477

Describe the bug

I used archivebox add --update, but I discovered that every time I used it for a single URL, a new different archive folder is made. I now have a lot of invalid snapshots.

I expected archivebox add to not put my archive into an inconsistent state, and just update the links that were already added.

Steps to reproduce

This is what I (think I) did, not all steps may be necessary:

  1. Add a link with archivebox add --index-only
  2. Add the same link a second time with archivebox add --extract headers,title,favicon,media --update
  3. Repeat step 2, for good measure

Every time you repeat step 2, a new snapshot with a different timestamp is created. Only the first one (which is empty) appears in archivebox list.

Screenshots or log output

I now have a lot of invalid archive dirs that archivebox init ignores.

[*] Scanning archive main index...
    /data/* 
    Index size: 5.9 MB across 3 files

    > SQL Main Index: 363 links      (found in index.sqlite3)
    > JSON Link Details: 708 links   (found in archive/*/index.json)

[*] Scanning archive data directories...
    /data/archive/* 
    Size: 52.0 GB across 55445 files in 22387 directories

    > indexed: 363                   (indexed links without checking archive status or data directory validity)
      > archived: 276                (indexed links that are archived with a valid data directory)
      > unarchived: 87               (indexed links that are unarchived with no data directory or an empty data directory)

    > present: 708                   (dirs that actually exist in the archive/ folder)
      > valid: 311                   (dirs with a valid index matched to the main index and archived content)
      > invalid: 397                 (dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized)
        > duplicate: 345             (dirs that conflict with other directories that have the same link URL or timestamp)
        > orphaned: 397              (dirs that contain a valid index but aren't listed in the main index)
        > corrupted: 0               (dirs that don't contain a valid index and aren't listed in the main index)
        > unrecognized: 0            (dirs that don't contain recognizable archive data and aren't listed in the main index)

So you have an idea about how to fix this? I'd like to keep only the latest snapshot for each URL

ArchiveBox version

0.7.2
ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:47:02 1713998822
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.0-117-generic-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=1000:1000 FS_PERMS=644
DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.9         valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.7.2          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v8.5.0          valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v20.12.2        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.1.46         valid     /app/node_modules/single-file-cli/single-file                               
 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor               
 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js                                  
 √  GIT_BINARY            v2.39.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2023.12.30     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v124.0.6367.29  valid     /usr/bin/chromium-browser                                                   
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None                                                                        

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None                                                                        
 -  COOKIES_FILE          -               disabled  None                                                                        

[i] Data locations:
 √  OUTPUT_DIR            5 files @       valid     /data                                                                       
 √  SOURCES_DIR           162 files       valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           710 files       valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             5.9 MB          valid     ./index.sqlite3

I am using the archivebox/archivebox:main docker image.

Originally created by @blastrock on GitHub (Aug 4, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1477 #### Describe the bug I used `archivebox add --update`, but I discovered that every time I used it for a single URL, a new different archive folder is made. I now have a lot of `invalid` snapshots. I expected `archivebox add` to not put my archive into an inconsistent state, and just update the links that were already added. #### Steps to reproduce This is what I (think I) did, not all steps may be necessary: 1. Add a link with `archivebox add --index-only` 2. Add the same link a second time with `archivebox add --extract headers,title,favicon,media --update` 3. Repeat step 2, for good measure Every time you repeat step 2, a new snapshot with a different timestamp is created. Only the first one (which is empty) appears in `archivebox list`. #### Screenshots or log output I now have a lot of invalid archive dirs that `archivebox init` ignores. ``` [*] Scanning archive main index... /data/* Index size: 5.9 MB across 3 files > SQL Main Index: 363 links (found in index.sqlite3) > JSON Link Details: 708 links (found in archive/*/index.json) [*] Scanning archive data directories... /data/archive/* Size: 52.0 GB across 55445 files in 22387 directories > indexed: 363 (indexed links without checking archive status or data directory validity) > archived: 276 (indexed links that are archived with a valid data directory) > unarchived: 87 (indexed links that are unarchived with no data directory or an empty data directory) > present: 708 (dirs that actually exist in the archive/ folder) > valid: 311 (dirs with a valid index matched to the main index and archived content) > invalid: 397 (dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized) > duplicate: 345 (dirs that conflict with other directories that have the same link URL or timestamp) > orphaned: 397 (dirs that contain a valid index but aren't listed in the main index) > corrupted: 0 (dirs that don't contain a valid index and aren't listed in the main index) > unrecognized: 0 (dirs that don't contain recognizable archive data and aren't listed in the main index) ``` So you have an idea about how to fix this? I'd like to keep only the latest snapshot for each URL #### ArchiveBox version ```logs 0.7.2 ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:47:02 1713998822 IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.0-117-generic-x86_64-with-glibc2.36 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=True FS_USER=1000:1000 FS_PERMS=644 DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.9 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.2 valid /usr/local/bin/archivebox √ CURL_BINARY v8.5.0 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v20.12.2 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.46 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid /app/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /app/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.39.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2023.12.30 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v124.0.6367.29 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 5 files @ valid /data √ SOURCES_DIR 162 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 710 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 5.9 MB valid ./index.sqlite3 ``` I am using the `archivebox/archivebox:main` docker image.
Author
Owner

@huyz commented on GitHub (Aug 4, 2024):

Sounds like my problem: https://github.com/ArchiveBox/ArchiveBox/discussions/1474

<!-- gh-comment-id:2267899208 --> @huyz commented on GitHub (Aug 4, 2024): Sounds like my problem: https://github.com/ArchiveBox/ArchiveBox/discussions/1474
Author
Owner

@pirate commented on GitHub (Aug 5, 2024):

Yes I believe it's probably the same issue, I'll ensure it's fixed in the upcoming 0.8 mega release. Apologies for the trouble, thanks for reporting.

<!-- gh-comment-id:2268089128 --> @pirate commented on GitHub (Aug 5, 2024): Yes I believe it's probably the same issue, I'll ensure it's fixed in the upcoming 0.8 mega release. Apologies for the trouble, thanks for reporting.
Author
Owner

@huyz commented on GitHub (Aug 5, 2024):

Is there an easy way to identify and nuke the invalid directories?

<!-- gh-comment-id:2269790235 --> @huyz commented on GitHub (Aug 5, 2024): Is there an easy way to identify and nuke the invalid directories?
Author
Owner

@pirate commented on GitHub (Aug 6, 2024):

archivebox list --status=invalid --csv=timestamp,url

--status=xyz can be any of the statuses listed in archivebox status (e.g. invalid, orphaned, duplicate, etc.)

--csv=timestamp will show the ts as the first column in the output, the archives dirs are stored in ./data/archive/<ts> so you can use that to delete those dirs.

<!-- gh-comment-id:2270257785 --> @pirate commented on GitHub (Aug 6, 2024): `archivebox list --status=invalid --csv=timestamp,url` `--status=xyz` can be any of the statuses listed in `archivebox status` (e.g. `invalid`, `orphaned`, `duplicate`, etc.) `--csv=timestamp` will show the ts as the first column in the output, the archives dirs are stored in `./data/archive/<ts>` so you can use that to delete those dirs.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3893
No description provided.