[GH-ISSUE #1511] Bug: Migration to 0.8.3-rc results in "Created at" timestamps to all be modified against the present time the migration happened #2402

Open
opened 2026-03-01 17:58:48 +03:00 by kerem · 3 comments
Owner

Originally created by @jessienab on GitHub (Sep 6, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1511

Describe the bug

Following a migration to 0.8.3, all Created At timestamps for 0.8.3 prior snapshots reflect the actual timestamp of when the migration happened, rather than the imported date/time the snapshot was actually taken.

Steps to reproduce

  1. Run archivebox 0.8.3 on a pre 0.8.3 instance.
  2. Wait for migration to complete
  3. Check Created at timestamps

Screenshots or log output

Screenshot from 2024-09-06 11-09-23
Screenshot from 2024-09-06 11-10-32

There is a warning on older entries pre 0.8.3

The ABID is not in sync with this snapshot! [extra verbosity/info]

ArchiveBox version

# archivebox version
0.8.3
ArchiveBox v0.8.3 COMMIT_HASH=31576e2 BUILD_TIME=2024-09-06 13:14:49 1725628489
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.6.47-1-lts-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=0:0 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.9         valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v5.1.1          valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.8.3          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v8.9.1          valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v20.17.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.1.54         valid     /app/node_modules/single-file-cli/single-file                               
 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor               
 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js                                  
 √  GIT_BINARY            v2.39.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2024.8.6       valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v128.0.6613     valid     /usr/bin/chromium-browser                                                   
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           34 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates                                                   

[i] Data locations:
 √  OUTPUT_DIR            9 files @       valid     /data                                                                       
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             168.7 MB        valid     ./index.sqlite3                                                             
 √  ARCHIVE_DIR           4995 files      valid     ./archive                                                                   
 √  SOURCES_DIR           1712 files      valid     ./sources                                                                   
 X  PERSONAS_DIR          missing         invalid   ./personas                                                                  
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 X  CACHE_DIR             missing         invalid   ./cache                                                                     
 X  CUSTOM_TEMPLATES_DIR  missing         invalid   ./templates
Originally created by @jessienab on GitHub (Sep 6, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1511 #### Describe the bug Following a migration to 0.8.3, all Created At timestamps for 0.8.3 prior snapshots reflect the actual timestamp of when the migration happened, rather than the imported date/time the snapshot was actually taken. #### Steps to reproduce 1. Run archivebox 0.8.3 on a pre 0.8.3 instance. 2. Wait for migration to complete 3. Check Created at timestamps #### Screenshots or log output ![Screenshot from 2024-09-06 11-09-23](https://github.com/user-attachments/assets/37d1eb4c-5b30-44c5-bdc4-6976a516a623) ![Screenshot from 2024-09-06 11-10-32](https://github.com/user-attachments/assets/770e67ae-2344-4dbe-a23b-040177f57ce9) There is a warning on older entries pre 0.8.3 > The ABID is not in sync with this snapshot! [extra verbosity/info] #### ArchiveBox version ``` # archivebox version 0.8.3 ArchiveBox v0.8.3 COMMIT_HASH=31576e2 BUILD_TIME=2024-09-06 13:14:49 1725628489 IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.6.47-1-lts-x86_64-with-glibc2.36 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=True FS_USER=0:0 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.9 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v5.1.1 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.8.3 valid /usr/local/bin/archivebox √ CURL_BINARY v8.9.1 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v20.17.0 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.54 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid /app/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /app/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.39.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2024.8.6 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v128.0.6613 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 34 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates [i] Data locations: √ OUTPUT_DIR 9 files @ valid /data √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 168.7 MB valid ./index.sqlite3 √ ARCHIVE_DIR 4995 files valid ./archive √ SOURCES_DIR 1712 files valid ./sources X PERSONAS_DIR missing invalid ./personas √ LOGS_DIR 1 files valid ./logs X CACHE_DIR missing invalid ./cache X CUSTOM_TEMPLATES_DIR missing invalid ./templates ```
Author
Owner

@pirate commented on GitHub (Sep 6, 2024):

Created_at and modified_at are new fields on Snapshot, so it makes sense that they got default fresh values, however I should probably copy over the old values from .added/.updated in the migration. Will fix it

<!-- gh-comment-id:2334861350 --> @pirate commented on GitHub (Sep 6, 2024): Created_at and modified_at are new fields on Snapshot, so it makes sense that they got default fresh values, however I should probably copy over the old values from `.added`/`.updated` in the migration. Will fix it
Author
Owner

@sclu1034 commented on GitHub (Sep 11, 2024):

In case someone (*innocent look*) were to run into this with their live data, and was to lazy to pick out the correct backup snapshot to restore, here's a quick jq filter to pick data from index.json into SQL commands:

# `(downloaded_at - created_at) = 0` as a cheap trick to pick out only those entries where all fields have been overwritten
# to identical values (near identical, technically, as they differ in a few nanoseconds).
filter="\"UPDATE core_snapshot SET \" +
        (if .updated != null then \"modified_at = '\" + .updated + \"', \" else \"\" end) +
        (if .bookmarked_date != null then \"created_at = '\" + .bookmarked_date + \"', bookmarked_at = '\" + .bookmarked_date + \"', \" else \"\" end) +
        (if .oldest_archive_date != null then \"downloaded_at = '\" + .oldest_archive_date + \"'\" else \"\" end) +
    \" WHERE timestamp = '\" + .timestamp + \"' AND (downloaded_at - created_at) = 0;\""

# Used like this (remove `-readonly` flag once you're certain you want to run this)
find <path/to/archive> -name index.json -exec jq -r "$filter" "{}" \; | sqlite3 -echo -readonly <path/to/index.sqlite3>
<!-- gh-comment-id:2343540261 --> @sclu1034 commented on GitHub (Sep 11, 2024): In case someone (\*innocent look\*) were to run into this with their live data, and was to lazy to pick out the correct backup snapshot to restore, here's a quick `jq` filter to pick data from `index.json` into SQL commands: ```sh # `(downloaded_at - created_at) = 0` as a cheap trick to pick out only those entries where all fields have been overwritten # to identical values (near identical, technically, as they differ in a few nanoseconds). filter="\"UPDATE core_snapshot SET \" + (if .updated != null then \"modified_at = '\" + .updated + \"', \" else \"\" end) + (if .bookmarked_date != null then \"created_at = '\" + .bookmarked_date + \"', bookmarked_at = '\" + .bookmarked_date + \"', \" else \"\" end) + (if .oldest_archive_date != null then \"downloaded_at = '\" + .oldest_archive_date + \"'\" else \"\" end) + \" WHERE timestamp = '\" + .timestamp + \"' AND (downloaded_at - created_at) = 0;\"" # Used like this (remove `-readonly` flag once you're certain you want to run this) find <path/to/archive> -name index.json -exec jq -r "$filter" "{}" \; | sqlite3 -echo -readonly <path/to/index.sqlite3> ```
Author
Owner

@pirate commented on GitHub (Sep 12, 2024):

@sclu1034 thats actually a great idea, i might take inspiration from you and use the index.json to fix several potential edge cases with the recent (fairly complex) migrations.

<!-- gh-comment-id:2345150830 --> @pirate commented on GitHub (Sep 12, 2024): @sclu1034 thats actually a great idea, i might take inspiration from you and use the index.json to fix several potential edge cases with the recent (fairly complex) migrations.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2402
No description provided.