[GH-ISSUE #1510] Bug: Attempting to removed failed "Archive again" result relating to a pre 0.8.3 snapshot results in Archivebox attempting to delete EVERY entry in the database?! #2401

Open
opened 2026-03-01 17:58:47 +03:00 by kerem · 4 comments
Owner

Originally created by @jessienab on GitHub (Sep 6, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1510

Describe the bug

Following up to: #1509

It seems ArchiveBox did eventually generate the "Archive again" entries for pre-0.8.3 snapshots, however it didn't archive them properly. When attempting to delete these, the following happened:

  1. The server.py/daphne was killed?
daphne.server Application instance <Task pending name='Task-311' coro=<ProtocolTypeRouter.__call__() running at /usr/local/lib/python3.11/site-packages/channels/routing.py:62> wait_for=<Task
 cancelling name='Task-314' coro=<ASGIHandler.handle.<locals>.process_request() running at /usr/local/lib/python3.11/site-packages/django/core/handlers/asgi.py:185> wait_for=<Future pending 
cb=[_chain_future.<locals>._call_check_cancel() at /usr/local/lib/python3.11/asyncio/futures.py:387, Task.task_wakeup()]> cb=[Task.task_wakeup()]>> for connection <WebRequest at 0x7729830880
90 method=POST uri=/admin/core/snapshot/ clientproto=HTTP/1.1> took too long to shut down and was killed.
daphne.server Application instance <Task cancelling name='Task-311' coro=<ProtocolTypeRouter.__call__() running at /usr/local/lib/python3.11/site-packages/channels/routing.py:62> wait_for=<_
GatheringFuture pending cb=[Task.task_wakeup()]>> for connection <WebRequest at 0x772983088090 method=POST uri=/admin/core/snapshot/ clientproto=HTTP/1.1> took too long to shut down and was 
killed.
  1. ArchiveBox then reports the following:

[i] Found 10958 matching URLs to remove.
10958 Links will be de-listed from the main index, and their archived content folders will be deleted from disk.
(9829 data folders with 70489 archived files will be deleted!)

I immediately killed ArchiveBox to prevent further damage, but at this point I'll have to restore from an older backup + manually re-grab a possibly large number of URLs for sites that weren't archived in that backup... 😮‍💨

My fault! 🤦‍♀️

Steps to reproduce

  1. Attempt to re-snapshot a pre 0.8.3 snapshot
  2. It should fail with a 500 error and a specific error message
  3. The snapshots should eventually appear within the snapshot listings, but will not have been archived at all
  4. Attempt to delete those entries
  5. Depending on number of entries, ArchiveBox will then report through logs that it will delete effectively every entry...

Screenshots or log output

See above

ArchiveBox version

# archivebox version
0.8.3
ArchiveBox v0.8.3 COMMIT_HASH=31576e2 BUILD_TIME=2024-09-06 13:14:49 1725628489
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.6.47-1-lts-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=0:0 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.9         valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v5.1.1          valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.8.3          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v8.9.1          valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v20.17.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.1.54         valid     /app/node_modules/single-file-cli/single-file                               
 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor               
 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js                                  
 √  GIT_BINARY            v2.39.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2024.8.6       valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v128.0.6613     valid     /usr/bin/chromium-browser                                                   
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           34 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates                                                   

[i] Data locations:
 √  OUTPUT_DIR            9 files @       valid     /data                                                                       
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             168.7 MB        valid     ./index.sqlite3                                                             
 √  ARCHIVE_DIR           4995 files      valid     ./archive                                                                   
 √  SOURCES_DIR           1712 files      valid     ./sources                                                                   
 X  PERSONAS_DIR          missing         invalid   ./personas                                                                  
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 X  CACHE_DIR             missing         invalid   ./cache                                                                     
 X  CUSTOM_TEMPLATES_DIR  missing         invalid   ./templates
Originally created by @jessienab on GitHub (Sep 6, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1510 #### Describe the bug Following up to: #1509 It seems ArchiveBox did eventually generate the "Archive again" entries for pre-0.8.3 snapshots, however it didn't archive them properly. When attempting to delete these, the following happened: 1. The server.py/daphne was killed? ``` daphne.server Application instance <Task pending name='Task-311' coro=<ProtocolTypeRouter.__call__() running at /usr/local/lib/python3.11/site-packages/channels/routing.py:62> wait_for=<Task cancelling name='Task-314' coro=<ASGIHandler.handle.<locals>.process_request() running at /usr/local/lib/python3.11/site-packages/django/core/handlers/asgi.py:185> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/local/lib/python3.11/asyncio/futures.py:387, Task.task_wakeup()]> cb=[Task.task_wakeup()]>> for connection <WebRequest at 0x7729830880 90 method=POST uri=/admin/core/snapshot/ clientproto=HTTP/1.1> took too long to shut down and was killed. daphne.server Application instance <Task cancelling name='Task-311' coro=<ProtocolTypeRouter.__call__() running at /usr/local/lib/python3.11/site-packages/channels/routing.py:62> wait_for=<_ GatheringFuture pending cb=[Task.task_wakeup()]>> for connection <WebRequest at 0x772983088090 method=POST uri=/admin/core/snapshot/ clientproto=HTTP/1.1> took too long to shut down and was killed. ``` 3. ArchiveBox then reports the following: > [i] Found 10958 matching URLs to remove. > 10958 Links will be de-listed from the main index, and their archived content folders will be deleted from disk. > (9829 data folders with 70489 archived files will be deleted!) I immediately killed ArchiveBox to prevent further damage, but at this point I'll have to restore from an older backup + manually re-grab a possibly large number of URLs for sites that weren't archived in that backup... :face_exhaling: My fault! :woman_facepalming: #### Steps to reproduce 1. Attempt to re-snapshot a pre 0.8.3 snapshot 2. It should fail with a 500 error and a specific error message 4. The snapshots should eventually appear within the snapshot listings, but will not have been archived at all 5. Attempt to delete those entries 6. Depending on number of entries, ArchiveBox will then report through logs that it will delete effectively every entry... #### Screenshots or log output See above #### ArchiveBox version ``` # archivebox version 0.8.3 ArchiveBox v0.8.3 COMMIT_HASH=31576e2 BUILD_TIME=2024-09-06 13:14:49 1725628489 IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.6.47-1-lts-x86_64-with-glibc2.36 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=True FS_USER=0:0 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.9 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v5.1.1 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.8.3 valid /usr/local/bin/archivebox √ CURL_BINARY v8.9.1 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v20.17.0 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.54 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid /app/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /app/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.39.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2024.8.6 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v128.0.6613 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 34 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates [i] Data locations: √ OUTPUT_DIR 9 files @ valid /data √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 168.7 MB valid ./index.sqlite3 √ ARCHIVE_DIR 4995 files valid ./archive √ SOURCES_DIR 1712 files valid ./sources X PERSONAS_DIR missing invalid ./personas √ LOGS_DIR 1 files valid ./logs X CACHE_DIR missing invalid ./cache X CUSTOM_TEMPLATES_DIR missing invalid ./templates ```
Author
Owner

@pirate commented on GitHub (Sep 6, 2024):

A shit, looks like some bug in the form parsing for the submit action selected all the snapshots?!

I'll investigate immediately, sorry about messing up your archive. I have several intergration tests that should prevent this type of thing around the CLI commands, but this shows I need to improve them to cover more of the UI button actions.

<!-- gh-comment-id:2334863431 --> @pirate commented on GitHub (Sep 6, 2024): A shit, looks like some bug in the form parsing for the submit action selected all the snapshots?! I'll investigate immediately, sorry about messing up your archive. I have several intergration tests that should prevent this type of thing around the CLI commands, but this shows I need to improve them to cover more of the UI button actions.
Author
Owner

@jessienab commented on GitHub (Sep 9, 2024):

A shit, looks like some bug in the form parsing for the submit action selected all the snapshots?!

I'll investigate immediately, sorry about messing up your archive. I have several intergration tests that should prevent this type of thing around the CLI commands, but this shows I need to improve them to cover more of the UI button actions.

No worries!! My fault not having functioning backups :)
I managed to grab an older DB (3 months out of date), compiled all the URLs from sources/ up to now, and am just regrabbing. Seems no website data was deleted? so at least worst case if a website is missing now in the archive index, at least the older archived data is still present on disk (I can grep around to find it 👍 )

Thanks again and I guess lesson for me to make a backup (as you indicated and I did not read hehe) before running betas!!!

<!-- gh-comment-id:2336925089 --> @jessienab commented on GitHub (Sep 9, 2024): > A shit, looks like some bug in the form parsing for the submit action selected all the snapshots?! > > I'll investigate immediately, sorry about messing up your archive. I have several intergration tests that should prevent this type of thing around the CLI commands, but this shows I need to improve them to cover more of the UI button actions. No worries!! My fault not having functioning backups :) I managed to grab an older DB (3 months out of date), compiled all the URLs from sources/ up to now, and am just regrabbing. Seems no website data was deleted? so at least worst case if a website is missing now in the archive index, at least the older archived data is still present on disk (I can grep around to find it :+1: ) Thanks again and I guess lesson for me to make a backup (as you indicated and I did not read hehe) before running betas!!!
Author
Owner

@pirate commented on GitHub (Sep 9, 2024):

If the older data is still present on disk running archivebox init should also re-import it, as it will scan the archive/ folder for snapshot entries not in the DB and re-create them from the archive/<id>/index.json file saved with each snapshot output.

<!-- gh-comment-id:2338928926 --> @pirate commented on GitHub (Sep 9, 2024): If the older data is still present on disk running `archivebox init` should also re-import it, as it will scan the `archive/` folder for snapshot entries not in the DB and re-create them from the `archive/<id>/index.json` file saved with each snapshot output.
Author
Owner

@jessienab commented on GitHub (Sep 10, 2024):

If the older data is still present on disk running archivebox init should also re-import it, as it will scan the archive/ folder for snapshot entries not in the DB and re-create them from the archive/<id>/index.json file saved with each snapshot output.

Luck had it that I had setup rsnapshot, and I found the backup it made the day before I nuked ArchiveBox; everything restored! yay :D

<!-- gh-comment-id:2341177772 --> @jessienab commented on GitHub (Sep 10, 2024): > If the older data is still present on disk running `archivebox init` should also re-import it, as it will scan the `archive/` folder for snapshot entries not in the DB and re-create them from the `archive/<id>/index.json` file saved with each snapshot output. Luck had it that I had setup rsnapshot, and I found the backup it made the day before I nuked ArchiveBox; everything restored! yay :D
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2401
No description provided.