[PR #2543] fix: cleanup orphaned assets if crawler crashes before database transaction commits #2153

Open
opened 2026-03-02 12:00:46 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/karakeep-app/karakeep/pull/2543
Author: @VedantMadane
Created: 3/1/2026
Status: 🔄 Open

Base: mainHead: fix/cleanup-orphaned-assets-on-error


📝 Commits (3)

  • 8970dee fix: cleanup orphaned assets if crawler crashes before database transaction commits
  • 5ffc216 fix: ensure cleanup only happens if transaction fails and delay asset deletion until after commit
  • fc53d9b fix: address code review feedback - clear tracking variables after DB transaction

📊 Changes

1 file changed (+337 additions, -278 deletions)

View changed files

📝 apps/workers/workers/crawlerWorker.ts (+337 -278)

📄 Description

Fixes #2519.

Problem

When the background crawler saves assets (screenshots, PDFs, images) to storage (S3 or local disk), there is a window of time before the database transaction commits the ssetId to the ssets table. If the worker crashes or OOMs during this window, the assets remain in storage with no database record pointing to them, effectively becoming orphaned.

Solution

  • Added ry...catch blocks around the core asset processing logic in crawlAndParseUrl and handleAsAssetBookmark.
  • Introduced a tracking array
    ewAssetIds to collect all successfully stored assets during a single crawl attempt.
  • If an error occurs before the database transaction completes, the catch block iterates through
    ewAssetIds and calls silentDeleteAsset to remove them from storage.
  • This ensures data integrity between the storage layer and the database even during unexpected crashes.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/karakeep-app/karakeep/pull/2543 **Author:** [@VedantMadane](https://github.com/VedantMadane) **Created:** 3/1/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/cleanup-orphaned-assets-on-error` --- ### 📝 Commits (3) - [`8970dee`](https://github.com/karakeep-app/karakeep/commit/8970dee633cbb12227ab49d01a7c33cdb4b79cc4) fix: cleanup orphaned assets if crawler crashes before database transaction commits - [`5ffc216`](https://github.com/karakeep-app/karakeep/commit/5ffc21697d5179cd45ba10c8c4a30225e763b45a) fix: ensure cleanup only happens if transaction fails and delay asset deletion until after commit - [`fc53d9b`](https://github.com/karakeep-app/karakeep/commit/fc53d9b6a7343da496e4e3b5d04f912b156ff970) fix: address code review feedback - clear tracking variables after DB transaction ### 📊 Changes **1 file changed** (+337 additions, -278 deletions) <details> <summary>View changed files</summary> 📝 `apps/workers/workers/crawlerWorker.ts` (+337 -278) </details> ### 📄 Description Fixes #2519. ### Problem When the background crawler saves assets (screenshots, PDFs, images) to storage (S3 or local disk), there is a window of time before the database transaction commits the ssetId to the ssets table. If the worker crashes or OOMs during this window, the assets remain in storage with no database record pointing to them, effectively becoming orphaned. ### Solution - Added ry...catch blocks around the core asset processing logic in crawlAndParseUrl and handleAsAssetBookmark. - Introduced a tracking array ewAssetIds to collect all successfully stored assets during a single crawl attempt. - If an error occurs before the database transaction completes, the catch block iterates through ewAssetIds and calls silentDeleteAsset to remove them from storage. - This ensures data integrity between the storage layer and the database even during unexpected crashes. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#2153
No description provided.