[PR #2452] [MERGED] feat: Add separate queue for import link crawling #2114

Closed
opened 2026-03-02 12:00:37 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/karakeep-app/karakeep/pull/2452
Author: @MohamedBassem
Created: 2/4/2026
Status: Merged
Merged: 2/8/2026
Merged by: @MohamedBassem

Base: mainHead: claude/separate-import-crawl-queue-5nqwz


📝 Commits (6)

  • fb05e8c feat: add separate queue for import link crawling
  • ac80f26 feat: add separate queue for import link crawling
  • 5078982 refactor: rename import crawler queue to low priority crawler queue
  • 1aa38cc simplify
  • bd8f4a6 Merge branch 'main' into claude/separate-import-crawl-queue-5nqwz
  • 9f41406 review comment

📊 Changes

5 files changed (+70 additions, -36 deletions)

View changed files

📝 apps/workers/index.ts (+6 -1)
📝 apps/workers/workers/crawlerWorker.ts (+43 -32)
📝 apps/workers/workers/importWorker.ts (+2 -2)
📝 packages/shared-server/src/queues.ts (+12 -0)
📝 packages/trpc/routers/bookmarks.ts (+7 -1)

📄 Description

Summary

This PR introduces a separate queue for link crawling operations triggered during imports, preventing import crawling from impacting the parallelism and performance of the main crawler queue.

Key Changes

  • New Queue: Created ImportLinkCrawlerQueue in packages/shared-server/src/queues.ts with identical configuration to LinkCrawlerQueue but isolated for import operations
  • Dual Worker Management: Updated CrawlerWorker to manage both the main crawler queue and import crawler queue simultaneously using a combined worker interface
  • Router Logic: Modified the bookmarks router to route crawl requests to ImportLinkCrawlerQueue when they originate from an import session (input.importSessionId), otherwise using the main LinkCrawlerQueue
  • Code Refactoring: Extracted common runner callbacks and options into reusable variables to avoid duplication between the two queue runners

Implementation Details

  • Both queues share identical job configurations (5 retries, no failed job retention) and runner options (1000ms poll interval, configurable timeout and concurrency)
  • The combined worker returns an object with run() and stop() methods that manage both runners in parallel
  • Import-triggered crawls are identified by the presence of importSessionId in the request, providing a clean separation point
  • This approach maintains backward compatibility while improving resource isolation for bulk import operations

https://claude.ai/code/session_01KQrq8PTAMCmBDfsGrDsZqc


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/karakeep-app/karakeep/pull/2452 **Author:** [@MohamedBassem](https://github.com/MohamedBassem) **Created:** 2/4/2026 **Status:** ✅ Merged **Merged:** 2/8/2026 **Merged by:** [@MohamedBassem](https://github.com/MohamedBassem) **Base:** `main` ← **Head:** `claude/separate-import-crawl-queue-5nqwz` --- ### 📝 Commits (6) - [`fb05e8c`](https://github.com/karakeep-app/karakeep/commit/fb05e8cc56f014d9557cb4fe2bd8f024a9c2cd7f) feat: add separate queue for import link crawling - [`ac80f26`](https://github.com/karakeep-app/karakeep/commit/ac80f26ea50e453bc36c992f23dae847e6ec0b9b) feat: add separate queue for import link crawling - [`5078982`](https://github.com/karakeep-app/karakeep/commit/507898221bca5033be70b8b8a80b3e51d230f66a) refactor: rename import crawler queue to low priority crawler queue - [`1aa38cc`](https://github.com/karakeep-app/karakeep/commit/1aa38ccab7d0da786d3dc5d3f57e9266dfa1d017) simplify - [`bd8f4a6`](https://github.com/karakeep-app/karakeep/commit/bd8f4a68e3dc2e5ec52fe3a5e5e554f2a80f0262) Merge branch 'main' into claude/separate-import-crawl-queue-5nqwz - [`9f41406`](https://github.com/karakeep-app/karakeep/commit/9f4140648238db582add8a051114d91b20231f4f) review comment ### 📊 Changes **5 files changed** (+70 additions, -36 deletions) <details> <summary>View changed files</summary> 📝 `apps/workers/index.ts` (+6 -1) 📝 `apps/workers/workers/crawlerWorker.ts` (+43 -32) 📝 `apps/workers/workers/importWorker.ts` (+2 -2) 📝 `packages/shared-server/src/queues.ts` (+12 -0) 📝 `packages/trpc/routers/bookmarks.ts` (+7 -1) </details> ### 📄 Description ## Summary This PR introduces a separate queue for link crawling operations triggered during imports, preventing import crawling from impacting the parallelism and performance of the main crawler queue. ## Key Changes - **New Queue**: Created `ImportLinkCrawlerQueue` in `packages/shared-server/src/queues.ts` with identical configuration to `LinkCrawlerQueue` but isolated for import operations - **Dual Worker Management**: Updated `CrawlerWorker` to manage both the main crawler queue and import crawler queue simultaneously using a combined worker interface - **Router Logic**: Modified the bookmarks router to route crawl requests to `ImportLinkCrawlerQueue` when they originate from an import session (`input.importSessionId`), otherwise using the main `LinkCrawlerQueue` - **Code Refactoring**: Extracted common runner callbacks and options into reusable variables to avoid duplication between the two queue runners ## Implementation Details - Both queues share identical job configurations (5 retries, no failed job retention) and runner options (1000ms poll interval, configurable timeout and concurrency) - The combined worker returns an object with `run()` and `stop()` methods that manage both runners in parallel - Import-triggered crawls are identified by the presence of `importSessionId` in the request, providing a clean separation point - This approach maintains backward compatibility while improving resource isolation for bulk import operations https://claude.ai/code/session_01KQrq8PTAMCmBDfsGrDsZqc --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-02 12:00:37 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#2114
No description provided.