[PR #2452] [MERGED] feat: Add separate queue for import link crawling #2114

New issue

Closed

opened 2026-03-02 12:00:37 +03:00 by kerem · 0 comments

kerem commented

2026-03-02 12:00:37 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/karakeep-app/karakeep/pull/2452
Author: @MohamedBassem
Created: 2/4/2026
Status: ✅ Merged
Merged: 2/8/2026
Merged by: @MohamedBassem

Base: main ← Head: claude/separate-import-crawl-queue-5nqwz

📝 Commits (6)

fb05e8c feat: add separate queue for import link crawling
ac80f26 feat: add separate queue for import link crawling
5078982 refactor: rename import crawler queue to low priority crawler queue
1aa38cc simplify
bd8f4a6 Merge branch 'main' into claude/separate-import-crawl-queue-5nqwz
9f41406 review comment

📊 Changes

5 files changed (+70 additions, -36 deletions)

View changed files

📝 apps/workers/index.ts (+6 -1)
📝 apps/workers/workers/crawlerWorker.ts (+43 -32)
📝 apps/workers/workers/importWorker.ts (+2 -2)
📝 packages/shared-server/src/queues.ts (+12 -0)
📝 packages/trpc/routers/bookmarks.ts (+7 -1)

📄 Description

Summary

This PR introduces a separate queue for link crawling operations triggered during imports, preventing import crawling from impacting the parallelism and performance of the main crawler queue.

Key Changes

New Queue: Created ImportLinkCrawlerQueue in packages/shared-server/src/queues.ts with identical configuration to LinkCrawlerQueue but isolated for import operations
Dual Worker Management: Updated CrawlerWorker to manage both the main crawler queue and import crawler queue simultaneously using a combined worker interface
Router Logic: Modified the bookmarks router to route crawl requests to ImportLinkCrawlerQueue when they originate from an import session (input.importSessionId), otherwise using the main LinkCrawlerQueue
Code Refactoring: Extracted common runner callbacks and options into reusable variables to avoid duplication between the two queue runners

Implementation Details

Both queues share identical job configurations (5 retries, no failed job retention) and runner options (1000ms poll interval, configurable timeout and concurrency)
The combined worker returns an object with run() and stop() methods that manage both runners in parallel
Import-triggered crawls are identified by the presence of importSessionId in the request, providing a clean separation point
This approach maintains backward compatibility while improving resource isolation for bulk import operations

https://claude.ai/code/session_01KQrq8PTAMCmBDfsGrDsZqc

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/karakeep-app/karakeep/pull/2452 **Author:** [@MohamedBassem](https://github.com/MohamedBassem) **Created:** 2/4/2026 **Status:** ✅ Merged **Merged:** 2/8/2026 **Merged by:** [@MohamedBassem](https://github.com/MohamedBassem) **Base:** `main` ← **Head:** `claude/separate-import-crawl-queue-5nqwz` --- ### 📝 Commits (6) - [`fb05e8c`](https://github.com/karakeep-app/karakeep/commit/fb05e8cc56f014d9557cb4fe2bd8f024a9c2cd7f) feat: add separate queue for import link crawling - [`ac80f26`](https://github.com/karakeep-app/karakeep/commit/ac80f26ea50e453bc36c992f23dae847e6ec0b9b) feat: add separate queue for import link crawling - [`5078982`](https://github.com/karakeep-app/karakeep/commit/507898221bca5033be70b8b8a80b3e51d230f66a) refactor: rename import crawler queue to low priority crawler queue - [`1aa38cc`](https://github.com/karakeep-app/karakeep/commit/1aa38ccab7d0da786d3dc5d3f57e9266dfa1d017) simplify - [`bd8f4a6`](https://github.com/karakeep-app/karakeep/commit/bd8f4a68e3dc2e5ec52fe3a5e5e554f2a80f0262) Merge branch 'main' into claude/separate-import-crawl-queue-5nqwz - [`9f41406`](https://github.com/karakeep-app/karakeep/commit/9f4140648238db582add8a051114d91b20231f4f) review comment ### 📊 Changes **5 files changed** (+70 additions, -36 deletions) <details> <summary>View changed files</summary> 📝 `apps/workers/index.ts` (+6 -1) 📝 `apps/workers/workers/crawlerWorker.ts` (+43 -32) 📝 `apps/workers/workers/importWorker.ts` (+2 -2) 📝 `packages/shared-server/src/queues.ts` (+12 -0) 📝 `packages/trpc/routers/bookmarks.ts` (+7 -1) </details> ### 📄 Description ## Summary This PR introduces a separate queue for link crawling operations triggered during imports, preventing import crawling from impacting the parallelism and performance of the main crawler queue. ## Key Changes - **New Queue**: Created `ImportLinkCrawlerQueue` in `packages/shared-server/src/queues.ts` with identical configuration to `LinkCrawlerQueue` but isolated for import operations - **Dual Worker Management**: Updated `CrawlerWorker` to manage both the main crawler queue and import crawler queue simultaneously using a combined worker interface - **Router Logic**: Modified the bookmarks router to route crawl requests to `ImportLinkCrawlerQueue` when they originate from an import session (`input.importSessionId`), otherwise using the main `LinkCrawlerQueue` - **Code Refactoring**: Extracted common runner callbacks and options into reusable variables to avoid duplication between the two queue runners ## Implementation Details - Both queues share identical job configurations (5 retries, no failed job retention) and runner options (1000ms poll interval, configurable timeout and concurrency) - The combined worker returns an object with `run()` and `stop()` methods that manage both runners in parallel - Import-triggered crawls are identified by the presence of `importSessionId` in the request, providing a clean separation point - This approach maintains backward compatibility while improving resource isolation for bulk import operations https://claude.ai/code/session_01KQrq8PTAMCmBDfsGrDsZqc --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

kerem

2026-03-02 12:00:37 +03:00

closed this issue
added the
pull-request
label

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/karakeep#2114

No description provided.

Rows
Columns