[GH-ISSUE #1395] [FEAT]: Configurable concurrency for OpenAI / LLM tagging #887

Closed
opened 2026-03-02 11:53:30 +03:00 by kerem · 0 comments
Owner

Originally created by @mratsim on GitHub (May 11, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1395

Describe the feature you'd like

I'd like configurable concurrency limit in the OpenAI worker to increase throughput, with an ENV variable exposed for Docker.

It is currently hardcoded to 1 in line 86.

github.com/karakeep-app/karakeep@c03dcfdbbc/apps/workers/openaiWorker.ts (L63-L94)

Describe the benefits this would bring to existing Karakeep users

I've imported tons of saved links from Pocket, running them through Ollama on a RTX5090, the latency to tag with gemma3:27b Q4 quantization is between 200ms to 1.2s.
And I have 28K items pending, so looking at 2hours min to 10hours of tagging.

Image

Single LLM inference famously memory-bound, because to predicting single tokens require internally Matrix-Vector multiplications which cannot fully occupy a CPU or a GPU compute or SIMD units.
However if we submit a batch of them, we can get a linear increase in throughput until the compute/SIMD units are fully occupied.

Concretely, with batch inference instead of generating ~60 tok/s, I can see rate of ~340 tok/s with vLLM backend (Ollama doesn't support in-flight batching AFAIK), a 5.7x perf improvement.

Can the goal of this request already be achieved via other means?

No

Have you searched for an existing open/closed issue?

  • I have searched for existing issues and none cover my fundamental request

Additional context

No response

Originally created by @mratsim on GitHub (May 11, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1395 ### Describe the feature you'd like I'd like configurable concurrency limit in the OpenAI worker to increase throughput, with an ENV variable exposed for Docker. It is currently hardcoded to 1 in line 86. https://github.com/karakeep-app/karakeep/blob/c03dcfdbbc5a99abdb7517a03482bccf875d1953/apps/workers/openaiWorker.ts#L63-L94 ### Describe the benefits this would bring to existing Karakeep users I've imported tons of saved links from Pocket, running them through Ollama on a RTX5090, the latency to tag with gemma3:27b Q4 quantization is between 200ms to 1.2s. And I have 28K items pending, so looking at 2hours min to 10hours of tagging. ![Image](https://github.com/user-attachments/assets/bc91a116-d9ff-42c9-a597-02260ee2049c) Single LLM inference famously memory-bound, because to predicting single tokens require internally Matrix-Vector multiplications which cannot fully occupy a CPU or a GPU compute or SIMD units. However if we submit a batch of them, we can get a linear increase in throughput until the compute/SIMD units are fully occupied. Concretely, with batch inference instead of generating ~60 tok/s, I can see rate of ~340 tok/s with vLLM backend (Ollama doesn't support in-flight batching AFAIK), a 5.7x perf improvement. ### Can the goal of this request already be achieved via other means? No ### Have you searched for an existing open/closed issue? - [x] I have searched for existing issues and none cover my fundamental request ### Additional context _No response_
kerem closed this issue 2026-03-02 11:53:31 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#887
No description provided.