[GH-ISSUE #1586] AI summarization not working with very slow local LLM setups #989

Closed
opened 2026-03-02 11:54:12 +03:00 by kerem · 5 comments
Owner

Originally created by @ubff389 on GitHub (Jun 10, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1586

Describe the Bug

I am using the "summarize" feature of Karakeep, running a 4b model on a very low powered hardware. I expect inference to take a long time, thus I set the timeout variables like so:

INFERENCE_JOB_TIMEOUT_SEC=600
INFERENCE_FETCH_TIMEOUT_SEC=600

As far as I understand, this should mean that Karakeep waits for 10 minutes for the Ollama server to finish processing. However, it fails after seemingly exactly 5 minutes with the following error message:

web-1          | TypeError: fetch failed
web-1          |     at node:internal/deps/undici/undici:13510:13
web-1          |     ... 8 lines matching cause stack trace ...
web-1          |     at async G (/app/apps/web/.next/server/chunks/269.js:4:45731) {
web-1          |   cause: TypeError: fetch failed
web-1          |       at node:internal/deps/undici/undici:13510:13
web-1          |       at async i (/app/apps/web/.next/server/chunks/269.js:19:52320)
web-1          |       at async M.processStreamableRequest (/app/apps/web/.next/server/chunks/269.js:19:53793)
web-1          |       at async V.runModel (/app/apps/web/.next/server/chunks/6815.js:1:50686)
web-1          |       at async V.inferFromText (/app/apps/web/.next/server/chunks/6815.js:1:51417)
web-1          |       at async /app/apps/web/.next/server/chunks/6815.js:12:203
web-1          |       at async X.h.middlewares (/app/apps/web/.next/server/chunks/269.js:4:46337)
web-1          |       at async F (/app/apps/web/.next/server/chunks/269.js:7:68)
web-1          |       at async F (/app/apps/web/.next/server/chunks/269.js:7:68)
web-1          |       at async G (/app/apps/web/.next/server/chunks/269.js:4:45731) {
web-1          |     [cause]: HeadersTimeoutError: Headers Timeout Error
web-1          |         at FastTimer.onParserTimeout [as _onTimeout] (node:internal/deps/undici/undici:6249:32)
web-1          |         at Timeout.onTick [as _onTimeout] (node:internal/deps/undici/undici:2210:17)
web-1          |         at listOnTimeout (node:internal/timers:588:17)
web-1          |         at process.processTimers (node:internal/timers:523:7) {
web-1          |       code: 'UND_ERR_HEADERS_TIMEOUT'
web-1          |     }
web-1          |   },
web-1          |   code: 'INTERNAL_SERVER_ERROR',
web-1          |   name: 'TRPCError'
web-1          | }

This is what I see in my Ollama journal output:

Jun 10 14:25:09 ollama-host ollama[156]: time=2025-06-10T14:25:09.887Z level=WARN source=runner.go:128 msg="truncating input prompt" limit=512 prompt=609 keep=4 new=512
Jun 10 14:29:57 ollama-host ollama[156]: [GIN] 2025/06/10 - 14:29:57 | 200 |          5m3s |    10.99.99.105 | POST     "/api/chat"

It looks like there is a "headers timeout" that is currently not being influenced by the envvars. Worth noting I think is that similar behaviour might also happen for normal AI tags, at least I see some "fetch failed" errors, and my ollama output shows similar requests that mostly have the duration of 5 minutes, but the tags then somehow actually get created.

Steps to Reproduce

  1. Set up an Ollama server that is slow
  2. Attempt to use AI summarize utilizing this server

Expected Behaviour

AI summary should be created, the job shouldn't be interrupted after 5 minutes because the actual timeout is configured as 10 minutes in the env

Screenshots or Additional Context

No response

Device Details

No response

Exact Karakeep Version

v0.25.0

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @ubff389 on GitHub (Jun 10, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1586 ### Describe the Bug I am using the "summarize" feature of Karakeep, running a 4b model on a very low powered hardware. I expect inference to take a long time, thus I set the timeout variables like so: ``` INFERENCE_JOB_TIMEOUT_SEC=600 INFERENCE_FETCH_TIMEOUT_SEC=600 ``` As far as I understand, this should mean that Karakeep waits for 10 minutes for the Ollama server to finish processing. However, it fails after seemingly exactly 5 minutes with the following error message: ``` web-1 | TypeError: fetch failed web-1 | at node:internal/deps/undici/undici:13510:13 web-1 | ... 8 lines matching cause stack trace ... web-1 | at async G (/app/apps/web/.next/server/chunks/269.js:4:45731) { web-1 | cause: TypeError: fetch failed web-1 | at node:internal/deps/undici/undici:13510:13 web-1 | at async i (/app/apps/web/.next/server/chunks/269.js:19:52320) web-1 | at async M.processStreamableRequest (/app/apps/web/.next/server/chunks/269.js:19:53793) web-1 | at async V.runModel (/app/apps/web/.next/server/chunks/6815.js:1:50686) web-1 | at async V.inferFromText (/app/apps/web/.next/server/chunks/6815.js:1:51417) web-1 | at async /app/apps/web/.next/server/chunks/6815.js:12:203 web-1 | at async X.h.middlewares (/app/apps/web/.next/server/chunks/269.js:4:46337) web-1 | at async F (/app/apps/web/.next/server/chunks/269.js:7:68) web-1 | at async F (/app/apps/web/.next/server/chunks/269.js:7:68) web-1 | at async G (/app/apps/web/.next/server/chunks/269.js:4:45731) { web-1 | [cause]: HeadersTimeoutError: Headers Timeout Error web-1 | at FastTimer.onParserTimeout [as _onTimeout] (node:internal/deps/undici/undici:6249:32) web-1 | at Timeout.onTick [as _onTimeout] (node:internal/deps/undici/undici:2210:17) web-1 | at listOnTimeout (node:internal/timers:588:17) web-1 | at process.processTimers (node:internal/timers:523:7) { web-1 | code: 'UND_ERR_HEADERS_TIMEOUT' web-1 | } web-1 | }, web-1 | code: 'INTERNAL_SERVER_ERROR', web-1 | name: 'TRPCError' web-1 | } ``` This is what I see in my Ollama journal output: ``` Jun 10 14:25:09 ollama-host ollama[156]: time=2025-06-10T14:25:09.887Z level=WARN source=runner.go:128 msg="truncating input prompt" limit=512 prompt=609 keep=4 new=512 Jun 10 14:29:57 ollama-host ollama[156]: [GIN] 2025/06/10 - 14:29:57 | 200 | 5m3s | 10.99.99.105 | POST "/api/chat" ``` It looks like there is a "headers timeout" that is currently not being influenced by the envvars. Worth noting I think is that similar behaviour might also happen for normal AI tags, at least I see some "fetch failed" errors, and my ollama output shows similar requests that mostly have the duration of 5 minutes, but the tags then somehow actually get created. ### Steps to Reproduce 1. Set up an Ollama server that is slow 2. Attempt to use AI summarize utilizing this server ### Expected Behaviour AI summary should be created, the job shouldn't be interrupted after 5 minutes because the actual timeout is configured as 10 minutes in the env ### Screenshots or Additional Context _No response_ ### Device Details _No response_ ### Exact Karakeep Version v0.25.0 ### Have you checked the troubleshooting guide? - [x] I have checked the troubleshooting guide and I haven't found a solution to my problem
kerem 2026-03-02 11:54:12 +03:00
Author
Owner

@Eragos commented on GitHub (Jun 10, 2025):

Hey!

here are my working Interference settings:

OLLAMA_BASE_URL=http://192.168.178.69:9001
INFERENCE_TEXT_MODEL=gemma3:4b
INFERENCE_IMAGE_MODEL=gemma3:4b
EMBEDDING_TEXT_MODEL=gemma3:4b
INFERENCE_CONTEXT_LENGTH=512
INFERENCE_LANG=deutsch
INFERENCE_JOB_TIMEOUT_SEC=180
INFERENCE_FETCH_TIMEOUT_SEC=300
INFERENCE_SUPPORTS_STRUCTURED_OUTPUT=true
INFERENCE_ENABLE_AUTO_TAGGING=true
INFERENCE_ENABLE_AUTO_SUMMARIZATION=true

I had to play with the INFERENCE_CONTEXT_LENGTH 1024 was too big :-/

Hopefully it helps a little bit.

Best Michael

<!-- gh-comment-id:2959896053 --> @Eragos commented on GitHub (Jun 10, 2025): Hey! here are my working Interference settings: ``` OLLAMA_BASE_URL=http://192.168.178.69:9001 INFERENCE_TEXT_MODEL=gemma3:4b INFERENCE_IMAGE_MODEL=gemma3:4b EMBEDDING_TEXT_MODEL=gemma3:4b INFERENCE_CONTEXT_LENGTH=512 INFERENCE_LANG=deutsch INFERENCE_JOB_TIMEOUT_SEC=180 INFERENCE_FETCH_TIMEOUT_SEC=300 INFERENCE_SUPPORTS_STRUCTURED_OUTPUT=true INFERENCE_ENABLE_AUTO_TAGGING=true INFERENCE_ENABLE_AUTO_SUMMARIZATION=true ``` I had to play with the `INFERENCE_CONTEXT_LENGTH` 1024 was too big :-/ Hopefully it helps a little bit. Best Michael
Author
Owner

@ubff389 commented on GitHub (Jun 10, 2025):

Hi Michael,
You're using 1024, I'm using 512 -- it's still too slow :)
I'm not complaining about the speed, but rather the fact that the 600s limits don't seem to work properly.
Do you know what exactly the "Embedding_text_model" envvar is needed for? The docu doesn't give much info about whether it even needs to be set at all.

<!-- gh-comment-id:2959908823 --> @ubff389 commented on GitHub (Jun 10, 2025): Hi Michael, You're using 1024, I'm using 512 -- it's still too slow :) I'm not complaining about the speed, but rather the fact that the 600s limits don't seem to work properly. Do you know what exactly the "Embedding_text_model" envvar is needed for? The docu doesn't give much info about whether it even needs to be set at all.
Author
Owner

@ubff389 commented on GitHub (Jun 10, 2025):

Moreover, this doesn't look like a purely performance issue. I have added a new bookmark, which triggered the auto tag creation job. This is what I see in ollama logs:

Jun 10 16:26:34 ollama-host ollama[156]: time=2025-06-10T16:26:34.091Z level=WARN source=runner.go:128 msg="truncating input prompt" limit=512 prompt=620 keep=4 new=512
Jun 10 16:31:21 ollama-host ollama[156]: [GIN] 2025/06/10 - 16:31:21 | 200 |          5m0s |    10.99.99.105 | POST     "/api/chat"
Jun 10 16:31:22 ollama-host ollama[156]: time=2025-06-10T16:31:22.403Z level=WARN source=runner.go:128 msg="truncating input prompt" limit=512 prompt=620 keep=4 new=512
Jun 10 16:34:06 ollama-host ollama[156]: [GIN] 2025/06/10 - 16:34:06 | 200 |         2m44s |    10.99.99.105 | POST     "/api/chat"

It's apparent that the first request didn't even go through for some reason and was aborted at 5m, but the second request for the same thing (look at the prompt=620 token count) went through and was finished in 2 minutes and 44 seconds. I will investigate if this is perhaps an issue of the model taking too long to load on the first time, although I am using an SSD.

<!-- gh-comment-id:2959952583 --> @ubff389 commented on GitHub (Jun 10, 2025): Moreover, this doesn't look like a purely performance issue. I have added a new bookmark, which triggered the auto tag creation job. This is what I see in ollama logs: ``` Jun 10 16:26:34 ollama-host ollama[156]: time=2025-06-10T16:26:34.091Z level=WARN source=runner.go:128 msg="truncating input prompt" limit=512 prompt=620 keep=4 new=512 Jun 10 16:31:21 ollama-host ollama[156]: [GIN] 2025/06/10 - 16:31:21 | 200 | 5m0s | 10.99.99.105 | POST "/api/chat" Jun 10 16:31:22 ollama-host ollama[156]: time=2025-06-10T16:31:22.403Z level=WARN source=runner.go:128 msg="truncating input prompt" limit=512 prompt=620 keep=4 new=512 Jun 10 16:34:06 ollama-host ollama[156]: [GIN] 2025/06/10 - 16:34:06 | 200 | 2m44s | 10.99.99.105 | POST "/api/chat" ``` It's apparent that the first request didn't even go through for some reason and was aborted at 5m, but the second request for the same thing (look at the prompt=620 token count) went through and was finished in 2 minutes and 44 seconds. I will investigate if this is perhaps an issue of the model taking too long to load on the first time, although I am using an SSD.
Author
Owner

@ubff389 commented on GitHub (Jun 11, 2025):

It was, indeed, an issue with the model -- because I was using such small models, the model ended up hallucinating when tasked with the summarization and ended up outputting the same sentence multiple hundreds of times; the prompt took a whopping 1 hour and 30 minutes. I think the solution is to just not use summarize with very slow hardware and very small models :D

<!-- gh-comment-id:2961696689 --> @ubff389 commented on GitHub (Jun 11, 2025): It was, indeed, an issue with the model -- because I was using such small models, the model ended up hallucinating when tasked with the summarization and ended up outputting the same sentence multiple hundreds of times; the prompt took a whopping 1 hour and 30 minutes. I think the solution is to just not use summarize with very slow hardware and very small models :D
Author
Owner

@noctrex commented on GitHub (Aug 31, 2025):

4b is too much for a small system. Models better suited are:
llama3.2:1b
gemma3:270m
qwen2.5:0.5b
qwen3:0.6b (but this is a thinking model and generates <think> tags in the description, but it is smarter)
I've got those running in a ollama container on a Synology NAS and it seems to be working much faster than 4b models.
For an image model, I use Qwen2.5-VL-3B-Instruct, but it's slower as its 3b and not 1b.

<!-- gh-comment-id:3240432585 --> @noctrex commented on GitHub (Aug 31, 2025): 4b is too much for a small system. Models better suited are: llama3.2:1b gemma3:270m qwen2.5:0.5b qwen3:0.6b (but this is a thinking model and generates \<think\> tags in the description, but it is smarter) I've got those running in a ollama container on a Synology NAS and it seems to be working much faster than 4b models. For an image model, I use Qwen2.5-VL-3B-Instruct, but it's slower as its 3b and not 1b.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#989
No description provided.