[GH-ISSUE #1622] [BUG] Inference task that takes a really long time to pre-process (>10 minutes) caused by slow tokenization #1015

Closed
opened 2026-03-02 11:54:24 +03:00 by kerem · 9 comments
Owner

Originally created by @pdc1 on GitHub (Jun 16, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1622

Describe the Bug

I have a bookmark that causes an inference task that never ends, with a node process that is stuck using 100% of one processor. The task never ends and never appears to time out. To recover, I have to delete the queue.db and restart the container.

The bookmark was originally imported in 0.24.1, which did not have this issue. I am guessing that the error is in the new tokenization, with the exception not properly caught by the processing thread, but that's just a guess.

The logs look like this:

2025-06-16T17:24:48.341Z info: [Crawler][11] Will crawl "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/" for link with id "a6iubzxc6diiqp1a005jsgxe"
2025-06-16T17:24:48.342Z info: [Crawler][11] Attempting to determine the content-type for the url https://www.xda-developers.com/games-that-justify-ray-tracing-tax/
2025-06-16T17:24:48.635Z info: [Crawler][11] Content-type for the url https://www.xda-developers.com/games-that-justify-ray-tracing-tax/ is "text/html; charset=UTF-8"
2025-06-16T17:24:51.800Z info: [Crawler][11] Successfully navigated to "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/". Waiting for the page to load ...
2025-06-16T17:24:53.581Z info: [Crawler][11] Finished waiting for the page to load.
2025-06-16T17:24:53.891Z info: [Crawler][11] Successfully fetched the page content.
2025-06-16T17:24:54.665Z info: [Crawler][11] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-06-16T17:24:54.676Z info: [Crawler][11] Will attempt to extract metadata from page ...
2025-06-16T17:25:02.020Z info: [Crawler][11] Will attempt to extract readable content ...
Error: Could not parse CSS stylesheet
    at exports.createStylesheet (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/helpers/stylesheets.js:37:21)
    at HTMLStyleElementImpl._updateAStyleBlock (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:68:5)
    at HTMLStyleElementImpl._poppedOffStackOfOpenElements (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:42:10)
    at JSDOMParse5Adapter.onItemPop (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/browser/parser/html.js:175:43)
    at Parser.onItemPop (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:158:90)
    at OpenElementStack.pop (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/open-element-stack.js:89:22)
    at endTagInText (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:2287:20)
    at Parser._endTagOutsideForeignContent (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:931:17)
    at Parser.onEndTag (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:897:18)
    at Tokenizer.emitCurrentTagToken (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/tokenizer/index.js:402:26) 
            .responsive-img{position:relative;overflow:hidden}.responsive-img img{position:absolute;top:0;left:0;width:100%;height:100%}

[... 800KB of CSS data(!) omitted...]

 .swiper-slide-shadow-bottom,.swiper-container-flip .swiper-slide-shadow-left,.swiper-container-flip .swiper-slide-shadow-right,.swiper-container-flip .swiper-slide-shadow-top{z-index:0;backface-visibility:hidden}.swiper-container-coverflow .swiper-wrapper{-ms-perspective:1200px}
        
2025-06-16T17:25:07.227Z info: [Crawler][11] Done extracting readable content.
2025-06-16T17:25:07.285Z info: [Crawler][11] Stored the screenshot as assetId: cca04d75-bf3b-4d11-8287-860d0281c22e
2025-06-16T17:25:07.394Z info: [Crawler][11] Done extracting metadata from the page.
2025-06-16T17:25:07.394Z info: [Crawler][11] Downloading image from "https://static1.xdaimages.com/wordpress/wp-content/uploads/2024/05/awii_launch_049.png"
2025-06-16T17:25:08.275Z info: [Crawler][11] Downloaded image as assetId: f4e149a7-1e83-4d86-adb6-c77418e2b322
2025-06-16T17:25:08.408Z info: [Crawler][11] Completed successfully
2025-06-16T17:25:09.299Z info: [inference][12] Starting an inference job for bookmark with id "a6iubzxc6diiqp1a005jsgxe"

Steps to Reproduce

  1. Add a bookmark for "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/"
  2. I think that's it?

Expected Behaviour

Content is processed or produces an error.

Screenshots or Additional Context

I can provide the full log file if it will help.

Device Details

No response

Exact Karakeep Version

0.25.0

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @pdc1 on GitHub (Jun 16, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1622 ### Describe the Bug I have a bookmark that causes an inference task that never ends, with a `node` process that is stuck using 100% of one processor. The task never ends and never appears to time out. To recover, I have to delete the queue.db and restart the container. The bookmark was originally imported in 0.24.1, which did not have this issue. I am guessing that the error is in the new tokenization, with the exception not properly caught by the processing thread, but that's just a guess. The logs look like this: ``` 2025-06-16T17:24:48.341Z info: [Crawler][11] Will crawl "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/" for link with id "a6iubzxc6diiqp1a005jsgxe" 2025-06-16T17:24:48.342Z info: [Crawler][11] Attempting to determine the content-type for the url https://www.xda-developers.com/games-that-justify-ray-tracing-tax/ 2025-06-16T17:24:48.635Z info: [Crawler][11] Content-type for the url https://www.xda-developers.com/games-that-justify-ray-tracing-tax/ is "text/html; charset=UTF-8" 2025-06-16T17:24:51.800Z info: [Crawler][11] Successfully navigated to "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/". Waiting for the page to load ... 2025-06-16T17:24:53.581Z info: [Crawler][11] Finished waiting for the page to load. 2025-06-16T17:24:53.891Z info: [Crawler][11] Successfully fetched the page content. 2025-06-16T17:24:54.665Z info: [Crawler][11] Finished capturing page content and a screenshot. FullPageScreenshot: false 2025-06-16T17:24:54.676Z info: [Crawler][11] Will attempt to extract metadata from page ... 2025-06-16T17:25:02.020Z info: [Crawler][11] Will attempt to extract readable content ... Error: Could not parse CSS stylesheet at exports.createStylesheet (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/helpers/stylesheets.js:37:21) at HTMLStyleElementImpl._updateAStyleBlock (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:68:5) at HTMLStyleElementImpl._poppedOffStackOfOpenElements (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:42:10) at JSDOMParse5Adapter.onItemPop (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/browser/parser/html.js:175:43) at Parser.onItemPop (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:158:90) at OpenElementStack.pop (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/open-element-stack.js:89:22) at endTagInText (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:2287:20) at Parser._endTagOutsideForeignContent (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:931:17) at Parser.onEndTag (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:897:18) at Tokenizer.emitCurrentTagToken (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/tokenizer/index.js:402:26) .responsive-img{position:relative;overflow:hidden}.responsive-img img{position:absolute;top:0;left:0;width:100%;height:100%} [... 800KB of CSS data(!) omitted...] .swiper-slide-shadow-bottom,.swiper-container-flip .swiper-slide-shadow-left,.swiper-container-flip .swiper-slide-shadow-right,.swiper-container-flip .swiper-slide-shadow-top{z-index:0;backface-visibility:hidden}.swiper-container-coverflow .swiper-wrapper{-ms-perspective:1200px} 2025-06-16T17:25:07.227Z info: [Crawler][11] Done extracting readable content. 2025-06-16T17:25:07.285Z info: [Crawler][11] Stored the screenshot as assetId: cca04d75-bf3b-4d11-8287-860d0281c22e 2025-06-16T17:25:07.394Z info: [Crawler][11] Done extracting metadata from the page. 2025-06-16T17:25:07.394Z info: [Crawler][11] Downloading image from "https://static1.xdaimages.com/wordpress/wp-content/uploads/2024/05/awii_launch_049.png" 2025-06-16T17:25:08.275Z info: [Crawler][11] Downloaded image as assetId: f4e149a7-1e83-4d86-adb6-c77418e2b322 2025-06-16T17:25:08.408Z info: [Crawler][11] Completed successfully 2025-06-16T17:25:09.299Z info: [inference][12] Starting an inference job for bookmark with id "a6iubzxc6diiqp1a005jsgxe" ``` ### Steps to Reproduce 1. Add a bookmark for "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/" 2. I think that's it? ### Expected Behaviour Content is processed or produces an error. ### Screenshots or Additional Context I can provide the full log file if it will help. ### Device Details _No response_ ### Exact Karakeep Version 0.25.0 ### Have you checked the troubleshooting guide? - [x] I have checked the troubleshooting guide and I haven't found a solution to my problem
kerem 2026-03-02 11:54:24 +03:00
Author
Owner

@pdc1 commented on GitHub (Jun 17, 2025):

I forgot to mention, the ollama server never saw a request, so the job is stuck on the karakeep_web side. I am also willing to share my database if that would help with debugging.

<!-- gh-comment-id:2981088433 --> @pdc1 commented on GitHub (Jun 17, 2025): I forgot to mention, the ollama server never saw a request, so the job is stuck on the karakeep_web side. I am also willing to share my database if that would help with debugging.
Author
Owner

@pdc1 commented on GitHub (Jun 18, 2025):

I have found a few more sites that have the Could not parse CSS stylesheet error, but none of them result in the stuck inference job. It is strange since the logs say it extracted readable content, but then some time later says the operation timed out.

<!-- gh-comment-id:2985611232 --> @pdc1 commented on GitHub (Jun 18, 2025): I have found a few more sites that have the `Could not parse CSS stylesheet` error, but none of them result in the stuck inference job. It is strange since the logs say it extracted readable content, but then some time later says the operation timed out.
Author
Owner

@Quack6765 commented on GitHub (Jun 19, 2025):

Same thing here with the following link: https://www.canadiantire.ca/en/pdp/ninja-creami-swirl-12-in-1-ice-cream-frozen-treat-maker-soft-serve-maker-nc701c-4992174p.html

Error: Could not parse CSS stylesheet
    at exports.createStylesheet (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/helpers/stylesheets.js:37:21)
    at HTMLStyleElementImpl._updateAStyleBlock (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:68:5)
    at HTMLStyleElementImpl._poppedOffStackOfOpenElements (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:42:10) [...]

Running v0.25.0

<!-- gh-comment-id:2988447819 --> @Quack6765 commented on GitHub (Jun 19, 2025): Same thing here with the following link: https://www.canadiantire.ca/en/pdp/ninja-creami-swirl-12-in-1-ice-cream-frozen-treat-maker-soft-serve-maker-nc701c-4992174p.html ``` Error: Could not parse CSS stylesheet at exports.createStylesheet (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/helpers/stylesheets.js:37:21) at HTMLStyleElementImpl._updateAStyleBlock (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:68:5) at HTMLStyleElementImpl._poppedOffStackOfOpenElements (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:42:10) [...] ``` Running v0.25.0
Author
Owner

@pdc1 commented on GitHub (Jun 19, 2025):

I tried another experiment, this time regenerating only the summaries. There was no CSS error, though this time there was an indexing timeout, and after 10 minutes the inference task did time out. It then tried again, and presumably will move on at some point. So it's not as bad as I originally thought, but it would be really nice to understand what is causing the timeout. As I mentioned, ollama is not contacted, so it appears to be something on the web app side.

Note that I do not have a value set for INFERENCE_FETCH_TIMEOUT_SEC so I was expecting a timeout of 5 min, not 10 min.

<!-- gh-comment-id:2989187371 --> @pdc1 commented on GitHub (Jun 19, 2025): I tried another experiment, this time regenerating only the summaries. There was no CSS error, though this time there was an indexing timeout, and after 10 minutes the inference task did time out. It then tried again, and presumably will move on at some point. So it's not as bad as I originally thought, but it would be really nice to understand what is causing the timeout. As I mentioned, `ollama` is not contacted, so it appears to be something on the web app side. Note that I do not have a value set for `INFERENCE_FETCH_TIMEOUT_SEC` so I was expecting a timeout of 5 min, not 10 min.
Author
Owner

@pdc1 commented on GitHub (Jun 20, 2025):

Hi there, me again. I looked through the summary logs for the bookmark above that always gets stuck, and found something interesting:

2025-06-19T21:23:29.451Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe"
2025-06-19T21:34:00.987Z error: [inference][353] inference job failed: Error: Timeout
2025-06-19T21:34:01.026Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe"
2025-06-19T21:43:49.626Z error: [inference][353] inference job failed: Error: Timeout
2025-06-19T21:43:49.683Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe"
2025-06-19T21:53:58.017Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7197 tokens.
2025-06-19T21:53:58.157Z error: [inference][353] inference job failed: Error: Timeout
2025-06-19T21:53:58.226Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe"
2025-06-19T22:04:50.091Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7286 tokens.
2025-06-19T22:04:50.283Z error: [inference][353] inference job failed: Error: Timeout
2025-06-19T22:04:55.665Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7194 tokens.
2025-06-19T22:05:01.530Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7207 tokens.

What is interesting is there are 4 timeouts, but also four success messages, although MUCH later. Could it really be that the full processing is taking from 21:23:29 to 21:53:58?! (half an hour!!) It looks like the last run takes more like 15 minutes, perhaps without the other retries going on in parallel.

I would love to know what the system is doing to use that much CPU before even sending the inference job to ollama! And only ~7000 tokens does not sound like a lot, once it got sent.

I'm going to try with a 1/2 hour timeout 😄

<!-- gh-comment-id:2993066872 --> @pdc1 commented on GitHub (Jun 20, 2025): Hi there, me again. I looked through the summary logs for the bookmark above that always gets stuck, and found something interesting: ``` 2025-06-19T21:23:29.451Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe" 2025-06-19T21:34:00.987Z error: [inference][353] inference job failed: Error: Timeout 2025-06-19T21:34:01.026Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe" 2025-06-19T21:43:49.626Z error: [inference][353] inference job failed: Error: Timeout 2025-06-19T21:43:49.683Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe" 2025-06-19T21:53:58.017Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7197 tokens. 2025-06-19T21:53:58.157Z error: [inference][353] inference job failed: Error: Timeout 2025-06-19T21:53:58.226Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe" 2025-06-19T22:04:50.091Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7286 tokens. 2025-06-19T22:04:50.283Z error: [inference][353] inference job failed: Error: Timeout 2025-06-19T22:04:55.665Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7194 tokens. 2025-06-19T22:05:01.530Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7207 tokens. ``` What is interesting is there are 4 timeouts, but also four success messages, although MUCH later. Could it really be that the full processing is taking from 21:23:29 to 21:53:58?! (half an hour!!) It looks like the last run takes more like 15 minutes, perhaps without the other retries going on in parallel. I would love to know what the system is doing to use that much CPU before even sending the inference job to ollama! And only ~7000 tokens does not sound like a lot, once it got sent. I'm going to try with a 1/2 hour timeout 😄
Author
Owner

@pdc1 commented on GitHub (Jun 21, 2025):

Okay, I made a really long timeout and it helped quite a bit. This time since there were not multiple competing jobs due to the timeout, the processing "only" took around 10.5 minutes to process tags, and another 11 minutes to process the summary. In both cases the ollama processing was minimal, 4 seconds for tags and 6 seconds for summary.

To summarize, something about this bookmark or site takes a REALLY long time for karakeep to tokenize, but if the timeouts are set long enough the processing works. If the timeouts are set too short, karakeep runs multiple versions of each job, each of which keeps running in the background after the timeout expires.

<!-- gh-comment-id:2993210210 --> @pdc1 commented on GitHub (Jun 21, 2025): Okay, I made a really long timeout and it helped quite a bit. This time since there were not multiple competing jobs due to the timeout, the processing "only" took around 10.5 minutes to process tags, and another 11 minutes to process the summary. In both cases the ollama processing was minimal, 4 seconds for tags and 6 seconds for summary. To summarize, something about this bookmark or site takes a REALLY long time for karakeep to tokenize, but if the timeouts are set long enough the processing works. If the timeouts are set too short, karakeep runs multiple versions of each job, each of which keeps running in the background after the timeout expires.
Author
Owner

@MohamedBassem commented on GitHub (Jun 21, 2025):

I can reproduce the issue and it seems to be indeed coming from the tokenizer, which is very surprising. Will try to dig deeper to understand why it happens.

<!-- gh-comment-id:2993544810 --> @MohamedBassem commented on GitHub (Jun 21, 2025): I can reproduce the issue and it seems to be indeed coming from the tokenizer, which is very surprising. Will try to dig deeper to understand why it happens.
Author
Owner

@MohamedBassem commented on GitHub (Jun 21, 2025):

The problem seems to be happening because the extracted content of this website contains a ton of white spaces, which I assume kills the performance of the tokenizer. One easy fix can be to eliminate consecutive white spaces, but let me first check if there's an open issue for this on the tokenizer lib we're using.

<!-- gh-comment-id:2993548278 --> @MohamedBassem commented on GitHub (Jun 21, 2025): The problem seems to be happening because the extracted content of this website contains a ton of white spaces, which I assume kills the performance of the tokenizer. One easy fix can be to eliminate consecutive white spaces, but let me first check if there's an open issue for this on the tokenizer lib we're using.
Author
Owner

@pdc1 commented on GitHub (Jun 21, 2025):

The problem seems to be happening because the extracted content of this website contains a ton of white spaces, which I assume kills the performance of the tokenizer. One easy fix can be to eliminate consecutive white spaces, but let me first check if there's an open issue for this on the tokenizer lib we're using.

I took a look in the database and wow! The text is 136K, and when I replaced all space/tab/newline with just a space, the size dropped to just 14K, a 90% reduction in size!

Thanks for taking a look!

<!-- gh-comment-id:2993592175 --> @pdc1 commented on GitHub (Jun 21, 2025): > The problem seems to be happening because the extracted content of this website contains a ton of white spaces, which I assume kills the performance of the tokenizer. One easy fix can be to eliminate consecutive white spaces, but let me first check if there's an open issue for this on the tokenizer lib we're using. I took a look in the database and wow! The text is 136K, and when I replaced all space/tab/newline with just a space, the size dropped to just 14K, a 90% reduction in size! Thanks for taking a look!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#1015
No description provided.