starred/karakeep

Fork 0

mirror of https://github.com/karakeep-app/karakeep.git synced 2026-04-25 16:06:04 +03:00

[GH-ISSUE #1622] [BUG] Inference task that takes a really long time to pre-process (>10 minutes) caused by slow tokenization #1015

New issue

Closed

opened 2026-03-02 11:54:24 +03:00 by kerem · 9 comments

kerem commented

2026-03-02 11:54:24 +03:00

Owner

Originally created by @pdc1 on GitHub (Jun 16, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1622

Describe the Bug

I have a bookmark that causes an inference task that never ends, with a node process that is stuck using 100% of one processor. The task never ends and never appears to time out. To recover, I have to delete the queue.db and restart the container.

The bookmark was originally imported in 0.24.1, which did not have this issue. I am guessing that the error is in the new tokenization, with the exception not properly caught by the processing thread, but that's just a guess.

The logs look like this:

2025-06-16T17:24:48.341Z [32minfo[39m: [Crawler][11] Will crawl "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/" for link with id "a6iubzxc6diiqp1a005jsgxe"
2025-06-16T17:24:48.342Z [32minfo[39m: [Crawler][11] Attempting to determine the content-type for the url https://www.xda-developers.com/games-that-justify-ray-tracing-tax/
2025-06-16T17:24:48.635Z [32minfo[39m: [Crawler][11] Content-type for the url https://www.xda-developers.com/games-that-justify-ray-tracing-tax/ is "text/html; charset=UTF-8"
2025-06-16T17:24:51.800Z [32minfo[39m: [Crawler][11] Successfully navigated to "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/". Waiting for the page to load ...
2025-06-16T17:24:53.581Z [32minfo[39m: [Crawler][11] Finished waiting for the page to load.
2025-06-16T17:24:53.891Z [32minfo[39m: [Crawler][11] Successfully fetched the page content.
2025-06-16T17:24:54.665Z [32minfo[39m: [Crawler][11] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-06-16T17:24:54.676Z [32minfo[39m: [Crawler][11] Will attempt to extract metadata from page ...
2025-06-16T17:25:02.020Z [32minfo[39m: [Crawler][11] Will attempt to extract readable content ...
Error: Could not parse CSS stylesheet
    at exports.createStylesheet (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/helpers/stylesheets.js:37:21)
    at HTMLStyleElementImpl._updateAStyleBlock (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:68:5)
    at HTMLStyleElementImpl._poppedOffStackOfOpenElements (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:42:10)
    at JSDOMParse5Adapter.onItemPop (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/browser/parser/html.js:175:43)
    at Parser.onItemPop (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:158:90)
    at OpenElementStack.pop (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/open-element-stack.js:89:22)
    at endTagInText (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:2287:20)
    at Parser._endTagOutsideForeignContent (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:931:17)
    at Parser.onEndTag (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:897:18)
    at Tokenizer.emitCurrentTagToken (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/tokenizer/index.js:402:26) 
            .responsive-img{position:relative;overflow:hidden}.responsive-img img{position:absolute;top:0;left:0;width:100%;height:100%}

[... 800KB of CSS data(!) omitted...]

 .swiper-slide-shadow-bottom,.swiper-container-flip .swiper-slide-shadow-left,.swiper-container-flip .swiper-slide-shadow-right,.swiper-container-flip .swiper-slide-shadow-top{z-index:0;backface-visibility:hidden}.swiper-container-coverflow .swiper-wrapper{-ms-perspective:1200px}
        
2025-06-16T17:25:07.227Z [32minfo[39m: [Crawler][11] Done extracting readable content.
2025-06-16T17:25:07.285Z [32minfo[39m: [Crawler][11] Stored the screenshot as assetId: cca04d75-bf3b-4d11-8287-860d0281c22e
2025-06-16T17:25:07.394Z [32minfo[39m: [Crawler][11] Done extracting metadata from the page.
2025-06-16T17:25:07.394Z [32minfo[39m: [Crawler][11] Downloading image from "https://static1.xdaimages.com/wordpress/wp-content/uploads/2024/05/awii_launch_049.png"
2025-06-16T17:25:08.275Z [32minfo[39m: [Crawler][11] Downloaded image as assetId: f4e149a7-1e83-4d86-adb6-c77418e2b322
2025-06-16T17:25:08.408Z [32minfo[39m: [Crawler][11] Completed successfully
2025-06-16T17:25:09.299Z [32minfo[39m: [inference][12] Starting an inference job for bookmark with id "a6iubzxc6diiqp1a005jsgxe"

Steps to Reproduce

Add a bookmark for "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/"
I think that's it?

Expected Behaviour

Content is processed or produces an error.

Screenshots or Additional Context

I can provide the full log file if it will help.

Device Details

No response

Exact Karakeep Version

0.25.0

Have you checked the troubleshooting guide?

I have checked the troubleshooting guide and I haven't found a solution to my problem

Originally created by @pdc1 on GitHub (Jun 16, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1622 ### Describe the Bug I have a bookmark that causes an inference task that never ends, with a `node` process that is stuck using 100% of one processor. The task never ends and never appears to time out. To recover, I have to delete the queue.db and restart the container. The bookmark was originally imported in 0.24.1, which did not have this issue. I am guessing that the error is in the new tokenization, with the exception not properly caught by the processing thread, but that's just a guess. The logs look like this: ``` 2025-06-16T17:24:48.341Z [32minfo[39m: [Crawler][11] Will crawl "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/" for link with id "a6iubzxc6diiqp1a005jsgxe" 2025-06-16T17:24:48.342Z [32minfo[39m: [Crawler][11] Attempting to determine the content-type for the url https://www.xda-developers.com/games-that-justify-ray-tracing-tax/ 2025-06-16T17:24:48.635Z [32minfo[39m: [Crawler][11] Content-type for the url https://www.xda-developers.com/games-that-justify-ray-tracing-tax/ is "text/html; charset=UTF-8" 2025-06-16T17:24:51.800Z [32minfo[39m: [Crawler][11] Successfully navigated to "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/". Waiting for the page to load ... 2025-06-16T17:24:53.581Z [32minfo[39m: [Crawler][11] Finished waiting for the page to load. 2025-06-16T17:24:53.891Z [32minfo[39m: [Crawler][11] Successfully fetched the page content. 2025-06-16T17:24:54.665Z [32minfo[39m: [Crawler][11] Finished capturing page content and a screenshot. FullPageScreenshot: false 2025-06-16T17:24:54.676Z [32minfo[39m: [Crawler][11] Will attempt to extract metadata from page ... 2025-06-16T17:25:02.020Z [32minfo[39m: [Crawler][11] Will attempt to extract readable content ... Error: Could not parse CSS stylesheet at exports.createStylesheet (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/helpers/stylesheets.js:37:21) at HTMLStyleElementImpl._updateAStyleBlock (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:68:5) at HTMLStyleElementImpl._poppedOffStackOfOpenElements (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:42:10) at JSDOMParse5Adapter.onItemPop (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/browser/parser/html.js:175:43) at Parser.onItemPop (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:158:90) at OpenElementStack.pop (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/open-element-stack.js:89:22) at endTagInText (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:2287:20) at Parser._endTagOutsideForeignContent (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:931:17) at Parser.onEndTag (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:897:18) at Tokenizer.emitCurrentTagToken (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/tokenizer/index.js:402:26) .responsive-img{position:relative;overflow:hidden}.responsive-img img{position:absolute;top:0;left:0;width:100%;height:100%} [... 800KB of CSS data(!) omitted...] .swiper-slide-shadow-bottom,.swiper-container-flip .swiper-slide-shadow-left,.swiper-container-flip .swiper-slide-shadow-right,.swiper-container-flip .swiper-slide-shadow-top{z-index:0;backface-visibility:hidden}.swiper-container-coverflow .swiper-wrapper{-ms-perspective:1200px} 2025-06-16T17:25:07.227Z [32minfo[39m: [Crawler][11] Done extracting readable content. 2025-06-16T17:25:07.285Z [32minfo[39m: [Crawler][11] Stored the screenshot as assetId: cca04d75-bf3b-4d11-8287-860d0281c22e 2025-06-16T17:25:07.394Z [32minfo[39m: [Crawler][11] Done extracting metadata from the page. 2025-06-16T17:25:07.394Z [32minfo[39m: [Crawler][11] Downloading image from "https://static1.xdaimages.com/wordpress/wp-content/uploads/2024/05/awii_launch_049.png" 2025-06-16T17:25:08.275Z [32minfo[39m: [Crawler][11] Downloaded image as assetId: f4e149a7-1e83-4d86-adb6-c77418e2b322 2025-06-16T17:25:08.408Z [32minfo[39m: [Crawler][11] Completed successfully 2025-06-16T17:25:09.299Z [32minfo[39m: [inference][12] Starting an inference job for bookmark with id "a6iubzxc6diiqp1a005jsgxe" ``` ### Steps to Reproduce 1. Add a bookmark for "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/" 2. I think that's it? ### Expected Behaviour Content is processed or produces an error. ### Screenshots or Additional Context I can provide the full log file if it will help. ### Device Details _No response_ ### Exact Karakeep Version 0.25.0 ### Have you checked the troubleshooting guide? - [x] I have checked the troubleshooting guide and I haven't found a solution to my problem

kerem

2026-03-02 11:54:24 +03:00

closed this issue
added the
bug

pri/high

status/approved
labels

kerem commented

2026-03-02 11:54:25 +03:00

Author

Owner

@pdc1 commented on GitHub (Jun 17, 2025):

I forgot to mention, the ollama server never saw a request, so the job is stuck on the karakeep_web side. I am also willing to share my database if that would help with debugging.

@pdc1 commented on GitHub (Jun 17, 2025): I forgot to mention, the ollama server never saw a request, so the job is stuck on the karakeep_web side. I am also willing to share my database if that would help with debugging.

kerem commented

2026-03-02 11:54:25 +03:00

Author

Owner

@pdc1 commented on GitHub (Jun 18, 2025):

I have found a few more sites that have the Could not parse CSS stylesheet error, but none of them result in the stuck inference job. It is strange since the logs say it extracted readable content, but then some time later says the operation timed out.

@pdc1 commented on GitHub (Jun 18, 2025): I have found a few more sites that have the `Could not parse CSS stylesheet` error, but none of them result in the stuck inference job. It is strange since the logs say it extracted readable content, but then some time later says the operation timed out.

kerem commented

2026-03-02 11:54:25 +03:00

Author

Owner

@Quack6765 commented on GitHub (Jun 19, 2025):

Same thing here with the following link: https://www.canadiantire.ca/en/pdp/ninja-creami-swirl-12-in-1-ice-cream-frozen-treat-maker-soft-serve-maker-nc701c-4992174p.html

Error: Could not parse CSS stylesheet
    at exports.createStylesheet (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/helpers/stylesheets.js:37:21)
    at HTMLStyleElementImpl._updateAStyleBlock (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:68:5)
    at HTMLStyleElementImpl._poppedOffStackOfOpenElements (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:42:10) [...]

Running v0.25.0

@Quack6765 commented on GitHub (Jun 19, 2025): Same thing here with the following link: https://www.canadiantire.ca/en/pdp/ninja-creami-swirl-12-in-1-ice-cream-frozen-treat-maker-soft-serve-maker-nc701c-4992174p.html ``` Error: Could not parse CSS stylesheet at exports.createStylesheet (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/helpers/stylesheets.js:37:21) at HTMLStyleElementImpl._updateAStyleBlock (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:68:5) at HTMLStyleElementImpl._poppedOffStackOfOpenElements (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:42:10) [...] ``` Running v0.25.0

kerem commented

2026-03-02 11:54:25 +03:00

Author

Owner

@pdc1 commented on GitHub (Jun 19, 2025):

I tried another experiment, this time regenerating only the summaries. There was no CSS error, though this time there was an indexing timeout, and after 10 minutes the inference task did time out. It then tried again, and presumably will move on at some point. So it's not as bad as I originally thought, but it would be really nice to understand what is causing the timeout. As I mentioned, ollama is not contacted, so it appears to be something on the web app side.

Note that I do not have a value set for INFERENCE_FETCH_TIMEOUT_SEC so I was expecting a timeout of 5 min, not 10 min.

@pdc1 commented on GitHub (Jun 19, 2025): I tried another experiment, this time regenerating only the summaries. There was no CSS error, though this time there was an indexing timeout, and after 10 minutes the inference task did time out. It then tried again, and presumably will move on at some point. So it's not as bad as I originally thought, but it would be really nice to understand what is causing the timeout. As I mentioned, `ollama` is not contacted, so it appears to be something on the web app side. Note that I do not have a value set for `INFERENCE_FETCH_TIMEOUT_SEC` so I was expecting a timeout of 5 min, not 10 min.

kerem commented

2026-03-02 11:54:25 +03:00

Author

Owner

@pdc1 commented on GitHub (Jun 20, 2025):

Hi there, me again. I looked through the summary logs for the bookmark above that always gets stuck, and found something interesting:

2025-06-19T21:23:29.451Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe"
2025-06-19T21:34:00.987Z error: [inference][353] inference job failed: Error: Timeout
2025-06-19T21:34:01.026Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe"
2025-06-19T21:43:49.626Z error: [inference][353] inference job failed: Error: Timeout
2025-06-19T21:43:49.683Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe"
2025-06-19T21:53:58.017Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7197 tokens.
2025-06-19T21:53:58.157Z error: [inference][353] inference job failed: Error: Timeout
2025-06-19T21:53:58.226Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe"
2025-06-19T22:04:50.091Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7286 tokens.
2025-06-19T22:04:50.283Z error: [inference][353] inference job failed: Error: Timeout
2025-06-19T22:04:55.665Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7194 tokens.
2025-06-19T22:05:01.530Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7207 tokens.

What is interesting is there are 4 timeouts, but also four success messages, although MUCH later. Could it really be that the full processing is taking from 21:23:29 to 21:53:58?! (half an hour!!) It looks like the last run takes more like 15 minutes, perhaps without the other retries going on in parallel.

I would love to know what the system is doing to use that much CPU before even sending the inference job to ollama! And only ~7000 tokens does not sound like a lot, once it got sent.

I'm going to try with a 1/2 hour timeout 😄

@pdc1 commented on GitHub (Jun 20, 2025): Hi there, me again. I looked through the summary logs for the bookmark above that always gets stuck, and found something interesting: ``` 2025-06-19T21:23:29.451Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe" 2025-06-19T21:34:00.987Z error: [inference][353] inference job failed: Error: Timeout 2025-06-19T21:34:01.026Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe" 2025-06-19T21:43:49.626Z error: [inference][353] inference job failed: Error: Timeout 2025-06-19T21:43:49.683Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe" 2025-06-19T21:53:58.017Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7197 tokens. 2025-06-19T21:53:58.157Z error: [inference][353] inference job failed: Error: Timeout 2025-06-19T21:53:58.226Z info: [inference][353] Starting a summary job for bookmark with id "a6iubzxc6diiqp1a005jsgxe" 2025-06-19T22:04:50.091Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7286 tokens. 2025-06-19T22:04:50.283Z error: [inference][353] inference job failed: Error: Timeout 2025-06-19T22:04:55.665Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7194 tokens. 2025-06-19T22:05:01.530Z info: [inference][353] Generated summary for bookmark "a6iubzxc6diiqp1a005jsgxe" using 7207 tokens. ``` What is interesting is there are 4 timeouts, but also four success messages, although MUCH later. Could it really be that the full processing is taking from 21:23:29 to 21:53:58?! (half an hour!!) It looks like the last run takes more like 15 minutes, perhaps without the other retries going on in parallel. I would love to know what the system is doing to use that much CPU before even sending the inference job to ollama! And only ~7000 tokens does not sound like a lot, once it got sent. I'm going to try with a 1/2 hour timeout 😄

kerem commented

2026-03-02 11:54:25 +03:00

Author

Owner

@pdc1 commented on GitHub (Jun 21, 2025):

Okay, I made a really long timeout and it helped quite a bit. This time since there were not multiple competing jobs due to the timeout, the processing "only" took around 10.5 minutes to process tags, and another 11 minutes to process the summary. In both cases the ollama processing was minimal, 4 seconds for tags and 6 seconds for summary.

To summarize, something about this bookmark or site takes a REALLY long time for karakeep to tokenize, but if the timeouts are set long enough the processing works. If the timeouts are set too short, karakeep runs multiple versions of each job, each of which keeps running in the background after the timeout expires.

@pdc1 commented on GitHub (Jun 21, 2025): Okay, I made a really long timeout and it helped quite a bit. This time since there were not multiple competing jobs due to the timeout, the processing "only" took around 10.5 minutes to process tags, and another 11 minutes to process the summary. In both cases the ollama processing was minimal, 4 seconds for tags and 6 seconds for summary. To summarize, something about this bookmark or site takes a REALLY long time for karakeep to tokenize, but if the timeouts are set long enough the processing works. If the timeouts are set too short, karakeep runs multiple versions of each job, each of which keeps running in the background after the timeout expires.

kerem commented

2026-03-02 11:54:26 +03:00

Author

Owner

@MohamedBassem commented on GitHub (Jun 21, 2025):

I can reproduce the issue and it seems to be indeed coming from the tokenizer, which is very surprising. Will try to dig deeper to understand why it happens.

@MohamedBassem commented on GitHub (Jun 21, 2025): I can reproduce the issue and it seems to be indeed coming from the tokenizer, which is very surprising. Will try to dig deeper to understand why it happens.

kerem commented

2026-03-02 11:54:26 +03:00

Author

Owner

@MohamedBassem commented on GitHub (Jun 21, 2025):

The problem seems to be happening because the extracted content of this website contains a ton of white spaces, which I assume kills the performance of the tokenizer. One easy fix can be to eliminate consecutive white spaces, but let me first check if there's an open issue for this on the tokenizer lib we're using.

@MohamedBassem commented on GitHub (Jun 21, 2025): The problem seems to be happening because the extracted content of this website contains a ton of white spaces, which I assume kills the performance of the tokenizer. One easy fix can be to eliminate consecutive white spaces, but let me first check if there's an open issue for this on the tokenizer lib we're using.

kerem commented

2026-03-02 11:54:26 +03:00

Author

Owner

@pdc1 commented on GitHub (Jun 21, 2025):

The problem seems to be happening because the extracted content of this website contains a ton of white spaces, which I assume kills the performance of the tokenizer. One easy fix can be to eliminate consecutive white spaces, but let me first check if there's an open issue for this on the tokenizer lib we're using.

I took a look in the database and wow! The text is 136K, and when I replaced all space/tab/newline with just a space, the size dropped to just 14K, a 90% reduction in size!

Thanks for taking a look!

@pdc1 commented on GitHub (Jun 21, 2025): > The problem seems to be happening because the extracted content of this website contains a ton of white spaces, which I assume kills the performance of the tokenizer. One easy fix can be to eliminate consecutive white spaces, but let me first check if there's an open issue for this on the tokenizer lib we're using. I took a look in the database and wow! The text is 136K, and when I replaced all space/tab/newline with just a space, the size dropped to just 14K, a 90% reduction in size! Thanks for taking a look!

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/karakeep#1015

No description provided.

Rows
Columns