[GH-ISSUE #1518] Attempting to determine the content-type for the url #950

Open
opened 2026-03-02 11:53:56 +03:00 by kerem · 0 comments
Owner

Originally created by @graealex on GitHub (Jun 2, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1518

Describe the Bug

With specific URLs, Karakeep fails early to retrieve the content-type:

web-1  | 2025-06-02T20:35:13.360Z info: [Crawler][121] Will crawl "https://www.st.com/en/microcontrollers-microprocessors/stm32f303re.html" for link with id "wla2tnfhiky3lud0pthchjmc"
web-1  | 2025-06-02T20:35:13.361Z info: [Crawler][121] Attempting to determine the content-type for the url https://www.st.com/en/microcontrollers-microprocessors/stm32f303re.html
web-1  | 2025-06-02T20:35:18.367Z error: [Crawler][121] Failed to determine the content-type for the url https://www.st.com/en/microcontrollers-microprocessors/stm32f303re.html: AbortError: The operation was aborted.
web-1  | 2025-06-02T20:35:30.742Z info: [Crawler][121] Successfully navigated to "https://www.st.com/en/microcontrollers-microprocessors/stm32f303re.html". Waiting for the page to load ...
web-1  | 2025-06-02T20:35:32.643Z info: [Crawler][121] Finished waiting for the page to load.
web-1  | 2025-06-02T20:35:32.824Z info: [Crawler][121] Successfully fetched the page content.
web-1  | 2025-06-02T20:35:35.710Z info: [Crawler][121] Finished capturing page content and a screenshot. FullPageScreenshot: true
web-1  | 2025-06-02T20:35:35.734Z info: [Crawler][121] Will attempt to extract metadata from page ...
web-1  | 2025-06-02T20:35:50.675Z info: [Crawler][121] Will attempt to extract readable content ...
web-1  | 2025-06-02T20:36:02.172Z info: [Crawler][121] Done extracting readable content.
web-1  | 2025-06-02T20:36:04.654Z info: [Crawler][121] Stored the screenshot as assetId: 9da6f2ac-ebc4-43ee-ac83-92d62ce47d0e
web-1  | 2025-06-02T20:45:13.358Z error: [Crawler][121] Crawling job failed: Error: Timed-out after 600 secs
web-1  | Error: Timed-out after 600 secs

The bookmark in the web UI remains blank. AI inference is never used. Screenshots are also not saved, despite what the log says.

In addition, it continues to get stuck in a loop where it tries to parse, fails, and re-tries again until the service is restarted.

Steps to Reproduce

  1. Try to bookmark https://www.st.com/en/microcontrollers-microprocessors/stm32f303re.html

Expected Behaviour

The specified URL should be crawled and all the specified assets be created, as well as AI inference tasks being run.

If it cannot be crawled, it should eventually stop, or at least queue all other pending requests in front.

Screenshots or Additional Context

No response

Device Details

No response

Exact Karakeep Version

Karakeep v0.24.1 nightly

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @graealex on GitHub (Jun 2, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1518 ### Describe the Bug With specific URLs, Karakeep fails early to retrieve the content-type: ``` web-1 | 2025-06-02T20:35:13.360Z info: [Crawler][121] Will crawl "https://www.st.com/en/microcontrollers-microprocessors/stm32f303re.html" for link with id "wla2tnfhiky3lud0pthchjmc" web-1 | 2025-06-02T20:35:13.361Z info: [Crawler][121] Attempting to determine the content-type for the url https://www.st.com/en/microcontrollers-microprocessors/stm32f303re.html web-1 | 2025-06-02T20:35:18.367Z error: [Crawler][121] Failed to determine the content-type for the url https://www.st.com/en/microcontrollers-microprocessors/stm32f303re.html: AbortError: The operation was aborted. web-1 | 2025-06-02T20:35:30.742Z info: [Crawler][121] Successfully navigated to "https://www.st.com/en/microcontrollers-microprocessors/stm32f303re.html". Waiting for the page to load ... web-1 | 2025-06-02T20:35:32.643Z info: [Crawler][121] Finished waiting for the page to load. web-1 | 2025-06-02T20:35:32.824Z info: [Crawler][121] Successfully fetched the page content. web-1 | 2025-06-02T20:35:35.710Z info: [Crawler][121] Finished capturing page content and a screenshot. FullPageScreenshot: true web-1 | 2025-06-02T20:35:35.734Z info: [Crawler][121] Will attempt to extract metadata from page ... web-1 | 2025-06-02T20:35:50.675Z info: [Crawler][121] Will attempt to extract readable content ... web-1 | 2025-06-02T20:36:02.172Z info: [Crawler][121] Done extracting readable content. web-1 | 2025-06-02T20:36:04.654Z info: [Crawler][121] Stored the screenshot as assetId: 9da6f2ac-ebc4-43ee-ac83-92d62ce47d0e web-1 | 2025-06-02T20:45:13.358Z error: [Crawler][121] Crawling job failed: Error: Timed-out after 600 secs web-1 | Error: Timed-out after 600 secs ``` The bookmark in the web UI remains blank. AI inference is never used. Screenshots are also not saved, despite what the log says. In addition, it continues to get stuck in a loop where it tries to parse, fails, and re-tries again until the service is restarted. ### Steps to Reproduce 1. Try to bookmark https://www.st.com/en/microcontrollers-microprocessors/stm32f303re.html ### Expected Behaviour The specified URL should be crawled and all the specified assets be created, as well as AI inference tasks being run. If it cannot be crawled, it should eventually stop, or at least queue all other pending requests in front. ### Screenshots or Additional Context _No response_ ### Device Details _No response_ ### Exact Karakeep Version Karakeep v0.24.1 nightly ### Have you checked the troubleshooting guide? - [x] I have checked the troubleshooting guide and I haven't found a solution to my problem
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#950
No description provided.