[GH-ISSUE #815] Failed to determine the content-type for the url #533

Open
opened 2026-03-02 11:50:38 +03:00 by kerem · 13 comments
Owner

Originally created by @GreenMonito on GitHub (Jan 3, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/815

Describe the Bug

Neither the tags nor the image are generated for any URL of https://www.elcorteingles.es/.
In the logs I see the following error:

2025-01-03T14:17:38.157Z error: [Crawler] [564] Failed to determine the content-type for the URL https://www.elcorteingles.es/: AbortError: The operation was aborted.

Steps to Reproduce

Add this URLs to Hoarder

Expected Behaviour

That the tags and the image are generated

Screenshots or Additional Context

ElCorteIngles

Device Details

Docker, Chrome

Exact Hoarder Version

v0.20.0

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @GreenMonito on GitHub (Jan 3, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/815 ### Describe the Bug Neither the tags nor the image are generated for any URL of https://www.elcorteingles.es/. In the logs I see the following error: ` 2025-01-03T14:17:38.157Z error: [Crawler] [564] Failed to determine the content-type for the URL https://www.elcorteingles.es/: AbortError: The operation was aborted.` ### Steps to Reproduce Add this URLs to Hoarder ### Expected Behaviour That the tags and the image are generated ### Screenshots or Additional Context ![ElCorteIngles](https://github.com/user-attachments/assets/3e56388c-4c14-4419-aa1f-eb14a6edea1e) ### Device Details Docker, Chrome ### Exact Hoarder Version v0.20.0 ### Have you checked the troubleshooting guide? - [X] I have checked the troubleshooting guide and I haven't found a solution to my problem
Author
Owner

@DerekParks commented on GitHub (Jan 10, 2025):

I'm running into a very similar issue. Looking at the logs and matching up with code paths. I think my html crawl is succeeding but my banner image crawl is not. (maybe because of doing 2 fetches too quickly?)

Edit: Setting CRAWLER_DOWNLOAD_BANNER_IMAGE="true" allowed the crawl to succeed in my case.

<!-- gh-comment-id:2581727021 --> @DerekParks commented on GitHub (Jan 10, 2025): I'm running into a very similar issue. Looking at the logs and matching up with code paths. I think my html crawl is succeeding but my banner image crawl is not. (maybe because of doing 2 fetches too quickly?) Edit: Setting CRAWLER_DOWNLOAD_BANNER_IMAGE="true" allowed the crawl to succeed in my case.
Author
Owner

@MohamedBassem commented on GitHub (Jan 11, 2025):

@DerekParks CRAWLER_DOWNLOAD_BANNER_IMAGE is the default, did you explicitly disable it before?

<!-- gh-comment-id:2585251175 --> @MohamedBassem commented on GitHub (Jan 11, 2025): @DerekParks `CRAWLER_DOWNLOAD_BANNER_IMAGE` is the default, did you explicitly disable it before?
Author
Owner

@MohamedBassem commented on GitHub (Jan 11, 2025):

@GreenMonito Can you share the full log when this link is added?

<!-- gh-comment-id:2585251345 --> @MohamedBassem commented on GitHub (Jan 11, 2025): @GreenMonito Can you share the full log when this link is added?
Author
Owner

@reinhardt-bit commented on GitHub (Jan 11, 2025):

I actually look at the docker logs to better try and figure out what the problem could be.The main issue I found to be this:
'''
TypeError: fetch failed
at node:internal/deps/undici/undici:12345:11
'''
This is a problem in the container itself and a fix I found was found here.
It is a old node.js dependency problem. Please can you look into this. Hope this helps.

I am running Hoarder in docker and I have Ollama installed locally on my machine

<!-- gh-comment-id:2585459463 --> @reinhardt-bit commented on GitHub (Jan 11, 2025): I actually look at the docker logs to better try and figure out what the problem could be.The main issue I found to be this: ''' TypeError: fetch failed at node:internal/deps/undici/undici:12345:11 ''' This is a problem in the container itself and a fix I found was found [here](https://medium.com/@pawanjotkaurbaweja/solved-typeerror-fetch-failed-4f7d304c0c68). It is a old node.js dependency problem. Please can you look into this. Hope this helps. I am running Hoarder in docker and I have Ollama installed locally on my machine
Author
Owner

@GreenMonito commented on GitHub (Jan 12, 2025):

@GreenMonito Can you share the full log when this link is added?

Sorry for the delay, these are the logs when adding the domain

2025-01-12T19:01:17.706Z info: [Crawler][685] Will crawl "https://www.elcorteingles.es/" for link with id "l02d6h4erym2632v78s44f4n"
2025-01-12T19:01:17.706Z info: [Crawler][685] Attempting to determine the content-type for the url https://www.elcorteingles.es/
2025-01-12T19:01:17.929Z info: [search][686] Attempting to index bookmark with id l02d6h4erym2632v78s44f4n ...
2025-01-12T19:01:18.088Z info: [search][686] Completed successfully
2025-01-12T19:01:22.708Z error: [Crawler][685] Failed to determine the content-type for the url https://www.elcorteingles.es/: AbortError: The operation was aborted.
2025-01-12T19:01:25.844Z info: [Crawler][685] Successfully navigated to "https://www.elcorteingles.es/". Waiting for the page to load ...
2025-01-12T19:01:26.844Z info: [Crawler][685] Finished waiting for the page to load.
2025-01-12T19:01:26.875Z info: [Crawler][685] Successfully fetched the page content.
2025-01-12T19:01:27.021Z info: [Crawler][685] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-01-12T19:01:27.032Z info: [Crawler][685] Will attempt to extract metadata from page ...
2025-01-12T19:01:28.481Z info: [Crawler][685] Will attempt to extract readable content ...
2025-01-12T19:01:30.049Z info: [Crawler][685] Done extracting readable content.
2025-01-12T19:01:30.102Z info: [Crawler][685] Stored the screenshot as assetId: 335395ff-5658-466d-a9e1-bf81ffecf658
2025-01-12T19:02:17.702Z error: [Crawler][685] Crawling job failed: Error: Timed-out after 60 secs Error: Timed-out after 60 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) at listOnTimeout (node:internal/timers:594:17) at process.processTimers (node:internal/timers:529:7)

<!-- gh-comment-id:2585880931 --> @GreenMonito commented on GitHub (Jan 12, 2025): > @GreenMonito Can you share the full log when this link is added? Sorry for the delay, these are the logs when adding the domain `2025-01-12T19:01:17.706Z info: [Crawler][685] Will crawl "https://www.elcorteingles.es/" for link with id "l02d6h4erym2632v78s44f4n"` `2025-01-12T19:01:17.706Z info: [Crawler][685] Attempting to determine the content-type for the url https://www.elcorteingles.es/` `2025-01-12T19:01:17.929Z info: [search][686] Attempting to index bookmark with id l02d6h4erym2632v78s44f4n ...` `2025-01-12T19:01:18.088Z info: [search][686] Completed successfully` `2025-01-12T19:01:22.708Z error: [Crawler][685] Failed to determine the content-type for the url https://www.elcorteingles.es/: AbortError: The operation was aborted.` `2025-01-12T19:01:25.844Z info: [Crawler][685] Successfully navigated to "https://www.elcorteingles.es/". Waiting for the page to load ...` `2025-01-12T19:01:26.844Z info: [Crawler][685] Finished waiting for the page to load.` `2025-01-12T19:01:26.875Z info: [Crawler][685] Successfully fetched the page content.` `2025-01-12T19:01:27.021Z info: [Crawler][685] Finished capturing page content and a screenshot. FullPageScreenshot: false` `2025-01-12T19:01:27.032Z info: [Crawler][685] Will attempt to extract metadata from page ...` `2025-01-12T19:01:28.481Z info: [Crawler][685] Will attempt to extract readable content ...` `2025-01-12T19:01:30.049Z info: [Crawler][685] Done extracting readable content.` `2025-01-12T19:01:30.102Z info: [Crawler][685] Stored the screenshot as assetId: 335395ff-5658-466d-a9e1-bf81ffecf658` `2025-01-12T19:02:17.702Z error: [Crawler][685] Crawling job failed: Error: Timed-out after 60 secs Error: Timed-out after 60 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) at listOnTimeout (node:internal/timers:594:17) at process.processTimers (node:internal/timers:529:7)`
Author
Owner

@hussion commented on GitHub (Jul 16, 2025):

same problem

<!-- gh-comment-id:3076544027 --> @hussion commented on GitHub (Jul 16, 2025): same problem
Author
Owner

@klausmcm commented on GitHub (Oct 1, 2025):

I also get the Failed to determine the content-type error when trying to add this link: https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514

I hope the below logs are useful

2025-10-01T03:15:49.702Z info: [Crawler][6945] Will crawl "https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514" for link with id "uchiw1c155i9qdrevt48t9qg"
2025-10-01T03:15:49.702Z info: [Crawler][6945] Attempting to determine the content-type for the url https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514
2025-10-01T03:15:49.801Z info: [search][6947] Attempting to index bookmark with id uchiw1c155i9qdrevt48t9qg ...
2025-10-01T03:15:49.902Z info: [webhook][6948] Starting a webhook job for bookmark with id "uchiw1c155i9qdrevt48t9qg for operation "created"
2025-10-01T03:15:49.902Z info: [webhook][6948] Completed successfully
2025-10-01T03:15:50.096Z info: [ruleEngine][6946] Completed successfully
2025-10-01T03:15:50.767Z info: <-- GET /api/trpc/bookmarks.getBookmark?batch=1&input=%7B%220%22%3A%7B%22json%22%3A%7B%22bookmarkId%22%3A%22uchiw1c155i9qdrevt48t9qg%22%7D%7D%7D
2025-10-01T03:15:50.769Z info: --> GET /api/trpc/bookmarks.getBookmark?batch=1&input=%7B%220%22%3A%7B%22json%22%3A%7B%22bookmarkId%22%3A%22uchiw1c155i9qdrevt48t9qg%22%7D%7D%7D 200 2ms
2025-10-01T03:15:50.849Z info: [search][6947] Completed successfully
2025-10-01T03:15:50.957Z info: [Crawler] The Playwright browser got disconnected. Will attempt to launch it again.
2025-10-01T03:15:50.957Z info: [Crawler] Connecting to existing browser websocket address: ws://chrome:3000/chrome/playwright?token=qruyq9a8j06xaxlx3ijghas8cnp6ssdv
2025-10-01T03:15:54.704Z error: [Crawler][6945] Failed to determine the content-type for the url https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514: AbortError: The operation was aborted.
2025-10-01T03:15:55.547Z info: [Crawler][6945] Navigating to "https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514"
2025-10-01T03:15:56.882Z info: [Crawler][6945] Successfully navigated to "https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514". Waiting for the page to load ...
2025-10-01T03:15:58.132Z info: <-- GET /api/health
2025-10-01T03:15:58.133Z info: --> GET /api/health 200 0ms
2025-10-01T03:16:00.912Z info: [Crawler][6945] Finished waiting for the page to load.
2025-10-01T03:16:01.031Z info: [Crawler][6945] Successfully fetched the page content.
2025-10-01T03:16:02.281Z info: [Crawler][6945] Finished capturing page content and a screenshot. FullPageScreenshot: true
2025-10-01T03:16:02.564Z info: [Crawler][6945] Will attempt to extract metadata from page ...
2025-10-01T03:16:02.692Z info: [Crawler][6945] Will attempt to extract readable content ...
2025-10-01T03:16:02.838Z info: [Crawler][6945] Done extracting readable content.
2025-10-01T03:16:02.857Z info: [Crawler][6945] Stored the screenshot as assetId: 096eaf7a-7f5d-4066-ba1d-a6c974ecf0e3
<!-- gh-comment-id:3354568128 --> @klausmcm commented on GitHub (Oct 1, 2025): I also get the `Failed to determine the content-type` error when trying to add this link: https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514 I hope the below logs are useful ``` 2025-10-01T03:15:49.702Z info: [Crawler][6945] Will crawl "https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514" for link with id "uchiw1c155i9qdrevt48t9qg" 2025-10-01T03:15:49.702Z info: [Crawler][6945] Attempting to determine the content-type for the url https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514 2025-10-01T03:15:49.801Z info: [search][6947] Attempting to index bookmark with id uchiw1c155i9qdrevt48t9qg ... 2025-10-01T03:15:49.902Z info: [webhook][6948] Starting a webhook job for bookmark with id "uchiw1c155i9qdrevt48t9qg for operation "created" 2025-10-01T03:15:49.902Z info: [webhook][6948] Completed successfully 2025-10-01T03:15:50.096Z info: [ruleEngine][6946] Completed successfully 2025-10-01T03:15:50.767Z info: <-- GET /api/trpc/bookmarks.getBookmark?batch=1&input=%7B%220%22%3A%7B%22json%22%3A%7B%22bookmarkId%22%3A%22uchiw1c155i9qdrevt48t9qg%22%7D%7D%7D 2025-10-01T03:15:50.769Z info: --> GET /api/trpc/bookmarks.getBookmark?batch=1&input=%7B%220%22%3A%7B%22json%22%3A%7B%22bookmarkId%22%3A%22uchiw1c155i9qdrevt48t9qg%22%7D%7D%7D 200 2ms 2025-10-01T03:15:50.849Z info: [search][6947] Completed successfully 2025-10-01T03:15:50.957Z info: [Crawler] The Playwright browser got disconnected. Will attempt to launch it again. 2025-10-01T03:15:50.957Z info: [Crawler] Connecting to existing browser websocket address: ws://chrome:3000/chrome/playwright?token=qruyq9a8j06xaxlx3ijghas8cnp6ssdv 2025-10-01T03:15:54.704Z error: [Crawler][6945] Failed to determine the content-type for the url https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514: AbortError: The operation was aborted. 2025-10-01T03:15:55.547Z info: [Crawler][6945] Navigating to "https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514" 2025-10-01T03:15:56.882Z info: [Crawler][6945] Successfully navigated to "https://www.cbc.ca/news/canada/british-columbia/strong-early-response-oxford-astrazeneca-rollout-metro-van-55-65s-1.5971514". Waiting for the page to load ... 2025-10-01T03:15:58.132Z info: <-- GET /api/health 2025-10-01T03:15:58.133Z info: --> GET /api/health 200 0ms 2025-10-01T03:16:00.912Z info: [Crawler][6945] Finished waiting for the page to load. 2025-10-01T03:16:01.031Z info: [Crawler][6945] Successfully fetched the page content. 2025-10-01T03:16:02.281Z info: [Crawler][6945] Finished capturing page content and a screenshot. FullPageScreenshot: true 2025-10-01T03:16:02.564Z info: [Crawler][6945] Will attempt to extract metadata from page ... 2025-10-01T03:16:02.692Z info: [Crawler][6945] Will attempt to extract readable content ... 2025-10-01T03:16:02.838Z info: [Crawler][6945] Done extracting readable content. 2025-10-01T03:16:02.857Z info: [Crawler][6945] Stored the screenshot as assetId: 096eaf7a-7f5d-4066-ba1d-a6c974ecf0e3 ```
Author
Owner

@stelle007 commented on GitHub (Oct 17, 2025):

Same issue for me.

<!-- gh-comment-id:3417361519 --> @stelle007 commented on GitHub (Oct 17, 2025): Same issue for me.
Author
Owner

@brodieferguson commented on GitHub (Dec 1, 2025):

I also get the Failed to determine the content-type error when trying to add washingtonpost links:

2025-12-01T19:52:47.697Z info: [Crawler][12704:2] The page has been precrawled. Will use the precrawled archive instead. 2025-12-01T19:52:47.717Z info: [Crawler][12704:2] Will attempt to extract metadata from page ... 2025-12-01T19:52:59.751Z info: <-- GET /api/health 2025-12-01T19:52:59.752Z info: --> GET /api/health 200 1ms 2025-12-01T19:53:29.814Z info: <-- GET /api/health 2025-12-01T19:53:29.815Z info: --> GET /api/health 200 0ms 2025-12-01T19:53:42.694Z error: [Crawler][12704] Crawling job failed: Error: Timeout Error: Timeout at Timeout._onTimeout (file:///app/apps/workers/node_modules/.pnpm/liteque@0.7.0_@opentelemetry+api@1.9.0_@types+better-sqlite3@7.6.13_@types+react@19.2.5_bette_j25tbpstiiqwo32nscmvntyxcu/node_modules/liteque/dist/index.js:263:28) at listOnTimeout (node:internal/timers:588:17) at process.processTimers (node:internal/timers:523:7) 2025-12-01T19:53:43.628Z info: [Crawler][12704:3] Will crawl "https://www.washingtonpost.com/national-security/2025/12/01/trump-habba-us-attorney-ruling/" for link with id "l6x93k0nrseoqc246vn75bba" 2025-12-01T19:53:43.628Z info: [Crawler][12704:3] Attempting to determine the content-type for the url https://www.washingtonpost.com/national-security/2025/12/01/trump-habba-us-attorney-ruling/ 2025-12-01T19:53:48.628Z error: [Crawler][12704:3] Failed to determine the content-type for the url https://www.washingtonpost.com/national-security/2025/12/01/trump-habba-us-attorney-ruling/: AbortError: The operation was aborted.

<!-- gh-comment-id:3598581440 --> @brodieferguson commented on GitHub (Dec 1, 2025): I also get the Failed to determine the content-type error when trying to add washingtonpost links: `2025-12-01T19:52:47.697Z info: [Crawler][12704:2] The page has been precrawled. Will use the precrawled archive instead. 2025-12-01T19:52:47.717Z info: [Crawler][12704:2] Will attempt to extract metadata from page ... 2025-12-01T19:52:59.751Z info: <-- GET /api/health 2025-12-01T19:52:59.752Z info: --> GET /api/health 200 1ms 2025-12-01T19:53:29.814Z info: <-- GET /api/health 2025-12-01T19:53:29.815Z info: --> GET /api/health 200 0ms 2025-12-01T19:53:42.694Z error: [Crawler][12704] Crawling job failed: Error: Timeout Error: Timeout at Timeout._onTimeout (file:///app/apps/workers/node_modules/.pnpm/liteque@0.7.0_@opentelemetry+api@1.9.0_@types+better-sqlite3@7.6.13_@types+react@19.2.5_bette_j25tbpstiiqwo32nscmvntyxcu/node_modules/liteque/dist/index.js:263:28) at listOnTimeout (node:internal/timers:588:17) at process.processTimers (node:internal/timers:523:7) 2025-12-01T19:53:43.628Z info: [Crawler][12704:3] Will crawl "https://www.washingtonpost.com/national-security/2025/12/01/trump-habba-us-attorney-ruling/" for link with id "l6x93k0nrseoqc246vn75bba" 2025-12-01T19:53:43.628Z info: [Crawler][12704:3] Attempting to determine the content-type for the url https://www.washingtonpost.com/national-security/2025/12/01/trump-habba-us-attorney-ruling/ 2025-12-01T19:53:48.628Z error: [Crawler][12704:3] Failed to determine the content-type for the url https://www.washingtonpost.com/national-security/2025/12/01/trump-habba-us-attorney-ruling/: AbortError: The operation was aborted.`
Author
Owner

@anuram2k commented on GitHub (Dec 18, 2025):

Same issue for me too.

<!-- gh-comment-id:3668556366 --> @anuram2k commented on GitHub (Dec 18, 2025): Same issue for me too.
Author
Owner

@anuram2k commented on GitHub (Dec 22, 2025):

Here is the log I see:

[Crawler][55:0] Failed to determine the content-type for the url https://developer.mozilla.org/en-US/docs/Web/API/WebOTP_API: AbortError: The operation was aborted.

<!-- gh-comment-id:3681842624 --> @anuram2k commented on GitHub (Dec 22, 2025): Here is the log I see: [Crawler][55:0] Failed to determine the content-type for the url https://developer.mozilla.org/en-US/docs/Web/API/WebOTP_API: AbortError: The operation was aborted.
Author
Owner

@anuram2k commented on GitHub (Jan 14, 2026):

Any plan of fixing this issue ?

<!-- gh-comment-id:3749984438 --> @anuram2k commented on GitHub (Jan 14, 2026): Any plan of fixing this issue ?
Author
Owner

@anuram2k commented on GitHub (Jan 20, 2026):

Here is the error for some other site:

[Crawler][168:2] Failed to determine the content-type for the url https://ntfy.sh/: AbortError: The operation was aborted.

<!-- gh-comment-id:3771884209 --> @anuram2k commented on GitHub (Jan 20, 2026): Here is the error for some other site: [Crawler][168:2] Failed to determine the content-type for the url https://ntfy.sh/: AbortError: The operation was aborted.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#533
No description provided.