[GH-ISSUE #806] Crawler job times out when adding www.dxl.com #528

Open
opened 2026-03-02 11:50:35 +03:00 by kerem · 2 comments
Owner

Originally created by @jhakonen on GitHub (Jan 2, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/806

Describe the Bug

Crawler seems to fail to a timeout when I try to add https://www.dxl.com to Hoarder.

Log output:

2025-01-02T11:27:34.698Z info: [Crawler][3261] Will crawl "https://www.dxl.com/" for link with id "ijiaans6quvns5pnfamzv9bw"
2025-01-02T11:27:34.699Z info: [Crawler][3261] Attempting to determine the content-type for the url https://www.dxl.com/
2025-01-02T11:27:34.716Z info: [search][3262] Attempting to index bookmark with id ijiaans6quvns5pnfamzv9bw ...
2025-01-02T11:27:34.839Z info: [search][3262] Completed successfully
2025-01-02T11:27:39.701Z error: [Crawler][3261] Failed to determine the content-type for the url https://www.dxl.com/: AbortError: The operation was aborted.
2025-01-02T11:27:42.848Z info: [Crawler][3261] Successfully navigated to "https://www.dxl.com/". Waiting for the page to load ...
2025-01-02T11:27:44.371Z info: [Crawler][3261] Finished waiting for the page to load.
2025-01-02T11:27:44.401Z info: [Crawler][3261] Successfully fetched the page content.
2025-01-02T11:27:44.555Z info: [Crawler][3261] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-01-02T11:27:44.559Z info: [Crawler][3261] Will attempt to extract metadata from page ...
2025-01-02T11:27:45.082Z info: [Crawler][3261] Will attempt to extract readable content ...
2025-01-02T11:27:45.364Z info: [Crawler][3261] Done extracting readable content.
2025-01-02T11:27:45.444Z info: [Crawler][3261] Stored the screenshot as assetId: 280d162c-4867-4d80-af5a-5cf98fe91f9d
2025-01-02T11:28:34.696Z error: [Crawler][3261] Crawling job failed: Error: Timed-out after 60 secs
Error: Timed-out after 60 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1544)
    at listOnTimeout (node:internal/timers:594:17)
    at process.processTimers (node:internal/timers:529:7)

This then happens again and again as Hoarder keeps retrying to add the site, and fails every time.

Steps to Reproduce

  1. Add https://www.dxl.com to Hoarder
  2. Follow Hoarder's log output

Expected Behaviour

Hoarder succesfully adds the website.

Screenshots or Additional Context

No response

Device Details

No response

Exact Hoarder Version

0.20.0

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @jhakonen on GitHub (Jan 2, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/806 ### Describe the Bug Crawler seems to fail to a timeout when I try to add https://www.dxl.com to Hoarder. Log output: ``` 2025-01-02T11:27:34.698Z info: [Crawler][3261] Will crawl "https://www.dxl.com/" for link with id "ijiaans6quvns5pnfamzv9bw" 2025-01-02T11:27:34.699Z info: [Crawler][3261] Attempting to determine the content-type for the url https://www.dxl.com/ 2025-01-02T11:27:34.716Z info: [search][3262] Attempting to index bookmark with id ijiaans6quvns5pnfamzv9bw ... 2025-01-02T11:27:34.839Z info: [search][3262] Completed successfully 2025-01-02T11:27:39.701Z error: [Crawler][3261] Failed to determine the content-type for the url https://www.dxl.com/: AbortError: The operation was aborted. 2025-01-02T11:27:42.848Z info: [Crawler][3261] Successfully navigated to "https://www.dxl.com/". Waiting for the page to load ... 2025-01-02T11:27:44.371Z info: [Crawler][3261] Finished waiting for the page to load. 2025-01-02T11:27:44.401Z info: [Crawler][3261] Successfully fetched the page content. 2025-01-02T11:27:44.555Z info: [Crawler][3261] Finished capturing page content and a screenshot. FullPageScreenshot: false 2025-01-02T11:27:44.559Z info: [Crawler][3261] Will attempt to extract metadata from page ... 2025-01-02T11:27:45.082Z info: [Crawler][3261] Will attempt to extract readable content ... 2025-01-02T11:27:45.364Z info: [Crawler][3261] Done extracting readable content. 2025-01-02T11:27:45.444Z info: [Crawler][3261] Stored the screenshot as assetId: 280d162c-4867-4d80-af5a-5cf98fe91f9d 2025-01-02T11:28:34.696Z error: [Crawler][3261] Crawling job failed: Error: Timed-out after 60 secs Error: Timed-out after 60 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1544) at listOnTimeout (node:internal/timers:594:17) at process.processTimers (node:internal/timers:529:7) ``` This then happens again and again as Hoarder keeps retrying to add the site, and fails every time. ### Steps to Reproduce 1. Add https://www.dxl.com to Hoarder 2. Follow Hoarder's log output ### Expected Behaviour Hoarder succesfully adds the website. ### Screenshots or Additional Context _No response_ ### Device Details _No response_ ### Exact Hoarder Version 0.20.0 ### Have you checked the troubleshooting guide? - [X] I have checked the troubleshooting guide and I haven't found a solution to my problem
Author
Owner

@MohamedBassem commented on GitHub (Jan 2, 2025):

Seems like the metadata extraction is getting stuck indeed.

<!-- gh-comment-id:2567667271 --> @MohamedBassem commented on GitHub (Jan 2, 2025): Seems like the metadata extraction is getting stuck indeed.
Author
Owner

@MohamedBassem commented on GitHub (Jan 2, 2025):

This usually indicates a bug in metascraper, the library we're using to extract the metadata. We can try upgrading the metascraper deps and see if it helps.

<!-- gh-comment-id:2567668244 --> @MohamedBassem commented on GitHub (Jan 2, 2025): This usually indicates a bug in `metascraper`, the library we're using to extract the metadata. We can try upgrading the metascraper deps and see if it helps.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#528
No description provided.