starred/karakeep

Fork 0

mirror of https://github.com/karakeep-app/karakeep.git synced 2026-04-25 07:56:05 +03:00

[GH-ISSUE #806] Crawler job times out when adding www.dxl.com #528

New issue

Open

opened 2026-03-02 11:50:35 +03:00 by kerem · 2 comments

kerem commented

2026-03-02 11:50:35 +03:00

Owner

Originally created by @jhakonen on GitHub (Jan 2, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/806

Describe the Bug

Crawler seems to fail to a timeout when I try to add https://www.dxl.com to Hoarder.

Log output:

2025-01-02T11:27:34.698Z info: [Crawler][3261] Will crawl "https://www.dxl.com/" for link with id "ijiaans6quvns5pnfamzv9bw"
2025-01-02T11:27:34.699Z info: [Crawler][3261] Attempting to determine the content-type for the url https://www.dxl.com/
2025-01-02T11:27:34.716Z info: [search][3262] Attempting to index bookmark with id ijiaans6quvns5pnfamzv9bw ...
2025-01-02T11:27:34.839Z info: [search][3262] Completed successfully
2025-01-02T11:27:39.701Z error: [Crawler][3261] Failed to determine the content-type for the url https://www.dxl.com/: AbortError: The operation was aborted.
2025-01-02T11:27:42.848Z info: [Crawler][3261] Successfully navigated to "https://www.dxl.com/". Waiting for the page to load ...
2025-01-02T11:27:44.371Z info: [Crawler][3261] Finished waiting for the page to load.
2025-01-02T11:27:44.401Z info: [Crawler][3261] Successfully fetched the page content.
2025-01-02T11:27:44.555Z info: [Crawler][3261] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-01-02T11:27:44.559Z info: [Crawler][3261] Will attempt to extract metadata from page ...
2025-01-02T11:27:45.082Z info: [Crawler][3261] Will attempt to extract readable content ...
2025-01-02T11:27:45.364Z info: [Crawler][3261] Done extracting readable content.
2025-01-02T11:27:45.444Z info: [Crawler][3261] Stored the screenshot as assetId: 280d162c-4867-4d80-af5a-5cf98fe91f9d
2025-01-02T11:28:34.696Z error: [Crawler][3261] Crawling job failed: Error: Timed-out after 60 secs
Error: Timed-out after 60 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1544)
    at listOnTimeout (node:internal/timers:594:17)
    at process.processTimers (node:internal/timers:529:7)

This then happens again and again as Hoarder keeps retrying to add the site, and fails every time.

Steps to Reproduce

Add https://www.dxl.com to Hoarder
Follow Hoarder's log output

Expected Behaviour

Hoarder succesfully adds the website.

Screenshots or Additional Context

No response

Device Details

No response

Exact Hoarder Version

0.20.0

Have you checked the troubleshooting guide?

I have checked the troubleshooting guide and I haven't found a solution to my problem

Originally created by @jhakonen on GitHub (Jan 2, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/806 ### Describe the Bug Crawler seems to fail to a timeout when I try to add https://www.dxl.com to Hoarder. Log output: ``` 2025-01-02T11:27:34.698Z info: [Crawler][3261] Will crawl "https://www.dxl.com/" for link with id "ijiaans6quvns5pnfamzv9bw" 2025-01-02T11:27:34.699Z info: [Crawler][3261] Attempting to determine the content-type for the url https://www.dxl.com/ 2025-01-02T11:27:34.716Z info: [search][3262] Attempting to index bookmark with id ijiaans6quvns5pnfamzv9bw ... 2025-01-02T11:27:34.839Z info: [search][3262] Completed successfully 2025-01-02T11:27:39.701Z error: [Crawler][3261] Failed to determine the content-type for the url https://www.dxl.com/: AbortError: The operation was aborted. 2025-01-02T11:27:42.848Z info: [Crawler][3261] Successfully navigated to "https://www.dxl.com/". Waiting for the page to load ... 2025-01-02T11:27:44.371Z info: [Crawler][3261] Finished waiting for the page to load. 2025-01-02T11:27:44.401Z info: [Crawler][3261] Successfully fetched the page content. 2025-01-02T11:27:44.555Z info: [Crawler][3261] Finished capturing page content and a screenshot. FullPageScreenshot: false 2025-01-02T11:27:44.559Z info: [Crawler][3261] Will attempt to extract metadata from page ... 2025-01-02T11:27:45.082Z info: [Crawler][3261] Will attempt to extract readable content ... 2025-01-02T11:27:45.364Z info: [Crawler][3261] Done extracting readable content. 2025-01-02T11:27:45.444Z info: [Crawler][3261] Stored the screenshot as assetId: 280d162c-4867-4d80-af5a-5cf98fe91f9d 2025-01-02T11:28:34.696Z error: [Crawler][3261] Crawling job failed: Error: Timed-out after 60 secs Error: Timed-out after 60 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1544) at listOnTimeout (node:internal/timers:594:17) at process.processTimers (node:internal/timers:529:7) ``` This then happens again and again as Hoarder keeps retrying to add the site, and fails every time. ### Steps to Reproduce 1. Add https://www.dxl.com to Hoarder 2. Follow Hoarder's log output ### Expected Behaviour Hoarder succesfully adds the website. ### Screenshots or Additional Context _No response_ ### Device Details _No response_ ### Exact Hoarder Version 0.20.0 ### Have you checked the troubleshooting guide? - [X] I have checked the troubleshooting guide and I haven't found a solution to my problem

kerem added the

bug

label

2026-03-02 11:50:35 +03:00

kerem commented

2026-03-02 11:50:36 +03:00

Author

Owner

@MohamedBassem commented on GitHub (Jan 2, 2025):

Seems like the metadata extraction is getting stuck indeed.

@MohamedBassem commented on GitHub (Jan 2, 2025): Seems like the metadata extraction is getting stuck indeed.

kerem commented

2026-03-02 11:50:36 +03:00

Author

Owner

@MohamedBassem commented on GitHub (Jan 2, 2025):

This usually indicates a bug in metascraper, the library we're using to extract the metadata. We can try upgrading the metascraper deps and see if it helps.

@MohamedBassem commented on GitHub (Jan 2, 2025): This usually indicates a bug in `metascraper`, the library we're using to extract the metadata. We can try upgrading the metascraper deps and see if it helps.

kerem referenced this issue

2026-03-02 11:58:39 +03:00

[PR #618] [CLOSED] [Perf] Move away from the APIs that return all tags in one request #528 #1671

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/karakeep#528

No description provided.

Rows
Columns