[GH-ISSUE #1097] [Bug] Failed to capture the screenshot #723

Closed
opened 2026-03-02 11:52:10 +03:00 by kerem · 3 comments
Owner

Originally created by @startle09 on GitHub (Mar 7, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1097

Describe the Bug

Failed to fetch the specific webpage, URL: https://github.com/HqWu-HITCS/Awesome-Chinese-LLM

Other webpages are fetched normally (including other GitHub repositories).

The bug can be reproduced multiple times. Please let me know if you need more information.

Steps to Reproduce

pass

Expected Behaviour

pass

Screenshots or Additional Context

Log:

2025-03-07T05:51:46.188Z info: [Crawler][74] Will crawl "https://github.com/HqWu-HITCS/Awesome-Chinese-LLM" for link with id "qaxgbqoluf49akiqs5q9rbar"
2025-03-07T05:51:46.188Z info: [Crawler][74] Attempting to determine the content-type for the url https://github.com/HqWu-HITCS/Awesome-Chinese-LLM
2025-03-07T05:51:46.502Z info: [webhook][76] Starting a webhook job for bookmark with id "qaxgbqoluf49akiqs5q9rbar"
2025-03-07T05:51:46.502Z info: [webhook][76] Completed successfully
2025-03-07T05:51:46.759Z info: [search][75] Attempting to index bookmark with id qaxgbqoluf49akiqs5q9rbar ...
2025-03-07T05:51:46.826Z info: [search][75] Completed successfully
2025-03-07T05:51:48.976Z info: [Crawler][74] Content-type for the url https://github.com/HqWu-HITCS/Awesome-Chinese-LLM is "text/html; charset=utf-8"
2025-03-07T05:51:52.985Z info: [Crawler][74] Successfully navigated to "https://github.com/HqWu-HITCS/Awesome-Chinese-LLM". Waiting for the page to load ...
2025-03-07T05:51:54.380Z info: [Crawler][74] Finished waiting for the page to load.
2025-03-07T05:51:54.469Z info: [Crawler][74] Successfully fetched the page content.
--> 2025-03-07T05:51:59.470Z warn: [Crawler][74] Failed to capture the screenshot.
2025-03-07T05:52:01.068Z info: [Crawler][74] Will attempt to extract metadata from page ...
2025-03-07T05:52:02.530Z info: [Crawler][74] Will attempt to extract readable content ...
2025-03-07T05:52:04.596Z info: [Crawler][74] Done extracting readable content.
2025-03-07T05:52:04.597Z info: [Crawler][74] Skipping storing the screenshot as it's empty.
2025-03-07T05:52:05.379Z info: [Crawler][74] Done extracting metadata from the page.
2025-03-07T05:52:05.379Z info: [Crawler][74] Downloading image from "https://opengraph.githubassets.com/e5554c3072197574abc8846fb5cff36fa325b3e1699cbea42f2c1883e8eeabfd/HqWu-HITCS/Awesome-Chinese-LLM"
2025-03-07T05:52:06.049Z info: [Crawler][74] Downloaded image as assetId: 05e1e551-904f-464f-ad6d-34907ab661d6
2025-03-07T05:52:06.060Z info: [Crawler][74] Completed successfully

Device Details

Debian 12

Exact Hoarder Version

0.22.0

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @startle09 on GitHub (Mar 7, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1097 ### Describe the Bug Failed to fetch the specific webpage, URL: https://github.com/HqWu-HITCS/Awesome-Chinese-LLM Other webpages are fetched normally (including other GitHub repositories). The bug can be reproduced multiple times. Please let me know if you need more information. ### Steps to Reproduce pass ### Expected Behaviour pass ### Screenshots or Additional Context Log: ``` 2025-03-07T05:51:46.188Z info: [Crawler][74] Will crawl "https://github.com/HqWu-HITCS/Awesome-Chinese-LLM" for link with id "qaxgbqoluf49akiqs5q9rbar" 2025-03-07T05:51:46.188Z info: [Crawler][74] Attempting to determine the content-type for the url https://github.com/HqWu-HITCS/Awesome-Chinese-LLM 2025-03-07T05:51:46.502Z info: [webhook][76] Starting a webhook job for bookmark with id "qaxgbqoluf49akiqs5q9rbar" 2025-03-07T05:51:46.502Z info: [webhook][76] Completed successfully 2025-03-07T05:51:46.759Z info: [search][75] Attempting to index bookmark with id qaxgbqoluf49akiqs5q9rbar ... 2025-03-07T05:51:46.826Z info: [search][75] Completed successfully 2025-03-07T05:51:48.976Z info: [Crawler][74] Content-type for the url https://github.com/HqWu-HITCS/Awesome-Chinese-LLM is "text/html; charset=utf-8" 2025-03-07T05:51:52.985Z info: [Crawler][74] Successfully navigated to "https://github.com/HqWu-HITCS/Awesome-Chinese-LLM". Waiting for the page to load ... 2025-03-07T05:51:54.380Z info: [Crawler][74] Finished waiting for the page to load. 2025-03-07T05:51:54.469Z info: [Crawler][74] Successfully fetched the page content. --> 2025-03-07T05:51:59.470Z warn: [Crawler][74] Failed to capture the screenshot. 2025-03-07T05:52:01.068Z info: [Crawler][74] Will attempt to extract metadata from page ... 2025-03-07T05:52:02.530Z info: [Crawler][74] Will attempt to extract readable content ... 2025-03-07T05:52:04.596Z info: [Crawler][74] Done extracting readable content. 2025-03-07T05:52:04.597Z info: [Crawler][74] Skipping storing the screenshot as it's empty. 2025-03-07T05:52:05.379Z info: [Crawler][74] Done extracting metadata from the page. 2025-03-07T05:52:05.379Z info: [Crawler][74] Downloading image from "https://opengraph.githubassets.com/e5554c3072197574abc8846fb5cff36fa325b3e1699cbea42f2c1883e8eeabfd/HqWu-HITCS/Awesome-Chinese-LLM" 2025-03-07T05:52:06.049Z info: [Crawler][74] Downloaded image as assetId: 05e1e551-904f-464f-ad6d-34907ab661d6 2025-03-07T05:52:06.060Z info: [Crawler][74] Completed successfully ``` ### Device Details Debian 12 ### Exact Hoarder Version 0.22.0 ### Have you checked the troubleshooting guide? - [x] I have checked the troubleshooting guide and I haven't found a solution to my problem
kerem 2026-03-02 11:52:10 +03:00
Author
Owner

@Summon528 commented on GitHub (Mar 23, 2025):

Hi, do you have full page screenshot enabled? If you do, you might have been timed out. I've submitted a pull request to added a CRAWLER_SCREENSHOT_TIMEOUT_SEC setting. You can wait for the nightly to drop, or you can try ghcr.io/summon528/hoarder@sha256:7c466c0e69945a090982ed2c0aa2ab2035723b7a654fc6250e11f4d58f2b0f02 and set the parameter to see if it fixes you issue.

<!-- gh-comment-id:2746385789 --> @Summon528 commented on GitHub (Mar 23, 2025): Hi, do you have full page screenshot enabled? If you do, you might have been timed out. I've submitted a pull request to added a CRAWLER_SCREENSHOT_TIMEOUT_SEC setting. You can wait for the nightly to drop, or you can try `ghcr.io/summon528/hoarder@sha256:7c466c0e69945a090982ed2c0aa2ab2035723b7a654fc6250e11f4d58f2b0f02` and set the parameter to see if it fixes you issue.
Author
Owner

@startle09 commented on GitHub (Mar 24, 2025):

@Summon528 Thank you for your reply. This issue occurs whether it is enabled or disabled, but even if the full page screenshot is turned off, it may still happen if the screenshot is taken after the page has fully loaded, right?

<!-- gh-comment-id:2746878379 --> @startle09 commented on GitHub (Mar 24, 2025): @Summon528 Thank you for your reply. This issue occurs whether it is enabled or disabled, but even if the full page screenshot is turned off, it may still happen if the screenshot is taken after the page has fully loaded, right?
Author
Owner

@Summon528 commented on GitHub (Mar 25, 2025):

It will happen if the screenshot takes more than 5 seconds. So if you have a relatively slow machine or you are taking a full screenshot of a long web page, you will get the error.

<!-- gh-comment-id:2749955818 --> @Summon528 commented on GitHub (Mar 25, 2025): It will happen if the screenshot takes more than 5 seconds. So if you have a relatively slow machine or you are taking a full screenshot of a long web page, you will get the error.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#723
No description provided.