[GH-ISSUE #2073] Page not hoarded #1292

Open
opened 2026-03-02 11:56:18 +03:00 by kerem · 4 comments
Owner

Originally created by @anupamck on GitHub (Oct 26, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/2073

Describe the Bug

When I try to hoard the following page: https://www.jamesshore.com/v2/blog/2025/the-accountability-problem, it doesn't get hoarded.

Steps to Reproduce

  1. Hoard this page https://www.jamesshore.com/v2/blog/2025/the-accountability-problem

Expected Behaviour

When I preview the hoarded page on my Karakeep server, I should be able to read a preview of it. However, said preview is absent.

Screenshots or Additional Context

Here's the preview on my self-hosted server.

Image

Here are the logs on my server (they appear to be fine):

web-1  | 2025-10-26T18:33:11.586Z info: [Crawler][19076] Navigating to "https://www.jamesshore.com/v2/blog/2025/the-accountability-problem"
web-1  | 2025-10-26T18:33:12.331Z info: [Crawler][19076] Successfully navigated to "https://www.jamesshore.com/v2/blog/2025/the-accountability-problem". Waiting for the page to load ...
web-1  | 2025-10-26T18:33:13.345Z info: [Crawler][19076] Finished waiting for the page to load.
web-1  | 2025-10-26T18:33:13.363Z info: [Crawler][19076] Successfully fetched the page content.
web-1  | 2025-10-26T18:33:13.581Z info: [Crawler][19076] Finished capturing page content and a screenshot. FullPageScreenshot: false
web-1  | 2025-10-26T18:33:13.599Z info: [Crawler][19076] Will attempt to extract metadata from page ...
web-1  | 2025-10-26T18:33:13.659Z info: [Crawler][19076] Will attempt to extract readable content ...
web-1  | 2025-10-26T18:33:13.956Z info: [Crawler][19076] Done extracting readable content.
web-1  | 2025-10-26T18:33:14.031Z info: [Crawler][19076] Stored the screenshot as assetId: a072ff23-51f0-4f5b-ba79-81f128927325

Device Details

No response

Exact Karakeep Version

0.27.1

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @anupamck on GitHub (Oct 26, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/2073 ### Describe the Bug When I try to hoard the following page: https://www.jamesshore.com/v2/blog/2025/the-accountability-problem, it doesn't get hoarded. ### Steps to Reproduce 1. Hoard this page https://www.jamesshore.com/v2/blog/2025/the-accountability-problem ### Expected Behaviour When I preview the hoarded page on my Karakeep server, I should be able to read a preview of it. However, said preview is absent. ### Screenshots or Additional Context Here's the preview on my self-hosted server. <img width="1727" height="836" alt="Image" src="https://github.com/user-attachments/assets/92cf2539-7071-4830-9960-6546c24793d2" /> Here are the logs on my server (they appear to be fine): ``` web-1 | 2025-10-26T18:33:11.586Z info: [Crawler][19076] Navigating to "https://www.jamesshore.com/v2/blog/2025/the-accountability-problem" web-1 | 2025-10-26T18:33:12.331Z info: [Crawler][19076] Successfully navigated to "https://www.jamesshore.com/v2/blog/2025/the-accountability-problem". Waiting for the page to load ... web-1 | 2025-10-26T18:33:13.345Z info: [Crawler][19076] Finished waiting for the page to load. web-1 | 2025-10-26T18:33:13.363Z info: [Crawler][19076] Successfully fetched the page content. web-1 | 2025-10-26T18:33:13.581Z info: [Crawler][19076] Finished capturing page content and a screenshot. FullPageScreenshot: false web-1 | 2025-10-26T18:33:13.599Z info: [Crawler][19076] Will attempt to extract metadata from page ... web-1 | 2025-10-26T18:33:13.659Z info: [Crawler][19076] Will attempt to extract readable content ... web-1 | 2025-10-26T18:33:13.956Z info: [Crawler][19076] Done extracting readable content. web-1 | 2025-10-26T18:33:14.031Z info: [Crawler][19076] Stored the screenshot as assetId: a072ff23-51f0-4f5b-ba79-81f128927325 ``` ### Device Details _No response_ ### Exact Karakeep Version 0.27.1 ### Have you checked the troubleshooting guide? - [x] I have checked the troubleshooting guide and I haven't found a solution to my problem
Author
Owner

@cloudchristoph commented on GitHub (Oct 27, 2025):

Interesting.

Tried to hoard this page on my system.
Error log shows a timeout.

I even tried another blog post from the page and the homepage itself.

None of the pages gets crawled. Maybe something on the page / webserver is blocking headless chrome?!

web-1          | 2025-10-27T08:34:01.779Z info: [Crawler][204220] Will crawl "https://www.jamesshore.com/v2/blog/2025/the-accountability-problem" for link with id "chcvae84vfjyrjxekn3p7bck"
web-1          | 2025-10-27T08:34:01.779Z info: [Crawler][204220] Attempting to determine the content-type for the url https://www.jamesshore.com/v2/blog/2025/the-accountability-problem
web-1          | 2025-10-27T08:34:02.237Z info: [Crawler][204220] Content-type for the url https://www.jamesshore.com/v2/blog/2025/the-accountability-problem is "text/html"
chrome-1       | [1027/083402.257046:WARNING:runtime_features.cc(728)] AttributionReportingCrossAppWeb cannot be enabled in this configuration. Use --enable-features=ConversionMeasurement,AttributionReportingCrossAppWeb in addition.
web-1          | 2025-10-27T08:34:02.450Z info: [Crawler][204220] Navigating to "https://www.jamesshore.com/v2/blog/2025/the-accountability-problem"
web-1          | 2025-10-27T08:34:03.244Z info: [Crawler][204220] Successfully navigated to "https://www.jamesshore.com/v2/blog/2025/the-accountability-problem". Waiting for the page to load ...
web-1          | 2025-10-27T08:34:04.455Z info: [Crawler][204220] Finished waiting for the page to load.
web-1          | 2025-10-27T08:34:04.483Z info: [Crawler][204220] Successfully fetched the page content.
web-1          | 2025-10-27T08:34:04.689Z info: [Crawler][204220] Finished capturing page content and a screenshot. FullPageScreenshot: false
web-1          | 2025-10-27T08:34:04.703Z info: [Crawler][204220] Will attempt to extract metadata from page ...
web-1          | 2025-10-27T08:34:04.756Z info: [Crawler][204220] Will attempt to extract readable content ...
web-1          | 2025-10-27T08:34:04.990Z info: [Crawler][204220] Done extracting readable content.
web-1          | 2025-10-27T08:34:05.047Z info: [Crawler][204220] Stored the screenshot as assetId: 4b2c7e10-3d2b-4ae2-81e1-0d6b0e725410
web-1          | 2025-10-27T08:35:01.781Z error: [Crawler][204220] Crawling job failed: Error: Timeout
web-1          | Error: Timeout
web-1          |     at Timeout._onTimeout (file:///app/apps/workers/node_modules/.pnpm/liteque@0.6.0_@opentelemetry+api@1.9.0_@types+better-sqlite3@7.6.13_@types+react@19.1.11_bett_q46iwo2f32qriczoxxudwy6mbu/node_modules/liteque/dist/index.js:231:28)
web-1          |     at listOnTimeout (node:internal/timers:588:17)
<!-- gh-comment-id:3450091427 --> @cloudchristoph commented on GitHub (Oct 27, 2025): Interesting. Tried to hoard this page on my system. Error log shows a timeout. I even tried another blog post from the page and the homepage itself. - https://www.jamesshore.com/v2/blog/2008/the-decline-and-fall-of-agile - https://www.jamesshore.com/ None of the pages gets crawled. Maybe something on the page / webserver is blocking headless chrome?! ``` web-1 | 2025-10-27T08:34:01.779Z info: [Crawler][204220] Will crawl "https://www.jamesshore.com/v2/blog/2025/the-accountability-problem" for link with id "chcvae84vfjyrjxekn3p7bck" web-1 | 2025-10-27T08:34:01.779Z info: [Crawler][204220] Attempting to determine the content-type for the url https://www.jamesshore.com/v2/blog/2025/the-accountability-problem web-1 | 2025-10-27T08:34:02.237Z info: [Crawler][204220] Content-type for the url https://www.jamesshore.com/v2/blog/2025/the-accountability-problem is "text/html" chrome-1 | [1027/083402.257046:WARNING:runtime_features.cc(728)] AttributionReportingCrossAppWeb cannot be enabled in this configuration. Use --enable-features=ConversionMeasurement,AttributionReportingCrossAppWeb in addition. web-1 | 2025-10-27T08:34:02.450Z info: [Crawler][204220] Navigating to "https://www.jamesshore.com/v2/blog/2025/the-accountability-problem" web-1 | 2025-10-27T08:34:03.244Z info: [Crawler][204220] Successfully navigated to "https://www.jamesshore.com/v2/blog/2025/the-accountability-problem". Waiting for the page to load ... web-1 | 2025-10-27T08:34:04.455Z info: [Crawler][204220] Finished waiting for the page to load. web-1 | 2025-10-27T08:34:04.483Z info: [Crawler][204220] Successfully fetched the page content. web-1 | 2025-10-27T08:34:04.689Z info: [Crawler][204220] Finished capturing page content and a screenshot. FullPageScreenshot: false web-1 | 2025-10-27T08:34:04.703Z info: [Crawler][204220] Will attempt to extract metadata from page ... web-1 | 2025-10-27T08:34:04.756Z info: [Crawler][204220] Will attempt to extract readable content ... web-1 | 2025-10-27T08:34:04.990Z info: [Crawler][204220] Done extracting readable content. web-1 | 2025-10-27T08:34:05.047Z info: [Crawler][204220] Stored the screenshot as assetId: 4b2c7e10-3d2b-4ae2-81e1-0d6b0e725410 web-1 | 2025-10-27T08:35:01.781Z error: [Crawler][204220] Crawling job failed: Error: Timeout web-1 | Error: Timeout web-1 | at Timeout._onTimeout (file:///app/apps/workers/node_modules/.pnpm/liteque@0.6.0_@opentelemetry+api@1.9.0_@types+better-sqlite3@7.6.13_@types+react@19.1.11_bett_q46iwo2f32qriczoxxudwy6mbu/node_modules/liteque/dist/index.js:231:28) web-1 | at listOnTimeout (node:internal/timers:588:17) ```
Author
Owner

@cloudchristoph commented on GitHub (Oct 28, 2025):

For what it's worth, even the SingleFile browser extension is unable to store this page directly.
Pretty sure it's something going on on the webpage itself.

<!-- gh-comment-id:3455076938 --> @cloudchristoph commented on GitHub (Oct 28, 2025): For what it's worth, even the SingleFile browser extension is unable to store this page directly. Pretty sure it's something going on on the webpage itself.
Author
Owner

@MohamedBassem commented on GitHub (Nov 1, 2025):

Will attempt to extract metadata from page ...

Seems like it's metadata extraction with metascraper that's getting stuck for some reason. Those usually end up being a metascraper bug.

<!-- gh-comment-id:3476282195 --> @MohamedBassem commented on GitHub (Nov 1, 2025): > Will attempt to extract metadata from page ... Seems like it's metadata extraction with metascraper that's getting stuck for some reason. Those usually end up being a metascraper bug.
Author
Owner

@SamDickinsonReece commented on GitHub (Nov 10, 2025):

I'm getting a similar issue on every page I'm trying to grab. I'm bouncing off Cloudflare Warp (not sure if related but here anyway) as many sites were rejecting me from Oracle Cloud. May be related to the new blocking change?

| 2025-11-10T23:19:22.457Z info: [Crawler][3602:3] Will crawl "https://github.com/mayanayza/netvisor" for link with id "ko3yiq117mw56j74j68oq8r2"
| 2025-11-10T23:19:22.457Z info: [Crawler][3602:3] Attempting to determine the content-type for the url https://github.com/mayanayza/netvisor
 | 2025-11-10T23:19:22.639Z info: [Crawler][3602:2] Stored large HTML content (51722897 bytes) as asset: 8dd14af2-61ef-442f-875a-75a4d4497695
| 2025-11-10T23:19:22.748Z info: [search][3612] Completed successfully
| 2025-11-10T23:19:23.130Z info: [Crawler][3602:3] Content-type for the url https://github.com/mayanayza/netvisor is "text/html"
| 2025-11-10T23:19:23.130Z info: [Crawler][3602:3] The page has been precrawled. Will use the precrawled archive instead.
| 2025-11-10T23:19:23.332Z info: [Crawler][3602:3] Will attempt to extract metadata from page ...
| 2025-11-10T23:19:35.919Z info: [Crawler][3602:3] Done extracting metadata from the page.
| 2025-11-10T23:19:35.919Z info: [Crawler][3602:3] Will attempt to extract readable content ...
| 2025-11-10T23:24:09.989Z info: [Crawler][3602:3] Done extracting readable content.
| 2025-11-10T23:24:09.996Z info: [Crawler][3602:3] Skipping storing the screenshot as it's empty.
| 2025-11-10T23:24:10.228Z error: [Crawler][3602] Crawling job failed: Error: Timeout
| Error: Timeout
|     at Timeout._onTimeout (file:///app/apps/workers/node_modules/.pnpm/liteque@0.6.2_@opentelemetry+api@1.9.0_@types+better-sqlite3@7.6.13_@types+react@19.2.2_bette_ieiqokhfddhykknqx65h5kqwm4/node_modules/liteque/dist/index.js:263:28)
|     at listOnTimeout (node:internal/timers:588:17)
|     at process.processTimers (node:internal/timers:523:7)
| 2025-11-10T23:24:10.253Z info: [Crawler][3602:4] Will crawl "https://github.com/mayanayza/netvisor" for link with id "ko3yiq117mw56j74j68oq8r2"
| 2025-11-10T23:24:10.253Z info: [Crawler][3602:4] Attempting to determine the content-type for the url https://github.com/mayanayza/netvisor
| 2025-11-10T23:24:10.425Z info: [Crawler][3602:3] Stored large HTML content (51722897 bytes) as asset: 11d10e69-3545-4333-b78d-1e90674260b1
| 2025-11-10T23:24:11.028Z info: [Crawler][3602:4] Content-type for the url https://github.com/mayanayza/netvisor is "text/html"
| 2025-11-10T23:24:11.028Z info: [Crawler][3602:4] The page has been precrawled. Will use the precrawled archive instead.
| 2025-11-10T23:24:11.232Z info: [Crawler][3602:4] Will attempt to extract metadata from page ...
| 2025-11-10T23:24:23.836Z info: [Crawler][3602:4] Done extracting metadata from the page.
| 2025-11-10T23:24:23.837Z info: [Crawler][3602:4] Will attempt to extract readable content ...

My compose file consists of the following modifications (was working before the weekend nightly)

  chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    command:
      - --no-sandbox
      - --disable-gpu
      - --disable-dev-shm-usage
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
      - --proxy-server=http://cloudflare-warp:8888
      - --proxy-bypass-list=172.17.0.1/16,10.0.0.200/24
    networks:
      - default
      - tunnel
  cloudflare-warp:
    image: ghcr.io/jazzxp/cloudflare-warp-proxy:latest
    cap_add:
      - NET_ADMIN
    networks:
      - tunnel
<!-- gh-comment-id:3514357523 --> @SamDickinsonReece commented on GitHub (Nov 10, 2025): I'm getting a similar issue on every page I'm trying to grab. I'm bouncing off Cloudflare Warp (not sure if related but here anyway) as many sites were rejecting me from Oracle Cloud. May be related to the new blocking change? ``` | 2025-11-10T23:19:22.457Z info: [Crawler][3602:3] Will crawl "https://github.com/mayanayza/netvisor" for link with id "ko3yiq117mw56j74j68oq8r2" | 2025-11-10T23:19:22.457Z info: [Crawler][3602:3] Attempting to determine the content-type for the url https://github.com/mayanayza/netvisor | 2025-11-10T23:19:22.639Z info: [Crawler][3602:2] Stored large HTML content (51722897 bytes) as asset: 8dd14af2-61ef-442f-875a-75a4d4497695 | 2025-11-10T23:19:22.748Z info: [search][3612] Completed successfully | 2025-11-10T23:19:23.130Z info: [Crawler][3602:3] Content-type for the url https://github.com/mayanayza/netvisor is "text/html" | 2025-11-10T23:19:23.130Z info: [Crawler][3602:3] The page has been precrawled. Will use the precrawled archive instead. | 2025-11-10T23:19:23.332Z info: [Crawler][3602:3] Will attempt to extract metadata from page ... | 2025-11-10T23:19:35.919Z info: [Crawler][3602:3] Done extracting metadata from the page. | 2025-11-10T23:19:35.919Z info: [Crawler][3602:3] Will attempt to extract readable content ... | 2025-11-10T23:24:09.989Z info: [Crawler][3602:3] Done extracting readable content. | 2025-11-10T23:24:09.996Z info: [Crawler][3602:3] Skipping storing the screenshot as it's empty. | 2025-11-10T23:24:10.228Z error: [Crawler][3602] Crawling job failed: Error: Timeout | Error: Timeout | at Timeout._onTimeout (file:///app/apps/workers/node_modules/.pnpm/liteque@0.6.2_@opentelemetry+api@1.9.0_@types+better-sqlite3@7.6.13_@types+react@19.2.2_bette_ieiqokhfddhykknqx65h5kqwm4/node_modules/liteque/dist/index.js:263:28) | at listOnTimeout (node:internal/timers:588:17) | at process.processTimers (node:internal/timers:523:7) | 2025-11-10T23:24:10.253Z info: [Crawler][3602:4] Will crawl "https://github.com/mayanayza/netvisor" for link with id "ko3yiq117mw56j74j68oq8r2" | 2025-11-10T23:24:10.253Z info: [Crawler][3602:4] Attempting to determine the content-type for the url https://github.com/mayanayza/netvisor | 2025-11-10T23:24:10.425Z info: [Crawler][3602:3] Stored large HTML content (51722897 bytes) as asset: 11d10e69-3545-4333-b78d-1e90674260b1 | 2025-11-10T23:24:11.028Z info: [Crawler][3602:4] Content-type for the url https://github.com/mayanayza/netvisor is "text/html" | 2025-11-10T23:24:11.028Z info: [Crawler][3602:4] The page has been precrawled. Will use the precrawled archive instead. | 2025-11-10T23:24:11.232Z info: [Crawler][3602:4] Will attempt to extract metadata from page ... | 2025-11-10T23:24:23.836Z info: [Crawler][3602:4] Done extracting metadata from the page. | 2025-11-10T23:24:23.837Z info: [Crawler][3602:4] Will attempt to extract readable content ... ``` My compose file consists of the following modifications (was working before the weekend nightly) ``` chrome: image: gcr.io/zenika-hub/alpine-chrome:123 command: - --no-sandbox - --disable-gpu - --disable-dev-shm-usage - --remote-debugging-address=0.0.0.0 - --remote-debugging-port=9222 - --hide-scrollbars - --proxy-server=http://cloudflare-warp:8888 - --proxy-bypass-list=172.17.0.1/16,10.0.0.200/24 networks: - default - tunnel cloudflare-warp: image: ghcr.io/jazzxp/cloudflare-warp-proxy:latest cap_add: - NET_ADMIN networks: - tunnel ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#1292
No description provided.