[GH-ISSUE #876] Duplicate download #570

Closed
opened 2026-03-02 11:50:57 +03:00 by kerem · 3 comments
Owner

Originally created by @debackerl on GitHub (Jan 13, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/876

Describe the Bug

I notice that in case of time-out, Hoarder will loop to redownload, but it will actually redo previous steps which were successful, including inference. In my case, using Ollama, I don't have much more cost, but people using OpenAI could get a larger bill.

Steps to Reproduce

  1. Use container ghcr.io/hoarder-app/hoarder:release (it's currently 0.21.0)
  2. Setup:
    BROWSER_WEB_URL: http://127.0.0.1:9222
    CRAWLER_FULL_PAGE_ARCHIVE: true
    OLLAMA_BASE_URL: http://xxxxx:11434
    INFERENCE_TEXT_MODEL: llama3.2:3b-instruct-q5_K_M
    INFERENCE_IMAGE_MODEL: llama3.2-vision:11b-instruct-q4_K_M
    INFERENCE_CONTEXT_LENGTH: 32768
    CRAWLER_NUM_WORKERS: 1
    CRAWLER_JOB_TIMEOUT_SEC: 15
  3. Use container gcr.io/zenika-hub/alpine-chrome:latest (it's currently 124)
  4. Setup:
    --disable-gpu
    --disable-dev-shm-usage
    --remote-debugging-address=127.0.0.1
    --remote-debugging-port=9222
    --hide-scrollbars
    --no-sandbox
  5. Bookmark http://www.keystone-europe.com/

Expected Behaviour

Download the page only once, and run inference only once.

Screenshots or Additional Context

2025-01-13T08:32:49.473Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:32:49.473Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/
2025-01-13T08:32:49.860Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8"
2025-01-13T08:32:50.477Z info: [search][148750] Completed successfully
2025-01-13T08:32:52.908Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ...
2025-01-13T08:32:53.910Z info: [Crawler][141486] Finished waiting for the page to load.
2025-01-13T08:32:54.043Z info: [Crawler][141486] Successfully fetched the page content.
2025-01-13T08:32:54.850Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-01-13T08:32:54.893Z info: [Crawler][141486] Will attempt to extract metadata from page ...
2025-01-13T08:32:55.294Z info: [Crawler][141486] Will attempt to extract readable content ...
2025-01-13T08:32:55.643Z info: [Crawler][141486] Done extracting readable content.
2025-01-13T08:32:55.729Z info: [Crawler][141486] Stored the screenshot as assetId: 80bfb1fb-53fc-4922-acf8-5ef0e0e502d9
2025-01-13T08:32:55.814Z info: [Crawler][141486] Done extracting metadata from the page.
2025-01-13T08:32:55.815Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png"
2025-01-13T08:32:55.903Z info: [Crawler][141486] Downloaded image as assetId: e1881a89-c440-405b-8dc8-53e6fcc7ccb3
2025-01-13T08:32:56.215Z info: [Crawler][141486] Will attempt to archive page ...
2025-01-13T08:32:56.830Z info: [inference][148751] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:32:56.871Z info: [VideoCrawler][148753] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config.
2025-01-13T08:32:56.894Z info: [VideoCrawler][148753] Video Download Completed successfully
2025-01-13T08:32:56.970Z info: [search][148752] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:32:57.275Z info: [Crawler][141484] Done archiving the page as assetId: 9e475209-4b2b-4eba-add3-3b259c62a35e
2025-01-13T08:32:57.420Z info: [inference][148751] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 635 tokens and inferred: electronic components,Keystone Electronics,Keystone Europe,interconnect components,hardware,industrial suppliers,electronics industry,trade shows
2025-01-13T08:32:57.718Z info: [inference][148751] Completed successfully
2025-01-13T08:32:57.795Z info: [search][148752] Completed successfully
2025-01-13T08:32:57.868Z info: [search][148754] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:32:58.671Z info: [search][148754] Completed successfully
2025-01-13T08:33:04.468Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs
Error: Timed-out after 15 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:594:17)
    at process.processTimers (node:internal/timers:529:7)
2025-01-13T08:33:04.565Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:33:04.566Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/
2025-01-13T08:33:04.891Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8"
2025-01-13T08:33:05.844Z info: [Crawler][141486] Done archiving the page as assetId: b09d8591-a57b-42d2-908c-771c5eb48682
2025-01-13T08:33:07.926Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ...
2025-01-13T08:33:08.928Z info: [Crawler][141486] Finished waiting for the page to load.
2025-01-13T08:33:08.943Z info: [Crawler][141486] Successfully fetched the page content.
2025-01-13T08:33:09.715Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-01-13T08:33:09.731Z info: [Crawler][141486] Will attempt to extract metadata from page ...
2025-01-13T08:33:09.904Z info: [Crawler][141486] Will attempt to extract readable content ...
2025-01-13T08:33:10.105Z info: [Crawler][141486] Done extracting readable content.
2025-01-13T08:33:10.133Z info: [Crawler][141486] Stored the screenshot as assetId: 7f037d9f-8380-4145-a025-d53de9a66c28
2025-01-13T08:33:10.221Z info: [Crawler][141486] Done extracting metadata from the page.
2025-01-13T08:33:10.221Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png"
2025-01-13T08:33:10.310Z info: [Crawler][141486] Downloaded image as assetId: 2d65cc73-2ee6-40ce-9aca-d9fa8eadb710
2025-01-13T08:33:10.636Z info: [Crawler][141486] Will attempt to archive page ...
2025-01-13T08:33:10.761Z info: [search][148756] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:33:11.176Z info: [inference][148755] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:33:11.223Z info: [VideoCrawler][148757] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config.
2025-01-13T08:33:11.227Z info: [VideoCrawler][148757] Video Download Completed successfully
2025-01-13T08:33:11.452Z info: [search][148756] Completed successfully
2025-01-13T08:33:11.696Z info: [inference][148755] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 630 tokens and inferred: Keystone Europe,electronic components,battery holders,interconnect components,Hardware,electronics,trade shows,Europe,India,MEA,Keystone Electronics
2025-01-13T08:33:11.898Z info: [inference][148755] Completed successfully
2025-01-13T08:33:12.545Z info: [search][148758] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:33:13.453Z info: [search][148758] Completed successfully
2025-01-13T08:33:19.564Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs
Error: Timed-out after 15 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:594:17)
    at process.processTimers (node:internal/timers:529:7)
2025-01-13T08:33:19.670Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:33:19.670Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/
2025-01-13T08:33:20.029Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8"
2025-01-13T08:33:20.080Z info: [Crawler][141486] Done archiving the page as assetId: 19c625d0-dd72-4a3c-8f24-91ec0eeba0f1
2025-01-13T08:33:23.134Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ...
2025-01-13T08:33:24.135Z info: [Crawler][141486] Finished waiting for the page to load.
2025-01-13T08:33:24.148Z info: [Crawler][141486] Successfully fetched the page content.
2025-01-13T08:33:24.806Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-01-13T08:33:24.817Z info: [Crawler][141486] Will attempt to extract metadata from page ...
2025-01-13T08:33:25.010Z info: [Crawler][141486] Will attempt to extract readable content ...
2025-01-13T08:33:25.219Z info: [Crawler][141486] Done extracting readable content.
2025-01-13T08:33:25.247Z info: [Crawler][141486] Stored the screenshot as assetId: 0525ec3b-523d-42b3-80ec-2995f5026f7c
2025-01-13T08:33:25.334Z info: [Crawler][141486] Done extracting metadata from the page.
2025-01-13T08:33:25.334Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png"
2025-01-13T08:33:25.426Z info: [Crawler][141486] Downloaded image as assetId: 1a472fbd-3430-478d-9f9a-4a44cb1029c7
2025-01-13T08:33:25.767Z info: [Crawler][141486] Will attempt to archive page ...
2025-01-13T08:33:25.823Z info: [search][148760] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:33:26.303Z info: [inference][148759] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:33:26.363Z info: [VideoCrawler][148761] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config.
2025-01-13T08:33:26.363Z info: [VideoCrawler][148761] Video Download Completed successfully
2025-01-13T08:33:26.631Z info: [search][148760] Completed successfully
2025-01-13T08:33:26.775Z info: [inference][148759] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 626 tokens and inferred: electronic components,Keystone Electronics,interconnect components,MEA+India,electronics,hardware,trade shows,electronics industry,Keystone Europe
2025-01-13T08:33:26.989Z info: [inference][148759] Completed successfully
2025-01-13T08:33:27.748Z info: [search][148762] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:33:28.493Z info: [search][148762] Completed successfully
2025-01-13T08:33:34.667Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs
Error: Timed-out after 15 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:594:17)
    at process.processTimers (node:internal/timers:529:7)
2025-01-13T08:33:34.754Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:33:34.754Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/
2025-01-13T08:33:34.797Z info: [Crawler][141486] Done archiving the page as assetId: ddf0d64d-2568-4b60-bbbf-80e5df3777dd
2025-01-13T08:33:35.062Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8"
2025-01-13T08:33:38.214Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ...
2025-01-13T08:33:39.215Z info: [Crawler][141486] Finished waiting for the page to load.
2025-01-13T08:33:39.229Z info: [Crawler][141486] Successfully fetched the page content.
2025-01-13T08:33:39.957Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-01-13T08:33:40.000Z info: [Crawler][141486] Will attempt to extract metadata from page ...
2025-01-13T08:33:40.168Z info: [Crawler][141486] Will attempt to extract readable content ...
2025-01-13T08:33:40.366Z info: [Crawler][141486] Done extracting readable content.
2025-01-13T08:33:40.400Z info: [Crawler][141486] Stored the screenshot as assetId: fcd04162-2288-483b-93e4-b0eb496a354d
2025-01-13T08:33:40.486Z info: [Crawler][141486] Done extracting metadata from the page.
2025-01-13T08:33:40.486Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png"
2025-01-13T08:33:40.576Z info: [Crawler][141486] Downloaded image as assetId: 89c5c380-6bc6-48d0-8715-36ac5ed5ae0b
2025-01-13T08:33:40.892Z info: [Crawler][141486] Will attempt to archive page ...
2025-01-13T08:33:41.448Z info: [inference][148763] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:33:41.496Z info: [VideoCrawler][148765] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config.
2025-01-13T08:33:41.497Z info: [VideoCrawler][148765] Video Download Completed successfully
2025-01-13T08:33:41.591Z info: [search][148764] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:33:42.094Z info: [inference][148763] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 649 tokens and inferred: Keystone Electronics,electronic components,battery holders,interconnect components,India,MEA,electronics industry,Keystone Europe,Keystone Electronics products,trade shows,distribution network,precision electronic interconnect components,quality products,competitive prices
2025-01-13T08:33:42.513Z info: [inference][148763] Completed successfully
2025-01-13T08:33:42.565Z info: [search][148764] Completed successfully
2025-01-13T08:33:42.648Z info: [search][148766] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:33:43.443Z info: [search][148766] Completed successfully
2025-01-13T08:33:49.752Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs
Error: Timed-out after 15 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:594:17)
    at process.processTimers (node:internal/timers:529:7)
2025-01-13T08:33:49.836Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:33:49.836Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/
2025-01-13T08:33:50.146Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8"
2025-01-13T08:33:50.387Z info: [Crawler][141486] Done archiving the page as assetId: 90d8fc06-0595-4392-8df4-fd62c83693a2
2025-01-13T08:33:53.340Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ...
2025-01-13T08:33:54.340Z info: [Crawler][141486] Finished waiting for the page to load.
2025-01-13T08:33:54.396Z info: [Crawler][141486] Successfully fetched the page content.
2025-01-13T08:33:55.180Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-01-13T08:33:55.221Z info: [Crawler][141486] Will attempt to extract metadata from page ...
2025-01-13T08:33:55.416Z info: [Crawler][141486] Will attempt to extract readable content ...
2025-01-13T08:33:55.618Z info: [Crawler][141486] Done extracting readable content.
2025-01-13T08:33:55.660Z info: [Crawler][141486] Stored the screenshot as assetId: dd9a7406-00d5-4f1f-ad3a-2a2d041e150c
2025-01-13T08:33:55.807Z info: [Crawler][141486] Done extracting metadata from the page.
2025-01-13T08:33:55.807Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png"
2025-01-13T08:33:56.001Z info: [Crawler][141486] Downloaded image as assetId: 3ec8e3d2-c781-46f1-9c64-99d25e9737ca
2025-01-13T08:33:56.300Z info: [Crawler][141486] Will attempt to archive page ...
2025-01-13T08:33:56.682Z info: [search][148768] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:33:56.729Z info: [inference][148767] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:33:56.782Z info: [VideoCrawler][148769] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config.
2025-01-13T08:33:56.782Z info: [VideoCrawler][148769] Video Download Completed successfully
2025-01-13T08:33:57.229Z info: [inference][148767] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 633 tokens and inferred: electronic components,Keystone Europe,India,Middle East Africa,interconnect components,hardware,electronics industry,trade shows
2025-01-13T08:33:57.512Z info: [inference][148767] Completed successfully
2025-01-13T08:33:57.565Z info: [search][148768] Completed successfully
2025-01-13T08:33:57.630Z info: [search][148770] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:33:58.373Z info: [search][148770] Completed successfully
2025-01-13T08:34:04.832Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs
Error: Timed-out after 15 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:594:17)
    at process.processTimers (node:internal/timers:529:7)
2025-01-13T08:34:04.909Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:34:04.909Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/
2025-01-13T08:34:05.486Z info: [Crawler][141486] Done archiving the page as assetId: 34add227-e1da-423f-a831-2132467078a2
2025-01-13T08:34:05.660Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8"
2025-01-13T08:34:08.838Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ...
2025-01-13T08:34:09.840Z info: [Crawler][141486] Finished waiting for the page to load.
2025-01-13T08:34:09.891Z info: [Crawler][141486] Successfully fetched the page content.
2025-01-13T08:34:10.557Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-01-13T08:34:10.607Z info: [Crawler][141486] Will attempt to extract metadata from page ...
2025-01-13T08:34:10.781Z info: [Crawler][141486] Will attempt to extract readable content ...
2025-01-13T08:34:11.149Z info: [Crawler][141486] Done extracting readable content.
2025-01-13T08:34:11.239Z info: [Crawler][141486] Stored the screenshot as assetId: 32a37682-689c-426a-a14c-e6a3d227b96a
2025-01-13T08:34:11.327Z info: [Crawler][141486] Done extracting metadata from the page.
2025-01-13T08:34:11.327Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png"
2025-01-13T08:34:11.598Z info: [Crawler][141486] Downloaded image as assetId: 2ed34185-5001-4d7d-b064-4e710c080445
2025-01-13T08:34:11.700Z info: [Crawler][141486] Will attempt to archive page ...
2025-01-13T08:34:12.245Z info: [VideoCrawler][148773] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config.
2025-01-13T08:34:12.245Z info: [VideoCrawler][148773] Video Download Completed successfully
2025-01-13T08:34:12.615Z info: [search][148772] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:34:12.716Z info: [inference][148771] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8"
2025-01-13T08:34:13.126Z info: [inference][148771] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 630 tokens and inferred: Keystone Europe,electronic components,interconnect components,battery holders,Hardware,MEA+India,Keystone Electronics,electronics,trade shows,DIY
2025-01-13T08:34:13.263Z info: [inference][148771] Completed successfully
2025-01-13T08:34:13.384Z info: [search][148772] Completed successfully
2025-01-13T08:34:13.404Z info: [search][148774] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ...
2025-01-13T08:34:14.134Z info: [search][148774] Completed successfully
2025-01-13T08:34:19.908Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs
Error: Timed-out after 15 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:594:17)
    at process.processTimers (node:internal/timers:529:7)

Device Details

Docker env

Exact Hoarder Version

v0.21.0

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @debackerl on GitHub (Jan 13, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/876 ### Describe the Bug I notice that in case of time-out, Hoarder will loop to redownload, but it will actually redo previous steps which were successful, including inference. In my case, using Ollama, I don't have much more cost, but people using OpenAI could get a larger bill. ### Steps to Reproduce 1. Use container ghcr.io/hoarder-app/hoarder:release (it's currently 0.21.0) 2. Setup: BROWSER_WEB_URL: http://127.0.0.1:9222 CRAWLER_FULL_PAGE_ARCHIVE: true OLLAMA_BASE_URL: http://xxxxx:11434 INFERENCE_TEXT_MODEL: llama3.2:3b-instruct-q5_K_M INFERENCE_IMAGE_MODEL: llama3.2-vision:11b-instruct-q4_K_M INFERENCE_CONTEXT_LENGTH: 32768 CRAWLER_NUM_WORKERS: 1 CRAWLER_JOB_TIMEOUT_SEC: 15 3. Use container gcr.io/zenika-hub/alpine-chrome:latest (it's currently 124) 4. Setup: --disable-gpu --disable-dev-shm-usage --remote-debugging-address=127.0.0.1 --remote-debugging-port=9222 --hide-scrollbars --no-sandbox 5. Bookmark http://www.keystone-europe.com/ ### Expected Behaviour Download the page only once, and run inference only once. ### Screenshots or Additional Context ``` 2025-01-13T08:32:49.473Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:32:49.473Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/ 2025-01-13T08:32:49.860Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8" 2025-01-13T08:32:50.477Z info: [search][148750] Completed successfully 2025-01-13T08:32:52.908Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ... 2025-01-13T08:32:53.910Z info: [Crawler][141486] Finished waiting for the page to load. 2025-01-13T08:32:54.043Z info: [Crawler][141486] Successfully fetched the page content. 2025-01-13T08:32:54.850Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false 2025-01-13T08:32:54.893Z info: [Crawler][141486] Will attempt to extract metadata from page ... 2025-01-13T08:32:55.294Z info: [Crawler][141486] Will attempt to extract readable content ... 2025-01-13T08:32:55.643Z info: [Crawler][141486] Done extracting readable content. 2025-01-13T08:32:55.729Z info: [Crawler][141486] Stored the screenshot as assetId: 80bfb1fb-53fc-4922-acf8-5ef0e0e502d9 2025-01-13T08:32:55.814Z info: [Crawler][141486] Done extracting metadata from the page. 2025-01-13T08:32:55.815Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png" 2025-01-13T08:32:55.903Z info: [Crawler][141486] Downloaded image as assetId: e1881a89-c440-405b-8dc8-53e6fcc7ccb3 2025-01-13T08:32:56.215Z info: [Crawler][141486] Will attempt to archive page ... 2025-01-13T08:32:56.830Z info: [inference][148751] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:32:56.871Z info: [VideoCrawler][148753] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config. 2025-01-13T08:32:56.894Z info: [VideoCrawler][148753] Video Download Completed successfully 2025-01-13T08:32:56.970Z info: [search][148752] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:32:57.275Z info: [Crawler][141484] Done archiving the page as assetId: 9e475209-4b2b-4eba-add3-3b259c62a35e 2025-01-13T08:32:57.420Z info: [inference][148751] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 635 tokens and inferred: electronic components,Keystone Electronics,Keystone Europe,interconnect components,hardware,industrial suppliers,electronics industry,trade shows 2025-01-13T08:32:57.718Z info: [inference][148751] Completed successfully 2025-01-13T08:32:57.795Z info: [search][148752] Completed successfully 2025-01-13T08:32:57.868Z info: [search][148754] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:32:58.671Z info: [search][148754] Completed successfully 2025-01-13T08:33:04.468Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs Error: Timed-out after 15 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) at listOnTimeout (node:internal/timers:594:17) at process.processTimers (node:internal/timers:529:7) 2025-01-13T08:33:04.565Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:33:04.566Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/ 2025-01-13T08:33:04.891Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8" 2025-01-13T08:33:05.844Z info: [Crawler][141486] Done archiving the page as assetId: b09d8591-a57b-42d2-908c-771c5eb48682 2025-01-13T08:33:07.926Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ... 2025-01-13T08:33:08.928Z info: [Crawler][141486] Finished waiting for the page to load. 2025-01-13T08:33:08.943Z info: [Crawler][141486] Successfully fetched the page content. 2025-01-13T08:33:09.715Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false 2025-01-13T08:33:09.731Z info: [Crawler][141486] Will attempt to extract metadata from page ... 2025-01-13T08:33:09.904Z info: [Crawler][141486] Will attempt to extract readable content ... 2025-01-13T08:33:10.105Z info: [Crawler][141486] Done extracting readable content. 2025-01-13T08:33:10.133Z info: [Crawler][141486] Stored the screenshot as assetId: 7f037d9f-8380-4145-a025-d53de9a66c28 2025-01-13T08:33:10.221Z info: [Crawler][141486] Done extracting metadata from the page. 2025-01-13T08:33:10.221Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png" 2025-01-13T08:33:10.310Z info: [Crawler][141486] Downloaded image as assetId: 2d65cc73-2ee6-40ce-9aca-d9fa8eadb710 2025-01-13T08:33:10.636Z info: [Crawler][141486] Will attempt to archive page ... 2025-01-13T08:33:10.761Z info: [search][148756] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:33:11.176Z info: [inference][148755] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:33:11.223Z info: [VideoCrawler][148757] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config. 2025-01-13T08:33:11.227Z info: [VideoCrawler][148757] Video Download Completed successfully 2025-01-13T08:33:11.452Z info: [search][148756] Completed successfully 2025-01-13T08:33:11.696Z info: [inference][148755] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 630 tokens and inferred: Keystone Europe,electronic components,battery holders,interconnect components,Hardware,electronics,trade shows,Europe,India,MEA,Keystone Electronics 2025-01-13T08:33:11.898Z info: [inference][148755] Completed successfully 2025-01-13T08:33:12.545Z info: [search][148758] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:33:13.453Z info: [search][148758] Completed successfully 2025-01-13T08:33:19.564Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs Error: Timed-out after 15 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) at listOnTimeout (node:internal/timers:594:17) at process.processTimers (node:internal/timers:529:7) 2025-01-13T08:33:19.670Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:33:19.670Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/ 2025-01-13T08:33:20.029Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8" 2025-01-13T08:33:20.080Z info: [Crawler][141486] Done archiving the page as assetId: 19c625d0-dd72-4a3c-8f24-91ec0eeba0f1 2025-01-13T08:33:23.134Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ... 2025-01-13T08:33:24.135Z info: [Crawler][141486] Finished waiting for the page to load. 2025-01-13T08:33:24.148Z info: [Crawler][141486] Successfully fetched the page content. 2025-01-13T08:33:24.806Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false 2025-01-13T08:33:24.817Z info: [Crawler][141486] Will attempt to extract metadata from page ... 2025-01-13T08:33:25.010Z info: [Crawler][141486] Will attempt to extract readable content ... 2025-01-13T08:33:25.219Z info: [Crawler][141486] Done extracting readable content. 2025-01-13T08:33:25.247Z info: [Crawler][141486] Stored the screenshot as assetId: 0525ec3b-523d-42b3-80ec-2995f5026f7c 2025-01-13T08:33:25.334Z info: [Crawler][141486] Done extracting metadata from the page. 2025-01-13T08:33:25.334Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png" 2025-01-13T08:33:25.426Z info: [Crawler][141486] Downloaded image as assetId: 1a472fbd-3430-478d-9f9a-4a44cb1029c7 2025-01-13T08:33:25.767Z info: [Crawler][141486] Will attempt to archive page ... 2025-01-13T08:33:25.823Z info: [search][148760] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:33:26.303Z info: [inference][148759] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:33:26.363Z info: [VideoCrawler][148761] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config. 2025-01-13T08:33:26.363Z info: [VideoCrawler][148761] Video Download Completed successfully 2025-01-13T08:33:26.631Z info: [search][148760] Completed successfully 2025-01-13T08:33:26.775Z info: [inference][148759] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 626 tokens and inferred: electronic components,Keystone Electronics,interconnect components,MEA+India,electronics,hardware,trade shows,electronics industry,Keystone Europe 2025-01-13T08:33:26.989Z info: [inference][148759] Completed successfully 2025-01-13T08:33:27.748Z info: [search][148762] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:33:28.493Z info: [search][148762] Completed successfully 2025-01-13T08:33:34.667Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs Error: Timed-out after 15 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) at listOnTimeout (node:internal/timers:594:17) at process.processTimers (node:internal/timers:529:7) 2025-01-13T08:33:34.754Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:33:34.754Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/ 2025-01-13T08:33:34.797Z info: [Crawler][141486] Done archiving the page as assetId: ddf0d64d-2568-4b60-bbbf-80e5df3777dd 2025-01-13T08:33:35.062Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8" 2025-01-13T08:33:38.214Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ... 2025-01-13T08:33:39.215Z info: [Crawler][141486] Finished waiting for the page to load. 2025-01-13T08:33:39.229Z info: [Crawler][141486] Successfully fetched the page content. 2025-01-13T08:33:39.957Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false 2025-01-13T08:33:40.000Z info: [Crawler][141486] Will attempt to extract metadata from page ... 2025-01-13T08:33:40.168Z info: [Crawler][141486] Will attempt to extract readable content ... 2025-01-13T08:33:40.366Z info: [Crawler][141486] Done extracting readable content. 2025-01-13T08:33:40.400Z info: [Crawler][141486] Stored the screenshot as assetId: fcd04162-2288-483b-93e4-b0eb496a354d 2025-01-13T08:33:40.486Z info: [Crawler][141486] Done extracting metadata from the page. 2025-01-13T08:33:40.486Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png" 2025-01-13T08:33:40.576Z info: [Crawler][141486] Downloaded image as assetId: 89c5c380-6bc6-48d0-8715-36ac5ed5ae0b 2025-01-13T08:33:40.892Z info: [Crawler][141486] Will attempt to archive page ... 2025-01-13T08:33:41.448Z info: [inference][148763] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:33:41.496Z info: [VideoCrawler][148765] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config. 2025-01-13T08:33:41.497Z info: [VideoCrawler][148765] Video Download Completed successfully 2025-01-13T08:33:41.591Z info: [search][148764] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:33:42.094Z info: [inference][148763] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 649 tokens and inferred: Keystone Electronics,electronic components,battery holders,interconnect components,India,MEA,electronics industry,Keystone Europe,Keystone Electronics products,trade shows,distribution network,precision electronic interconnect components,quality products,competitive prices 2025-01-13T08:33:42.513Z info: [inference][148763] Completed successfully 2025-01-13T08:33:42.565Z info: [search][148764] Completed successfully 2025-01-13T08:33:42.648Z info: [search][148766] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:33:43.443Z info: [search][148766] Completed successfully 2025-01-13T08:33:49.752Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs Error: Timed-out after 15 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) at listOnTimeout (node:internal/timers:594:17) at process.processTimers (node:internal/timers:529:7) 2025-01-13T08:33:49.836Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:33:49.836Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/ 2025-01-13T08:33:50.146Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8" 2025-01-13T08:33:50.387Z info: [Crawler][141486] Done archiving the page as assetId: 90d8fc06-0595-4392-8df4-fd62c83693a2 2025-01-13T08:33:53.340Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ... 2025-01-13T08:33:54.340Z info: [Crawler][141486] Finished waiting for the page to load. 2025-01-13T08:33:54.396Z info: [Crawler][141486] Successfully fetched the page content. 2025-01-13T08:33:55.180Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false 2025-01-13T08:33:55.221Z info: [Crawler][141486] Will attempt to extract metadata from page ... 2025-01-13T08:33:55.416Z info: [Crawler][141486] Will attempt to extract readable content ... 2025-01-13T08:33:55.618Z info: [Crawler][141486] Done extracting readable content. 2025-01-13T08:33:55.660Z info: [Crawler][141486] Stored the screenshot as assetId: dd9a7406-00d5-4f1f-ad3a-2a2d041e150c 2025-01-13T08:33:55.807Z info: [Crawler][141486] Done extracting metadata from the page. 2025-01-13T08:33:55.807Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png" 2025-01-13T08:33:56.001Z info: [Crawler][141486] Downloaded image as assetId: 3ec8e3d2-c781-46f1-9c64-99d25e9737ca 2025-01-13T08:33:56.300Z info: [Crawler][141486] Will attempt to archive page ... 2025-01-13T08:33:56.682Z info: [search][148768] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:33:56.729Z info: [inference][148767] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:33:56.782Z info: [VideoCrawler][148769] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config. 2025-01-13T08:33:56.782Z info: [VideoCrawler][148769] Video Download Completed successfully 2025-01-13T08:33:57.229Z info: [inference][148767] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 633 tokens and inferred: electronic components,Keystone Europe,India,Middle East Africa,interconnect components,hardware,electronics industry,trade shows 2025-01-13T08:33:57.512Z info: [inference][148767] Completed successfully 2025-01-13T08:33:57.565Z info: [search][148768] Completed successfully 2025-01-13T08:33:57.630Z info: [search][148770] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:33:58.373Z info: [search][148770] Completed successfully 2025-01-13T08:34:04.832Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs Error: Timed-out after 15 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) at listOnTimeout (node:internal/timers:594:17) at process.processTimers (node:internal/timers:529:7) 2025-01-13T08:34:04.909Z info: [Crawler][141486] Will crawl "http://www.keystone-europe.com/" for link with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:34:04.909Z info: [Crawler][141486] Attempting to determine the content-type for the url http://www.keystone-europe.com/ 2025-01-13T08:34:05.486Z info: [Crawler][141486] Done archiving the page as assetId: 34add227-e1da-423f-a831-2132467078a2 2025-01-13T08:34:05.660Z info: [Crawler][141486] Content-type for the url http://www.keystone-europe.com/ is "text/html; charset=UTF-8" 2025-01-13T08:34:08.838Z info: [Crawler][141486] Successfully navigated to "http://www.keystone-europe.com/". Waiting for the page to load ... 2025-01-13T08:34:09.840Z info: [Crawler][141486] Finished waiting for the page to load. 2025-01-13T08:34:09.891Z info: [Crawler][141486] Successfully fetched the page content. 2025-01-13T08:34:10.557Z info: [Crawler][141486] Finished capturing page content and a screenshot. FullPageScreenshot: false 2025-01-13T08:34:10.607Z info: [Crawler][141486] Will attempt to extract metadata from page ... 2025-01-13T08:34:10.781Z info: [Crawler][141486] Will attempt to extract readable content ... 2025-01-13T08:34:11.149Z info: [Crawler][141486] Done extracting readable content. 2025-01-13T08:34:11.239Z info: [Crawler][141486] Stored the screenshot as assetId: 32a37682-689c-426a-a14c-e6a3d227b96a 2025-01-13T08:34:11.327Z info: [Crawler][141486] Done extracting metadata from the page. 2025-01-13T08:34:11.327Z info: [Crawler][141486] Downloading image from "https://www.keystone-europe.com/wp-content/uploads/2016/11/logo-keystone-europe-mea-india.png" 2025-01-13T08:34:11.598Z info: [Crawler][141486] Downloaded image as assetId: 2ed34185-5001-4d7d-b064-4e710c080445 2025-01-13T08:34:11.700Z info: [Crawler][141486] Will attempt to archive page ... 2025-01-13T08:34:12.245Z info: [VideoCrawler][148773] Skipping video download from "http://www.keystone-europe.com/", because it is disabled in the config. 2025-01-13T08:34:12.245Z info: [VideoCrawler][148773] Video Download Completed successfully 2025-01-13T08:34:12.615Z info: [search][148772] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:34:12.716Z info: [inference][148771] Starting an inference job for bookmark with id "p0kuqkszcvyshulnvbb97cj8" 2025-01-13T08:34:13.126Z info: [inference][148771] Inferring tag for bookmark "p0kuqkszcvyshulnvbb97cj8" used 630 tokens and inferred: Keystone Europe,electronic components,interconnect components,battery holders,Hardware,MEA+India,Keystone Electronics,electronics,trade shows,DIY 2025-01-13T08:34:13.263Z info: [inference][148771] Completed successfully 2025-01-13T08:34:13.384Z info: [search][148772] Completed successfully 2025-01-13T08:34:13.404Z info: [search][148774] Attempting to index bookmark with id p0kuqkszcvyshulnvbb97cj8 ... 2025-01-13T08:34:14.134Z info: [search][148774] Completed successfully 2025-01-13T08:34:19.908Z error: [Crawler][141486] Crawling job failed: Error: Timed-out after 15 secs Error: Timed-out after 15 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) at listOnTimeout (node:internal/timers:594:17) at process.processTimers (node:internal/timers:529:7) ``` ### Device Details Docker env ### Exact Hoarder Version v0.21.0 ### Have you checked the troubleshooting guide? - [X] I have checked the troubleshooting guide and I haven't found a solution to my problem
kerem closed this issue 2026-03-02 11:50:57 +03:00
Author
Owner

@debackerl commented on GitHub (Jan 13, 2025):

I wonder if this does not explain https://github.com/hoarder-app/hoarder/issues/742, because I also have many screenshots, banners, and many full page archives per bookmark.

<!-- gh-comment-id:2586662046 --> @debackerl commented on GitHub (Jan 13, 2025): I wonder if this does not explain https://github.com/hoarder-app/hoarder/issues/742, because I also have many screenshots, banners, and many full page archives per bookmark.
Author
Owner

@MohamedBassem commented on GitHub (Jan 13, 2025):

yes, it's very likely related to the same bug

<!-- gh-comment-id:2586694101 --> @MohamedBassem commented on GitHub (Jan 13, 2025): yes, it's very likely related to the same bug
Author
Owner

@MohamedBassem commented on GitHub (Jan 18, 2025):

Let's track it in #742 instead.

<!-- gh-comment-id:2599963600 --> @MohamedBassem commented on GitHub (Jan 18, 2025): Let's track it in #742 instead.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#570
No description provided.