[GH-ISSUE #273] (Docker:nightly) Everything works but: "Crawling job failed: Error: EXDEV: cross-device link not permitted" leads to multiple retries and "failed crawling jobs" #182

Closed
opened 2026-03-02 11:47:23 +03:00 by kerem · 2 comments
Owner

Originally created by @Deathproof76 on GitHub (Jul 4, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/273

Hello! 🙋‍♂️

I'm running the current nightly from July 2nd and I seem to encounter an almost cosmetic bug, as everything else seems to be in working order regarding fetching, crawling, inference etc. This hasn't happened with my setup before and I sadly can't make out the exact commit since it did.

Everytime I hoard a new page or recrawl an old one (to try out a new llm (running with ollama)) I get this error message from the workers container:

2024-07-04T10:12:15.501Z error: [Crawler][3904] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/43e6a047-5a13-41ad-b6d9-af89eea2f7a1' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/43e6a047-5a13-41ad-b6d9-af89eea2f7a1/asset.bin'

although subjectively the hoarding hasn't failed, the workers deem it so and repeat the attempt multiple times, which leads to:

2024-07-04T09:44:45.028Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:44:45.029Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd
2024-07-04T09:44:45.512Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8"
2024-07-04T09:44:46.576Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ...
2024-07-04T09:44:49.551Z info: [Crawler][3902] Finished waiting for the page to load.
2024-07-04T09:44:49.872Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true
2024-07-04T09:44:49.875Z info: [Crawler][3902] Will attempt to extract metadata from page ...
2024-07-04T09:44:50.002Z info: [Crawler][3902] Will attempt to extract readable content ...
2024-07-04T09:44:50.094Z info: [Crawler][3902] Done extracting readable content.
2024-07-04T09:44:50.096Z info: [Crawler][3902] Done extracting metadata from the page.
2024-07-04T09:44:50.110Z info: [Crawler][3902] Stored the screenshot as assetId: ffa51dc7-2eeb-4e08-a832-87928459f2a4
2024-07-04T09:44:50.110Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png"
2024-07-04T09:44:50.216Z info: [Crawler][3902] Downloaded image as assetId: c7d82832-7638-486d-9b3d-c388ff29df43
2024-07-04T09:44:50.234Z info: [Crawler][3902] Will attempt to archive page ...
2024-07-04T09:44:50.238Z info: [inference][5113] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:44:50.239Z info: [search][11018] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:44:50.295Z info: [search][11018] Completed successfully
2024-07-04T09:44:51.062Z info: [inference][5113] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 27 tokens and inferred: Machine Learning,Natural Language Processing,German Language,Whisper,HuggingFace
2024-07-04T09:44:51.080Z info: [inference][5113] Completed successfully
2024-07-04T09:44:51.080Z info: [search][11019] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:44:51.135Z info: [search][11019] Completed successfully
2024-07-04T09:44:58.421Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/08d38192-37d4-4628-8da4-4f1671c44520' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/08d38192-37d4-4628-8da4-4f1671c44520/asset.bin'
2024-07-04T09:45:00.461Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:00.461Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd
2024-07-04T09:45:00.941Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8"
2024-07-04T09:45:01.784Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ...
2024-07-04T09:45:04.787Z info: [Crawler][3902] Finished waiting for the page to load.
2024-07-04T09:45:05.119Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true
2024-07-04T09:45:05.121Z info: [Crawler][3902] Will attempt to extract metadata from page ...
2024-07-04T09:45:05.248Z info: [Crawler][3902] Will attempt to extract readable content ...
2024-07-04T09:45:05.331Z info: [Crawler][3902] Done extracting readable content.
2024-07-04T09:45:05.332Z info: [Crawler][3902] Done extracting metadata from the page.
2024-07-04T09:45:05.333Z info: [Crawler][3902] Stored the screenshot as assetId: 4e014c81-8a79-4877-ae9b-c99287434a13
2024-07-04T09:45:05.333Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png"
2024-07-04T09:45:05.439Z info: [Crawler][3902] Downloaded image as assetId: 48ecb83e-a5ec-4d11-9164-1b01a97ffa71
2024-07-04T09:45:05.456Z info: [Crawler][3902] Will attempt to archive page ...
2024-07-04T09:45:05.460Z info: [inference][5114] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:05.460Z info: [search][11020] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:05.517Z info: [search][11020] Completed successfully
2024-07-04T09:45:06.330Z info: [inference][5114] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 27 tokens and inferred: Machine Learning,Natural Language Processing,German Language,Whisper Model,Distillation
2024-07-04T09:45:06.348Z info: [inference][5114] Completed successfully
2024-07-04T09:45:06.348Z info: [search][11021] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:06.403Z info: [search][11021] Completed successfully
2024-07-04T09:45:22.300Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/2aaff50d-f4c7-4fe3-8b04-2c004a83c21a' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/2aaff50d-f4c7-4fe3-8b04-2c004a83c21a/asset.bin'
2024-07-04T09:45:26.316Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:26.316Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd
2024-07-04T09:45:26.602Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8"
2024-07-04T09:45:27.698Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ...
2024-07-04T09:45:30.709Z info: [Crawler][3902] Finished waiting for the page to load.
2024-07-04T09:45:31.065Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true
2024-07-04T09:45:31.068Z info: [Crawler][3902] Will attempt to extract metadata from page ...
2024-07-04T09:45:31.187Z info: [Crawler][3902] Will attempt to extract readable content ...
2024-07-04T09:45:31.271Z info: [Crawler][3902] Done extracting readable content.
2024-07-04T09:45:31.273Z info: [Crawler][3902] Done extracting metadata from the page.
2024-07-04T09:45:31.274Z info: [Crawler][3902] Stored the screenshot as assetId: d2a1689b-b7c1-4129-aed7-be6231d8be83
2024-07-04T09:45:31.274Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png"
2024-07-04T09:45:31.386Z info: [Crawler][3902] Downloaded image as assetId: dda5ffc3-34fd-4cff-8a3b-1066cb3b7e2a
2024-07-04T09:45:31.415Z info: [Crawler][3902] Will attempt to archive page ...
2024-07-04T09:45:31.419Z info: [inference][5115] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:31.419Z info: [search][11022] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:31.475Z info: [search][11022] Completed successfully
2024-07-04T09:45:32.750Z info: [inference][5115] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 45 tokens and inferred: Machine Learning,Natural Language Processing,Speech Recognition,German Language,HuggingFace
2024-07-04T09:45:32.765Z info: [inference][5115] Completed successfully
2024-07-04T09:45:32.765Z info: [search][11023] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:32.821Z info: [search][11023] Completed successfully
2024-07-04T09:45:39.336Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/f6c22412-7fb5-44d4-b47c-750b3a81c516' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/f6c22412-7fb5-44d4-b47c-750b3a81c516/asset.bin'
2024-07-04T09:45:47.359Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:47.359Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd
2024-07-04T09:45:47.629Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8"
2024-07-04T09:45:48.461Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ...
2024-07-04T09:45:51.536Z info: [Crawler][3902] Finished waiting for the page to load.
2024-07-04T09:45:51.866Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true
2024-07-04T09:45:51.869Z info: [Crawler][3902] Will attempt to extract metadata from page ...
2024-07-04T09:45:52.010Z info: [Crawler][3902] Will attempt to extract readable content ...
2024-07-04T09:45:52.102Z info: [Crawler][3902] Done extracting readable content.
2024-07-04T09:45:52.103Z info: [Crawler][3902] Done extracting metadata from the page.
2024-07-04T09:45:52.104Z info: [Crawler][3902] Stored the screenshot as assetId: f18e3c0d-6ae7-48ae-af10-9c43eeff1be6
2024-07-04T09:45:52.104Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png"
2024-07-04T09:45:52.248Z info: [Crawler][3902] Downloaded image as assetId: b88ba0ce-7cd2-4739-8e59-2a7eba1f4ed7
2024-07-04T09:45:52.255Z info: [Crawler][3902] Will attempt to archive page ...
2024-07-04T09:45:52.260Z info: [inference][5116] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs"
2024-07-04T09:45:52.260Z info: [search][11024] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:52.316Z info: [search][11024] Completed successfully
2024-07-04T09:45:52.873Z info: [inference][5116] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 28 tokens and inferred: Machine Learning,Natural Language Processing,German Language,Whisper Model,HuggingFace
2024-07-04T09:45:52.877Z info: [inference][5116] Completed successfully
2024-07-04T09:45:52.878Z info: [search][11025] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ...
2024-07-04T09:45:52.934Z info: [search][11025] Completed successfully
2024-07-04T09:46:03.330Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/e0b27956-2f2c-4b5d-a663-4e04b61276bb' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/e0b27956-2f2c-4b5d-a663-4e04b61276bb/asset.bin'

until the workers give up.

I tried recrawling and reindexing all, which in the end marked all jobs as failed

image

even though they did not. The latest grabs for example:

image

All the other containers of the stack show no errors. My volumes are mounted directly. Tried to mount /tmp to /dev/shm/hoarder on the system drive and also to a folder next to the /data folder on the same disk as the other mounts. Which didn't help.

.env

MEILI_MASTER_KEY=6L0m********************************************hCzm
NEXTAUTH_SECRET=/RfD5U******************************************Eqr9OZ+ciD
NEXTAUTH_URL=http://192.168.0.208:3111
CRAWLER_NUM_WORKERS=1
CRAWLER_FULL_PAGE_SCREENSHOT=true
CRAWLER_FULL_PAGE_ARCHIVE=true
CRAWLER_JOB_TIMEOUT_SEC=90
CRAWLER_NAVIGATE_TIMEOUT_SEC=45
MAX_ASSET_SIZE_MB=20
OLLAMA_BASE_URL=http://192.168.0.208:11434
DISABLE_SIGNUPS=true
INFERENCE_LANG=english

compose:

version: "3.8"
services:
  web:
    image: ghcr.io/mohamedbassem/hoarder-web:latest
    restart: unless-stopped
    volumes:
      - /mnt/Dockerspace/hoarder/web:/data
    ports:
      - 3111:3000
    env_file:
      - /mnt/Dockerspace/hoarder/.env
    environment:
      MEILI_ADDR: http://192.168.0.208:7700
      DATA_DIR: /data
      REDIS_HOST: rediss
  rediss:
    image: redis:7.2-alpine
    restart: unless-stopped
    volumes:
      - /mnt/Dockerspace/hoarder/redis:/data
  chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    restart: unless-stopped
    ports:
      - 9222:9222
    command:
      - --no-sandbox
      - --disable-gpu
      - --disable-dev-shm-usage
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
      - --enable-features=ConversionMeasurement,AttributionReportingCrossAppWeb
  meilisearch:
    image: getmeili/meilisearch:v1.6
    restart: unless-stopped
    environment:
      MEILI_NO_ANALYTICS: true
    ports:
      - 7700:7700
    env_file:
      - /mnt/Dockerspace/hoarder/.env
    volumes:
      - /mnt/Dockerspace/hoarder/meilisearch:/meili_data
  workers:
    image: ghcr.io/mohamedbassem/hoarder-workers:latest
    restart: unless-stopped
    env_file:
      - /mnt/Dockerspace/hoarder/.env
    volumes:
      - /mnt/Dockerspace/hoarder/web:/data
    environment:
      REDIS_HOST: rediss
      MEILI_ADDR: http://192.168.0.208:7700
      BROWSER_WEB_URL: http://192.168.0.208:9222
      DATA_DIR: /data
      INFERENCE_TEXT_MODEL: gemma-2-9b-it-IQ3_XXS.gguf:latest
      INFERENCE_IMAGE_MODEL: moondream:1.8b-v2-q6_K
    depends_on:
      web:
        condition: service_started
Originally created by @Deathproof76 on GitHub (Jul 4, 2024). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/273 Hello! 🙋‍♂️ I'm running the current nightly from July 2nd and I seem to encounter an almost cosmetic bug, as everything else seems to be in working order regarding fetching, crawling, inference etc. This hasn't happened with my setup before and I sadly can't make out the exact commit since it did. Everytime I hoard a new page or recrawl an old one (to try out a new llm (running with ollama)) I get this error message from the workers container: **`2024-07-04T10:12:15.501Z error: [Crawler][3904] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/43e6a047-5a13-41ad-b6d9-af89eea2f7a1' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/43e6a047-5a13-41ad-b6d9-af89eea2f7a1/asset.bin'`** although subjectively the hoarding hasn't failed, the workers deem it so and repeat the attempt multiple times, which leads to: ``` 2024-07-04T09:44:45.028Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs" 2024-07-04T09:44:45.029Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd 2024-07-04T09:44:45.512Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8" 2024-07-04T09:44:46.576Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ... 2024-07-04T09:44:49.551Z info: [Crawler][3902] Finished waiting for the page to load. 2024-07-04T09:44:49.872Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true 2024-07-04T09:44:49.875Z info: [Crawler][3902] Will attempt to extract metadata from page ... 2024-07-04T09:44:50.002Z info: [Crawler][3902] Will attempt to extract readable content ... 2024-07-04T09:44:50.094Z info: [Crawler][3902] Done extracting readable content. 2024-07-04T09:44:50.096Z info: [Crawler][3902] Done extracting metadata from the page. 2024-07-04T09:44:50.110Z info: [Crawler][3902] Stored the screenshot as assetId: ffa51dc7-2eeb-4e08-a832-87928459f2a4 2024-07-04T09:44:50.110Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png" 2024-07-04T09:44:50.216Z info: [Crawler][3902] Downloaded image as assetId: c7d82832-7638-486d-9b3d-c388ff29df43 2024-07-04T09:44:50.234Z info: [Crawler][3902] Will attempt to archive page ... 2024-07-04T09:44:50.238Z info: [inference][5113] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs" 2024-07-04T09:44:50.239Z info: [search][11018] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ... 2024-07-04T09:44:50.295Z info: [search][11018] Completed successfully 2024-07-04T09:44:51.062Z info: [inference][5113] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 27 tokens and inferred: Machine Learning,Natural Language Processing,German Language,Whisper,HuggingFace 2024-07-04T09:44:51.080Z info: [inference][5113] Completed successfully 2024-07-04T09:44:51.080Z info: [search][11019] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ... 2024-07-04T09:44:51.135Z info: [search][11019] Completed successfully 2024-07-04T09:44:58.421Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/08d38192-37d4-4628-8da4-4f1671c44520' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/08d38192-37d4-4628-8da4-4f1671c44520/asset.bin' 2024-07-04T09:45:00.461Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs" 2024-07-04T09:45:00.461Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd 2024-07-04T09:45:00.941Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8" 2024-07-04T09:45:01.784Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ... 2024-07-04T09:45:04.787Z info: [Crawler][3902] Finished waiting for the page to load. 2024-07-04T09:45:05.119Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true 2024-07-04T09:45:05.121Z info: [Crawler][3902] Will attempt to extract metadata from page ... 2024-07-04T09:45:05.248Z info: [Crawler][3902] Will attempt to extract readable content ... 2024-07-04T09:45:05.331Z info: [Crawler][3902] Done extracting readable content. 2024-07-04T09:45:05.332Z info: [Crawler][3902] Done extracting metadata from the page. 2024-07-04T09:45:05.333Z info: [Crawler][3902] Stored the screenshot as assetId: 4e014c81-8a79-4877-ae9b-c99287434a13 2024-07-04T09:45:05.333Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png" 2024-07-04T09:45:05.439Z info: [Crawler][3902] Downloaded image as assetId: 48ecb83e-a5ec-4d11-9164-1b01a97ffa71 2024-07-04T09:45:05.456Z info: [Crawler][3902] Will attempt to archive page ... 2024-07-04T09:45:05.460Z info: [inference][5114] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs" 2024-07-04T09:45:05.460Z info: [search][11020] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ... 2024-07-04T09:45:05.517Z info: [search][11020] Completed successfully 2024-07-04T09:45:06.330Z info: [inference][5114] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 27 tokens and inferred: Machine Learning,Natural Language Processing,German Language,Whisper Model,Distillation 2024-07-04T09:45:06.348Z info: [inference][5114] Completed successfully 2024-07-04T09:45:06.348Z info: [search][11021] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ... 2024-07-04T09:45:06.403Z info: [search][11021] Completed successfully 2024-07-04T09:45:22.300Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/2aaff50d-f4c7-4fe3-8b04-2c004a83c21a' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/2aaff50d-f4c7-4fe3-8b04-2c004a83c21a/asset.bin' 2024-07-04T09:45:26.316Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs" 2024-07-04T09:45:26.316Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd 2024-07-04T09:45:26.602Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8" 2024-07-04T09:45:27.698Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ... 2024-07-04T09:45:30.709Z info: [Crawler][3902] Finished waiting for the page to load. 2024-07-04T09:45:31.065Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true 2024-07-04T09:45:31.068Z info: [Crawler][3902] Will attempt to extract metadata from page ... 2024-07-04T09:45:31.187Z info: [Crawler][3902] Will attempt to extract readable content ... 2024-07-04T09:45:31.271Z info: [Crawler][3902] Done extracting readable content. 2024-07-04T09:45:31.273Z info: [Crawler][3902] Done extracting metadata from the page. 2024-07-04T09:45:31.274Z info: [Crawler][3902] Stored the screenshot as assetId: d2a1689b-b7c1-4129-aed7-be6231d8be83 2024-07-04T09:45:31.274Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png" 2024-07-04T09:45:31.386Z info: [Crawler][3902] Downloaded image as assetId: dda5ffc3-34fd-4cff-8a3b-1066cb3b7e2a 2024-07-04T09:45:31.415Z info: [Crawler][3902] Will attempt to archive page ... 2024-07-04T09:45:31.419Z info: [inference][5115] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs" 2024-07-04T09:45:31.419Z info: [search][11022] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ... 2024-07-04T09:45:31.475Z info: [search][11022] Completed successfully 2024-07-04T09:45:32.750Z info: [inference][5115] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 45 tokens and inferred: Machine Learning,Natural Language Processing,Speech Recognition,German Language,HuggingFace 2024-07-04T09:45:32.765Z info: [inference][5115] Completed successfully 2024-07-04T09:45:32.765Z info: [search][11023] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ... 2024-07-04T09:45:32.821Z info: [search][11023] Completed successfully 2024-07-04T09:45:39.336Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/f6c22412-7fb5-44d4-b47c-750b3a81c516' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/f6c22412-7fb5-44d4-b47c-750b3a81c516/asset.bin' 2024-07-04T09:45:47.359Z info: [Crawler][3902] Will crawl "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd" for link with id "n68vikut13h1d0jo7x2cewfs" 2024-07-04T09:45:47.359Z info: [Crawler][3902] Attempting to determine the content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd 2024-07-04T09:45:47.629Z info: [Crawler][3902] Content-type for the url https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd is "text/html; charset=utf-8" 2024-07-04T09:45:48.461Z info: [Crawler][3902] Successfully navigated to "https://huggingface.co/aTrain-core/distil-whisper-large-v3-de-kd". Waiting for the page to load ... 2024-07-04T09:45:51.536Z info: [Crawler][3902] Finished waiting for the page to load. 2024-07-04T09:45:51.866Z info: [Crawler][3902] Finished capturing page content and a screenshot. FullPageScreenshot: true 2024-07-04T09:45:51.869Z info: [Crawler][3902] Will attempt to extract metadata from page ... 2024-07-04T09:45:52.010Z info: [Crawler][3902] Will attempt to extract readable content ... 2024-07-04T09:45:52.102Z info: [Crawler][3902] Done extracting readable content. 2024-07-04T09:45:52.103Z info: [Crawler][3902] Done extracting metadata from the page. 2024-07-04T09:45:52.104Z info: [Crawler][3902] Stored the screenshot as assetId: f18e3c0d-6ae7-48ae-af10-9c43eeff1be6 2024-07-04T09:45:52.104Z info: [Crawler][3902] Downloading image from "https://cdn-thumbnails.huggingface.co/social-thumbnails/models/aTrain-core/distil-whisper-large-v3-de-kd.png" 2024-07-04T09:45:52.248Z info: [Crawler][3902] Downloaded image as assetId: b88ba0ce-7cd2-4739-8e59-2a7eba1f4ed7 2024-07-04T09:45:52.255Z info: [Crawler][3902] Will attempt to archive page ... 2024-07-04T09:45:52.260Z info: [inference][5116] Starting an inference job for bookmark with id "n68vikut13h1d0jo7x2cewfs" 2024-07-04T09:45:52.260Z info: [search][11024] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ... 2024-07-04T09:45:52.316Z info: [search][11024] Completed successfully 2024-07-04T09:45:52.873Z info: [inference][5116] Inferring tag for bookmark "n68vikut13h1d0jo7x2cewfs" used 28 tokens and inferred: Machine Learning,Natural Language Processing,German Language,Whisper Model,HuggingFace 2024-07-04T09:45:52.877Z info: [inference][5116] Completed successfully 2024-07-04T09:45:52.878Z info: [search][11025] Attempting to index bookmark with id n68vikut13h1d0jo7x2cewfs ... 2024-07-04T09:45:52.934Z info: [search][11025] Completed successfully 2024-07-04T09:46:03.330Z error: [Crawler][3902] Crawling job failed: Error: EXDEV: cross-device link not permitted, rename '/tmp/e0b27956-2f2c-4b5d-a663-4e04b61276bb' -> '/data/assets/m886fg03flmkf2e9m9mo5tlk/e0b27956-2f2c-4b5d-a663-4e04b61276bb/asset.bin' ``` until the workers give up. I tried recrawling and reindexing all, which in the end marked all jobs as failed ![image](https://github.com/hoarder-app/hoarder/assets/95944496/c1af3dd0-8c7a-4e22-993f-6af9ad5cf5cc) even though they did not. The latest grabs for example: ![image](https://github.com/hoarder-app/hoarder/assets/95944496/0d9f406b-de58-4599-80ae-2ac91bcc3b12) All the other containers of the stack show no errors. My volumes are mounted directly. Tried to mount /tmp to /dev/shm/hoarder on the system drive and also to a folder next to the /data folder on the same disk as the other mounts. Which didn't help. .env ``` MEILI_MASTER_KEY=6L0m********************************************hCzm NEXTAUTH_SECRET=/RfD5U******************************************Eqr9OZ+ciD NEXTAUTH_URL=http://192.168.0.208:3111 CRAWLER_NUM_WORKERS=1 CRAWLER_FULL_PAGE_SCREENSHOT=true CRAWLER_FULL_PAGE_ARCHIVE=true CRAWLER_JOB_TIMEOUT_SEC=90 CRAWLER_NAVIGATE_TIMEOUT_SEC=45 MAX_ASSET_SIZE_MB=20 OLLAMA_BASE_URL=http://192.168.0.208:11434 DISABLE_SIGNUPS=true INFERENCE_LANG=english ``` compose: ``` version: "3.8" services: web: image: ghcr.io/mohamedbassem/hoarder-web:latest restart: unless-stopped volumes: - /mnt/Dockerspace/hoarder/web:/data ports: - 3111:3000 env_file: - /mnt/Dockerspace/hoarder/.env environment: MEILI_ADDR: http://192.168.0.208:7700 DATA_DIR: /data REDIS_HOST: rediss rediss: image: redis:7.2-alpine restart: unless-stopped volumes: - /mnt/Dockerspace/hoarder/redis:/data chrome: image: gcr.io/zenika-hub/alpine-chrome:123 restart: unless-stopped ports: - 9222:9222 command: - --no-sandbox - --disable-gpu - --disable-dev-shm-usage - --remote-debugging-address=0.0.0.0 - --remote-debugging-port=9222 - --hide-scrollbars - --enable-features=ConversionMeasurement,AttributionReportingCrossAppWeb meilisearch: image: getmeili/meilisearch:v1.6 restart: unless-stopped environment: MEILI_NO_ANALYTICS: true ports: - 7700:7700 env_file: - /mnt/Dockerspace/hoarder/.env volumes: - /mnt/Dockerspace/hoarder/meilisearch:/meili_data workers: image: ghcr.io/mohamedbassem/hoarder-workers:latest restart: unless-stopped env_file: - /mnt/Dockerspace/hoarder/.env volumes: - /mnt/Dockerspace/hoarder/web:/data environment: REDIS_HOST: rediss MEILI_ADDR: http://192.168.0.208:7700 BROWSER_WEB_URL: http://192.168.0.208:9222 DATA_DIR: /data INFERENCE_TEXT_MODEL: gemma-2-9b-it-IQ3_XXS.gguf:latest INFERENCE_IMAGE_MODEL: moondream:1.8b-v2-q6_K depends_on: web: condition: service_started ```
kerem 2026-03-02 11:47:23 +03:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@MohamedBassem commented on GitHub (Jul 5, 2024):

Yeah, the culprit here is CRAWLER_FULL_PAGE_ARCHIVE=true, I intentionally made it the last step in the crawler worker such that if it fails, it doesn't impact other responsibilities of crawler.
Will need to dig deeper to understand why the rename doesn't work. For now, disabling full page archives should solve your issue.

<!-- gh-comment-id:2211497179 --> @MohamedBassem commented on GitHub (Jul 5, 2024): Yeah, the culprit here is `CRAWLER_FULL_PAGE_ARCHIVE=true`, I intentionally made it the last step in the crawler worker such that if it fails, it doesn't impact other responsibilities of crawler. Will need to dig deeper to understand why the rename doesn't work. For now, disabling full page archives should solve your issue.
Author
Owner

@MohamedBassem commented on GitHub (Jul 5, 2024):

I managed to repro the problem and I think I have a fix. Perfect timing before the release, thanks for the report!

<!-- gh-comment-id:2211506522 --> @MohamedBassem commented on GitHub (Jul 5, 2024): I managed to repro the problem and I think I have a fix. Perfect timing before the release, thanks for the report!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#182
No description provided.