[GH-ISSUE #1265] Proxy for Chrome not working anymore #813

Closed
opened 2026-03-02 11:52:56 +03:00 by kerem · 18 comments
Owner

Originally created by @maidou-00 on GitHub (Apr 14, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1265

Hello everyone. Just recently upgraded to the V0.23.2(nightly build), and the proxy for Chrome is not working anymore... I live in an internet-censored place and hence proxy is a must.

Here are my config:

chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    container_name: Hoarder-CHROME
    restart: unless-stopped
    command:
      - --no-sandbox
      - --disable-gpu
      - --disable-dev-shm-usage
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
      - --proxy-server='https=172.21.0.1:1080'   # Note: it was - --proxy-server=172.21.0.1:1080 and it was working before

Also tried redeploying/restart, the usual drills.

Logs(trying to access Google):

2025-04-14T10:20:22.380Z info: [Crawler][2909] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk"

2025-04-14T10:20:22.380Z info: [Crawler][2909] Attempting to determine the content-type for the url https://www.google.com

2025-04-14T10:20:27.382Z error: [Crawler][2909] Failed to determine the content-type for the url https://www.google.com: AbortError: The operation was aborted.

2025-04-14T10:22:27.492Z error: [Crawler][2909] Crawling job failed: TimeoutError: Navigation timeout of 120000 ms exceeded

TimeoutError: Navigation timeout of 120000 ms exceeded

    at new Deferred (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:59:34)

    at Deferred.create (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:21:16)

    at new LifecycleWatcher (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/LifecycleWatcher.js:65:60)

    at CdpFrame.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/Frame.js:136:29)

    at CdpFrame.<anonymous> (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/decorators.js:98:27)

    at CdpPage.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Page.js:590:43)

    at crawlPage (/app/apps/workers/crawlerWorker.ts:3:2115)

    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)

    at async crawlAndParseUrl (/app/apps/workers/crawlerWorker.ts:3:9435)

    at async runCrawler (/app/apps/workers/crawlerWorker.ts:3:13098)

2025-04-14T10:22:30.921Z info: [Crawler][2909] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk"

2025-04-14T10:22:30.922Z info: [Crawler][2909] Attempting to determine the content-type for the url https://www.google.com

2025-04-14T10:22:35.924Z error: [Crawler][2909] Failed to determine the content-type for the url https://www.google.com: AbortError: The operation was aborted.

2025-04-14T10:24:36.035Z error: [Crawler][2909] Crawling job failed: TimeoutError: Navigation timeout of 120000 ms exceeded

TimeoutError: Navigation timeout of 120000 ms exceeded

    at new Deferred (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:59:34)

    at Deferred.create (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:21:16)

    at new LifecycleWatcher (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/LifecycleWatcher.js:65:60)

    at CdpFrame.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/Frame.js:136:29)

    at CdpFrame.<anonymous> (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/decorators.js:98:27)

    at CdpPage.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Page.js:590:43)

    at crawlPage (/app/apps/workers/crawlerWorker.ts:3:2115)

    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)

    at async crawlAndParseUrl (/app/apps/workers/crawlerWorker.ts:3:9435)

    at async runCrawler (/app/apps/workers/crawlerWorker.ts:3:13098)

update

With proxy variable set to - --proxy-server=172.21.0.1:1080, the logs are as following. It seemed like the only problem is "Failed to determine the content-type for the url https://www.google.com", Chrome is able to navigate and read, but just cannot crawl successfully

2025-04-14T10:49:09.299Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222

2025-04-14T10:49:09.305Z info: [Crawler] Successfully resolved IP address, new address: http://172.83.0.17:9222/

2025-04-14T10:49:17.801Z info: [Crawler][2913] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk"

2025-04-14T10:49:17.801Z info: [Crawler][2913] Attempting to determine the content-type for the url https://www.google.com

2025-04-14T10:49:22.801Z error: [Crawler][2913] Failed to determine the content-type for the url https://www.google.com: AbortError: The operation was aborted.

2025-04-14T10:49:35.272Z info: [Crawler][2913] Successfully navigated to "https://www.google.com". Waiting for the page to load ...

2025-04-14T10:49:37.079Z info: [Crawler][2913] Finished waiting for the page to load.

2025-04-14T10:49:37.095Z info: [Crawler][2913] Successfully fetched the page content.

2025-04-14T10:49:38.262Z info: [Crawler][2913] Finished capturing page content and a screenshot. FullPageScreenshot: true

2025-04-14T10:49:38.269Z info: [Crawler][2913] Will attempt to extract metadata from page ...

2025-04-14T10:49:39.257Z info: [Crawler][2913] Will attempt to extract readable content ...

2025-04-14T10:49:40.088Z info: [Crawler][2913] Done extracting readable content.

2025-04-14T10:49:40.268Z info: [Crawler][2913] Stored the screenshot as assetId: 70e73c50-d5b5-4a92-a168-97589ed3d483

2025-04-14T10:51:48.809Z info: [Crawler][2913] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk"

2025-04-14T10:51:48.809Z info: [Crawler][2913] Attempting to determine the content-type for the url https://www.google.com

2025-04-14T10:51:48.820Z error: [Crawler][2913] Crawling job failed: Error: Timed-out after 150 secs

Error: Timed-out after 150 secs

    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)

    at listOnTimeout (node:internal/timers:594:17)

    at process.processTimers (node:internal/timers:529:7)

My env variable:

Image
Originally created by @maidou-00 on GitHub (Apr 14, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1265 Hello everyone. Just recently upgraded to the V0.23.2(nightly build), and the proxy for Chrome is not working anymore... I live in an internet-censored place and hence proxy is a must. Here are my config: ```yaml chrome: image: gcr.io/zenika-hub/alpine-chrome:123 container_name: Hoarder-CHROME restart: unless-stopped command: - --no-sandbox - --disable-gpu - --disable-dev-shm-usage - --remote-debugging-address=0.0.0.0 - --remote-debugging-port=9222 - --hide-scrollbars - --proxy-server='https=172.21.0.1:1080' # Note: it was - --proxy-server=172.21.0.1:1080 and it was working before ``` Also tried redeploying/restart, the usual drills. Logs(trying to access Google): ``` 2025-04-14T10:20:22.380Z info: [Crawler][2909] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk" 2025-04-14T10:20:22.380Z info: [Crawler][2909] Attempting to determine the content-type for the url https://www.google.com 2025-04-14T10:20:27.382Z error: [Crawler][2909] Failed to determine the content-type for the url https://www.google.com: AbortError: The operation was aborted. 2025-04-14T10:22:27.492Z error: [Crawler][2909] Crawling job failed: TimeoutError: Navigation timeout of 120000 ms exceeded TimeoutError: Navigation timeout of 120000 ms exceeded at new Deferred (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:59:34) at Deferred.create (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:21:16) at new LifecycleWatcher (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/LifecycleWatcher.js:65:60) at CdpFrame.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/Frame.js:136:29) at CdpFrame.<anonymous> (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/decorators.js:98:27) at CdpPage.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Page.js:590:43) at crawlPage (/app/apps/workers/crawlerWorker.ts:3:2115) at process.processTicksAndRejections (node:internal/process/task_queues:105:5) at async crawlAndParseUrl (/app/apps/workers/crawlerWorker.ts:3:9435) at async runCrawler (/app/apps/workers/crawlerWorker.ts:3:13098) 2025-04-14T10:22:30.921Z info: [Crawler][2909] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk" 2025-04-14T10:22:30.922Z info: [Crawler][2909] Attempting to determine the content-type for the url https://www.google.com 2025-04-14T10:22:35.924Z error: [Crawler][2909] Failed to determine the content-type for the url https://www.google.com: AbortError: The operation was aborted. 2025-04-14T10:24:36.035Z error: [Crawler][2909] Crawling job failed: TimeoutError: Navigation timeout of 120000 ms exceeded TimeoutError: Navigation timeout of 120000 ms exceeded at new Deferred (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:59:34) at Deferred.create (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:21:16) at new LifecycleWatcher (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/LifecycleWatcher.js:65:60) at CdpFrame.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/Frame.js:136:29) at CdpFrame.<anonymous> (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/decorators.js:98:27) at CdpPage.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Page.js:590:43) at crawlPage (/app/apps/workers/crawlerWorker.ts:3:2115) at process.processTicksAndRejections (node:internal/process/task_queues:105:5) at async crawlAndParseUrl (/app/apps/workers/crawlerWorker.ts:3:9435) at async runCrawler (/app/apps/workers/crawlerWorker.ts:3:13098) ``` ## update With proxy variable set to ` - --proxy-server=172.21.0.1:1080`, the logs are as following. It seemed like the only problem is "Failed to determine the content-type for the url https://www.google.com", Chrome is able to navigate and read, but just cannot crawl successfully ``` 2025-04-14T10:49:09.299Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222 2025-04-14T10:49:09.305Z info: [Crawler] Successfully resolved IP address, new address: http://172.83.0.17:9222/ 2025-04-14T10:49:17.801Z info: [Crawler][2913] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk" 2025-04-14T10:49:17.801Z info: [Crawler][2913] Attempting to determine the content-type for the url https://www.google.com 2025-04-14T10:49:22.801Z error: [Crawler][2913] Failed to determine the content-type for the url https://www.google.com: AbortError: The operation was aborted. 2025-04-14T10:49:35.272Z info: [Crawler][2913] Successfully navigated to "https://www.google.com". Waiting for the page to load ... 2025-04-14T10:49:37.079Z info: [Crawler][2913] Finished waiting for the page to load. 2025-04-14T10:49:37.095Z info: [Crawler][2913] Successfully fetched the page content. 2025-04-14T10:49:38.262Z info: [Crawler][2913] Finished capturing page content and a screenshot. FullPageScreenshot: true 2025-04-14T10:49:38.269Z info: [Crawler][2913] Will attempt to extract metadata from page ... 2025-04-14T10:49:39.257Z info: [Crawler][2913] Will attempt to extract readable content ... 2025-04-14T10:49:40.088Z info: [Crawler][2913] Done extracting readable content. 2025-04-14T10:49:40.268Z info: [Crawler][2913] Stored the screenshot as assetId: 70e73c50-d5b5-4a92-a168-97589ed3d483 2025-04-14T10:51:48.809Z info: [Crawler][2913] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk" 2025-04-14T10:51:48.809Z info: [Crawler][2913] Attempting to determine the content-type for the url https://www.google.com 2025-04-14T10:51:48.820Z error: [Crawler][2913] Crawling job failed: Error: Timed-out after 150 secs Error: Timed-out after 150 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) at listOnTimeout (node:internal/timers:594:17) at process.processTimers (node:internal/timers:529:7) ``` My env variable: <img width="1599" alt="Image" src="https://github.com/user-attachments/assets/5f254092-7183-4c92-8244-fc85bc9a64bb" />
kerem closed this issue 2026-03-02 11:52:56 +03:00
Author
Owner

@maidou-00 commented on GitHub (Apr 15, 2025):

Hello, I made it work by setting up proxy in the router, but I still have question.
my setup: I have deployed proxy-client(shadowsocks) docker on my NAS, the client connects to a proxy-server on Google Cloud. Dockers with env. variables set such as HTTP/HTTPS_PROXY=proxy-client.IP:port is able to bypass internet censorship. I also have a proxy-client running on my router, however I usually don't enable it.

I suspect the problem in my case was, the crawler will try to access the webpage and determine the type before Chrome starts to initiate; if the crawler failed to determine the type/unable to access the target webpage, it won't proceed to crawling even if Chrome is able to read the webpage. (Only speculation, correct me if I'm wrong)

Strangely, I later on tried to test my theory(turning off the proxy in router first), by setting env. variables such as HTTP/HTTPS_Proxy (I even tried the lower case ones) under the hoarder_web section, but it still won't work.

Not sure what's going on, but I thought it is worth mentioning @MohamedBassem

<!-- gh-comment-id:2806175244 --> @maidou-00 commented on GitHub (Apr 15, 2025): Hello, I made it work by setting up proxy in the router, but I still have question. **my setup**: I have deployed proxy-client(shadowsocks) docker on my NAS, the client connects to a proxy-server on Google Cloud. Dockers with env. variables set such as `HTTP/HTTPS_PROXY=proxy-client.IP:port` is able to bypass internet censorship. I also have a proxy-client running on my router, however I usually don't enable it. I suspect the problem in my case was, the crawler will try to access the webpage and determine the type before Chrome starts to initiate; if the crawler failed to determine the type/unable to access the target webpage, it won't proceed to crawling even if Chrome is able to read the webpage. (Only speculation, correct me if I'm wrong) Strangely, I later on tried to test my theory(turning off the proxy in router first), by setting env. variables such as HTTP/HTTPS_Proxy (I even tried the lower case ones) under the hoarder_web section, but it still won't work. Not sure what's going on, but I thought it is worth mentioning @MohamedBassem
Author
Owner

@nickhelion commented on GitHub (Apr 21, 2025):

New to karakeep but meeting the same problem, what is the latest version tha works? considering downgrade

Hello, I made it work by setting up proxy in the router, but I still have question. my setup: I have deployed proxy-client(shadowsocks) docker on my NAS, the client connects to a proxy-server on Google Cloud. Dockers with env. variables set such as HTTP/HTTPS_PROXY=proxy-client.IP:port is able to bypass internet censorship. I also have a proxy-client running on my router, however I usually don't enable it.

I suspect the problem in my case was, the crawler will try to access the webpage and determine the type before Chrome starts to initiate; if the crawler failed to determine the type/unable to access the target webpage, it won't proceed to crawling even if Chrome is able to read the webpage. (Only speculation, correct me if I'm wrong)

Strangely, I later on tried to test my theory(turning off the proxy in router first), by setting env. variables such as HTTP/HTTPS_Proxy (I even tried the lower case ones) under the hoarder_web section, but it still won't work.

Not sure what's going on, but I thought it is worth mentioning @MohamedBassem

<!-- gh-comment-id:2817620483 --> @nickhelion commented on GitHub (Apr 21, 2025): New to karakeep but meeting the same problem, what is the latest version tha works? considering downgrade > Hello, I made it work by setting up proxy in the router, but I still have question. **my setup**: I have deployed proxy-client(shadowsocks) docker on my NAS, the client connects to a proxy-server on Google Cloud. Dockers with env. variables set such as `HTTP/HTTPS_PROXY=proxy-client.IP:port` is able to bypass internet censorship. I also have a proxy-client running on my router, however I usually don't enable it. > > I suspect the problem in my case was, the crawler will try to access the webpage and determine the type before Chrome starts to initiate; if the crawler failed to determine the type/unable to access the target webpage, it won't proceed to crawling even if Chrome is able to read the webpage. (Only speculation, correct me if I'm wrong) > > Strangely, I later on tried to test my theory(turning off the proxy in router first), by setting env. variables such as HTTP/HTTPS_Proxy (I even tried the lower case ones) under the hoarder_web section, but it still won't work. > > Not sure what's going on, but I thought it is worth mentioning [@MohamedBassem](https://github.com/MohamedBassem)
Author
Owner

@MohamedBassem commented on GitHub (Apr 21, 2025):

I suspect the problem in my case was, the crawler will try to access the webpage and determine the type before Chrome starts to initiate; if the crawler failed to determine the type/unable to access the target webpage, it won't proceed to crawling even if Chrome is able to read the webpage. (Only speculation, correct me if I'm wrong)

@maidou-00 you're half correct. The container first will attempt figure out the content type (this doesn't use a proxy indeed). If it fails though (which it does in your case), it'll continue with the crawling with chrome just fine.

What seems to be happening in your case is that the step:

Will attempt to extract metadata from page ...

Seems to be timing out for some reason. Does this happen for all websites? Can you try some other website and share the logs?

<!-- gh-comment-id:2819150495 --> @MohamedBassem commented on GitHub (Apr 21, 2025): > I suspect the problem in my case was, the crawler will try to access the webpage and determine the type before Chrome starts to initiate; if the crawler failed to determine the type/unable to access the target webpage, it won't proceed to crawling even if Chrome is able to read the webpage. (Only speculation, correct me if I'm wrong) @maidou-00 you're half correct. The container first will attempt figure out the content type (this doesn't use a proxy indeed). If it fails though (which it does in your case), it'll continue with the crawling with chrome just fine. What seems to be happening in your case is that the step: ``` Will attempt to extract metadata from page ... ``` Seems to be timing out for some reason. Does this happen for all websites? Can you try some other website and share the logs?
Author
Owner

@maidou-00 commented on GitHub (Apr 22, 2025):

I suspect the problem in my case was, the crawler will try to access the webpage and determine the type before Chrome starts to initiate; if the crawler failed to determine the type/unable to access the target webpage, it won't proceed to crawling even if Chrome is able to read the webpage. (Only speculation, correct me if I'm wrong)

@maidou-00 you're half correct. The container first will attempt figure out the content type (this doesn't use a proxy indeed). If it fails though (which it does in your case), it'll continue with the crawling with chrome just fine.

What seems to be happening in your case is that the step:

Will attempt to extract metadata from page ...

Seems to be timing out for some reason. Does this happen for all websites? Can you try some other website and share the logs?

Hi Mohamed, thanks for replying!

This “timing out” only happens to censored website such as Google/Youtube etcs...the logs are just like the one I posted. Will upload some more logs later today.

For uncensored website, the crawling process runs smoothly and usually takes less than a minute. Will upload some logs.

  • My device: I am running my dockers on a Synology NAS (20g of memory but the CPU is a low power cumsumption one).
  • Internet speed : My internet speed is about 600M, browsing speed is fast enough for me. For the censored websites, with proxy enabled, the speed is pretty fast too(can view 2k youtube videos smoothly)
<!-- gh-comment-id:2819785305 --> @maidou-00 commented on GitHub (Apr 22, 2025): > > I suspect the problem in my case was, the crawler will try to access the webpage and determine the type before Chrome starts to initiate; if the crawler failed to determine the type/unable to access the target webpage, it won't proceed to crawling even if Chrome is able to read the webpage. (Only speculation, correct me if I'm wrong) > > [@maidou-00](https://github.com/maidou-00) you're half correct. The container first will attempt figure out the content type (this doesn't use a proxy indeed). If it fails though (which it does in your case), it'll continue with the crawling with chrome just fine. > > What seems to be happening in your case is that the step: > > ``` > Will attempt to extract metadata from page ... > ``` > > Seems to be timing out for some reason. Does this happen for all websites? Can you try some other website and share the logs? Hi Mohamed, thanks for replying! This “timing out” only happens to censored website such as Google/Youtube etcs...the logs are just like the one I posted. Will upload some more logs later today. For uncensored website, the crawling process runs smoothly and usually takes less than a minute. Will upload some logs. - My device: I am running my dockers on a Synology NAS (20g of memory but the CPU is a low power cumsumption one). - Internet speed : My internet speed is about 600M, browsing speed is fast enough for me. For the censored websites, with proxy enabled, the speed is pretty fast too(can view 2k youtube videos smoothly)
Author
Owner

@nickhelion commented on GitHub (May 2, 2025):

I resolved my issue(chrome not going through flag specified proxy server) by setting proxy environment variables inside docker container, and then passing --proxy-auto-detect flag for chrome. --proxy-server flag does not work in my setup.

my chrome part compose file looks like

chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    restart: unless-stopped
    ports:
      - 9222:9222
    command:
      - --no-sandbox
      - --disable-gpu
      - --disable-dev-shm-usage
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
      - --proxy-auto-detect
    environment:
      HTTP_PROXY: http://my-proxy-host:7890
      HTTPS_PROXY: http://my-proxy-host:7890
      ALL_PROXY: http://my-proxy-host:7890
      http_proxy: http://my-proxy-host:7890
      https_proxy: http://my-proxy-host:7890
      all_PROXY: http://my-proxy-host:7890 # it is remain further invetigating than chrome reads upper or lower case, so i give both
      NO_PROXY: "*.local;127.0.0.1;localhost" # not sure if it's correct format
<!-- gh-comment-id:2846301547 --> @nickhelion commented on GitHub (May 2, 2025): I resolved my issue(chrome not going through flag specified proxy server) by setting proxy environment variables inside docker container, and then passing `--proxy-auto-detect` flag for chrome. `--proxy-server ` flag does not work in my setup. my chrome part compose file looks like ```yaml chrome: image: gcr.io/zenika-hub/alpine-chrome:123 restart: unless-stopped ports: - 9222:9222 command: - --no-sandbox - --disable-gpu - --disable-dev-shm-usage - --remote-debugging-address=0.0.0.0 - --remote-debugging-port=9222 - --hide-scrollbars - --proxy-auto-detect environment: HTTP_PROXY: http://my-proxy-host:7890 HTTPS_PROXY: http://my-proxy-host:7890 ALL_PROXY: http://my-proxy-host:7890 http_proxy: http://my-proxy-host:7890 https_proxy: http://my-proxy-host:7890 all_PROXY: http://my-proxy-host:7890 # it is remain further invetigating than chrome reads upper or lower case, so i give both NO_PROXY: "*.local;127.0.0.1;localhost" # not sure if it's correct format ```
Author
Owner

@maidou-00 commented on GitHub (May 3, 2025):

I resolved my issue(chrome not going through flag specified proxy server) by setting proxy environment variables inside docker container, and then passing --proxy-auto-detect flag for chrome. --proxy-server flag does not work in my setup.

my chrome part compose file looks like

chrome:
image: gcr.io/zenika-hub/alpine-chrome:123
restart: unless-stopped
ports:
- 9222:9222
command:
- --no-sandbox
- --disable-gpu
- --disable-dev-shm-usage
- --remote-debugging-address=0.0.0.0
- --remote-debugging-port=9222
- --hide-scrollbars
- --proxy-auto-detect
environment:
HTTP_PROXY: http://my-proxy-host:7890
HTTPS_PROXY: http://my-proxy-host:7890
ALL_PROXY: http://my-proxy-host:7890
http_proxy: http://my-proxy-host:7890
https_proxy: http://my-proxy-host:7890
all_PROXY: http://my-proxy-host:7890 # it is remain further invetigating than chrome reads upper or lower case, so i give both
NO_PROXY: "*.local;127.0.0.1;localhost" # not sure if it's correct format

thanks Nick, I'll give it a try!

<!-- gh-comment-id:2848646854 --> @maidou-00 commented on GitHub (May 3, 2025): > I resolved my issue(chrome not going through flag specified proxy server) by setting proxy environment variables inside docker container, and then passing `--proxy-auto-detect` flag for chrome. `--proxy-server ` flag does not work in my setup. > > my chrome part compose file looks like > > chrome: > image: gcr.io/zenika-hub/alpine-chrome:123 > restart: unless-stopped > ports: > - 9222:9222 > command: > - --no-sandbox > - --disable-gpu > - --disable-dev-shm-usage > - --remote-debugging-address=0.0.0.0 > - --remote-debugging-port=9222 > - --hide-scrollbars > - --proxy-auto-detect > environment: > HTTP_PROXY: http://my-proxy-host:7890 > HTTPS_PROXY: http://my-proxy-host:7890 > ALL_PROXY: http://my-proxy-host:7890 > http_proxy: http://my-proxy-host:7890 > https_proxy: http://my-proxy-host:7890 > all_PROXY: http://my-proxy-host:7890 # it is remain further invetigating than chrome reads upper or lower case, so i give both > NO_PROXY: "*.local;127.0.0.1;localhost" # not sure if it's correct format thanks Nick, I'll give it a try!
Author
Owner

@maidou-00 commented on GitHub (May 14, 2025):

Hello Mohamed @MohamedBassem , sorry for the late reply. I've been trying to fix it but no luck so far. Here are some additional logs:
I have:

  1. disabled the VPN in my router

  2. updated the proxy env variables like Nickhelion suggested, i.e. - --proxy-auto-detect, and include additional ALL_PROXY

  3. setting the timeout to 300 seconds, with 5 crawler workers. My device is a low-power-consumption NAS with 20G of memory

  4. Logs from extracting Youtube:

2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted.
2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ...
2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load.
2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content.
2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true
2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ...
2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ...
2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content.
2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d
 ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
    at async open (node:internal/fs/promises:633:25)
    at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914)
    at async Object.readFile (node:internal/fs/promises:1237:14)
    at async Promise.all (index 0)
    at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329)
    at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411
    at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880)
    at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943)
    at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42)
    at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) {

  errno: -2,
  code: 'ENOENT',
  syscall: 'open',
  path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
}
2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
Error: Timed-out after 300 secs
2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:588:17)
    at process.processTimers (node:internal/timers:523:7)
<!-- gh-comment-id:2879802851 --> @maidou-00 commented on GitHub (May 14, 2025): Hello Mohamed @MohamedBassem , sorry for the late reply. I've been trying to fix it but no luck so far. Here are some additional logs: I have: 1. disabled the VPN in my router 2. updated the proxy env variables like Nickhelion suggested, i.e. `- --proxy-auto-detect`, and include additional `ALL_PROXY` 3. setting the timeout to 300 seconds, with 5 crawler workers. My device is a low-power-consumption NAS with 20G of memory 1. Logs from extracting Youtube: ``` 2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" 2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com 2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted. 2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ... 2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load. 2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content. 2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true 2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ... 2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ... 2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content. 2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' at async open (node:internal/fs/promises:633:25) at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914) at async Object.readFile (node:internal/fs/promises:1237:14) at async Promise.all (index 0) at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329) at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411 at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880) at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943) at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42) at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) {  errno: -2, code: 'ENOENT', syscall: 'open', path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' } 2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" 2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com Error: Timed-out after 300 secs 2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) at listOnTimeout (node:internal/timers:588:17) at process.processTimers (node:internal/timers:523:7) ```
Author
Owner

@maidou-00 commented on GitHub (Jun 13, 2025):

@nickhelion hey Nick, thanks for the tip. I've already tried your settings but it doesn't seem to work, the crawler always stuck...thanks anyways

<!-- gh-comment-id:2970150768 --> @maidou-00 commented on GitHub (Jun 13, 2025): @nickhelion hey Nick, thanks for the tip. I've already tried your settings but it doesn't seem to work, the crawler always stuck...thanks anyways
Author
Owner

@nickhelion commented on GitHub (Jun 16, 2025):

@nickhelion hey Nick, thanks for the tip. I've already tried your settings but it doesn't seem to work, the crawler always stuck...thanks anyways

You're welcome. I don't understand the problem myself. I spent a lot of time to come up with a configuration that I can use. Now I dare not touch it.

<!-- gh-comment-id:2975123136 --> @nickhelion commented on GitHub (Jun 16, 2025): > [@nickhelion](https://github.com/nickhelion) hey Nick, thanks for the tip. I've already tried your settings but it doesn't seem to work, the crawler always stuck...thanks anyways You're welcome. I don't understand the problem myself. I spent a lot of time to come up with a configuration that I can use. Now I dare not touch it.
Author
Owner

@mapleshadow commented on GitHub (Jul 2, 2025):

你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有:

  1. 在我的路由器中禁用了 VPN
  2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的- --proxy-auto-detect``ALL_PROXY
  3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS
  4. 提取 Youtube 的日志:
2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted.
2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ...
2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load.
2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content.
2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true
2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ...
2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ...
2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content.
2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d
 ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
    at async open (node:internal/fs/promises:633:25)
    at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914)
    at async Object.readFile (node:internal/fs/promises:1237:14)
    at async Promise.all (index 0)
    at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329)
    at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411
    at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880)
    at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943)
    at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42)
    at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) {
�
  errno: -2,
  code: 'ENOENT',
  syscall: 'open',
  path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
}
2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
Error: Timed-out after 300 secs
2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:588:17)
    at process.processTimers (node:internal/timers:523:7)

我遇到了同样的问题,我就说中文吧,估计是个BUG。
一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。

<!-- gh-comment-id:3027358133 --> @mapleshadow commented on GitHub (Jul 2, 2025): > 你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有: > > 1. 在我的路由器中禁用了 VPN > 2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的`- --proxy-auto-detect``ALL_PROXY` > 3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS > 4. 提取 Youtube 的日志: > > ``` > 2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > 2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > 2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted. > 2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ... > 2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load. > 2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content. > 2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true > 2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ... > 2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ... > 2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content. > 2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d > ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > at async open (node:internal/fs/promises:633:25) > at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914) > at async Object.readFile (node:internal/fs/promises:1237:14) > at async Promise.all (index 0) > at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329) > at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411 > at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880) > at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943) > at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42) > at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) { > � > errno: -2, > code: 'ENOENT', > syscall: 'open', > path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > } > 2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > 2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > Error: Timed-out after 300 secs > 2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs > at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) > at listOnTimeout (node:internal/timers:588:17) > at process.processTimers (node:internal/timers:523:7) > ``` 我遇到了同样的问题,我就说中文吧,估计是个BUG。 一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。
Author
Owner

@mapleshadow commented on GitHub (Jul 2, 2025):

@nickhelion hey Nick, thanks for the tip. I've already tried your settings but it doesn't seem to work, the crawler always stuck...thanks anyways

你的问题可以参考我的做法

<!-- gh-comment-id:3027359797 --> @mapleshadow commented on GitHub (Jul 2, 2025): > [@nickhelion](https://github.com/nickhelion) hey Nick, thanks for the tip. I've already tried your settings but it doesn't seem to work, the crawler always stuck...thanks anyways 你的问题可以参考我的做法
Author
Owner

@maidou-00 commented on GitHub (Jul 3, 2025):

你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有:

  1. 在我的路由器中禁用了 VPN
  2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的 - --proxy-auto-detectALL_PROXY ``
  3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS
  4. 提取 Youtube 的日志:
2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted.
2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ...
2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load.
2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content.
2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true
2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ...
2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ...
2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content.
2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d
 ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
    at async open (node:internal/fs/promises:633:25)
    at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914)
    at async Object.readFile (node:internal/fs/promises:1237:14)
    at async Promise.all (index 0)
    at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329)
    at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411
    at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880)
    at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943)
    at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42)
    at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) {
�
  errno: -2,
  code: 'ENOENT',
  syscall: 'open',
  path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
}
2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
Error: Timed-out after 300 secs
2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:588:17)
    at process.processTimers (node:internal/timers:523:7)

我遇到了同样的问题,我就说中文吧,估计是个BUG。 一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。

感谢回复!你是每次都需要为不存在的文件夹创建一个文件夹吗?

<!-- gh-comment-id:3032076847 --> @maidou-00 commented on GitHub (Jul 3, 2025): > > 你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有: > > > > 1. 在我的路由器中禁用了 VPN > > 2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的`` - --proxy-auto-detect``ALL_PROXY `` > > 3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS > > 4. 提取 Youtube 的日志: > > > > ``` > > 2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > > 2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > > 2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted. > > 2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ... > > 2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load. > > 2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content. > > 2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true > > 2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ... > > 2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ... > > 2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content. > > 2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d > > ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > > at async open (node:internal/fs/promises:633:25) > > at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914) > > at async Object.readFile (node:internal/fs/promises:1237:14) > > at async Promise.all (index 0) > > at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329) > > at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411 > > at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880) > > at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943) > > at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42) > > at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) { > > � > > errno: -2, > > code: 'ENOENT', > > syscall: 'open', > > path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > > } > > 2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > > 2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > > Error: Timed-out after 300 secs > > 2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs > > at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) > > at listOnTimeout (node:internal/timers:588:17) > > at process.processTimers (node:internal/timers:523:7) > > ``` > > 我遇到了同样的问题,我就说中文吧,估计是个BUG。 一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。 感谢回复!你是每次都需要为不存在的文件夹创建一个文件夹吗?
Author
Owner

@mapleshadow commented on GitHub (Jul 3, 2025):

你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有:

  1. 在我的路由器中禁用了 VPN
  2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的 - --proxy-auto-detectALL_PROXY ``
  3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS
  4. 提取 Youtube 的日志:
2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted.
2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ...
2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load.
2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content.
2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true
2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ...
2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ...
2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content.
2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d
 ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
    at async open (node:internal/fs/promises:633:25)
    at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914)
    at async Object.readFile (node:internal/fs/promises:1237:14)
    at async Promise.all (index 0)
    at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329)
    at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411
    at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880)
    at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943)
    at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42)
    at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) {
�
  errno: -2,
  code: 'ENOENT',
  syscall: 'open',
  path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
}
2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
Error: Timed-out after 300 secs
2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:588:17)
    at process.processTimers (node:internal/timers:523:7)

我遇到了同样的问题,我就说中文吧,估计是个BUG。 一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。

感谢回复!你是每次都需要为不存在的文件夹创建一个文件夹吗?

一般都不会这样,我只遇到了一次。另外遇到很多次爬网失败的情况,原因不明。解决办法是,在用户设置内,找到失败的,删除,重新分享给karakeep再爬。或者列表内找到失败的(就是停留在网址界面不出内容的),删除,重新爬。
我感觉karakeep的数据库能力不行,要想好用我估计还是得集成redis,或者再挂一个pq数据库才行,因为总是能遇到数据库锁死的情况。

<!-- gh-comment-id:3032437581 --> @mapleshadow commented on GitHub (Jul 3, 2025): > > > 你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有: > > > > > > 1. 在我的路由器中禁用了 VPN > > > 2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的` - --proxy-auto-detect`ALL_PROXY `` > > > 3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS > > > 4. 提取 Youtube 的日志: > > > > > > ``` > > > 2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > > > 2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > > > 2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted. > > > 2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ... > > > 2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load. > > > 2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content. > > > 2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true > > > 2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ... > > > 2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ... > > > 2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content. > > > 2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d > > > ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > > > at async open (node:internal/fs/promises:633:25) > > > at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914) > > > at async Object.readFile (node:internal/fs/promises:1237:14) > > > at async Promise.all (index 0) > > > at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329) > > > at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411 > > > at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880) > > > at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943) > > > at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42) > > > at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) { > > > � > > > errno: -2, > > > code: 'ENOENT', > > > syscall: 'open', > > > path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > > > } > > > 2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > > > 2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > > > Error: Timed-out after 300 secs > > > 2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs > > > at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) > > > at listOnTimeout (node:internal/timers:588:17) > > > at process.processTimers (node:internal/timers:523:7) > > > ``` > > > > > > 我遇到了同样的问题,我就说中文吧,估计是个BUG。 一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。 > > 感谢回复!你是每次都需要为不存在的文件夹创建一个文件夹吗? 一般都不会这样,我只遇到了一次。另外遇到很多次爬网失败的情况,原因不明。解决办法是,在用户设置内,找到失败的,删除,重新分享给karakeep再爬。或者列表内找到失败的(就是停留在网址界面不出内容的),删除,重新爬。 我感觉karakeep的数据库能力不行,要想好用我估计还是得集成redis,或者再挂一个pq数据库才行,因为总是能遇到数据库锁死的情况。
Author
Owner

@mapleshadow commented on GitHub (Jul 4, 2025):

你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有:

  1. 在我的路由器中禁用了 VPN
  2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的 - --proxy-auto-detectALL_PROXY ``
  3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS
  4. 提取 Youtube 的日志:
2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted.
2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ...
2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load.
2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content.
2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true
2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ...
2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ...
2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content.
2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d
 ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
    at async open (node:internal/fs/promises:633:25)
    at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914)
    at async Object.readFile (node:internal/fs/promises:1237:14)
    at async Promise.all (index 0)
    at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329)
    at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411
    at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880)
    at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943)
    at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42)
    at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) {
�
  errno: -2,
  code: 'ENOENT',
  syscall: 'open',
  path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
}
2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
Error: Timed-out after 300 secs
2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:588:17)
    at process.processTimers (node:internal/timers:523:7)

我遇到了同样的问题,我就说中文吧,估计是个BUG。 一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。

感谢回复!你是每次都需要为不存在的文件夹创建一个文件夹吗?

另外补充一下,千万不要使用0.50.0,是真有卡住的BUG。换成latest即可。

<!-- gh-comment-id:3034347297 --> @mapleshadow commented on GitHub (Jul 4, 2025): > > > 你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有: > > > > > > 1. 在我的路由器中禁用了 VPN > > > 2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的` - --proxy-auto-detect`ALL_PROXY `` > > > 3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS > > > 4. 提取 Youtube 的日志: > > > > > > ``` > > > 2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > > > 2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > > > 2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted. > > > 2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ... > > > 2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load. > > > 2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content. > > > 2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true > > > 2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ... > > > 2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ... > > > 2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content. > > > 2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d > > > ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > > > at async open (node:internal/fs/promises:633:25) > > > at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914) > > > at async Object.readFile (node:internal/fs/promises:1237:14) > > > at async Promise.all (index 0) > > > at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329) > > > at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411 > > > at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880) > > > at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943) > > > at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42) > > > at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) { > > > � > > > errno: -2, > > > code: 'ENOENT', > > > syscall: 'open', > > > path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > > > } > > > 2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > > > 2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > > > Error: Timed-out after 300 secs > > > 2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs > > > at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) > > > at listOnTimeout (node:internal/timers:588:17) > > > at process.processTimers (node:internal/timers:523:7) > > > ``` > > > > > > 我遇到了同样的问题,我就说中文吧,估计是个BUG。 一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。 > > 感谢回复!你是每次都需要为不存在的文件夹创建一个文件夹吗? 另外补充一下,千万不要使用0.50.0,是真有卡住的BUG。换成latest即可。
Author
Owner

@MohamedBassem commented on GitHub (Jul 4, 2025):

I don't know what's going on in this discussion, but all I can say is that native proxy support is coming in the next release.

<!-- gh-comment-id:3034960764 --> @MohamedBassem commented on GitHub (Jul 4, 2025): I don't know what's going on in this discussion, but all I can say is that native proxy support is coming in the next release.
Author
Owner

@maidou-00 commented on GitHub (Jul 9, 2025):

I don't know what's going on in this discussion, but all I can say is that native proxy support is coming in the next release.

I've looked through most of the proxy-related issues and attempted various workarounds suggested by people who had similar issues. Unfortunately, nothing worked for me.
Very glad to hear that native proxy support is coming, and soo much looking forward to it!

<!-- gh-comment-id:3050981767 --> @maidou-00 commented on GitHub (Jul 9, 2025): > I don't know what's going on in this discussion, but all I can say is that native proxy support is coming in the next release. I've looked through most of the proxy-related issues and attempted various workarounds suggested by people who had similar issues. Unfortunately, nothing worked for me. Very glad to hear that native proxy support is coming, and soo much looking forward to it!
Author
Owner

@maidou-00 commented on GitHub (Jul 9, 2025):

你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有:

  1. 在我的路由器中禁用了 VPN
  2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的 - --proxy-auto-detectALL_PROXY ``
  3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS
  4. 提取 Youtube 的日志:
2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted.
2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ...
2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load.
2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content.
2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true
2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ...
2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ...
2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content.
2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d
 ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
    at async open (node:internal/fs/promises:633:25)
    at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914)
    at async Object.readFile (node:internal/fs/promises:1237:14)
    at async Promise.all (index 0)
    at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329)
    at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411
    at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880)
    at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943)
    at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42)
    at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) {
�
  errno: -2,
  code: 'ENOENT',
  syscall: 'open',
  path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
}
2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
Error: Timed-out after 300 secs
2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:588:17)
    at process.processTimers (node:internal/timers:523:7)

我遇到了同样的问题,我就说中文吧,估计是个BUG。 一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。

感谢回复!你是每次都需要为不存在的文件夹创建一个文件夹吗?

另外补充一下,千万不要使用0.50.0,是真有卡住的BUG。换成latest即可。

感谢!我试了一下,还是不行。。。作者说下一个版本会自带代理功能,估计那时候就好了。
Thanks for the reply. Unfortunately I am still stuck...it looks like Mohamed will include native proxy support in the next release, I'm just gonna wait for it haha.

<!-- gh-comment-id:3050987861 --> @maidou-00 commented on GitHub (Jul 9, 2025): > > > > 你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有: > > > > > > > > 1. 在我的路由器中禁用了 VPN > > > > 2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的` - --proxy-auto-detect`ALL_PROXY `` > > > > 3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS > > > > 4. 提取 Youtube 的日志: > > > > > > > > ``` > > > > 2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > > > > 2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > > > > 2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted. > > > > 2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ... > > > > 2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load. > > > > 2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content. > > > > 2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true > > > > 2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ... > > > > 2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ... > > > > 2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content. > > > > 2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d > > > > ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > > > > at async open (node:internal/fs/promises:633:25) > > > > at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914) > > > > at async Object.readFile (node:internal/fs/promises:1237:14) > > > > at async Promise.all (index 0) > > > > at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329) > > > > at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411 > > > > at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880) > > > > at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943) > > > > at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42) > > > > at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) { > > > > � > > > > errno: -2, > > > > code: 'ENOENT', > > > > syscall: 'open', > > > > path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > > > > } > > > > 2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > > > > 2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > > > > Error: Timed-out after 300 secs > > > > 2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs > > > > at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) > > > > at listOnTimeout (node:internal/timers:588:17) > > > > at process.processTimers (node:internal/timers:523:7) > > > > ``` > > > > > > > > > 我遇到了同样的问题,我就说中文吧,估计是个BUG。 一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。 > > > > > > 感谢回复!你是每次都需要为不存在的文件夹创建一个文件夹吗? > > 另外补充一下,千万不要使用0.50.0,是真有卡住的BUG。换成latest即可。 感谢!我试了一下,还是不行。。。作者说下一个版本会自带代理功能,估计那时候就好了。 Thanks for the reply. Unfortunately I am still stuck...it looks like Mohamed will include native proxy support in the next release, I'm just gonna wait for it haha.
Author
Owner

@mapleshadow commented on GitHub (Jul 10, 2025):

你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有:

  1. 在我的路由器中禁用了 VPN
  2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的 - --proxy-auto-detectALL_PROXY ``
  3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS
  4. 提取 Youtube 的日志:
2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted.
2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ...
2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load.
2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content.
2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true
2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ...
2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ...
2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content.
2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d
 ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
    at async open (node:internal/fs/promises:633:25)
    at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914)
    at async Object.readFile (node:internal/fs/promises:1237:14)
    at async Promise.all (index 0)
    at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329)
    at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411
    at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880)
    at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943)
    at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42)
    at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) {
�
  errno: -2,
  code: 'ENOENT',
  syscall: 'open',
  path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json'
}
2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt"
2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com
Error: Timed-out after 300 secs
2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs
    at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
    at listOnTimeout (node:internal/timers:588:17)
    at process.processTimers (node:internal/timers:523:7)

我遇到了同样的问题,我就说中文吧,估计是个BUG。 一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。

感谢回复!你是每次都需要为不存在的文件夹创建一个文件夹吗?

另外补充一下,千万不要使用0.50.0,是真有卡住的BUG。换成latest即可。

感谢!我试了一下,还是不行。。。作者说下一个版本会自带代理功能,估计那时候就好了。 Thanks for the reply. Unfortunately I am still stuck...it looks like Mohamed will include native proxy support in the next release, I'm just gonna wait for it haha.

是的,仍然会卡住,这个BUG始终存在,我已经换linkwarden,暂时还挺不错,除了没有爬网的功能,其他功能都有。

<!-- gh-comment-id:3057130294 --> @mapleshadow commented on GitHub (Jul 10, 2025): > > > > > 你好 Mohamed ,很抱歉回复晚了。我一直在尝试修复它,但到目前为止还没有运气。以下是一些其他日志:我有: > > > > > > > > > > 1. 在我的路由器中禁用了 VPN > > > > > 2. 更新了 Nickhelion 建议的代理环境变量,即 ,并包含额外的` - --proxy-auto-detect`ALL_PROXY `` > > > > > 3. 将 timeout 设置为 300 秒,使用 5 个爬网程序工作程序。我的设备是具有 20G 内存的低功耗 NAS > > > > > 4. 提取 Youtube 的日志: > > > > > > > > > > ``` > > > > > 2025-05-14T11:08:43.360Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > > > > > 2025-05-14T11:08:43.361Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > > > > > 2025-05-14T11:08:48.365Z error: [Crawler][3389] Failed to determine the content-type for the url https://www.youtube.com: AbortError: The operation was aborted. > > > > > 2025-05-14T11:08:51.695Z info: [Crawler][3389] Successfully navigated to "https://www.youtube.com". Waiting for the page to load ... > > > > > 2025-05-14T11:08:56.697Z info: [Crawler][3389] Finished waiting for the page to load. > > > > > 2025-05-14T11:08:56.757Z info: [Crawler][3389] Successfully fetched the page content. > > > > > 2025-05-14T11:08:57.264Z info: [Crawler][3389] Finished capturing page content and a screenshot. FullPageScreenshot: true > > > > > 2025-05-14T11:08:57.282Z info: [Crawler][3389] Will attempt to extract metadata from page ... > > > > > 2025-05-14T11:09:01.137Z info: [Crawler][3389] Will attempt to extract readable content ... > > > > > 2025-05-14T11:09:03.899Z info: [Crawler][3389] Done extracting readable content. > > > > > 2025-05-14T11:09:04.644Z info: [Crawler][3389] Stored the screenshot as assetId: 8c777957-9488-41b8-a434-e6854bf4d39d > > > > > ⨯ Error: ENOENT: no such file or directory, open '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > > > > > at async open (node:internal/fs/promises:633:25) > > > > > at async w (/app/apps/web/.next/server/chunks/6815.js:1:1914) > > > > > at async Object.readFile (node:internal/fs/promises:1237:14) > > > > > at async Promise.all (index 0) > > > > > at async q (/app/apps/web/.next/server/app/api/assets/[assetId]/route.js:1:2329) > > > > > at async /app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:38411 > > > > > at async e_.execute (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:27880) > > > > > at async e_.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:6:39943) > > > > > at async doRender (/app/node_modules/next/dist/server/base-server.js:1366:42) > > > > > at async cacheEntry.responseCache.get.routeKind (/app/node_modules/next/dist/server/base-server.js:1588:28) { > > > > > � > > > > > errno: -2, > > > > > code: 'ENOENT', > > > > > syscall: 'open', > > > > > path: '/data/assets/uutwl3rjowf2fezaspgd2ca7/bb180a59-c728-402a-8eeb-faaae39055f6/metadata.json' > > > > > } > > > > > 2025-05-14T11:13:43.063Z info: [Crawler][3389] Will crawl "https://www.youtube.com" for link with id "ab35k06ojykaag4l8qgzsgqt" > > > > > 2025-05-14T11:13:43.064Z info: [Crawler][3389] Attempting to determine the content-type for the url https://www.youtube.com > > > > > Error: Timed-out after 300 secs > > > > > 2025-05-14T11:13:43.329Z error: [Crawler][3389] Crawling job failed: Error: Timed-out after 300 secs > > > > > at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025) > > > > > at listOnTimeout (node:internal/timers:588:17) > > > > > at process.processTimers (node:internal/timers:523:7) > > > > > ``` > > > > > > > > > > > > 我遇到了同样的问题,我就说中文吧,估计是个BUG。 一旦某个文件夹不存在,例如你的是bb180a59-c728-402a-8eeb-faaae39055f6不存在,整个就会卡住并陷入死循环。我目前手动创建了不存在的文件夹,并从其他文件夹内将bin和json复制粘贴,有意思的是,系统跳出了死循环。。。。。。问题解决。。。 > > > > > > > > > 感谢回复!你是每次都需要为不存在的文件夹创建一个文件夹吗? > > > > > > 另外补充一下,千万不要使用0.50.0,是真有卡住的BUG。换成latest即可。 > > 感谢!我试了一下,还是不行。。。作者说下一个版本会自带代理功能,估计那时候就好了。 Thanks for the reply. Unfortunately I am still stuck...it looks like Mohamed will include native proxy support in the next release, I'm just gonna wait for it haha. 是的,仍然会卡住,这个BUG始终存在,我已经换linkwarden,暂时还挺不错,除了没有爬网的功能,其他功能都有。
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#813
No description provided.