[GH-ISSUE #331] Unable to crawl any site #215

Closed
opened 2026-03-02 11:47:41 +03:00 by kerem · 6 comments
Owner

Originally created by @djl0 on GitHub (Jul 27, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/331

I'm creating a new issue after I originally (incorrectly) thought it was related to https://github.com/hoarder-app/hoarder/issues/327

When adding a bookmark either in web or CLI, it's not able to fetch anything from the target site. Looking at the container logs, the worker container can't connect to the chrome container:

2024-07-27T21:45:35.867Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/
2024-07-27T21:45:36.609Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8"
2024-07-27T21:45:36.666Z info: [Crawler][6] Will crawl "https://core.telegram.org/api/" for link with id "hgfhadzqyrs96i1z1l1zmv2t"
2024-07-27T21:45:36.666Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/
2024-07-27T21:45:37.316Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8"
2024-07-27T21:45:37.317Z error: [Crawler][6] Crawling job failed: AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  logger.info(
  `[Crawler][${jobId}] Successfully navigated to "${url}". Waiting for the page to load ...`,
  )

2024-07-27T21:45:38.527Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-07-27T21:45:38.528Z info: [Crawler] Successfully resolved IP address, new address: http://172.18.0.5:9222/
2024-07-27T21:45:48.531Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs

The chrome container has an error regarding dbus. Not sure how important that is. It's discussed being an issue here for another project, but I couldn't apt while inside the container, so i couldn't try installing dbus.

[0727/213832.985728:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.009490:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.009634:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.231216:WARNING:sandbox_linux.cc(420)] InitializeSandbox() called with multiple threads in process gpu-process.
[0727/213833.303223:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0727/213833.303483:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended
[0727/213833.347854:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.

DevTools listening on ws://0.0.0.0:9222/devtools/browser/d3983fdb-6ddf-4b23-901b-226e2fa783ce

Any guidance would be greatly appreciated!

Originally created by @djl0 on GitHub (Jul 27, 2024). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/331 I'm creating a new issue after I originally (incorrectly) thought it was related to https://github.com/hoarder-app/hoarder/issues/327 When adding a bookmark either in web or CLI, it's not able to fetch anything from the target site. Looking at the container logs, the worker container can't connect to the chrome container: ``` 2024-07-27T21:45:35.867Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/ 2024-07-27T21:45:36.609Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8" 2024-07-27T21:45:36.666Z info: [Crawler][6] Will crawl "https://core.telegram.org/api/" for link with id "hgfhadzqyrs96i1z1l1zmv2t" 2024-07-27T21:45:36.666Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/ 2024-07-27T21:45:37.316Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8" 2024-07-27T21:45:37.317Z error: [Crawler][6] Crawling job failed: AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value: logger.info( `[Crawler][${jobId}] Successfully navigated to "${url}". Waiting for the page to load ...`, ) 2024-07-27T21:45:38.527Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222 2024-07-27T21:45:38.528Z info: [Crawler] Successfully resolved IP address, new address: http://172.18.0.5:9222/ 2024-07-27T21:45:48.531Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs ``` The chrome container has an error regarding dbus. Not sure how important that is. It's discussed being an issue [here](https://github.com/cypress-io/cypress/issues/4925) for another project, but I couldn't `apt` while inside the container, so i couldn't try installing dbus. ``` [0727/213832.985728:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory [0727/213833.009490:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory [0727/213833.009634:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory [0727/213833.231216:WARNING:sandbox_linux.cc(420)] InitializeSandbox() called with multiple threads in process gpu-process. [0727/213833.303223:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed [0727/213833.303483:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended [0727/213833.347854:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable. DevTools listening on ws://0.0.0.0:9222/devtools/browser/d3983fdb-6ddf-4b23-901b-226e2fa783ce ``` Any guidance would be greatly appreciated!
kerem 2026-03-02 11:47:41 +03:00
  • closed this issue
  • added the
    question
    label
Author
Owner

@MohamedBassem commented on GitHub (Jul 27, 2024):

Ok, now it's yet another error.

2024-07-27T21:45:48.531Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs

Can you share your compose file? Did you rename the chrome container name?

<!-- gh-comment-id:2254262977 --> @MohamedBassem commented on GitHub (Jul 27, 2024): Ok, now it's yet another error. > 2024-07-27T21:45:48.531Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs Can you share your compose file? Did you rename the chrome container name?
Author
Owner

@hongruilin commented on GitHub (Jul 29, 2024):

I'm creating a new issue after I originally (incorrectly) thought it was related to #327

When adding a bookmark either in web or CLI, it's not able to fetch anything from the target site. Looking at the container logs, the worker container can't connect to the chrome container:

2024-07-27T21:45:35.867Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/
2024-07-27T21:45:36.609Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8"
2024-07-27T21:45:36.666Z info: [Crawler][6] Will crawl "https://core.telegram.org/api/" for link with id "hgfhadzqyrs96i1z1l1zmv2t"
2024-07-27T21:45:36.666Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/
2024-07-27T21:45:37.316Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8"
2024-07-27T21:45:37.317Z error: [Crawler][6] Crawling job failed: AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:

  logger.info(
  `[Crawler][${jobId}] Successfully navigated to "${url}". Waiting for the page to load ...`,
  )

2024-07-27T21:45:38.527Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-07-27T21:45:38.528Z info: [Crawler] Successfully resolved IP address, new address: http://172.18.0.5:9222/
2024-07-27T21:45:48.531Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs

The chrome container has an error regarding dbus. Not sure how important that is. It's discussed being an issue here for another project, but I couldn't while inside the container, so i couldn't try installing dbus.apt

[0727/213832.985728:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.009490:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.009634:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0727/213833.231216:WARNING:sandbox_linux.cc(420)] InitializeSandbox() called with multiple threads in process gpu-process.
[0727/213833.303223:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0727/213833.303483:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended
[0727/213833.347854:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.

DevTools listening on ws://0.0.0.0:9222/devtools/browser/d3983fdb-6ddf-4b23-901b-226e2fa783ce

Any guidance would be greatly appreciated!

Has your issue been resolved? I also encountered the same problem when deploying with Docker on Windows, even changing the Chrome container name didn't solve it. Currently, I can only run hoarder normally on the Linux system

<!-- gh-comment-id:2256820068 --> @hongruilin commented on GitHub (Jul 29, 2024): > I'm creating a new issue after I originally (incorrectly) thought it was related to #327 > > When adding a bookmark either in web or CLI, it's not able to fetch anything from the target site. Looking at the container logs, the worker container can't connect to the chrome container: > > ``` > 2024-07-27T21:45:35.867Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/ > 2024-07-27T21:45:36.609Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8" > 2024-07-27T21:45:36.666Z info: [Crawler][6] Will crawl "https://core.telegram.org/api/" for link with id "hgfhadzqyrs96i1z1l1zmv2t" > 2024-07-27T21:45:36.666Z info: [Crawler][6] Attempting to determine the content-type for the url https://core.telegram.org/api/ > 2024-07-27T21:45:37.316Z info: [Crawler][6] Content-type for the url https://core.telegram.org/api/ is "text/html; charset=utf-8" > 2024-07-27T21:45:37.317Z error: [Crawler][6] Crawling job failed: AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value: > > logger.info( > `[Crawler][${jobId}] Successfully navigated to "${url}". Waiting for the page to load ...`, > ) > > 2024-07-27T21:45:38.527Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222 > 2024-07-27T21:45:38.528Z info: [Crawler] Successfully resolved IP address, new address: http://172.18.0.5:9222/ > 2024-07-27T21:45:48.531Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs > ``` > > The chrome container has an error regarding dbus. Not sure how important that is. It's discussed being an issue [here](https://github.com/cypress-io/cypress/issues/4925) for another project, but I couldn't while inside the container, so i couldn't try installing dbus.`apt` > > ``` > [0727/213832.985728:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory > [0727/213833.009490:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory > [0727/213833.009634:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory > [0727/213833.231216:WARNING:sandbox_linux.cc(420)] InitializeSandbox() called with multiple threads in process gpu-process. > [0727/213833.303223:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed > [0727/213833.303483:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended > [0727/213833.347854:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable. > > DevTools listening on ws://0.0.0.0:9222/devtools/browser/d3983fdb-6ddf-4b23-901b-226e2fa783ce > ``` > > Any guidance would be greatly appreciated! Has your issue been resolved? I also encountered the same problem when deploying with Docker on Windows, even changing the Chrome container name didn't solve it. Currently, I can only run hoarder normally on the Linux system
Author
Owner

@Antebios commented on GitHub (Aug 2, 2024):

I am having a very similar issue. I am using the kubernetes templates provided by this project. I have my bookmarks imported, but they cannot be crawled:

2024-08-02T17:26:24.210Z info: [Crawler][8239] Will crawl "http://forum.xda-developers.com/showthread.php?t=838448" for link with id "zbqya9md0clzrahb6vls2qda"
2024-08-02T17:26:24.210Z info: [Crawler][8239] Attempting to determine the content-type for the url http://forum.xda-developers.com/showthread.php?t=838448
2024-08-02T17:26:24.727Z info: [Crawler][8239] Content-type for the url http://forum.xda-developers.com/showthread.php?t=838448 is "text/html; charset=utf-8"
2024-08-02T17:26:24.729Z error: [Crawler][8239] Crawling job failed: AssertionError [ERR_ASSERTION]: undefined == true

The chrome container logs are these:

[0802/171801.792156:INFO:policy_logger.cc(145)] :components/policy/core/common/config_dir_policy_loader.cc(118) Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0802/171801.792183:INFO:policy_logger.cc(145)] :components/policy/core/common/config_dir_policy_loader.cc(118) Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended
<!-- gh-comment-id:2265846558 --> @Antebios commented on GitHub (Aug 2, 2024): I am having a very similar issue. I am using the kubernetes templates provided by this project. I have my bookmarks imported, but they cannot be crawled: ``` 2024-08-02T17:26:24.210Z info: [Crawler][8239] Will crawl "http://forum.xda-developers.com/showthread.php?t=838448" for link with id "zbqya9md0clzrahb6vls2qda" 2024-08-02T17:26:24.210Z info: [Crawler][8239] Attempting to determine the content-type for the url http://forum.xda-developers.com/showthread.php?t=838448 2024-08-02T17:26:24.727Z info: [Crawler][8239] Content-type for the url http://forum.xda-developers.com/showthread.php?t=838448 is "text/html; charset=utf-8" 2024-08-02T17:26:24.729Z error: [Crawler][8239] Crawling job failed: AssertionError [ERR_ASSERTION]: undefined == true ``` The chrome container logs are these: ``` [0802/171801.792156:INFO:policy_logger.cc(145)] :components/policy/core/common/config_dir_policy_loader.cc(118) Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed [0802/171801.792183:INFO:policy_logger.cc(145)] :components/policy/core/common/config_dir_policy_loader.cc(118) Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended ```
Author
Owner

@jacobsandersen commented on GitHub (Aug 7, 2024):

I am experiencing the same issue.

<!-- gh-comment-id:2273183174 --> @jacobsandersen commented on GitHub (Aug 7, 2024): I am experiencing the same issue.
Author
Owner

@MohamedBassem commented on GitHub (Aug 21, 2024):

For anyone who was facing crawling issues because on kubernetes, apparently, the chrome service was missing, and it's now fixed in https://github.com/hoarder-app/hoarder/pull/358/files

cc. @Antebios

<!-- gh-comment-id:2301176152 --> @MohamedBassem commented on GitHub (Aug 21, 2024): For anyone who was facing crawling issues because on kubernetes, apparently, the chrome service was missing, and it's now fixed in https://github.com/hoarder-app/hoarder/pull/358/files cc. @Antebios
Author
Owner

@MohamedBassem commented on GitHub (Sep 22, 2024):

Closing this issue for now, feel free to open it if you're still facing the issue.

<!-- gh-comment-id:2366830982 --> @MohamedBassem commented on GitHub (Sep 22, 2024): Closing this issue for now, feel free to open it if you're still facing the issue.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#215
No description provided.