[GH-ISSUE #414] Bypassing cookie and GDPR banner #266

Open
opened 2026-03-02 11:48:13 +03:00 by kerem · 63 comments
Owner

Originally created by @TurbulenceDeterministe on GitHub (Sep 23, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/414

When i use hoarder on a youtube link, the crawler get stuck with the cookie banner, any idea on how to solve this ?

image

Originally created by @TurbulenceDeterministe on GitHub (Sep 23, 2024). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/414 When i use hoarder on a youtube link, the crawler get stuck with the cookie banner, any idea on how to solve this ? ![image](https://github.com/user-attachments/assets/0c50b791-6abc-45f0-9894-4b81e08da983)
Author
Owner

@CrypticC3s4r commented on GitHub (Oct 5, 2024):

@Dwelled2593 can you provide some example links ?

<!-- gh-comment-id:2394996957 --> @CrypticC3s4r commented on GitHub (Oct 5, 2024): @Dwelled2593 can you provide some example links ?
Author
Owner

@pix commented on GitHub (Oct 13, 2024):

It's a Cookies / GDPR notice:

image

From: https://www.youtube.com/watch?v=E-5b1iGNraM (Probably a EU thing)

<!-- gh-comment-id:2409020769 --> @pix commented on GitHub (Oct 13, 2024): It's a Cookies / GDPR notice: ![image](https://github.com/user-attachments/assets/08fbca24-15e7-40d2-b7f1-f333e160f04f) From: https://www.youtube.com/watch?v=E-5b1iGNraM (Probably a EU thing)
Author
Owner

@vhsdream commented on GitHub (Nov 18, 2024):

Can confirm this also happens with other types of sites that have notices a human has to click - for instance this article from the New York Times.

As a slightly humorous (and kind of irritating) aside, when I tried to ask my local LLM (Llama3.2) to summarize the article, this was it's response:

NYT

Edit: forgot to show what that news link appears as:

image

<!-- gh-comment-id:2483923654 --> @vhsdream commented on GitHub (Nov 18, 2024): Can confirm this also happens with other types of sites that have notices a human has to click - for instance [this article](https://www.nytimes.com/2024/11/18/us/politics/trump-military-mass-deportation.html?unlocked_article_code=1.a04.HpN3.IOW_OSerqqC5) from the New York Times. As a slightly humorous (and kind of irritating) aside, when I tried to ask my local LLM (Llama3.2) to summarize the article, this was it's response: ![NYT](https://github.com/user-attachments/assets/9bacebf1-ee39-4b89-90ba-88fe6bd7f505) Edit: forgot to show what that news link appears as: ![image](https://github.com/user-attachments/assets/4bf21ffe-2bde-4bef-8f83-5c75b5702b67)
Author
Owner

@hedger commented on GitHub (Dec 12, 2024):

As someone with a massive amount of YT links in bookmarks, I'd love to see this fixed.
ArchiveBox seems to handle it correctly when being fed direct YT links.

<!-- gh-comment-id:2538885735 --> @hedger commented on GitHub (Dec 12, 2024): As someone with a massive amount of YT links in bookmarks, I'd love to see this fixed. ArchiveBox seems to handle it correctly when being fed direct YT links.
Author
Owner

@ctschach commented on GitHub (Jan 6, 2025):

Okay, I'm running into the same issue.

One way to get around this would be the ability to provide a cookie file which is used when crawling sites. With this in place, YT should know that the cookies are already accepted. You can use a chrome extension called "Get cookies.txt locally" to get the required cookies. This would probably be helpful for other sites to.

<!-- gh-comment-id:2573014843 --> @ctschach commented on GitHub (Jan 6, 2025): Okay, I'm running into the same issue. One way to get around this would be the ability to provide a cookie file which is used when crawling sites. With this in place, YT should know that the cookies are already accepted. You can use a chrome extension called "Get cookies.txt locally" to get the required cookies. This would probably be helpful for other sites to.
Author
Owner

@cakonopka commented on GitHub (Jan 20, 2025):

Got the same issue with Instagram Posts or X Posts

<!-- gh-comment-id:2603307583 --> @cakonopka commented on GitHub (Jan 20, 2025): Got the same issue with Instagram Posts or X Posts
Author
Owner

@fcorvelo commented on GitHub (Jan 22, 2025):

I don't know if this is the best solution, but I managed to get this to work using the "i-dont-care-about-cookies" Chrome extension. This extension automatically accepts and hides all the cookie banners.

The way I made this work was to make the following changes in the Chrome container:

  1. Use the --headless=new flag to allow loading extensions in headless mode.
  2. Create a bind mount between a folder on the host containing the unzipped extension code and some folder inside the container.
  3. Use the --load-extension=/i-dont-care-about-cookies flag with the path to the previously mounted folder to tell Chrome from where to load the extension.

And that's it! This way, when Hoarder calls Chrome it will load the extension and the extension will accept all the cookie banners, making them disappear.

One caveat (and this is why I say that maybe this is not the best solution) is that enabling the --headless=new flag creates the following error on the Hoarder web container that keeps appearing in the logs every 5 seconds. I didn't see any issues using Hoarder with this error, but perhaps there is something that I'm not aware of.

[Crawler] Failed to connect to the browser instance, will retry in 5 secs: TypeError: Failed to fetch browser webSocket URL from http://172.18.0.3:9222/json/version: fetch failed

So, I hope that this can help while there is no official way of doing it 😃

<!-- gh-comment-id:2607927904 --> @fcorvelo commented on GitHub (Jan 22, 2025): I don't know if this is the best solution, but I managed to get this to work using the "i-dont-care-about-cookies" Chrome extension. This extension automatically accepts and hides all the cookie banners. The way I made this work was to make the following changes in the Chrome container: 1. Use the `--headless=new` flag to allow loading extensions in headless mode. 2. Create a bind mount between a folder on the host containing the unzipped extension code and some folder inside the container. 3. Use the `--load-extension=/i-dont-care-about-cookies` flag with the path to the previously mounted folder to tell Chrome from where to load the extension. And that's it! This way, when Hoarder calls Chrome it will load the extension and the extension will accept all the cookie banners, making them disappear. One caveat (and this is why I say that maybe this is not the best solution) is that enabling the `--headless=new` flag creates the following error on the Hoarder web container that keeps appearing in the logs every 5 seconds. I didn't see any issues using Hoarder with this error, but perhaps there is something that I'm not aware of. `[Crawler] Failed to connect to the browser instance, will retry in 5 secs: TypeError: Failed to fetch browser webSocket URL from http://172.18.0.3:9222/json/version: fetch failed` So, I hope that this can help while there is no official way of doing it 😃
Author
Owner

@kassyss commented on GitHub (Jan 23, 2025):

@fcorvelo you made my day « ! »
I was eager to get a workaround until this issue is resolved and you brought it. Who knows, the final resolution will maybe based on chromium addon use.

Thank you very much

<!-- gh-comment-id:2610407264 --> @kassyss commented on GitHub (Jan 23, 2025): @fcorvelo you made my day « ! » I was eager to get a workaround until this issue is resolved and you brought it. Who knows, the final resolution will maybe based on chromium addon use. Thank you very much
Author
Owner

@Deathproof76 commented on GitHub (Jan 24, 2025):

Without trying the workaround, I don't seem to have the same problem for the link previews. Also located in the EU.

For visibility (because of the giant screenshots) and maybe debug part of my .env:

CRAWLER_FULL_PAGE_SCREENSHOT=true
CRAWLER_FULL_PAGE_ARCHIVE=true
CRAWLER_DOWNLOAD_BANNER_IMAGE=true
CRAWLER_FULL_PAGE_ARCHIVE=true

another thing I can think of, that I do different than the standard compose.yaml, is that I exposed the chrome port directly

 chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    restart: unless-stopped
    ports:
      - 9222:9222

And connected the chrome instance via BROWSER_WEB_URL: http://192.168.0.208:9222 (LAN-IP of the server directly in the compose.yaml for the hoarder container).

Image

but of course:

Image

Image

<!-- gh-comment-id:2611992269 --> @Deathproof76 commented on GitHub (Jan 24, 2025): Without trying the workaround, I don't seem to have the same problem for the link previews. Also located in the EU. For visibility (because of the giant screenshots) and maybe debug part of my .env: ``` CRAWLER_FULL_PAGE_SCREENSHOT=true CRAWLER_FULL_PAGE_ARCHIVE=true CRAWLER_DOWNLOAD_BANNER_IMAGE=true CRAWLER_FULL_PAGE_ARCHIVE=true ``` another thing I can think of, that I do different than the standard compose.yaml, is that I exposed the chrome port directly ``` chrome: image: gcr.io/zenika-hub/alpine-chrome:123 restart: unless-stopped ports: - 9222:9222 ``` And connected the chrome instance via `BROWSER_WEB_URL: http://192.168.0.208:9222` (LAN-IP of the server directly in the compose.yaml for the hoarder container). ![Image](https://github.com/user-attachments/assets/c0c962aa-98a4-4bcd-9fe1-1c31456ff002) but of course: ![Image](https://github.com/user-attachments/assets/d30e9f4c-811a-406a-b3e2-197d81ba6889) ![Image](https://github.com/user-attachments/assets/cfbc790b-56dd-47e4-bd57-1630a25934b8)
Author
Owner

@Deathproof76 commented on GitHub (Jan 24, 2025):

@fcorvelo If you have a minute:
I actually tried the workaround, but am most likely not savvy enough to get the intructions right and need some guidance.

  1. downloaded the latest release of https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies/releases/tag/v1.1.4 could be the first mistake but I couldn't find an unpacked extension of the standard "I dont care..." on the fly so I used that
  2. unzipped the contents of https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies/releases/download/v1.1.4/ISDCAC-chrome-source.zip

Image

  1. I then renamed the folder from ISDCAC-chrome-source to cookies so the folder with the unpacked extension files is in /mnt/Dockerspace/hoarder/cookies at its root looking just like the first screenshot
  2. added the commands from the workaround and the bind mount like this in the compose:
  chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    restart: unless-stopped
    ports:
      - 9222:9222
    volumes:
      - /mnt/Dockerspace/hoarder/cookies:/cookies
    command:
      - --no-sandbox
      - --headless=new
      - --disable-gpu
      - --disable-dev-shm-usage
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
      - --load-extension=/cookies

I can see the mounted folder including contents in the chrome container at its root /cookies

which leads to

Image

Otherwise the preview seemed to work

Image

and the archive looked like this:

Image

edit:

logs of the chrome container:

[1:25:0124/091356.185723:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[1:22:0124/091356.189722:ERROR:bus.cc(407)] Failed to connect to the bus: Could not parse server address: Unknown address type (examples of valid types are "tcp" and on UNIX "unix")
[1:1:0124/091356.215888:ERROR:policy_logger.cc(157)] :components/enterprise/browser/controller/chrome_browser_cloud_management_controller.cc(161) Cloud management controller initialization aborted as CBCM is not enabled. Please use the `--enable-chrome-browser-cloud-management` command line flag to enable it if you are not using the official Google Chrome build.
DevTools listening on ws://127.0.0.1:9222/devtools/browser/4f1d34fa-705a-41c5-8a15-f661e88459f8

logs of the hoarder container upon hoarding:

2025-01-24T09:20:10.159Z info: [Crawler] Successfully resolved IP address, new address: http://192.168.0.208:9222/
2025-01-24T09:20:10.160Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs: TypeError: Failed to fetch browser webSocket URL from http://192.168.0.208:9222/json/version: fetch failed
    at node:internal/deps/undici/undici:13484:13
    at async getWSEndpoint (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:94:24)
    at async getConnectionTransport (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:81:31)
    at async _connectToBrowser (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:50:50)
    at async PuppeteerExtra.connect (/app/apps/workers/node_modules/.pnpm/puppeteer-extra@3.3.6_puppeteer@22.3.0_typescript@5.3.3_/node_modules/puppeteer-extra/dist/index.cjs.js:151:25)
    at async /app/apps/workers/crawlerWorker.ts:2:4664
2025-01-24T09:20:14.068Z info: [Crawler][7355] Will crawl "https://www.youtube.com/watch?v=qDdjGRhkM0M" for link with id "tftr7o4gasebymn71tb5uxjj"
2025-01-24T09:20:14.068Z info: [Crawler][7355] Attempting to determine the content-type for the url https://www.youtube.com/watch?v=qDdjGRhkM0M
2025-01-24T09:20:14.251Z info: [search][7356] Attempting to index bookmark with id tftr7o4gasebymn71tb5uxjj ...
2025-01-24T09:20:14.361Z info: [search][7356] Completed successfully
2025-01-24T09:20:14.588Z info: [Crawler][7355] Content-type for the url https://www.youtube.com/watch?v=qDdjGRhkM0M is "text/html; charset=utf-8"
2025-01-24T09:20:14.589Z info: [Crawler][7355] Running in browserless mode. Will do a plain http request to "https://www.youtube.com/watch?v=qDdjGRhkM0M". Screenshots will be disabled.
2025-01-24T09:20:14.624Z info: [Crawler][7355] Successfully fetched the content of "https://www.youtube.com/watch?v=qDdjGRhkM0M". Status: 200, Size: 0
2025-01-24T09:20:15.160Z info: [Crawler] Connecting to existing browser instance: http://192.168.0.208:9222
2025-01-24T09:20:15.160Z info: [Crawler] Successfully resolved IP address, new address: http://192.168.0.208:9222/
2025-01-24T09:20:15.161Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs: TypeError: Failed to fetch browser webSocket URL from http://192.168.0.208:9222/json/version: fetch failed
    at node:internal/deps/undici/undici:13484:13
    at async getWSEndpoint (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:94:24)
    at async getConnectionTransport (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:81:31)
    at async _connectToBrowser (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:50:50)
    at async PuppeteerExtra.connect (/app/apps/workers/node_modules/.pnpm/puppeteer-extra@3.3.6_puppeteer@22.3.0_typescript@5.3.3_/node_modules/puppeteer-extra/dist/index.cjs.js:151:25)
    at async /app/apps/workers/crawlerWorker.ts:2:4664
2025-01-24T09:20:15.231Z info: [Crawler][7355] Will attempt to extract metadata from page ...
2025-01-24T09:20:15.658Z info: [Crawler][7355] Will attempt to extract readable content ...
2025-01-24T09:20:15.823Z info: [Crawler][7355] Done extracting readable content.
2025-01-24T09:20:15.823Z info: [Crawler][7355] Skipping storing the screenshot as it's empty.
2025-01-24T09:20:15.838Z info: [Crawler][7355] Done extracting metadata from the page.
2025-01-24T09:20:15.838Z info: [Crawler][7355] Downloading image from "https://i.ytimg.com/vi/qDdjGRhkM0M/hqdefault.jpg"
2025-01-24T09:20:15.917Z info: [Crawler][7355] Downloaded image as assetId: bdebf978-0726-4139-bb0c-cefc57454381
2025-01-24T09:20:15.928Z info: [Crawler][7355] Will attempt to archive page ...
2025-01-24T09:20:16.839Z info: [search][7358] Attempting to index bookmark with id tftr7o4gasebymn71tb5uxjj ...
2025-01-24T09:20:16.847Z info: [VideoCrawler][7359] Attempting to download a file from "https://www.youtube.com/watch?v=qDdjGRhkM0M" to "/tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329" using the following arguments: "https://www.youtube.com/watch?v=qDdjGRhkM0M,-f,best[filesize<800M],-o,/tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329,--no-playlist"
2025-01-24T09:20:16.885Z info: [inference][7357] Starting an inference job for bookmark with id "tftr7o4gasebymn71tb5uxjj"
2025-01-24T09:20:16.959Z info: [search][7358] Completed successfully
shortMessage=Command failed with exit code 1: yt-dlp 'https://www.youtube.com/watch?v=qDdjGRhkM0M' -f 'best[filesize<800M]' -o /tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329 --no-playlist command=yt-dlp https://www.youtube.com/watch?v=qDdjGRhkM0M -f best[filesize<800M] -o /tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329 --no-playlist escapedCommand=yt-dlp 'https://www.youtube.com/watch?v=qDdjGRhkM0M' -f 'best[filesize<800M]' -o /tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329 --no-playlist cwd=/app/apps/workers durationMs=2114.489444 failed=true timedOut=false isCanceled=false isGracefullyCanceled=false isTerminated=false isMaxBuffer=false isForcefullyTerminated=false exitCode=1 stdout=[youtube] Extracting URL: https://www.youtube.com/watch?v=qDdjGRhkM0M
[youtube] qDdjGRhkM0M: Downloading webpage
[youtube] qDdjGRhkM0M: Downloading ios player API JSON
[youtube] qDdjGRhkM0M: Downloading mweb player API JSON
[youtube] qDdjGRhkM0M: Downloading m3u8 information stderr=ERROR: [youtube] qDdjGRhkM0M: Requested format is not available. Use --list-formats for a list of available formats stdio=[null,"[youtube] Extracting URL: https://www.youtube.com/watch?v=qDdjGRhkM0M\n[youtube] qDdjGRhkM0M: Downloading webpage\n[youtube] qDdjGRhkM0M: Downloading ios player API JSON\n[youtube] qDdjGRhkM0M: Downloading mweb player API JSON\n[youtube] qDdjGRhkM0M: Downloading m3u8 information","ERROR: [youtube] qDdjGRhkM0M: Requested format is not available. Use --list-formats for a list of available formats"] ipcOutput=[] pipedFrom=[]
2025-01-24T09:20:18.962Z error: [VideoCrawler][7359] Failed to download a file from "https://www.youtube.com/watch?v=qDdjGRhkM0M" to "/tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329"
2025-01-24T09:20:18.963Z info: [VideoCrawler][7359] Video Download Completed successfully
2025-01-24T09:20:20.161Z info: [Crawler] Connecting to existing browser instance: http://192.168.0.208:9222
2025-01-24T09:20:20.161Z info: [Crawler] Successfully resolved IP address, new address: http://192.168.0.208:9222/
2025-01-24T09:20:21.017Z info: [inference][7357] Inferring tag for bookmark "tftr7o4gasebymn71tb5uxjj" used 345 tokens and inferred: Radiohead,Live Performance,Music,YouTube,2003,Live Show
2025-01-24T09:20:21.042Z info: [inference][7357] Completed successfully
2025-01-24T09:20:21.968Z info: [search][7360] Attempting to index bookmark with id tftr7o4gasebymn71tb5uxjj ...
2025-01-24T09:20:22.084Z info: [search][7360] Completed successfully
2025-01-24T09:20:22.639Z info: [Crawler][7355] Done archiving the page as assetId: 9b6fb846-0f73-442a-9401-f89237b07b97
2025-01-24T09:20:22.646Z info: [Crawler][7355] Completed successfully

<!-- gh-comment-id:2612018059 --> @Deathproof76 commented on GitHub (Jan 24, 2025): @fcorvelo If you have a minute: I actually tried the workaround, but am most likely not savvy enough to get the intructions right and need some guidance. 1. downloaded the latest release of https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies/releases/tag/v1.1.4 could be the first mistake but I couldn't find an unpacked extension of the standard "I dont care..." on the fly so I used that 2. unzipped the contents of https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies/releases/download/v1.1.4/ISDCAC-chrome-source.zip ![Image](https://github.com/user-attachments/assets/17c53d7c-27c0-48de-b75b-36f387b6a62e) 3. I then renamed the folder from ISDCAC-chrome-source to `cookies` so the folder with the unpacked extension files is in `/mnt/Dockerspace/hoarder/cookies` at its root looking just like the first screenshot 4. added the commands from the workaround and the bind mount like this in the compose: ``` chrome: image: gcr.io/zenika-hub/alpine-chrome:123 restart: unless-stopped ports: - 9222:9222 volumes: - /mnt/Dockerspace/hoarder/cookies:/cookies command: - --no-sandbox - --headless=new - --disable-gpu - --disable-dev-shm-usage - --remote-debugging-address=0.0.0.0 - --remote-debugging-port=9222 - --hide-scrollbars - --load-extension=/cookies ``` I can see the mounted folder including contents in the chrome container at its root /cookies which leads to ![Image](https://github.com/user-attachments/assets/908b97cb-5eb9-48ea-8b9a-933902668057) Otherwise the preview seemed to work ![Image](https://github.com/user-attachments/assets/ace8482f-7e0b-4a20-af35-83995991e8d2) and the archive looked like this: ![Image](https://github.com/user-attachments/assets/96439de1-fe05-4855-bd58-6d202d720113) edit: logs of the chrome container: ``` [1:25:0124/091356.185723:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory [1:22:0124/091356.189722:ERROR:bus.cc(407)] Failed to connect to the bus: Could not parse server address: Unknown address type (examples of valid types are "tcp" and on UNIX "unix") [1:1:0124/091356.215888:ERROR:policy_logger.cc(157)] :components/enterprise/browser/controller/chrome_browser_cloud_management_controller.cc(161) Cloud management controller initialization aborted as CBCM is not enabled. Please use the `--enable-chrome-browser-cloud-management` command line flag to enable it if you are not using the official Google Chrome build. DevTools listening on ws://127.0.0.1:9222/devtools/browser/4f1d34fa-705a-41c5-8a15-f661e88459f8 ``` logs of the hoarder container upon hoarding: ``` 2025-01-24T09:20:10.159Z info: [Crawler] Successfully resolved IP address, new address: http://192.168.0.208:9222/ 2025-01-24T09:20:10.160Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs: TypeError: Failed to fetch browser webSocket URL from http://192.168.0.208:9222/json/version: fetch failed at node:internal/deps/undici/undici:13484:13 at async getWSEndpoint (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:94:24) at async getConnectionTransport (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:81:31) at async _connectToBrowser (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:50:50) at async PuppeteerExtra.connect (/app/apps/workers/node_modules/.pnpm/puppeteer-extra@3.3.6_puppeteer@22.3.0_typescript@5.3.3_/node_modules/puppeteer-extra/dist/index.cjs.js:151:25) at async /app/apps/workers/crawlerWorker.ts:2:4664 2025-01-24T09:20:14.068Z info: [Crawler][7355] Will crawl "https://www.youtube.com/watch?v=qDdjGRhkM0M" for link with id "tftr7o4gasebymn71tb5uxjj" 2025-01-24T09:20:14.068Z info: [Crawler][7355] Attempting to determine the content-type for the url https://www.youtube.com/watch?v=qDdjGRhkM0M 2025-01-24T09:20:14.251Z info: [search][7356] Attempting to index bookmark with id tftr7o4gasebymn71tb5uxjj ... 2025-01-24T09:20:14.361Z info: [search][7356] Completed successfully 2025-01-24T09:20:14.588Z info: [Crawler][7355] Content-type for the url https://www.youtube.com/watch?v=qDdjGRhkM0M is "text/html; charset=utf-8" 2025-01-24T09:20:14.589Z info: [Crawler][7355] Running in browserless mode. Will do a plain http request to "https://www.youtube.com/watch?v=qDdjGRhkM0M". Screenshots will be disabled. 2025-01-24T09:20:14.624Z info: [Crawler][7355] Successfully fetched the content of "https://www.youtube.com/watch?v=qDdjGRhkM0M". Status: 200, Size: 0 2025-01-24T09:20:15.160Z info: [Crawler] Connecting to existing browser instance: http://192.168.0.208:9222 2025-01-24T09:20:15.160Z info: [Crawler] Successfully resolved IP address, new address: http://192.168.0.208:9222/ 2025-01-24T09:20:15.161Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs: TypeError: Failed to fetch browser webSocket URL from http://192.168.0.208:9222/json/version: fetch failed at node:internal/deps/undici/undici:13484:13 at async getWSEndpoint (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:94:24) at async getConnectionTransport (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:81:31) at async _connectToBrowser (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/common/BrowserConnector.js:50:50) at async PuppeteerExtra.connect (/app/apps/workers/node_modules/.pnpm/puppeteer-extra@3.3.6_puppeteer@22.3.0_typescript@5.3.3_/node_modules/puppeteer-extra/dist/index.cjs.js:151:25) at async /app/apps/workers/crawlerWorker.ts:2:4664 2025-01-24T09:20:15.231Z info: [Crawler][7355] Will attempt to extract metadata from page ... 2025-01-24T09:20:15.658Z info: [Crawler][7355] Will attempt to extract readable content ... 2025-01-24T09:20:15.823Z info: [Crawler][7355] Done extracting readable content. 2025-01-24T09:20:15.823Z info: [Crawler][7355] Skipping storing the screenshot as it's empty. 2025-01-24T09:20:15.838Z info: [Crawler][7355] Done extracting metadata from the page. 2025-01-24T09:20:15.838Z info: [Crawler][7355] Downloading image from "https://i.ytimg.com/vi/qDdjGRhkM0M/hqdefault.jpg" 2025-01-24T09:20:15.917Z info: [Crawler][7355] Downloaded image as assetId: bdebf978-0726-4139-bb0c-cefc57454381 2025-01-24T09:20:15.928Z info: [Crawler][7355] Will attempt to archive page ... 2025-01-24T09:20:16.839Z info: [search][7358] Attempting to index bookmark with id tftr7o4gasebymn71tb5uxjj ... 2025-01-24T09:20:16.847Z info: [VideoCrawler][7359] Attempting to download a file from "https://www.youtube.com/watch?v=qDdjGRhkM0M" to "/tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329" using the following arguments: "https://www.youtube.com/watch?v=qDdjGRhkM0M,-f,best[filesize<800M],-o,/tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329,--no-playlist" 2025-01-24T09:20:16.885Z info: [inference][7357] Starting an inference job for bookmark with id "tftr7o4gasebymn71tb5uxjj" 2025-01-24T09:20:16.959Z info: [search][7358] Completed successfully shortMessage=Command failed with exit code 1: yt-dlp 'https://www.youtube.com/watch?v=qDdjGRhkM0M' -f 'best[filesize<800M]' -o /tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329 --no-playlist command=yt-dlp https://www.youtube.com/watch?v=qDdjGRhkM0M -f best[filesize<800M] -o /tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329 --no-playlist escapedCommand=yt-dlp 'https://www.youtube.com/watch?v=qDdjGRhkM0M' -f 'best[filesize<800M]' -o /tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329 --no-playlist cwd=/app/apps/workers durationMs=2114.489444 failed=true timedOut=false isCanceled=false isGracefullyCanceled=false isTerminated=false isMaxBuffer=false isForcefullyTerminated=false exitCode=1 stdout=[youtube] Extracting URL: https://www.youtube.com/watch?v=qDdjGRhkM0M [youtube] qDdjGRhkM0M: Downloading webpage [youtube] qDdjGRhkM0M: Downloading ios player API JSON [youtube] qDdjGRhkM0M: Downloading mweb player API JSON [youtube] qDdjGRhkM0M: Downloading m3u8 information stderr=ERROR: [youtube] qDdjGRhkM0M: Requested format is not available. Use --list-formats for a list of available formats stdio=[null,"[youtube] Extracting URL: https://www.youtube.com/watch?v=qDdjGRhkM0M\n[youtube] qDdjGRhkM0M: Downloading webpage\n[youtube] qDdjGRhkM0M: Downloading ios player API JSON\n[youtube] qDdjGRhkM0M: Downloading mweb player API JSON\n[youtube] qDdjGRhkM0M: Downloading m3u8 information","ERROR: [youtube] qDdjGRhkM0M: Requested format is not available. Use --list-formats for a list of available formats"] ipcOutput=[] pipedFrom=[] 2025-01-24T09:20:18.962Z error: [VideoCrawler][7359] Failed to download a file from "https://www.youtube.com/watch?v=qDdjGRhkM0M" to "/tmp/video_downloads/56719752-c7f0-4933-96d7-b03e58a58329" 2025-01-24T09:20:18.963Z info: [VideoCrawler][7359] Video Download Completed successfully 2025-01-24T09:20:20.161Z info: [Crawler] Connecting to existing browser instance: http://192.168.0.208:9222 2025-01-24T09:20:20.161Z info: [Crawler] Successfully resolved IP address, new address: http://192.168.0.208:9222/ 2025-01-24T09:20:21.017Z info: [inference][7357] Inferring tag for bookmark "tftr7o4gasebymn71tb5uxjj" used 345 tokens and inferred: Radiohead,Live Performance,Music,YouTube,2003,Live Show 2025-01-24T09:20:21.042Z info: [inference][7357] Completed successfully 2025-01-24T09:20:21.968Z info: [search][7360] Attempting to index bookmark with id tftr7o4gasebymn71tb5uxjj ... 2025-01-24T09:20:22.084Z info: [search][7360] Completed successfully 2025-01-24T09:20:22.639Z info: [Crawler][7355] Done archiving the page as assetId: 9b6fb846-0f73-442a-9401-f89237b07b97 2025-01-24T09:20:22.646Z info: [Crawler][7355] Completed successfully ```
Author
Owner

@Deathproof76 commented on GitHub (Jan 24, 2025):

For other sites I get the cached content and the archive, but yeah video download and full page screenshot seems to be broken. So I've definitely done something wrong with the chrome container.

<!-- gh-comment-id:2612089682 --> @Deathproof76 commented on GitHub (Jan 24, 2025): For other sites I get the cached content and the archive, but yeah video download and full page screenshot seems to be broken. So I've definitely done something wrong with the chrome container.
Author
Owner

@fcorvelo commented on GitHub (Jan 24, 2025):

@Deathproof76 I just checked again on my end and I can confirm that I see the same behavior as you do. So it appears that this workaround is not as good as I thought it would be.

For some sites like YouTube it does skip the banner and picks up the correct title and image for the card list, as you can see below. And this was what I was personally more interested in achieving. But as you mention, the content itself and the screenshots seem to be broken.

Before:

Image

After:

Image
<!-- gh-comment-id:2612363216 --> @fcorvelo commented on GitHub (Jan 24, 2025): @Deathproof76 I just checked again on my end and I can confirm that I see the same behavior as you do. So it appears that this workaround is not as good as I thought it would be. For some sites like YouTube it does skip the banner and picks up the correct title and image for the card list, as you can see below. And this was what I was personally more interested in achieving. But as you mention, the content itself and the screenshots seem to be broken. Before: <img width="441" alt="Image" src="https://github.com/user-attachments/assets/7a94f1df-3943-4e37-9e65-a44860759589" /> After: <img width="437" alt="Image" src="https://github.com/user-attachments/assets/2243a2c2-3b03-4120-98cd-18b804e76a66" />
Author
Owner

@Deathproof76 commented on GitHub (Jan 24, 2025):

@fcorvelo If it's mostly about the previews on the main hoarder page, which are working for me ... Maybe it could be due to an env setting difference?

this is my full compose:

services:
  hoarder:
    image: ghcr.io/hoarder-app/hoarder:release
    container_name: hoarder
    restart: unless-stopped
    volumes:
      - /mnt/Dockerspace/hoarder/web:/data
    ports:
      - 3111:3000
    env_file:
      - /mnt/Dockerspace/hoarder/.env
    environment:
      MEILI_ADDR: http://192.168.0.208:7700
      INFERENCE_CONTEXT_LENGTH: 4096
      DATA_DIR: /data
      BROWSER_WEB_URL: http://192.168.0.208:9222
      INFERENCE_TEXT_MODEL: ministral:perplexica
      INFERENCE_IMAGE_MODEL: minicpm-v:latest
      EMBEDDING_TEXT_MODEL: snowflake-arctic-embed-l-v2.0.F16.gguf:latest
      OLLAMA_KEEP_ALIVE: 0m
      INFERENCE_JOB_TIMEOUT_SEC: 30
      OCR_LANGS: eng,deu,spa
      CRAWLER_VIDEO_DOWNLOAD: true
      CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE: 800
      CRAWLER_ENABLE_ADBLOCKER: true
  chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    container_name: chrome
    restart: unless-stopped
    ports:
      - 9222:9222
    command:
      - --no-sandbox
      - --disable-gpu
      - --disable-dev-shm-usage
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
      - --enable-features=ConversionMeasurement,AttributionReportingCrossAppWeb
  meilisearch:
    image: getmeili/meilisearch:v1.6
    container_name: meilisearch
    restart: unless-stopped
    environment:
      MEILI_NO_ANALYTICS: true
    ports:
      - 7700:7700
    env_file:
      - /mnt/Dockerspace/hoarder/.env
    volumes:
      - /mnt/Dockerspace/hoarder/meilisearch:/meili_data

and heres the .env, I removed some privacy related stuff:

MEILI_MASTER_KEY=***********************************
NEXTAUTH_SECRET=************************************
NEXTAUTH_URL=https://hoarder.******.******
CRAWLER_NUM_WORKERS=2
CRAWLER_FULL_PAGE_SCREENSHOT=true
CRAWLER_FULL_PAGE_ARCHIVE=true
CRAWLER_DOWNLOAD_BANNER_IMAGE=true
CRAWLER_JOB_TIMEOUT_SEC=60
CRAWLER_NAVIGATE_TIMEOUT_SEC=45
MAX_ASSET_SIZE_MB=100
OLLAMA_BASE_URL=http://192.168.0.208:11434
OCR_CACHE_DIR=/data
INFERENCE_LANG=in the same language as the content

🤷‍♂️with this exact config I simply do

Image

and it looks like this

Image

<!-- gh-comment-id:2612419615 --> @Deathproof76 commented on GitHub (Jan 24, 2025): @fcorvelo If it's mostly about the previews on the main hoarder page, which are working for me ... Maybe it could be due to an env setting difference? this is my full compose: ``` services: hoarder: image: ghcr.io/hoarder-app/hoarder:release container_name: hoarder restart: unless-stopped volumes: - /mnt/Dockerspace/hoarder/web:/data ports: - 3111:3000 env_file: - /mnt/Dockerspace/hoarder/.env environment: MEILI_ADDR: http://192.168.0.208:7700 INFERENCE_CONTEXT_LENGTH: 4096 DATA_DIR: /data BROWSER_WEB_URL: http://192.168.0.208:9222 INFERENCE_TEXT_MODEL: ministral:perplexica INFERENCE_IMAGE_MODEL: minicpm-v:latest EMBEDDING_TEXT_MODEL: snowflake-arctic-embed-l-v2.0.F16.gguf:latest OLLAMA_KEEP_ALIVE: 0m INFERENCE_JOB_TIMEOUT_SEC: 30 OCR_LANGS: eng,deu,spa CRAWLER_VIDEO_DOWNLOAD: true CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE: 800 CRAWLER_ENABLE_ADBLOCKER: true chrome: image: gcr.io/zenika-hub/alpine-chrome:123 container_name: chrome restart: unless-stopped ports: - 9222:9222 command: - --no-sandbox - --disable-gpu - --disable-dev-shm-usage - --remote-debugging-address=0.0.0.0 - --remote-debugging-port=9222 - --hide-scrollbars - --enable-features=ConversionMeasurement,AttributionReportingCrossAppWeb meilisearch: image: getmeili/meilisearch:v1.6 container_name: meilisearch restart: unless-stopped environment: MEILI_NO_ANALYTICS: true ports: - 7700:7700 env_file: - /mnt/Dockerspace/hoarder/.env volumes: - /mnt/Dockerspace/hoarder/meilisearch:/meili_data ``` and heres the .env, I removed some privacy related stuff: ``` MEILI_MASTER_KEY=*********************************** NEXTAUTH_SECRET=************************************ NEXTAUTH_URL=https://hoarder.******.****** CRAWLER_NUM_WORKERS=2 CRAWLER_FULL_PAGE_SCREENSHOT=true CRAWLER_FULL_PAGE_ARCHIVE=true CRAWLER_DOWNLOAD_BANNER_IMAGE=true CRAWLER_JOB_TIMEOUT_SEC=60 CRAWLER_NAVIGATE_TIMEOUT_SEC=45 MAX_ASSET_SIZE_MB=100 OLLAMA_BASE_URL=http://192.168.0.208:11434 OCR_CACHE_DIR=/data INFERENCE_LANG=in the same language as the content ``` 🤷‍♂️with this exact config I simply do ![Image](https://github.com/user-attachments/assets/d637aeb0-aaf1-44a5-9042-b9906a7811f7) and it looks like this ![Image](https://github.com/user-attachments/assets/2810adaf-7474-41e5-aa62-bc5536650028)
Author
Owner

@fcorvelo commented on GitHub (Jan 25, 2025):

@Deathproof76 That's very strange. I tried your environment variables (except the ones related to Inference because I'm not using that) and I still see the YouTube cookie banner.

Here is my compose stack (I'm using Portainer):

version: "3.8"
services:
  web:
    image: ghcr.io/hoarder-app/hoarder:${HOARDER_VERSION:-release}
    restart: unless-stopped
    volumes:
      - data:/data
    ports:
      - 3000:3000
    env_file:
      - stack.env
    environment:
      MEILI_ADDR: http://meilisearch:7700
      BROWSER_WEB_URL: http://chrome:9222
      DATA_DIR: /data
  chrome:
    image: gcr.io/zenika-hub/alpine-chrome:123
    restart: unless-stopped
    command:
      - --no-sandbox
      - --disable-gpu
      - --disable-dev-shm-usage
      - --remote-debugging-address=0.0.0.0
      - --remote-debugging-port=9222
      - --hide-scrollbars
      - --enable-features=ConversionMeasurement,AttributionReportingCrossAppWeb
  meilisearch:
    image: getmeili/meilisearch:v1.11.1
    restart: unless-stopped
    env_file:
      - stack.env
    environment:
      MEILI_NO_ANALYTICS: "true"
    volumes:
      - meilisearch:/meili_data

volumes:
  meilisearch:
  data:

stack.env:

HOARDER_VERSION=0.21.0
NEXTAUTH_SECRET=***
MEILI_MASTER_KEY=***
NEXTAUTH_URL=http://hoarder.***.com
DISABLE_SIGNUPS=true
CRAWLER_VIDEO_DOWNLOAD=true
CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE=800
CRAWLER_ENABLE_ADBLOCKER=true
CRAWLER_NUM_WORKERS=2
CRAWLER_FULL_PAGE_SCREENSHOT=true
CRAWLER_FULL_PAGE_ARCHIVE=true
CRAWLER_DOWNLOAD_BANNER_IMAGE=true
CRAWLER_JOB_TIMEOUT_SEC=60
CRAWLER_NAVIGATE_TIMEOUT_SEC=45
MAX_ASSET_SIZE_MB=100
<!-- gh-comment-id:2613687530 --> @fcorvelo commented on GitHub (Jan 25, 2025): @Deathproof76 That's very strange. I tried your environment variables (except the ones related to Inference because I'm not using that) and I still see the YouTube cookie banner. Here is my compose stack (I'm using Portainer): ```yml version: "3.8" services: web: image: ghcr.io/hoarder-app/hoarder:${HOARDER_VERSION:-release} restart: unless-stopped volumes: - data:/data ports: - 3000:3000 env_file: - stack.env environment: MEILI_ADDR: http://meilisearch:7700 BROWSER_WEB_URL: http://chrome:9222 DATA_DIR: /data chrome: image: gcr.io/zenika-hub/alpine-chrome:123 restart: unless-stopped command: - --no-sandbox - --disable-gpu - --disable-dev-shm-usage - --remote-debugging-address=0.0.0.0 - --remote-debugging-port=9222 - --hide-scrollbars - --enable-features=ConversionMeasurement,AttributionReportingCrossAppWeb meilisearch: image: getmeili/meilisearch:v1.11.1 restart: unless-stopped env_file: - stack.env environment: MEILI_NO_ANALYTICS: "true" volumes: - meilisearch:/meili_data volumes: meilisearch: data: ``` stack.env: ``` HOARDER_VERSION=0.21.0 NEXTAUTH_SECRET=*** MEILI_MASTER_KEY=*** NEXTAUTH_URL=http://hoarder.***.com DISABLE_SIGNUPS=true CRAWLER_VIDEO_DOWNLOAD=true CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE=800 CRAWLER_ENABLE_ADBLOCKER=true CRAWLER_NUM_WORKERS=2 CRAWLER_FULL_PAGE_SCREENSHOT=true CRAWLER_FULL_PAGE_ARCHIVE=true CRAWLER_DOWNLOAD_BANNER_IMAGE=true CRAWLER_JOB_TIMEOUT_SEC=60 CRAWLER_NAVIGATE_TIMEOUT_SEC=45 MAX_ASSET_SIZE_MB=100 ```
Author
Owner

@Deathproof76 commented on GitHub (Jan 25, 2025):

@fcorvelo *currently on the phone. So just as an experiment. I use this exact url https://youtu.be/_7VXPS7q00Y?si=Nr8fnq37dQWAsHgv hoard it and it works on my end. Nice preview in app, archived full page screenshot got cookies etc.
for comparison the log of my hoarder container, would most likely be helpful to know where it differs on your end, sorry for the screenshot:

Image

<!-- gh-comment-id:2613709220 --> @Deathproof76 commented on GitHub (Jan 25, 2025): @fcorvelo *currently on the phone. So just as an experiment. I use this exact url https://youtu.be/_7VXPS7q00Y?si=Nr8fnq37dQWAsHgv hoard it and it works on my end. Nice preview in app, archived full page screenshot got cookies etc. for comparison the log of my hoarder container, would most likely be helpful to know where it differs on your end, sorry for the screenshot: ![Image](https://github.com/user-attachments/assets/60688ffe-f4cb-4b52-9b1e-8d8e6963440d)
Author
Owner

@Deathproof76 commented on GitHub (Jan 25, 2025):

In the app again for reference

Image

<!-- gh-comment-id:2613709949 --> @Deathproof76 commented on GitHub (Jan 25, 2025): In the app again for reference ![Image](https://github.com/user-attachments/assets/7253eb6c-3455-4b1b-a29b-196b0d3de947)
Author
Owner

@fcorvelo commented on GitHub (Jan 25, 2025):

@Deathproof76 I just found out that there are different behaviors depending on the YouTube link. If I use a link to a video, it works. But if I use a link to a channel, it doesn't. Can you try the channel of that video you sent? Like this: https://www.youtube.com/@VivaLaDirtLeague

<!-- gh-comment-id:2613712195 --> @fcorvelo commented on GitHub (Jan 25, 2025): @Deathproof76 I just found out that there are different behaviors depending on the YouTube link. If I use a link to a video, it works. But if I use a link to a channel, it doesn't. Can you try the channel of that video you sent? Like this: https://www.youtube.com/@VivaLaDirtLeague
Author
Owner

@Deathproof76 commented on GitHub (Jan 25, 2025):

@fcorvelo well, and there it is 😅 at least we're finally on the same page and know how to reproduce this

Image

I'd guesstimate that it has something to do with crawler trying to find a nice banner, settles on the YouTube.svg logo and chokes on it

Image

And due to that we'll get to see the cookie screenshot as a backup, which is the "primary" screenshotable layer. Or something like that.
maybe a workaround could be to adjust the crawler logic to specifically ignore .svg files and/or possibly look for the biggest hoarder supported image format.

See #141 and https://github.com/hoarder-app/hoarder/issues/128

<!-- gh-comment-id:2613714705 --> @Deathproof76 commented on GitHub (Jan 25, 2025): @fcorvelo well, and there it is 😅 at least we're finally on the same page and know how to reproduce this ![Image](https://github.com/user-attachments/assets/33ec5a7d-618f-4fe2-b56c-8bc15c07dad2) I'd guesstimate that it has something to do with crawler trying to find a nice banner, settles on the YouTube.svg logo and chokes on it ![Image](https://github.com/user-attachments/assets/09aca0f7-1f20-4eee-b267-cbf5efcad7ec) And due to that we'll get to see the cookie screenshot as a backup, which is the "primary" screenshotable layer. Or something like that. maybe a workaround could be to adjust the crawler logic to specifically ignore .svg files and/or possibly look for the biggest hoarder supported image format. See #141 and https://github.com/hoarder-app/hoarder/issues/128
Author
Owner

@Deathproof76 commented on GitHub (Jan 25, 2025):

Just spitballing on the side: Maybe I'd be possible to incorporate the integrated puppeteer adblock and enhance the blocklist for cookie requests? Mmm, that'll most likely block the whole page, I'd have to read up on the possibilities.

Also, maybe beyond the dns block scope, but with something like ublock origin for example it's possible to cosmetically filter specific annoyances away like the cookie stuff. Maybe it'd be possible to create filtertemplates for known offenders like YouTube via puppeteer in the hoarder container itself.

another possibility would be to find out why the hoarder container can't connect/or has problems to connect a chromium instance with a loaded extension and solve that

<!-- gh-comment-id:2613734838 --> @Deathproof76 commented on GitHub (Jan 25, 2025): Just spitballing on the side: Maybe I'd be possible to incorporate the integrated puppeteer adblock and enhance the blocklist for cookie requests? Mmm, that'll most likely block the whole page, I'd have to read up on the possibilities. Also, maybe beyond the dns block scope, but with something like ublock origin for example it's possible to cosmetically filter specific annoyances away like the cookie stuff. Maybe it'd be possible to create filtertemplates for known offenders like YouTube via puppeteer in the hoarder container itself. another possibility would be to find out why the hoarder container can't connect/or has problems to connect a chromium instance with a loaded extension and solve that
Author
Owner

@pcasalinho commented on GitHub (Jan 26, 2025):

Don't know how this only happens in some youtube videos:
This link starts playing the video and then shows the cookie disclaimer: https://www.youtube.com/watch?v=CbIASgzUIUU
This video does not play until the cookie disclaimer is accepted: https://www.youtube.com/shorts/t67qpQFMDUw
You can view this in a private mode browser without cookies

In the first youtube link, hoarder does everything well.
On the second link, it "hoards" the disclaimer.

<!-- gh-comment-id:2614561342 --> @pcasalinho commented on GitHub (Jan 26, 2025): Don't know how this only happens in some youtube videos: This link starts playing the video and then shows the cookie disclaimer: <https://www.youtube.com/watch?v=CbIASgzUIUU> This video does not play until the cookie disclaimer is accepted: <https://www.youtube.com/shorts/t67qpQFMDUw> You can view this in a private mode browser without cookies In the first youtube link, hoarder does everything well. On the second link, it "hoards" the disclaimer.
Author
Owner

@ctschach commented on GitHub (Feb 3, 2025):

Do we have any progression or updates on this topic?

<!-- gh-comment-id:2631808122 --> @ctschach commented on GitHub (Feb 3, 2025): Do we have any progression or updates on this topic?
Author
Owner

@gercollo commented on GitHub (Mar 13, 2025):

+1 on this. For me, it's happening with YouTube Shorts and recently started also with links from Google Maps.

<!-- gh-comment-id:2721639691 --> @gercollo commented on GitHub (Mar 13, 2025): +1 on this. For me, it's happening with YouTube Shorts and recently started also with links from Google Maps.
Author
Owner

@kafmees commented on GitHub (Mar 17, 2025):

It also uses the cookies consent page for tagging. So every link is tagged as Cookies, Privacy policy, consent, advertising. I'll try the workaround as this makes the app worthless atm.

<!-- gh-comment-id:2727887482 --> @kafmees commented on GitHub (Mar 17, 2025): It also uses the cookies consent page for tagging. So every link is tagged as Cookies, Privacy policy, consent, advertising. I'll try the workaround as this makes the app worthless atm.
Author
Owner

@terribium commented on GitHub (Mar 17, 2025):

+1 for me too.

Tested these 2 urls to YouTube and both worked in incognito windows with "I still don't care about cookies" -extension in Vivaldi (Chrome).

URLS:
https://www.youtube.com/watch?v=CbIASgzUIUU
https://www.youtube.com/shorts/t67qpQFMDUw
Extension: https://chromewebstore.google.com/detail/i-still-dont-care-about-c/edibdbjcniadpccecjdfdjjppcpchdlm

Is this something we should and could incorporate in Hoarder?

Don't know how this only happens in some youtube videos: This link starts playing the video and then shows the cookie disclaimer: https://www.youtube.com/watch?v=CbIASgzUIUU This video does not play until the cookie disclaimer is accepted: https://www.youtube.com/shorts/t67qpQFMDUw You can view this in a private mode browser without cookies

In the first youtube link, hoarder does everything well. On the second link, it "hoards" the disclaimer.

<!-- gh-comment-id:2728526029 --> @terribium commented on GitHub (Mar 17, 2025): +1 for me too. Tested these 2 urls to YouTube and both worked in incognito windows with "I still don't care about cookies" -extension in Vivaldi (Chrome). URLS: https://www.youtube.com/watch?v=CbIASgzUIUU https://www.youtube.com/shorts/t67qpQFMDUw Extension: https://chromewebstore.google.com/detail/i-still-dont-care-about-c/edibdbjcniadpccecjdfdjjppcpchdlm Is this something we should and could incorporate in Hoarder? > Don't know how this only happens in some youtube videos: This link starts playing the video and then shows the cookie disclaimer: https://www.youtube.com/watch?v=CbIASgzUIUU This video does not play until the cookie disclaimer is accepted: https://www.youtube.com/shorts/t67qpQFMDUw You can view this in a private mode browser without cookies > > In the first youtube link, hoarder does everything well. On the second link, it "hoards" the disclaimer.
Author
Owner

@mobiledude commented on GitHub (Mar 22, 2025):

is there in general a way to bypass cookies not only youtube related? I am hoarding a lot from this site but it's cookie consent message makes it useless. https://www.totaaltv.nl/nieuws/kpn-slaat-handen-ineen-met-netflix-en-stelt-ook-nietklanten-in-staat-hiervan-te-profiteren/

<!-- gh-comment-id:2744988168 --> @mobiledude commented on GitHub (Mar 22, 2025): is there in general a way to bypass cookies not only youtube related? I am hoarding a lot from this site but it's cookie consent message makes it useless. https://www.totaaltv.nl/nieuws/kpn-slaat-handen-ineen-met-netflix-en-stelt-ook-nietklanten-in-staat-hiervan-te-profiteren/
Author
Owner

@dennisvanderpool commented on GitHub (Apr 20, 2025):

Is it possible to take over the browser and just login to some sites? Then you can let it work for almost all sites, even ones where you really have to login.

Or as an alternative expose a folder where cookies can be copied into? Is it then possible to copy cookies from my desktop browser into that folder?

<!-- gh-comment-id:2817323759 --> @dennisvanderpool commented on GitHub (Apr 20, 2025): Is it possible to take over the browser and just login to some sites? Then you can let it work for almost all sites, even ones where you really have to login. Or as an alternative expose a folder where cookies can be copied into? Is it then possible to copy cookies from my desktop browser into that folder?
Author
Owner

@fspv commented on GitHub (Apr 27, 2025):

I came up with this custom docker image, with adblock and "I still don't care about cookies" loaded

  chrome:
    restart: unless-stopped
    container_name: chrome
    build:
      context: .
      dockerfile_inline: |
        # Use a lightweight Debian-based image
        FROM debian:bullseye-slim

        # Install dependencies
        RUN apt-get update \
          && apt-get install -y wget gnupg curl unzip dbus dbus-x11 xvfb upower x11vnc novnc python3-websockify fluxbox \
          && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
          && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
          && apt-get update \
          && apt-get install -y google-chrome-stable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst fonts-freefont-ttf libxss1 \
            --no-install-recommends \
          && rm -rf /var/lib/apt/lists/*

        # RUN mkdir -p /etc/chromium/policies/managed/
        # RUN echo '{"ExtensionInstallForcelist": ["edibdbjcniadpccecjdfdjjppcpchdlm"]}' > /etc/chromium/policies/managed/policy.json

        # Create a non-root user
        RUN groupadd -r chromiumuser && useradd -u 1000 -rm -g chromiumuser -G audio,video chromiumuser

        RUN mkdir /run/dbus
        RUN chmod 777 /run/dbus
        RUN echo 01234567890123456789012345678901 > /etc/machine-id

        USER chromiumuser

        # Create extension directory
        RUN mkdir -p /tmp/chrome/extensions
        RUN mkdir -p /tmp/chrome/profile

        # Download and unzip "I Still Don't Care About Cookies"
        RUN curl -L -o /tmp/isdcac.zip https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies/releases/download/v1.1.4/ISDCAC-chrome-source.zip && \
            unzip /tmp/isdcac.zip -d /tmp/chrome/extensions/isdcac && \
            rm /tmp/isdcac.zip

        # Download and unzip uBlock Origin
        RUN curl -L -o /tmp/ublock.zip https://github.com/uBlockOrigin/uBOL-home/releases/download/uBOLite_2025.4.13.1188/uBOLite_2025.4.13.1188.chromium.mv3.zip && \
            unzip /tmp/ublock.zip -d /tmp/chrome/extensions/ublock && \
            rm /tmp/ublock.zip

        ENV DBUS_SESSION_BUS_ADDRESS autolaunch:

        RUN x11vnc -storepasswd 123 /tmp/vnc-password

        RUN echo '#!/bin/bash -uex\n\
        Xvfb :1 -screen 0 1024x768x16 -ac -nolisten tcp -nolisten unix & \
        DISPLAY=:1 fluxbox & \
        DISPLAY=:1 x11vnc -nopw -forever -localhost -shared -rfbport 5900 -rfbportv6 5900 & \
        DISPLAY=:1 websockify -D --web=/usr/share/novnc 7900 localhost:5900 & \
        dbus-daemon --system --fork --print-address 1 > /tmp/dbus-session-addr.txt && \
        export DBUS_SESSION_BUS_ADDRESS=$(cat /tmp/dbus-session-addr.txt) && \
        DISPLAY=:1 google-chrome --disable-gpu --no-default-browser-check --no-first-run --disable-3d-apis --disable-dev-shm-usage \
        --load-extension=/tmp/chrome/extensions/isdcac,/tmp/chrome/extensions/ublock \
        --remote-debugging-address=0.0.0.0 --remote-debugging-port=9222 --user-data-dir=$(mktemp -d) \
        --proxy-server=http://myproxy:1081 --proxy-bypass-list=.example.com \
        "$@"' > /tmp/run-chrome.sh && chmod +x /tmp/run-chrome.sh

        # Set the entrypoint
        ENTRYPOINT ["/tmp/run-chrome.sh"]
    cap_add:
      - SYS_ADMIN

Works well for me and also exposes chrome UI on port 7900 where you can see an actual browser. This is basically the way it is done in selenium docker image, for example. I just added a couple of more tricks to make extensions load

<!-- gh-comment-id:2833345424 --> @fspv commented on GitHub (Apr 27, 2025): I came up with this custom docker image, with adblock and "I still don't care about cookies" loaded ```yaml chrome: restart: unless-stopped container_name: chrome build: context: . dockerfile_inline: | # Use a lightweight Debian-based image FROM debian:bullseye-slim # Install dependencies RUN apt-get update \ && apt-get install -y wget gnupg curl unzip dbus dbus-x11 xvfb upower x11vnc novnc python3-websockify fluxbox \ && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \ && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \ && apt-get update \ && apt-get install -y google-chrome-stable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst fonts-freefont-ttf libxss1 \ --no-install-recommends \ && rm -rf /var/lib/apt/lists/* # RUN mkdir -p /etc/chromium/policies/managed/ # RUN echo '{"ExtensionInstallForcelist": ["edibdbjcniadpccecjdfdjjppcpchdlm"]}' > /etc/chromium/policies/managed/policy.json # Create a non-root user RUN groupadd -r chromiumuser && useradd -u 1000 -rm -g chromiumuser -G audio,video chromiumuser RUN mkdir /run/dbus RUN chmod 777 /run/dbus RUN echo 01234567890123456789012345678901 > /etc/machine-id USER chromiumuser # Create extension directory RUN mkdir -p /tmp/chrome/extensions RUN mkdir -p /tmp/chrome/profile # Download and unzip "I Still Don't Care About Cookies" RUN curl -L -o /tmp/isdcac.zip https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies/releases/download/v1.1.4/ISDCAC-chrome-source.zip && \ unzip /tmp/isdcac.zip -d /tmp/chrome/extensions/isdcac && \ rm /tmp/isdcac.zip # Download and unzip uBlock Origin RUN curl -L -o /tmp/ublock.zip https://github.com/uBlockOrigin/uBOL-home/releases/download/uBOLite_2025.4.13.1188/uBOLite_2025.4.13.1188.chromium.mv3.zip && \ unzip /tmp/ublock.zip -d /tmp/chrome/extensions/ublock && \ rm /tmp/ublock.zip ENV DBUS_SESSION_BUS_ADDRESS autolaunch: RUN x11vnc -storepasswd 123 /tmp/vnc-password RUN echo '#!/bin/bash -uex\n\ Xvfb :1 -screen 0 1024x768x16 -ac -nolisten tcp -nolisten unix & \ DISPLAY=:1 fluxbox & \ DISPLAY=:1 x11vnc -nopw -forever -localhost -shared -rfbport 5900 -rfbportv6 5900 & \ DISPLAY=:1 websockify -D --web=/usr/share/novnc 7900 localhost:5900 & \ dbus-daemon --system --fork --print-address 1 > /tmp/dbus-session-addr.txt && \ export DBUS_SESSION_BUS_ADDRESS=$(cat /tmp/dbus-session-addr.txt) && \ DISPLAY=:1 google-chrome --disable-gpu --no-default-browser-check --no-first-run --disable-3d-apis --disable-dev-shm-usage \ --load-extension=/tmp/chrome/extensions/isdcac,/tmp/chrome/extensions/ublock \ --remote-debugging-address=0.0.0.0 --remote-debugging-port=9222 --user-data-dir=$(mktemp -d) \ --proxy-server=http://myproxy:1081 --proxy-bypass-list=.example.com \ "$@"' > /tmp/run-chrome.sh && chmod +x /tmp/run-chrome.sh # Set the entrypoint ENTRYPOINT ["/tmp/run-chrome.sh"] cap_add: - SYS_ADMIN ``` Works well for me and also exposes chrome UI on port 7900 where you can see an actual browser. This is basically the way it is done in selenium docker image, for example. I just added a couple of more tricks to make extensions load
Author
Owner

@kafmees commented on GitHub (May 3, 2025):

Disclaimer: I'm a total noob, don't know how to code and I use LLMs for these kind of things. So I did not write any of this, but please don't call me a "vibe coder", as I'm working hard to learn and do more and more stuff myself.

As I don't want to use a custom docker image, I've created a solution for tagging youtube shorts which works for me for the time being.

Specifically, I set up a webhook in Karakeep (triggered on created events), which is handled by a small Python Flask service running in Docker. When a new bookmark is added:

  1. If the URL is a YouTube Shorts link (/shorts/xyz), the script rewrites it to a regular YouTube URL (/watch?v=xyz)
  2. It then creates a new bookmark with the rewritten URL (which Karakeep tags correctly)
  3. It tags the original Shorts link with a custom tag: "Opsodemieteren"

Because I added a rule in Karakeep to automatically archive bookmarks with that tag, the incorrectly-tagged link gets archived right away.

I'm using this workaround until there's a more structural solution in place. Two things that would make this cleaner:

  1. Being able to delete bookmarks via the API using a tag filter
  2. Or having a rule action that allows deleting bookmarks based on tags

I know this isn't ideal, but it works for now, and maybe it helps someone else facing the same issue.

You can find the full stand alone script here: https://gist.github.com/kafmees/eb5f6705b29ca80d34e1fbd1817d4ab7

<!-- gh-comment-id:2848782075 --> @kafmees commented on GitHub (May 3, 2025): **Disclaimer:** I'm a total noob, don't know how to code and I use LLMs for these kind of things. So I did not write any of this, but please don't call me a "vibe coder", as I'm working hard to learn and do more and more stuff myself. As I don't want to use a custom docker image, I've created a solution for tagging youtube shorts which works for me for the time being. Specifically, I set up a webhook in Karakeep (triggered on created events), which is handled by a small Python Flask service running in Docker. When a new bookmark is added: 1. If the URL is a YouTube Shorts link (/shorts/xyz), the script rewrites it to a regular YouTube URL (/watch?v=xyz) 2. It then creates a new bookmark with the rewritten URL (which Karakeep tags correctly) 3. It tags the original Shorts link with a custom tag: "Opsodemieteren" Because I added a rule in Karakeep to automatically archive bookmarks with that tag, the incorrectly-tagged link gets archived right away. I'm using this workaround until there's a more structural solution in place. Two things that would make this cleaner: 1. Being able to delete bookmarks via the API using a tag filter 2. Or having a rule action that allows deleting bookmarks based on tags I know this isn't ideal, but it works for now, and maybe it helps someone else facing the same issue. You can find the full stand alone script here: [https://gist.github.com/kafmees/eb5f6705b29ca80d34e1fbd1817d4ab7](https://gist.github.com/kafmees/eb5f6705b29ca80d34e1fbd1817d4ab7)
Author
Owner

@ballerbude commented on GitHub (May 7, 2025):

@fspv How do I use that in context of Karakeep? Could you provide your full docker-compose file? Спасибо заранее.

<!-- gh-comment-id:2860067122 --> @ballerbude commented on GitHub (May 7, 2025): @fspv How do I use that in context of Karakeep? Could you provide your full docker-compose file? Спасибо заранее.
Author
Owner

@fspv commented on GitHub (May 11, 2025):

@ballerbude just replase "chrome" service in your docker compose with this one

<!-- gh-comment-id:2869776480 --> @fspv commented on GitHub (May 11, 2025): @ballerbude just replase "chrome" service in your docker compose with this one
Author
Owner

@Deathproof76 commented on GitHub (May 12, 2025):

@fspv I can't seem to be able to get this running. Tried building outside of the stack and also as a drop in for the standard chrome, made sure every container is in the same network. Also tried removing the proxy line. Tried opening ports and made sure the docker network resolves the ip correctly.

The vnc instance via 7900 is accessible from within the lan (browser in a browser). But the 9222 remote debugging port is can only be curled from from within the built chrome container itself. Otherwise it's just "can't connect to browser" from karakeep on repeat.

Maybe a permissions issue? Some more guidance would be greatly appreciated.

<!-- gh-comment-id:2874296897 --> @Deathproof76 commented on GitHub (May 12, 2025): @fspv I can't seem to be able to get this running. Tried building outside of the stack and also as a drop in for the standard chrome, made sure every container is in the same network. Also tried removing the proxy line. Tried opening ports and made sure the docker network resolves the ip correctly. The vnc instance via 7900 is accessible from within the lan (browser in a browser). But the 9222 remote debugging port is can only be curled from from within the built chrome container itself. Otherwise it's just "can't connect to browser" from karakeep on repeat. Maybe a permissions issue? Some more guidance would be greatly appreciated.
Author
Owner

@jncmney commented on GitHub (May 28, 2025):

The same thing happens on other sites. I tried this page: https://comicvine.gamespot.com/revival/4050-50379/characters/ and cannot get Karakeep to skip/ignore the consent form.

Image Image
<!-- gh-comment-id:2914635323 --> @jncmney commented on GitHub (May 28, 2025): The same thing happens on other sites. I tried this page: https://comicvine.gamespot.com/revival/4050-50379/characters/ and cannot get Karakeep to skip/ignore the consent form. <img width="1401" alt="Image" src="https://github.com/user-attachments/assets/e2ca7741-b354-4ce5-ace5-9d2e12bdb73a" /> <img width="1432" alt="Image" src="https://github.com/user-attachments/assets/acc1461c-bd38-4c28-a989-d52b90d64ab0" />
Author
Owner

@ballerbude commented on GitHub (May 29, 2025):

Man, this is very frustrating and making this otherwise great tool somewhat obsolete if you live in the EU and visit EU sites. Readeck skips this consent banners and can scrape the text from the mentioned pages. So technically, it's very much possible.

<!-- gh-comment-id:2919461296 --> @ballerbude commented on GitHub (May 29, 2025): Man, this is very frustrating and making this otherwise great tool somewhat obsolete if you live in the EU and visit EU sites. Readeck skips this consent banners and can scrape the text from the mentioned pages. So technically, it's very much possible.
Author
Owner

@teddyfresco commented on GitHub (Jun 7, 2025):

Man, this is very frustrating and making this otherwise great tool somewhat obsolete if you live in the EU and visit EU sites. Readeck skips this consent banners and can scrape the text from the mentioned pages. So technically, it's very much possible.

I've tried adapting every suggestion on this issue, nothing worked, maybe because I am a useless newbie, but it seems Readeck is perfectly able to deal with it, in some way, as this blog post suggests:
https://cyb.org.uk/2025/06/05/self-hosted-bookmarking.html?ref=selfh.st
I'd very much like to use Karakeep, I consider it far superior, if not for this problem, which is a big hindrance...

<!-- gh-comment-id:2952361366 --> @teddyfresco commented on GitHub (Jun 7, 2025): > Man, this is very frustrating and making this otherwise great tool somewhat obsolete if you live in the EU and visit EU sites. Readeck skips this consent banners and can scrape the text from the mentioned pages. So technically, it's very much possible. I've tried adapting every suggestion on this issue, nothing worked, maybe because I am a useless newbie, but it seems Readeck is perfectly able to deal with it, in some way, as this blog post suggests: https://cyb.org.uk/2025/06/05/self-hosted-bookmarking.html?ref=selfh.st I'd very much like to use Karakeep, I consider it far superior, if not for this problem, which is a big hindrance...
Author
Owner

@MohamedBassem commented on GitHub (Jun 7, 2025):

Folks, I hear your pain. I'll make this the main feature of the 0.26 release. I'm almost done with the 0.25 (should be out this weekend hopefully).

<!-- gh-comment-id:2952373544 --> @MohamedBassem commented on GitHub (Jun 7, 2025): Folks, I hear your pain. I'll make this the main feature of the 0.26 release. I'm almost done with the 0.25 (should be out this weekend hopefully).
Author
Owner

@pdc1 commented on GitHub (Jun 7, 2025):

Along the same lines I was wondering if it would be possible to allow custom adblock filter lists? I have a series of lists I use with Brave, and I'd like to match that on the crawler.

<!-- gh-comment-id:2952920961 --> @pdc1 commented on GitHub (Jun 7, 2025): Along the same lines I was wondering if it would be possible to allow custom adblock filter lists? I have a series of lists I use with Brave, and I'd like to match that on the crawler.
Author
Owner

@CrazyWolf13 commented on GitHub (Jul 4, 2025):

Hi
Just checking in, I'm one of the maintainer of community-scripts, I'd love this feature too, espcially for tiktok reels, though that will possibly get even harder, as they serve a captcha currently:

Image Image
<!-- gh-comment-id:3036519828 --> @CrazyWolf13 commented on GitHub (Jul 4, 2025): Hi Just checking in, I'm one of the maintainer of community-scripts, I'd love this feature too, espcially for tiktok reels, though that will possibly get even harder, as they serve a captcha currently: <img width="539" height="512" alt="Image" src="https://github.com/user-attachments/assets/f907083e-8402-4cfc-b1ec-a794d6dd55e3" /> <img width="1715" height="860" alt="Image" src="https://github.com/user-attachments/assets/b20eb038-6c89-4a18-864c-9e4363346bcc" />
Author
Owner

@Tandem1000 commented on GitHub (Jul 23, 2025):

Folks, I hear your pain. I'll make this the main feature of the 0.26 release. I'm almost done with the 0.25 (should be out this weekend hopefully).

I updated to 0.26 yesterday. Unfortunately, the ‘GDPR notice’ problem still persists (here: youtube.com).

<!-- gh-comment-id:3110269906 --> @Tandem1000 commented on GitHub (Jul 23, 2025): > Folks, I hear your pain. I'll make this the main feature of the 0.26 release. I'm almost done with the 0.25 (should be out this weekend hopefully). I updated to 0.26 yesterday. Unfortunately, the ‘GDPR notice’ problem still persists (here: youtube.com).
Author
Owner

@fspv commented on GitHub (Jul 26, 2025):

Hey, I replied in this thread earlier with a docker compose file for the chrome image. I had some time recently to package it properly. You can find the result in this repo https://github.com/fspv/crawl-browser

Here is the minimum docker compose definition for the chrome target, which should work.

services:
  chrome:
    image: nuhotetotniksvoboden/crawl-browser:latest
    command:
      - --no-sandbox

There was a mention of the bug in this thread that the port 9222 is not available remotely. I have managed to fix it. Feel free to try it out and let me know if it works or not.

It comes with https://github.com/uBlockOrigin/uBOL-home and https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies extensions by default, so should block most of the banners and cookies notices. You can add other extensions as well (see the readme and examples in the repo)

<!-- gh-comment-id:3123493891 --> @fspv commented on GitHub (Jul 26, 2025): Hey, I replied in this thread earlier with a docker compose file for the chrome image. I had some time recently to package it properly. You can find the result in this repo https://github.com/fspv/crawl-browser Here is the minimum docker compose definition for the chrome target, which should work. ```yaml services: chrome: image: nuhotetotniksvoboden/crawl-browser:latest command: - --no-sandbox ``` There was a mention of the bug in this thread that the port 9222 is not available remotely. I have managed to fix it. Feel free to try it out and let me know if it works or not. It comes with https://github.com/uBlockOrigin/uBOL-home and https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies extensions by default, so should block most of the banners and cookies notices. You can add other extensions as well (see the readme and examples in the repo)
Author
Owner

@miraculix95 commented on GitHub (Jul 28, 2025):

Hey, I replied in this thread earlier with a docker compose file for the chrome image. I had some time recently to package it properly. You can find the result in this repo https://github.com/fspv/crawl-browser

Here is the minimum docker compose definition for the chrome target, which should work.

services:
chrome:
image: nuhotetotniksvoboden/crawl-browser:latest
command:
- --no-sandbox

There was a mention of the bug in this thread that the port 9222 is not available remotely. I have managed to fix it. Feel free to try it out and let me know if it works or not.

It comes with https://github.com/uBlockOrigin/uBOL-home and https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies extensions by default, so should block most of the banners and cookies notices. You can add other extensions as well (see the readme and examples in the repo)

Hi, thanks for the effort. Unfortunately it was not working for me.
What I did:

  1. just replacing the image for the karakeep chome-service with the one you provided
  2. replacing the image for the karakeep chrome-service and also commenting out all the command-parameters after --no sandbox.

Results:

  1. failure to fetch content
  2. the same as with the karakeep vanilla images - the cookie banner is still in the screenshot

This is a major issue because these days 80% of websites have a cookie banner.
With the issue the software is unusable.

<!-- gh-comment-id:3127491938 --> @miraculix95 commented on GitHub (Jul 28, 2025): > Hey, I replied in this thread earlier with a docker compose file for the chrome image. I had some time recently to package it properly. You can find the result in this repo https://github.com/fspv/crawl-browser > > Here is the minimum docker compose definition for the chrome target, which should work. > > services: > chrome: > image: nuhotetotniksvoboden/crawl-browser:latest > command: > - --no-sandbox > > There was a mention of the bug in this thread that the port 9222 is not available remotely. I have managed to fix it. Feel free to try it out and let me know if it works or not. > > It comes with https://github.com/uBlockOrigin/uBOL-home and https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies extensions by default, so should block most of the banners and cookies notices. You can add other extensions as well (see the readme and examples in the repo) Hi, thanks for the effort. Unfortunately it was not working for me. What I did: 1) just replacing the image for the karakeep chome-service with the one you provided 2) replacing the image for the karakeep chrome-service and also commenting out all the command-parameters after --no sandbox. Results: 1) failure to fetch content 2) the same as with the karakeep vanilla images - the cookie banner is still in the screenshot This is a major issue because these days 80% of websites have a cookie banner. With the issue the software is unusable.
Author
Owner

@fspv commented on GitHub (Jul 28, 2025):

Hmm, the result 2 is interesting. I wouldn't expect it to be the case. I'll try it myself. Maybe karakeep just doesn't wait long enough for the extensions to do the the job. Any chance you can try to connect to your container via port 7900, go to vnc.html and see what's actually going on when karakeep tries to fetch something?

<!-- gh-comment-id:3127544223 --> @fspv commented on GitHub (Jul 28, 2025): Hmm, the result 2 is interesting. I wouldn't expect it to be the case. I'll try it myself. Maybe karakeep just doesn't wait long enough for the extensions to do the the job. Any chance you can try to connect to your container via port 7900, go to vnc.html and see what's actually going on when karakeep tries to fetch something?
Author
Owner

@miraculix95 commented on GitHub (Jul 28, 2025):

Hmm, the result 2 is interesting. I wouldn't expect it to be the case. I'll try it myself. Maybe karakeep just doesn't wait long enough for the extensions to do the the job. Any chance you can try to connect to your container via port 7900, go to vnc.html and see what's actually going on when karakeep tries to fetch something?

Unfortunately I am a semi-literate as concerns development.

this is my current docker-compose.yml (caddy is started with another yml):
Please let me know what I should change.

##################

name: karakeep

services:
web:
image: ghcr.io/karakeep-app/karakeep:${KARAKEEP_VERSION:-release}
restart: unless-stopped
container_name: karakeep-web
volumes:
# By default, the data is stored in a docker volume called "data".
# If you want to mount a custom directory, change the volume mapping to:
# - /path/to/your/directory:/data
- data:/data
# ports:
# - 3000:3000
expose:
- "3000"
env_file:
- .env
environment:
MEILI_ADDR: http://meilisearch:7700
BROWSER_WEB_URL: http://chrome:9222
# OPENAI_API_KEY: ...

  # You almost never want to change the value of the DATA_DIR variable.
  # If you want to mount a custom directory, change the volume mapping above instead.
  DATA_DIR: /data # DON'T CHANGE THIS
networks:
  - web
labels:
  - "caddy=karakeep.${DOMAIN}"
  - "caddy.reverse_proxy={{upstreams 3000}}"

chrome:
# alternative image to circumvent cookie banners
image: nuhotetotniksvoboden/crawl-browser:latest # alternative chrome suggested by https://github.com/karakeep-app/karakeep/issues/414 https://github.com/fspv/crawl-browser
# image: gcr.io/zenika-hub/alpine-chrome:123
restart: unless-stopped
container_name: karakeep-chrome
command:
- --no-sandbox
# - --disable-gpu
# - --disable-dev-shm-usage
# - --remote-debugging-address=0.0.0.0
# - --remote-debugging-port=9222
# - --hide-scrollbars
networks:
- web

meilisearch:
image: getmeili/meilisearch:v1.13.3
restart: unless-stopped
env_file:
- .env
environment:
MEILI_NO_ANALYTICS: "true"
volumes:
- meilisearch:/meili_data
networks:
- web

volumes:
meilisearch:
data:

networks:
web:
external: true
name: web

<!-- gh-comment-id:3127630629 --> @miraculix95 commented on GitHub (Jul 28, 2025): > Hmm, the result 2 is interesting. I wouldn't expect it to be the case. I'll try it myself. Maybe karakeep just doesn't wait long enough for the extensions to do the the job. Any chance you can try to connect to your container via port 7900, go to vnc.html and see what's actually going on when karakeep tries to fetch something? Unfortunately I am a semi-literate as concerns development. this is my current docker-compose.yml (caddy is started with another yml): Please let me know what I should change. ################## name: karakeep services: web: image: ghcr.io/karakeep-app/karakeep:${KARAKEEP_VERSION:-release} restart: unless-stopped container_name: karakeep-web volumes: # By default, the data is stored in a docker volume called "data". # If you want to mount a custom directory, change the volume mapping to: # - /path/to/your/directory:/data - data:/data # ports: # - 3000:3000 expose: - "3000" env_file: - .env environment: MEILI_ADDR: http://meilisearch:7700 BROWSER_WEB_URL: http://chrome:9222 # OPENAI_API_KEY: ... # You almost never want to change the value of the DATA_DIR variable. # If you want to mount a custom directory, change the volume mapping above instead. DATA_DIR: /data # DON'T CHANGE THIS networks: - web labels: - "caddy=karakeep.${DOMAIN}" - "caddy.reverse_proxy={{upstreams 3000}}" chrome: # alternative image to circumvent cookie banners image: nuhotetotniksvoboden/crawl-browser:latest # alternative chrome suggested by https://github.com/karakeep-app/karakeep/issues/414 https://github.com/fspv/crawl-browser # image: gcr.io/zenika-hub/alpine-chrome:123 restart: unless-stopped container_name: karakeep-chrome command: - --no-sandbox # - --disable-gpu # - --disable-dev-shm-usage # - --remote-debugging-address=0.0.0.0 # - --remote-debugging-port=9222 # - --hide-scrollbars networks: - web meilisearch: image: getmeili/meilisearch:v1.13.3 restart: unless-stopped env_file: - .env environment: MEILI_NO_ANALYTICS: "true" volumes: - meilisearch:/meili_data networks: - web volumes: meilisearch: data: networks: web: external: true name: web
Author
Owner

@fspv commented on GitHub (Jul 29, 2025):

Okay, so I've just tested it myself, and indeed my image doesn't help. I tested it using news.google.com and it successfully closes the consent banner, but it takes a few seconds to do that.

The problem is that karakeep closes the website before extensions have a chance to do anything. I use this image for different purposes and it works well there, because I have added an artificial 10s sleep before page load and content grabbing. I don't see an option in karakeep to do that.

UPD: it kinda should be handled by this github.com/karakeep-app/karakeep@afcc27d557/apps/workers/workers/crawlerWorker.ts (L414) not sure, though, why it doesn't work. Maybe logs can provide some insight, but I don't have much time to look at this as I'm not an active karakeep user at the moment

<!-- gh-comment-id:3133118809 --> @fspv commented on GitHub (Jul 29, 2025): Okay, so I've just tested it myself, and indeed my image doesn't help. I tested it using news.google.com and it successfully closes the consent banner, but it takes a few seconds to do that. The problem is that karakeep closes the website before extensions have a chance to do anything. I use this image for different purposes and it works well there, because I have added an artificial 10s sleep before page load and content grabbing. I don't see an option in karakeep to do that. UPD: it kinda should be handled by this https://github.com/karakeep-app/karakeep/blob/afcc27d5578377d66d79506a147ef8e9fd668783/apps/workers/workers/crawlerWorker.ts#L414 not sure, though, why it doesn't work. Maybe logs can provide some insight, but I don't have much time to look at this as I'm not an active karakeep user at the moment
Author
Owner

@miraculix95 commented on GitHub (Jul 30, 2025):

Okay, so I've just tested it myself, and indeed my image doesn't help. I tested it using news.google.com and it successfully closes the consent banner, but it takes a few seconds to do that.

The problem is that karakeep closes the website before extensions have a chance to do anything. I use this image for different purposes and it works well there, because I have added an artificial 10s sleep before page load and content grabbing. I don't see an option in karakeep to do that.

UPD: it kinda should be handled by this

karakeep/apps/workers/workers/crawlerWorker.ts

Line 414 in afcc27d
// Wait until network is relatively idle or timeout after 5 seconds
not sure, though, why it doesn't work. Maybe logs can provide some insight, but I don't have much time to look at this as I'm not an active karakeep user at the moment

Thanks a lot anyway ... a pity ... weird that nobody else is looking into this ...

<!-- gh-comment-id:3134436918 --> @miraculix95 commented on GitHub (Jul 30, 2025): > Okay, so I've just tested it myself, and indeed my image doesn't help. I tested it using news.google.com and it successfully closes the consent banner, but it takes a few seconds to do that. > > The problem is that karakeep closes the website before extensions have a chance to do anything. I use this image for different purposes and it works well there, because I have added an artificial 10s sleep before page load and content grabbing. I don't see an option in karakeep to do that. > > UPD: it kinda should be handled by this > > [karakeep/apps/workers/workers/crawlerWorker.ts](https://github.com/karakeep-app/karakeep/blob/afcc27d5578377d66d79506a147ef8e9fd668783/apps/workers/workers/crawlerWorker.ts#L414) > > Line 414 in [afcc27d](/karakeep-app/karakeep/commit/afcc27d5578377d66d79506a147ef8e9fd668783) > // Wait until network is relatively idle or timeout after 5 seconds > not sure, though, why it doesn't work. Maybe logs can provide some insight, but I don't have much time to look at this as I'm not an active karakeep user at the moment Thanks a lot anyway ... a pity ... weird that nobody else is looking into this ...
Author
Owner

@pdc1 commented on GitHub (Jul 31, 2025):

I have been doing some testing (hacking, really) and have found a few things that have helped. Unfortunately this is more for @MohamedBassem to integrate and not something a non-developer user would likely be able to do.

The first part is to get a browser image that persists its profile state (e.g. mount a local volume for profile) and also allows VNC access to the container so you can log in to accounts, solve captchas and whatnot as needed. I built my own (using Brave browser) but others were mentioned above.

The other thing that I found helped was to reuse the browser context, which effectively opens a new tab on the browser so it has access to the cookies from the session where you did the login/captcha/etc. Basically in crawlPage, instead of having the browser object call newContext, do something like this:

        const existingContexts = browser.contexts();
	if (existingContexts.length > 0) {
	  logger_default.debug(`[Crawler]${existingContexts.length} contexts, reusing first one...`);
	  // Reuse the first existing context
	  context = existingContexts[0];
	} else {
	  // Fallback to creating a new incognito context
	  context = await browser.newContext({
	  [...]

Then at the end, instead of closing the context (since we want to keep reusing it), just close the page (this effectively closes a tab that was opened for the scraping):

  } finally {
    if (page) {
      logger_default.debug(`[Crawler][${jobId}] closing page..`);
      await page.close();
    }
    // Only close the browser if it was created on demand
    if (serverConfig.crawler.browserConnectOnDemand) {
      await browser.close();
    }
  }

I also found it helped to turn on some Playwright debugging messages:

    environment:
      [...]
      DEBUG: pw:api,pw:error
      PLAYWRIGHT_LOG: trace
<!-- gh-comment-id:3141142544 --> @pdc1 commented on GitHub (Jul 31, 2025): I have been doing some testing (hacking, really) and have found a few things that have helped. Unfortunately this is more for @MohamedBassem to integrate and not something a non-developer user would likely be able to do. The first part is to get a browser image that persists its profile state (e.g. mount a local volume for profile) and also allows VNC access to the container so you can log in to accounts, solve captchas and whatnot as needed. I built my own (using Brave browser) but others were mentioned above. The other thing that I found helped was to reuse the browser context, which effectively opens a new tab on the browser so it has access to the cookies from the session where you did the login/captcha/etc. Basically in `crawlPage`, instead of having the browser object call `newContext`, do something like this: ```javascript const existingContexts = browser.contexts(); if (existingContexts.length > 0) { logger_default.debug(`[Crawler]${existingContexts.length} contexts, reusing first one...`); // Reuse the first existing context context = existingContexts[0]; } else { // Fallback to creating a new incognito context context = await browser.newContext({ [...] ``` Then at the end, instead of closing the context (since we want to keep reusing it), just close the page (this effectively closes a tab that was opened for the scraping): ```javascript } finally { if (page) { logger_default.debug(`[Crawler][${jobId}] closing page..`); await page.close(); } // Only close the browser if it was created on demand if (serverConfig.crawler.browserConnectOnDemand) { await browser.close(); } } ``` I also found it helped to turn on some Playwright debugging messages: ```yaml environment: [...] DEBUG: pw:api,pw:error PLAYWRIGHT_LOG: trace ```
Author
Owner

@pdc1 commented on GitHub (Aug 1, 2025):

@miraculix95 if you could send me some links you are having problems with, I will try them in my hacked setup.

<!-- gh-comment-id:3144688498 --> @pdc1 commented on GitHub (Aug 1, 2025): @miraculix95 if you could send me some links you are having problems with, I will try them in my hacked setup.
Author
Owner

@ballerbude commented on GitHub (Aug 1, 2025):

Too complicated everything. The most streamlined solution would probably be something like miniflux RSS reading has implemented it. Just let me insert my cookies for personally defined sites. That way even paywalls could be circumvented.

<!-- gh-comment-id:3145444056 --> @ballerbude commented on GitHub (Aug 1, 2025): Too complicated everything. The most streamlined solution would probably be something like miniflux RSS reading has implemented it. Just let me insert my cookies for personally defined sites. That way even paywalls could be circumvented.
Author
Owner

@ballerbude commented on GitHub (Aug 2, 2025):

@pdc1 could you try heise.de and golem.de

<!-- gh-comment-id:3146474346 --> @ballerbude commented on GitHub (Aug 2, 2025): @pdc1 could you try heise.de and golem.de
Author
Owner

@pdc1 commented on GitHub (Aug 2, 2025):

It's looking good, I think Brave is doing a good job with the cookie blocking. Note that I also run a pi-hole ad blocker on my home network which may skew the results a bit.

Here's an example from heise.de: Titanic: VR-Erfahrung zeigt Untergang aus Passagiersicht

The thumbnail looks like this:
Image

And the first paragraph in Karakeep is:

Auf dem Bootsdeck drängen sich Passagiere der 1. Klasse. Es ist 1:10 Uhr und die Evakuierung verläuft nur schleppend. Ein Crewmitglied beruhigt einen besorgten Passagier mit dem Hinweis, es handle sich lediglich um eine Übung. Doch die Anspannung ist greifbar.

And here is the screenshot that Karakeep saves:
Image

golem.de got similar good results. I will try to package up a docker image you could try.

<!-- gh-comment-id:3146516954 --> @pdc1 commented on GitHub (Aug 2, 2025): It's looking good, I think Brave is doing a good job with the cookie blocking. Note that I also run a pi-hole ad blocker on my home network which may skew the results a bit. Here's an example from heise.de: [Titanic: VR-Erfahrung zeigt Untergang aus Passagiersicht](https://www.heise.de/news/Titanic-VR-Erfahrung-zeigt-Untergang-aus-Passagiersicht-10507527.html) The thumbnail looks like this: ![Image](https://github.com/user-attachments/assets/befbdda9-4644-4656-867a-304d08c60367) And the first paragraph in Karakeep is: > Auf dem Bootsdeck drängen sich Passagiere der 1. Klasse. Es ist 1:10 Uhr und die Evakuierung verläuft nur schleppend. Ein Crewmitglied beruhigt einen besorgten Passagier mit dem Hinweis, es handle sich lediglich um eine Übung. Doch die Anspannung ist greifbar. And here is the screenshot that Karakeep saves: ![Image](https://github.com/user-attachments/assets/6df32f24-7d8b-4654-8110-24f350040d7e) golem.de got similar good results. I will try to package up a docker image you could try.
Author
Owner

@pdc1 commented on GitHub (Aug 2, 2025):

Here's my setup, I hope I included everything people need to experiment.

  1. In docker-compose.yml replace the chrome section with this:
 # Use Brave instead

  chrome:
    container_name: brave-chrome
    build: .
    ports:
      - "9222:9222"     # Chrome DevTools for Playwright
      - "127.0.0.1:5901:5900"     # VNC access, localhost only
    volumes:
      - ./chrome-config:/data                    # Persistent Brave user data
      - ./plugins:/plugins                       # Optional plugin folder
    environment:
      - DISPLAY=:0
    shm_size: "1gb"
    restart: unless-stopped
  1. Dockerfile. You'll need to replace the first line with your appropriate architecture (arm64v8 is for my 64-bit Raspberry Pi 4). I'm a docker newbie (this was a ChatGPT assist 😉) so maybe there is a more generic Debian image.
FROM arm64v8/debian:bookworm

ENV DEBIAN_FRONTEND=noninteractive

# Install system + GUI packages (slim install)
RUN apt-get update && apt-get install -y --no-install-recommends \
    wget curl gnupg2 ca-certificates \
    xvfb tigervnc-standalone-server tigervnc-common fluxbox xterm xclip \
    dbus-x11 \
    libnss3 libatk-bridge2.0-0 libxss1 libasound2 libgbm1 \
    netcat-openbsd socat sudo \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

# From https://brave.com/linux/
RUN curl -fsSLo /usr/share/keyrings/brave-browser-archive-keyring.gpg \
    https://brave-browser-apt-release.s3.brave.com/brave-browser-archive-keyring.gpg
RUN curl -fsSLo /etc/apt/sources.list.d/brave-browser-release.sources \
    https://brave-browser-apt-release.s3.brave.com/brave-browser.sources
RUN apt update
RUN apt install -y --no-install-recommends brave-browser

# Create user for GUI session
RUN useradd -ms /bin/bash chromium && \
    echo "chromium ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers

# Create /tmp/.X11-unix with proper permissions (for Xvfb)
RUN mkdir -p /tmp/.X11-unix && chmod 1777 /tmp/.X11-unix
RUN chown root /tmp/.X11-unix

# Ensure /data exists and is owned correctly (for Brave profile)
RUN mkdir -p /data && chown chromium:chromium /data

# Switch to non-root user
USER chromium
WORKDIR /home/chromium

# Add start script
COPY --chown=chromium:chromium start.sh /start.sh
RUN chmod +x /start.sh

CMD ["/start.sh"]
  1. And here is the start.sh script:
#!/bin/bash

export DISPLAY=:0

# Force remove Brave lock file
rm -f /data/SingletonLock

# Avoid dbus errors (for the most part)
eval $(dbus-launch)
export DBUS_SESSION_BUS_ADDRESS

# Avoid ALSA errors
cat > ~/.asoundrc <<EOF
pcm.!default {
  type null
}
ctl.!default {
  type null
}
EOF

# Start VNC server (which also sets up its own X server)
vncserver :0 -geometry 1280x800 -depth 24 -SecurityTypes None -localhost no --I-KNOW-THIS-IS-INSECURE &

# Wait briefly to ensure X session is ready
sleep 2

# Start lightweight window manager
fluxbox > /dev/null 2>&1 &

# Start socat in background with a wait loop for Brave
(
  echo "Waiting for Brave debugger on port 9221..."
  while ! nc -z localhost 9221; do sleep 0.1; done
  echo "Debugger is up, starting socat..."
  socat TCP-LISTEN:9222,fork,reuseaddr TCP:127.0.0.1:9221
) &

# Launch Brave with remote debugging
brave-browser \
  --no-sandbox \
  --remote-debugging-port=9221 \
  --disable-gpu \
  --user-data-dir=/data \
  --disable-extensions-file-access-check \
  --disable-background-networking \
  --metrics-recording-only \
  --no-first-run \
  --mute-audio \
  --disable-blink-features=AutomationControlled \
  --autoplay-policy=no-user-gesture-required

All three files go in the same directory. Create a chrome-config directory for the persistent Brave user profile.

  1. Run docker compose up --build -d and it should do its thing. The new container will be called brave-chrome.

  2. Now hopefully everything is running. Now to configure Brave. Install tigervnc-viewer on the host (for me it was sudo apt install tigervnc-viewer, and connect using xtigervncviewer -SecurityTypes None localhost:5901.

  3. In the VNC session, Brave should be running. Click the upper right menu and select Settings. Then click the upper left settings menu and select "Shields". Scroll near the bottom and select Content Filtering (Enable custom filters). Check the first 6 boxes (all the "Fanboy" options). You can also experiment with other filters as needed (I saw an Easylist Germany for example).

If you leave the tigervnc session open you can watch the webpage load as Karakeep does its thing! This might help pinpoint problems with timing for example.

  1. That should be it. It sounds like a lot, but once set up things should run without further interaction needed.

I hope this helps. I'm not sure how much of all this could be pre-configured into a tidy container image for production, but it's good proof of concept at this point.

If you find you need a longer delay to allow the page to finish loading, I can help you with a Karakeep hack to add a delay.

<!-- gh-comment-id:3146537943 --> @pdc1 commented on GitHub (Aug 2, 2025): Here's my setup, I hope I included everything people need to experiment. 1. In `docker-compose.yml` replace the `chrome` section with this: ```yaml # Use Brave instead chrome: container_name: brave-chrome build: . ports: - "9222:9222" # Chrome DevTools for Playwright - "127.0.0.1:5901:5900" # VNC access, localhost only volumes: - ./chrome-config:/data # Persistent Brave user data - ./plugins:/plugins # Optional plugin folder environment: - DISPLAY=:0 shm_size: "1gb" restart: unless-stopped ``` 2. `Dockerfile`. You'll need to replace the first line with your appropriate architecture (`arm64v8` is for my 64-bit Raspberry Pi 4). I'm a docker newbie (this was a ChatGPT assist 😉) so maybe there is a more generic Debian image. ```dockerfile FROM arm64v8/debian:bookworm ENV DEBIAN_FRONTEND=noninteractive # Install system + GUI packages (slim install) RUN apt-get update && apt-get install -y --no-install-recommends \ wget curl gnupg2 ca-certificates \ xvfb tigervnc-standalone-server tigervnc-common fluxbox xterm xclip \ dbus-x11 \ libnss3 libatk-bridge2.0-0 libxss1 libasound2 libgbm1 \ netcat-openbsd socat sudo \ && apt-get clean && rm -rf /var/lib/apt/lists/* # From https://brave.com/linux/ RUN curl -fsSLo /usr/share/keyrings/brave-browser-archive-keyring.gpg \ https://brave-browser-apt-release.s3.brave.com/brave-browser-archive-keyring.gpg RUN curl -fsSLo /etc/apt/sources.list.d/brave-browser-release.sources \ https://brave-browser-apt-release.s3.brave.com/brave-browser.sources RUN apt update RUN apt install -y --no-install-recommends brave-browser # Create user for GUI session RUN useradd -ms /bin/bash chromium && \ echo "chromium ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers # Create /tmp/.X11-unix with proper permissions (for Xvfb) RUN mkdir -p /tmp/.X11-unix && chmod 1777 /tmp/.X11-unix RUN chown root /tmp/.X11-unix # Ensure /data exists and is owned correctly (for Brave profile) RUN mkdir -p /data && chown chromium:chromium /data # Switch to non-root user USER chromium WORKDIR /home/chromium # Add start script COPY --chown=chromium:chromium start.sh /start.sh RUN chmod +x /start.sh CMD ["/start.sh"] ``` 3. And here is the `start.sh` script: ```bash #!/bin/bash export DISPLAY=:0 # Force remove Brave lock file rm -f /data/SingletonLock # Avoid dbus errors (for the most part) eval $(dbus-launch) export DBUS_SESSION_BUS_ADDRESS # Avoid ALSA errors cat > ~/.asoundrc <<EOF pcm.!default { type null } ctl.!default { type null } EOF # Start VNC server (which also sets up its own X server) vncserver :0 -geometry 1280x800 -depth 24 -SecurityTypes None -localhost no --I-KNOW-THIS-IS-INSECURE & # Wait briefly to ensure X session is ready sleep 2 # Start lightweight window manager fluxbox > /dev/null 2>&1 & # Start socat in background with a wait loop for Brave ( echo "Waiting for Brave debugger on port 9221..." while ! nc -z localhost 9221; do sleep 0.1; done echo "Debugger is up, starting socat..." socat TCP-LISTEN:9222,fork,reuseaddr TCP:127.0.0.1:9221 ) & # Launch Brave with remote debugging brave-browser \ --no-sandbox \ --remote-debugging-port=9221 \ --disable-gpu \ --user-data-dir=/data \ --disable-extensions-file-access-check \ --disable-background-networking \ --metrics-recording-only \ --no-first-run \ --mute-audio \ --disable-blink-features=AutomationControlled \ --autoplay-policy=no-user-gesture-required ``` All three files go in the same directory. Create a `chrome-config` directory for the persistent Brave user profile. 4. Run `docker compose up --build -d` and it should do its thing. The new container will be called `brave-chrome`. 5. Now hopefully everything is running. Now to configure Brave. Install `tigervnc-viewer` on the host (for me it was `sudo apt install tigervnc-viewer`, and connect using `xtigervncviewer -SecurityTypes None localhost:5901`. 6. In the VNC session, Brave should be running. Click the upper right menu and select Settings. Then click the upper left settings menu and select "Shields". Scroll near the bottom and select Content Filtering (Enable custom filters). Check the first 6 boxes (all the "Fanboy" options). You can also experiment with other filters as needed (I saw an Easylist Germany for example). If you leave the tigervnc session open you can watch the webpage load as Karakeep does its thing! This might help pinpoint problems with timing for example. 7. That should be it. It sounds like a lot, but once set up things should run without further interaction needed. I hope this helps. I'm not sure how much of all this could be pre-configured into a tidy container image for production, but it's good proof of concept at this point. If you find you need a longer delay to allow the page to finish loading, I can help you with a Karakeep hack to add a delay.
Author
Owner

@pdc1 commented on GitHub (Aug 2, 2025):

Okay, I have some time so here is the delay hack if you need it 😄. Note that my docker image is installed in karakeep_app so my Karakeep container is called karakeep-app-web-1. Your mileage may vary. docker ps lists the running images.

  1. Get a copy of the Karakeep compiled JavaScript from the container: docker cp karakeep-app-web-1:/app/apps/workers/dist/index.mjs index-orig.mjs

  2. Copy the index-orig.mjs to index.mjs and edit in your favorite text editor.

  3. Look for the phrase Waiting for the page to load ... (it's a large file so that might take a while)

  4. Add the lines starting with + (do not include the +)

		logger_default.info(`[Crawler][${jobId}] Successfully navigated to "${url$1}". Waiting for the page to load ...`);
  		await Promise.race([page.waitForLoadState("networkidle", { timeout: 5e3 }).catch(() => ({})), new Promise((resolve$1) => setTimeout(resolve$1, 5e3))]);
+ 
+ 		// Add an extra delay to ensure dynamic content finishes
+ 		logger_default.info(`[Crawler][${jobId}] Additional wait for the page to load ...`);
+ 		await page.waitForTimeout(5000);

That is 5 seconds, but you can extend as needed. You could even give enough time to interactively solve a captcha if there's a page you really want to capture.

  1. Copy the index.mjs file back to the Karakeep container: docker cp index.mjs karakeep-app-web-1:/app/apps/workers/dist/index.mjs

  2. Restart the Karakeep container docker restart karakeep-app-web-1

  3. And watch the logs: docker logs -f -n 20 karakeep-app-web-1. You can hit ^C to stop when you're done, it won't affect your Karakeep server, it's just stopping the log command.

I think that should be it, happy hacking!

<!-- gh-comment-id:3146545504 --> @pdc1 commented on GitHub (Aug 2, 2025): Okay, I have some time so here is the delay hack if you need it 😄. Note that my docker image is installed in `karakeep_app` so my Karakeep container is called `karakeep-app-web-1`. Your mileage may vary. `docker ps` lists the running images. 1. Get a copy of the Karakeep compiled JavaScript from the container: `docker cp karakeep-app-web-1:/app/apps/workers/dist/index.mjs index-orig.mjs` 2. Copy the `index-orig.mjs` to `index.mjs` and edit in your favorite text editor. 3. Look for the phrase `Waiting for the page to load ...` (it's a large file so that might take a while) 4. Add the lines starting with `+` (*do not* include the `+`) ```JavaScript logger_default.info(`[Crawler][${jobId}] Successfully navigated to "${url$1}". Waiting for the page to load ...`); await Promise.race([page.waitForLoadState("networkidle", { timeout: 5e3 }).catch(() => ({})), new Promise((resolve$1) => setTimeout(resolve$1, 5e3))]); + + // Add an extra delay to ensure dynamic content finishes + logger_default.info(`[Crawler][${jobId}] Additional wait for the page to load ...`); + await page.waitForTimeout(5000); ``` That is 5 seconds, but you can extend as needed. You could even give enough time to interactively solve a captcha if there's a page you *really* want to capture. 5. Copy the `index.mjs` file back to the Karakeep container: `docker cp index.mjs karakeep-app-web-1:/app/apps/workers/dist/index.mjs` 6. Restart the Karakeep container `docker restart karakeep-app-web-1` 7. And watch the logs: `docker logs -f -n 20 karakeep-app-web-1`. You can hit ^C to stop when you're done, it won't affect your Karakeep server, it's just stopping the log command. I think that should be it, happy hacking!
Author
Owner

@CrazyWolf13 commented on GitHub (Aug 2, 2025):

@MohamedBassem seeing that there is a great demand for this and already some good looking hacked together concepts, is there any way we can get something officially supported, or a preferred way from your side for this to be implemented?

<!-- gh-comment-id:3146571145 --> @CrazyWolf13 commented on GitHub (Aug 2, 2025): @MohamedBassem seeing that there is a great demand for this and already some good looking hacked together concepts, is there any way we can get something officially supported, or a preferred way from your side for this to be implemented?
Author
Owner

@MohamedBassem commented on GitHub (Aug 2, 2025):

There are indeed a lot of good ideas here. I'll evaluate the different options and see what we can do. The brave idea is smart and didn't come to mind for example. The waiting interval until ublock kicks in is something that I remember experimenting with before but didn't work, will give it a try one more time now that we're on playwright. Using persistent contexts I was trying to avoid as it can have security implications but I might maybe be able to consider a one context per user approach or something. Will go through the comments and report back. Please keep the good ideas coming!

<!-- gh-comment-id:3146576485 --> @MohamedBassem commented on GitHub (Aug 2, 2025): There are indeed a lot of good ideas here. I'll evaluate the different options and see what we can do. The brave idea is smart and didn't come to mind for example. The waiting interval until ublock kicks in is something that I remember experimenting with before but didn't work, will give it a try one more time now that we're on playwright. Using persistent contexts I was trying to avoid as it can have security implications but I might maybe be able to consider a one context per user approach or something. Will go through the comments and report back. Please keep the good ideas coming!
Author
Owner

@ewanly commented on GitHub (Oct 18, 2025):

I hope this will be resolved soon. I just started using Karakeep, and I didn't like how most of my links are presented.

<!-- gh-comment-id:3418534850 --> @ewanly commented on GitHub (Oct 18, 2025): I hope this will be resolved soon. I just started using Karakeep, and I didn't like how most of my links are presented.
Author
Owner

@kassyss commented on GitHub (Oct 18, 2025):

Hi @ewanly
In the meantime you can use Singlefile extension (available for almost all browsers as well as an iOS safari).
It Works pretty well but i agree, i would prefer to only rely on Karakeep app and extension.
Best regards

<!-- gh-comment-id:3418581264 --> @kassyss commented on GitHub (Oct 18, 2025): Hi @ewanly In the meantime you can use Singlefile extension (available for almost all browsers as well as an iOS safari). It Works pretty well but i agree, i would prefer to only rely on Karakeep app and extension. Best regards
Author
Owner

@maltokyo commented on GitHub (Nov 8, 2025):

How do I do that @kassyss ? Thanks!

<!-- gh-comment-id:3506548578 --> @maltokyo commented on GitHub (Nov 8, 2025): How do I do that @kassyss ? Thanks!
Author
Owner

@kassyss commented on GitHub (Nov 8, 2025):

Hi @maltokyo, you just need to install singlefile extension (Firefox, chrome, iOS…) and read the documentation

<!-- gh-comment-id:3506672245 --> @kassyss commented on GitHub (Nov 8, 2025): Hi @maltokyo, you just need to install singlefile extension (Firefox, chrome, iOS…) and read the [documentation](https://docs.karakeep.app/guides/singlefile/)
Author
Owner

@maltokyo commented on GitHub (Nov 8, 2025):

Done, thank you!

<!-- gh-comment-id:3506681226 --> @maltokyo commented on GitHub (Nov 8, 2025): Done, thank you!
Author
Owner

@alexbelgium commented on GitHub (Jan 16, 2026):

@pdc1 hi, how do you connect your brave browser to Karakeep? thanks

<!-- gh-comment-id:3761541197 --> @alexbelgium commented on GitHub (Jan 16, 2026): @pdc1 hi, how do you connect your brave browser to Karakeep? thanks
Author
Owner

@kafmees commented on GitHub (Jan 16, 2026):

I'm not sure what you mean. You can use https://chromewebstore.google.com/detail/karakeep/kgcjekpmcjjogibpjebkhaanilehneje?pli=1 as an add-on. But I use the android app and share kinks with it.

<!-- gh-comment-id:3761839654 --> @kafmees commented on GitHub (Jan 16, 2026): I'm not sure what you mean. You can use https://chromewebstore.google.com/detail/karakeep/kgcjekpmcjjogibpjebkhaanilehneje?pli=1 as an add-on. But I use the android app and share kinks with it.
Author
Owner

@alexbelgium commented on GitHub (Jan 16, 2026):

Hi, thanks. I meant that pdc1 was using a brave browser to circumvent cookies and gdpr. I also installed a brave based docker but can’t find how to connect it with the BROWSER_WEB_URL env variable

Edit : great extension btw

<!-- gh-comment-id:3761862490 --> @alexbelgium commented on GitHub (Jan 16, 2026): Hi, thanks. I meant that pdc1 was using a brave browser to circumvent cookies and gdpr. I also installed a brave based docker but can’t find how to connect it with the BROWSER_WEB_URL env variable Edit : great extension btw
Author
Owner

@pdc1 commented on GitHub (Jan 17, 2026):

Hi, thanks. I meant that pdc1 was using a brave browser to circumvent cookies and gdpr. I also installed a brave based docker but can’t find how to connect it with the BROWSER_WEB_URL env variable

Edit : great extension btw

There's a section above in this thread (https://github.com/karakeep-app/karakeep/issues/414#issuecomment-3146537943) where I document the changes. Since brave is based on chromium it can be swapped with chromium, but requires some hacking of the docker file. It also required some changes to the core karakeep code, so it's more of a proof of concept than a configuration change.

<!-- gh-comment-id:3762521615 --> @pdc1 commented on GitHub (Jan 17, 2026): > Hi, thanks. I meant that pdc1 was using a brave browser to circumvent cookies and gdpr. I also installed a brave based docker but can’t find how to connect it with the BROWSER_WEB_URL env variable > > Edit : great extension btw There's a section above in this thread (https://github.com/karakeep-app/karakeep/issues/414#issuecomment-3146537943) where I document the changes. Since brave is based on chromium it can be swapped with chromium, but requires some hacking of the docker file. It also required some changes to the core karakeep code, so it's more of a proof of concept than a configuration change.
Author
Owner

@Sleywill commented on GitHub (Feb 23, 2026):

Adding another option for people who find cookie banners a recurring issue: if you're using an external screenshot API, most of the good ones handle this at the API level.

SnapAPI for example has a blockCookieBanners: true param that uses a CSS selector blocklist to hide consent dialogs before capture — so you don't have to build and maintain your own banner detection logic:

{
  "url": "https://example.com",
  "blockCookieBanners": true,
  "blockAds": true
}

Not sure if karakeep is moving toward an API-based approach, but might be useful context for the discussion.

<!-- gh-comment-id:3945712279 --> @Sleywill commented on GitHub (Feb 23, 2026): Adding another option for people who find cookie banners a recurring issue: if you're using an external screenshot API, most of the good ones handle this at the API level. [SnapAPI](https://snapapi.pics) for example has a `blockCookieBanners: true` param that uses a CSS selector blocklist to hide consent dialogs before capture — so you don't have to build and maintain your own banner detection logic: ```json { "url": "https://example.com", "blockCookieBanners": true, "blockAds": true } ``` Not sure if karakeep is moving toward an API-based approach, but might be useful context for the discussion.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#266
No description provided.