starred/karakeep

Fork 0

mirror of https://github.com/karakeep-app/karakeep.git synced 2026-04-25 16:06:04 +03:00

[GH-ISSUE #778] Videos do not download (or do not show up properly in UI) when CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE set to -1 #510

New issue

Closed

opened 2026-03-02 11:50:27 +03:00 by kerem · 7 comments

kerem commented

2026-03-02 11:50:27 +03:00

Owner

Originally created by @bverkron on GitHub (Dec 28, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/778

Describe the Bug

When CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE is set to -1 the videos downloaded from youtube (for example this one) don't work. I get the following instead of a playable video...

https://github.com/user-attachments/assets/bcbe630f-3e83-48e4-9585-164bc79bdfa4

When setting CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE back to another value like 1000 it works. Default value (i.e. not including the env var) also works.

Steps to Reproduce

Set CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE to -1
Add url https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg to Hoarder
Attempt to view the video after it's downloaded / processed

Expected Behaviour

Video is playable

Screenshots or Additional Context

Log entries from successful download (CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE set to 1000)

2024-12-28T01:39:12.708Z info: [Crawler][45] Will crawl "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg" for link with id "totdckcx63gx6xnlwtcbiozi"
2024-12-28T01:39:12.709Z info: [Crawler][45] Attempting to determine the content-type for the url https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg
2024-12-28T01:39:12.772Z info: [search][46] Attempting to index bookmark with id totdckcx63gx6xnlwtcbiozi ...
2024-12-28T01:39:12.922Z info: [search][46] Completed successfully
2024-12-28T01:39:12.992Z info: [Crawler][45] Content-type for the url https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg is "text/html; charset=utf-8"
2024-12-28T01:39:16.053Z info: [Crawler][45] Successfully navigated to "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg". Waiting for the page to load ...
2024-12-28T01:39:21.057Z info: [Crawler][45] Finished waiting for the page to load.
2024-12-28T01:39:21.265Z info: [Crawler][45] Successfully fetched the page content.
2024-12-28T01:39:21.609Z info: [Crawler][45] Finished capturing page content and a screenshot. FullPageScreenshot: false
2024-12-28T01:39:21.619Z info: [Crawler][45] Will attempt to extract metadata from page ...
2024-12-28T01:39:26.436Z info: [Crawler][45] Will attempt to extract readable content ...
2024-12-28T01:39:29.317Z info: [Crawler][45] Done extracting readable content.
2024-12-28T01:39:29.378Z info: [Crawler][45] Stored the screenshot as assetId: 0c0b7315-29d1-488c-ab33-602b9eefd7d5
2024-12-28T01:39:29.436Z info: [Crawler][45] Done extracting metadata from the page.
2024-12-28T01:39:29.437Z info: [Crawler][45] Downloading image from "https://i.ytimg.com/vi/Lw9Y_A5rzOs/maxres2.jpg?sqp=-oaymwEoCIAKENAF8quKqQMcGADwAQH4AbYIgAKAD4oCDAgAEAEYciBLKEAwDw==&rs=AOn4CLBYL4uSqtx5DMs9e-sE5MbFW6XmtA"
2024-12-28T01:39:29.521Z info: [Crawler][45] Downloaded image as assetId: 34d0f0f2-4bfb-478b-9578-1865b673eb09
2024-12-28T01:39:29.602Z info: [Crawler][45] Completed successfully
2024-12-28T01:39:30.415Z debug: [inference][47] No inference client configured, nothing to do now
2024-12-28T01:39:30.416Z info: [inference][47] Completed successfully
2024-12-28T01:39:30.470Z info: [search][48] Attempting to index bookmark with id totdckcx63gx6xnlwtcbiozi ...
2024-12-28T01:39:30.482Z info: [VideoCrawler][49] Attempting to download a file from "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg" to "/tmp/video_downloads/bdbaf00b-9b02-4fa4-9369-8e4e632f7c9d" using the following arguments: "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg,-f,best[filesize<1000M],-o,/tmp/video_downloads/bdbaf00b-9b02-4fa4-9369-8e4e632f7c9d,--no-playlist"
2024-12-28T01:39:30.574Z info: [search][48] Completed successfully
2024-12-28T01:39:35.136Z info: [VideoCrawler][49] Finished downloading a file from "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg" to "/tmp/video_downloads/bdbaf00b-9b02-4fa4-9369-8e4e632f7c9d"
2024-12-28T01:39:35.177Z info: [VideoCrawler][49] Finished downloading video from "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg" and adding it to the database
2024-12-28T01:39:35.178Z info: [VideoCrawler][49] Video Download Completed successfully

Log when set to -1

2024-12-28T01:44:48.903Z info: [Crawler][51] Will crawl "https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr" for link with id "uskue5v4bdwpl8jzgbmcfh64"
2024-12-28T01:44:48.905Z info: [Crawler][51] Attempting to determine the content-type for the url https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr
2024-12-28T01:44:49.071Z info: [search][52] Attempting to index bookmark with id uskue5v4bdwpl8jzgbmcfh64 ...
2024-12-28T01:44:49.143Z info: [Crawler][51] Content-type for the url https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr is "text/html; charset=utf-8"
2024-12-28T01:44:49.151Z info: [search][52] Completed successfully
2024-12-28T01:44:51.860Z info: [Crawler][51] Successfully navigated to "https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr". Waiting for the page to load ...
2024-12-28T01:44:56.861Z info: [Crawler][51] Finished waiting for the page to load.
2024-12-28T01:44:57.093Z info: [Crawler][51] Successfully fetched the page content.
2024-12-28T01:44:57.390Z info: [Crawler][51] Finished capturing page content and a screenshot. FullPageScreenshot: false
2024-12-28T01:44:57.403Z info: [Crawler][51] Will attempt to extract metadata from page ...
2024-12-28T01:45:02.529Z info: [Crawler][51] Will attempt to extract readable content ...
2024-12-28T01:45:05.685Z info: [Crawler][51] Done extracting readable content.
2024-12-28T01:45:05.745Z info: [Crawler][51] Stored the screenshot as assetId: cb93da1a-00e6-438c-af44-db72f37456a1
2024-12-28T01:45:05.789Z info: [Crawler][51] Done extracting metadata from the page.
2024-12-28T01:45:05.789Z info: [Crawler][51] Downloading image from "https://i.ytimg.com/vi/Lw9Y_A5rzOs/maxres2.jpg?sqp=-oaymwEoCIAKENAF8quKqQMcGADwAQH4AbYIgAKAD4oCDAgAEAEYciBLKEAwDw==&rs=AOn4CLBYL4uSqtx5DMs9e-sE5MbFW6XmtA"
2024-12-28T01:45:05.857Z info: [Crawler][51] Downloaded image as assetId: ccd5259b-49c8-4afc-8303-76078e2ca57d
2024-12-28T01:45:05.927Z info: [Crawler][51] Completed successfully
2024-12-28T01:45:06.777Z debug: [inference][53] No inference client configured, nothing to do now
2024-12-28T01:45:06.778Z info: [inference][53] Completed successfully
2024-12-28T01:45:06.834Z info: [search][54] Attempting to index bookmark with id uskue5v4bdwpl8jzgbmcfh64 ...
2024-12-28T01:45:06.848Z info: [VideoCrawler][55] Attempting to download a file from "https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr" to "/tmp/video_downloads/454d5edb-f75d-4b56-8203-ad40613563b8" using the following arguments: "https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr,-o,/tmp/video_downloads/454d5edb-f75d-4b56-8203-ad40613563b8,--no-playlist"
2024-12-28T01:45:06.937Z info: [search][54] Completed successfully

Device Details

Safari 17.6 on macOS

Exact Hoarder Version

v0.20.0

Originally created by @bverkron on GitHub (Dec 28, 2024). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/778 ### Describe the Bug When `CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE` is set to -1 the videos downloaded from youtube (for example [this one](https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg)) don't work. I get the following instead of a playable video... https://github.com/user-attachments/assets/bcbe630f-3e83-48e4-9585-164bc79bdfa4 When setting `CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE` back to another value like 1000 it works. Default value (i.e. not including the env var) also works. ### Steps to Reproduce 1. Set `CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE` to `-1` 2. Add url https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg to Hoarder 3. Attempt to view the video after it's downloaded / processed 4. ### Expected Behaviour Video is playable ### Screenshots or Additional Context Log entries from successful download (CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE set to 1000) ``` 2024-12-28T01:39:12.708Z info: [Crawler][45] Will crawl "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg" for link with id "totdckcx63gx6xnlwtcbiozi" 2024-12-28T01:39:12.709Z info: [Crawler][45] Attempting to determine the content-type for the url https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg 2024-12-28T01:39:12.772Z info: [search][46] Attempting to index bookmark with id totdckcx63gx6xnlwtcbiozi ... 2024-12-28T01:39:12.922Z info: [search][46] Completed successfully 2024-12-28T01:39:12.992Z info: [Crawler][45] Content-type for the url https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg is "text/html; charset=utf-8" 2024-12-28T01:39:16.053Z info: [Crawler][45] Successfully navigated to "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg". Waiting for the page to load ... 2024-12-28T01:39:21.057Z info: [Crawler][45] Finished waiting for the page to load. 2024-12-28T01:39:21.265Z info: [Crawler][45] Successfully fetched the page content. 2024-12-28T01:39:21.609Z info: [Crawler][45] Finished capturing page content and a screenshot. FullPageScreenshot: false 2024-12-28T01:39:21.619Z info: [Crawler][45] Will attempt to extract metadata from page ... 2024-12-28T01:39:26.436Z info: [Crawler][45] Will attempt to extract readable content ... 2024-12-28T01:39:29.317Z info: [Crawler][45] Done extracting readable content. 2024-12-28T01:39:29.378Z info: [Crawler][45] Stored the screenshot as assetId: 0c0b7315-29d1-488c-ab33-602b9eefd7d5 2024-12-28T01:39:29.436Z info: [Crawler][45] Done extracting metadata from the page. 2024-12-28T01:39:29.437Z info: [Crawler][45] Downloading image from "https://i.ytimg.com/vi/Lw9Y_A5rzOs/maxres2.jpg?sqp=-oaymwEoCIAKENAF8quKqQMcGADwAQH4AbYIgAKAD4oCDAgAEAEYciBLKEAwDw==&rs=AOn4CLBYL4uSqtx5DMs9e-sE5MbFW6XmtA" 2024-12-28T01:39:29.521Z info: [Crawler][45] Downloaded image as assetId: 34d0f0f2-4bfb-478b-9578-1865b673eb09 2024-12-28T01:39:29.602Z info: [Crawler][45] Completed successfully 2024-12-28T01:39:30.415Z debug: [inference][47] No inference client configured, nothing to do now 2024-12-28T01:39:30.416Z info: [inference][47] Completed successfully 2024-12-28T01:39:30.470Z info: [search][48] Attempting to index bookmark with id totdckcx63gx6xnlwtcbiozi ... 2024-12-28T01:39:30.482Z info: [VideoCrawler][49] Attempting to download a file from "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg" to "/tmp/video_downloads/bdbaf00b-9b02-4fa4-9369-8e4e632f7c9d" using the following arguments: "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg,-f,best[filesize<1000M],-o,/tmp/video_downloads/bdbaf00b-9b02-4fa4-9369-8e4e632f7c9d,--no-playlist" 2024-12-28T01:39:30.574Z info: [search][48] Completed successfully 2024-12-28T01:39:35.136Z info: [VideoCrawler][49] Finished downloading a file from "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg" to "/tmp/video_downloads/bdbaf00b-9b02-4fa4-9369-8e4e632f7c9d" 2024-12-28T01:39:35.177Z info: [VideoCrawler][49] Finished downloading video from "https://youtu.be/Lw9Y_A5rzOs?si=tDY6iGdnSK_pm4vg" and adding it to the database 2024-12-28T01:39:35.178Z info: [VideoCrawler][49] Video Download Completed successfully ``` Log when set to -1 ``` 2024-12-28T01:44:48.903Z info: [Crawler][51] Will crawl "https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr" for link with id "uskue5v4bdwpl8jzgbmcfh64" 2024-12-28T01:44:48.905Z info: [Crawler][51] Attempting to determine the content-type for the url https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr 2024-12-28T01:44:49.071Z info: [search][52] Attempting to index bookmark with id uskue5v4bdwpl8jzgbmcfh64 ... 2024-12-28T01:44:49.143Z info: [Crawler][51] Content-type for the url https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr is "text/html; charset=utf-8" 2024-12-28T01:44:49.151Z info: [search][52] Completed successfully 2024-12-28T01:44:51.860Z info: [Crawler][51] Successfully navigated to "https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr". Waiting for the page to load ... 2024-12-28T01:44:56.861Z info: [Crawler][51] Finished waiting for the page to load. 2024-12-28T01:44:57.093Z info: [Crawler][51] Successfully fetched the page content. 2024-12-28T01:44:57.390Z info: [Crawler][51] Finished capturing page content and a screenshot. FullPageScreenshot: false 2024-12-28T01:44:57.403Z info: [Crawler][51] Will attempt to extract metadata from page ... 2024-12-28T01:45:02.529Z info: [Crawler][51] Will attempt to extract readable content ... 2024-12-28T01:45:05.685Z info: [Crawler][51] Done extracting readable content. 2024-12-28T01:45:05.745Z info: [Crawler][51] Stored the screenshot as assetId: cb93da1a-00e6-438c-af44-db72f37456a1 2024-12-28T01:45:05.789Z info: [Crawler][51] Done extracting metadata from the page. 2024-12-28T01:45:05.789Z info: [Crawler][51] Downloading image from "https://i.ytimg.com/vi/Lw9Y_A5rzOs/maxres2.jpg?sqp=-oaymwEoCIAKENAF8quKqQMcGADwAQH4AbYIgAKAD4oCDAgAEAEYciBLKEAwDw==&rs=AOn4CLBYL4uSqtx5DMs9e-sE5MbFW6XmtA" 2024-12-28T01:45:05.857Z info: [Crawler][51] Downloaded image as assetId: ccd5259b-49c8-4afc-8303-76078e2ca57d 2024-12-28T01:45:05.927Z info: [Crawler][51] Completed successfully 2024-12-28T01:45:06.777Z debug: [inference][53] No inference client configured, nothing to do now 2024-12-28T01:45:06.778Z info: [inference][53] Completed successfully 2024-12-28T01:45:06.834Z info: [search][54] Attempting to index bookmark with id uskue5v4bdwpl8jzgbmcfh64 ... 2024-12-28T01:45:06.848Z info: [VideoCrawler][55] Attempting to download a file from "https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr" to "/tmp/video_downloads/454d5edb-f75d-4b56-8203-ad40613563b8" using the following arguments: "https://youtu.be/Lw9Y_A5rzOs?si=m1mYS19NUmXzkexr,-o,/tmp/video_downloads/454d5edb-f75d-4b56-8203-ad40613563b8,--no-playlist" 2024-12-28T01:45:06.937Z info: [search][54] Completed successfully ``` ### Device Details Safari 17.6 on macOS ### Exact Hoarder Version v0.20.0

kerem closed this issue

2026-03-02 11:50:27 +03:00

kerem commented

2026-03-02 11:50:28 +03:00

Author

Owner

@bverkron commented on GitHub (Dec 28, 2024):

Workaround is setting it to a very high number that will likely never be reached, like 9999999999999. Effectively the same as having no limit.

@bverkron commented on GitHub (Dec 28, 2024): Workaround is setting it to a very high number that will likely never be reached, like 9999999999999. Effectively the same as having no limit.

kerem commented

2026-03-02 11:50:28 +03:00

Author

Owner

@kamtschatka commented on GitHub (Dec 29, 2024):

works fine for me, have you tried other browsers to see if maybe Safari does not support the video format?

@kamtschatka commented on GitHub (Dec 29, 2024): works fine for me, have you tried other browsers to see if maybe Safari does not support the video format?

kerem commented

2026-03-02 11:50:28 +03:00

Author

Owner

@bverkron commented on GitHub (Dec 29, 2024):

It appears to work in Brave (i.e. Chrome) but Hoarder must be doing something different with the video (aside from the compression I assume) when set to -1 since using any other value besides -1 (even an arbitrarily large value like 9999999999999) makes it playable in Safari.

@bverkron commented on GitHub (Dec 29, 2024): It appears to work in Brave (i.e. Chrome) but Hoarder must be doing something different with the video (aside from the compression I assume) when set to -1 since using any other value besides -1 (even an arbitrarily large value like 9999999999999) makes it playable in Safari.

kerem commented

2026-03-02 11:50:28 +03:00

Author

Owner

@kamtschatka commented on GitHub (Dec 29, 2024):

hoarder merely passes this parameter to yt-dlp, which then chooses which file to download.
When -1 is provided, we skip adding the filesize filter, otherwise we add best[filesize<${maxVideoDownloadSize}M] to the arguments.
So seems like yt-dlp simply chooses a different version of the video then and Safari really has some issues with some video formats.

@kamtschatka commented on GitHub (Dec 29, 2024): hoarder merely passes this parameter to yt-dlp, which then chooses which file to download. When -1 is provided, we skip adding the filesize filter, otherwise we add `best[filesize<${maxVideoDownloadSize}M]` to the arguments. So seems like yt-dlp simply chooses a different version of the video then and Safari really has some issues with some video formats.

kerem commented

2026-03-02 11:50:28 +03:00

Author

Owner

@bverkron commented on GitHub (Dec 30, 2024):

Looks like with -1 set it's downloading it AV1 format. Perhaps that's the original Youtube is serving up and with any other value passed it's converting to MP4. In either case AV1 is relatively new and not widely supported yet which probably explains playback problems in Safari.

Video format with CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE = -1
negative 1

Video format with CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE = 9999999999999

Video format with CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE not set, I think. Same result as 9999999999999

This kind of issue may be solved by proxy if https://github.com/hoarder-app/hoarder/issues/775 were to implement different options to control the resolution, format, etc. Hoarder could be set to give consistent formats so these kinds of inconsistencies are avoided.

@bverkron commented on GitHub (Dec 30, 2024): Looks like with -1 set it's downloading it AV1 format. Perhaps that's the original Youtube is serving up and with any other value passed it's converting to MP4. In either case AV1 is relatively new and not widely supported yet which probably explains playback problems in Safari. Video format with `CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE` = -1 <img width="552" alt="negative 1" src="https://github.com/user-attachments/assets/e5303fd8-7a0b-40b7-b0d7-52475f373f19" /> Video format with `CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE` = 9999999999999 <img width="552" alt="99999999999" src="https://github.com/user-attachments/assets/b76e7aef-9052-4700-be1b-82479bba0c3f" /> Video format with `CRAWLER_VIDEO_DOWNLOAD_MAX_SIZE` not set, I think. Same result as 9999999999999 <img width="552" alt="50" src="https://github.com/user-attachments/assets/ce5d57f7-9808-4ccb-90d5-b6df4fc11035" /> This kind of issue may be solved by proxy if https://github.com/hoarder-app/hoarder/issues/775 were to implement different options to control the resolution, format, etc. Hoarder could be set to give consistent formats so these kinds of inconsistencies are avoided.

kerem commented

2026-03-02 11:50:28 +03:00

Author

Owner

@kamtschatka commented on GitHub (Dec 30, 2024):

Yeah, i don't think it makes sense to track this separately and should be fixed as part of #775.
Safari truly is the new Internet Explorer of the Internet...

@kamtschatka commented on GitHub (Dec 30, 2024): Yeah, i don't think it makes sense to track this separately and should be fixed as part of #775. Safari truly is the new Internet Explorer of the Internet...

kerem commented

2026-03-02 11:50:28 +03:00

Author

Owner

@bverkron commented on GitHub (Dec 30, 2024):

Closing in favour of #775

@bverkron commented on GitHub (Dec 30, 2024): Closing in favour of #775

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/karakeep#510

No description provided.

Rows
Columns