starred/karakeep

Fork 0

mirror of https://github.com/karakeep-app/karakeep.git synced 2026-04-25 07:56:05 +03:00

[GH-ISSUE #1563] [FR] Cache and store images in the extracted reader content #978

New issue

Open

opened 2026-03-02 11:54:07 +03:00 by kerem · 3 comments

kerem commented

2026-03-02 11:54:07 +03:00

Owner

Originally created by @MohamedBassem on GitHub (Jun 8, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1563

Currently, only full page archives inline the images for offline access. The default extracted content doesn't. This means that if the bookmark goes offline, it'll break the reader view of the bookmark, we should change that.

For the reader content, we can't inline the images (similar to singlefile) because the db size will explode. We should however parse the HTML, extract the images, download them as attached assets, and then replace the links with asset URLs instead.

Originally created by @MohamedBassem on GitHub (Jun 8, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1563 Currently, only full page archives inline the images for offline access. The default extracted content doesn't. This means that if the bookmark goes offline, it'll break the reader view of the bookmark, we should change that. For the reader content, we can't inline the images (similar to singlefile) because the db size will explode. We should however parse the HTML, extract the images, download them as attached assets, and then replace the links with asset URLs instead.

kerem added the

feature request

status/approved

pri/medium

labels

2026-03-02 11:54:07 +03:00

kerem commented

2026-03-02 11:54:08 +03:00

Author

Owner

@digithree commented on GitHub (Jun 8, 2025):

Some thoughts on the spec so far @MohamedBassem :

Do you intend the above to execute on the server (not mobile for example) and to use the existing asset management system (used for screenshots, etc.)?
When in the process would the images of extracted content be downloaded as assets? The easiest hook is at content extraction time as part of that process, but asset storage may then also balloon, it will just be a little more manageable.
I can imagine users (or rather, admins) wanting reader assets to be separately deletable or limited, as distinct from bookmark screenshots (used for thumbnails) which would be likely wanted to be more permanent, as it will compete with it
Similarly, admins/users may want to disable or limit the storage capacity for this functionality

@digithree commented on GitHub (Jun 8, 2025): Some thoughts on the spec so far @MohamedBassem : 1. Do you intend the above to execute on the server (not mobile for example) and to use the existing asset management system (used for screenshots, etc.)? 2. When in the process would the images of extracted content be downloaded as assets? The easiest hook is at content extraction time as part of that process, but asset storage may then also balloon, it will just be a little more manageable. 3. I can imagine users (or rather, admins) wanting reader assets to be separately deletable or limited, as distinct from bookmark screenshots (used for thumbnails) which would be likely wanted to be more permanent, as it will compete with it 4. Similarly, admins/users may want to disable or limit the storage capacity for this functionality

kerem commented

2026-03-02 11:54:08 +03:00

Author

Owner

@MohamedBassem commented on GitHub (Jun 8, 2025):

Yes, exactly. Probably a new asset type called contentImages that can be used to store those.
It's going to happen as part of the crawler worker yes, or maybe the crawler can spin up a new job to do the image downloads to not degrade the reliability of the crawler worker.
I agree that they'll probably spam/bloat the attached assets to a bookmark, so we'll need to ensure that they don't degrade the UX on the frontend and exclude them in those places.
We can add a way to disable them similar to full page archives.

I agree in general that they'll bloat the existing asset management system, and that's part of the reason I didn't do them before. But maybe that's not that big of a concern? We can also have an artificial limit of maybe 50 images per link or something.

@MohamedBassem commented on GitHub (Jun 8, 2025): 1. Yes, exactly. Probably a new asset type called `contentImages` that can be used to store those. 2. It's going to happen as part of the crawler worker yes, or maybe the crawler can spin up a new job to do the image downloads to not degrade the reliability of the crawler worker. 3. I agree that they'll probably spam/bloat the attached assets to a bookmark, so we'll need to ensure that they don't degrade the UX on the frontend and exclude them in those places. 4. We can add a way to disable them similar to full page archives. I agree in general that they'll bloat the existing asset management system, and that's part of the reason I didn't do them before. But maybe that's not that big of a concern? We can also have an artificial limit of maybe 50 images per link or something.

kerem commented

2026-03-02 11:54:08 +03:00

Author

Owner

@digithree commented on GitHub (Jun 10, 2025):

I'm not sure, it could bloat quite a lot I think potentially. I've realised what the core dissonance is for me, I had envisioned it as a cache, not a permanent asset library for the reader mode files. They're two different approaches that serve different needs, so once that is clear then lots of the concerns melt away.

Either way though I think empowering admins with a disable flag and limits per user or at the site level will mean it doesn't catch anyone by surprise.

@digithree commented on GitHub (Jun 10, 2025): I'm not sure, it could bloat quite a lot I think potentially. I've realised what the core dissonance is for me, I had envisioned it as a cache, not a permanent asset library for the reader mode files. They're two different approaches that serve different needs, so once that is clear then lots of the concerns melt away. Either way though I think empowering admins with a disable flag and limits per user or at the site level will mean it doesn't catch anyone by surprise.

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/karakeep#978

No description provided.

Rows
Columns