[GH-ISSUE #1563] [FR] Cache and store images in the extracted reader content #978

Open
opened 2026-03-02 11:54:07 +03:00 by kerem · 3 comments
Owner

Originally created by @MohamedBassem on GitHub (Jun 8, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1563

Currently, only full page archives inline the images for offline access. The default extracted content doesn't. This means that if the bookmark goes offline, it'll break the reader view of the bookmark, we should change that.

For the reader content, we can't inline the images (similar to singlefile) because the db size will explode. We should however parse the HTML, extract the images, download them as attached assets, and then replace the links with asset URLs instead.

Originally created by @MohamedBassem on GitHub (Jun 8, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1563 Currently, only full page archives inline the images for offline access. The default extracted content doesn't. This means that if the bookmark goes offline, it'll break the reader view of the bookmark, we should change that. For the reader content, we can't inline the images (similar to singlefile) because the db size will explode. We should however parse the HTML, extract the images, download them as attached assets, and then replace the links with asset URLs instead.
Author
Owner

@digithree commented on GitHub (Jun 8, 2025):

Some thoughts on the spec so far @MohamedBassem :

  1. Do you intend the above to execute on the server (not mobile for example) and to use the existing asset management system (used for screenshots, etc.)?
  2. When in the process would the images of extracted content be downloaded as assets? The easiest hook is at content extraction time as part of that process, but asset storage may then also balloon, it will just be a little more manageable.
  3. I can imagine users (or rather, admins) wanting reader assets to be separately deletable or limited, as distinct from bookmark screenshots (used for thumbnails) which would be likely wanted to be more permanent, as it will compete with it
  4. Similarly, admins/users may want to disable or limit the storage capacity for this functionality
<!-- gh-comment-id:2954174880 --> @digithree commented on GitHub (Jun 8, 2025): Some thoughts on the spec so far @MohamedBassem : 1. Do you intend the above to execute on the server (not mobile for example) and to use the existing asset management system (used for screenshots, etc.)? 2. When in the process would the images of extracted content be downloaded as assets? The easiest hook is at content extraction time as part of that process, but asset storage may then also balloon, it will just be a little more manageable. 3. I can imagine users (or rather, admins) wanting reader assets to be separately deletable or limited, as distinct from bookmark screenshots (used for thumbnails) which would be likely wanted to be more permanent, as it will compete with it 4. Similarly, admins/users may want to disable or limit the storage capacity for this functionality
Author
Owner

@MohamedBassem commented on GitHub (Jun 8, 2025):

  1. Yes, exactly. Probably a new asset type called contentImages that can be used to store those.
  2. It's going to happen as part of the crawler worker yes, or maybe the crawler can spin up a new job to do the image downloads to not degrade the reliability of the crawler worker.
  3. I agree that they'll probably spam/bloat the attached assets to a bookmark, so we'll need to ensure that they don't degrade the UX on the frontend and exclude them in those places.
  4. We can add a way to disable them similar to full page archives.

I agree in general that they'll bloat the existing asset management system, and that's part of the reason I didn't do them before. But maybe that's not that big of a concern? We can also have an artificial limit of maybe 50 images per link or something.

<!-- gh-comment-id:2954182226 --> @MohamedBassem commented on GitHub (Jun 8, 2025): 1. Yes, exactly. Probably a new asset type called `contentImages` that can be used to store those. 2. It's going to happen as part of the crawler worker yes, or maybe the crawler can spin up a new job to do the image downloads to not degrade the reliability of the crawler worker. 3. I agree that they'll probably spam/bloat the attached assets to a bookmark, so we'll need to ensure that they don't degrade the UX on the frontend and exclude them in those places. 4. We can add a way to disable them similar to full page archives. I agree in general that they'll bloat the existing asset management system, and that's part of the reason I didn't do them before. But maybe that's not that big of a concern? We can also have an artificial limit of maybe 50 images per link or something.
Author
Owner

@digithree commented on GitHub (Jun 10, 2025):

I'm not sure, it could bloat quite a lot I think potentially. I've realised what the core dissonance is for me, I had envisioned it as a cache, not a permanent asset library for the reader mode files. They're two different approaches that serve different needs, so once that is clear then lots of the concerns melt away.

Either way though I think empowering admins with a disable flag and limits per user or at the site level will mean it doesn't catch anyone by surprise.

<!-- gh-comment-id:2960316927 --> @digithree commented on GitHub (Jun 10, 2025): I'm not sure, it could bloat quite a lot I think potentially. I've realised what the core dissonance is for me, I had envisioned it as a cache, not a permanent asset library for the reader mode files. They're two different approaches that serve different needs, so once that is clear then lots of the concerns melt away. Either way though I think empowering admins with a disable flag and limits per user or at the site level will mean it doesn't catch anyone by surprise.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#978
No description provided.