starred/karakeep

Fork 0

mirror of https://github.com/karakeep-app/karakeep.git synced 2026-04-26 08:26:03 +03:00

[GH-ISSUE #471] Crawler issues - lacking images and code blocks #303

New issue

Closed

opened 2026-03-02 11:48:38 +03:00 by kerem · 2 comments

kerem commented

2026-03-02 11:48:38 +03:00

Owner

Originally created by @Capsup on GitHub (Oct 4, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/471

Hey yo!

I've been looking for a system like this for a long time but they all tend to have the same problem: crawling the websites I save usually ends up with a low quality local version of the website, which just makes it useless to me to combat link rot.

As an example, I am trying to locally cache this link and the code blocks, which are the most important on the site, does not get saved locally. Nor does the images below them either.

Real site:

Cached site:

Another problem like it can be observed when crawling reddit links:

In general, many of the websites I tried crawling is missing important content to me. It's especially "code blocks" that seems to not be cached locally.

Are there any opportunities for me to configure Monolith to aid me in caching the content that I require?
Or can hoarder itself do something to improve the quality of the cache?

Originally created by @Capsup on GitHub (Oct 4, 2024). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/471 Hey yo! I've been looking for a system like this for a long time but they all tend to have the same problem: crawling the websites I save usually ends up with a low quality local version of the website, which just makes it useless to me to combat link rot. As an example, I am trying to locally cache [this](https://ashy.vargur.dev/fit-kitchen-the-tengu/) link and the code blocks, which are the most important on the site, does not get saved locally. Nor does the images below them either. Real site: ![image](https://github.com/user-attachments/assets/544f4e75-48fb-4a2c-a3bb-94f1f70c11b5) Cached site: ![image](https://github.com/user-attachments/assets/93811851-aab8-48a7-bb81-fb5fcd88e6e6) Another problem like it can be observed when crawling reddit links: ![image](https://github.com/user-attachments/assets/30a9753a-6b93-4f74-a24f-26a58a2e1733) In general, many of the websites I tried crawling is missing important content to me. It's especially "code blocks" that seems to not be cached locally. Are there any opportunities for me to configure Monolith to aid me in caching the content that I require? Or can hoarder itself do something to improve the quality of the cache?

kerem closed this issue

2026-03-02 11:48:38 +03:00

kerem commented

2026-03-02 11:48:39 +03:00

Author

Owner

@kamtschatka commented on GitHub (Oct 4, 2024):

You are looking at the "Cached Content" tab, not at the "Archive" tab.
Cache Content extracts the HTML and renders it then in hoarder, so that has a lot of limitations. If you enable monolith archiving via the environment variables, there will be an archive (see the dropdown above). How does it look if you use this?

@kamtschatka commented on GitHub (Oct 4, 2024): You are looking at the "Cached Content" tab, not at the "Archive" tab. Cache Content extracts the HTML and renders it then in hoarder, so that has a lot of limitations. If you enable monolith archiving via the environment variables, there will be an archive (see the dropdown above). How does it look if you use this?

kerem commented

2026-03-02 11:48:39 +03:00

Author

Owner

@Capsup commented on GitHub (Oct 4, 2024):

It seems you are right, that is my bad. I tried looking in the config for that option but must have missed it. I ended up assuming "cached" was the equivalent.

To anyone who might run into the same issue, setting CRAWLER_FULL_PAGE_ARCHIVE to true activated the archieve functionality and that works as expected.

May I suggest we add a part to the docs on the website about enabling this option, specifically, in the installation section?

@Capsup commented on GitHub (Oct 4, 2024): It seems you are right, that is my bad. I tried looking in the config for that option but must have missed it. I ended up assuming "cached" was the equivalent. To anyone who might run into the same issue, setting `CRAWLER_FULL_PAGE_ARCHIVE` to `true` activated the archieve functionality and that works as expected. May I suggest we add a part to the docs on the website about enabling this option, specifically, in the installation section?

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/karakeep#303

No description provided.

Rows
Columns