[GH-ISSUE #471] Crawler issues - lacking images and code blocks #303

Closed
opened 2026-03-02 11:48:38 +03:00 by kerem · 2 comments
Owner

Originally created by @Capsup on GitHub (Oct 4, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/471

Hey yo!

I've been looking for a system like this for a long time but they all tend to have the same problem: crawling the websites I save usually ends up with a low quality local version of the website, which just makes it useless to me to combat link rot.

As an example, I am trying to locally cache this link and the code blocks, which are the most important on the site, does not get saved locally. Nor does the images below them either.

Real site:
image

Cached site:
image

Another problem like it can be observed when crawling reddit links:
image

In general, many of the websites I tried crawling is missing important content to me. It's especially "code blocks" that seems to not be cached locally.

Are there any opportunities for me to configure Monolith to aid me in caching the content that I require?
Or can hoarder itself do something to improve the quality of the cache?

Originally created by @Capsup on GitHub (Oct 4, 2024). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/471 Hey yo! I've been looking for a system like this for a long time but they all tend to have the same problem: crawling the websites I save usually ends up with a low quality local version of the website, which just makes it useless to me to combat link rot. As an example, I am trying to locally cache [this](https://ashy.vargur.dev/fit-kitchen-the-tengu/) link and the code blocks, which are the most important on the site, does not get saved locally. Nor does the images below them either. Real site: ![image](https://github.com/user-attachments/assets/544f4e75-48fb-4a2c-a3bb-94f1f70c11b5) Cached site: ![image](https://github.com/user-attachments/assets/93811851-aab8-48a7-bb81-fb5fcd88e6e6) Another problem like it can be observed when crawling reddit links: ![image](https://github.com/user-attachments/assets/30a9753a-6b93-4f74-a24f-26a58a2e1733) In general, many of the websites I tried crawling is missing important content to me. It's especially "code blocks" that seems to not be cached locally. Are there any opportunities for me to configure Monolith to aid me in caching the content that I require? Or can hoarder itself do something to improve the quality of the cache?
kerem closed this issue 2026-03-02 11:48:38 +03:00
Author
Owner

@kamtschatka commented on GitHub (Oct 4, 2024):

You are looking at the "Cached Content" tab, not at the "Archive" tab.
Cache Content extracts the HTML and renders it then in hoarder, so that has a lot of limitations. If you enable monolith archiving via the environment variables, there will be an archive (see the dropdown above). How does it look if you use this?

<!-- gh-comment-id:2393233380 --> @kamtschatka commented on GitHub (Oct 4, 2024): You are looking at the "Cached Content" tab, not at the "Archive" tab. Cache Content extracts the HTML and renders it then in hoarder, so that has a lot of limitations. If you enable monolith archiving via the environment variables, there will be an archive (see the dropdown above). How does it look if you use this?
Author
Owner

@Capsup commented on GitHub (Oct 4, 2024):

It seems you are right, that is my bad. I tried looking in the config for that option but must have missed it. I ended up assuming "cached" was the equivalent.

To anyone who might run into the same issue, setting CRAWLER_FULL_PAGE_ARCHIVE to true activated the archieve functionality and that works as expected.

May I suggest we add a part to the docs on the website about enabling this option, specifically, in the installation section?

<!-- gh-comment-id:2393645479 --> @Capsup commented on GitHub (Oct 4, 2024): It seems you are right, that is my bad. I tried looking in the config for that option but must have missed it. I ended up assuming "cached" was the equivalent. To anyone who might run into the same issue, setting `CRAWLER_FULL_PAGE_ARCHIVE` to `true` activated the archieve functionality and that works as expected. May I suggest we add a part to the docs on the website about enabling this option, specifically, in the installation section?
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#303
No description provided.