[GH-ISSUE #742] Worker timeouts during full page archives don't get cleaned properly causing duplicates and large space usage #482

Closed
opened 2026-03-02 11:50:13 +03:00 by kerem · 13 comments
Owner

Originally created by @maya329 on GitHub (Dec 19, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/742

Describe the Bug

I'm currently running hoarder within docker and today I checked and found out it's taking up 40GB while I only have 119 bookmarks, the file size seems to be very unusual. I did an ncdu and the results are attached below.

How do I find out which bookmarks are taking up 1GB of space?

Steps to Reproduce

  1. Run ncdu on the system and browse into the data folder

Expected Behaviour

Less file size

Screenshots or Additional Context

image

Device Details

No response

Exact Hoarder Version

v0.19.0

Originally created by @maya329 on GitHub (Dec 19, 2024). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/742 ### Describe the Bug I'm currently running hoarder within docker and today I checked and found out it's taking up 40GB while I only have 119 bookmarks, the file size seems to be very unusual. I did an ncdu and the results are attached below. How do I find out which bookmarks are taking up 1GB of space? ### Steps to Reproduce 1. Run ncdu on the system and browse into the data folder ### Expected Behaviour Less file size ### Screenshots or Additional Context <img width="1713" alt="image" src="https://github.com/user-attachments/assets/27c49554-304b-447b-be16-b26cc44cff28" /> ### Device Details _No response_ ### Exact Hoarder Version v0.19.0
kerem 2026-03-02 11:50:13 +03:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@ctschach commented on GitHub (Dec 20, 2024):

Have you enabled video download? This is what happened to me….

<!-- gh-comment-id:2557192547 --> @ctschach commented on GitHub (Dec 20, 2024): Have you enabled video download? This is what happened to me….
Author
Owner

@MohamedBassem commented on GitHub (Dec 20, 2024):

As @ctschach mentioned, this looks like video downloads being enabled indeed.

If you go inside the large folder and run cat metadata.json it should tell you the type of the asset.

And if you want to know which asset this is, you can go to:

https://<addr>/api/assets/<UUID>

If you want to know to which bookmark does this asset belong, you can run the following query against the sqlite database:

select bookmarkId from assets where id = '<UUID>';
<!-- gh-comment-id:2557204515 --> @MohamedBassem commented on GitHub (Dec 20, 2024): As @ctschach mentioned, this looks like video downloads being enabled indeed. If you go inside the large folder and run `cat metadata.json` it should tell you the type of the asset. And if you want to know which asset this is, you can go to: `https://<addr>/api/assets/<UUID>` If you want to know to which bookmark does this asset belong, you can run the following query against the sqlite database: ``` select bookmarkId from assets where id = '<UUID>'; ```
Author
Owner

@maya329 commented on GitHub (Dec 20, 2024):

Have you enabled video download? This is what happened to me….

No, this was my first thought, but I checked all the youtube bookmarks and the "video" tab in the dropdown are all greyed out.

<!-- gh-comment-id:2557507404 --> @maya329 commented on GitHub (Dec 20, 2024): > Have you enabled video download? This is what happened to me…. No, this was my first thought, but I checked all the youtube bookmarks and the "video" tab in the dropdown are all greyed out.
Author
Owner

@maya329 commented on GitHub (Dec 20, 2024):

Seems like I have a ton of full page archives... Is there a quick way to clear these and only keep 1?

image
<!-- gh-comment-id:2557514569 --> @maya329 commented on GitHub (Dec 20, 2024): Seems like I have a ton of full page archives... Is there a quick way to clear these and only keep 1? <img width="746" alt="image" src="https://github.com/user-attachments/assets/d041ef14-84ae-43c5-a77a-db5af656672b" />
Author
Owner

@MohamedBassem commented on GitHub (Dec 20, 2024):

how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it.

<!-- gh-comment-id:2557517923 --> @MohamedBassem commented on GitHub (Dec 20, 2024): how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it.
Author
Owner

@MohamedBassem commented on GitHub (Dec 20, 2024):

how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it.

<!-- gh-comment-id:2557518030 --> @MohamedBassem commented on GitHub (Dec 20, 2024): how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it.
Author
Owner

@maya329 commented on GitHub (Dec 20, 2024):

I have no idea... Haha, I just exported the entire hoarder data to JSON, nuked the container and made a new one then imported in. I'll keep a lookout and see what happens after.

<!-- gh-comment-id:2557543265 --> @maya329 commented on GitHub (Dec 20, 2024): I have no idea... Haha, I just exported the entire hoarder data to JSON, nuked the container and made a new one then imported in. I'll keep a lookout and see what happens after.
Author
Owner

@maya329 commented on GitHub (Dec 20, 2024):

Alright, I've tested enough and come to the conclusion it's that specific webpage that is causing problems. Seems like it's timing out during archiving and causing an endless loop as the worker try again:

image

The webpage I have bookmarked is: https://www.interaction-design.org/literature/topics/visual-hierarchy

If you want to test.

<!-- gh-comment-id:2557620296 --> @maya329 commented on GitHub (Dec 20, 2024): Alright, I've tested enough and come to the conclusion it's that specific webpage that is causing problems. Seems like it's timing out during archiving and causing an endless loop as the worker try again: <img width="1271" alt="image" src="https://github.com/user-attachments/assets/8f771791-7a75-43e5-9006-1c9101cfa56e" /> The webpage I have bookmarked is: https://www.interaction-design.org/literature/topics/visual-hierarchy If you want to test.
Author
Owner

@kamtschatka commented on GitHub (Dec 20, 2024):

OK so I guess the reason is two-fold:

  • We are not properly handling timeouts of the worker, which causes this behavior of having the full page archive being scheduled again and again, adding to the list of assets for this bookmark.
  • You have not set the CRAWLER_JOB_TIMEOUT_SEC high enough to give the crawler the chance to even finish in time. Can you try increasing it to a longer time and see if the problem persists, so we can confirm?
<!-- gh-comment-id:2557712552 --> @kamtschatka commented on GitHub (Dec 20, 2024): OK so I guess the reason is two-fold: * We are not properly handling timeouts of the worker, which causes this behavior of having the full page archive being scheduled again and again, adding to the list of assets for this bookmark. * You have not set the CRAWLER_JOB_TIMEOUT_SEC high enough to give the crawler the chance to even finish in time. Can you try increasing it to a longer time and see if the problem persists, so we can confirm?
Author
Owner

@maya329 commented on GitHub (Dec 20, 2024):

I increased the timeout to 300 and it's working well now. Thanks for the solution!

<!-- gh-comment-id:2557730964 --> @maya329 commented on GitHub (Dec 20, 2024): I increased the timeout to 300 and it's working well now. Thanks for the solution!
Author
Owner

@debackerl commented on GitHub (Jan 18, 2025):

It not only create duplicate assets but also orphan tags: it can happen that inference task is run multiple times because of a timeout, but different tags could be generated each time. Sometimes, the 1st inferno task created new tags, but 2nd inference will bot use them again generating orphan tags.

I think that in case of a timeout, completed tasks should not be retried. It costs to rerun inference, and takes time to rerun full page archiving.

I believe that a retry is worth in case of HTTP errors, but not of a timeout. Maybe worth making it a parameter if someone want to keep retrying slow websites.

<!-- gh-comment-id:2600224846 --> @debackerl commented on GitHub (Jan 18, 2025): It not only create duplicate assets but also orphan tags: it can happen that inference task is run multiple times because of a timeout, but different tags could be generated each time. Sometimes, the 1st inferno task created new tags, but 2nd inference will bot use them again generating orphan tags. I think that in case of a timeout, completed tasks should not be retried. It costs to rerun inference, and takes time to rerun full page archiving. I believe that a retry is worth in case of HTTP errors, but not of a timeout. Maybe worth making it a parameter if someone want to keep retrying slow websites.
Author
Owner

@petrm commented on GitHub (Feb 28, 2025):

how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it.

Is there any way to identify the orphaned duplicates? I am stuck with 40GB of those.

<!-- gh-comment-id:2691454025 --> @petrm commented on GitHub (Feb 28, 2025): > how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it. Is there any way to identify the orphaned duplicates? I am stuck with 40GB of those.
Author
Owner

@MohamedBassem commented on GitHub (Mar 2, 2025):

@petrm The nightly build has a new Manage assets tab in the settings page. Those will help you figure out the large assets. If you think you have "orphaned assets", those you can reap by running the tidy asset admin job.

<!-- gh-comment-id:2692706738 --> @MohamedBassem commented on GitHub (Mar 2, 2025): @petrm The nightly build has a new `Manage assets` tab in the settings page. Those will help you figure out the large assets. If you think you have "orphaned assets", those you can reap by running the `tidy asset` admin job.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#482
No description provided.