mirror of
https://github.com/karakeep-app/karakeep.git
synced 2026-04-26 00:16:03 +03:00
[GH-ISSUE #742] Worker timeouts during full page archives don't get cleaned properly causing duplicates and large space usage #482
Closed
opened 2026-03-02 11:50:13 +03:00 by kerem
·
13 comments
No Branch/Tag specified
main
refactor/use-npm-singlefile
onetab
claude/issue-2596-20260321-1401
claude/fix-docs-button-responsive-V3aBQ
claude/review-import-backpressure-D4ArJ
claude/fix-archived-bookmarks-mobile-P9OJW
claude/issue-1189-20260211-1601
claude/fix-nested-smart-lists-3uFkt
claude/issue-2298-20251223-1704
feat/import-v3
claude/add-cli-search-subcommand-6kIe0
claude/add-bookmark-indexing-timestamps-96bPj
claude/auto-disable-failing-feeds-fkDhP
claude/add-tag-search-aliases-HzESD
feat/docker-compose-dev
claude/add-attachedby-tags-endpoint-01WYfemMGHJJjXsPYLvUJAno
claude/fix-crawler-memory-leaks-NE7Ct
bookmark-debugger
claude/issue-2352-20260106-1120
claude/issue-1977-20260102-2348
claude/add-banner-rendering-JeLUk
claude/add-descendant-qualifier-cUm26
claude/skip-metadata-refresh-archives-CAo4Y
claude/fix-archive-pending-banner-pAyGM
claude/add-embeddings-support-h2swV
claude/nested-manage-lists-QVV85
claude/privacy-type-system-MG1bT
claude/add-action-menu-icons-6hNKw
claude/issue-2299-20251223-1711
claude/bookmark-indexing-progress-QwZSI
claude/migrate-bookmark-attachments-3O2te
claude/add-2025-wrapped-feature-tIUIh
claude/improve-ai-settings-design-639tq
claude/add-youtube-metascraper-plugin-0lWC7
claude/add-problem-reporting-gSSEV
claude/add-mobile-list-menus-spcS7
claude/shadcn-bookmark-cards-WWHzP
claude/add-extensions-link-HTeXc
claude/add-onboarding-screens-hsYMO
claude/fix-settings-switch-overflow-nlzM4
claude/clamp-bookmark-titles-diAEz
claude/port-stats-mobile-expo-MuXAn
claude/whats-new-base-version-vrv8C
claude/fix-settings-auth-checks-jgyD8
claude/add-server-version-display-3sGa2
claude/fix-tag-editor-scrolling-rzdbG
claude/add-company-pricing-card-y5mHY
claude/audit-optimize-transactions-xpDVc
codex/ensure-consistent-ui-experience-across-app-pages
claude/plan-opentelemetry-integration-01Jx183mz1Ev8h8JoYj97Auw
libsql
db-indicies
claude/export-import-lists-01UuCWwdaqduAd35NppvjnMD
claude/configurable-worker-timeout-0198GQh6YrrRzqG62xnogyrz
claude/check-import-quota-01CPdxTpHp18Ba62bYcBTVbA
claude/scraper-worker-thread-01FEHen6MGrQHmdBstJSuiyA
claude/customize-dialog-styling-01CVjEv2KgyZJSpCg3mqkvR7
claude/add-asset-cache-headers-0175WhNcqwiwurrmjj52jnLT
claude/add-db-search-plugin-017Xxd4Jq3MfjWT788vgfbaq
benchmarks-2
claude/add-filtered-deletion-01DTxWNcg3hhqdNpeNLa3s6L
claude/actionbutton-loading-spinner-015DY5ZTvgPgFAXTZz3UGaYv
claude/add-broken-links-qualifier-01S31X1LsKiYb9gE1dXTKvi3
claude/docker-release-tag-trigger-01UmzFXEumhK2jdmRGtMcueo
claude/spread-feed-fetch-scheduling-01EihUtmZSyqeE1HfRMessxW
restate-idempotency
claude/align-android-ios-colors-01GJfkhEyZVBReohVioPa8ok
claude/improve-mobile-app-colors-0155LzHfkd5HyJr6YyZMsus5
codex/add-autocomplete-for-search-query-language
claude/add-bookmark-backups-016L2A8Z94n7tDgDdMPdFuAd
claude/restrict-binary-user-permissions-01FSGyy2RXGZvE26YbAejzGi
effect-ts
claude/prepare-trpc-npm-publish-0193EjfwpxSNVNcLXqXjs6Ln
shared-list-sidebar
claude/lazy-load-tiktoken-017UTNpJPTcMMQvNEBa1aFwo
codex/fix-asset-pre-processing-worker-abort-signals
add-groupid
claude/add-bookmark-list-button-01VF7uXYNLsVDzqdozWMXP5M
claude/extract-shared-ui-components-01DSVfaCr6WRqAyx1vJTZk9r
claude/migrate-shadcn-sidebar-01DKjpg9MD5PJ2potemSnbvW
claude/add-collaborators-rate-limits-01VjXyRWWPUkGQKa8d8D8qKj
claude/modernize-dark-mode-01FRfE81PAY5C44pFu1cYocf
claude/add-signed-url-bookmark-01PjYT1ZhvLK2FPJNTAhJsWf
restate-group-id
claude/add-highlights-page-012vhHpn8fVNp3gf7gBeW14s
claude/disable-shared-bookmark-features-01B9fiGUdu6NyWaxSQFsQBxP
claude/mobile-bookmark-grid-layouts-018cGBBMhPJVq6PJVRBpqT2r
claude/add-mobile-bookmark-summary-01494LYoh4sJW5Fj4GPm62Vj
claude/add-mobile-tags-screen-01WRADt4ZzvXVew1Y9vqF8SV
claude/add-highlight-notes-01LpanRLS4a2YMnT1qB5GTqX
claude/add-search-bar-014k2ngaqjwYRVSvqmbuECqr
claude/hide-collaborator-emails-01TQrkkMupC7CR9BTuDkireg
claude/list-invitation-approval-0129V89M1riXW6JqmoF74VfM
claude/add-bookmark-archive-sort-018VbGPGvtmsGgXFEERoAX7B
claude/add-mobile-smart-lists-01251tYo9u1SywE6XFezAv9e
claude/bookmark-drag-drop-01DmWq286ogHpDGHKcXjKr3z
claude/add-rss-import-01DH1Q2axcDeq8nQJR5MWjPJ
claude/mobile-inapp-browser-auth-01KiT6bwyntRPQ1X4oTtAveC
claude/offline-mode-react-query-01D1rE2bdBEPw2teGqunr5Gd
claude/add-singlefile-extension-support-01BEB9QQZABzwfZDvR9Bz5b2
claude/custom-list-slugs-01VxcfkNUXZ97FNpNVURopMq
claude/issue-2148-20251118-1133
claude/add-groupid-queue-fairness-011CV1r8Wb46HuGAg5o95i3m
claude/hide-viewer-shared-lists-01Fst6NBvdxrXXnDhUmjsNDP
claude/collaborative-lists-013AvDvMqkoszDVcSoCYgBcM
claude/implement-feature-01LT5XzGsbEhZkYXNEjEwdui
claude/fix-bookmark-loading-state-01AgF4H2drxwuTCJDB2Xgiu4
claude/admin-user-edit-013tbiRmb1KX2fhSYqmGKCu8
claude/expose-all-api-01YTruEW72WQYMtq4iZoaPkA
claude/add-doc-link-main-016NYLxShpKuH6R8XCBgeZtc
claude/fix-issue-2133-019JLvdSRAUbU4FtjQztcM6S
claude/explore-effect-ts-integration-01F7xb1dWwP1ma4LnLbFGfDD
claude/optimize-dockerfile-build-011CV5gDnPZbdbbVSPDofC4e
claude/add-custom-headers-guide-011CV249t16aWDRb1mCrzQdC
claude/mobile-app-signup-011CUxPtCXgU6U3T8GShTR2Q
claude/crawler-worker-fetch-browser-011CUvcRc24XEr9DTWDW6MX8
claude/fix-issue-784-011CUvubQrcZHG9S3KjpCKbK
codex/add-user-settings-for-inference-language-and-screenshots
claude/fix-mobile-signin-server-address-011CUnaUWwY2Fhq5Xbwhgr8H
better-auth-2
claude/issue-2028-20251012-1429
claude/issue-1010-20251012-1154
codex/update-feed-refresh-job-idempotency-key
restate
import-v2
fix-public-lists
recurse-delete-list
abort-dangling-processing
tag-pagination
ratelimit-plugin
claude/issue-1937-20250914-0912
codex/implement-title-search-query-qualifier
copilot/add-edit-button-for-notes
cookie-path
ai-tag-cleanup
codex/add-allowlist-and-blocklist-env-variables
mobile-retheme
expo-next-upgrade
opencode/issue1788-20250727215611
fix-trailing-slash-deduplication
edit-bookmark-dialog
bookmark-embeddings
rag
nextjs-15
bookmark-hover-bar
sapling-pr-archive-MohamedBassem
track-bookmark-assets
json-cli
admin-settings
mobile-dark-mode
android/v1.9.2-0
ios/v1.9.1-1
android/v1.9.1-0
ios/v1.9.1-0
ios/v1.9.0-2
ios/v1.9.0-1
android/v1.9.0-1
extension/v1.2.9
cli/v0.31.0
sdk/v0.31.0
mcp/v0.31.0
android/v1.9.0-0
ios/v1.9.0-0
v0.31.0
android/v1.8.5-0
cli/v0.30.0
sdk/v0.30.0
ios/v1.8.4-0
android/v1.8.4-0
v0.30.0
cli/v0.29.1
v0.29.3
v0.29.2
v0.29.1
sdk/v0.29.0
cli/v0.29.0
mcp/v0.29.0
ios/v1.8.3-0
android/v1.8.3-0
extension/v1.2.8
v0.29.0
android/v1.8.2-2
android/v1.8.2-1
ios/v1.8.2-0
android/v1.8.2-0
extension/v1.2.7
android/v1.8.1-0
ios/v1.8.1-0
v0.28.0
cli/v0.27.1
cli/v0.27.0
v0.27.1
sdk/v0.27.0
v0.27.0
android/v1.8.0-1
ios/v1.8.0-1
mcp/v0.26.0
sdk/v0.26.0
v0.26.0
cli/v0.25.0
ios/v1.7.0-1
mcp/v0.25.0
v0.25.0
extension/v1.2.6
ios/v1.7.0-0
android/v1.7.0-0
v0.24.1
v0.24.0
mcp/v0.23.10
mcp/v0.23.9
mcp/v0.23.8
extension/v1.2.5
mcp/v0.23.7
mcp/v0.23.6
mcp/v0.23.5
mcp/v0.23.4
sdk/v0.23.2
cli/v0.23.0
extension/v1.2.4
android/v1.6.9-1
ios/v1.6.9-1
v0.23.2
v0.23.1
sdk/v0.23.0
v0.23.0
ios/v1.6.9-0
sdk/v0.22.0
v0.22.0
android/v1.6.8-0
ios/v1.6.8-0
sdk/v0.21.2
sdk/v0.21.1
sdk/v0.21.0
v0.21.0
cli/v0.20.0
v0.20.0
ios/v1.6.7-4
android/v1.6.7-4
ios/v1.6.7-3
android/v1.6.7-3
android/v1.6.7-2
ios/v1.6.7-2
android/v1.6.7-1
ios/v1.6.7-1
ios/v1.6.7-0
android/v1.6.7-0
v0.19.0
android/v1.6.6-0
android/v1.6.5-0
ios/v1.6.5-0
ios/v1.6.4-0
android/v1.6.4-0
v0.18.0
v0.17.1
v0.17.0
ios/v1.6.3-0
android/v1.6.3-0
extension/v1.2.3
ios/v1.6.2-1
android/v1.6.2-1
ios/v1.6.2-0
android/v1.6.2-0
v0.16.0
ios/v1.6.1-3
android/v1.6.1-3
ios/v1.6.1-2
android/v1.6.1-2
android/v1.6.1-1
ios/v1.6.1-1
android/v1.6.1-0
ios/v1.6.1-0
extension/v1.2.2
android/v1.6.0-1
ios/v1.6.0-1
ios/v1.6.0
android/v1.6.0
cli/v0.13.7
cli/v0.13.6
v0.15.0
cli/v0.13.5
extension/v1.2.1
v0.14.0
cli/v0.13.3
cli/v0.13.2
cli/v0.13.1
cli/v0.13.0
v0.13.1
v0.13.0
mobile-v1.5.0
mobile-v1.4.0
v0.12.2
v0.12.1
v0.12.0
v0.11.1
v0.11.0
v0.10.1
v0.10.0
v0.9.0
v0.8.0
v0.7.0
v0.6.0
v0.5.0
v0.4.1
v.0.4.0
v.0.3.1
v0.3.0
v0.2.2
v0.2.1
v0.2.0
v0.1.0
Labels
Clear labels
Mirrored from GitHub Pull Request
UI/UX
android
bug
dependencies
documentation
documentation
extension
feature request
feature request
good first issue
ios
long-term
performance
pri/high
pri/low
pri/medium
pull-request
Mirrored from GitHub Pull Request
question
status/approved
status/icebox
status/pending_clarification
status/untriaged
No labels
UI/UX
android
bug
dependencies
documentation
documentation
extension
feature request
feature request
good first issue
ios
long-term
performance
pri/high
pri/low
pri/medium
pull-request
question
status/approved
status/icebox
status/pending_clarification
status/untriaged
Milestone
Clear milestone
No items
No milestone
Projects
Clear projects
No items
No project
Assignees
Clear assignees
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".
No due date set.
Dependencies
No dependencies set.
Reference
starred/karakeep#482
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @maya329 on GitHub (Dec 19, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/742
Describe the Bug
I'm currently running hoarder within docker and today I checked and found out it's taking up 40GB while I only have 119 bookmarks, the file size seems to be very unusual. I did an ncdu and the results are attached below.
How do I find out which bookmarks are taking up 1GB of space?
Steps to Reproduce
Expected Behaviour
Less file size
Screenshots or Additional Context
Device Details
No response
Exact Hoarder Version
v0.19.0
@ctschach commented on GitHub (Dec 20, 2024):
Have you enabled video download? This is what happened to me….
@MohamedBassem commented on GitHub (Dec 20, 2024):
As @ctschach mentioned, this looks like video downloads being enabled indeed.
If you go inside the large folder and run
cat metadata.jsonit should tell you the type of the asset.And if you want to know which asset this is, you can go to:
https://<addr>/api/assets/<UUID>If you want to know to which bookmark does this asset belong, you can run the following query against the sqlite database:
@maya329 commented on GitHub (Dec 20, 2024):
No, this was my first thought, but I checked all the youtube bookmarks and the "video" tab in the dropdown are all greyed out.
@maya329 commented on GitHub (Dec 20, 2024):
Seems like I have a ton of full page archives... Is there a quick way to clear these and only keep 1?
@MohamedBassem commented on GitHub (Dec 20, 2024):
how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it.
@MohamedBassem commented on GitHub (Dec 20, 2024):
how did you end up with that many? 😅 If you don't care about the bookmark, the easiest would be to remove it and re-add it.
@maya329 commented on GitHub (Dec 20, 2024):
I have no idea... Haha, I just exported the entire hoarder data to JSON, nuked the container and made a new one then imported in. I'll keep a lookout and see what happens after.
@maya329 commented on GitHub (Dec 20, 2024):
Alright, I've tested enough and come to the conclusion it's that specific webpage that is causing problems. Seems like it's timing out during archiving and causing an endless loop as the worker try again:
The webpage I have bookmarked is: https://www.interaction-design.org/literature/topics/visual-hierarchy
If you want to test.
@kamtschatka commented on GitHub (Dec 20, 2024):
OK so I guess the reason is two-fold:
@maya329 commented on GitHub (Dec 20, 2024):
I increased the timeout to 300 and it's working well now. Thanks for the solution!
@debackerl commented on GitHub (Jan 18, 2025):
It not only create duplicate assets but also orphan tags: it can happen that inference task is run multiple times because of a timeout, but different tags could be generated each time. Sometimes, the 1st inferno task created new tags, but 2nd inference will bot use them again generating orphan tags.
I think that in case of a timeout, completed tasks should not be retried. It costs to rerun inference, and takes time to rerun full page archiving.
I believe that a retry is worth in case of HTTP errors, but not of a timeout. Maybe worth making it a parameter if someone want to keep retrying slow websites.
@petrm commented on GitHub (Feb 28, 2025):
Is there any way to identify the orphaned duplicates? I am stuck with 40GB of those.
@MohamedBassem commented on GitHub (Mar 2, 2025):
@petrm The nightly build has a new
Manage assetstab in the settings page. Those will help you figure out the large assets. If you think you have "orphaned assets", those you can reap by running thetidy assetadmin job.