[GH-ISSUE #364] Non-webpage bookmarks duplication problem #237

Open
opened 2026-03-02 11:47:52 +03:00 by kerem · 11 comments
Owner

Originally created by @Haze-sh on GitHub (Aug 27, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/364

When adding links that contain non-webpage content (like pdfs or gifs) it gets duplicated in the database, I think this shouldn't be the expected behaviour

Originally created by @Haze-sh on GitHub (Aug 27, 2024). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/364 When adding links that contain non-webpage content (like pdfs or gifs) it gets duplicated in the database, I think this shouldn't be the expected behaviour
Author
Owner

@kamtschatka commented on GitHub (Aug 31, 2024):

I have added a gif and a pdf to hoarder and got only 1 entry. Please provide more information on what you are doing and where you are seeing a duplication

<!-- gh-comment-id:2323041301 --> @kamtschatka commented on GitHub (Aug 31, 2024): I have added a gif and a pdf to hoarder and got only 1 entry. Please provide more information on what you are doing and where you are seeing a duplication
Author
Owner

@Haze-sh commented on GitHub (Sep 1, 2024):

Sorry this wasn't really clear, I was importing a list of urls through the command line, when I reimport the same list again the pdfs and gifs get readded to hoarder again but webpages do not duplicate. And they are not searchable too.

<!-- gh-comment-id:2323098492 --> @Haze-sh commented on GitHub (Sep 1, 2024): Sorry this wasn't really clear, I was importing a list of urls through the command line, when I reimport the same list again the pdfs and gifs get readded to hoarder again but webpages do not duplicate. And they are not searchable too.
Author
Owner

@kamtschatka commented on GitHub (Sep 15, 2024):

can you please explain exactly what are you doing?
I ran bookmarks add --link https://asdf.readthedocs.io/_/downloads/en/2.9.2/pdf/ and get the response alreadyExists: false.
then I run it again and get the response alreadyExists: true and the bookmark is not added again.

Which commands are you calling to get the duplication?

<!-- gh-comment-id:2351473988 --> @kamtschatka commented on GitHub (Sep 15, 2024): can you please explain exactly what are you doing? I ran `bookmarks add --link https://asdf.readthedocs.io/_/downloads/en/2.9.2/pdf/` and get the response `alreadyExists: false`. then I run it again and get the response `alreadyExists: true` and the bookmark is not added again. Which commands are you calling to get the duplication?
Author
Owner

@Haze-sh commented on GitHub (Oct 5, 2024):

I can't replicate this with the same exact command above! I don't have a clue what could have been different at my machine.

<!-- gh-comment-id:2395142432 --> @Haze-sh commented on GitHub (Oct 5, 2024): I can't replicate this with the same exact command above! I don't have a clue what could have been different at my machine.
Author
Owner

@Haze-sh commented on GitHub (Oct 5, 2024):

For further clarification, the above command gives alreadyExists: false at every re-run.

<!-- gh-comment-id:2395143227 --> @Haze-sh commented on GitHub (Oct 5, 2024): For further clarification, the above command gives `alreadyExists: false` at every re-run.
Author
Owner

@kamtschatka commented on GitHub (Oct 5, 2024):

are you on the latest version (0.17.1)?
I have just rerun the test and I can't reproduce it. Maybe someone else can reproduce it, but I don't really understand why the same command would lead to different outcomes on our machines, as it is quite simple^^

<!-- gh-comment-id:2395152191 --> @kamtschatka commented on GitHub (Oct 5, 2024): are you on the latest version (0.17.1)? I have just rerun the test and I can't reproduce it. Maybe someone else can reproduce it, but I don't really understand why the same command would lead to different outcomes on our machines, as it is quite simple^^
Author
Owner

@kamtschatka commented on GitHub (Oct 5, 2024):

OK now I can reproduce it, if you wait long enough for it to completely crawl, then this happens. I will have a look

<!-- gh-comment-id:2395152440 --> @kamtschatka commented on GitHub (Oct 5, 2024): OK now I can reproduce it, if you wait long enough for it to completely crawl, then this happens. I will have a look
Author
Owner

@Haze-sh commented on GitHub (Jan 10, 2025):

I can confirm this is still reproducible as of version 21.0

<!-- gh-comment-id:2584543079 --> @Haze-sh commented on GitHub (Jan 10, 2025): I can confirm this is still reproducible as of version 21.0
Author
Owner

@andygeorge commented on GitHub (May 18, 2025):

Running into this as well, using the nightly Docker image. I have an automated import of "top X popular links" from a site, and deduplication seems to works great for everything but PDFs - duplicate PDFs can be imported many times in this situation.

<!-- gh-comment-id:2888969500 --> @andygeorge commented on GitHub (May 18, 2025): Running into this as well, using the nightly Docker image. I have an automated import of "top X popular links" from a site, and deduplication seems to works great for everything but PDFs - duplicate PDFs can be imported many times in this situation.
Author
Owner

@thiswillbeyourgithub commented on GitHub (May 21, 2025):

This got me thinking. Duplicate handling is actually not at all obvious and is not directly documented in the faq.

It would be nice to get the maintainer's view on the following:

  1. if I use karakeep for years, I can totally bookmark the same page twice. For example say you read and highlight the wikipedia page of a politician, then they get elected, their wikipedia will drastically change. It would not be that surprising that we would want to re add it and read it in a year from now.
  • Should karakeep refuse to add the second bookmark because it has the same URL?
  • Should karakeep overwrite the first bookmark because it prioritizes updated content over passed highlights?
  • Should karakeep add the second bookmark as an entirely different bookmark that just happens to share a URL with another one?
  • Should karakeep be able to handle multiple content per bookmark? In that case we'd have a single bookmark with two version of the webpage, so could highlight both the previous and current version without losing anything?
  1. This time about PDFs: is the deduplication done on a sourceUrl basis? A hash? If the file is the same. will the metadata be updated (for example the createdAt metadata)?
  2. About videos: if I add the same youtube url twice but have changed the yt-dlp arguments, what will happen?

I think those question are bound to be asked by anyone using karakeep for long enough so it would be good to mention the detailed answer in the documentation (a FAQ?).

Edit: Related to #859 where the maintainer provided a helpful answer but not for all my questions.

<!-- gh-comment-id:2898551470 --> @thiswillbeyourgithub commented on GitHub (May 21, 2025): This got me thinking. Duplicate handling is actually not at all obvious and is not directly documented in the faq. It would be nice to get the maintainer's view on the following: 1. if I use karakeep for years, I can totally bookmark the same page twice. For example say you read and highlight the wikipedia page of a politician, then they get elected, their wikipedia will drastically change. It would not be that surprising that we would want to re add it and read it in a year from now. - Should karakeep refuse to add the second bookmark because it has the same URL? - Should karakeep overwrite the first bookmark because it prioritizes updated content over passed highlights? - Should karakeep add the second bookmark as an entirely different bookmark that just happens to share a URL with another one? - Should karakeep be able to handle multiple content per bookmark? In that case we'd have a single bookmark with two version of the webpage, so could highlight both the previous and current version without losing anything? 2. This time about PDFs: is the deduplication done on a sourceUrl basis? A hash? If the file is the same. will the metadata be updated (for example the createdAt metadata)? 3. About videos: if I add the same youtube url twice but have changed the yt-dlp arguments, what will happen? I think those question are bound to be asked by anyone using karakeep for long enough so it would be good to mention the detailed answer in the documentation (a FAQ?). Edit: Related to #859 where the maintainer provided a [helpful answer](https://github.com/karakeep-app/karakeep/issues/859#issuecomment-2585221088) but not for all my questions.
Author
Owner

@akhdanfadh commented on GitHub (Jan 22, 2026):

Totally agree with @thiswillbeyourgithub, deduplication is not obvious and need a design decision / documentation.

I ran into this while building my HN syncer. Some HN submissions are PDFs, so this is relatable. In the meantime, as a workaround, I've implemented the deduplication client-side by comparing the sourceUrl (nullable in the asset schema, and that makes sense). Current server-side deduplication as of v0.30.0 is only for link type. Even when I create the bookmark via link type request, the crawler sees the pdf and change the bookmark type to asset automatically.

github.com/karakeep-app/karakeep@d472a3a1c4/packages/trpc/routers/bookmarks.ts (L91-L106)

<!-- gh-comment-id:3784675991 --> @akhdanfadh commented on GitHub (Jan 22, 2026): Totally agree with @thiswillbeyourgithub, deduplication is not obvious and need a design decision / documentation. I ran into this while building my HN [syncer](https://github.com/akhdanfadh/hnkeep). Some HN submissions are PDFs, so this is relatable. In the meantime, as a workaround, I've implemented the deduplication client-side by comparing the `sourceUrl` (nullable in the asset schema, and that makes sense). Current server-side deduplication as of v0.30.0 is only for link type. Even when I create the bookmark via link type request, the crawler sees the pdf and change the bookmark type to asset automatically. https://github.com/karakeep-app/karakeep/blob/d472a3a1c428bad8ce2ddc0822fb5b327e9465d4/packages/trpc/routers/bookmarks.ts#L91-L106
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#237
No description provided.