[GH-ISSUE #263] Calculate hash to prevent duplicates from being imported #213

Closed
opened 2026-02-25 21:31:27 +03:00 by kerem · 1 comment
Owner

Originally created by @croontje on GitHub (Dec 20, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/263

I was testing a little bit today, and I noticed that I can upload the same PDF over and over again.
I think it would be better if you would calculate a hash (eg MD5, SHA, ...) and on import check if it already exists.

It would be even better to detect 2 versions of the same document, but I think that's almost impossible...
For example if you scan the same document twice, flag them as duplicates. But as I said this seems almost impossible to me :)

Originally created by @croontje on GitHub (Dec 20, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/263 I was testing a little bit today, and I noticed that I can upload the same PDF over and over again. I think it would be better if you would calculate a hash (eg MD5, SHA, ...) and on import check if it already exists. It would be even better to detect 2 versions of the same document, but I think that's almost impossible... For example if you scan the same document twice, flag them as duplicates. But as I said this seems almost impossible to me :)
kerem 2026-02-25 21:31:27 +03:00
Author
Owner

@ciur commented on GitHub (Dec 21, 2020):

@croontje, whether this feature is useful or not is a matter of debate.

There is a duplicate issue #167 on this topic.

For me personally, the inclusion of this feature as part of core will make development mode more complex: many times while fixing bugs or just during development I use to upload 2-3 available to me documents couple of times to "simulate" multitude of documents. With this hashing thingy I will need to adjust defaults in order to avoid duplicates.

Proper (I mean where similar documents are detected, which may not be necessary 100% exact) de-duplication is a good thing to have and it will be added later.

Starting with Papermerge 2.0, you will be able to create a separate app (app == extention == plugin) which will provide this functionality. Actually I will create a hashing app as example of how to write external apps in order to extend core functionality.

<!-- gh-comment-id:749089839 --> @ciur commented on GitHub (Dec 21, 2020): @croontje, whether this feature is useful or not is a matter of debate. There is a duplicate issue #167 on this topic. For me personally, the inclusion of this feature as part of core will make _development mode_ more complex: many times while fixing bugs or just during development I use to upload 2-3 available to me documents couple of times to "simulate" multitude of documents. With this hashing thingy I will need to adjust defaults in order to avoid duplicates. Proper (I mean where similar documents are detected, which may not be necessary 100% exact) de-duplication is a good thing to have and it will be added later. Starting with Papermerge 2.0, you will be able to create a separate app (app == extention == plugin) which will provide this functionality. Actually I will create a hashing app as example of how to write external apps in order to extend core functionality.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#213
No description provided.