[GH-ISSUE #466] Edit OCRed text #362

Open
opened 2026-02-25 21:31:46 +03:00 by kerem · 3 comments
Owner

Originally created by @testingsw1 on GitHub (Jul 19, 2022).
Original GitHub issue: https://github.com/ciur/papermerge/issues/466

Originally assigned to: @ciur on GitHub.

For me OCR result are not that good and sadly sometimes really bad. I wish to have at least few important key words OCRed.
Is it possible to add option "Edit OCRed text" (button via GUI, under "View OCRed text", above Tags) so we can manually fix text?

Originally created by @testingsw1 on GitHub (Jul 19, 2022). Original GitHub issue: https://github.com/ciur/papermerge/issues/466 Originally assigned to: @ciur on GitHub. For me OCR result are not that good and sadly sometimes really bad. I wish to have at least few important key words OCRed. Is it possible to add option "Edit OCRed text" (button via GUI, under "View OCRed text", above Tags) so we can manually fix text?
Author
Owner

@ciur commented on GitHub (Jul 19, 2022):

This feature makes sense. However, "Editing the OCR text" steps over the feet of versioning concept.
Say you have a document X.pdf which you OCRed. Thus you have the version 0 (original) and version 1 with OCRed (of poor quality i.e. some words were not detected properly) text.
If you decide to edit the OCRed text, then some of the later OCRing will discard your text "corrections". If you choose to "manually (re-)run OCR", the newer version (version 3, created when you clicked "run OCR") will have again "bad quality OCR with missing keywords". Similarly, if you choose to rotate one page within the document, the entire document will be OCRed which will result in newer version (version 3) without corrected text.

On the other hand moving pages around, deleting pages, reordering pages, merge documents will increase documents version but will preserve corrected text.

@testingsw1
If above described trade off sounds ok for you, then I am perfectly fine going forward and implementing this feature.

<!-- gh-comment-id:1188967041 --> @ciur commented on GitHub (Jul 19, 2022): This feature makes sense. However, "Editing the OCR text" steps over the feet of versioning concept. Say you have a document X.pdf which you OCRed. Thus you have the version 0 (original) and version 1 with OCRed (of poor quality i.e. some words were not detected properly) text. If you decide to edit the OCRed text, then **some of the later OCRing** will discard your text "corrections". If you choose to "manually (re-)run OCR", the newer version (version 3, created when you clicked "run OCR") will have again "bad quality OCR with missing keywords". Similarly, if you choose to rotate one page within the document, the entire document will be OCRed which will result in newer version (version 3) without corrected text. On the other hand moving pages around, deleting pages, reordering pages, merge documents will increase documents version but will preserve corrected text. @testingsw1 If above described trade off sounds ok for you, then I am perfectly fine going forward and implementing this feature.
Author
Owner

@testingsw1 commented on GitHub (Jul 19, 2022):

Thank you! Of course I will be fine with this. I think most of us are having these documents as archives and there is not a lot of versioning. This feature will really help! I have scanner with Abbyy software that does great job on some documents - with this feature I can easily replace wrong text in papermerge (just copy correct text from Abbyy). Can't wait for new version! Thank you again!

Ps. Is is possible to delete not needed versions via console? I can get to correct document via manage.py shell --> Document.objects.get(id=XX), but I am not sure how to get (and delete) specific version.

<!-- gh-comment-id:1189402236 --> @testingsw1 commented on GitHub (Jul 19, 2022): Thank you! Of course I will be fine with this. I think most of us are having these documents as archives and there is not a lot of versioning. This feature will really help! I have scanner with Abbyy software that does great job on some documents - with this feature I can easily replace wrong text in papermerge (just copy correct text from Abbyy). Can't wait for new version! Thank you again! Ps. Is is possible to delete not needed versions via console? I can get to correct document via manage.py shell --> Document.objects.get(id=XX), but I am not sure how to get (and delete) specific version.
Author
Owner

@ciur commented on GitHub (Jul 20, 2022):

Ps. Is is possible to delete not needed versions via console? I can get to correct document via manage.py shell --> Document.objects.get(id=XX), but I am not sure how to get (and delete) specific version.

Yes - just keep in mind that I've never tested what happens when you delete document version and there is no such "official feature" :)

In shell>

In [2]: from papermerge.core.models import Document
In [3]: doc = Document.objects.first()
In [4]: doc_version = doc.versions.get(number=2)
In [5]: doc_version
Out[5]: <DocumentVersion: id=f6298a48-a991-4b8a-a75c-5fde6d899c4b number=2>
In [6]: doc_version.delete()
Out[6]: (3, {'core.Page': 2, 'core.DocumentVersion': 1})

Where number=2 is version number of document instance. Last line from above output means that 3 objects were deleted from the database - one DocumentVersion and two associated Page model instances.

<!-- gh-comment-id:1190344994 --> @ciur commented on GitHub (Jul 20, 2022): > Ps. Is is possible to delete not needed versions via console? I can get to correct document via manage.py shell --> Document.objects.get(id=XX), but I am not sure how to get (and delete) specific version. Yes - just keep in mind that I've never tested what happens when you delete document version and there is no such "official feature" :) In shell> ``` In [2]: from papermerge.core.models import Document In [3]: doc = Document.objects.first() In [4]: doc_version = doc.versions.get(number=2) In [5]: doc_version Out[5]: <DocumentVersion: id=f6298a48-a991-4b8a-a75c-5fde6d899c4b number=2> In [6]: doc_version.delete() Out[6]: (3, {'core.Page': 2, 'core.DocumentVersion': 1}) ``` Where ``number=2`` is version number of document instance. Last line from above output means that 3 objects were deleted from the database - one DocumentVersion and two associated Page model instances.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#362
No description provided.