mirror of
https://github.com/ciur/papermerge.git
synced 2026-04-25 03:55:58 +03:00
[GH-ISSUE #466] Edit OCRed text #362
Labels
No labels
2.1
3.0
3.0.1
3.0.2
3.0.3
3.0.3
3.1
3.2
3.2
3.3
3.5
3.x
Fixed. Waiting for feedback.
Fixed. Waiting for feedback.
UX
Version 2.1 - alpha
XSS
announcement
beta
blocker
bug
cannot reproduce
confirmed
confirmed
critical
demo
dependencies
deployment
detchnical debt
discussion
docker
documentation
donations
duplicate
enhancement
feature request
frontend
fundraising
good first issue
good issue
help wanted
high
implemented
important
improvement
incomplete
invalid
investigation
kubernetes
low
low impact
medium
medium
medium impact
migration from 2.0
migration from 2.1
missing-language
missing-ocr-language
no-activity
note
ocr
outofscope
packaging
performance
popular request
pull-request
pypi
question
raspberry pi
roadmap
search
security
setup
status
task
technical debt
updates
user xp
version 1.4.0 - demo
will be implemented
will not be implemented
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/papermerge#362
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @testingsw1 on GitHub (Jul 19, 2022).
Original GitHub issue: https://github.com/ciur/papermerge/issues/466
Originally assigned to: @ciur on GitHub.
For me OCR result are not that good and sadly sometimes really bad. I wish to have at least few important key words OCRed.
Is it possible to add option "Edit OCRed text" (button via GUI, under "View OCRed text", above Tags) so we can manually fix text?
@ciur commented on GitHub (Jul 19, 2022):
This feature makes sense. However, "Editing the OCR text" steps over the feet of versioning concept.
Say you have a document X.pdf which you OCRed. Thus you have the version 0 (original) and version 1 with OCRed (of poor quality i.e. some words were not detected properly) text.
If you decide to edit the OCRed text, then some of the later OCRing will discard your text "corrections". If you choose to "manually (re-)run OCR", the newer version (version 3, created when you clicked "run OCR") will have again "bad quality OCR with missing keywords". Similarly, if you choose to rotate one page within the document, the entire document will be OCRed which will result in newer version (version 3) without corrected text.
On the other hand moving pages around, deleting pages, reordering pages, merge documents will increase documents version but will preserve corrected text.
@testingsw1
If above described trade off sounds ok for you, then I am perfectly fine going forward and implementing this feature.
@testingsw1 commented on GitHub (Jul 19, 2022):
Thank you! Of course I will be fine with this. I think most of us are having these documents as archives and there is not a lot of versioning. This feature will really help! I have scanner with Abbyy software that does great job on some documents - with this feature I can easily replace wrong text in papermerge (just copy correct text from Abbyy). Can't wait for new version! Thank you again!
Ps. Is is possible to delete not needed versions via console? I can get to correct document via manage.py shell --> Document.objects.get(id=XX), but I am not sure how to get (and delete) specific version.
@ciur commented on GitHub (Jul 20, 2022):
Yes - just keep in mind that I've never tested what happens when you delete document version and there is no such "official feature" :)
In shell>
Where
number=2is version number of document instance. Last line from above output means that 3 objects were deleted from the database - one DocumentVersion and two associated Page model instances.