mirror of
https://github.com/ciur/papermerge.git
synced 2026-04-25 12:05:58 +03:00
[GH-ISSUE #250] Guess the language(s) of a document and choose the correct OCR language(s) to process it #203
Labels
No labels
2.1
3.0
3.0.1
3.0.2
3.0.3
3.0.3
3.1
3.2
3.2
3.3
3.5
3.x
Fixed. Waiting for feedback.
Fixed. Waiting for feedback.
UX
Version 2.1 - alpha
XSS
announcement
beta
blocker
bug
cannot reproduce
confirmed
confirmed
critical
demo
dependencies
deployment
detchnical debt
discussion
docker
documentation
donations
duplicate
enhancement
feature request
frontend
fundraising
good first issue
good issue
help wanted
high
implemented
important
improvement
incomplete
invalid
investigation
kubernetes
low
low impact
medium
medium
medium impact
migration from 2.0
migration from 2.1
missing-language
missing-ocr-language
no-activity
note
ocr
outofscope
packaging
performance
popular request
pull-request
pypi
question
raspberry pi
roadmap
search
security
setup
status
task
technical debt
updates
user xp
version 1.4.0 - demo
will be implemented
will not be implemented
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/papermerge#203
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @wechsler42 on GitHub (Dec 8, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/250
Originally assigned to: @ciur on GitHub.
Is your feature request related to a problem? Please describe.
OCR of documents needs me to select the OCR language from the installed OCR languages. This is somehow tedious if I have to scan and OCR documents in different languages.
Describe the solution you'd like
I suggest to implement an algorithm which is able to guess the language(s) of a document and choses the correct OCR language. If the correct OCR language is missing or the algorithm cannot guess the right language or the algorithm is unsure then I would be prompted to decide what to do with an OCR scanning of the document in question.
@ciur commented on GitHub (Dec 8, 2020):
@wechsler42
This is something definitely possible. But I need to mark it as out of scope at this stage. Features like this will make perfect sense when Papermerge will grow up to full fledged, mature and stable DMS.
At this point it Papermerge is missing "must" features - likes versioning (coming in 2.0).
@ciur commented on GitHub (Dec 8, 2020):
Another argument why it is
out of scope- is because language detection must be performed by 3rd party tool - it is not task of DMS system to guess languages - similar to the fact that it is not task of DMS tool to OCR - that is performed by 3rd party tool.If you are aware of such
language guessingopen source tool, I will gladly consider it.@wechsler42 commented on GitHub (Dec 9, 2020):
Thanks for discussing the feature suggestion. I would like to suggest three Python projects dealing with language detection:
On "Detect an Unknown Language using Python" they offer different examples on text detection. However, the right mix of "time to detect" as well as accuracy seems to be different between the projects but is important for guessing the correct OCR language in the Papermerge project.
@DocLambda commented on GitHub (Jan 8, 2021):
@wechsler42 well your examples work well once you have text! However to get the text you first need to OCR which ... needs a language! ;-) So the only real solution is to OCR with all possible languages and then use e.g.
langdetectto decide which one is most probable. How many languages we are talking about here?If this is something @ciur will accept as PR (it will be computationally heavy) I'm willing to do it.
Another alternative is to specify multiple languages to the OCR I tested with ocrmypdf (also see my PR in papermerge-core) and it will work for German and English by specifying
--language deu+eng. However, sometimes results are better when giving the correct language of the document...@ciur commented on GitHub (Jan 8, 2021):
@DocLambda is perfectly right:
I didn't realize this point at the beginning. It is called infinite recursion which results in stack overflow... :)))
Jokes apart, I think this ticket is good to be closed.