[GH-ISSUE #250] Guess the language(s) of a document and choose the correct OCR language(s) to process it #203

Closed
opened 2026-02-25 21:31:25 +03:00 by kerem · 5 comments
Owner

Originally created by @wechsler42 on GitHub (Dec 8, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/250

Originally assigned to: @ciur on GitHub.

Is your feature request related to a problem? Please describe.
OCR of documents needs me to select the OCR language from the installed OCR languages. This is somehow tedious if I have to scan and OCR documents in different languages.

Describe the solution you'd like
I suggest to implement an algorithm which is able to guess the language(s) of a document and choses the correct OCR language. If the correct OCR language is missing or the algorithm cannot guess the right language or the algorithm is unsure then I would be prompted to decide what to do with an OCR scanning of the document in question.

Originally created by @wechsler42 on GitHub (Dec 8, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/250 Originally assigned to: @ciur on GitHub. **Is your feature request related to a problem? Please describe.** OCR of documents needs me to select the OCR language from the installed OCR languages. This is somehow tedious if I have to scan and OCR documents in different languages. **Describe the solution you'd like** I suggest to implement an algorithm which is able to guess the language(s) of a document and choses the correct OCR language. If the correct OCR language is missing or the algorithm cannot guess the right language or the algorithm is unsure then I would be prompted to decide what to do with an OCR scanning of the document in question.
Author
Owner

@ciur commented on GitHub (Dec 8, 2020):

@wechsler42

This is something definitely possible. But I need to mark it as out of scope at this stage. Features like this will make perfect sense when Papermerge will grow up to full fledged, mature and stable DMS.
At this point it Papermerge is missing "must" features - likes versioning (coming in 2.0).

<!-- gh-comment-id:740862298 --> @ciur commented on GitHub (Dec 8, 2020): @wechsler42 This is something definitely possible. But I need to mark it as out of scope at this stage. Features like this will make perfect sense when Papermerge will grow up to full fledged, mature and stable DMS. At this point it Papermerge is missing "must" features - likes versioning (coming in 2.0).
Author
Owner

@ciur commented on GitHub (Dec 8, 2020):

Another argument why it is out of scope - is because language detection must be performed by 3rd party tool - it is not task of DMS system to guess languages - similar to the fact that it is not task of DMS tool to OCR - that is performed by 3rd party tool.
If you are aware of such language guessing open source tool, I will gladly consider it.

<!-- gh-comment-id:740866762 --> @ciur commented on GitHub (Dec 8, 2020): Another argument why it is ``out of scope`` - is because language detection must be performed by 3rd party tool - it is not task of DMS system to guess languages - similar to the fact that it is not task of DMS tool to OCR - that is performed by 3rd party tool. If you are aware of such ``language guessing`` open source tool, I will gladly consider it.
Author
Owner

@wechsler42 commented on GitHub (Dec 9, 2020):

Thanks for discussing the feature suggestion. I would like to suggest three Python projects dealing with language detection:

On "Detect an Unknown Language using Python" they offer different examples on text detection. However, the right mix of "time to detect" as well as accuracy seems to be different between the projects but is important for guessing the correct OCR language in the Papermerge project.

<!-- gh-comment-id:741752257 --> @wechsler42 commented on GitHub (Dec 9, 2020): Thanks for discussing the feature suggestion. I would like to suggest three Python projects dealing with language detection: - Python langdetect: https://pypi.org/project/langdetect/ : It seems a bit outdated although it is (was) rather widely used - Python spaCy : https://github.com/explosion/spaCy : It seems up to date, well maintained and documented - Python TextBlob: https://github.com/sloria/textblob : It seems up to date, well maintained and documented On "[Detect an Unknown Language using Python"](https://www.geeksforgeeks.org/detect-an-unknown-language-using-python/) they offer different examples on text detection. However, the right mix of "time to detect" as well as accuracy seems to be different between the projects but is important for guessing the correct OCR language in the Papermerge project.
Author
Owner

@DocLambda commented on GitHub (Jan 8, 2021):

@wechsler42 well your examples work well once you have text! However to get the text you first need to OCR which ... needs a language! ;-) So the only real solution is to OCR with all possible languages and then use e.g. langdetect to decide which one is most probable. How many languages we are talking about here?
If this is something @ciur will accept as PR (it will be computationally heavy) I'm willing to do it.

Another alternative is to specify multiple languages to the OCR I tested with ocrmypdf (also see my PR in papermerge-core) and it will work for German and English by specifying --language deu+eng. However, sometimes results are better when giving the correct language of the document...

<!-- gh-comment-id:756685063 --> @DocLambda commented on GitHub (Jan 8, 2021): @wechsler42 well your examples work well **once you have text**! However to get the text you first need to OCR which ... **needs a language**! ;-) So the only real solution is to OCR with all possible languages and then use e.g. `langdetect` to decide which one is most probable. How many languages we are talking about here? If this is something @ciur will accept as PR (it will be computationally heavy) I'm willing to do it. Another alternative is to specify multiple languages to the OCR I tested with ocrmypdf (also see my PR in papermerge-core) and it will work for German and English by specifying `--language deu+eng`. However, sometimes results are better when giving the correct language of the document...
Author
Owner

@ciur commented on GitHub (Jan 8, 2021):

@DocLambda is perfectly right:

However to get the text you first need to OCR which ... needs a language! ;-)

I didn't realize this point at the beginning. It is called infinite recursion which results in stack overflow... :)))
Jokes apart, I think this ticket is good to be closed.

<!-- gh-comment-id:756792869 --> @ciur commented on GitHub (Jan 8, 2021): @DocLambda is perfectly right: > However to get the text you first need to OCR which ... needs a language! ;-) I didn't realize this point at the beginning. It is called infinite recursion which results in stack overflow... :))) Jokes apart, I think this ticket is good to be closed.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#203
No description provided.