mirror of
https://github.com/ciur/papermerge.git
synced 2026-04-25 03:55:58 +03:00
[GH-ISSUE #583] Gujarati, Hindi and Sanskrit Language OCR not working #456
Labels
No labels
2.1
3.0
3.0.1
3.0.2
3.0.3
3.0.3
3.1
3.2
3.2
3.3
3.5
3.x
Fixed. Waiting for feedback.
Fixed. Waiting for feedback.
UX
Version 2.1 - alpha
XSS
announcement
beta
blocker
bug
cannot reproduce
confirmed
confirmed
critical
demo
dependencies
deployment
detchnical debt
discussion
docker
documentation
donations
duplicate
enhancement
feature request
frontend
fundraising
good first issue
good issue
help wanted
high
implemented
important
improvement
incomplete
invalid
investigation
kubernetes
low
low impact
medium
medium
medium impact
migration from 2.0
migration from 2.1
missing-language
missing-ocr-language
no-activity
note
ocr
outofscope
packaging
performance
popular request
pull-request
pypi
question
raspberry pi
roadmap
search
security
setup
status
task
technical debt
updates
user xp
version 1.4.0 - demo
will be implemented
will not be implemented
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/papermerge#456
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @vikithakar on GitHub (Jan 22, 2024).
Original GitHub issue: https://github.com/ciur/papermerge/issues/583
Originally assigned to: @ciur on GitHub.
Description of Issue
After building Papermerge with
Gujarati,HindiandSanskritLanguage support, when you upload and run OCR on files, it churns out OCR text which is not correct. I think the tesseract-ocr is consistent with the text output it gives for the file, but it seems like papermerge does not have fonts or Character Sets to display the translations in the OCR text language.Build Details
Dockerfileto add tesseract-ocr to papermergeInfo:
@ciur commented on GitHub (Jan 23, 2024):
Thank you for reporting the issue!
@ciur commented on GitHub (Jan 30, 2024):
@vikithakar
In order to make this work, I need to include
Gujarati,HindiandSanskritcodes here and here. For the second list, I need respective language written in original language; for examplefrain French is "Français";ellin Greek is "Ελληνικά".Could you please provide original writing of the language name for
Gujarati,HindiandSanskrit?gujin Gujrati is "..." ?hinin Hindi is "..." ?sanin "Sanskrit is "..." ?@vikithakar commented on GitHub (Jan 30, 2024):
@ciur
Original Language Name
gujin Gujarati is ગુજરાતીhinin Hindi is हिंदीsanin Sanskrit is संस्कृत@ciur commented on GitHub (Jan 31, 2024):
@vikithakar
PR for adding above mentioned languages.
Change will be available in 3.0.3 release
Note that you will need to build your image as before. However, when you will start papermerge don't forget to add PAPERMERGE__OCR__DEFAULT_LANGUAGE variable so that when you import docs they will be OCRed in "default OCR" language.
In ticket's screenshot you've uploaded you can see that document was OCRed with OCR language being set "German" (deu code corresponds to German language). That's why those strange characters.
@ciur commented on GitHub (Jan 31, 2024):
@vikithakar
Here is screenshot with working app (as mentioned above will be part of 3.0.3):