[GH-ISSUE #583] Gujarati, Hindi and Sanskrit Language OCR not working #456

Closed
opened 2026-02-25 21:31:57 +03:00 by kerem · 5 comments
Owner

Originally created by @vikithakar on GitHub (Jan 22, 2024).
Original GitHub issue: https://github.com/ciur/papermerge/issues/583

Originally assigned to: @ciur on GitHub.

Screenshot from 2024-01-22 18-03-00

Description of Issue

After building Papermerge with Gujarati, Hindi and Sanskrit Language support, when you upload and run OCR on files, it churns out OCR text which is not correct. I think the tesseract-ocr is consistent with the text output it gives for the file, but it seems like papermerge does not have fonts or Character Sets to display the translations in the OCR text language.

Build Details

Dockerfile to add tesseract-ocr to papermerge

FROM papermerge/papermerge:3.0.2
RUN apt install tesseract-ocr-hin tesseract-ocr-guj tesseract-ocr-san -y

Info:

  • Papermerge Version 3.0.2
Originally created by @vikithakar on GitHub (Jan 22, 2024). Original GitHub issue: https://github.com/ciur/papermerge/issues/583 Originally assigned to: @ciur on GitHub. ![Screenshot from 2024-01-22 18-03-00](https://github.com/ciur/papermerge/assets/1167994/bf37ee5d-f18e-4996-9e88-eb92cfd660d6) ### Description of Issue After building Papermerge with `Gujarati`, `Hindi` and `Sanskrit` Language support, when you upload and run OCR on files, it churns out OCR text which is not correct. I think the tesseract-ocr is consistent with the text output it gives for the file, but it seems like papermerge does not have fonts or Character Sets to display the translations in the OCR text language. ### Build Details `Dockerfile` to add tesseract-ocr to papermerge ``` FROM papermerge/papermerge:3.0.2 RUN apt install tesseract-ocr-hin tesseract-ocr-guj tesseract-ocr-san -y ``` **Info:** - Papermerge Version 3.0.2
kerem 2026-02-25 21:31:57 +03:00
Author
Owner

@ciur commented on GitHub (Jan 23, 2024):

Thank you for reporting the issue!

<!-- gh-comment-id:1905292171 --> @ciur commented on GitHub (Jan 23, 2024): Thank you for reporting the issue!
Author
Owner

@ciur commented on GitHub (Jan 30, 2024):

@vikithakar

In order to make this work, I need to include Gujarati, Hindi and Sanskrit codes here and here. For the second list, I need respective language written in original language; for example fra in French is "Français"; ell in Greek is "Ελληνικά".

Could you please provide original writing of the language name for Gujarati, Hindi and Sanskrit ?

  • guj in Gujrati is "..." ?
  • hin in Hindi is "..." ?
  • san in "Sanskrit is "..." ?
<!-- gh-comment-id:1916138702 --> @ciur commented on GitHub (Jan 30, 2024): @vikithakar In order to make this work, I need to include `Gujarati`, `Hindi` and `Sanskrit` codes [here](https://github.com/papermerge/papermerge-core/blob/master/papermerge/core/schemas/tasks.py#L6) and [here](https://github.com/papermerge/papermerge-core/blob/master/ui/src/cconstants.ts#L10). For the second list, I need respective language written in original language; for example `fra` in French is "Français"; `ell` in Greek is "Ελληνικά". Could you please provide original writing of the language name for `Gujarati`, `Hindi` and `Sanskrit` ? - `guj` in Gujrati is "..." ? - `hin` in Hindi is "..." ? - `san` in "Sanskrit is "..." ?
Author
Owner

@vikithakar commented on GitHub (Jan 30, 2024):

@ciur
Original Language Name

  • guj in Gujarati is ગુજરાતી
  • hin in Hindi is हिंदी
  • san in Sanskrit is संस्कृत
<!-- gh-comment-id:1916174204 --> @vikithakar commented on GitHub (Jan 30, 2024): @ciur Original Language Name - `guj` in Gujarati is ગુજરાતી - `hin` in Hindi is हिंदी - `san` in Sanskrit is संस्कृत
Author
Owner

@ciur commented on GitHub (Jan 31, 2024):

@vikithakar

PR for adding above mentioned languages.

Change will be available in 3.0.3 release

Note that you will need to build your image as before. However, when you will start papermerge don't forget to add PAPERMERGE__OCR__DEFAULT_LANGUAGE variable so that when you import docs they will be OCRed in "default OCR" language.

In ticket's screenshot you've uploaded you can see that document was OCRed with OCR language being set "German" (deu code corresponds to German language). That's why those strange characters.

lang-codes

<!-- gh-comment-id:1918404845 --> @ciur commented on GitHub (Jan 31, 2024): @vikithakar [PR](https://github.com/papermerge/papermerge-core/pull/323) for adding above mentioned languages. Change will be available in 3.0.3 release Note that you will need to build your image as before. However, when you will start papermerge don't forget to add [PAPERMERGE__OCR__DEFAULT_LANGUAGE](https://docs.papermerge.io/3.0/settings/ocr/?h=papermerge__ocr__default_language#ocr__default_language) variable so that when you import docs they will be OCRed in "default OCR" language. In ticket's screenshot you've uploaded you can see that document was OCRed with OCR language being set "German" (deu code corresponds to German language). That's why those strange characters. ![lang-codes](https://github.com/ciur/papermerge/assets/24827601/e2f1861b-953c-4671-90b3-2295ce945bc3)
Author
Owner

@ciur commented on GitHub (Jan 31, 2024):

@vikithakar

Here is screenshot with working app (as mentioned above will be part of 3.0.3):

papermerge-with-hindi-text

<!-- gh-comment-id:1918445152 --> @ciur commented on GitHub (Jan 31, 2024): @vikithakar Here is screenshot with working app (as mentioned above will be part of 3.0.3): ![papermerge-with-hindi-text](https://github.com/ciur/papermerge/assets/24827601/fe79438a-9223-45a4-8adf-7b88899869fc)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#456
No description provided.