starred/papermerge

Fork 0

mirror of https://github.com/ciur/papermerge.git synced 2026-04-25 03:55:58 +03:00

[GH-ISSUE #583] Gujarati, Hindi and Sanskrit Language OCR not working #456

New issue

Closed

opened 2026-02-25 21:31:57 +03:00 by kerem · 5 comments

kerem commented

2026-02-25 21:31:57 +03:00

Owner

Originally created by @vikithakar on GitHub (Jan 22, 2024).
Original GitHub issue: https://github.com/ciur/papermerge/issues/583

Originally assigned to: @ciur on GitHub.

Description of Issue

After building Papermerge with Gujarati, Hindi and Sanskrit Language support, when you upload and run OCR on files, it churns out OCR text which is not correct. I think the tesseract-ocr is consistent with the text output it gives for the file, but it seems like papermerge does not have fonts or Character Sets to display the translations in the OCR text language.

Build Details

Dockerfile to add tesseract-ocr to papermerge

FROM papermerge/papermerge:3.0.2
RUN apt install tesseract-ocr-hin tesseract-ocr-guj tesseract-ocr-san -y

Info:

Papermerge Version 3.0.2

Originally created by @vikithakar on GitHub (Jan 22, 2024). Original GitHub issue: https://github.com/ciur/papermerge/issues/583 Originally assigned to: @ciur on GitHub. ![Screenshot from 2024-01-22 18-03-00](https://github.com/ciur/papermerge/assets/1167994/bf37ee5d-f18e-4996-9e88-eb92cfd660d6) ### Description of Issue After building Papermerge with `Gujarati`, `Hindi` and `Sanskrit` Language support, when you upload and run OCR on files, it churns out OCR text which is not correct. I think the tesseract-ocr is consistent with the text output it gives for the file, but it seems like papermerge does not have fonts or Character Sets to display the translations in the OCR text language. ### Build Details `Dockerfile` to add tesseract-ocr to papermerge ``` FROM papermerge/papermerge:3.0.2 RUN apt install tesseract-ocr-hin tesseract-ocr-guj tesseract-ocr-san -y ``` **Info:** - Papermerge Version 3.0.2

kerem

2026-02-25 21:31:57 +03:00

closed this issue
added the
confirmed

medium

3.0.2

ocr
labels

kerem commented

2026-02-25 21:31:58 +03:00

Author

Owner

@ciur commented on GitHub (Jan 23, 2024):

Thank you for reporting the issue!

@ciur commented on GitHub (Jan 23, 2024): Thank you for reporting the issue!

kerem commented

2026-02-25 21:31:58 +03:00

Author

Owner

@ciur commented on GitHub (Jan 30, 2024):

@vikithakar

In order to make this work, I need to include Gujarati, Hindi and Sanskrit codes here and here. For the second list, I need respective language written in original language; for example fra in French is "Français"; ell in Greek is "Ελληνικά".

Could you please provide original writing of the language name for Gujarati, Hindi and Sanskrit ?

guj in Gujrati is "..." ?
hin in Hindi is "..." ?
san in "Sanskrit is "..." ?

@ciur commented on GitHub (Jan 30, 2024): @vikithakar In order to make this work, I need to include `Gujarati`, `Hindi` and `Sanskrit` codes [here](https://github.com/papermerge/papermerge-core/blob/master/papermerge/core/schemas/tasks.py#L6) and [here](https://github.com/papermerge/papermerge-core/blob/master/ui/src/cconstants.ts#L10). For the second list, I need respective language written in original language; for example `fra` in French is "Français"; `ell` in Greek is "Ελληνικά". Could you please provide original writing of the language name for `Gujarati`, `Hindi` and `Sanskrit` ? - `guj` in Gujrati is "..." ? - `hin` in Hindi is "..." ? - `san` in "Sanskrit is "..." ?

kerem commented

2026-02-25 21:31:58 +03:00

Author

Owner

@vikithakar commented on GitHub (Jan 30, 2024):

@ciur
Original Language Name

guj in Gujarati is ગુજરાતી
hin in Hindi is हिंदी
san in Sanskrit is संस्कृत

@vikithakar commented on GitHub (Jan 30, 2024): @ciur Original Language Name - `guj` in Gujarati is ગુજરાતી - `hin` in Hindi is हिंदी - `san` in Sanskrit is संस्कृत

kerem commented

2026-02-25 21:31:58 +03:00

Author

Owner

@ciur commented on GitHub (Jan 31, 2024):

@vikithakar

PR for adding above mentioned languages.

Change will be available in 3.0.3 release

Note that you will need to build your image as before. However, when you will start papermerge don't forget to add PAPERMERGE__OCR__DEFAULT_LANGUAGE variable so that when you import docs they will be OCRed in "default OCR" language.

In ticket's screenshot you've uploaded you can see that document was OCRed with OCR language being set "German" (deu code corresponds to German language). That's why those strange characters.

@ciur commented on GitHub (Jan 31, 2024): @vikithakar [PR](https://github.com/papermerge/papermerge-core/pull/323) for adding above mentioned languages. Change will be available in 3.0.3 release Note that you will need to build your image as before. However, when you will start papermerge don't forget to add [PAPERMERGE__OCR__DEFAULT_LANGUAGE](https://docs.papermerge.io/3.0/settings/ocr/?h=papermerge__ocr__default_language#ocr__default_language) variable so that when you import docs they will be OCRed in "default OCR" language. In ticket's screenshot you've uploaded you can see that document was OCRed with OCR language being set "German" (deu code corresponds to German language). That's why those strange characters. ![lang-codes](https://github.com/ciur/papermerge/assets/24827601/e2f1861b-953c-4671-90b3-2295ce945bc3)

kerem commented

2026-02-25 21:31:58 +03:00

Author

Owner

@ciur commented on GitHub (Jan 31, 2024):

@vikithakar

Here is screenshot with working app (as mentioned above will be part of 3.0.3):

@ciur commented on GitHub (Jan 31, 2024): @vikithakar Here is screenshot with working app (as mentioned above will be part of 3.0.3): ![papermerge-with-hindi-text](https://github.com/ciur/papermerge/assets/24827601/fe79438a-9223-45a4-8adf-7b88899869fc)