[GH-ISSUE #624] Russian and Kazakh OCR #487

Open
opened 2026-02-25 21:32:02 +03:00 by kerem · 3 comments
Owner

Originally created by @Sergey-alm on GitHub (Aug 15, 2024).
Original GitHub issue: https://github.com/ciur/papermerge/issues/624

Originally assigned to: @ciur on GitHub.

Hello! I have installed Russian and Kazakh OCR languages, but papermerge does not work with them. The gray circle is after processing and the search does not search for Russian/Kazakh words.

Info:

  • Papermerge Version 3.2
Originally created by @Sergey-alm on GitHub (Aug 15, 2024). Original GitHub issue: https://github.com/ciur/papermerge/issues/624 Originally assigned to: @ciur on GitHub. Hello! I have installed Russian and Kazakh OCR languages, but papermerge does not work with them. The gray circle is after processing and the search does not search for Russian/Kazakh words. **Info:** - Papermerge Version 3.2
Author
Owner

@bl1nkker commented on GitHub (Apr 2, 2025):

I have implemented support for Russian and Kazakh OCR languages in my own setup, and everything is working fine. In the real world, you need to do a little more than what is described in the documentation, so here’s a step-by-step guide on how I achieved this

  1. Create your custom OCR docker image:

first, you need to create your own OCR worker image to include the necessary languages. Create a Dockerfile based on the existing papermerge/ocrworker:0.3.1 image and install the required OCR language packages:

FROM papermerge/ocrworker:0.3.1
# Add the required languages here
RUN apt update && apt install -y tesseract-ocr-kaz tesseract-ocr-rus
  1. Verify the languages in the OCR worker:

once the docker image is built and the ocr worker is running verify that the languages are installed:
docker exec -it <ocr_worker_docker_container_id> tesseract --list-langs

  1. Add the language support in the Papermerge codebase:
  • update OCR task schema:
    In the papermerge/core/features/tasks/schema.py file, add the new language codes to the LangCode type
LangCode = Literal[
    "ces",
    "dan",
    "deu",
    "ell",
    "eng",
    "fas",
    "fin",
    "fra",
    "guj",
    "heb",
    "hin",
    "ita",
    "jpn",
    "kor",
    "lit",
    "nld",
    "nor",
    "pol",
    "por",
    "ron",
    "san",
    "spa",
    # add additional languages here
    "kaz",
    "rus",
]
  • Update UI Constants:
    In the ui2/src/cconstants/ts file, add required language names:
export const OCR_LANG: OCRLangType = {
    ces: "Čeština",
    dan: "Dansk",
    deu: "Deutsch",
    ell: "Ελληνικά",
    eng: "English",
    fin: "Suomi",
    fra: "Français",
    guj: "ગુજરાતી",
    heb: "עברית",
    hin: "हिंदी",
    ita: "Italiano",
    jpn: "日本語",
    kor: "한국어",
    lit: "Lietuvių",
    nld: "Nederlands",
    nor: "Norsk",
    osd: "Osd",
    pol: "Polski",
    por: "Português",
    ron: "Română",
    san: "संस्कृत",
    spa: "Español",
    // Add additional languages here
    kaz: "Қазақша",
    rus: "Русский",
};
  • Update OCRCode Type:
    In the ui2/src/types.ts and ui2/src/types/ocr.ts files, extend the OCRCode type:
export type OCRCode = 
    | "ces" | "dan" | "deu" | "ell" | "eng" | "fin" | "fra" | "guj" | "heb"
    | "hin" | "ita" | "jpn" | "kor" | "lit" | "nld" | "nor" | "osd" | "pol"
    | "por" | "ron" | "san" | "spa"
    // Add additional languages here
    | "kaz" | "rus";
  1. Build custom Papermerge image:
docker buildx build --platform linux/amd64 -t myimage:0.0.1 -f docker/standard/Dockerfile .
  1. Run Papermerge with the custom OCR worker
<!-- gh-comment-id:2772335928 --> @bl1nkker commented on GitHub (Apr 2, 2025): I have implemented support for Russian and Kazakh OCR languages in my own setup, and everything is working fine. In the real world, you need to do a little more than what is described in the [documentation](https://docs.papermerge.io/3.4/setup/add-ocr-langs/), so here’s a step-by-step guide on how I achieved this 1. Create your custom OCR docker image: first, you need to create your own OCR worker image to include the necessary languages. Create a Dockerfile based on the existing papermerge/ocrworker:0.3.1 image and install the required OCR language packages: ``` FROM papermerge/ocrworker:0.3.1 # Add the required languages here RUN apt update && apt install -y tesseract-ocr-kaz tesseract-ocr-rus ``` 2. Verify the languages in the OCR worker: once the docker image is built and the ocr worker is running verify that the languages are installed: ```docker exec -it <ocr_worker_docker_container_id> tesseract --list-langs``` 3. Add the language support in the Papermerge codebase: * update OCR task schema: In the `papermerge/core/features/tasks/schema.py` file, add the new language codes to the LangCode type ``` LangCode = Literal[ "ces", "dan", "deu", "ell", "eng", "fas", "fin", "fra", "guj", "heb", "hin", "ita", "jpn", "kor", "lit", "nld", "nor", "pol", "por", "ron", "san", "spa", # add additional languages here "kaz", "rus", ] ``` * Update UI Constants: In the `ui2/src/cconstants/ts file`, add required language names: ``` export const OCR_LANG: OCRLangType = { ces: "Čeština", dan: "Dansk", deu: "Deutsch", ell: "Ελληνικά", eng: "English", fin: "Suomi", fra: "Français", guj: "ગુજરાતી", heb: "עברית", hin: "हिंदी", ita: "Italiano", jpn: "日本語", kor: "한국어", lit: "Lietuvių", nld: "Nederlands", nor: "Norsk", osd: "Osd", pol: "Polski", por: "Português", ron: "Română", san: "संस्कृत", spa: "Español", // Add additional languages here kaz: "Қазақша", rus: "Русский", }; ``` * Update OCRCode Type: In the `ui2/src/types.ts` and `ui2/src/types/ocr.ts` files, extend the OCRCode type: ``` export type OCRCode = | "ces" | "dan" | "deu" | "ell" | "eng" | "fin" | "fra" | "guj" | "heb" | "hin" | "ita" | "jpn" | "kor" | "lit" | "nld" | "nor" | "osd" | "pol" | "por" | "ron" | "san" | "spa" // Add additional languages here | "kaz" | "rus"; ``` 4. Build custom Papermerge image: ``` docker buildx build --platform linux/amd64 -t myimage:0.0.1 -f docker/standard/Dockerfile . ``` 5. Run Papermerge with the custom OCR worker
Author
Owner

@bl1nkker commented on GitHub (Apr 2, 2025):

@ciur, i just wanted to point out that while the process for adding OCR languages in Papermerge is generally straightforward (which I really appreciate), it currently requires a few extra steps that aren't mentioned in the documentation

it would be great if the documentation could be updated to include these steps

<!-- gh-comment-id:2772341280 --> @bl1nkker commented on GitHub (Apr 2, 2025): @ciur, i just wanted to point out that while the process for adding OCR languages in Papermerge is generally straightforward (which I really appreciate), it currently requires a few extra steps that aren't mentioned in the [documentation](https://docs.papermerge.io/3.4/setup/add-ocr-langs/) it would be great if the documentation could be updated to include these steps
Author
Owner

@ciur commented on GitHub (Apr 2, 2025):

@bl1nkker thank you for nicely organized guide. I've added it as part of documentation

<!-- gh-comment-id:2773492074 --> @ciur commented on GitHub (Apr 2, 2025): @bl1nkker thank you for nicely organized guide. I've added it as part of [documentation](https://docs.papermerge.io/3.4/setup/add-ocr-langs/)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#487
No description provided.