[GH-ISSUE #149] Search for non-latin text is case sensitive. #117

Closed
opened 2026-02-25 21:31:14 +03:00 by kerem · 2 comments
Owner

Originally created by @ciur on GitHub (Oct 3, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/149

Originally assigned to: @ciur on GitHub.

Problem was reported for bulgarian language and 1.4.2 version. I tested it on Russian language and master branch.

  1. Install tesseract's russian language pack (sudo apt-get install tesseract-ocr-rus)
  2. Configure russian language to be present in UI, e.g.:
OCR_DEFAULT_LANGUAGE = "rus"

OCR_LANGUAGES = {
    "deu": "Deutsch",
    "eng": "English",
    "rus": "Russian"
}
  1. Upload attached document
    Марк Аврелий.pdf
  2. Search for "марк". All lowercase cyrillic characters, without quotes.

Expected
Uploaded document is expected to be present in search results as it contains words "Марк Аврелий".
All languages searches (cyrillic and latin) must not be case sensitive by default i.e.
if you search for "Аврелий" or "авреЛИй" (by default) results must be same.

Actual
Only if user searches for matching case- i.e. "Марк Аврелий" - uploaded document is revealed.

Desktop:

  • OS: Ubuntu 20.04 LTS (does not matter)
  • Browser: Firefox (does not matter)
  • Papermerge Version = 1.4.2 and master
Originally created by @ciur on GitHub (Oct 3, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/149 Originally assigned to: @ciur on GitHub. Problem was reported for bulgarian language and 1.4.2 version. I tested it on Russian language and master branch. 1. Install tesseract's russian language pack (sudo apt-get install tesseract-ocr-rus) 2. Configure russian language to be present in UI, e.g.: ``` OCR_DEFAULT_LANGUAGE = "rus" OCR_LANGUAGES = { "deu": "Deutsch", "eng": "English", "rus": "Russian" } ``` 3. Upload attached document [Марк Аврелий.pdf](https://github.com/ciur/papermerge/files/5322071/default.pdf) 4. Search for "марк". All lowercase cyrillic characters, without quotes. **Expected** Uploaded document is expected to be present in search results as it contains words "Марк Аврелий". All languages searches (cyrillic and latin) must not be case sensitive by default i.e. if you search for "Аврелий" or "авреЛИй" (by default) results must be same. **Actual** Only if user searches for matching case- i.e. "Марк Аврелий" - uploaded document is revealed. **Desktop:** - OS: Ubuntu 20.04 LTS (does not matter) - Browser: Firefox (does not matter) - Papermerge Version = 1.4.2 and master
kerem 2026-02-25 21:31:14 +03:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@ciur commented on GitHub (Oct 4, 2020):

I think the "problem" is in database. There is a known issue with SQLite database. In default setup it performs case sensitive setups for Unicode strings.

Still need to confirm that for PostgreSQL behaviour is correct.

<!-- gh-comment-id:703203965 --> @ciur commented on GitHub (Oct 4, 2020): I think the "problem" is in database. There is a [known issue with SQLite database.](https://docs.djangoproject.com/en/3.1/ref/databases/#sqlite-string-matching) In default setup it performs case sensitive setups for Unicode strings. Still need to confirm that for PostgreSQL behaviour is correct.
Author
Owner

@ciur commented on GitHub (Oct 4, 2020):

Confirmed. Application works as expected with PostgreSQL database. Thus, the problem is because of SQLite database.
I won't fix the issue, as SQLite is not meant to run in production environments.

<!-- gh-comment-id:703205451 --> @ciur commented on GitHub (Oct 4, 2020): Confirmed. Application works as expected with PostgreSQL database. Thus, the problem is because of SQLite database. I won't fix the issue, as SQLite is not meant to run in production environments.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#117
No description provided.