[GH-ISSUE #588] poor OCR detection, optimizations with 'jbig2' or 'pngquant' possible? #462

Open
opened 2026-02-25 21:31:58 +03:00 by kerem · 3 comments
Owner

Originally created by @Chavell3 on GitHub (Feb 1, 2024).
Original GitHub issue: https://github.com/ciur/papermerge/issues/588

Hi,

I did uploaded an PDF to papermerge and searched for some text.
Curiously after the OCR finished, it couldn't find the following string "Edelberg" from the following text:

FoxitPDFReader_02012024_213038

Although that is from a original PDF document, it's not scanned or something like that, I was not able to find the text above.
So it can't be an issue of poor document quality.
Would it be possible to install and additionally scan with 'jbig2' or 'pngquant' to improve text recognition?

Thanks

Originally created by @Chavell3 on GitHub (Feb 1, 2024). Original GitHub issue: https://github.com/ciur/papermerge/issues/588 Hi, I did uploaded an PDF to papermerge and searched for some text. Curiously after the OCR finished, it couldn't find the following string "Edelberg" from the following text: ![FoxitPDFReader_02012024_213038](https://github.com/ciur/papermerge/assets/56268217/af1a6012-0f9b-4923-8513-9cf6b816b3a6) Although that is from a original PDF document, it's not scanned or something like that, I was not able to find the text above. So it can't be an issue of poor document quality. Would it be possible to install and additionally scan with 'jbig2' or 'pngquant' to improve text recognition? Thanks
Author
Owner

@ciur commented on GitHub (Feb 2, 2024):

I don't understand your request. I've never heard of jbig2/pngquant. Could you please detail on jbig2/pngquant, specifically how can I enable them and why may that improve OCR output?

<!-- gh-comment-id:1922875641 --> @ciur commented on GitHub (Feb 2, 2024): I don't understand your request. I've never heard of jbig2/pngquant. Could you please detail on jbig2/pngquant, specifically how can I enable them and why may that improve OCR output?
Author
Owner

@Chavell3 commented on GitHub (Feb 2, 2024):

Hi,

well within the worker node, when I manually started the scan process, I got the following message:

worker-1  | [2024-02-01 20:58:39,350: WARNING/ForkPoolWorker-2] [tesseract] lots of diacritics - possibly poor OCR
worker-1  | [2024-02-01 21:18:38,969: WARNING/ForkPoolWorker-2] The output file size is 10.71× larger than the input file.
worker-1  | Possible reasons for this include:
worker-1  | --force-ocr was issued, causing transcoding.
worker-1  | --deskew was issued, causing transcoding.
worker-1  | The optional dependency 'jbig2' was not found, so some image optimizations could not be attempted.
worker-1  | The optional dependency 'pngquant' was not found, so some image optimizations could not be attempted.
worker-1  | Plugins were used.

There is a documentation from OCRmyPDF which mentions both tools - https://buildmedia.readthedocs.org/media/pdf/ocrmypdf/latest/ocrmypdf.pdf -
Statement from OCRmyPDF document: "Optimization is improved when a JBIG2 encoder is available and when pngquant is installed. If either of these components are missing, then some types of images cannot be optimized."

<!-- gh-comment-id:1923336070 --> @Chavell3 commented on GitHub (Feb 2, 2024): Hi, well within the worker node, when I manually started the scan process, I got the following message: ``` worker-1 | [2024-02-01 20:58:39,350: WARNING/ForkPoolWorker-2] [tesseract] lots of diacritics - possibly poor OCR worker-1 | [2024-02-01 21:18:38,969: WARNING/ForkPoolWorker-2] The output file size is 10.71× larger than the input file. worker-1 | Possible reasons for this include: worker-1 | --force-ocr was issued, causing transcoding. worker-1 | --deskew was issued, causing transcoding. worker-1 | The optional dependency 'jbig2' was not found, so some image optimizations could not be attempted. worker-1 | The optional dependency 'pngquant' was not found, so some image optimizations could not be attempted. worker-1 | Plugins were used. ``` There is a documentation from OCRmyPDF which mentions both tools - https://buildmedia.readthedocs.org/media/pdf/ocrmypdf/latest/ocrmypdf.pdf - Statement from OCRmyPDF document: "Optimization is improved when a JBIG2 encoder is available and when pngquant is installed. If either of these components are missing, then some types of images cannot be optimized."
Author
Owner

@ciur commented on GitHub (Feb 3, 2024):

Thank you for references. I will investigate if those libs improve OCR quality. Not sure what to say else now.

<!-- gh-comment-id:1925358004 --> @ciur commented on GitHub (Feb 3, 2024): Thank you for references. I will investigate if those libs improve OCR quality. Not sure what to say else now.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#462
No description provided.