mirror of
https://github.com/ciur/papermerge.git
synced 2026-04-25 03:55:58 +03:00
[GH-ISSUE #588] poor OCR detection, optimizations with 'jbig2' or 'pngquant' possible? #462
Labels
No labels
2.1
3.0
3.0.1
3.0.2
3.0.3
3.0.3
3.1
3.2
3.2
3.3
3.5
3.x
Fixed. Waiting for feedback.
Fixed. Waiting for feedback.
UX
Version 2.1 - alpha
XSS
announcement
beta
blocker
bug
cannot reproduce
confirmed
confirmed
critical
demo
dependencies
deployment
detchnical debt
discussion
docker
documentation
donations
duplicate
enhancement
feature request
frontend
fundraising
good first issue
good issue
help wanted
high
implemented
important
improvement
incomplete
invalid
investigation
kubernetes
low
low impact
medium
medium
medium impact
migration from 2.0
migration from 2.1
missing-language
missing-ocr-language
no-activity
note
ocr
outofscope
packaging
performance
popular request
pull-request
pypi
question
raspberry pi
roadmap
search
security
setup
status
task
technical debt
updates
user xp
version 1.4.0 - demo
will be implemented
will not be implemented
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/papermerge#462
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Chavell3 on GitHub (Feb 1, 2024).
Original GitHub issue: https://github.com/ciur/papermerge/issues/588
Hi,
I did uploaded an PDF to papermerge and searched for some text.
Curiously after the OCR finished, it couldn't find the following string "Edelberg" from the following text:
Although that is from a original PDF document, it's not scanned or something like that, I was not able to find the text above.
So it can't be an issue of poor document quality.
Would it be possible to install and additionally scan with 'jbig2' or 'pngquant' to improve text recognition?
Thanks
@ciur commented on GitHub (Feb 2, 2024):
I don't understand your request. I've never heard of jbig2/pngquant. Could you please detail on jbig2/pngquant, specifically how can I enable them and why may that improve OCR output?
@Chavell3 commented on GitHub (Feb 2, 2024):
Hi,
well within the worker node, when I manually started the scan process, I got the following message:
There is a documentation from OCRmyPDF which mentions both tools - https://buildmedia.readthedocs.org/media/pdf/ocrmypdf/latest/ocrmypdf.pdf -
Statement from OCRmyPDF document: "Optimization is improved when a JBIG2 encoder is available and when pngquant is installed. If either of these components are missing, then some types of images cannot be optimized."
@ciur commented on GitHub (Feb 3, 2024):
Thank you for references. I will investigate if those libs improve OCR quality. Not sure what to say else now.