starred/papermerge

Fork 0

mirror of https://github.com/ciur/papermerge.git synced 2026-04-25 12:05:58 +03:00

[GH-ISSUE #151] [Feature] - Templates for OCR (Zonal OCR) using KULL #115

New issue

Closed

opened 2026-02-25 21:31:14 +03:00 by kerem · 4 comments

kerem commented

2026-02-25 21:31:14 +03:00

Owner

Originally created by @swissbyte on GitHub (Oct 8, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/151

Originally assigned to: @ciur on GitHub.

I already saw your pinned post about automation. Looks great! I exactly need a solution which automatically sorts, and tags my PDFs accrding to their content.

Now i have a feature request:
Many of the expensive solutions offers Zonal OCR. They implemented something called OCR-Templates. This is nothing else than just a file which defines several boxes where OCR searches for Text. One possibility to select such zones is this project:

https://jsoma.github.io/kull
https://github.com/jsoma/kull

I have also recorded a short gif

Now the trick:
There are multiple zones defined

unamed0
unamed1 and so on.

If we have the possibility to do RegEx or any other StrComp function on every zone itself, we would have an extremly powerfull detection engine.

AND:

If we have the possibility to use some of the content of these fields as metadata, we would have one of the most powerfull intelligent classification engine out there...

What do you think?
I have not looked through the code, but if possible, i would like to help to implement this feature.
Maybe this feature could be implemented together with Milestone 1.5.0 ?

Thank you very much.

Originally created by @swissbyte on GitHub (Oct 8, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/151 Originally assigned to: @ciur on GitHub. Hi I already saw your pinned post about automation. Looks great! I exactly need a solution which automatically sorts, and tags my PDFs accrding to their content. Now i have a feature request: Many of the expensive solutions offers Zonal OCR. They implemented something called OCR-Templates. This is nothing else than just a file which defines several boxes where OCR searches for Text. One possibility to select such zones is this project: https://jsoma.github.io/kull https://github.com/jsoma/kull I have also recorded a short gif ![templateSelection](https://user-images.githubusercontent.com/33572050/95497841-42cd4480-09a3-11eb-88dd-76be8b08a0dc.gif) Now the trick: There are multiple zones defined - unamed0 - unamed1 and so on. If we have the possibility to do RegEx or any other StrComp function on every zone itself, we would have an extremly powerfull detection engine. AND: If we have the possibility to use some of the content of these fields as metadata, we would have one of the most powerfull intelligent classification engine out there... What do you think? I have not looked through the code, but if possible, i would like to help to implement this feature. Maybe this feature could be implemented together with Milestone 1.5.0 ? Thank you very much.

kerem

2026-02-25 21:31:14 +03:00

closed this issue
added the
enhancement

feature request
labels

kerem commented

2026-02-25 21:31:14 +03:00

Author

Owner

@ciur commented on GitHub (Oct 8, 2020):

very interesting idea! thanks for the tip!

@ciur commented on GitHub (Oct 8, 2020): very interesting idea! thanks for the tip!

kerem commented

2026-02-25 21:31:15 +03:00

Author

Owner

@swissbyte commented on GitHub (Oct 9, 2020):

But be carefull... It looks like, that KULL outputs wrong coordinates. At least y1 and y2 are switched...
I think the most efficient way would be to first OCR the document, overlay the text into the pdf and then search for text using pdfQuery... There is also a Project which can directly search using OCR. But i think this would be much more CPU-Heavy

@swissbyte commented on GitHub (Oct 9, 2020): But be carefull... It looks like, that KULL outputs wrong coordinates. At least y1 and y2 are switched... I think the most efficient way would be to first OCR the document, overlay the text into the pdf and then search for text using pdfQuery... There is also a Project which can directly search using OCR. But i think this would be much more CPU-Heavy

kerem commented

2026-02-25 21:31:15 +03:00

Author

Owner

@svenihoney commented on GitHub (Oct 16, 2020):

Maybe using ocrmypdf is an option here: Embeds the OCR result into the PDF and does document improvements on imports as well. Maybe #169 is relevant for this, too.

@svenihoney commented on GitHub (Oct 16, 2020): Maybe using [ocrmypdf](https://pypi.org/project/ocrmypdf/) is an option here: Embeds the OCR result into the PDF and does document improvements on imports as well. Maybe #169 is relevant for this, too.

kerem commented

2026-02-25 21:31:15 +03:00

Author

Owner

@damirabal commented on GitHub (Nov 17, 2020):

Zonal OCR would be a HUGE feature to implement to this great DMS.

@damirabal commented on GitHub (Nov 17, 2020): Zonal OCR would be a HUGE feature to implement to this great DMS.

kerem referenced this issue

2026-02-25 21:32:13 +03:00

[PR #116] [CLOSED] Stable/1.4.x #551

kerem referenced this issue

2026-02-25 21:32:13 +03:00