[GH-ISSUE #181] A way to disable the OCR if PDFs already OCRed #179

New issue

Closed

opened 2026-02-27 15:55:30 +03:00 by kerem · 0 comments

kerem commented

2026-02-27 15:55:30 +03:00

Owner

Originally created by @comstyle on GitHub (Aug 14, 2018).
Original GitHub issue: https://github.com/RD17/ambar/issues/181

All our PDFs are already OCRed. However most of the times the ambar crawler seems to do the OCR anyway. Is there a way to define, that the crawler should not use OCR at all or not for specific folders/filename patterns?
In this case it should just extract the text which is already in the PDF file.
If not I guess that this would be a suggestion.

If just found the way to disable OCR in the pipeline using environment var:

ocrPdfMaxPageCount=0

So that is fine for me, however doing this more granularity would be great ;-)

Btw. thanks for that amazing product

Thx

Originally created by @comstyle on GitHub (Aug 14, 2018). Original GitHub issue: https://github.com/RD17/ambar/issues/181 All our PDFs are already OCRed. However most of the times the ambar crawler seems to do the OCR anyway. Is there a way to define, that the crawler should not use OCR at all or not for specific folders/filename patterns? In this case it should just extract the text which is already in the PDF file. If not I guess that this would be a suggestion. If just found the way to disable OCR in the pipeline using environment var: - ocrPdfMaxPageCount=0 So that is fine for me, however doing this more granularity would be great ;-) Btw. thanks for that amazing product Thx