[GH-ISSUE #181] A way to disable the OCR if PDFs already OCRed #179

Closed
opened 2026-02-27 15:55:30 +03:00 by kerem · 0 comments
Owner

Originally created by @comstyle on GitHub (Aug 14, 2018).
Original GitHub issue: https://github.com/RD17/ambar/issues/181

All our PDFs are already OCRed. However most of the times the ambar crawler seems to do the OCR anyway. Is there a way to define, that the crawler should not use OCR at all or not for specific folders/filename patterns?
In this case it should just extract the text which is already in the PDF file.
If not I guess that this would be a suggestion.

If just found the way to disable OCR in the pipeline using environment var:

  • ocrPdfMaxPageCount=0

So that is fine for me, however doing this more granularity would be great ;-)

Btw. thanks for that amazing product

Thx

Originally created by @comstyle on GitHub (Aug 14, 2018). Original GitHub issue: https://github.com/RD17/ambar/issues/181 All our PDFs are already OCRed. However most of the times the ambar crawler seems to do the OCR anyway. Is there a way to define, that the crawler should not use OCR at all or not for specific folders/filename patterns? In this case it should just extract the text which is already in the PDF file. If not I guess that this would be a suggestion. If just found the way to disable OCR in the pipeline using environment var: - ocrPdfMaxPageCount=0 So that is fine for me, however doing this more granularity would be great ;-) Btw. thanks for that amazing product Thx
kerem closed this issue 2026-02-27 15:55:30 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ambar#179
No description provided.