[GH-ISSUE #258] Tesseract working out only the first five pages of documents #247

Closed
opened 2026-02-27 15:55:48 +03:00 by kerem · 1 comment
Owner

Originally created by @santangelo on GitHub (Aug 21, 2019).
Original GitHub issue: https://github.com/RD17/ambar/issues/258

Hi,
first of all thank you for this great work you did.

I have set up Ambar to crawl a number of pdfs in an Archive folder with subfolders containing pdfs.
It seemed to me that it was doing great, but I have just noticed in the crawler log that Tesseract is processing only the first five pages of the documents (attached screenshot).
segnalazione github

Of course the words from the 6th page on are not in the index.
Is there an option I did not know of?
Thanks a lot.
Michele

Originally created by @santangelo on GitHub (Aug 21, 2019). Original GitHub issue: https://github.com/RD17/ambar/issues/258 Hi, first of all thank you for this great work you did. I have set up Ambar to crawl a number of pdfs in an Archive folder with subfolders containing pdfs. It seemed to me that it was doing great, but I have just noticed in the crawler log that Tesseract is processing only the first five pages of the documents (attached screenshot). ![segnalazione github](https://user-images.githubusercontent.com/2600500/63447263-00157980-c43c-11e9-883f-833f269c4b4f.PNG) Of course the words from the 6th page on are not in the index. Is there an option I did not know of? Thanks a lot. Michele
kerem closed this issue 2026-02-27 15:55:48 +03:00
Author
Owner

@santangelo commented on GitHub (Aug 22, 2019):

In case it helps other, it was enough to set the env var "ocrPdfMaxPageCount" to desired number of pages on pipeline container. Thanks to the Ambar help.
Michele

<!-- gh-comment-id:523899978 --> @santangelo commented on GitHub (Aug 22, 2019): In case it helps other, it was enough to set the env var "ocrPdfMaxPageCount" to desired number of pages on pipeline container. Thanks to the Ambar help. Michele
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ambar#247
No description provided.