[GH-ISSUE #258] Tesseract working out only the first five pages of documents #247

New issue

Closed

opened 2026-02-27 15:55:48 +03:00 by kerem · 1 comment

kerem commented

2026-02-27 15:55:48 +03:00

Owner

Originally created by @santangelo on GitHub (Aug 21, 2019).
Original GitHub issue: https://github.com/RD17/ambar/issues/258

Hi,
first of all thank you for this great work you did.

I have set up Ambar to crawl a number of pdfs in an Archive folder with subfolders containing pdfs.
It seemed to me that it was doing great, but I have just noticed in the crawler log that Tesseract is processing only the first five pages of the documents (attached screenshot).

Of course the words from the 6th page on are not in the index.
Is there an option I did not know of?
Thanks a lot.
Michele

Originally created by @santangelo on GitHub (Aug 21, 2019). Original GitHub issue: https://github.com/RD17/ambar/issues/258 Hi, first of all thank you for this great work you did. I have set up Ambar to crawl a number of pdfs in an Archive folder with subfolders containing pdfs. It seemed to me that it was doing great, but I have just noticed in the crawler log that Tesseract is processing only the first five pages of the documents (attached screenshot). ![segnalazione github](https://user-images.githubusercontent.com/2600500/63447263-00157980-c43c-11e9-883f-833f269c4b4f.PNG) Of course the words from the 6th page on are not in the index. Is there an option I did not know of? Thanks a lot. Michele

kerem closed this issue

2026-02-27 15:55:48 +03:00

kerem commented

2026-02-27 15:55:49 +03:00

Author

Owner

@santangelo commented on GitHub (Aug 22, 2019):

In case it helps other, it was enough to set the env var "ocrPdfMaxPageCount" to desired number of pages on pipeline container. Thanks to the Ambar help.
Michele

@santangelo commented on GitHub (Aug 22, 2019): In case it helps other, it was enough to set the env var "ocrPdfMaxPageCount" to desired number of pages on pipeline container. Thanks to the Ambar help. Michele

kerem referenced this issue

2026-02-27 15:56:01 +03:00

[PR #248] [MERGED] add brief section to readme about building docker images #292

kerem referenced this issue

2026-02-27 15:56:02 +03:00

[PR #270] [CLOSED] Compile FrontEnd as part of building the Docker image #296