[GH-ISSUE #251] Skip OCR if the file to import already has embedded text (= a PDF with text layer) #202

Closed
opened 2026-02-25 21:31:25 +03:00 by kerem · 1 comment
Owner

Originally created by @wechsler42 on GitHub (Dec 8, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/251

Originally assigned to: @ciur on GitHub.

Is your feature request related to a problem? Please describe.
If a PDF file is uploaded the OCR step with tesseract is performed no matter if the file is with or without embedded text (text layer).

Describe the solution you'd like
I would suggest to automatically skip the OCR of PDF files with already embedded text layers. So this would speed up the whole import process because the time consuming tesseract step would be omitted for this sort of PDF files.

Originally created by @wechsler42 on GitHub (Dec 8, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/251 Originally assigned to: @ciur on GitHub. **Is your feature request related to a problem? Please describe.** If a PDF file is uploaded the OCR step with tesseract is performed no matter if the file is with or without embedded text (text layer). **Describe the solution you'd like** I would suggest to automatically skip the OCR of PDF files with already embedded text layers. So this would speed up the whole import process because the time consuming tesseract step would be omitted for this sort of PDF files.
Author
Owner

@ciur commented on GitHub (Dec 8, 2020):

@wechsler42, thank you for opening this ticket!

I don't think this is possible due limitation described below.
Let's say we skip OCR step because PDF file already has a text layer. I think there are 3rd party tools to extract that layer of text, so let's say that Papermerge uses those 3rd party tools, extracts texts from PDF using a faster process and saves text into database.

The question is: how is papermerge supposed to render verbatim PDF file in web browser?
When you see a PDF file like this in Papermerge:

Screenshot from 2020-12-08 19-37-36

"rendered pdfs" are actually extracted images + layered text (provided by tesseract) which "simulates" a real PDF viewer.
Rendering verbatim PDFs is possible only using a 3rd party javascript library (e.g. pdf.js) which is used by mozilla and chrome browsers - but problem is that library does not allow cut,paste, moving around of the document pages!

Here is a screenshot of standard pdf viewer in the firefox browser:
Screenshot from 2020-12-08 19-42-55

Long story short - OCR process performed by Papermerge is more just tesseract -l de input.pdf, it also extracts hocr file and prepares document for rendering using Papermerge document viewer which has fancy functions like cut/paste/reordering of pages. Thus, skipping OCR process is not possible.

<!-- gh-comment-id:740839430 --> @ciur commented on GitHub (Dec 8, 2020): @wechsler42, thank you for opening this ticket! I don't think this is possible due limitation described below. Let's say we skip OCR step because PDF file already has a text layer. I think there are 3rd party tools to extract that layer of text, so let's say that Papermerge uses those 3rd party tools, extracts texts from PDF using a faster process and saves text into database. The question is: how is papermerge supposed to render verbatim PDF file in web browser? When you see a PDF file like this in Papermerge: ![Screenshot from 2020-12-08 19-37-36](https://user-images.githubusercontent.com/24827601/101526467-f1363b80-398c-11eb-8c0b-d2654f32ceca.png) "rendered pdfs" are actually extracted images + layered text (provided by tesseract) which "simulates" a real PDF viewer. Rendering verbatim PDFs is possible only using a 3rd party javascript library (e.g. pdf.js) which is used by mozilla and chrome browsers - but problem is that library does not allow cut,paste, moving around of the document pages! Here is a screenshot of standard pdf viewer in the firefox browser: ![Screenshot from 2020-12-08 19-42-55](https://user-images.githubusercontent.com/24827601/101526958-abc63e00-398d-11eb-9e83-1c3d55b36e70.png) Long story short - OCR process performed by Papermerge is more just ```tesseract -l de input.pdf```, it also extracts hocr file and prepares document for rendering using Papermerge document viewer which has fancy functions like cut/paste/reordering of pages. Thus, skipping OCR process is not possible.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#202
No description provided.