[GH-ISSUE #251] Skip OCR if the file to import already has embedded text (= a PDF with text layer) #202

New issue

Closed

opened 2026-02-25 21:31:25 +03:00 by kerem · 1 comment

kerem commented

2026-02-25 21:31:25 +03:00

Owner

Originally created by @wechsler42 on GitHub (Dec 8, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/251

Originally assigned to: @ciur on GitHub.

Is your feature request related to a problem? Please describe.
If a PDF file is uploaded the OCR step with tesseract is performed no matter if the file is with or without embedded text (text layer).

Describe the solution you'd like
I would suggest to automatically skip the OCR of PDF files with already embedded text layers. So this would speed up the whole import process because the time consuming tesseract step would be omitted for this sort of PDF files.

Originally created by @wechsler42 on GitHub (Dec 8, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/251 Originally assigned to: @ciur on GitHub. **Is your feature request related to a problem? Please describe.** If a PDF file is uploaded the OCR step with tesseract is performed no matter if the file is with or without embedded text (text layer). **Describe the solution you'd like** I would suggest to automatically skip the OCR of PDF files with already embedded text layers. So this would speed up the whole import process because the time consuming tesseract step would be omitted for this sort of PDF files.

kerem

2026-02-25 21:31:25 +03:00

closed this issue
added the
enhancement

feature request

discussion
labels

kerem commented

2026-02-25 21:31:26 +03:00

Author

Owner

@ciur commented on GitHub (Dec 8, 2020):

@wechsler42, thank you for opening this ticket!

I don't think this is possible due limitation described below.
Let's say we skip OCR step because PDF file already has a text layer. I think there are 3rd party tools to extract that layer of text, so let's say that Papermerge uses those 3rd party tools, extracts texts from PDF using a faster process and saves text into database.

The question is: how is papermerge supposed to render verbatim PDF file in web browser?
When you see a PDF file like this in Papermerge:

"rendered pdfs" are actually extracted images + layered text (provided by tesseract) which "simulates" a real PDF viewer.
Rendering verbatim PDFs is possible only using a 3rd party javascript library (e.g. pdf.js) which is used by mozilla and chrome browsers - but problem is that library does not allow cut,paste, moving around of the document pages!

Here is a screenshot of standard pdf viewer in the firefox browser:

Long story short - OCR process performed by Papermerge is more just tesseract -l de input.pdf, it also extracts hocr file and prepares document for rendering using Papermerge document viewer which has fancy functions like cut/paste/reordering of pages. Thus, skipping OCR process is not possible.

@ciur commented on GitHub (Dec 8, 2020): @wechsler42, thank you for opening this ticket! I don't think this is possible due limitation described below. Let's say we skip OCR step because PDF file already has a text layer. I think there are 3rd party tools to extract that layer of text, so let's say that Papermerge uses those 3rd party tools, extracts texts from PDF using a faster process and saves text into database. The question is: how is papermerge supposed to render verbatim PDF file in web browser? When you see a PDF file like this in Papermerge: ![Screenshot from 2020-12-08 19-37-36](https://user-images.githubusercontent.com/24827601/101526467-f1363b80-398c-11eb-8c0b-d2654f32ceca.png) "rendered pdfs" are actually extracted images + layered text (provided by tesseract) which "simulates" a real PDF viewer. Rendering verbatim PDFs is possible only using a 3rd party javascript library (e.g. pdf.js) which is used by mozilla and chrome browsers - but problem is that library does not allow cut,paste, moving around of the document pages! Here is a screenshot of standard pdf viewer in the firefox browser: ![Screenshot from 2020-12-08 19-42-55](https://user-images.githubusercontent.com/24827601/101526958-abc63e00-398d-11eb-9e83-1c3d55b36e70.png) Long story short - OCR process performed by Papermerge is more just ```tesseract -l de input.pdf```, it also extracts hocr file and prepares document for rendering using Papermerge document viewer which has fancy functions like cut/paste/reordering of pages. Thus, skipping OCR process is not possible.

kerem referenced this issue

2026-02-25 21:32:15 +03:00

[PR #216] [MERGED] Fix automates ACL and other #564