[GH-ISSUE #78] Download document with OCRed layer of text. #60

Closed
opened 2026-02-25 21:31:07 +03:00 by kerem · 8 comments
Owner

Originally created by @ciur on GitHub (Aug 19, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/78

Originally assigned to: @ciur on GitHub.

One of the features suggested in this discussion
is ability to download document with OCRed layer of text.

If I think carefully of my workflow - I needed to download many times the documents, and even though OCR text layer was available in document viewer - downloaded PDF version did not contained that text.

I think it makes sense to add make download button as dropdown with 2 versions:

  • download original
  • download original with text over (or embedded text)

Here is comment where user mentioned this.

Originally created by @ciur on GitHub (Aug 19, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/78 Originally assigned to: @ciur on GitHub. One of the [features suggested in this discussion ](https://www.reddit.com/r/selfhosted/comments/ibwuj2/papermerge_14_out/) is ability to download document with OCRed layer of text. If I think carefully of my workflow - I needed to download many times the documents, and even though OCR text layer was available in document viewer - downloaded PDF version did not contained that text. I think it makes sense to add make download button as dropdown with 2 versions: * download original * download original with text over (or embedded text) Here is [comment where user mentioned this](https://www.reddit.com/r/selfhosted/comments/ibwuj2/papermerge_14_out/g1yqlsy?utm_source=share&utm_medium=web2x).
Author
Owner

@ciur commented on GitHub (Aug 19, 2020):

Great, feature request, Eugen !

<!-- gh-comment-id:676559096 --> @ciur commented on GitHub (Aug 19, 2020): Great, feature request, Eugen !
Author
Owner

@ciur commented on GitHub (Aug 19, 2020):

Sure, Eugen, you are welcome! 👍
I love speaking to myself... 😒

<!-- gh-comment-id:676559985 --> @ciur commented on GitHub (Aug 19, 2020): Sure, Eugen, you are welcome! :+1: I love speaking to myself... :unamused:
Author
Owner

@Quotic commented on GitHub (Aug 19, 2020):

Eugen, I agree with Eugen. That will be great.

<!-- gh-comment-id:676575526 --> @Quotic commented on GitHub (Aug 19, 2020): Eugen, I agree with Eugen. That will be great.
Author
Owner

@ciur commented on GitHub (Oct 14, 2020):

Feedback from reddit

Hi there. If the feature doesn't currently exist to 'embed' the OCR/plain-text into the document itself, please consider it as a feature request.

Acrobat Pro and ocrmypdf both do this, but many tools do not, and instead just put the plan text data in a database file rather than inline within the document. This means if you ever take your pdf somewhere else, you have to OCR it again.

<!-- gh-comment-id:708520287 --> @ciur commented on GitHub (Oct 14, 2020): [Feedback from reddit](https://www.reddit.com/r/selfhosted/comments/jb3jcl/papermerge_15_is_out_finally_with_tags/g8t5n5u?utm_source=share&utm_medium=web2x&context=3) > Hi there. If the feature doesn't currently exist to 'embed' the OCR/plain-text into the document itself, please consider it as a feature request. > > Acrobat Pro and ocrmypdf both do this, but many tools do not, and instead just put the plan text data in a database file rather than inline within the document. This means if you ever take your pdf somewhere else, you have to OCR it again.
Author
Owner

@ebertland commented on GitHub (Oct 21, 2020):

I would like this to work as part of a backup to cloud strategy as well. When I copy the PDF files to a cloud service, I would like them to be searchable there as well. (My current workaround is to run the scans through ocrmypdf before sending them to papermerge.)

When you implement this feature, please consider providing the same capability through a REST API also, so that I can make a custom backup utility. Thanks!

<!-- gh-comment-id:713356240 --> @ebertland commented on GitHub (Oct 21, 2020): I would like this to work as part of a backup to cloud strategy as well. When I copy the PDF files to a cloud service, I would like them to be searchable there as well. (My current workaround is to run the scans through ocrmypdf before sending them to papermerge.) When you implement this feature, please consider providing the same capability through a REST API also, so that I can make a custom backup utility. Thanks!
Author
Owner

@CorneliousJD commented on GitHub (Oct 23, 2020):

I would also like this to be implemented.

Right now in order to make this happen I'm also running PDFs through OCRmyPDF-Auto (docker container) to automatically de-skew the pages and embed OCR directly into PDF before importing into Papermerge, that way the raw PDF file on the filesystem retains embedded OCR, AND its hilightable within the webUI of Papermerge as well. Best of both worlds.

Not needing to do this step first would make the process even faster for usres like me and @ebertland

If it's a toggle-able option within Papermerge that would be even better, as some people may not need it, but I think defaulting to ON would be the best.

May even be a good idea to have a button to go back and re-encode the PDFs that already exist within Papermerge so they get OCR data embedded directly into the PDF, that way our existing libraries are already covered as well.

Until then though I will continue to first run PDFs through OCRmyPDF-auto before ingesting into Papermerge.

Thanks for the consideration!

<!-- gh-comment-id:715493587 --> @CorneliousJD commented on GitHub (Oct 23, 2020): I would also like this to be implemented. Right now in order to make this happen I'm also running PDFs through OCRmyPDF-Auto (docker container) to automatically de-skew the pages and embed OCR directly into PDF before importing into Papermerge, that way the raw PDF file on the filesystem retains embedded OCR, AND its hilightable within the webUI of Papermerge as well. Best of both worlds. Not needing to do this step first would make the process even faster for usres like me and @ebertland If it's a toggle-able option within Papermerge that would be even better, as some people may not need it, but I think defaulting to ON would be the best. May even be a good idea to have a button to go back and re-encode the PDFs that already exist within Papermerge so they get OCR data embedded directly into the PDF, that way our existing libraries are already covered as well. Until then though I will continue to first run PDFs through OCRmyPDF-auto before ingesting into Papermerge. Thanks for the consideration!
Author
Owner

@ciur commented on GitHub (Jun 21, 2021):

Feature is implemented and it is now part of develop branch. Will be released as part of version 2.1 (~ December 2021).
With this feature in place, when user uploads a new document, after OCR process is complete, user can download 2 document versions - so called document version 0 (v0) - original (i.e. exactly same as uploaded) and version 1 (v1) the document with text overlay i.e. fully searchable PDF document.
Text overlay is created using OCRmyPDF .

<!-- gh-comment-id:865253702 --> @ciur commented on GitHub (Jun 21, 2021): Feature is implemented and it is now part of develop branch. Will be released as part of version 2.1 (~ December 2021). With this feature in place, when user uploads a new document, after OCR process is complete, user can download 2 document versions - so called document version 0 (v0) - original (i.e. exactly same as uploaded) and version 1 (v1) the document with text overlay i.e. fully searchable PDF document. Text overlay is created using [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF) .
Author
Owner

@ciur commented on GitHub (Jul 31, 2022):

Text overlay feature is implemented and available in 2.1.0a35.

Basically now, after document is uploaded and OCRed, you can download document with text layer!
Beside document with text overlay original version is available to download as well.
Under the hood papermerge uses OCRmyPDF in order to produce a searchable PDF.

Here is a quick demo:
text-overlay-demo

And documentation about text layer

<!-- gh-comment-id:1200433017 --> @ciur commented on GitHub (Jul 31, 2022): Text overlay feature is implemented and available in [2.1.0a35](https://github.com/papermerge/papermerge-core/pkgs/container/papermerge/30997992?tag=2.1.0a35). Basically now, after document is uploaded and OCRed, you can download document with text layer! Beside document with text overlay original version is available to download as well. Under the hood papermerge uses [OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest/) in order to produce a searchable PDF. Here is a quick demo: ![text-overlay-demo](https://user-images.githubusercontent.com/24827601/182030285-cd4cf63b-4fe6-4d35-a6da-7d6593a5f422.gif) [And documentation about text layer](https://docs.papermerge.io/User%27s%20Manual/ocr.html#ocred-text-layer)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#60
No description provided.