[GH-ISSUE #603] Feature Request: OCR support for digitally signed dcouments. #477

Open
opened 2026-02-25 21:32:00 +03:00 by kerem · 3 comments
Owner

Originally created by @ShakataGaNai on GitHub (Mar 1, 2024).
Original GitHub issue: https://github.com/ciur/papermerge/issues/603

Originally assigned to: @ciur on GitHub.

Running v3.1 out of docker containers for testing (per https://docs.papermerge.io/3.1/setup/docker-compose/ ). When you upload and attempt to OCR a digitally signed document, the process fails silently. Looking at the logs (from the worker) finds a logical error message:

[2024-03-02 00:20:56,933: ERROR/ForkPoolWorker-8] Task papermerge.core.tasks.ocr_document_task[77be2d59-9703-42df-a3cc-bf920a61eab4] raised unexpected: DigitalSignatureError()
Traceback (most recent call last):
  File "/core_app/.venv/lib/python3.10/site-packages/celery/app/trace.py", line 477, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/core_app/.venv/lib/python3.10/site-packages/celery/app/trace.py", line 760, in __protected_call__
    return self.run(*args, **kwargs)
  File "/core_app/papermerge/core/tasks.py", line 79, in ocr_document_task
    ocr_document(
  File "/core_app/papermerge/core/ocr/document.py", line 86, in ocr_document
    _ocr_document(
  File "/core_app/papermerge/core/ocr/document.py", line 54, in _ocr_document
    ocrmypdf.ocr(
  File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/api.py", line 337, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
  File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/_sync.py", line 388, in run_pipeline
    validate_pdfinfo_options(context)
  File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/_pipeline.py", line 204, in validate_pdfinfo_options
    raise DigitalSignatureError()
ocrmypdf.exceptions.DigitalSignatureError: Input PDF has a digital signature. OCR would alter the document,
invalidating the signature.

I can't find any mention of this anywhere, but supporting OCR for digitally signed documents would be nice. Perhaps the version dropdown can indicate something like "Version X w/ OCRed and w/o digital signature". Honestly, I don't even care about accessing a version of the document with OCR'd text, so long as the text is there for full text search. Especially when dealing with a multiplicity of signed legal documents.

Originally created by @ShakataGaNai on GitHub (Mar 1, 2024). Original GitHub issue: https://github.com/ciur/papermerge/issues/603 Originally assigned to: @ciur on GitHub. Running v3.1 out of docker containers for testing (per https://docs.papermerge.io/3.1/setup/docker-compose/ ). When you upload and attempt to OCR a digitally signed document, the process fails silently. Looking at the logs (from the worker) finds a logical error message: ``` [2024-03-02 00:20:56,933: ERROR/ForkPoolWorker-8] Task papermerge.core.tasks.ocr_document_task[77be2d59-9703-42df-a3cc-bf920a61eab4] raised unexpected: DigitalSignatureError() Traceback (most recent call last): File "/core_app/.venv/lib/python3.10/site-packages/celery/app/trace.py", line 477, in trace_task R = retval = fun(*args, **kwargs) File "/core_app/.venv/lib/python3.10/site-packages/celery/app/trace.py", line 760, in __protected_call__ return self.run(*args, **kwargs) File "/core_app/papermerge/core/tasks.py", line 79, in ocr_document_task ocr_document( File "/core_app/papermerge/core/ocr/document.py", line 86, in ocr_document _ocr_document( File "/core_app/papermerge/core/ocr/document.py", line 54, in _ocr_document ocrmypdf.ocr( File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/api.py", line 337, in ocr return run_pipeline(options=options, plugin_manager=plugin_manager, api=True) File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/_sync.py", line 388, in run_pipeline validate_pdfinfo_options(context) File "/core_app/.venv/lib/python3.10/site-packages/ocrmypdf/_pipeline.py", line 204, in validate_pdfinfo_options raise DigitalSignatureError() ocrmypdf.exceptions.DigitalSignatureError: Input PDF has a digital signature. OCR would alter the document, invalidating the signature. ``` I can't find any mention of this anywhere, but supporting OCR for digitally signed documents would be nice. Perhaps the version dropdown can indicate something like "Version X w/ OCRed and w/o digital signature". Honestly, I don't even care about accessing a version of the document with OCR'd text, so long as the text is there for full text search. Especially when dealing with a multiplicity of signed legal documents.
Author
Owner

@ciur commented on GitHub (Mar 3, 2024):

Thank you for opening this ticket.

Would you mind uploading a digitally signed document that I can experiment with? Of course, I mean document without sensitive information. One page document (digitally signed) with a couple of words would do the job just fine.

This will help me understand your request better and, of course, validate the feature while developing it.

<!-- gh-comment-id:1975082011 --> @ciur commented on GitHub (Mar 3, 2024): Thank you for opening this ticket. Would you mind uploading a digitally signed document that I can experiment with? Of course, I mean **document without sensitive information**. One page document (digitally signed) with a couple of words would do the job just fine. This will help me understand your request better and, of course, validate the feature while developing it.
Author
Owner

@ShakataGaNai commented on GitHub (Mar 3, 2024):

Attaching 3. One is a digital document pushed right through docusign. One is the same document printed then scanned, and through docusign. The third is the same print/scan document signed with Adobe Acrobat (which I'm least confident in working, because Adobe...)

Lipsum scan - adobe signed.pdf
Lipsum scan - docusign.pdf
lipsum - docusign.pdf

<!-- gh-comment-id:1975387913 --> @ShakataGaNai commented on GitHub (Mar 3, 2024): Attaching 3. One is a digital document pushed right through docusign. One is the same document printed then scanned, and through docusign. The third is the same print/scan document signed with Adobe Acrobat (which I'm least confident in working, because Adobe...) [Lipsum scan - adobe signed.pdf](https://github.com/ciur/papermerge/files/14474943/Lipsum.scan.-.adobe.signed.pdf) [Lipsum scan - docusign.pdf](https://github.com/ciur/papermerge/files/14474944/Lipsum.scan.-.docusign.pdf) [lipsum - docusign.pdf](https://github.com/ciur/papermerge/files/14474945/lipsum.-.docusign.pdf)
Author
Owner

@bluekitedreamer commented on GitHub (Apr 23, 2024):

Possibly a simple issue to fix, see another issue recently filed here with solution suggestion (https://github.com/ciur/papermerge/issues/614#issue-2255198217).

<!-- gh-comment-id:2071487898 --> @bluekitedreamer commented on GitHub (Apr 23, 2024): Possibly a simple issue to fix, see another issue recently filed here with solution suggestion (https://github.com/ciur/papermerge/issues/614#issue-2255198217).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#477
No description provided.