[GH-ISSUE #127] French OCR fails when running after "Importer" directory #96

Closed
opened 2026-02-25 21:31:11 +03:00 by kerem · 5 comments
Owner

Originally created by @gaalcaras on GitHub (Sep 17, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/127

Hi there,

I've been playing around with papermerge lately, great work!

Unfortunately, I can't seem to run French OCR when using the "Importer" directory − although it works fine when uploading the file directly.

I'm using the linuxserver Docker image of papermerge.

Here are the relevant lines from papermerge.conf.py:

OCR_DEFAULT_LANGUAGE = "fra"

OCR_LANGUAGES = {
    "fra": "French",
    "eng": "English",
}

When I upload a file directly to the inbox, everything works fine. Here are the first lines of the log when grepping tesseract:

papermerge         | 2020-09-17T16:37:38.627164022Z [2020-09-17 16:37:38,626: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1
papermerge         | 2020-09-17T16:37:40.044259676Z [2020-09-17 16:37:40,043: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/125/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/125/page-1|hocr
papermerge         | 2020-09-17T16:37:41.431137020Z [2020-09-17 16:37:41,431: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/100/page-1|hocr
papermerge         | 2020-09-17T16:37:42.874047624Z [2020-09-17 16:37:42,873: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/75/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/75/page-1|hocr
papermerge         | 2020-09-17T16:37:43.865581275Z [2020-09-17 16:37:43,865: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/50/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/50/page-1|hocr
papermerge         | 2020-09-17T16:37:44.969770457Z [2020-09-17 16:37:44,969: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_2/100/page-2.jpg|/data/media/results/user_1/document_17/pages/page_2

But when I move a file to the Importer directory, this happens (complete log this time):

papermerge         | 2020-09-17T16:47:18.601873673Z [2020-09-17 16:47:18,601: INFO/ForkPoolWorker-2] Importing file /importer/e-RESA.pdf...
papermerge         | 2020-09-17T16:47:18.603650417Z [2020-09-17 16:47:18,603: INFO/ForkPoolWorker-2] Same as temp_file_name=/tmp/tmp5vrjfa8p/e-RESA.pdf...
papermerge         | 2020-09-17T16:47:18.607349962Z [2020-09-17 16:47:18,607: DEBUG/ForkPoolWorker-2] Importing file /tmp/tmp5vrjfa8p/e-RESA.pdf.
papermerge         | 2020-09-17T16:47:18.659031911Z [2020-09-17 16:47:18,658: DEBUG/ForkPoolWorker-2] Post save doc => normalize_pages
papermerge         | 2020-09-17T16:47:18.659268128Z [2020-09-17 16:47:18,659: DEBUG/ForkPoolWorker-2] Normalizing document 18
papermerge         | 2020-09-17T16:47:18.666408263Z [2020-09-17 16:47:18,666: DEBUG/ForkPoolWorker-2] Uploading file /tmp/tmp5vrjfa8p/e-RESA.pdf to docs/user_1/document_18/e-RESA.pdf
papermerge         | 2020-09-17T16:47:18.666610584Z [2020-09-17 16:47:18,666: DEBUG/ForkPoolWorker-2] copy_doc: /tmp/tmp5vrjfa8p/e-RESA.pdf to docs/user_1/document_18/e-RESA.pdf
papermerge         | 2020-09-17T16:47:18.667201448Z [2020-09-17 16:47:18,667: DEBUG/ForkPoolWorker-2] Document 18 has 1 pages
papermerge         | 2020-09-17T16:47:18.672169866Z [2020-09-17 16:47:18,671: DEBUG/ForkPoolWorker-2]  ocr_page user_id=1 doc_id=18 page_num=1
papermerge         | 2020-09-17T16:47:18.672289228Z [2020-09-17 16:47:18,672: DEBUG/ForkPoolWorker-2] subprocess: /usr/bin/file --mime-type -b /data/media/docs/user_1/document_18/e-RESA.pdf
papermerge         | 2020-09-17T16:47:18.677201972Z [2020-09-17 16:47:18,676: DEBUG/ForkPoolWorker-2] Mime Type = Mime(/data/media/docs/user_1/document_18/e-RESA.pdf, application/pdf)
papermerge         | 2020-09-17T16:47:18.677314130Z [2020-09-17 16:47:18,677: DEBUG/ForkPoolWorker-2] subprocess: /usr/bin/file --mime-type -b /data/media/docs/user_1/document_18/e-RESA.pdf
papermerge         | 2020-09-17T16:47:18.681960690Z [2020-09-17 16:47:18,681: DEBUG/ForkPoolWorker-2] OCR PDF document
papermerge         | 2020-09-17T16:47:18.690268186Z [2020-09-17 16:47:18,689: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/125/page-1.jpg
papermerge         | 2020-09-17T16:47:18.690354276Z [2020-09-17 16:47:18,690: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/125 does not exists. Creating.
papermerge         | 2020-09-17T16:47:18.690635824Z [2020-09-17 16:47:18,690: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|1550|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/125/page
papermerge         | 2020-09-17T16:47:18.742333188Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/100/page-1.jpg
papermerge         | 2020-09-17T16:47:18.742464233Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/100 does not exists. Creating.
papermerge         | 2020-09-17T16:47:18.742940080Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|1240|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/100/page
papermerge         | 2020-09-17T16:47:18.784122351Z [2020-09-17 16:47:18,783: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/75/page-1.jpg
papermerge         | 2020-09-17T16:47:18.784215354Z [2020-09-17 16:47:18,784: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/75 does not exists. Creating.
papermerge         | 2020-09-17T16:47:18.784377875Z [2020-09-17 16:47:18,784: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|930|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/75/page
papermerge         | 2020-09-17T16:47:18.816375434Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/50/page-1.jpg
papermerge         | 2020-09-17T16:47:18.816455236Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/50 does not exists. Creating.
papermerge         | 2020-09-17T16:47:18.816635157Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|620|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/50/page
papermerge         | 2020-09-17T16:47:18.841902783Z [2020-09-17 16:47:18,841: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/10/page-1.jpg
papermerge         | 2020-09-17T16:47:18.841992456Z [2020-09-17 16:47:18,841: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/10 does not exists. Creating.
papermerge         | 2020-09-17T16:47:18.842162119Z [2020-09-17 16:47:18,842: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|124|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/10/page
papermerge         | 2020-09-17T16:47:18.861551288Z [2020-09-17 16:47:18,861: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1
papermerge         | 2020-09-17T16:47:18.869939701Z [2020-09-17 16:47:18,869: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge         | 2020-09-17T16:47:18.869955678Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge         | 2020-09-17T16:47:18.869959187Z Failed loading language 'fre'
papermerge         | 2020-09-17T16:47:18.869961665Z Tesseract couldn't load any languages!
papermerge         | 2020-09-17T16:47:18.869964273Z Could not initialize tesseract.
papermerge         | 2020-09-17T16:47:18.869966753Z
papermerge         | 2020-09-17T16:47:18.870152882Z [2020-09-17 16:47:18,870: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/125/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/125/page-1|hocr
papermerge         | 2020-09-17T16:47:18.878382858Z [2020-09-17 16:47:18,878: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge         | 2020-09-17T16:47:18.878400899Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge         | 2020-09-17T16:47:18.878405734Z Failed loading language 'fre'
papermerge         | 2020-09-17T16:47:18.878409389Z Tesseract couldn't load any languages!
papermerge         | 2020-09-17T16:47:18.878412582Z Could not initialize tesseract.
papermerge         | 2020-09-17T16:47:18.878416071Z
papermerge         | 2020-09-17T16:47:18.878464860Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/125/page-1.jpg - Complete.
papermerge         | 2020-09-17T16:47:18.878761200Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/125/page-1.hocr.
papermerge         | 2020-09-17T16:47:18.878786629Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/100/page-1|hocr
papermerge         | 2020-09-17T16:47:18.886793257Z [2020-09-17 16:47:18,886: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge         | 2020-09-17T16:47:18.886810945Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge         | 2020-09-17T16:47:18.886814693Z Failed loading language 'fre'
papermerge         | 2020-09-17T16:47:18.886817268Z Tesseract couldn't load any languages!
papermerge         | 2020-09-17T16:47:18.886819822Z Could not initialize tesseract.
papermerge         | 2020-09-17T16:47:18.886822245Z
papermerge         | 2020-09-17T16:47:18.886874136Z [2020-09-17 16:47:18,886: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/100/page-1.jpg - Complete.
papermerge         | 2020-09-17T16:47:18.886954935Z [2020-09-17 16:47:18,886: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/100/page-1.hocr.
papermerge         | 2020-09-17T16:47:18.887103486Z [2020-09-17 16:47:18,887: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/75/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/75/page-1|hocr
papermerge         | 2020-09-17T16:47:18.895445974Z [2020-09-17 16:47:18,895: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge         | 2020-09-17T16:47:18.895466845Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge         | 2020-09-17T16:47:18.895473193Z Failed loading language 'fre'
papermerge         | 2020-09-17T16:47:18.895477902Z Tesseract couldn't load any languages!
papermerge         | 2020-09-17T16:47:18.895481946Z Could not initialize tesseract.
papermerge         | 2020-09-17T16:47:18.895485920Z
papermerge         | 2020-09-17T16:47:18.895522453Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/75/page-1.jpg - Complete.
papermerge         | 2020-09-17T16:47:18.895572202Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/75/page-1.hocr.
papermerge         | 2020-09-17T16:47:18.895736885Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/50/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/50/page-1|hocr
papermerge         | 2020-09-17T16:47:18.903811183Z [2020-09-17 16:47:18,903: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge         | 2020-09-17T16:47:18.903828233Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge         | 2020-09-17T16:47:18.903832541Z Failed loading language 'fre'
papermerge         | 2020-09-17T16:47:18.903835551Z Tesseract couldn't load any languages!
papermerge         | 2020-09-17T16:47:18.903838084Z Could not initialize tesseract.
papermerge         | 2020-09-17T16:47:18.903840557Z
papermerge         | 2020-09-17T16:47:18.903922423Z [2020-09-17 16:47:18,903: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/50/page-1.jpg - Complete.
papermerge         | 2020-09-17T16:47:18.904556111Z [2020-09-17 16:47:18,903: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/50/page-1.hocr.
papermerge         | 2020-09-17T16:47:18.904570069Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2]  user_id=1 doc_id=18 page_num=1 page_type=pdf total_exec_time=0.23
papermerge         | 2020-09-17T16:47:18.904586242Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2] Page hocr ready: document_id=18 page_num=1
papermerge         | 2020-09-17T16:47:18.904591287Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2] apply_automates: Begin.
papermerge         | 2020-09-17T16:47:18.916041990Z [2020-09-17 16:47:18,915: ERROR/ForkPoolWorker-2] Task papermerge.core.management.commands.worker.import_from_local_folder[4800dff1-ea3b-4730-9c05-0c24fc23ff10] raised unexpected: FileNotFoundError(2, 'No such file or directory')
papermerge         | 2020-09-17T16:47:18.916059667Z Traceback (most recent call last):
papermerge         | 2020-09-17T16:47:18.916064337Z   File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 385, in trace_task
papermerge         | 2020-09-17T16:47:18.916068437Z     R = retval = fun(*args, **kwargs)
papermerge         | 2020-09-17T16:47:18.916071555Z   File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 650, in __protected_call__
papermerge         | 2020-09-17T16:47:18.916075118Z     return self.run(*args, **kwargs)
papermerge         | 2020-09-17T16:47:18.916078443Z   File "/app/papermerge/papermerge/core/management/commands/worker.py", line 53, in import_from_local_folder
papermerge         | 2020-09-17T16:47:18.916082225Z     import_documents(settings.PAPERMERGE_IMPORTER_DIR)
papermerge         | 2020-09-17T16:47:18.916085667Z   File "/app/papermerge/papermerge/core/importers/local.py", line 45, in import_documents
papermerge         | 2020-09-17T16:47:18.916089370Z     imp.import_file()
papermerge         | 2020-09-17T16:47:18.916093094Z   File "/app/papermerge/papermerge/core/document_importer.py", line 106, in import_file
papermerge         | 2020-09-17T16:47:18.916097201Z     DocumentImporter.ocr_document(
papermerge         | 2020-09-17T16:47:18.916100930Z   File "/app/papermerge/papermerge/core/document_importer.py", line 156, in ocr_document
papermerge         | 2020-09-17T16:47:18.916104262Z     signals.page_ocr.send(
papermerge         | 2020-09-17T16:47:18.916107280Z   File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 173, in send
papermerge         | 2020-09-17T16:47:18.916111292Z     return [
papermerge         | 2020-09-17T16:47:18.916114621Z   File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 174, in <listcomp>
papermerge         | 2020-09-17T16:47:18.916118202Z     (receiver, receiver(signal=self, sender=sender, **named))
papermerge         | 2020-09-17T16:47:18.916121347Z   File "/app/papermerge/papermerge/core/signals.py", line 35, in apply_automates_handler
papermerge         | 2020-09-17T16:47:18.916124812Z     apply_automates(
papermerge         | 2020-09-17T16:47:18.916127941Z   File "/app/papermerge/papermerge/core/automate.py", line 45, in apply_automates
papermerge         | 2020-09-17T16:47:18.916131187Z     with open(text_path, "r") as f:
papermerge         | 2020-09-17T16:47:18.916134714Z FileNotFoundError: [Errno 2] No such file or directory: '/data/media/results/user_1/document_18/pages/page_1.txt'

Obviously, this does not work because the correct code for French is fra, not fre. But I can't figure out why it uses fre instead of fra just when I use the Importer directory instead of a direct upload. I have double checked the config files, I have used the correct code.

Any idea about how we could fix this?

Originally created by @gaalcaras on GitHub (Sep 17, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/127 Hi there, I've been playing around with papermerge lately, great work! Unfortunately, I can't seem to run French OCR when using the "Importer" directory − although it works fine when uploading the file directly. I'm using the linuxserver Docker image of papermerge. Here are the relevant lines from `papermerge.conf.py`: ```python OCR_DEFAULT_LANGUAGE = "fra" OCR_LANGUAGES = { "fra": "French", "eng": "English", } ``` When I upload a file directly to the inbox, everything works fine. Here are the first lines of the log when grepping `tesseract`: ``` papermerge | 2020-09-17T16:37:38.627164022Z [2020-09-17 16:37:38,626: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1 papermerge | 2020-09-17T16:37:40.044259676Z [2020-09-17 16:37:40,043: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/125/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/125/page-1|hocr papermerge | 2020-09-17T16:37:41.431137020Z [2020-09-17 16:37:41,431: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/100/page-1|hocr papermerge | 2020-09-17T16:37:42.874047624Z [2020-09-17 16:37:42,873: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/75/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/75/page-1|hocr papermerge | 2020-09-17T16:37:43.865581275Z [2020-09-17 16:37:43,865: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/50/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/50/page-1|hocr papermerge | 2020-09-17T16:37:44.969770457Z [2020-09-17 16:37:44,969: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_2/100/page-2.jpg|/data/media/results/user_1/document_17/pages/page_2 ``` But when I move a file to the Importer directory, this happens (complete log this time): ``` papermerge | 2020-09-17T16:47:18.601873673Z [2020-09-17 16:47:18,601: INFO/ForkPoolWorker-2] Importing file /importer/e-RESA.pdf... papermerge | 2020-09-17T16:47:18.603650417Z [2020-09-17 16:47:18,603: INFO/ForkPoolWorker-2] Same as temp_file_name=/tmp/tmp5vrjfa8p/e-RESA.pdf... papermerge | 2020-09-17T16:47:18.607349962Z [2020-09-17 16:47:18,607: DEBUG/ForkPoolWorker-2] Importing file /tmp/tmp5vrjfa8p/e-RESA.pdf. papermerge | 2020-09-17T16:47:18.659031911Z [2020-09-17 16:47:18,658: DEBUG/ForkPoolWorker-2] Post save doc => normalize_pages papermerge | 2020-09-17T16:47:18.659268128Z [2020-09-17 16:47:18,659: DEBUG/ForkPoolWorker-2] Normalizing document 18 papermerge | 2020-09-17T16:47:18.666408263Z [2020-09-17 16:47:18,666: DEBUG/ForkPoolWorker-2] Uploading file /tmp/tmp5vrjfa8p/e-RESA.pdf to docs/user_1/document_18/e-RESA.pdf papermerge | 2020-09-17T16:47:18.666610584Z [2020-09-17 16:47:18,666: DEBUG/ForkPoolWorker-2] copy_doc: /tmp/tmp5vrjfa8p/e-RESA.pdf to docs/user_1/document_18/e-RESA.pdf papermerge | 2020-09-17T16:47:18.667201448Z [2020-09-17 16:47:18,667: DEBUG/ForkPoolWorker-2] Document 18 has 1 pages papermerge | 2020-09-17T16:47:18.672169866Z [2020-09-17 16:47:18,671: DEBUG/ForkPoolWorker-2] ocr_page user_id=1 doc_id=18 page_num=1 papermerge | 2020-09-17T16:47:18.672289228Z [2020-09-17 16:47:18,672: DEBUG/ForkPoolWorker-2] subprocess: /usr/bin/file --mime-type -b /data/media/docs/user_1/document_18/e-RESA.pdf papermerge | 2020-09-17T16:47:18.677201972Z [2020-09-17 16:47:18,676: DEBUG/ForkPoolWorker-2] Mime Type = Mime(/data/media/docs/user_1/document_18/e-RESA.pdf, application/pdf) papermerge | 2020-09-17T16:47:18.677314130Z [2020-09-17 16:47:18,677: DEBUG/ForkPoolWorker-2] subprocess: /usr/bin/file --mime-type -b /data/media/docs/user_1/document_18/e-RESA.pdf papermerge | 2020-09-17T16:47:18.681960690Z [2020-09-17 16:47:18,681: DEBUG/ForkPoolWorker-2] OCR PDF document papermerge | 2020-09-17T16:47:18.690268186Z [2020-09-17 16:47:18,689: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/125/page-1.jpg papermerge | 2020-09-17T16:47:18.690354276Z [2020-09-17 16:47:18,690: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/125 does not exists. Creating. papermerge | 2020-09-17T16:47:18.690635824Z [2020-09-17 16:47:18,690: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|1550|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/125/page papermerge | 2020-09-17T16:47:18.742333188Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/100/page-1.jpg papermerge | 2020-09-17T16:47:18.742464233Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/100 does not exists. Creating. papermerge | 2020-09-17T16:47:18.742940080Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|1240|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/100/page papermerge | 2020-09-17T16:47:18.784122351Z [2020-09-17 16:47:18,783: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/75/page-1.jpg papermerge | 2020-09-17T16:47:18.784215354Z [2020-09-17 16:47:18,784: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/75 does not exists. Creating. papermerge | 2020-09-17T16:47:18.784377875Z [2020-09-17 16:47:18,784: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|930|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/75/page papermerge | 2020-09-17T16:47:18.816375434Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/50/page-1.jpg papermerge | 2020-09-17T16:47:18.816455236Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/50 does not exists. Creating. papermerge | 2020-09-17T16:47:18.816635157Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|620|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/50/page papermerge | 2020-09-17T16:47:18.841902783Z [2020-09-17 16:47:18,841: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/10/page-1.jpg papermerge | 2020-09-17T16:47:18.841992456Z [2020-09-17 16:47:18,841: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/10 does not exists. Creating. papermerge | 2020-09-17T16:47:18.842162119Z [2020-09-17 16:47:18,842: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|124|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/10/page papermerge | 2020-09-17T16:47:18.861551288Z [2020-09-17 16:47:18,861: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1 papermerge | 2020-09-17T16:47:18.869939701Z [2020-09-17 16:47:18,869: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata papermerge | 2020-09-17T16:47:18.869955678Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. papermerge | 2020-09-17T16:47:18.869959187Z Failed loading language 'fre' papermerge | 2020-09-17T16:47:18.869961665Z Tesseract couldn't load any languages! papermerge | 2020-09-17T16:47:18.869964273Z Could not initialize tesseract. papermerge | 2020-09-17T16:47:18.869966753Z papermerge | 2020-09-17T16:47:18.870152882Z [2020-09-17 16:47:18,870: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/125/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/125/page-1|hocr papermerge | 2020-09-17T16:47:18.878382858Z [2020-09-17 16:47:18,878: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata papermerge | 2020-09-17T16:47:18.878400899Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. papermerge | 2020-09-17T16:47:18.878405734Z Failed loading language 'fre' papermerge | 2020-09-17T16:47:18.878409389Z Tesseract couldn't load any languages! papermerge | 2020-09-17T16:47:18.878412582Z Could not initialize tesseract. papermerge | 2020-09-17T16:47:18.878416071Z papermerge | 2020-09-17T16:47:18.878464860Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/125/page-1.jpg - Complete. papermerge | 2020-09-17T16:47:18.878761200Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/125/page-1.hocr. papermerge | 2020-09-17T16:47:18.878786629Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/100/page-1|hocr papermerge | 2020-09-17T16:47:18.886793257Z [2020-09-17 16:47:18,886: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata papermerge | 2020-09-17T16:47:18.886810945Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. papermerge | 2020-09-17T16:47:18.886814693Z Failed loading language 'fre' papermerge | 2020-09-17T16:47:18.886817268Z Tesseract couldn't load any languages! papermerge | 2020-09-17T16:47:18.886819822Z Could not initialize tesseract. papermerge | 2020-09-17T16:47:18.886822245Z papermerge | 2020-09-17T16:47:18.886874136Z [2020-09-17 16:47:18,886: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/100/page-1.jpg - Complete. papermerge | 2020-09-17T16:47:18.886954935Z [2020-09-17 16:47:18,886: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/100/page-1.hocr. papermerge | 2020-09-17T16:47:18.887103486Z [2020-09-17 16:47:18,887: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/75/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/75/page-1|hocr papermerge | 2020-09-17T16:47:18.895445974Z [2020-09-17 16:47:18,895: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata papermerge | 2020-09-17T16:47:18.895466845Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. papermerge | 2020-09-17T16:47:18.895473193Z Failed loading language 'fre' papermerge | 2020-09-17T16:47:18.895477902Z Tesseract couldn't load any languages! papermerge | 2020-09-17T16:47:18.895481946Z Could not initialize tesseract. papermerge | 2020-09-17T16:47:18.895485920Z papermerge | 2020-09-17T16:47:18.895522453Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/75/page-1.jpg - Complete. papermerge | 2020-09-17T16:47:18.895572202Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/75/page-1.hocr. papermerge | 2020-09-17T16:47:18.895736885Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/50/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/50/page-1|hocr papermerge | 2020-09-17T16:47:18.903811183Z [2020-09-17 16:47:18,903: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata papermerge | 2020-09-17T16:47:18.903828233Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. papermerge | 2020-09-17T16:47:18.903832541Z Failed loading language 'fre' papermerge | 2020-09-17T16:47:18.903835551Z Tesseract couldn't load any languages! papermerge | 2020-09-17T16:47:18.903838084Z Could not initialize tesseract. papermerge | 2020-09-17T16:47:18.903840557Z papermerge | 2020-09-17T16:47:18.903922423Z [2020-09-17 16:47:18,903: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/50/page-1.jpg - Complete. papermerge | 2020-09-17T16:47:18.904556111Z [2020-09-17 16:47:18,903: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/50/page-1.hocr. papermerge | 2020-09-17T16:47:18.904570069Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2] user_id=1 doc_id=18 page_num=1 page_type=pdf total_exec_time=0.23 papermerge | 2020-09-17T16:47:18.904586242Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2] Page hocr ready: document_id=18 page_num=1 papermerge | 2020-09-17T16:47:18.904591287Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2] apply_automates: Begin. papermerge | 2020-09-17T16:47:18.916041990Z [2020-09-17 16:47:18,915: ERROR/ForkPoolWorker-2] Task papermerge.core.management.commands.worker.import_from_local_folder[4800dff1-ea3b-4730-9c05-0c24fc23ff10] raised unexpected: FileNotFoundError(2, 'No such file or directory') papermerge | 2020-09-17T16:47:18.916059667Z Traceback (most recent call last): papermerge | 2020-09-17T16:47:18.916064337Z File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 385, in trace_task papermerge | 2020-09-17T16:47:18.916068437Z R = retval = fun(*args, **kwargs) papermerge | 2020-09-17T16:47:18.916071555Z File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 650, in __protected_call__ papermerge | 2020-09-17T16:47:18.916075118Z return self.run(*args, **kwargs) papermerge | 2020-09-17T16:47:18.916078443Z File "/app/papermerge/papermerge/core/management/commands/worker.py", line 53, in import_from_local_folder papermerge | 2020-09-17T16:47:18.916082225Z import_documents(settings.PAPERMERGE_IMPORTER_DIR) papermerge | 2020-09-17T16:47:18.916085667Z File "/app/papermerge/papermerge/core/importers/local.py", line 45, in import_documents papermerge | 2020-09-17T16:47:18.916089370Z imp.import_file() papermerge | 2020-09-17T16:47:18.916093094Z File "/app/papermerge/papermerge/core/document_importer.py", line 106, in import_file papermerge | 2020-09-17T16:47:18.916097201Z DocumentImporter.ocr_document( papermerge | 2020-09-17T16:47:18.916100930Z File "/app/papermerge/papermerge/core/document_importer.py", line 156, in ocr_document papermerge | 2020-09-17T16:47:18.916104262Z signals.page_ocr.send( papermerge | 2020-09-17T16:47:18.916107280Z File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 173, in send papermerge | 2020-09-17T16:47:18.916111292Z return [ papermerge | 2020-09-17T16:47:18.916114621Z File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 174, in <listcomp> papermerge | 2020-09-17T16:47:18.916118202Z (receiver, receiver(signal=self, sender=sender, **named)) papermerge | 2020-09-17T16:47:18.916121347Z File "/app/papermerge/papermerge/core/signals.py", line 35, in apply_automates_handler papermerge | 2020-09-17T16:47:18.916124812Z apply_automates( papermerge | 2020-09-17T16:47:18.916127941Z File "/app/papermerge/papermerge/core/automate.py", line 45, in apply_automates papermerge | 2020-09-17T16:47:18.916131187Z with open(text_path, "r") as f: papermerge | 2020-09-17T16:47:18.916134714Z FileNotFoundError: [Errno 2] No such file or directory: '/data/media/results/user_1/document_18/pages/page_1.txt' ``` Obviously, this does not work because the correct code for French is `fra`, not `fre`. But I can't figure out why it uses `fre` instead of `fra` just when I use the Importer directory instead of a direct upload. I have double checked the config files, I have used the correct code. Any idea about how we could fix this?
Author
Owner

@ciur commented on GitHub (Sep 17, 2020):

@gaalcaras
The difference between two scenarios (manually upload and importer run) is that in importer case, Papermerge gets current settings from "default OCR language of superuser" which is read from database. In case of manual upload language code is read from configuration file. This means that value "fre" comes from database. It might be that a previous typo was propagated to database and is stored there.

I assume you are the only user = admin = superuser.
With your superuser/admin try this:

  1. Upper-Right menu -> Preferences:

01-preferences

  1. Choose French language and save:

01-change-lang

By saving as I mentioned above your database stored typo (fre) should be overwritten with correct value (fra).

Just in case, you applied your configurations in both worker and main app containers, right ?

<!-- gh-comment-id:694418904 --> @ciur commented on GitHub (Sep 17, 2020): @gaalcaras The difference between two scenarios (manually upload and importer run) is that in importer case, Papermerge gets current settings from "default OCR language of superuser" which is read from database. In case of manual upload language code is read from configuration file. This means that value "fre" comes from database. It might be that a previous typo was propagated to database and is stored there. I assume you are the only user = admin = superuser. With your superuser/admin try this: 1. Upper-Right menu -> Preferences: ![01-preferences](https://user-images.githubusercontent.com/24827601/93511792-9fe55580-f923-11ea-9fa4-316be687404a.png) 2. Choose French language and save: ![01-change-lang](https://user-images.githubusercontent.com/24827601/93511866-b9869d00-f923-11ea-9928-37896f200878.png) By saving as I mentioned above your database stored typo (fre) should be overwritten with correct value (fra). Just in case, you [applied your configurations in both worker and main app containers](https://papermerge.readthedocs.io/en/latest/setup/docker.html#main-app-worker-or-both), right ?
Author
Owner

@gaalcaras commented on GitHub (Sep 17, 2020):

Indeed you're right, the problem was with the database. However, saving the ocr language in the settings menu did not work. It was already on French anyway. I changed it back to English, then back to French, to no effect. But I achieved the desired result by modifying the database directly. Thanks for your input!

Just in case, you applied your configurations in both worker and main app containers, right ?

AFAIK, the Linuxserver image does not run separate containers. I assumed it runs more like the "bare metal" approach described in the docs, thus reading from the same configuration files. I could be wrong though.

<!-- gh-comment-id:694469236 --> @gaalcaras commented on GitHub (Sep 17, 2020): Indeed you're right, the problem was with the database. However, saving the ocr language in the settings menu did not work. It was already on French anyway. I changed it back to English, then back to French, to no effect. But I achieved the desired result by modifying the database directly. Thanks for your input! > Just in case, you applied your configurations in both worker and main app containers, right ? AFAIK, the [Linuxserver image](https://github.com/linuxserver/docker-papermerge) does not run separate containers. I assumed it runs more like the "bare metal" approach described in the docs, thus reading from the same configuration files. I could be wrong though.
Author
Owner

@ciur commented on GitHub (Sep 18, 2020):

Hi @gaalcaras,

great that you figured it out!

I have just checked linuxserver image 😮 ...
Those guys from Linuxserver did an amazing work! 🌟
First of all, indeed, they managed to wrapp everything in one single docker image!
They use different configuration (sqlite3 instead of postgresql and uwsgi instead of apache mod_wsgi).
And yes, they followed "bere metal" approach, but again, as I mentioned - they managed to wrap worker and main app in a single docker image 🎉

<!-- gh-comment-id:694693793 --> @ciur commented on GitHub (Sep 18, 2020): Hi @gaalcaras, great that you figured it out! I have just checked linuxserver image :open_mouth: ... Those guys from Linuxserver did an amazing work! :star2: First of all, indeed, they managed to wrapp everything in one single docker image! They use different configuration (sqlite3 instead of postgresql and uwsgi instead of apache mod_wsgi). And yes, they followed "bere metal" approach, but again, as I mentioned - they managed to wrap worker and main app in a single docker image :tada:
Author
Owner

@gaalcaras commented on GitHub (Sep 18, 2020):

I agree, Linuxserver is an amazing project, it has made self hosting that much more enjoyable for me. I'm glad you appreciate what they did with Papermerge :)

<!-- gh-comment-id:694703014 --> @gaalcaras commented on GitHub (Sep 18, 2020): I agree, Linuxserver is an amazing project, it has made self hosting that much more enjoyable for me. I'm glad you appreciate what they did with Papermerge :)
Author
Owner

@guim31 commented on GitHub (Oct 22, 2020):

Hi @gaalcaras !
Could you tell me how you managed to change langage in the database directly ?

<!-- gh-comment-id:714557236 --> @guim31 commented on GitHub (Oct 22, 2020): Hi @gaalcaras ! Could you tell me how you managed to change langage in the database directly ?
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#96
No description provided.