[GH-ISSUE #99] OCR Working, But Automates Not; OCR not recognizing spaces in text; & Django Setting 'DEBUG = False' Breaks Login Page in Chrome v84 #78

Closed
opened 2026-02-25 21:31:09 +03:00 by kerem · 6 comments
Owner

Originally created by @dohlin on GitHub (Aug 24, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/99

I have the following Automate set up:

image
image

And I know OCR seems to be working as I can search the inbox for the term the automate is looking for and it pulls a result:

image

But the document stays in the inbox, and won't move to the Utilities folder (I've waited a good 15+ minutes). Am I doing something wrong?


OCR also appears to not be recognizing spaces in the text - as if I open a document in the inbox, highlight a bunch of text and do a ctrl+c to copy, then paste it into a text editor (e.g. Word) the words are all there, but no spaces between words are included. Any setting to fix this?


Also - setting 'DEBUG = False' in settings.py breaks the login page in Chrome 84 - gives errors about the html mimetype and css refusals. Not sure if this is known or expected.

Originally created by @dohlin on GitHub (Aug 24, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/99 I have the following Automate set up: ![image](https://user-images.githubusercontent.com/5067574/91090515-d5e03400-e61a-11ea-881b-82c270ecff0e.png) ![image](https://user-images.githubusercontent.com/5067574/91090485-c52fbe00-e61a-11ea-8c83-2d6d3431f61a.png) And I know OCR seems to be working as I can search the inbox for the term the automate is looking for and it pulls a result: ![image](https://user-images.githubusercontent.com/5067574/91090732-23f53780-e61b-11ea-8a93-c6d196af2a78.jpg) But the document stays in the inbox, and won't move to the Utilities folder (I've waited a good 15+ minutes). Am I doing something wrong? ------------------------- OCR also appears to not be recognizing spaces in the text - as if I open a document in the inbox, highlight a bunch of text and do a ctrl+c to copy, then paste it into a text editor (e.g. Word) the words are all there, but no spaces between words are included. Any setting to fix this? ------------------------- Also - setting 'DEBUG = False' in settings.py breaks the login page in Chrome 84 - gives errors about the html mimetype and css refusals. Not sure if this is known or expected.
kerem 2026-02-25 21:31:09 +03:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@ciur commented on GitHub (Aug 26, 2020):

Hi @dohlin, your Automates configuration looks correct. There is an issue indeed. At this point is difficult for me to track the problem. This is actually the real - there is no way at this point to track why automates missed certain documents. There should be some sort of UI log activity for the user. There is a duplicate issue #88

Copy paste thingy is another problem. This is how it works at this point.

And lastly the DEBUG=False => give errors about html mimetype... that is very strage. I will give it try and come back with details.

<!-- gh-comment-id:680686218 --> @ciur commented on GitHub (Aug 26, 2020): Hi @dohlin, your Automates configuration looks correct. There is an issue indeed. At this point is difficult for me to track the problem. This is actually the real - there is no way at this point to track why automates missed certain documents. There should be some sort of UI log activity for the user. There is a duplicate issue #88 Copy paste thingy is another problem. This is how it works at this point. And lastly the DEBUG=False => give errors about html mimetype... that is very strage. I will give it try and come back with details.
Author
Owner

@mikkelnl commented on GitHub (Aug 26, 2020):

I'm also testing Papermerge as a possible upgrade from Perless ;-) and also found that Automates doesn't seem to work. I watched the console as I uploaded a new PDF, and found this error, which seems to point to automation?

[2020-08-26 11:43:27,118: DEBUG/ForkPoolWorker-2] Automate Belasting matched document=belastingdienst.pdf
[2020-08-26 11:43:27,123: ERROR/ForkPoolWorker-2] Task papermerge.core.tasks.ocr_page[c44a7e82-c2f3-4301-a829-cbcb94fb1087] raised unexpected: At
tributeError("'NoneType' object has no attribute 'module'")
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 650, in protected_call
return self.run(*args, **kwargs)
File "/app/papermerge/papermerge/core/tasks.py", line 34, in ocr_page
page_hocr_ready.send(
File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 173, in send
return [
File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 174, in
(receiver, receiver(signal=self, sender=sender, **named))
File "/app/papermerge/papermerge/core/signals.py", line 32, in apply_automates_handler
apply_automates(
File "/app/papermerge/papermerge/core/automate.py", line 48, in apply_automates
logger.debug(f"Found plugin module={plugin_klass.module}")
AttributeError: 'NoneType' object has no attribute 'module'

<!-- gh-comment-id:680829708 --> @mikkelnl commented on GitHub (Aug 26, 2020): I'm also testing Papermerge as a possible upgrade from Perless ;-) and also found that Automates doesn't seem to work. I watched the console as I uploaded a new PDF, and found this error, which seems to point to automation? > [2020-08-26 11:43:27,118: DEBUG/ForkPoolWorker-2] Automate Belasting matched document=belastingdienst.pdf > [2020-08-26 11:43:27,123: ERROR/ForkPoolWorker-2] Task papermerge.core.tasks.ocr_page[c44a7e82-c2f3-4301-a829-cbcb94fb1087] raised unexpected: At > tributeError("'NoneType' object has no attribute '__module__'") > Traceback (most recent call last): > File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 385, in trace_task > R = retval = fun(*args, **kwargs) > File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 650, in __protected_call__ > return self.run(*args, **kwargs) > File "/app/papermerge/papermerge/core/tasks.py", line 34, in ocr_page > page_hocr_ready.send( > File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 173, in send > return [ > File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 174, in <listcomp> > (receiver, receiver(signal=self, sender=sender, **named)) > File "/app/papermerge/papermerge/core/signals.py", line 32, in apply_automates_handler > apply_automates( > File "/app/papermerge/papermerge/core/automate.py", line 48, in apply_automates > logger.debug(f"Found plugin module={plugin_klass.__module__}") > AttributeError: 'NoneType' object has no attribute '__module__'
Author
Owner

@ciur commented on GitHub (Aug 29, 2020):

Hi guys, @mikkelnl, @dohlin I am investigating Automates related issues.

In order to makes it easier to track automates, I am adding so called "user logs". So that you will be able to track in UI directly main events like Automate run, Automate matched or there was a mismatch.
I will come back next week with more details.

<!-- gh-comment-id:683264401 --> @ciur commented on GitHub (Aug 29, 2020): Hi guys, @mikkelnl, @dohlin I am investigating Automates related issues. In order to makes it easier to track automates, I am adding so called "user logs". So that you will be able to track in UI directly main events like Automate run, Automate matched or there was a mismatch. I will come back next week with more details.
Author
Owner

@mikkelnl commented on GitHub (Aug 29, 2020):

Great, if there's anything I can do to test etc, let me know.

<!-- gh-comment-id:683264562 --> @mikkelnl commented on GitHub (Aug 29, 2020): Great, if there's anything I can do to test etc, let me know.
Author
Owner

@ciur commented on GitHub (Aug 31, 2020):

ah, I found the issue! I made a stupid mistake! I automates were matched against hocr text, not text itself, which results in low rate matching. Anyway, I will fix this. As a bonus you will get UI logs where you will be able to follow the whole matching/mismatching process.

<!-- gh-comment-id:683707192 --> @ciur commented on GitHub (Aug 31, 2020): ah, I found the issue! I made a stupid mistake! I automates were matched against hocr text, not text itself, which results in low rate matching. Anyway, I will fix this. As a bonus you will get UI logs where you will be able to follow the whole matching/mismatching process.
Author
Owner

@ciur commented on GitHub (Sep 1, 2020):

I "fixed" automates issues.

There are some important changes:

  1. The "Extract Page" was removed. First of all it was an experiment of mine which failed. Simply too buggy to be of practical use.
  2. Automates plugins part also was removed. It is so, because adding plugins the way I thought it is just wrong - for each and every plugin user was supposed to write a small python app... which even I did not do it.
    In meantime I learned about another project - which inspired me to change initial design of automate plugins.
    Instead user will be able to upload an yml file which will be sort of template describing data to extract. I will implement invioce2data approach in a later version of papermerge.

For now, I will leave automates as simple "match and move to destination folder".
In next version 1.5 - automates will enable user to assign tags to the matched documents.
The invoice2data approach of automatically extracting data from documents will be introduced in Papermerge 1.6.

Automates they are simpler now - but they work!

<!-- gh-comment-id:684509099 --> @ciur commented on GitHub (Sep 1, 2020): I "fixed" automates issues. There are some important changes: 1. The "Extract Page" was removed. First of all it was an experiment of mine which failed. Simply too buggy to be of practical use. 2. Automates plugins part also was removed. It is so, because adding plugins the way I thought it is just wrong - for each and every plugin user was supposed to write a small python app... which even I did not do it. In meantime I learned about [another project](https://github.com/invoice-x/invoice2data) - which inspired me to change initial design of automate plugins. Instead user will be able to upload an yml file which will be sort of template describing data to extract. I will implement invioce2data approach in a later version of papermerge. For now, I will leave automates as simple "match and move to destination folder". In next version 1.5 - automates will enable user to assign tags to the matched documents. The [invoice2data approach](https://github.com/invoice-x/invoice2data) of automatically extracting data from documents will be introduced in Papermerge 1.6. Automates they are simpler now - but they work!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#78
No description provided.