mirror of
https://github.com/ciur/papermerge.git
synced 2026-04-25 20:15:58 +03:00
[GH-ISSUE #217] OCR performance question #175
Labels
No labels
2.1
3.0
3.0.1
3.0.2
3.0.3
3.0.3
3.1
3.2
3.2
3.3
3.5
3.x
Fixed. Waiting for feedback.
Fixed. Waiting for feedback.
UX
Version 2.1 - alpha
XSS
announcement
beta
blocker
bug
cannot reproduce
confirmed
confirmed
critical
demo
dependencies
deployment
detchnical debt
discussion
docker
documentation
donations
duplicate
enhancement
feature request
frontend
fundraising
good first issue
good issue
help wanted
high
implemented
important
improvement
incomplete
invalid
investigation
kubernetes
low
low impact
medium
medium
medium impact
migration from 2.0
migration from 2.1
missing-language
missing-ocr-language
no-activity
note
ocr
outofscope
packaging
performance
popular request
pull-request
pypi
question
raspberry pi
roadmap
search
security
setup
status
task
technical debt
updates
user xp
version 1.4.0 - demo
will be implemented
will not be implemented
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/papermerge#175
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @amo13 on GitHub (Nov 19, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/217
Originally assigned to: @ciur on GitHub.
I dropped a whole lot of documents into the papermerge inbox to initially fill my papermerge instance, maybe between 60 and 100. By reloading the inbox page I could see that the documents got digested in a matter of minutes. They all got added to the correct folders and got correctly tagged as set in the automations. All in all it went very well and very fast!
After some time, maybe a week or so, without adding documents, without really using papermerge, I dropped 16 more documents into the inbox. Those were of the same nature as the other ones before, same sender, same sort of content, same number of pages. But this time, it took ages for papermerge to actually digest them, move and tag them according to the automations.
On montag at 06:00, there were 13 documents left to be processed inside the inbox.
On tuesday at 10:30, there were still 6 left
On wednesday at 06:00, still 2 left to be processed.
During all that time, the cpu was running at 100% on each core, used up by tesseract.
Does this sound normal? What strikes me here is that on my initial drop of (many many more) documents, it went so fast to digest them and that it took ages to process the second (and much smaller) batch of documents. Thanks for sharing any thoughts on this!
Info:
@ciur commented on GitHub (Nov 20, 2020):
@amo13, can you please confirm that you use redis as message broker (I saw in description, but still..) ?
You can confirm that by checking if folder
queue(in project directory) is empty.Another way to double check that redis is doing the job - check that redis db is populated with jobs as documents are added.
@amo13 commented on GitHub (Nov 20, 2020):
Yes, I use redis as a message broker, just as described here.
There is only one file
2788761550_9e0db55e-91f9-44dc-aa14-a6499cfd3375.d1474b39-b45f-3f37-8c33-95dc25a30f7a.msgin my queue folder and it's from Nov. 11th, so I guess it's still from before the change to use redis.With
redis-cli monitor, I see a lot of these (around one per second, you can see the timestamps on the left):After dropping a document into the inbox (using the web-ui), I was not able to see anything more than those heartbeats in the redis monitor. (Maybe I just missed it, I can't be so sure since there is a constant message flood with nextcloud adding a lot more message flood, and it might take ages again the the one document to finish processing. Maybe something would appear in the redis monitor upon OCR end?) The
queuefolder didn't change either though.redis-cli INFO | grep ^dbshows me only one db, which should be my nextcloud instance. (But I don't really know how redis works and what that means)@amo13 commented on GitHub (Dec 11, 2020):
I am not quite sure why, but currently, without having changed anything, the OCR of newly dropped documents into the inbox is again fast as lightning. I am closing this for now, since it seems to be unclear why it slowed down so horribly in the first place as described above. I will come back to this if I notice another massive slowdown without apparent reason...