mirror of
https://github.com/ciur/papermerge.git
synced 2026-04-25 12:05:58 +03:00
[GH-ISSUE #30] Some questions regarding Metadata #24
Labels
No labels
2.1
3.0
3.0.1
3.0.2
3.0.3
3.0.3
3.1
3.2
3.2
3.3
3.5
3.x
Fixed. Waiting for feedback.
Fixed. Waiting for feedback.
UX
Version 2.1 - alpha
XSS
announcement
beta
blocker
bug
cannot reproduce
confirmed
confirmed
critical
demo
dependencies
deployment
detchnical debt
discussion
docker
documentation
donations
duplicate
enhancement
feature request
frontend
fundraising
good first issue
good issue
help wanted
high
implemented
important
improvement
incomplete
invalid
investigation
kubernetes
low
low impact
medium
medium
medium impact
migration from 2.0
migration from 2.1
missing-language
missing-ocr-language
no-activity
note
ocr
outofscope
packaging
performance
popular request
pull-request
pypi
question
raspberry pi
roadmap
search
security
setup
status
task
technical debt
updates
user xp
version 1.4.0 - demo
will be implemented
will not be implemented
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/papermerge#24
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @charlie89 on GitHub (Jul 4, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/30
Hi,
As far as i understand the documentation and the section about the metadata plugins in the file papermerge.conf.py.example, there should be python plugins which extract the metadata from the ocr'd content. (btw, the documentation doesn't mention the metadata plugins).
I found the metadata plugin lidl-recepts-de on pypi.org (oh and btw the 'i' is missing compared to the linked github repo papermerge/lidl-receipts-de and the comment in papermerge.conf.py.example) and the code seems pretty easy, so i would like to implement my own metadata plugins for all my monthly/quarterly/yearly reoccurring documents and of course publish them for everyone to use.
So before i start implementing these plugins i have the following questions:
Shouldn't the plugin names have a specific prefix like 'papermerge-metadata-' or something like that (which would give 'papermerge-metadata-lidl-recepts-de')? That way it would be easier to find the plugin on pypi.org.
Should there really be a single plugin for every single type of document (one for lidl, one for rewe,...)? If someone would only implement a metadata plugin for each bigger grocery/electronics/household/furniture/etc store in the german speaking theritory, there would be hundreds of single plugins. Then there will be many plugins which are maintained by other people and these plugins will get outdated when you change something on the interface between papermerge and these metadata plugins. Would't it be much better to just have a single plugin where everything is contained and everyone contributes through push requests? Or maybe use a completely other concept, i.e. a frontend where the user defines a rule to match the document type (if the document is from lidl or rewe) and then adds rules to extract the necessary metadata for this document type and maybe a github repo with existing templates.
Is there a way to map a result of the metadata plugin to a configured metadata with another name?
Example: The lidl-recepts-de plugin returns metadata for 'shop', 'price' and 'date', but i would like to have them displayed in the frontend in german as 'Firma', 'Betrag' und 'Datum'. (this is based on the code, i didn't actually tested it with a lidl receipt).
Some background information:
My objective is to just scan a document (or get it via mail) and then have the DMS automatically add metadata and push it into the right folder without any further manual intervention from me (at least for all monthly reoccurring bills, cause i hate doing the same manual thing each month if i can automate it). If i scan a bill i.e from the grocery store Lidl, a custom script would get the picture from my scanner (cause my scanner doesnt directly support remote storage) and upload it via api to a custom inbox folder in papermerge, then there would be the metadata 'shop' on this folder and another custom script would then move the document (via api and a not yet existing move command) to the directory bills/lidl which has the metadata date, price and location of the store defined. This way papermerge would be pretty much exactly the DMS i'm looking for.
I'm running Version 1.3.0. Oh and I'm from Austria, so you could even write in German in case you PM me.
@ciur commented on GitHub (Jul 4, 2020):
First of all, this:
is exactly I am trying to build :).
Another very important point - in version 1.3.0 metadata backend code is there - but user interface to use it - is not! Yes, there is a dropdown menu which pops up metadata menu and I documented (too early) even plugins.. For release 1.3.0 I introduced metadata, experimented with all above concepts - but implementation is still incomplete.
The objective - the way you expressed it very nice - will be fulfilled in release 1.4.0 which will be out in August 2020.
The way I see it is - again as you said:
This rules that you mentioned are part of user interface changes in release 1.4.0. To be onest I am not sure how they will look yet as I am experimenting with concepts myself - so I am open for any suggestion.
The LIDL plugin was my very first experiment/proof of concept, so it is very difficult for me to debate at this moment "how it should be". I usually experiment and implement the most practical approach.
Wie kann ich PM dir?
Du kannst mir eine Email schreiben an: eugen@papermerge.com - dadurch könnten wir weiter brainstormen.
@ciur commented on GitHub (Jul 21, 2020):
I re-read and reconsidered your feedback.
You have very, very good points. Let me go through one by one:
Absolutely! I will add papermerge-meta-plugin- prefix to each.
Working on the concept right now. The will be called "Automates". UI will allow you to configure text/regexp to match, what plugin to apply and optionally 1. what folder to move into 2. maybe extract page as individual document.
Good catch! I will add configuration for that. Next to METADATA_PLUGINS = [] there will be METADATA_PLUGIN_MAPS={} where you will be able to map metadata keys in plugins to names matching your document.
Example:
@charlie89 commented on GitHub (Jul 21, 2020):
Sounds very promising.
I found the python module invoice2data, which can already parse invoices based on small yaml configuration files and regex.
It's pretty simple, you just run
invoice2data --debug my_invoice.pdfto get the ocr'd text, paste the result into regexpal.com (or any other regex tester), write a regex in the formCustomer-No:\s+(\d+)and put that into a config file and then its able to read the customer number (i.e.12345).Maybe you can use that module or use it for inspration.
@ciur commented on GitHub (Jul 24, 2020):
I just checked - invoice2data project - I love their templating system. Thank you for suggestion, @charlie89.
In future versions of Papermerge I will definitely borrow ideas from invoice2data project.