[GH-ISSUE #30] Some questions regarding Metadata #24

Closed
opened 2026-02-25 21:31:01 +03:00 by kerem · 4 comments
Owner

Originally created by @charlie89 on GitHub (Jul 4, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/30

Hi,
As far as i understand the documentation and the section about the metadata plugins in the file papermerge.conf.py.example, there should be python plugins which extract the metadata from the ocr'd content. (btw, the documentation doesn't mention the metadata plugins).
I found the metadata plugin lidl-recepts-de on pypi.org (oh and btw the 'i' is missing compared to the linked github repo papermerge/lidl-receipts-de and the comment in papermerge.conf.py.example) and the code seems pretty easy, so i would like to implement my own metadata plugins for all my monthly/quarterly/yearly reoccurring documents and of course publish them for everyone to use.

So before i start implementing these plugins i have the following questions:

  • Shouldn't the plugin names have a specific prefix like 'papermerge-metadata-' or something like that (which would give 'papermerge-metadata-lidl-recepts-de')? That way it would be easier to find the plugin on pypi.org.

  • Should there really be a single plugin for every single type of document (one for lidl, one for rewe,...)? If someone would only implement a metadata plugin for each bigger grocery/electronics/household/furniture/etc store in the german speaking theritory, there would be hundreds of single plugins. Then there will be many plugins which are maintained by other people and these plugins will get outdated when you change something on the interface between papermerge and these metadata plugins. Would't it be much better to just have a single plugin where everything is contained and everyone contributes through push requests? Or maybe use a completely other concept, i.e. a frontend where the user defines a rule to match the document type (if the document is from lidl or rewe) and then adds rules to extract the necessary metadata for this document type and maybe a github repo with existing templates.

  • Is there a way to map a result of the metadata plugin to a configured metadata with another name?
    Example: The lidl-recepts-de plugin returns metadata for 'shop', 'price' and 'date', but i would like to have them displayed in the frontend in german as 'Firma', 'Betrag' und 'Datum'. (this is based on the code, i didn't actually tested it with a lidl receipt).

Some background information:
My objective is to just scan a document (or get it via mail) and then have the DMS automatically add metadata and push it into the right folder without any further manual intervention from me (at least for all monthly reoccurring bills, cause i hate doing the same manual thing each month if i can automate it). If i scan a bill i.e from the grocery store Lidl, a custom script would get the picture from my scanner (cause my scanner doesnt directly support remote storage) and upload it via api to a custom inbox folder in papermerge, then there would be the metadata 'shop' on this folder and another custom script would then move the document (via api and a not yet existing move command) to the directory bills/lidl which has the metadata date, price and location of the store defined. This way papermerge would be pretty much exactly the DMS i'm looking for.
I'm running Version 1.3.0. Oh and I'm from Austria, so you could even write in German in case you PM me.

Originally created by @charlie89 on GitHub (Jul 4, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/30 Hi, As far as i understand the documentation and the section about the metadata plugins in the file papermerge.conf.py.example, there should be python plugins which extract the metadata from the ocr'd content. (btw, the documentation doesn't mention the metadata plugins). I found the metadata plugin lidl-recepts-de on pypi.org (oh and btw the 'i' is missing compared to the linked github repo papermerge/lidl-receipts-de and the comment in papermerge.conf.py.example) and the code seems pretty easy, so i would like to implement my own metadata plugins for all my monthly/quarterly/yearly reoccurring documents and of course publish them for everyone to use. So before i start implementing these plugins i have the following questions: - Shouldn't the plugin names have a specific prefix like 'papermerge-metadata-' or something like that (which would give 'papermerge-metadata-lidl-recepts-de')? That way it would be easier to find the plugin on pypi.org. - Should there really be a single plugin for every single type of document (one for lidl, one for rewe,...)? If someone would only implement a metadata plugin for each bigger grocery/electronics/household/furniture/etc store in the german speaking theritory, there would be hundreds of single plugins. Then there will be many plugins which are maintained by other people and these plugins will get outdated when you change something on the interface between papermerge and these metadata plugins. Would't it be much better to just have a single plugin where everything is contained and everyone contributes through push requests? Or maybe use a completely other concept, i.e. a frontend where the user defines a rule to match the document type (if the document is from lidl or rewe) and then adds rules to extract the necessary metadata for this document type and maybe a github repo with existing templates. - Is there a way to map a result of the metadata plugin to a configured metadata with another name? Example: The lidl-recepts-de plugin returns metadata for 'shop', 'price' and 'date', but i would like to have them displayed in the frontend in german as 'Firma', 'Betrag' und 'Datum'. (this is based on the code, i didn't actually tested it with a lidl receipt). Some background information: My objective is to just scan a document (or get it via mail) and then have the DMS automatically add metadata and push it into the right folder without any further manual intervention from me (at least for all monthly reoccurring bills, cause i hate doing the same manual thing each month if i can automate it). If i scan a bill i.e from the grocery store Lidl, a custom script would get the picture from my scanner (cause my scanner doesnt directly support remote storage) and upload it via api to a custom inbox folder in papermerge, then there would be the metadata 'shop' on this folder and another custom script would then move the document (via api and a not yet existing move command) to the directory bills/lidl which has the metadata date, price and location of the store defined. This way papermerge would be pretty much exactly the DMS i'm looking for. I'm running Version 1.3.0. Oh and I'm from Austria, so you could even write in German in case you PM me.
kerem 2026-02-25 21:31:01 +03:00
Author
Owner

@ciur commented on GitHub (Jul 4, 2020):

First of all, this:

My objective is to just scan a document (or get it via mail) and then have the DMS automatically add metadata and push it into the right folder without any further manual intervention from me (at least for all monthly reoccurring bills, cause i hate doing the same manual thing each month if i can automate it) [...] then there would be the metadata 'shop' on this folder and another custom script would then move the document (via api and a not yet existing move command) to the directory bills/lidl which has the metadata date, price and location of the store defined.

is exactly I am trying to build :).

Another very important point - in version 1.3.0 metadata backend code is there - but user interface to use it - is not! Yes, there is a dropdown menu which pops up metadata menu and I documented (too early) even plugins.. For release 1.3.0 I introduced metadata, experimented with all above concepts - but implementation is still incomplete.
The objective - the way you expressed it very nice - will be fulfilled in release 1.4.0 which will be out in August 2020.

The way I see it is - again as you said:

[...] i.e. a frontend where the user defines a rule to match the document type (if the document is from lidl or rewe) and then adds rules to extract the necessary metadata for this document [...]

This rules that you mentioned are part of user interface changes in release 1.4.0. To be onest I am not sure how they will look yet as I am experimenting with concepts myself - so I am open for any suggestion.
The LIDL plugin was my very first experiment/proof of concept, so it is very difficult for me to debate at this moment "how it should be". I usually experiment and implement the most practical approach.

Wie kann ich PM dir?
Du kannst mir eine Email schreiben an: eugen@papermerge.com - dadurch könnten wir weiter brainstormen.

<!-- gh-comment-id:653813946 --> @ciur commented on GitHub (Jul 4, 2020): First of all, this: > My objective is to just scan a document (or get it via mail) and then have the DMS automatically add metadata and push it into the right folder without any further manual intervention from me (at least for all monthly reoccurring bills, cause i hate doing the same manual thing each month if i can automate it) [...] then there would be the metadata 'shop' on this folder and another custom script would then move the document (via api and a not yet existing move command) to the directory bills/lidl which has the metadata date, price and location of the store defined. is exactly I am trying to build :). Another very important point - in version 1.3.0 metadata backend code is there - but user interface to use it - is not! Yes, there is a dropdown menu which pops up metadata menu and I documented (too early) even plugins.. For release 1.3.0 I introduced metadata, experimented with all above concepts - but implementation is still incomplete. The objective - the way you expressed it very nice - will be fulfilled in release 1.4.0 which will be out in August 2020. The way I see it is - again as you said: > [...] i.e. a frontend where the user defines a rule to match the document type (if the document is from lidl or rewe) and then adds rules to extract the necessary metadata for this document [...] This rules that you mentioned are part of user interface changes in release 1.4.0. To be onest I am not sure how they will look yet as I am experimenting with concepts myself - so I am open for any suggestion. The LIDL plugin was my very first experiment/proof of concept, so it is very difficult for me to debate at this moment "how it should be". I usually experiment and implement the most practical approach. Wie kann ich PM dir? Du kannst mir eine Email schreiben an: eugen@papermerge.com - dadurch könnten wir weiter brainstormen.
Author
Owner

@ciur commented on GitHub (Jul 21, 2020):

I re-read and reconsidered your feedback.
You have very, very good points. Let me go through one by one:

Shouldn't the plugin names have a specific prefix like 'papermerge-metadata-' or something like that (which would give 'papermerge-metadata-lidl-recepts-de')? That way it would be easier to find the plugin on pypi.org.

Absolutely! I will add papermerge-meta-plugin- prefix to each.

Or maybe use a completely other concept, i.e. a frontend where the user defines a rule to match the document type (if the document is from lidl or rewe) and then adds rules to extract the necessary metadata for this document type and maybe a github repo with existing templates.

Working on the concept right now. The will be called "Automates". UI will allow you to configure text/regexp to match, what plugin to apply and optionally 1. what folder to move into 2. maybe extract page as individual document.

Is there a way to map a result of the metadata plugin to a configured metadata with another name?

Good catch! I will add configuration for that. Next to METADATA_PLUGINS = [] there will be METADATA_PLUGIN_MAPS={} where you will be able to map metadata keys in plugins to names matching your document.
Example:

 METADATA_PLUGINS = [
   "lidl_receipts_de.Lidl",      # notice that even though Python PyPi Package name is papermerge-meta-plugin-lidl....
                                            # importable name is shorter - lidl_receipts_de
   "rewe_receipts_del.Rewe" 
]
METADATA_PLUGIN_MAPS = {
    "lidl_receipts_de.Lidl": {
        "shop": "Firma",
        "price": "Betrag",
        "date": "Datum"
   },
   "rewe_receipts_de.Rewe": { ... }
}

<!-- gh-comment-id:661757302 --> @ciur commented on GitHub (Jul 21, 2020): I re-read and reconsidered your feedback. You have very, very good points. Let me go through one by one: > Shouldn't the plugin names have a specific prefix like 'papermerge-metadata-' or something like that (which would give 'papermerge-metadata-lidl-recepts-de')? That way it would be easier to find the plugin on pypi.org. Absolutely! I will add papermerge-meta-plugin- prefix to each. > Or maybe use a completely other concept, i.e. a frontend where the user defines a rule to match the document type (if the document is from lidl or rewe) and then adds rules to extract the necessary metadata for this document type and maybe a github repo with existing templates. Working on the concept right now. The will be called "Automates". UI will allow you to configure text/regexp to match, what plugin to apply and optionally 1. what folder to move into 2. maybe extract page as individual document. > Is there a way to map a result of the metadata plugin to a configured metadata with another name? Good catch! I will add configuration for that. Next to METADATA_PLUGINS = [] there will be METADATA_PLUGIN_MAPS={} where you will be able to map metadata keys in plugins to names matching your document. Example: ``` METADATA_PLUGINS = [ "lidl_receipts_de.Lidl", # notice that even though Python PyPi Package name is papermerge-meta-plugin-lidl.... # importable name is shorter - lidl_receipts_de "rewe_receipts_del.Rewe" ] METADATA_PLUGIN_MAPS = { "lidl_receipts_de.Lidl": { "shop": "Firma", "price": "Betrag", "date": "Datum" }, "rewe_receipts_de.Rewe": { ... } } ```
Author
Owner

@charlie89 commented on GitHub (Jul 21, 2020):

Sounds very promising.
I found the python module invoice2data, which can already parse invoices based on small yaml configuration files and regex.
It's pretty simple, you just run invoice2data --debug my_invoice.pdf to get the ocr'd text, paste the result into regexpal.com (or any other regex tester), write a regex in the form Customer-No:\s+(\d+) and put that into a config file and then its able to read the customer number (i.e. 12345).
Maybe you can use that module or use it for inspration.

<!-- gh-comment-id:661824313 --> @charlie89 commented on GitHub (Jul 21, 2020): Sounds very promising. I found the python module [invoice2data](https://github.com/invoice-x/invoice2data), which can already parse invoices based on small yaml configuration files and regex. It's pretty simple, you just run `invoice2data --debug my_invoice.pdf` to get the ocr'd text, paste the result into [regexpal.com](https://www.regexpal.com) (or any other regex tester), write a regex in the form `Customer-No:\s+(\d+)` and put that into a config file and then its able to read the customer number (i.e. `12345`). Maybe you can use that module or use it for inspration.
Author
Owner

@ciur commented on GitHub (Jul 24, 2020):

I just checked - invoice2data project - I love their templating system. Thank you for suggestion, @charlie89.
In future versions of Papermerge I will definitely borrow ideas from invoice2data project.

<!-- gh-comment-id:663344735 --> @ciur commented on GitHub (Jul 24, 2020): I just checked - invoice2data project - I love their templating system. Thank you for suggestion, @charlie89. In future versions of Papermerge I will definitely borrow ideas from invoice2data project.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#24
No description provided.