[GH-ISSUE #205] Automatically determine creation date via OCR #166

Closed
opened 2026-02-25 21:31:21 +03:00 by kerem · 7 comments
Owner

Originally created by @guillaume-u on GitHub (Nov 11, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/205

Originally assigned to: @ciur on GitHub.

I would like to upload a lot of scan with different date formats. Ideally, as paperless, the creation date (which is different than added date for paperless) should be determined via OCR without setting the date format.

Is it a way to do that in the current version — I did not find this capability ? or can it be a new release ?

Thanks.

Originally created by @guillaume-u on GitHub (Nov 11, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/205 Originally assigned to: @ciur on GitHub. I would like to upload a lot of scan with different date formats. Ideally, as paperless, the _creation date_ (which is different than _added date_ for paperless) should be determined via OCR without setting the date format. Is it a way to do that in the current version — I did not find this capability ? or can it be a new release ? Thanks.
kerem 2026-02-25 21:31:21 +03:00
Author
Owner

@ciur commented on GitHub (Nov 11, 2020):

Hi @guillaume-u. In Papermerge there is only created_at field (creation date). One way to approach your problem would be to add "added date" metadata. If you add metadata on folder level - all documents added to that folder will inherit this metadata field. You will need to fill metadata value manually though.

<!-- gh-comment-id:725501853 --> @ciur commented on GitHub (Nov 11, 2020): Hi @guillaume-u. In Papermerge there is only created_at field (creation date). One way to approach your problem would be to add "added date" metadata. If you add metadata on folder level - all documents added to that folder will inherit this metadata field. You will need to fill metadata value manually though.
Author
Owner

@guillaume-u commented on GitHub (Nov 11, 2020):

Thanks @ciur, just one more question, the created_at field is only depending on creation time (not the content of the doc), correct ?

<!-- gh-comment-id:725511951 --> @guillaume-u commented on GitHub (Nov 11, 2020): Thanks @ciur, just one more question, the created_at field is only depending on creation time (not the content of the doc), correct ?
Author
Owner

@ciur commented on GitHub (Nov 11, 2020):

@guillaume-u, correct, created_at field is set to the date your document was uploaded.
To be technically correct created_at is of so called datetime type (date + time), thus it store date and time when you document was uploaded to the Papermerge - and has nothing to do with content itself of the document.

<!-- gh-comment-id:725519017 --> @ciur commented on GitHub (Nov 11, 2020): @guillaume-u, correct, ``created_at`` field is set to the date your document was uploaded. To be technically correct ``created_at`` is of so called datetime type (date + time), thus it store date and time when you document was uploaded to the Papermerge - and has nothing to do with content itself of the document.
Author
Owner

@guillaume-u commented on GitHub (Nov 11, 2020):

Thanks again @ciur.

Before closing this ticket, do you think this feature (add a date determined by OCR) is a good feature and can be added in the roadmap or not ?

<!-- gh-comment-id:725525340 --> @guillaume-u commented on GitHub (Nov 11, 2020): Thanks again @ciur. Before closing this ticket, do you think this feature (add a date determined by OCR) is a good feature and can be added in the roadmap or not ?
Author
Owner

@ciur commented on GitHub (Nov 11, 2020):

@guillaume-u, it is definitely a good feature. Including "added date" sounds very reasonable. The problem with that, if someone else will requests a "sent date" (many physical letters include a field which indicate when that letter was sent) or maybe a "paid date" etc etc - it would not be reasonable to include all those fields.
One solution would be to use metadata.
Yet another solution, which would be possible starting with next release is to add a very small app (I will document step by step how to do it) which will extend document model with whatever field date/integer field for your particular need/case. That would be as simple as:

class Document(AbstractDocument):
    """
    This is main document model extention which adds an extra date field
    """
    added_date = models.DateField(...)

Later approach has disadvantage that you need to know little bit of programming. But on the other hand, if your feature/add date field has popular demand I, or any other person can write a reusable app so that those who need add date will include it in their Papermerge instance.

<!-- gh-comment-id:725581290 --> @ciur commented on GitHub (Nov 11, 2020): @guillaume-u, it is definitely a good feature. Including "added date" sounds very reasonable. The problem with that, if someone else will requests a "sent date" (many physical letters include a field which indicate when that letter was sent) or maybe a "paid date" etc etc - it would not be reasonable to include all those fields. One solution would be to use metadata. Yet another solution, which would be possible starting with [next release](https://github.com/ciur/papermerge/issues/203) is to add a very small app (I will document step by step how to do it) which will extend document model with whatever field date/integer field for your particular need/case. That would be as simple as: ``` class Document(AbstractDocument): """ This is main document model extention which adds an extra date field """ added_date = models.DateField(...) ``` Later approach has disadvantage that you need to know little bit of programming. But on the other hand, if your feature/add date field has popular demand I, or any other person can write a reusable app so that those who need _add date_ will include it in their Papermerge instance.
Author
Owner

@guillaume-u commented on GitHub (Nov 11, 2020):

Thanks !

As an ugly hack (for my usage only because it's not clean at all and it modifies created_at instead of creates a new attribute), I copy/past paperless regex into core/automate.py. It does the job for my import.

Thanks again for your project. As paperless it's a verry good one.

<!-- gh-comment-id:725609281 --> @guillaume-u commented on GitHub (Nov 11, 2020): Thanks ! As an ugly hack (for my usage only because it's not clean at all and it modifies `created_at` instead of creates a new attribute), I copy/past paperless regex into `core/automate.py`. It does the job for my import. Thanks again for your project. As paperless it's a verry good one.
Author
Owner

@amo13 commented on GitHub (Nov 13, 2020):

As an ugly hack (for my usage only because it's not clean at all and it modifies created_at instead of creates a new attribute), I copy/past paperless regex into core/automate.py. It does the job for my import.

I also like a lot the idea of having automatic creation-date extraction. Would you mind sharing the regex from paperless here?

(duplicate of #71)

<!-- gh-comment-id:727073066 --> @amo13 commented on GitHub (Nov 13, 2020): > As an ugly hack (for my usage only because it's not clean at all and it modifies `created_at` instead of creates a new attribute), I copy/past paperless regex into `core/automate.py`. It does the job for my import. I also like a lot the idea of having automatic creation-date extraction. Would you mind sharing the regex from paperless here? (duplicate of #71)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#166
No description provided.