[GH-ISSUE #106] Replace pdftk with stapler #82

Closed
opened 2026-02-25 21:31:10 +03:00 by kerem · 24 comments
Owner

Originally created by @ciur on GitHub (Sep 2, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/106

Hey, stapler, where were you back April 2020? I was looking for you!
The best I found for pdf page management was pdftk...
But pdftk license... license S... is Proprietary/GPL.
Today, thanks to @mtonnie, I found you, my dear stapler.

Pdftk - sorry, buddy, I will kick you out from Papermerge 1.5.0!

Originally created by @ciur on GitHub (Sep 2, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/106 Hey, [stapler](https://pypi.org/project/stapler/), where were you back April 2020? I was looking for you! The best I found for pdf page management was pdftk... But pdftk license... license S... is Proprietary/GPL. Today, thanks to @mtonnie, I found you, my dear [stapler](https://pypi.org/project/stapler/). Pdftk - sorry, buddy, I will kick you out from Papermerge 1.5.0!
kerem closed this issue 2026-02-25 21:31:10 +03:00
Author
Owner

@tf198 commented on GitHub (Sep 15, 2020):

It would be great if you could implement this as a generic plugin system which gets the document and a list of pages (potentially in a new order)

# if it could be as simple a signature as
def pdf_plugin(doc, pages, **kwargs):
  pass

# then it opens the door to all sorts of operations
pdf_rearrange(doc, [2, 1, 4, 3, 5])
pdf_delete(doc, [3, 4])
pdf_rotate(doc, [1, 3, 5], angle=180)

Could be many more uses - crop, deskew, contrast etc.

This ability to correct batch scanned documents is a standout for PaperMerge so opening this to plugins would be a real bonus. If the interface could send extra kwargs as well that would be even better e.g. threshold=60

Looking at the code you are pretty invested in the clipboard workflow so not sure how this would fit in though.

<!-- gh-comment-id:692436252 --> @tf198 commented on GitHub (Sep 15, 2020): It would be great if you could implement this as a generic plugin system which gets the document and a list of pages (potentially in a new order) ```python # if it could be as simple a signature as def pdf_plugin(doc, pages, **kwargs): pass # then it opens the door to all sorts of operations pdf_rearrange(doc, [2, 1, 4, 3, 5]) pdf_delete(doc, [3, 4]) pdf_rotate(doc, [1, 3, 5], angle=180) ``` Could be many more uses - crop, deskew, contrast etc. This ability to correct batch scanned documents is a standout for PaperMerge so opening this to plugins would be a real bonus. If the interface could send extra kwargs as well that would be even better e.g. `threshold=60` Looking at the code you are pretty invested in the clipboard workflow so not sure how this would fit in though.
Author
Owner

@ciur commented on GitHub (Sep 15, 2020):

Hi @tf198,

thank you for your suggestion. I do have on my radar the correction features of scanned documents. When I was implementing pdf management feature - I 'removed' rotation in very last moment... Nowadays when I batch scan long receipts I feel the need to have (an automatic) deskew! These scan correction will definitely make its way into the feature set.

<!-- gh-comment-id:692455813 --> @ciur commented on GitHub (Sep 15, 2020): Hi @tf198, thank you for your suggestion. I do have on my radar the correction features of scanned documents. When I was implementing pdf management feature - I 'removed' rotation in very last moment... Nowadays when I batch scan long receipts I feel the need to have (an automatic) deskew! These scan correction will definitely make its way into the feature set.
Author
Owner

@georgkrause commented on GitHub (Oct 4, 2020):

Adding my voice here, pdftk is a pain to package and this is actually an blocker for my planned deployment via alpine based docker image

<!-- gh-comment-id:703247100 --> @georgkrause commented on GitHub (Oct 4, 2020): Adding my voice here, pdftk is a pain to package and this is actually an blocker for my planned deployment via alpine based docker image
Author
Owner

@ciur commented on GitHub (Oct 5, 2020):

right! pdftk is on "my blacklist" :) . But I will postpone its removal until 1.6. The reason for postponing is that even stapler project is not all unicorns - it looks not 100% reliable to me (currently tests are not passing, and that state did not change since last 4 months). But at least stapler is BSD licensed (plus it is python and generally rather simple project). So I will need more time to review pdftk -> stapler transition.

<!-- gh-comment-id:703412976 --> @ciur commented on GitHub (Oct 5, 2020): right! pdftk is on "my blacklist" :) . But I will postpone its removal until 1.6. The reason for postponing is that even stapler project is not all unicorns - it looks not 100% reliable to me (currently tests are not passing, and that state did not change since last 4 months). But at least stapler is BSD licensed (plus it is python and generally rather simple project). So I will need more time to review pdftk -> stapler transition.
Author
Owner

@georgkrause commented on GitHub (Oct 5, 2020):

Where are the not passing tests? Only the latest merge commit does have failing pipelines, but only for Python 3.5 which is deprecated anyway while 3.8 went well.

Anyway, I totally understand switching basic libraries is a big thing, so I understand the postponing, still I would vote for a high priority :)

<!-- gh-comment-id:703415066 --> @georgkrause commented on GitHub (Oct 5, 2020): Where are the not passing tests? Only the latest merge commit does have failing pipelines, but only for Python 3.5 which is deprecated anyway while 3.8 went well. Anyway, I totally understand switching basic libraries is a big thing, so I understand the postponing, still I would vote for a high priority :)
Author
Owner

@georgkrause commented on GitHub (Nov 15, 2020):

Are there any news in this?

<!-- gh-comment-id:727533964 --> @georgkrause commented on GitHub (Nov 15, 2020): Are there any news in this?
Author
Owner

@georgkrause commented on GitHub (Nov 22, 2020):

I investigated a little and did not found any call of the pdftk binary. Actually all work seems to be done by the poppler-utils, eg pdfseparate and pdfunite and pdfinfo.

So if I am getting this right, pdftk is not actually needed. I am currently testing by removing the check, the Docker Images is currently building.

<!-- gh-comment-id:731762480 --> @georgkrause commented on GitHub (Nov 22, 2020): I investigated a little and did not found any call of the pdftk binary. Actually all work seems to be done by the poppler-utils, eg `pdfseparate` and `pdfunite` and `pdfinfo`. So if I am getting this right, pdftk is not actually needed. I am currently testing by removing the check, the Docker Images is currently building.
Author
Owner

@ciur commented on GitHub (Nov 22, 2020):

@georgkrause, Papermerge application uses this module in mglib for cut/paste/move pages operations

pdfunite, pdfseparate are legacy modules, I will remove them right away.

<!-- gh-comment-id:731767587 --> @ciur commented on GitHub (Nov 22, 2020): @georgkrause, Papermerge application uses [this module in mglib for cut/paste/move pages operations](https://github.com/papermerge/mglib/blob/master/mglib/pdftk.py) pdfunite, pdfseparate are legacy modules, I will remove them right away.
Author
Owner

@ciur commented on GitHub (Nov 22, 2020):

I removed legacy pdfseparate and pdfunite modules .
They job is now performed by mglib.

<!-- gh-comment-id:731769115 --> @ciur commented on GitHub (Nov 22, 2020): I removed [legacy pdfseparate and pdfunite modules ](https://github.com/ciur/papermerge/commit/ae3b901b54520702c3113bec3f9239a376bd5617). They job is now performed by [mglib.](https://github.com/papermerge/mglib)
Author
Owner

@georgkrause commented on GitHub (Nov 22, 2020):

Well, the question is: Why is pdftk declared as dependencies if it is never called. We could remove the check in this case, right?

Edit: Ah, I get it. This lib is a wrapper around pdftk. Thanks for the hint!

<!-- gh-comment-id:731770874 --> @georgkrause commented on GitHub (Nov 22, 2020): Well, the question is: Why is pdftk declared as dependencies if it is never called. We could remove the check in this case, right? Edit: Ah, I get it. This lib is a wrapper around pdftk. Thanks for the hint!
Author
Owner

@georgkrause commented on GitHub (Nov 22, 2020):

@ciur It seems like mglib.pdftk is not used anywhere in the papermerge code, or I am not able to find the places

<!-- gh-comment-id:731772147 --> @georgkrause commented on GitHub (Nov 22, 2020): @ciur It seems like `mglib.pdftk` is not used anywhere in the papermerge code, or I am not able to find the places
Author
Owner

@georgkrause commented on GitHub (Nov 22, 2020):

Alright, I found it. pdftk is called from the storage class. So this issue should be move to glib, right? There are no changes needed in papermerge anymore to remove pdftk.

Now I found the places where I have to look at the code I feel confident to file a PR to replace the current functions of pdftk by stapler. Are you open for such a patch? Are there any rules to follow?

<!-- gh-comment-id:731773168 --> @georgkrause commented on GitHub (Nov 22, 2020): Alright, I found it. pdftk is called from the `storage` class. So this issue should be move to glib, right? There are no changes needed in papermerge anymore to remove pdftk. Now I found the places where I have to look at the code I feel confident to file a PR to replace the current functions of pdftk by stapler. Are you open for such a patch? Are there any rules to follow?
Author
Owner

@ciur commented on GitHub (Nov 22, 2020):

hint: start tracking pdftk down here

Cut/Paste operation uses default storage which is an abstraction of all operations performed on document files.
default storage -> mglib.storage -> mglib.pdftk -> pdftk
Why this 'complicated' ?
Because default storage module (which works on local file system) can be replaced and/or suplimented with additional storages e.g SFTP, AWS etc.

<!-- gh-comment-id:731773248 --> @ciur commented on GitHub (Nov 22, 2020): hint: [start tracking pdftk down here](https://github.com/ciur/papermerge/blob/8877782fd48d3705e5d0488a180c2a6a9d0dccab/papermerge/core/models/document.py#L771) Cut/Paste operation uses default storage which is an abstraction of all operations performed on document files. default storage -> mglib.storage -> mglib.pdftk -> pdftk Why this 'complicated' ? Because default storage module (which works on local file system) can be replaced and/or suplimented with additional storages e.g SFTP, AWS etc.
Author
Owner

@georgkrause commented on GitHub (Nov 22, 2020):

I think this huge level of abstraction is quite nice and I am far too less involved in the code base to criticize your decisions! Thanks for the explanation anyway :)

<!-- gh-comment-id:731773610 --> @georgkrause commented on GitHub (Nov 22, 2020): I think this huge level of abstraction is quite nice and I am far too less involved in the code base to criticize your decisions! Thanks for the explanation anyway :)
Author
Owner

@ciur commented on GitHub (Nov 22, 2020):

Now I found the places where I have to look at the code I feel confident to file a PR to replace the current functions of pdftk by
stapler. Are you open for such a patch? Are there any rules to follow?

Yes, I would be very happy to see pdftk replaced with something which has more Apache 2.0 friendly licensing.

<!-- gh-comment-id:731773659 --> @ciur commented on GitHub (Nov 22, 2020): > Now I found the places where I have to look at the code I feel confident to file a PR to replace the current functions of pdftk by > stapler. Are you open for such a patch? Are there any rules to follow? Yes, I would be very happy to see pdftk replaced with something which has more Apache 2.0 friendly licensing.
Author
Owner

@ciur commented on GitHub (Nov 22, 2020):

@georgkrause, I added today a PR template to papermerge repo. Mglib still does not have a PR template. But in general very important rule is to follow PEP8 style. Plus obviously at least couple of unit tests.

<!-- gh-comment-id:731774066 --> @ciur commented on GitHub (Nov 22, 2020): @georgkrause, I added today a PR template to papermerge repo. Mglib still does not have a PR template. But in general very important rule is to follow PEP8 style. Plus obviously at least couple of unit tests.
Author
Owner

@georgkrause commented on GitHub (Nov 22, 2020):

As far as I can see the new lib module needs to provide paste_pages, reorder_pages and delete_pages functionality as interface for mglib.storage, right?

Is there any chat for developers of this project, btw? So I wont need to spam that much in the issue

<!-- gh-comment-id:731774302 --> @georgkrause commented on GitHub (Nov 22, 2020): As far as I can see the new lib module needs to provide `paste_pages`, `reorder_pages` and `delete_pages` functionality as interface for `mglib.storage`, right? Is there any chat for developers of this project, btw? So I wont need to spam that much in the issue
Author
Owner

@ciur commented on GitHub (Nov 22, 2020):

Is there any chat for developers of this project, btw? So I wont need to spam that much in the issue

I don't know. I didn't add a "papermerge special chat" :)

As far as I can see the new lib module needs to provide paste_pages, reorder_pages and delete_pages functionality as interface for mglib.storage, right?

Yes, I would expect that you would do most of the change in a new mglib module which would be used only be mglib.storage. Also, PDFTK_BINARY option won't be needed (thus its refs in papermerge.core will be removed). But here is catch:
because papermerge depends on mglib - I would merge your changes in mglib. Test them, then I will release new version of mglib. Only after new version of mglib will be published on pypi - papermerge deps can be updated (and PDFTK_BINARY option removed)

It is always more complex then one might initially think :)

<!-- gh-comment-id:731775093 --> @ciur commented on GitHub (Nov 22, 2020): > Is there any chat for developers of this project, btw? So I wont need to spam that much in the issue I don't know. I didn't add a "papermerge special chat" :) > As far as I can see the new lib module needs to provide paste_pages, reorder_pages and delete_pages functionality as interface for mglib.storage, right? Yes, I would expect that you would do most of the change in a new mglib module which would be used only be mglib.storage. Also, PDFTK_BINARY option won't be needed (thus its refs in papermerge.core will be removed). But here is catch: because papermerge depends on mglib - I would merge your changes in mglib. Test them, then **I will release new version of mglib**. Only after new version of mglib will be published on pypi - papermerge deps can be updated (and PDFTK_BINARY option removed) It is always more complex then one might initially think :)
Author
Owner

@georgkrause commented on GitHub (Nov 22, 2020):

Its still less complicated than sorting papers :D

Alright, so the plan is: read the pdftk module, write unit tests for pdftk module, create a new module, change unit tests to test the functions of this module and implement until all tests succeed. I am going to start tomorrow, will file an early PR so you can follow and comment on my steps

<!-- gh-comment-id:731775468 --> @georgkrause commented on GitHub (Nov 22, 2020): Its still less complicated than sorting papers :D Alright, so the plan is: read the pdftk module, write unit tests for pdftk module, create a new module, change unit tests to test the functions of this module and implement until all tests succeed. I am going to start tomorrow, will file an early PR so you can follow and comment on my steps
Author
Owner

@georgkrause commented on GitHub (Nov 29, 2020):

Since mglib 1.3 supports stapler now, it should be straight forward to switch, right?

@ciur Should I file a PR for this?

<!-- gh-comment-id:735426952 --> @georgkrause commented on GitHub (Nov 29, 2020): Since mglib 1.3 supports stapler now, it should be straight forward to switch, right? @ciur Should I file a PR for this?
Author
Owner

@ciur commented on GitHub (Dec 1, 2020):

Should I file a PR for this?

@georgkrause, no.

I removed all pdftk references and replaced them with your changes - and it works fantastic!
Note also that I added other minor changes to mglib library and incremented it to 1.3.1.

Thank you for your great contribution!
Starting with Papermerge 2.0 there will be no more pdftk dependency :), yey!

<!-- gh-comment-id:736269363 --> @ciur commented on GitHub (Dec 1, 2020): > Should I file a PR for this? @georgkrause, no. I removed all pdftk references and replaced them with your changes - and it works fantastic! Note also that I added other minor changes to mglib library and incremented it to 1.3.1. Thank you for your great contribution! Starting with Papermerge 2.0 there will be no more pdftk dependency :), yey!
Author
Owner

@georgkrause commented on GitHub (Dec 13, 2020):

@ciur thanks for your work! Can we already close this?

<!-- gh-comment-id:744072982 --> @georgkrause commented on GitHub (Dec 13, 2020): @ciur thanks for your work! Can we already close this?
Author
Owner

@georgkrause commented on GitHub (Dec 13, 2020):

Is there any way to get this into a release before 2.0? Would be nice to build some alpine based docker images :)

<!-- gh-comment-id:744073057 --> @georgkrause commented on GitHub (Dec 13, 2020): Is there any way to get this into a release before 2.0? Would be nice to build some alpine based docker images :)
Author
Owner

@ciur commented on GitHub (Dec 16, 2020):

Is there any way to get this into a release before 2.0? Would be nice to build some alpine based docker images :)

@georgkrause, no.

The release candidate for 2.0 will be out in first week of January 2021.

Can we already close this?

Yes, this ticket can be closed.

<!-- gh-comment-id:745779719 --> @ciur commented on GitHub (Dec 16, 2020): > Is there any way to get this into a release before 2.0? Would be nice to build some alpine based docker images :) @georgkrause, no. The release candidate for 2.0 will be out in first week of January 2021. > Can we already close this? Yes, this ticket can be closed.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#82
No description provided.