mirror of
https://github.com/ciur/papermerge.git
synced 2026-04-25 12:05:58 +03:00
[GH-ISSUE #106] Replace pdftk with stapler #82
Labels
No labels
2.1
3.0
3.0.1
3.0.2
3.0.3
3.0.3
3.1
3.2
3.2
3.3
3.5
3.x
Fixed. Waiting for feedback.
Fixed. Waiting for feedback.
UX
Version 2.1 - alpha
XSS
announcement
beta
blocker
bug
cannot reproduce
confirmed
confirmed
critical
demo
dependencies
deployment
detchnical debt
discussion
docker
documentation
donations
duplicate
enhancement
feature request
frontend
fundraising
good first issue
good issue
help wanted
high
implemented
important
improvement
incomplete
invalid
investigation
kubernetes
low
low impact
medium
medium
medium impact
migration from 2.0
migration from 2.1
missing-language
missing-ocr-language
no-activity
note
ocr
outofscope
packaging
performance
popular request
pull-request
pypi
question
raspberry pi
roadmap
search
security
setup
status
task
technical debt
updates
user xp
version 1.4.0 - demo
will be implemented
will not be implemented
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/papermerge#82
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ciur on GitHub (Sep 2, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/106
Hey, stapler, where were you back April 2020? I was looking for you!
The best I found for pdf page management was pdftk...
But pdftk license... license S... is Proprietary/GPL.
Today, thanks to @mtonnie, I found you, my dear stapler.
Pdftk - sorry, buddy, I will kick you out from Papermerge 1.5.0!
@tf198 commented on GitHub (Sep 15, 2020):
It would be great if you could implement this as a generic plugin system which gets the document and a list of pages (potentially in a new order)
Could be many more uses - crop, deskew, contrast etc.
This ability to correct batch scanned documents is a standout for PaperMerge so opening this to plugins would be a real bonus. If the interface could send extra kwargs as well that would be even better e.g.
threshold=60Looking at the code you are pretty invested in the clipboard workflow so not sure how this would fit in though.
@ciur commented on GitHub (Sep 15, 2020):
Hi @tf198,
thank you for your suggestion. I do have on my radar the correction features of scanned documents. When I was implementing pdf management feature - I 'removed' rotation in very last moment... Nowadays when I batch scan long receipts I feel the need to have (an automatic) deskew! These scan correction will definitely make its way into the feature set.
@georgkrause commented on GitHub (Oct 4, 2020):
Adding my voice here, pdftk is a pain to package and this is actually an blocker for my planned deployment via alpine based docker image
@ciur commented on GitHub (Oct 5, 2020):
right! pdftk is on "my blacklist" :) . But I will postpone its removal until 1.6. The reason for postponing is that even stapler project is not all unicorns - it looks not 100% reliable to me (currently tests are not passing, and that state did not change since last 4 months). But at least stapler is BSD licensed (plus it is python and generally rather simple project). So I will need more time to review pdftk -> stapler transition.
@georgkrause commented on GitHub (Oct 5, 2020):
Where are the not passing tests? Only the latest merge commit does have failing pipelines, but only for Python 3.5 which is deprecated anyway while 3.8 went well.
Anyway, I totally understand switching basic libraries is a big thing, so I understand the postponing, still I would vote for a high priority :)
@georgkrause commented on GitHub (Nov 15, 2020):
Are there any news in this?
@georgkrause commented on GitHub (Nov 22, 2020):
I investigated a little and did not found any call of the pdftk binary. Actually all work seems to be done by the poppler-utils, eg
pdfseparateandpdfuniteandpdfinfo.So if I am getting this right, pdftk is not actually needed. I am currently testing by removing the check, the Docker Images is currently building.
@ciur commented on GitHub (Nov 22, 2020):
@georgkrause, Papermerge application uses this module in mglib for cut/paste/move pages operations
pdfunite, pdfseparate are legacy modules, I will remove them right away.
@ciur commented on GitHub (Nov 22, 2020):
I removed legacy pdfseparate and pdfunite modules .
They job is now performed by mglib.
@georgkrause commented on GitHub (Nov 22, 2020):
Well, the question is: Why is pdftk declared as dependencies if it is never called. We could remove the check in this case, right?
Edit: Ah, I get it. This lib is a wrapper around pdftk. Thanks for the hint!
@georgkrause commented on GitHub (Nov 22, 2020):
@ciur It seems like
mglib.pdftkis not used anywhere in the papermerge code, or I am not able to find the places@georgkrause commented on GitHub (Nov 22, 2020):
Alright, I found it. pdftk is called from the
storageclass. So this issue should be move to glib, right? There are no changes needed in papermerge anymore to remove pdftk.Now I found the places where I have to look at the code I feel confident to file a PR to replace the current functions of pdftk by stapler. Are you open for such a patch? Are there any rules to follow?
@ciur commented on GitHub (Nov 22, 2020):
hint: start tracking pdftk down here
Cut/Paste operation uses default storage which is an abstraction of all operations performed on document files.
default storage -> mglib.storage -> mglib.pdftk -> pdftk
Why this 'complicated' ?
Because default storage module (which works on local file system) can be replaced and/or suplimented with additional storages e.g SFTP, AWS etc.
@georgkrause commented on GitHub (Nov 22, 2020):
I think this huge level of abstraction is quite nice and I am far too less involved in the code base to criticize your decisions! Thanks for the explanation anyway :)
@ciur commented on GitHub (Nov 22, 2020):
Yes, I would be very happy to see pdftk replaced with something which has more Apache 2.0 friendly licensing.
@ciur commented on GitHub (Nov 22, 2020):
@georgkrause, I added today a PR template to papermerge repo. Mglib still does not have a PR template. But in general very important rule is to follow PEP8 style. Plus obviously at least couple of unit tests.
@georgkrause commented on GitHub (Nov 22, 2020):
As far as I can see the new lib module needs to provide
paste_pages,reorder_pagesanddelete_pagesfunctionality as interface formglib.storage, right?Is there any chat for developers of this project, btw? So I wont need to spam that much in the issue
@ciur commented on GitHub (Nov 22, 2020):
I don't know. I didn't add a "papermerge special chat" :)
Yes, I would expect that you would do most of the change in a new mglib module which would be used only be mglib.storage. Also, PDFTK_BINARY option won't be needed (thus its refs in papermerge.core will be removed). But here is catch:
because papermerge depends on mglib - I would merge your changes in mglib. Test them, then I will release new version of mglib. Only after new version of mglib will be published on pypi - papermerge deps can be updated (and PDFTK_BINARY option removed)
It is always more complex then one might initially think :)
@georgkrause commented on GitHub (Nov 22, 2020):
Its still less complicated than sorting papers :D
Alright, so the plan is: read the pdftk module, write unit tests for pdftk module, create a new module, change unit tests to test the functions of this module and implement until all tests succeed. I am going to start tomorrow, will file an early PR so you can follow and comment on my steps
@georgkrause commented on GitHub (Nov 29, 2020):
Since mglib 1.3 supports stapler now, it should be straight forward to switch, right?
@ciur Should I file a PR for this?
@ciur commented on GitHub (Dec 1, 2020):
@georgkrause, no.
I removed all pdftk references and replaced them with your changes - and it works fantastic!
Note also that I added other minor changes to mglib library and incremented it to 1.3.1.
Thank you for your great contribution!
Starting with Papermerge 2.0 there will be no more pdftk dependency :), yey!
@georgkrause commented on GitHub (Dec 13, 2020):
@ciur thanks for your work! Can we already close this?
@georgkrause commented on GitHub (Dec 13, 2020):
Is there any way to get this into a release before 2.0? Would be nice to build some alpine based docker images :)
@ciur commented on GitHub (Dec 16, 2020):
@georgkrause, no.
The release candidate for 2.0 will be out in first week of January 2021.
Yes, this ticket can be closed.