[GH-ISSUE #104] Verify / Modify Result Text files #84

Closed
opened 2026-02-25 21:31:10 +03:00 by kerem · 1 comment
Owner

Originally created by @goldcoders on GitHub (Sep 1, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/104

Sometimes the extracted text dont match , is there a way we can verify in the UI and edit the output to make it more concise ,
I have experience converting image or pdf to text with tessarct command not getting 100% correct match... sometimes...

it would be handy , since we already have the file and output text save on
./media/results/**/*.txt

maybe we can add on the UI a way to verify the output text and change it if theres Discrepancy

Originally created by @goldcoders on GitHub (Sep 1, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/104 Sometimes the extracted text dont match , is there a way we can verify in the UI and edit the output to make it more concise , I have experience converting image or pdf to text with tessarct command not getting 100% correct match... sometimes... it would be handy , since we already have the file and output text save on ./media/results/**/*.txt maybe we can add on the UI a way to verify the output text and change it if theres Discrepancy
kerem closed this issue 2026-02-25 21:31:10 +03:00
Author
Owner

@ciur commented on GitHub (Sep 1, 2020):

It is not so easy at it looks like. The problem is that txt files (and hocr files as well) are moved around. Try to upload a pdf with couple of pages and (after OCR is ready) try to reorder pages and you will understand what I mean.

On file level documents are versioned. Every time you reorder/cut/paste pages - a new version is created. And inside that version
.txt are moved/copied.

<!-- gh-comment-id:684969129 --> @ciur commented on GitHub (Sep 1, 2020): It is not so easy at it looks like. The problem is that txt files (and hocr files as well) are moved around. Try to upload a pdf with couple of pages and (after OCR is ready) try to reorder pages and you will understand what I mean. On file level documents are versioned. Every time you reorder/cut/paste pages - a new version is created. And inside that version .txt are moved/copied.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#84
No description provided.