[GH-ISSUE #104] Verify / Modify Result Text files #84

New issue

Closed

opened 2026-02-25 21:31:10 +03:00 by kerem · 1 comment

kerem commented

2026-02-25 21:31:10 +03:00

Owner

Originally created by @goldcoders on GitHub (Sep 1, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/104

Sometimes the extracted text dont match , is there a way we can verify in the UI and edit the output to make it more concise ,
I have experience converting image or pdf to text with tessarct command not getting 100% correct match... sometimes...

it would be handy , since we already have the file and output text save on
./media/results/**/*.txt

maybe we can add on the UI a way to verify the output text and change it if theres Discrepancy

Originally created by @goldcoders on GitHub (Sep 1, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/104 Sometimes the extracted text dont match , is there a way we can verify in the UI and edit the output to make it more concise , I have experience converting image or pdf to text with tessarct command not getting 100% correct match... sometimes... it would be handy , since we already have the file and output text save on ./media/results/**/*.txt maybe we can add on the UI a way to verify the output text and change it if theres Discrepancy

kerem closed this issue

2026-02-25 21:31:10 +03:00

kerem commented

2026-02-25 21:31:10 +03:00

Author

Owner

@ciur commented on GitHub (Sep 1, 2020):

It is not so easy at it looks like. The problem is that txt files (and hocr files as well) are moved around. Try to upload a pdf with couple of pages and (after OCR is ready) try to reorder pages and you will understand what I mean.

On file level documents are versioned. Every time you reorder/cut/paste pages - a new version is created. And inside that version
.txt are moved/copied.

@ciur commented on GitHub (Sep 1, 2020): It is not so easy at it looks like. The problem is that txt files (and hocr files as well) are moved around. Try to upload a pdf with couple of pages and (after OCR is ready) try to reorder pages and you will understand what I mean. On file level documents are versioned. Every time you reorder/cut/paste pages - a new version is created. And inside that version .txt are moved/copied.