mirror of
https://github.com/karant-dev/AutoRedact.git
synced 2026-04-26 00:05:52 +03:00
[GH-ISSUE #17] [Bug]: Tesseract OCR fails to separate words, causing over-redaction #5
Labels
No labels
bug
enhancement
enhancement
enhancement
help wanted
pull-request
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/AutoRedact#5
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @karant-dev on GitHub (Dec 12, 2025).
Original GitHub issue: https://github.com/karant-dev/AutoRedact/issues/17
Describe the bug
When scanning documents (e.g., IDs/Licenses), Tesseract sometimes fails to insert spaces between labels and values (e.g., 'Address:123MainSt' instead of 'Address: 123 Main St').
Because lines are detected as single massive 'words', if a regex matches part of that string (e.g. the number '123'), the entire string gets redacted.
Steps to reproduce
Expected behavior
Only the sensitive substring should be redacted, or words should be correctly segmented.
Additional context
Code audit suggests Tesseract is grouping these into a single 'word' block. We might need to force char-level processing or better segmentation.