[GH-ISSUE #17] [Bug]: Tesseract OCR fails to separate words, causing over-redaction #5

New issue

Open

opened 2026-03-02 11:44:57 +03:00 by kerem · 0 comments

kerem commented

2026-03-02 11:44:57 +03:00

Owner

Originally created by @karant-dev on GitHub (Dec 12, 2025).
Original GitHub issue: https://github.com/karant-dev/AutoRedact/issues/17

Describe the bug

When scanning documents (e.g., IDs/Licenses), Tesseract sometimes fails to insert spaces between labels and values (e.g., 'Address:123MainSt' instead of 'Address: 123 Main St').

Because lines are detected as single massive 'words', if a regex matches part of that string (e.g. the number '123'), the entire string gets redacted.

Steps to reproduce

Upload a document where text is close together (like a Driver's License).
Observe that 'Address' and other non-sensitive labels get redacted along with the sensitive data.

Expected behavior

Only the sensitive substring should be redacted, or words should be correctly segmented.

Additional context

Code audit suggests Tesseract is grouping these into a single 'word' block. We might need to force char-level processing or better segmentation.

Originally created by @karant-dev on GitHub (Dec 12, 2025). Original GitHub issue: https://github.com/karant-dev/AutoRedact/issues/17 ### Describe the bug When scanning documents (e.g., IDs/Licenses), Tesseract sometimes fails to insert spaces between labels and values (e.g., 'Address:123MainSt' instead of 'Address: 123 Main St'). Because lines are detected as single massive 'words', if a regex matches *part* of that string (e.g. the number '123'), the **entire** string gets redacted. ### Steps to reproduce 1. Upload a document where text is close together (like a Driver's License). 2. Observe that 'Address' and other non-sensitive labels get redacted along with the sensitive data. ### Expected behavior Only the sensitive substring should be redacted, or words should be correctly segmented. ### Additional context Code audit suggests Tesseract is grouping these into a single 'word' block. We might need to force char-level processing or better segmentation.