[GH-ISSUE #17] [Bug]: Tesseract OCR fails to separate words, causing over-redaction #5

Open
opened 2026-03-02 11:44:57 +03:00 by kerem · 0 comments
Owner

Originally created by @karant-dev on GitHub (Dec 12, 2025).
Original GitHub issue: https://github.com/karant-dev/AutoRedact/issues/17

Describe the bug

When scanning documents (e.g., IDs/Licenses), Tesseract sometimes fails to insert spaces between labels and values (e.g., 'Address:123MainSt' instead of 'Address: 123 Main St').

Because lines are detected as single massive 'words', if a regex matches part of that string (e.g. the number '123'), the entire string gets redacted.

Steps to reproduce

  1. Upload a document where text is close together (like a Driver's License).
  2. Observe that 'Address' and other non-sensitive labels get redacted along with the sensitive data.

Expected behavior

Only the sensitive substring should be redacted, or words should be correctly segmented.

Additional context

Code audit suggests Tesseract is grouping these into a single 'word' block. We might need to force char-level processing or better segmentation.

Originally created by @karant-dev on GitHub (Dec 12, 2025). Original GitHub issue: https://github.com/karant-dev/AutoRedact/issues/17 ### Describe the bug When scanning documents (e.g., IDs/Licenses), Tesseract sometimes fails to insert spaces between labels and values (e.g., 'Address:123MainSt' instead of 'Address: 123 Main St'). Because lines are detected as single massive 'words', if a regex matches *part* of that string (e.g. the number '123'), the **entire** string gets redacted. ### Steps to reproduce 1. Upload a document where text is close together (like a Driver's License). 2. Observe that 'Address' and other non-sensitive labels get redacted along with the sensitive data. ### Expected behavior Only the sensitive substring should be redacted, or words should be correctly segmented. ### Additional context Code audit suggests Tesseract is grouping these into a single 'word' block. We might need to force char-level processing or better segmentation.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/AutoRedact#5
No description provided.