[PR #452] Fix/chinese filename classification 404 #479

Closed
opened 2026-03-02 04:13:44 +03:00 by kerem · 0 comments
Owner

Original Pull Request: https://github.com/git-ai-project/git-ai/pull/452

State: closed
Merged: Yes


Fix Git path parsing to properly decode non-ASCII filenames containing
Chinese characters, emojis, and other Unicode scripts.

Problem

Git outputs non-ASCII filenames in quoted format with octal escapes:
+++ "b/\344\270\255\346\226\207.txt"

The previous implementation didn't unescape these paths, causing authorship
tracking to fail for files with non-ASCII names.

Solution

  • Add unescape_git_path() utility to decode octal escapes in git paths
  • Update parse_diff_added_lines() and parse_diff_added_lines_with_insertions()
    to handle both quoted and unquoted path formats
  • Properly extract and decode paths from git diff output

Testing

Added comprehensive test coverage for:

  • CJK scripts (Chinese, Japanese, Korean)
  • RTL scripts (Arabic, Hebrew, Persian, Urdu)
  • Indic scripts (Hindi, Tamil, Bengali, Telugu, Gujarati)
  • Southeast Asian (Thai, Vietnamese, Khmer, Lao)
  • Cyrillic and Greek
  • Emoji (including ZWJ sequences, skin tones, flags)
  • Special Unicode (math symbols, currency, diacritics)
  • Unicode normalization (NFC/NFD)
  • Edge cases and stress tests

Fixes #404

**Original Pull Request:** https://github.com/git-ai-project/git-ai/pull/452 **State:** closed **Merged:** Yes --- Fix Git path parsing to properly decode non-ASCII filenames containing Chinese characters, emojis, and other Unicode scripts. ## Problem Git outputs non-ASCII filenames in quoted format with octal escapes: +++ "b/\344\270\255\346\226\207.txt" The previous implementation didn't unescape these paths, causing authorship tracking to fail for files with non-ASCII names. ## Solution - Add `unescape_git_path()` utility to decode octal escapes in git paths - Update `parse_diff_added_lines()` and `parse_diff_added_lines_with_insertions()` to handle both quoted and unquoted path formats - Properly extract and decode paths from git diff output ## Testing Added comprehensive test coverage for: - CJK scripts (Chinese, Japanese, Korean) - RTL scripts (Arabic, Hebrew, Persian, Urdu) - Indic scripts (Hindi, Tamil, Bengali, Telugu, Gujarati) - Southeast Asian (Thai, Vietnamese, Khmer, Lao) - Cyrillic and Greek - Emoji (including ZWJ sequences, skin tones, flags) - Special Unicode (math symbols, currency, diacritics) - Unicode normalization (NFC/NFD) - Edge cases and stress tests Fixes #404
kerem 2026-03-02 04:13:44 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/git-ai#479
No description provided.