[GH-ISSUE #426] BUG Reprot: Git Commit Fails with Non-UTF-8 Files (e.g., GBK) #159

Closed
opened 2026-03-02 04:12:23 +03:00 by kerem · 6 comments
Owner

Originally created by @ShuboLiang on GitHub (Jan 30, 2026).
Original GitHub issue: https://github.com/git-ai-project/git-ai/issues/426

Originally assigned to: @svarlamov on GitHub.

If the code file is in a format other than UTF-8 (such as GBK—briefly, GBK is a character encoding for simplified Chinese), executing git commit will result in an error.

Image
Originally created by @ShuboLiang on GitHub (Jan 30, 2026). Original GitHub issue: https://github.com/git-ai-project/git-ai/issues/426 Originally assigned to: @svarlamov on GitHub. If the code file is in a format other than UTF-8 (such as GBK—briefly, GBK is a character encoding for simplified Chinese), executing git commit will result in an error. <img width="881" height="786" alt="Image" src="https://github.com/user-attachments/assets/f84c444d-da0c-4c4e-b365-7f72648410f9" />
kerem 2026-03-02 04:12:23 +03:00
Author
Owner

@ShuboLiang commented on GitHub (Feb 2, 2026):

@acunniffe Is this in progress?

<!-- gh-comment-id:3832721333 --> @ShuboLiang commented on GitHub (Feb 2, 2026): @acunniffe Is this in progress?
Author
Owner

@acunniffe commented on GitHub (Feb 2, 2026):

Yes working on testing it x-platform

<!-- gh-comment-id:3832724092 --> @acunniffe commented on GitHub (Feb 2, 2026): Yes working on testing it x-platform
Author
Owner

@svarlamov commented on GitHub (Feb 2, 2026):

Just catching up on this -- what's the expected behavior? I don't think Git, nor Git AI for that matter, can mix encodings rationally. File names for sure have to be UTF-8, and I'm not sure that Git treats GBK contents as text either (for the purposes of diffs, etc.).

I think we would have to detect non-UTF-8 and treat them as binary files (essentially bypassing Git AI). It would fix the error (this shouldn't error), but you will probably lose AI attribution for any non-UTF-8 file contents

<!-- gh-comment-id:3832901215 --> @svarlamov commented on GitHub (Feb 2, 2026): Just catching up on this -- what's the expected behavior? I don't think Git, nor Git AI for that matter, can mix encodings rationally. File names for sure have to be UTF-8, and I'm not sure that Git treats GBK contents as text either (for the purposes of diffs, etc.). I think we would have to detect non-UTF-8 and treat them as binary files (essentially bypassing Git AI). It would fix the error (this shouldn't error), but you will probably lose AI attribution for any non-UTF-8 file contents
Author
Owner

@ShuboLiang commented on GitHub (Feb 2, 2026):

@svarlamov I hope that git-ai can properly track contribution history for code files encoded in GBK. Our company maintains many legacy projects that use GBK encoding. Git itself can handle these files as plain text without issues. However, when I use git-ai, it fails to work correctly during git commit. Specifically, after committing, the expected git-ai statistics UI (showing how much was written by humans vs. AI) doesn't appear.

<!-- gh-comment-id:3832924780 --> @ShuboLiang commented on GitHub (Feb 2, 2026): @svarlamov I hope that git-ai can properly track contribution history for code files encoded in GBK. Our company maintains many legacy projects that use GBK encoding. Git itself can handle these files as plain text without issues. However, when I use git-ai, it fails to work correctly during git commit. Specifically, after committing, the expected git-ai statistics UI (showing how much was written by humans vs. AI) doesn't appear.
Author
Owner

@svarlamov commented on GitHub (Feb 2, 2026):

I see, we'd like to get support for this then! Can you share the settings you use with Git to support GBK and how it handles mixed encoding within a repo? I'd also like to confirm that the filenames are UTF-8 (I was under the impression that UTF-8 is a hard requirement for file names).

We use the byte type internally for most of our logic, so technically this should be possible. I don't think Rust Strings are GBK-compatible though, so this will be interesting to implement/might require a bit of refactoring in areas where we read files into String before converting to byte for processing

<!-- gh-comment-id:3832980562 --> @svarlamov commented on GitHub (Feb 2, 2026): I see, we'd like to get support for this then! Can you share the settings you use with Git to support GBK and how it handles mixed encoding within a repo? I'd also like to confirm that the filenames are UTF-8 (I was under the impression that UTF-8 is a hard requirement for file names). We use the `byte` type internally for most of our logic, so technically this should be possible. I don't think Rust Strings are GBK-compatible though, so this will be interesting to implement/might require a bit of refactoring in areas where we read files into String before converting to `byte` for processing
Author
Owner

@ShuboLiang commented on GitHub (Feb 2, 2026):

Git works without any additional configuration. All our filenames are in English, so they are UTF-8 encoded. Thank you for your help! Since git-ai will be heavily used by our users, we’ll likely have many more related questions for you in the future. Thanks again!

<!-- gh-comment-id:3833100934 --> @ShuboLiang commented on GitHub (Feb 2, 2026): Git works without any additional configuration. All our filenames are in English, so they are UTF-8 encoded. Thank you for your help! Since git-ai will be heavily used by our users, we’ll likely have many more related questions for you in the future. Thanks again!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/git-ai#159
No description provided.