[GH-ISSUE #503] Stats failed: From UTF-8 error: invalid utf-8 sequence of 1 bytes from index 81909 #174

Open
opened 2026-03-02 04:12:35 +03:00 by kerem · 1 comment
Owner

Originally created by @ysjemmm on GitHub (Feb 11, 2026).
Original GitHub issue: https://github.com/git-ai-project/git-ai/issues/503

Originally assigned to: @svarlamov on GitHub.

I am not a Rust developer. I consulted an AI and it provided some solutions.

Problem Analysis

Error Message

Stats failed: From UTF-8 error: invalid utf-8 sequence of 1 bytes from index 81909

Root Cause

When executing the git-ai stats command, the code attempts to convert Git diff output to a UTF-8 string at the following location:

Location: src/git/repository.rs line 1773

let diff_output = String::from_utf8(output.stdout)?;

This error is triggered when the repository contains:

  1. Binary files (images, compiled files, archives, etc.)
  2. Non-UTF-8 encoded text files (such as GBK, Latin-1, etc.)
  3. Files containing invalid UTF-8 sequences

In your case, commit e022db36e2d16b63d8477439451b05599c3da117 likely contains:

  • iOS project binary resource files (.png, .jpg, .xcassets, etc.)
  • Build artifacts or dependency libraries
  • Files with special encodings

Why This Affects the Stats Command

The git-ai stats command execution flow:

  1. Get commit diff statistics
  2. Call diff_added_lines() to parse added line numbers
  3. Fails here: Attempts to convert Git diff output to UTF-8 string
  4. Run git-ai blame on each added line to determine AI attribution

Solutions

Change strict UTF-8 conversion to lossy conversion, which automatically replaces invalid characters:

// Before (src/git/repository.rs:1773)
let diff_output = String::from_utf8(output.stdout)?;

// After
let diff_output = String::from_utf8_lossy(&output.stdout).to_string();

Pros:

  • Won't fail due to non-UTF-8 content
  • For diff parsing, replacing invalid characters doesn't affect results (only parsing line numbers and filenames)
  • Simple and straightforward

Cons:

  • May lose some special character information (but minimal impact on stats command)

Solution 2: Add Binary File Filtering

Add --no-binary option to Git diff command to skip binary files:

// Modify src/git/repository.rs:1745
args.push("--no-binary".to_string());  // Add this line
args.push("-U0".to_string());

Pros:

  • Avoids processing binary files at the source
  • Maintains UTF-8 strictness

Cons:

  • Still can't handle non-UTF-8 encoded text files
  • May miss some files that need to be counted

Solution 3: Combined Approach (Best)

Combine Solutions 1 and 2:

// src/git/repository.rs
pub fn diff_added_lines(
    &self,
    from_ref: &str,
    to_ref: &str,
    pathspecs: Option<&HashSet<String>>,
) -> Result<HashMap<String, Vec<u32>>, GitAiError> {
    let mut args = self.global_args_for_exec();
    args.push("diff".to_string());
    args.push("-U0".to_string());
    args.push("--no-color".to_string());
    args.push("--no-binary".to_string());  // Add: skip binary files
    args.push(from_ref.to_string());
    args.push(to_ref.to_string());

    // ... pathspecs handling ...

    let output = exec_git(&args)?;
    // Use lossy conversion instead of strict conversion
    let diff_output = String::from_utf8_lossy(&output.stdout).to_string();

    let mut result = parse_diff_added_lines(&diff_output)?;

    // ... subsequent processing ...
}
Originally created by @ysjemmm on GitHub (Feb 11, 2026). Original GitHub issue: https://github.com/git-ai-project/git-ai/issues/503 Originally assigned to: @svarlamov on GitHub. I am not a Rust developer. I consulted an AI and it provided some solutions. ## Problem Analysis ### Error Message ``` Stats failed: From UTF-8 error: invalid utf-8 sequence of 1 bytes from index 81909 ``` ### Root Cause When executing the `git-ai stats` command, the code attempts to convert Git diff output to a UTF-8 string at the following location: **Location**: `src/git/repository.rs` line 1773 ```rust let diff_output = String::from_utf8(output.stdout)?; ``` This error is triggered when the repository contains: 1. **Binary files** (images, compiled files, archives, etc.) 2. **Non-UTF-8 encoded text files** (such as GBK, Latin-1, etc.) 3. **Files containing invalid UTF-8 sequences** In your case, commit `e022db36e2d16b63d8477439451b05599c3da117` likely contains: - iOS project binary resource files (.png, .jpg, .xcassets, etc.) - Build artifacts or dependency libraries - Files with special encodings ### Why This Affects the Stats Command The `git-ai stats` command execution flow: 1. Get commit diff statistics 2. Call `diff_added_lines()` to parse added line numbers 3. **Fails here**: Attempts to convert Git diff output to UTF-8 string 4. Run `git-ai blame` on each added line to determine AI attribution ## Solutions ### Solution 1: Use UTF-8 Lossy Conversion (Recommended) Change strict UTF-8 conversion to lossy conversion, which automatically replaces invalid characters: ```rust // Before (src/git/repository.rs:1773) let diff_output = String::from_utf8(output.stdout)?; // After let diff_output = String::from_utf8_lossy(&output.stdout).to_string(); ``` **Pros**: - Won't fail due to non-UTF-8 content - For diff parsing, replacing invalid characters doesn't affect results (only parsing line numbers and filenames) - Simple and straightforward **Cons**: - May lose some special character information (but minimal impact on stats command) ### Solution 2: Add Binary File Filtering Add `--no-binary` option to Git diff command to skip binary files: ```rust // Modify src/git/repository.rs:1745 args.push("--no-binary".to_string()); // Add this line args.push("-U0".to_string()); ``` **Pros**: - Avoids processing binary files at the source - Maintains UTF-8 strictness **Cons**: - Still can't handle non-UTF-8 encoded text files - May miss some files that need to be counted ### Solution 3: Combined Approach (Best) Combine Solutions 1 and 2: ```rust // src/git/repository.rs pub fn diff_added_lines( &self, from_ref: &str, to_ref: &str, pathspecs: Option<&HashSet<String>>, ) -> Result<HashMap<String, Vec<u32>>, GitAiError> { let mut args = self.global_args_for_exec(); args.push("diff".to_string()); args.push("-U0".to_string()); args.push("--no-color".to_string()); args.push("--no-binary".to_string()); // Add: skip binary files args.push(from_ref.to_string()); args.push(to_ref.to_string()); // ... pathspecs handling ... let output = exec_git(&args)?; // Use lossy conversion instead of strict conversion let diff_output = String::from_utf8_lossy(&output.stdout).to_string(); let mut result = parse_diff_added_lines(&diff_output)?; // ... subsequent processing ... } ```
Author
Owner

@stuartsessions commented on GitHub (Feb 11, 2026):

#426 shows a similar error with utf-8 encoding.

<!-- gh-comment-id:3881833683 --> @stuartsessions commented on GitHub (Feb 11, 2026): #426 shows a similar error with utf-8 encoding.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/git-ai#174
No description provided.