[GH-ISSUE #358] [CRITICAL] Panic when processing files with multi-byte UTF-8 characters (Chinese, Japanese, etc.) #132

Closed
opened 2026-03-02 04:12:04 +03:00 by kerem · 1 comment
Owner

Originally created by @harvest-L on GitHub (Jan 16, 2026).
Original GitHub issue: https://github.com/git-ai-project/git-ai/issues/358

Bug Severity

CRITICAL - Causes git-ai to crash with panic, making it unusable for any files containing multi-byte UTF-8 characters.

Error Message

thread 'blocking-2' panicked at src\authorship\attribution_tracker.rs:1754:42:
byte index 2630 is not a char boundary; it is inside '选' (bytes 2629..2632) of `<template>
    <div class="add-person-container">
      <!-- 添加按钮(仅添加模式显示) -->
      <div class="add-btn-wrapper" v-if="mode === 'add'">
        <es-button type="text" icon="el-icon-plus" @click="addRow"> 添加个人 </es-button>
      </d

Note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Affected Triggers

Both git-ai checkpoint hooks fail with the same panic:

  • PreToolUse:Edit [git-ai checkpoint claude --hook-input stdin] failed with non-blocking status code 101
  • PostToolUse:Edit [git-ai checkpoint claude --hook-input stdin] failed with non-blocking status code 101

Problem Description

When git-ai processes files containing multi-byte UTF-8 characters (Chinese, Japanese, Korean, emoji, etc.), it panics because it uses byte indices to slice strings directly, but the byte indices can fall in the middle of a multi-byte character.

Root Cause

Location: src/authorship/attribution_tracker.rs:1754

let content_slice = &full_content[std::cmp::max(line_start, attribution.start)
    ..std::cmp::min(line_end, attribution.end)];

The code uses byte indexing (line_start, line_end, attribution.start, attribution.end) to slice the string directly. In UTF-8:

  • ASCII characters = 1 byte
  • Chinese/Japanese/Korean characters = 3 bytes
  • Emoji = 4 bytes

When a byte index falls in the middle of a multi-byte character (e.g., byte 2630 of a 3-byte character at bytes 2629-2632), Rust panics because string slicing must occur on valid character boundaries.

Reproduction Steps

  1. Create a file with Chinese, Japanese, or other multi-byte UTF-8 characters
  2. For example, a Vue file with Chinese comments:
    <template>
      <div class="add-person-container">
        <!-- 添加按钮仅添加模式显示 -->
        <div class="add-btn-wrapper" v-if="mode === 'add'">
          <es-button type="text" icon="el-icon-plus" @click="addRow"> 添加个人 </es-button>
        </div>
      </div>
    </template>
    
  3. Use AI (Claude Code or Cursor) to edit the file
  4. git-ai checkpoint will panic when processing this file

Impact

  • All files with Chinese/Japanese/Korean text cause git-ai to crash
  • Vue, React, HTML files with non-English comments are unusable with git-ai
  • Any file containing emoji triggers the panic
  • Cannot use git-ai in CJK (China, Japan, Korea) markets
  • Affects all AI operations - CR, edits, refactoring on files with multi-byte characters

Expected Behavior

git-ai should handle multi-byte UTF-8 characters correctly:

  1. Byte indices should be validated or adjusted to character boundaries before slicing
  2. String operations should use character-safe methods
  3. No panic should occur when processing files with multi-byte characters

Affected Users

This affects any user who:

  • Works with Chinese, Japanese, Korean, or other CJK languages
  • Uses emoji in their code
  • Has non-ASCII characters in comments or strings
  • Works in international teams with multi-language codebases

Suggested Fix

Use Rust's character boundary checks to ensure safe string slicing:

// Option 1: Use is_char_boundary check
let start = std::cmp::max(line_start, attribution.start);
let end = std::cmp::min(line_end, attribution.end);

// Ensure indices are on character boundaries
if !full_content.is_char_boundary(start) || !full_content.is_char_boundary(end) {
    // Adjust to nearest character boundary or skip
    // Option A: Skip this attribution
    continue;
    // Option B: Adjust to character boundaries
    // let start = full_content.char_indices()
    //     .find(|(idx, _)| *idx >= start)
    //     .map(|(idx, _)| idx)
    //     .unwrap_or(end);
}

let content_slice = &full_content[start..end];

Or use .get() for safer slicing:

// Option 2: Use get() method
let start = std::cmp::max(line_start, attribution.start);
let end = std::cmp::min(line_end, attribution.end);

let content_slice = match full_content.get(start..end) {
    Some(slice) => slice,
    None => continue, // Skip if not on character boundary
};

Environment

  • git-ai version: 1.0.31
  • OS: Windows (but affects all platforms)
  • File types: Vue, HTML, JavaScript, TypeScript, any text file
  • Character encoding: UTF-8 with multi-byte characters

Additional Context

This is a blocker for using git-ai in many international markets and projects. The panic occurs during checkpoint creation, which means:

  • AI edits on files with multi-byte characters cannot be tracked
  • Users must remove all CJK text or emoji from files before using git-ai
  • Makes git-ai essentially unusable for entire regions (China, Japan, Korea, etc.)

The fix is straightforward and should be prioritized as critical.

Originally created by @harvest-L on GitHub (Jan 16, 2026). Original GitHub issue: https://github.com/git-ai-project/git-ai/issues/358 ## Bug Severity **CRITICAL** - Causes git-ai to crash with panic, making it unusable for any files containing multi-byte UTF-8 characters. ## Error Message ``` thread 'blocking-2' panicked at src\authorship\attribution_tracker.rs:1754:42: byte index 2630 is not a char boundary; it is inside '选' (bytes 2629..2632) of `<template> <div class="add-person-container"> <!-- 添加按钮(仅添加模式显示) --> <div class="add-btn-wrapper" v-if="mode === 'add'"> <es-button type="text" icon="el-icon-plus" @click="addRow"> 添加个人 </es-button> </d ``` Note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ## Affected Triggers Both git-ai checkpoint hooks fail with the same panic: - **PreToolUse:Edit** `[git-ai checkpoint claude --hook-input stdin] failed with non-blocking status code 101` - **PostToolUse:Edit** `[git-ai checkpoint claude --hook-input stdin] failed with non-blocking status code 101` ## Problem Description When git-ai processes files containing multi-byte UTF-8 characters (Chinese, Japanese, Korean, emoji, etc.), it panics because it uses **byte indices** to slice strings directly, but the byte indices can fall in the middle of a multi-byte character. ### Root Cause **Location**: `src/authorship/attribution_tracker.rs:1754` ```rust let content_slice = &full_content[std::cmp::max(line_start, attribution.start) ..std::cmp::min(line_end, attribution.end)]; ``` The code uses byte indexing (`line_start`, `line_end`, `attribution.start`, `attribution.end`) to slice the string directly. In UTF-8: - ASCII characters = 1 byte - Chinese/Japanese/Korean characters = 3 bytes - Emoji = 4 bytes When a byte index falls in the middle of a multi-byte character (e.g., byte 2630 of a 3-byte character at bytes 2629-2632), Rust panics because string slicing **must** occur on valid character boundaries. ## Reproduction Steps 1. Create a file with Chinese, Japanese, or other multi-byte UTF-8 characters 2. For example, a Vue file with Chinese comments: ```vue <template> <div class="add-person-container"> <!-- 添加按钮(仅添加模式显示) --> <div class="add-btn-wrapper" v-if="mode === 'add'"> <es-button type="text" icon="el-icon-plus" @click="addRow"> 添加个人 </es-button> </div> </div> </template> ``` 3. Use AI (Claude Code or Cursor) to edit the file 4. git-ai checkpoint will panic when processing this file ## Impact - ✗ **All files with Chinese/Japanese/Korean text** cause git-ai to crash - ✗ **Vue, React, HTML files with non-English comments** are unusable with git-ai - ✗ **Any file containing emoji** triggers the panic - ✗ **Cannot use git-ai in CJK (China, Japan, Korea) markets** - ✗ **Affects all AI operations** - CR, edits, refactoring on files with multi-byte characters ## Expected Behavior git-ai should handle multi-byte UTF-8 characters correctly: 1. Byte indices should be validated or adjusted to character boundaries before slicing 2. String operations should use character-safe methods 3. No panic should occur when processing files with multi-byte characters ## Affected Users This affects **any user** who: - Works with Chinese, Japanese, Korean, or other CJK languages - Uses emoji in their code - Has non-ASCII characters in comments or strings - Works in international teams with multi-language codebases ## Suggested Fix Use Rust's character boundary checks to ensure safe string slicing: ```rust // Option 1: Use is_char_boundary check let start = std::cmp::max(line_start, attribution.start); let end = std::cmp::min(line_end, attribution.end); // Ensure indices are on character boundaries if !full_content.is_char_boundary(start) || !full_content.is_char_boundary(end) { // Adjust to nearest character boundary or skip // Option A: Skip this attribution continue; // Option B: Adjust to character boundaries // let start = full_content.char_indices() // .find(|(idx, _)| *idx >= start) // .map(|(idx, _)| idx) // .unwrap_or(end); } let content_slice = &full_content[start..end]; ``` Or use `.get()` for safer slicing: ```rust // Option 2: Use get() method let start = std::cmp::max(line_start, attribution.start); let end = std::cmp::min(line_end, attribution.end); let content_slice = match full_content.get(start..end) { Some(slice) => slice, None => continue, // Skip if not on character boundary }; ``` ## Environment - git-ai version: 1.0.31 - OS: Windows (but affects all platforms) - File types: Vue, HTML, JavaScript, TypeScript, any text file - Character encoding: UTF-8 with multi-byte characters ## Additional Context This is a **blocker** for using git-ai in many international markets and projects. The panic occurs during checkpoint creation, which means: - AI edits on files with multi-byte characters cannot be tracked - Users must remove all CJK text or emoji from files before using git-ai - Makes git-ai essentially unusable for entire regions (China, Japan, Korea, etc.) The fix is straightforward and should be prioritized as **critical**.
kerem 2026-03-02 04:12:04 +03:00
Author
Owner

@svarlamov commented on GitHub (Jan 17, 2026):

@harvest-L Thanks for the report again! Fixed in the latest next release: https://github.com/acunniffe/git-ai/releases/tag/v1.0.36-next-8b2936f -- please try it by pulling the install.sh from that release to get it

Will be in the next stable release (day or two)

<!-- gh-comment-id:3762906415 --> @svarlamov commented on GitHub (Jan 17, 2026): @harvest-L Thanks for the report again! Fixed in the latest `next` release: https://github.com/acunniffe/git-ai/releases/tag/v1.0.36-next-8b2936f -- please try it by pulling the install.sh from that release to get it Will be in the next stable release (day or two)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/git-ai#132
No description provided.