[GH-ISSUE #463] Range stats performance issue and attribution semantics question #171

Closed
opened 2026-03-02 04:12:32 +03:00 by kerem · 1 comment
Owner

Originally created by @txf0096 on GitHub (Feb 5, 2026).
Original GitHub issue: https://github.com/git-ai-project/git-ai/issues/463

Problem

The git-ai stats <start>..<end> command has severe performance and memory issues on commit ranges:

Performance metrics (28 commits, 19 files):

  • Time: 3.5 minutes
  • Memory: 10.8 GB total allocated (20M allocations), 663 MB peak
  • Result: OOM on production servers

Root Cause

VirtualAttributions::new_for_base_commit is called with blame_start_commit=None, causing git blame to traverse the entire repository history for each file, even though we only need attributions for the commit range.

The Fix

Limit git blame scope to the range by passing start_sha as blame_start_commit:

// Before (slow)
VirtualAttributions::new_for_base_commit(
    repo_clone,
    end_sha.to_string(),
    &changed_files,
    None,  // blame entire history
)

// After (fast)
VirtualAttributions::new_for_base_commit(
    repo_clone,
    end_sha.to_string(),
    &changed_files,
    Some(start_sha.to_string()),  // limit to range
)

Performance improvement:

  • Time: 212s → 6s (35x faster)
  • Memory: 10.8 GB → 442 MB (96% reduction)
  • Peak: 663 MB → 48 MB (93% reduction)
  • OOM issue completely resolved

Attribution Semantics Question

However, this fix changes the statistics output:

For git diff start..end showing +1606 -34 lines:

Version human_additions ai_additions Total
Original (blame full history) 1200 406 1606
Fixed (blame range only) 1487 119 1606

Both sum to 1606 (the git diff additions), but attribute differently.

Question: Which behavior is correct?

The code comment at line 296-297 says:

// This ensures we only count AI contributions that happened during these commits,
// not AI contributions from before the range

This suggests the fixed version (119 AI lines) is correct - it only counts AI code added within the range.

The original version (406 AI lines) appears to count:

  • AI lines added within the range ✓
  • AI lines that existed before the range but their prompts were also used within the range ✗

Example scenario:

Commit A (before range): file.txt has "line1" (AI, prompt_1)
Commit B (range start): file.txt adds "line2" (human)
Commit C (range end): file.txt adds "line3" (human)

git diff B..C shows: +1 line (line3)

Original behavior:
- Blames entire history to C
- Sees line1 was added by prompt_1
- If prompt_1 was also used in range B..C (e.g., modified other files)
- Counts line1 as AI attribution
- But line1 is NOT in the git diff!

Fixed behavior:
- Blames only B..C
- Only sees changes within the range
- line3 attributed correctly as human

Questions for maintainers:

  1. Is the original behavior (blame full history) intentional?

    • Should range stats include AI attributions from before the range?
  2. What does "range stats" mean semantically?

    • Attribution of the net changes shown in git diff start..end? (fixed version)
    • Attribution of all code existing at end but filtered by prompt usage? (original version)
  3. Is the performance issue acceptable?

    • Blaming full history for large repos is very expensive
    • Should we always limit blame scope for ranges?

Environment

  • Repository: Internal fork of git-ai with ~2000 commits
  • Range: 28 commits, 19 files changed
  • OS: Both macOS (development) and Linux (production servers)

Proposed Solution

If the fixed behavior (119 AI lines) is correct, the fix is simple and provides massive performance improvements. If the original behavior is intentional, we need to optimize it differently.

Please clarify the expected semantics so we can apply the appropriate fix.

Originally created by @txf0096 on GitHub (Feb 5, 2026). Original GitHub issue: https://github.com/git-ai-project/git-ai/issues/463 ## Problem The `git-ai stats <start>..<end>` command has severe performance and memory issues on commit ranges: **Performance metrics** (28 commits, 19 files): - Time: 3.5 minutes - Memory: 10.8 GB total allocated (20M allocations), 663 MB peak - Result: OOM on production servers ## Root Cause `VirtualAttributions::new_for_base_commit` is called with `blame_start_commit=None`, causing git blame to traverse the **entire repository history** for each file, even though we only need attributions for the commit range. ## The Fix Limit git blame scope to the range by passing `start_sha` as `blame_start_commit`: ```rust // Before (slow) VirtualAttributions::new_for_base_commit( repo_clone, end_sha.to_string(), &changed_files, None, // blame entire history ) // After (fast) VirtualAttributions::new_for_base_commit( repo_clone, end_sha.to_string(), &changed_files, Some(start_sha.to_string()), // limit to range ) ``` **Performance improvement**: - Time: 212s → 6s (35x faster) - Memory: 10.8 GB → 442 MB (96% reduction) - Peak: 663 MB → 48 MB (93% reduction) - OOM issue completely resolved ## Attribution Semantics Question However, this fix **changes the statistics output**: For `git diff start..end` showing `+1606 -34` lines: | Version | human_additions | ai_additions | Total | |---------|----------------|--------------|-------| | Original (blame full history) | 1200 | 406 | 1606 | | Fixed (blame range only) | 1487 | 119 | 1606 | Both sum to 1606 (the git diff additions), but attribute differently. ### Question: Which behavior is correct? The code comment at line 296-297 says: ```rust // This ensures we only count AI contributions that happened during these commits, // not AI contributions from before the range ``` This suggests the **fixed version (119 AI lines) is correct** - it only counts AI code added **within the range**. The original version (406 AI lines) appears to count: - AI lines added within the range ✓ - AI lines that existed before the range but their prompts were also used within the range ✗ **Example scenario**: ``` Commit A (before range): file.txt has "line1" (AI, prompt_1) Commit B (range start): file.txt adds "line2" (human) Commit C (range end): file.txt adds "line3" (human) git diff B..C shows: +1 line (line3) Original behavior: - Blames entire history to C - Sees line1 was added by prompt_1 - If prompt_1 was also used in range B..C (e.g., modified other files) - Counts line1 as AI attribution - But line1 is NOT in the git diff! Fixed behavior: - Blames only B..C - Only sees changes within the range - line3 attributed correctly as human ``` ### Questions for maintainers: 1. **Is the original behavior (blame full history) intentional?** - Should range stats include AI attributions from before the range? 2. **What does "range stats" mean semantically?** - Attribution of the **net changes** shown in `git diff start..end`? (fixed version) - Attribution of all code **existing at end** but filtered by prompt usage? (original version) 3. **Is the performance issue acceptable?** - Blaming full history for large repos is very expensive - Should we always limit blame scope for ranges? ## Environment - Repository: Internal fork of git-ai with ~2000 commits - Range: 28 commits, 19 files changed - OS: Both macOS (development) and Linux (production servers) ## Proposed Solution If the fixed behavior (119 AI lines) is correct, the fix is simple and provides massive performance improvements. If the original behavior is intentional, we need to optimize it differently. Please clarify the expected semantics so we can apply the appropriate fix.
kerem 2026-03-02 04:12:32 +03:00
Author
Owner

@svarlamov commented on GitHub (Feb 7, 2026):

Should be fixed in the next release https://github.com/git-ai-project/git-ai/pull/471

Thanks for the report and let us know if this helps!

<!-- gh-comment-id:3865747615 --> @svarlamov commented on GitHub (Feb 7, 2026): Should be fixed in the next release https://github.com/git-ai-project/git-ai/pull/471 Thanks for the report and let us know if this helps!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/git-ai#171
No description provided.