[PR #515] Optimize rebase/squash rewrite performance and expand regression coverage #530

Open
opened 2026-03-02 04:13:53 +03:00 by kerem · 0 comments
Owner

Original Pull Request: https://github.com/git-ai-project/git-ai/pull/515

State: open
Merged: No


Summary

This PR substantially reduces git-ai overhead for rewrite workflows (rebase, merge --squash, and related post-commit paths), while tightening correctness with targeted regression tests.

The work is focused on first-principles performance improvements:

  • reduce total work done,
  • reduce number of git subprocesses,
  • avoid content hydration unless strictly necessary,
  • preserve existing authorship semantics.

Problem

In large repos and rewrite-heavy flows (Graphite stacks, large generated diffs, long commit ranges), rewrite-related operations were dominated by:

  • many small sequential git calls,
  • broad-path scans instead of staged/changed subsets,
  • expensive attribution/materialization on paths that cannot affect AI output,
  • avoidable per-commit/per-file note/path parsing.

This led to multi-minute stalls in real rebase/squash scenarios.

Design/Approach

1) Shrink worksets to AI-relevant + staged/changed paths

  • Squash prep now starts from actual staged files after merge --squash instead of broad branch diffs.
  • Path sets are narrowed to files touched by AI notes in relevant commit ranges.
  • Small staged-set optimization bypasses expensive source-side file extraction when safe.

2) Batch git object reads

  • Added batched content readers using git cat-file --batch:
    • Repository::get_files_content_at_commit
    • optimized Repository::get_all_staged_files_content
  • Rebase/squash and traversal code now hydrate blobs in bulk rather than per-file calls.

3) Fast-paths for no-op / metadata-only scenarios

  • Rebase note remap fast path for unchanged tracked blobs.
  • Squash prep direct-reuse path when target AI side is empty and staged content matches source AI state.
  • Metadata-only note remap retained where no AI-touched files need full rewrite.

4) Avoid unnecessary heavy attribution/materialization

  • VirtualAttributions now supports line-only/initial-only loaders for post-commit paths:
    • from_just_working_log_line_only
    • from_initial_only_line_only
  • This avoids eager file-content + char-range hydration where line-level data is sufficient.
  • Blame path adds machine-mode fast behavior (skip presentation-only human-author hydration when not needed).

5) Squash pre-commit checkpoint skip handshake

  • Squash prep stores staged index tree OID marker.
  • Pre-commit checkpoint fast path skips when tree is unchanged and INITIAL already contains needed AI attribution.
  • Post-commit consumes INITIAL directly in this path.
  • Marker lifecycle is cleaned/reset correctly.

Correctness & Semantics

Key invariant preserved:

  • We do not change authorship semantics to chase speed.

Notable fix included:

  • Restored virtual-attribution file tracking semantics so files are not dropped just because blamed AI lines are empty (this previously caused note-loss regression in certain rebase cases).

Test Coverage Added

New edge/regression tests cover:

  • squash pre-commit skip marker match/mismatch behavior,
  • squash marker persistence + cleanup lifecycle,
  • post-commit filtering predicate correctness for human/AI/override paths,
  • line-attribution compression merge boundaries,
  • VirtualAttributions initial-only and message-clear behavior,
  • authorship traversal parser robustness (quoted paths, truncated batch payload),
  • batched repository readers and index-tree OID behavior.

Representative test additions:

  • src/commands/checkpoint.rs
  • src/git/repo_storage.rs
  • src/authorship/post_commit.rs
  • src/authorship/rebase_authorship.rs
  • src/authorship/virtual_attribution.rs
  • src/git/authorship_traversal.rs
  • src/git/repository.rs

Bench/Diagnostics

  • Added benchmark harness for heavy squash scenarios:
    • scripts/benchmarks/git/benchmark_nasty_squashes.sh
  • Added/expanded perf logging around rewrite/post-commit hot paths to make bottlenecks explicit.

Validation

Ran full suite successfully:

  • cargo test -- --test-threads=1

No test failures.

Scope

Primary touched areas:

  • rewrite orchestration (rebase_authorship, squash prep/rewrite)
  • virtual attribution loading/conversion paths
  • blame/traversal batching optimizations
  • checkpoint/post-commit fast paths
  • repository/storage helpers for batch reads and squash markers

Open with Devin
**Original Pull Request:** https://github.com/git-ai-project/git-ai/pull/515 **State:** open **Merged:** No --- ## Summary This PR substantially reduces `git-ai` overhead for rewrite workflows (`rebase`, `merge --squash`, and related post-commit paths), while tightening correctness with targeted regression tests. The work is focused on first-principles performance improvements: - reduce total work done, - reduce number of git subprocesses, - avoid content hydration unless strictly necessary, - preserve existing authorship semantics. ## Problem In large repos and rewrite-heavy flows (Graphite stacks, large generated diffs, long commit ranges), rewrite-related operations were dominated by: - many small sequential git calls, - broad-path scans instead of staged/changed subsets, - expensive attribution/materialization on paths that cannot affect AI output, - avoidable per-commit/per-file note/path parsing. This led to multi-minute stalls in real rebase/squash scenarios. ## Design/Approach ### 1) Shrink worksets to AI-relevant + staged/changed paths - Squash prep now starts from **actual staged files** after `merge --squash` instead of broad branch diffs. - Path sets are narrowed to files touched by AI notes in relevant commit ranges. - Small staged-set optimization bypasses expensive source-side file extraction when safe. ### 2) Batch git object reads - Added batched content readers using `git cat-file --batch`: - `Repository::get_files_content_at_commit` - optimized `Repository::get_all_staged_files_content` - Rebase/squash and traversal code now hydrate blobs in bulk rather than per-file calls. ### 3) Fast-paths for no-op / metadata-only scenarios - Rebase note remap fast path for unchanged tracked blobs. - Squash prep direct-reuse path when target AI side is empty and staged content matches source AI state. - Metadata-only note remap retained where no AI-touched files need full rewrite. ### 4) Avoid unnecessary heavy attribution/materialization - `VirtualAttributions` now supports line-only/initial-only loaders for post-commit paths: - `from_just_working_log_line_only` - `from_initial_only_line_only` - This avoids eager file-content + char-range hydration where line-level data is sufficient. - Blame path adds machine-mode fast behavior (skip presentation-only human-author hydration when not needed). ### 5) Squash pre-commit checkpoint skip handshake - Squash prep stores staged index tree OID marker. - Pre-commit checkpoint fast path skips when tree is unchanged and INITIAL already contains needed AI attribution. - Post-commit consumes INITIAL directly in this path. - Marker lifecycle is cleaned/reset correctly. ## Correctness & Semantics Key invariant preserved: - We do **not** change authorship semantics to chase speed. Notable fix included: - Restored virtual-attribution file tracking semantics so files are not dropped just because blamed AI lines are empty (this previously caused note-loss regression in certain rebase cases). ## Test Coverage Added New edge/regression tests cover: - squash pre-commit skip marker match/mismatch behavior, - squash marker persistence + cleanup lifecycle, - post-commit filtering predicate correctness for human/AI/override paths, - line-attribution compression merge boundaries, - `VirtualAttributions` initial-only and message-clear behavior, - authorship traversal parser robustness (quoted paths, truncated batch payload), - batched repository readers and index-tree OID behavior. Representative test additions: - `src/commands/checkpoint.rs` - `src/git/repo_storage.rs` - `src/authorship/post_commit.rs` - `src/authorship/rebase_authorship.rs` - `src/authorship/virtual_attribution.rs` - `src/git/authorship_traversal.rs` - `src/git/repository.rs` ## Bench/Diagnostics - Added benchmark harness for heavy squash scenarios: - `scripts/benchmarks/git/benchmark_nasty_squashes.sh` - Added/expanded perf logging around rewrite/post-commit hot paths to make bottlenecks explicit. ## Validation Ran full suite successfully: - `cargo test -- --test-threads=1` No test failures. ## Scope Primary touched areas: - rewrite orchestration (`rebase_authorship`, squash prep/rewrite) - virtual attribution loading/conversion paths - blame/traversal batching optimizations - checkpoint/post-commit fast paths - repository/storage helpers for batch reads and squash markers <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/git-ai-project/git-ai/pull/515" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/git-ai#530
No description provided.