[PR #515] Optimize rebase/squash rewrite performance and expand regression coverage #530

New issue

Open

opened 2026-03-02 04:13:53 +03:00 by kerem · 0 comments

kerem commented

2026-03-02 04:13:53 +03:00

Owner

Original Pull Request: https://github.com/git-ai-project/git-ai/pull/515

State: open
Merged: No

Summary

This PR substantially reduces git-ai overhead for rewrite workflows (rebase, merge --squash, and related post-commit paths), while tightening correctness with targeted regression tests.

The work is focused on first-principles performance improvements:

reduce total work done,
reduce number of git subprocesses,
avoid content hydration unless strictly necessary,
preserve existing authorship semantics.

Problem

In large repos and rewrite-heavy flows (Graphite stacks, large generated diffs, long commit ranges), rewrite-related operations were dominated by:

many small sequential git calls,
broad-path scans instead of staged/changed subsets,
expensive attribution/materialization on paths that cannot affect AI output,
avoidable per-commit/per-file note/path parsing.

This led to multi-minute stalls in real rebase/squash scenarios.

Design/Approach

1) Shrink worksets to AI-relevant + staged/changed paths

Squash prep now starts from actual staged files after merge --squash instead of broad branch diffs.
Path sets are narrowed to files touched by AI notes in relevant commit ranges.
Small staged-set optimization bypasses expensive source-side file extraction when safe.

2) Batch git object reads

Added batched content readers using git cat-file --batch:
- Repository::get_files_content_at_commit
- optimized Repository::get_all_staged_files_content
Rebase/squash and traversal code now hydrate blobs in bulk rather than per-file calls.

3) Fast-paths for no-op / metadata-only scenarios

Rebase note remap fast path for unchanged tracked blobs.
Squash prep direct-reuse path when target AI side is empty and staged content matches source AI state.
Metadata-only note remap retained where no AI-touched files need full rewrite.

4) Avoid unnecessary heavy attribution/materialization

VirtualAttributions now supports line-only/initial-only loaders for post-commit paths:
- from_just_working_log_line_only
- from_initial_only_line_only
This avoids eager file-content + char-range hydration where line-level data is sufficient.
Blame path adds machine-mode fast behavior (skip presentation-only human-author hydration when not needed).

5) Squash pre-commit checkpoint skip handshake

Squash prep stores staged index tree OID marker.
Pre-commit checkpoint fast path skips when tree is unchanged and INITIAL already contains needed AI attribution.
Post-commit consumes INITIAL directly in this path.
Marker lifecycle is cleaned/reset correctly.

Correctness & Semantics

Key invariant preserved:

We do not change authorship semantics to chase speed.

Notable fix included:

Restored virtual-attribution file tracking semantics so files are not dropped just because blamed AI lines are empty (this previously caused note-loss regression in certain rebase cases).

Test Coverage Added

New edge/regression tests cover:

squash pre-commit skip marker match/mismatch behavior,
squash marker persistence + cleanup lifecycle,
post-commit filtering predicate correctness for human/AI/override paths,
line-attribution compression merge boundaries,
VirtualAttributions initial-only and message-clear behavior,
authorship traversal parser robustness (quoted paths, truncated batch payload),
batched repository readers and index-tree OID behavior.

Representative test additions:

src/commands/checkpoint.rs
src/git/repo_storage.rs
src/authorship/post_commit.rs
src/authorship/rebase_authorship.rs
src/authorship/virtual_attribution.rs
src/git/authorship_traversal.rs
src/git/repository.rs

Bench/Diagnostics

Added benchmark harness for heavy squash scenarios:
- scripts/benchmarks/git/benchmark_nasty_squashes.sh
Added/expanded perf logging around rewrite/post-commit hot paths to make bottlenecks explicit.

Validation

Ran full suite successfully:

cargo test -- --test-threads=1

No test failures.

Scope

Primary touched areas:

rewrite orchestration (rebase_authorship, squash prep/rewrite)
virtual attribution loading/conversion paths
blame/traversal batching optimizations
checkpoint/post-commit fast paths
repository/storage helpers for batch reads and squash markers

**Original Pull Request:** https://github.com/git-ai-project/git-ai/pull/515 **State:** open **Merged:** No --- ## Summary This PR substantially reduces `git-ai` overhead for rewrite workflows (`rebase`, `merge --squash`, and related post-commit paths), while tightening correctness with targeted regression tests. The work is focused on first-principles performance improvements: - reduce total work done, - reduce number of git subprocesses, - avoid content hydration unless strictly necessary, - preserve existing authorship semantics. ## Problem In large repos and rewrite-heavy flows (Graphite stacks, large generated diffs, long commit ranges), rewrite-related operations were dominated by: - many small sequential git calls, - broad-path scans instead of staged/changed subsets, - expensive attribution/materialization on paths that cannot affect AI output, - avoidable per-commit/per-file note/path parsing. This led to multi-minute stalls in real rebase/squash scenarios. ## Design/Approach ### 1) Shrink worksets to AI-relevant + staged/changed paths - Squash prep now starts from **actual staged files** after `merge --squash` instead of broad branch diffs. - Path sets are narrowed to files touched by AI notes in relevant commit ranges. - Small staged-set optimization bypasses expensive source-side file extraction when safe. ### 2) Batch git object reads - Added batched content readers using `git cat-file --batch`: - `Repository::get_files_content_at_commit` - optimized `Repository::get_all_staged_files_content` - Rebase/squash and traversal code now hydrate blobs in bulk rather than per-file calls. ### 3) Fast-paths for no-op / metadata-only scenarios - Rebase note remap fast path for unchanged tracked blobs. - Squash prep direct-reuse path when target AI side is empty and staged content matches source AI state. - Metadata-only note remap retained where no AI-touched files need full rewrite. ### 4) Avoid unnecessary heavy attribution/materialization - `VirtualAttributions` now supports line-only/initial-only loaders for post-commit paths: - `from_just_working_log_line_only` - `from_initial_only_line_only` - This avoids eager file-content + char-range hydration where line-level data is sufficient. - Blame path adds machine-mode fast behavior (skip presentation-only human-author hydration when not needed). ### 5) Squash pre-commit checkpoint skip handshake - Squash prep stores staged index tree OID marker. - Pre-commit checkpoint fast path skips when tree is unchanged and INITIAL already contains needed AI attribution. - Post-commit consumes INITIAL directly in this path. - Marker lifecycle is cleaned/reset correctly. ## Correctness & Semantics Key invariant preserved: - We do **not** change authorship semantics to chase speed. Notable fix included: - Restored virtual-attribution file tracking semantics so files are not dropped just because blamed AI lines are empty (this previously caused note-loss regression in certain rebase cases). ## Test Coverage Added New edge/regression tests cover: - squash pre-commit skip marker match/mismatch behavior, - squash marker persistence + cleanup lifecycle, - post-commit filtering predicate correctness for human/AI/override paths, - line-attribution compression merge boundaries, - `VirtualAttributions` initial-only and message-clear behavior, - authorship traversal parser robustness (quoted paths, truncated batch payload), - batched repository readers and index-tree OID behavior. Representative test additions: - `src/commands/checkpoint.rs` - `src/git/repo_storage.rs` - `src/authorship/post_commit.rs` - `src/authorship/rebase_authorship.rs` - `src/authorship/virtual_attribution.rs` - `src/git/authorship_traversal.rs` - `src/git/repository.rs` ## Bench/Diagnostics - Added benchmark harness for heavy squash scenarios: - `scripts/benchmarks/git/benchmark_nasty_squashes.sh` - Added/expanded perf logging around rewrite/post-commit hot paths to make bottlenecks explicit. ## Validation Ran full suite successfully: - `cargo test -- --test-threads=1` No test failures. ## Scope Primary touched areas: - rewrite orchestration (`rebase_authorship`, squash prep/rewrite) - virtual attribution loading/conversion paths - blame/traversal batching optimizations - checkpoint/post-commit fast paths - repository/storage helpers for batch reads and squash markers  --- <a href="https://app.devin.ai/review/git-ai-project/git-ai/pull/515" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>