[PR #115] Character-based tracking in checkpoints #263

Closed
opened 2026-03-02 04:13:07 +03:00 by kerem · 0 comments
Owner

Original Pull Request: https://github.com/git-ai-project/git-ai/pull/115

State: closed
Merged: Yes


We were running into some limitations of our previous line-based checkpointing approach, so we migrated to a new character range-based approach using Google's diff-match-patch algorithm (used in many production text collab systems).

Updating the way we store attribution in checkpoints enables us to have a much richer history of the edits that occurred in the file, filter out noise from formatters, and retain attribution for moved blocks of code

  • Rewrite checkpoint logic to use a new character-based range attribution tracking approach, built on Google's diff-match-patch algorithm (thanks to a robust rust port)
  • Formatters do not change attribution (at least with whitespace-only reformatting; more to come)
  • Last non-whitespace edit to a line 'wins' the attribution for that line
  • Initial support for tracking AI blocks through moves (within a single file for now) [see video]

https://github.com/user-attachments/assets/27cf5f89-a357-4729-8924-b20f119ae924

Todo:

  • Human additions/deletions aren’t properly reflecting moves (even in case of a move, the entire diff is counted as human for stats)
  • Benchmark perf on large repos
  • Document that authorship logs may reference prompt hashes from earlier authorship logs
  • Test backwards compat with pre-existing working logs (how much do we need to guarantee?)

Performance impacts:

  • No substantial change to diff processing time. Since move detection is currently O(n_attributions * n_new_diffs), it is currently disabled as merged. After more testing and optimization, it will be enabled.
  • The support for tracking cut/pastes (moves) requires calculating an ai blame the first time a given file is checkpointed in the current working log. On small repos, this is barely noticeable (tens of ms) however on the chromium repo this adds nearly 2 seconds per file (sequential) with diffs.
**Original Pull Request:** https://github.com/git-ai-project/git-ai/pull/115 **State:** closed **Merged:** Yes --- We were running into some limitations of our previous line-based checkpointing approach, so we migrated to a new character range-based approach using Google's diff-match-patch algorithm (used in many production text collab systems). Updating the way we store attribution in checkpoints enables us to have a much richer history of the edits that occurred in the file, filter out noise from formatters, and retain attribution for moved blocks of code - [x] Rewrite checkpoint logic to use a new character-based range attribution tracking approach, built on Google's diff-match-patch algorithm (thanks to a robust rust port) - [x] Formatters do not change attribution (at least with whitespace-only reformatting; more to come) - [x] Last non-whitespace edit to a line 'wins' the attribution for that line - [x] Initial support for tracking AI blocks through moves (within a single file for now) [see video] https://github.com/user-attachments/assets/27cf5f89-a357-4729-8924-b20f119ae924 Todo: - [ ] Human additions/deletions aren’t properly reflecting moves (even in case of a move, the entire diff is counted as human for stats) - [x] Benchmark perf on large repos - [x] Document that authorship logs may reference prompt hashes from earlier authorship logs - [x] Test backwards compat with pre-existing working logs (how much do we need to guarantee?) Performance impacts: * No substantial change to diff processing time. Since move detection is currently `O(n_attributions * n_new_diffs)`, it is currently disabled as merged. After more testing and optimization, it will be enabled. * The support for tracking cut/pastes (moves) requires calculating an ai blame the first time a given file is checkpointed in the current working log. On small repos, this is barely noticeable (tens of ms) however on the chromium repo this adds nearly 2 seconds per file (sequential) with diffs.
kerem 2026-03-02 04:13:07 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/git-ai#263
No description provided.