[PR #203] [MERGED] fix: detect and clean up zombie worktree loops (#202) #197

Closed
opened 2026-02-27 10:22:40 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/mikeyobrien/ralph-orchestrator/pull/203
Author: @CoderMageFox
Created: 2/26/2026
Status: Merged
Merged: 2/27/2026
Merged by: @mikeyobrien

Base: mainHead: fix/zombie-worktree-loops


📝 Commits (9)

  • 9f58987 fix: detect and clean up zombie worktree loops (#202)
  • c0d2ee4 style: cargo fmt
  • fa5ff4b fix: keep zombie loops discoverable for cleanup
  • 26bb147 fix: resolve CI breakages in token accounting updates
  • 5574715 Stabilize web preflight version checks against transient exec errors
  • 0a77a9f Drain trailing PTY output after child exit to avoid race flakes
  • 882ea78 fix(loops): run fallback cleanup when worktree path is missing
  • 684493f fix(loops): keep orphan discoverable if stop signal fails
  • c9fcf6f fix(loops): harden orphan cleanup fallback paths

📊 Changes

14 files changed (+582 additions, -32 deletions)

View changed files

📝 crates/ralph-adapters/src/acp_executor.rs (+9 -0)
📝 crates/ralph-adapters/src/pty_executor.rs (+97 -0)
📝 crates/ralph-api/src/loop_domain.rs (+2 -0)
📝 crates/ralph-bench/src/main.rs (+1 -0)
📝 crates/ralph-cli/src/display.rs (+1 -0)
📝 crates/ralph-cli/src/loop_runner.rs (+7 -0)
📝 crates/ralph-cli/src/loops.rs (+303 -7)
📝 crates/ralph-cli/src/main.rs (+3 -1)
📝 crates/ralph-cli/src/web.rs (+36 -16)
📝 crates/ralph-cli/tests/integration_clean.rs (+1 -0)
📝 crates/ralph-core/src/event_loop/mod.rs (+11 -1)
📝 crates/ralph-core/src/loop_registry.rs (+108 -5)
📝 crates/ralph-core/src/summary_writer.rs (+1 -0)
📝 crates/ralph-core/tests/smoke_runner.rs (+2 -2)

📄 Description

Problem

When a parallel loop's worktree directory is removed externally, the loop process keeps running and the registry entry persists as running. This creates zombie loops that block parallel slots and can't be stopped via ralph loops stop. The only workaround is manual kill + editing loops.json.

Solution

Adds zombie detection at multiple layers so they are automatically cleaned up:

  • is_alive() now checks worktree directory existence in addition to PID — this is the highest-leverage fix since with_lock(), clean_stale(), and list() all call it
  • is_pid_alive() added for raw PID checks when you need to send signals to orphan processes
  • WorkspaceGone termination reason so running loops self-exit when their workspace disappears
  • ralph loops list shows orphan status when PID alive but worktree gone
  • ralph loops stop falls back to registry PID + signal when worktree is missing
  • ralph loops discard gracefully handles already-removed worktrees (no more NotFound errors)

Files Changed

File Change
crates/ralph-core/src/loop_registry.rs is_alive() worktree check + is_pid_alive() + 2 new tests
crates/ralph-core/src/event_loop/mod.rs WorkspaceGone variant + workspace check in check_termination()
crates/ralph-core/src/summary_writer.rs status_text() match arm
crates/ralph-cli/src/loops.rs list, stop, discard fixes
crates/ralph-cli/src/display.rs WorkspaceGone display match arm
crates/ralph-cli/src/loop_runner.rs WorkspaceGone in history + merge queue match arms
crates/ralph-bench/src/main.rs WorkspaceGone in bench format match arm

Testing

  • 684 ralph-core lib tests pass
  • 245 ralph-cli tests pass (including updated existing test for worktree path handling)
  • 2 new tests: test_zombie_worktree_detected_as_dead, test_no_worktree_entry_unaffected

Closes #202


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/mikeyobrien/ralph-orchestrator/pull/203 **Author:** [@CoderMageFox](https://github.com/CoderMageFox) **Created:** 2/26/2026 **Status:** ✅ Merged **Merged:** 2/27/2026 **Merged by:** [@mikeyobrien](https://github.com/mikeyobrien) **Base:** `main` ← **Head:** `fix/zombie-worktree-loops` --- ### 📝 Commits (9) - [`9f58987`](https://github.com/mikeyobrien/ralph-orchestrator/commit/9f5898725b76fc9fabe6eda3cfaec4f27b19d27b) fix: detect and clean up zombie worktree loops (#202) - [`c0d2ee4`](https://github.com/mikeyobrien/ralph-orchestrator/commit/c0d2ee46bcef2a2d3219cb14eac74ac870f7fc9f) style: cargo fmt - [`fa5ff4b`](https://github.com/mikeyobrien/ralph-orchestrator/commit/fa5ff4bba04fe38b3350abdf5362c50ae4344bbe) fix: keep zombie loops discoverable for cleanup - [`26bb147`](https://github.com/mikeyobrien/ralph-orchestrator/commit/26bb14748915e2efba3d8252b5c3eb059362d7d9) fix: resolve CI breakages in token accounting updates - [`5574715`](https://github.com/mikeyobrien/ralph-orchestrator/commit/5574715a80ccce53e355046b0a0814551912bf06) Stabilize web preflight version checks against transient exec errors - [`0a77a9f`](https://github.com/mikeyobrien/ralph-orchestrator/commit/0a77a9f7a1d2969460aea530abef84374d49b811) Drain trailing PTY output after child exit to avoid race flakes - [`882ea78`](https://github.com/mikeyobrien/ralph-orchestrator/commit/882ea781b3e9ba4eb6e0056091fc3fee3f60dba6) fix(loops): run fallback cleanup when worktree path is missing - [`684493f`](https://github.com/mikeyobrien/ralph-orchestrator/commit/684493f0f34c2a47cd2d74ae55f00cf044dfc4b5) fix(loops): keep orphan discoverable if stop signal fails - [`c9fcf6f`](https://github.com/mikeyobrien/ralph-orchestrator/commit/c9fcf6fbb27147146108c62e10b9f393c995cdb4) fix(loops): harden orphan cleanup fallback paths ### 📊 Changes **14 files changed** (+582 additions, -32 deletions) <details> <summary>View changed files</summary> 📝 `crates/ralph-adapters/src/acp_executor.rs` (+9 -0) 📝 `crates/ralph-adapters/src/pty_executor.rs` (+97 -0) 📝 `crates/ralph-api/src/loop_domain.rs` (+2 -0) 📝 `crates/ralph-bench/src/main.rs` (+1 -0) 📝 `crates/ralph-cli/src/display.rs` (+1 -0) 📝 `crates/ralph-cli/src/loop_runner.rs` (+7 -0) 📝 `crates/ralph-cli/src/loops.rs` (+303 -7) 📝 `crates/ralph-cli/src/main.rs` (+3 -1) 📝 `crates/ralph-cli/src/web.rs` (+36 -16) 📝 `crates/ralph-cli/tests/integration_clean.rs` (+1 -0) 📝 `crates/ralph-core/src/event_loop/mod.rs` (+11 -1) 📝 `crates/ralph-core/src/loop_registry.rs` (+108 -5) 📝 `crates/ralph-core/src/summary_writer.rs` (+1 -0) 📝 `crates/ralph-core/tests/smoke_runner.rs` (+2 -2) </details> ### 📄 Description ## Problem When a parallel loop's worktree directory is removed externally, the loop process keeps running and the registry entry persists as `running`. This creates zombie loops that block parallel slots and can't be stopped via `ralph loops stop`. The only workaround is manual `kill` + editing `loops.json`. ## Solution Adds zombie detection at multiple layers so they are automatically cleaned up: - **`is_alive()`** now checks worktree directory existence in addition to PID — this is the highest-leverage fix since `with_lock()`, `clean_stale()`, and `list()` all call it - **`is_pid_alive()`** added for raw PID checks when you need to send signals to orphan processes - **`WorkspaceGone`** termination reason so running loops self-exit when their workspace disappears - **`ralph loops list`** shows `orphan` status when PID alive but worktree gone - **`ralph loops stop`** falls back to registry PID + signal when worktree is missing - **`ralph loops discard`** gracefully handles already-removed worktrees (no more `NotFound` errors) ## Files Changed | File | Change | |------|--------| | `crates/ralph-core/src/loop_registry.rs` | `is_alive()` worktree check + `is_pid_alive()` + 2 new tests | | `crates/ralph-core/src/event_loop/mod.rs` | `WorkspaceGone` variant + workspace check in `check_termination()` | | `crates/ralph-core/src/summary_writer.rs` | `status_text()` match arm | | `crates/ralph-cli/src/loops.rs` | `list`, `stop`, `discard` fixes | | `crates/ralph-cli/src/display.rs` | `WorkspaceGone` display match arm | | `crates/ralph-cli/src/loop_runner.rs` | `WorkspaceGone` in history + merge queue match arms | | `crates/ralph-bench/src/main.rs` | `WorkspaceGone` in bench format match arm | ## Testing - 684 ralph-core lib tests pass - 245 ralph-cli tests pass (including updated existing test for worktree path handling) - 2 new tests: `test_zombie_worktree_detected_as_dead`, `test_no_worktree_entry_unaffected` Closes #202 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-02-27 10:22:40 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ralph-orchestrator#197
No description provided.