[GH-ISSUE #202] Bug: Worktree loop becomes zombie when worktree directory is removed but registry entry persists #76

Closed
opened 2026-02-27 10:22:05 +03:00 by kerem · 0 comments
Owner

Originally created by @CoderMageFox on GitHub (Feb 26, 2026).
Original GitHub issue: https://github.com/mikeyobrien/ralph-orchestrator/issues/202

Summary

When a parallel loop's worktree directory is removed (by external process, git operation, or cleanup), the loop registry (loops.json) retains the entry and the ralph process continues running. This creates a "zombie loop" that appears as running in ralph loops list but cannot perform any work, blocking the parallel slot and confusing users into thinking parallel execution is active.

Environment

  • ralph version: latest (compiled from source, Rust binary)
  • OS: macOS (Darwin 24.3.0, arm64)
  • Git version: 2.x

Steps to Reproduce

  1. Start a primary loop: ralph run -b codex -p "Task A"
  2. Start a second loop (auto-spawns to worktree): ralph run -b claude -p "Task B"
  3. Observe worktree created at .worktrees/<loop-id>/
  4. The worktree directory gets removed (e.g., by another process, manual cleanup, or git worktree prune)
  5. Run ralph loops list

Expected Behavior

  • Ralph should detect that the worktree directory no longer exists
  • The loop should be marked as orphan or automatically cleaned up
  • ralph loops list should show accurate status

Actual Behavior

  • ralph loops list shows the loop as running
  • The ralph process (PID visible in loops.json) is still alive but operating on a non-existent directory
  • ralph loops stop <id> fails with "Cannot determine active loop - it may have already stopped"
  • The only way to clean up is manual kill <pid> + editing loops.json

Observed State

$ ralph loops list
Loops: running: 2

ID                   STATUS       MERGE    AGE      LOCATION             PROMPT
----------------------------------------------------------------------------------------
(primary)            running      -        -        (in-place)           Task A...
true-brook           running      -        -        true-brook           Task B...

$ ls -la .worktrees/true-brook
# Directory is empty (only . and ..)

$ git worktree list
/path/to/repo      56639be [fix/app-registry-review-fixes]
# No true-brook worktree listed

Root Cause Analysis

Three areas in the codebase lack worktree existence validation:

1. loop_registry.rsLoopEntry::is_alive() (line ~129)

Currently only checks if the PID is alive via kill(pid, 0). Does not verify the worktree directory exists.

#[cfg(unix)]
pub fn is_alive(&self) -> bool {
    use nix::sys::signal::kill;
    use nix::unistd::Pid;
    kill(Pid::from_raw(self.pid as i32), None)
        .map(|_| true)
        .unwrap_or(false)
}

Suggested fix: Add worktree path existence check for entries that have worktree_path:

pub fn is_alive(&self) -> bool {
    let pid_alive = /* existing PID check */;
    if !pid_alive {
        return false;
    }
    // For worktree loops, also verify the directory exists
    if let Some(ref wt_path) = self.worktree_path {
        return std::path::Path::new(wt_path).exists();
    }
    true
}

2. loop_runner.rs — No runtime worktree health check

The event loop in run_loop_impl does not verify the workspace directory exists between iterations. If the worktree is removed mid-execution, the loop continues running but all file operations silently fail or error.

Suggested fix: Add a workspace existence check at the start of each iteration:

// At the beginning of each iteration
if !loop_context.workspace().exists() {
    error!("Workspace directory no longer exists: {}", loop_context.workspace().display());
    // Deregister from registry and exit gracefully
    break TerminationReason::WorkspaceGone;
}

3. loops.rslist command doesn't detect orphans

The list subcommand displays status based solely on PID liveness. It should cross-reference with actual worktree existence and show orphan status when the directory is missing.

4. loops.rsstop command fails on zombie loops

ralph loops stop <id> fails because it can't determine the active loop context. It should still be able to kill the PID and clean up the registry entry even when the worktree is gone.

Suggested Priority

Medium-High — This silently breaks parallel execution, which is a core v2 feature. Users see "2 loops running" but only 1 is actually working, with no indication of the problem.

Workaround

Manual cleanup:

kill <zombie_pid>
# Edit .ralph/loops.json to remove the stale entry
# Or: ralph loops prune (if the PID is already dead)
Originally created by @CoderMageFox on GitHub (Feb 26, 2026). Original GitHub issue: https://github.com/mikeyobrien/ralph-orchestrator/issues/202 ## Summary When a parallel loop's worktree directory is removed (by external process, git operation, or cleanup), the loop registry (`loops.json`) retains the entry and the ralph process continues running. This creates a "zombie loop" that appears as `running` in `ralph loops list` but cannot perform any work, blocking the parallel slot and confusing users into thinking parallel execution is active. ## Environment - **ralph version**: latest (compiled from source, Rust binary) - **OS**: macOS (Darwin 24.3.0, arm64) - **Git version**: 2.x ## Steps to Reproduce 1. Start a primary loop: `ralph run -b codex -p "Task A"` 2. Start a second loop (auto-spawns to worktree): `ralph run -b claude -p "Task B"` 3. Observe worktree created at `.worktrees/<loop-id>/` 4. The worktree directory gets removed (e.g., by another process, manual cleanup, or git worktree prune) 5. Run `ralph loops list` ## Expected Behavior - Ralph should detect that the worktree directory no longer exists - The loop should be marked as `orphan` or automatically cleaned up - `ralph loops list` should show accurate status ## Actual Behavior - `ralph loops list` shows the loop as `running` - The ralph process (PID visible in `loops.json`) is still alive but operating on a non-existent directory - `ralph loops stop <id>` fails with "Cannot determine active loop - it may have already stopped" - The only way to clean up is manual `kill <pid>` + editing `loops.json` ## Observed State ``` $ ralph loops list Loops: running: 2 ID STATUS MERGE AGE LOCATION PROMPT ---------------------------------------------------------------------------------------- (primary) running - - (in-place) Task A... true-brook running - - true-brook Task B... $ ls -la .worktrees/true-brook # Directory is empty (only . and ..) $ git worktree list /path/to/repo 56639be [fix/app-registry-review-fixes] # No true-brook worktree listed ``` ## Root Cause Analysis Three areas in the codebase lack worktree existence validation: ### 1. `loop_registry.rs` — `LoopEntry::is_alive()` (line ~129) Currently only checks if the PID is alive via `kill(pid, 0)`. Does not verify the worktree directory exists. ```rust #[cfg(unix)] pub fn is_alive(&self) -> bool { use nix::sys::signal::kill; use nix::unistd::Pid; kill(Pid::from_raw(self.pid as i32), None) .map(|_| true) .unwrap_or(false) } ``` **Suggested fix**: Add worktree path existence check for entries that have `worktree_path`: ```rust pub fn is_alive(&self) -> bool { let pid_alive = /* existing PID check */; if !pid_alive { return false; } // For worktree loops, also verify the directory exists if let Some(ref wt_path) = self.worktree_path { return std::path::Path::new(wt_path).exists(); } true } ``` ### 2. `loop_runner.rs` — No runtime worktree health check The event loop in `run_loop_impl` does not verify the workspace directory exists between iterations. If the worktree is removed mid-execution, the loop continues running but all file operations silently fail or error. **Suggested fix**: Add a workspace existence check at the start of each iteration: ```rust // At the beginning of each iteration if !loop_context.workspace().exists() { error!("Workspace directory no longer exists: {}", loop_context.workspace().display()); // Deregister from registry and exit gracefully break TerminationReason::WorkspaceGone; } ``` ### 3. `loops.rs` — `list` command doesn't detect orphans The `list` subcommand displays status based solely on PID liveness. It should cross-reference with actual worktree existence and show `orphan` status when the directory is missing. ### 4. `loops.rs` — `stop` command fails on zombie loops `ralph loops stop <id>` fails because it can't determine the active loop context. It should still be able to kill the PID and clean up the registry entry even when the worktree is gone. ## Suggested Priority **Medium-High** — This silently breaks parallel execution, which is a core v2 feature. Users see "2 loops running" but only 1 is actually working, with no indication of the problem. ## Workaround Manual cleanup: ```bash kill <zombie_pid> # Edit .ralph/loops.json to remove the stale entry # Or: ralph loops prune (if the PID is already dead) ```
kerem closed this issue 2026-02-27 10:22:05 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ralph-orchestrator#76
No description provided.