[GH-ISSUE #187] [Bug]: default_publishes cascades to LOOP_COMPLETE when agent writes no events in worktree loops #71

New issue

Open

opened 2026-02-27 10:22:04 +03:00 by kerem · 0 comments

kerem commented

2026-02-27 10:22:04 +03:00

Owner

Originally created by @arjhun-personal on GitHub (Feb 25, 2026).
Original GitHub issue: https://github.com/mikeyobrien/ralph-orchestrator/issues/187

Bug Description

When parallel worktree loops are spawned, they complete in ~3 seconds (3 iterations) without the Claude agent performing any actual work. The default_publishes chain cascades through all hats to LOOP_COMPLETE because the agent writes zero events in each iteration.

Steps to Reproduce

Primary loop is running with a Coordinator hat
Coordinator spawns 3 parallel loops via ralph run -c preset.yml --autonomous -p "..."
Each spawned ralph run detects the lock, creates a worktree, and starts a fresh loop
The worktree loops complete in ~3 iterations (~3s each) with no code changes

Preset Configuration

event_loop:
  starting_event: "track.build.start"

hats:
  track_builder:
    triggers: ["track.build.start"]
    default_publishes: "track.build.done"

  security_reviewer:
    triggers: ["track.build.done"]
    default_publishes: "security_review.passed"

  track_reviewer:
    triggers: ["security_review.passed"]
    default_publishes: "LOOP_COMPLETE"

What Happens

Each iteration, the Claude agent runs but writes no events (never calls ralph emit). After each iteration, check_default_publishes in event_loop/mod.rs:1422 fires because process_events_from_jsonl returned Ok(false):

Iteration 1: track.build.start activates track_builder → agent writes nothing → default injects track.build.done
Iteration 2: track.build.done activates security_reviewer → agent writes nothing → default injects security_review.passed
Iteration 3: security_review.passed activates track_reviewer → agent writes nothing → default injects LOOP_COMPLETE → loop terminates

The loop completes having done zero work. The Coordinator then has to fall back to sequential execution.

Root Cause Analysis

Event isolation is correct — worktree loops get fresh timestamped events files via run_loop_impl (loop_runner.rs:120-128). The events are NOT inherited from the parent. The ralph agent memory mem-1740470400-a1b2 claiming "event history inheritance" is an incorrect self-diagnosis.

The actual issue is the default_publishes fallback design. It treats agent silence as success. When the agent fails to emit events for ANY reason (malformed prompt, missing env, API error, confused context), default_publishes advances the state machine as if the work was done. Three silent iterations cascade the full hat chain to completion.

The code path in loop_runner.rs:1082-1089:

if !agent_wrote_events {
    let active_hats = event_loop.state().last_active_hat_ids.clone();
    for active_hat_id in &active_hats {
        event_loop.check_default_publishes(active_hat_id);
        if event_loop.has_pending_events() {
            break;
        }
    }
}

Why the Agent Wrote No Events

The exact cause of agent silence is unclear since worktree session logs were deleted during cleanup. Possible causes:

The -p inline prompt passed by the Coordinator was insufficient for the worktree context
The agent encountered errors (missing .env, can't build) and exited without emitting
The Claude API returned fast (rate limit, error) and the executor treated it as success

Suggested Fixes

Option 1: Don't allow default_publishes on the completion promise hat. If a hat's default_publishes matches completion_promise, reject the configuration at startup. The final step should always require explicit agent action.

Option 2: Add minimum iteration duration guard. If an iteration completes in < N seconds (e.g., 10s), don't fire default_publishes. A real Claude API call takes much longer than 1-3 seconds.

Option 3: Require agent tool usage before accepting defaults. Before firing default_publishes, verify the agent performed at least one meaningful tool call (file read/write, bash command). If the agent did nothing, treat it as a failure — not silent success.

Option 4: Limit consecutive default_publishes cascading. If default_publishes fires N times in a row without any agent-written events, terminate with a diagnostic error rather than completing.

Environment

ralph-orchestrator: built from arjhun-personal/ralph-orchestrator fork (up to date with upstream)
Backend: Claude (claude-code CLI)
OS: macOS Darwin 24.6.0

Originally created by @arjhun-personal on GitHub (Feb 25, 2026). Original GitHub issue: https://github.com/mikeyobrien/ralph-orchestrator/issues/187 ## Bug Description When parallel worktree loops are spawned, they complete in ~3 seconds (3 iterations) without the Claude agent performing any actual work. The `default_publishes` chain cascades through all hats to `LOOP_COMPLETE` because the agent writes zero events in each iteration. ## Steps to Reproduce 1. Primary loop is running with a Coordinator hat 2. Coordinator spawns 3 parallel loops via `ralph run -c preset.yml --autonomous -p "..."` 3. Each spawned `ralph run` detects the lock, creates a worktree, and starts a fresh loop 4. The worktree loops complete in ~3 iterations (~3s each) with no code changes ## Preset Configuration ```yaml event_loop: starting_event: "track.build.start" hats: track_builder: triggers: ["track.build.start"] default_publishes: "track.build.done" security_reviewer: triggers: ["track.build.done"] default_publishes: "security_review.passed" track_reviewer: triggers: ["security_review.passed"] default_publishes: "LOOP_COMPLETE" ``` ## What Happens Each iteration, the Claude agent runs but writes no events (never calls `ralph emit`). After each iteration, `check_default_publishes` in `event_loop/mod.rs:1422` fires because `process_events_from_jsonl` returned `Ok(false)`: - **Iteration 1**: `track.build.start` activates `track_builder` → agent writes nothing → default injects `track.build.done` - **Iteration 2**: `track.build.done` activates `security_reviewer` → agent writes nothing → default injects `security_review.passed` - **Iteration 3**: `security_review.passed` activates `track_reviewer` → agent writes nothing → default injects `LOOP_COMPLETE` → loop terminates The loop completes having done zero work. The Coordinator then has to fall back to sequential execution. ## Root Cause Analysis **Event isolation is correct** — worktree loops get fresh timestamped events files via `run_loop_impl` (loop_runner.rs:120-128). The events are NOT inherited from the parent. The ralph agent memory `mem-1740470400-a1b2` claiming "event history inheritance" is an incorrect self-diagnosis. **The actual issue is the `default_publishes` fallback design.** It treats agent silence as success. When the agent fails to emit events for ANY reason (malformed prompt, missing env, API error, confused context), `default_publishes` advances the state machine as if the work was done. Three silent iterations cascade the full hat chain to completion. The code path in `loop_runner.rs:1082-1089`: ```rust if !agent_wrote_events { let active_hats = event_loop.state().last_active_hat_ids.clone(); for active_hat_id in &active_hats { event_loop.check_default_publishes(active_hat_id); if event_loop.has_pending_events() { break; } } } ``` ## Why the Agent Wrote No Events The exact cause of agent silence is unclear since worktree session logs were deleted during cleanup. Possible causes: - The `-p` inline prompt passed by the Coordinator was insufficient for the worktree context - The agent encountered errors (missing `.env`, can't build) and exited without emitting - The Claude API returned fast (rate limit, error) and the executor treated it as success ## Suggested Fixes **Option 1: Don't allow `default_publishes` on the completion promise hat.** If a hat's `default_publishes` matches `completion_promise`, reject the configuration at startup. The final step should always require explicit agent action. **Option 2: Add minimum iteration duration guard.** If an iteration completes in < N seconds (e.g., 10s), don't fire `default_publishes`. A real Claude API call takes much longer than 1-3 seconds. **Option 3: Require agent tool usage before accepting defaults.** Before firing `default_publishes`, verify the agent performed at least one meaningful tool call (file read/write, bash command). If the agent did nothing, treat it as a failure — not silent success. **Option 4: Limit consecutive `default_publishes` cascading.** If `default_publishes` fires N times in a row without any agent-written events, terminate with a diagnostic error rather than completing. ## Environment - ralph-orchestrator: built from `arjhun-personal/ralph-orchestrator` fork (up to date with upstream) - Backend: Claude (claude-code CLI) - OS: macOS Darwin 24.6.0