[GH-ISSUE #187] [Bug]: default_publishes cascades to LOOP_COMPLETE when agent writes no events in worktree loops #71

Open
opened 2026-02-27 10:22:04 +03:00 by kerem · 0 comments
Owner

Originally created by @arjhun-personal on GitHub (Feb 25, 2026).
Original GitHub issue: https://github.com/mikeyobrien/ralph-orchestrator/issues/187

Bug Description

When parallel worktree loops are spawned, they complete in ~3 seconds (3 iterations) without the Claude agent performing any actual work. The default_publishes chain cascades through all hats to LOOP_COMPLETE because the agent writes zero events in each iteration.

Steps to Reproduce

  1. Primary loop is running with a Coordinator hat
  2. Coordinator spawns 3 parallel loops via ralph run -c preset.yml --autonomous -p "..."
  3. Each spawned ralph run detects the lock, creates a worktree, and starts a fresh loop
  4. The worktree loops complete in ~3 iterations (~3s each) with no code changes

Preset Configuration

event_loop:
  starting_event: "track.build.start"

hats:
  track_builder:
    triggers: ["track.build.start"]
    default_publishes: "track.build.done"

  security_reviewer:
    triggers: ["track.build.done"]
    default_publishes: "security_review.passed"

  track_reviewer:
    triggers: ["security_review.passed"]
    default_publishes: "LOOP_COMPLETE"

What Happens

Each iteration, the Claude agent runs but writes no events (never calls ralph emit). After each iteration, check_default_publishes in event_loop/mod.rs:1422 fires because process_events_from_jsonl returned Ok(false):

  • Iteration 1: track.build.start activates track_builder → agent writes nothing → default injects track.build.done
  • Iteration 2: track.build.done activates security_reviewer → agent writes nothing → default injects security_review.passed
  • Iteration 3: security_review.passed activates track_reviewer → agent writes nothing → default injects LOOP_COMPLETE → loop terminates

The loop completes having done zero work. The Coordinator then has to fall back to sequential execution.

Root Cause Analysis

Event isolation is correct — worktree loops get fresh timestamped events files via run_loop_impl (loop_runner.rs:120-128). The events are NOT inherited from the parent. The ralph agent memory mem-1740470400-a1b2 claiming "event history inheritance" is an incorrect self-diagnosis.

The actual issue is the default_publishes fallback design. It treats agent silence as success. When the agent fails to emit events for ANY reason (malformed prompt, missing env, API error, confused context), default_publishes advances the state machine as if the work was done. Three silent iterations cascade the full hat chain to completion.

The code path in loop_runner.rs:1082-1089:

if !agent_wrote_events {
    let active_hats = event_loop.state().last_active_hat_ids.clone();
    for active_hat_id in &active_hats {
        event_loop.check_default_publishes(active_hat_id);
        if event_loop.has_pending_events() {
            break;
        }
    }
}

Why the Agent Wrote No Events

The exact cause of agent silence is unclear since worktree session logs were deleted during cleanup. Possible causes:

  • The -p inline prompt passed by the Coordinator was insufficient for the worktree context
  • The agent encountered errors (missing .env, can't build) and exited without emitting
  • The Claude API returned fast (rate limit, error) and the executor treated it as success

Suggested Fixes

Option 1: Don't allow default_publishes on the completion promise hat. If a hat's default_publishes matches completion_promise, reject the configuration at startup. The final step should always require explicit agent action.

Option 2: Add minimum iteration duration guard. If an iteration completes in < N seconds (e.g., 10s), don't fire default_publishes. A real Claude API call takes much longer than 1-3 seconds.

Option 3: Require agent tool usage before accepting defaults. Before firing default_publishes, verify the agent performed at least one meaningful tool call (file read/write, bash command). If the agent did nothing, treat it as a failure — not silent success.

Option 4: Limit consecutive default_publishes cascading. If default_publishes fires N times in a row without any agent-written events, terminate with a diagnostic error rather than completing.

Environment

  • ralph-orchestrator: built from arjhun-personal/ralph-orchestrator fork (up to date with upstream)
  • Backend: Claude (claude-code CLI)
  • OS: macOS Darwin 24.6.0
Originally created by @arjhun-personal on GitHub (Feb 25, 2026). Original GitHub issue: https://github.com/mikeyobrien/ralph-orchestrator/issues/187 ## Bug Description When parallel worktree loops are spawned, they complete in ~3 seconds (3 iterations) without the Claude agent performing any actual work. The `default_publishes` chain cascades through all hats to `LOOP_COMPLETE` because the agent writes zero events in each iteration. ## Steps to Reproduce 1. Primary loop is running with a Coordinator hat 2. Coordinator spawns 3 parallel loops via `ralph run -c preset.yml --autonomous -p "..."` 3. Each spawned `ralph run` detects the lock, creates a worktree, and starts a fresh loop 4. The worktree loops complete in ~3 iterations (~3s each) with no code changes ## Preset Configuration ```yaml event_loop: starting_event: "track.build.start" hats: track_builder: triggers: ["track.build.start"] default_publishes: "track.build.done" security_reviewer: triggers: ["track.build.done"] default_publishes: "security_review.passed" track_reviewer: triggers: ["security_review.passed"] default_publishes: "LOOP_COMPLETE" ``` ## What Happens Each iteration, the Claude agent runs but writes no events (never calls `ralph emit`). After each iteration, `check_default_publishes` in `event_loop/mod.rs:1422` fires because `process_events_from_jsonl` returned `Ok(false)`: - **Iteration 1**: `track.build.start` activates `track_builder` → agent writes nothing → default injects `track.build.done` - **Iteration 2**: `track.build.done` activates `security_reviewer` → agent writes nothing → default injects `security_review.passed` - **Iteration 3**: `security_review.passed` activates `track_reviewer` → agent writes nothing → default injects `LOOP_COMPLETE` → loop terminates The loop completes having done zero work. The Coordinator then has to fall back to sequential execution. ## Root Cause Analysis **Event isolation is correct** — worktree loops get fresh timestamped events files via `run_loop_impl` (loop_runner.rs:120-128). The events are NOT inherited from the parent. The ralph agent memory `mem-1740470400-a1b2` claiming "event history inheritance" is an incorrect self-diagnosis. **The actual issue is the `default_publishes` fallback design.** It treats agent silence as success. When the agent fails to emit events for ANY reason (malformed prompt, missing env, API error, confused context), `default_publishes` advances the state machine as if the work was done. Three silent iterations cascade the full hat chain to completion. The code path in `loop_runner.rs:1082-1089`: ```rust if !agent_wrote_events { let active_hats = event_loop.state().last_active_hat_ids.clone(); for active_hat_id in &active_hats { event_loop.check_default_publishes(active_hat_id); if event_loop.has_pending_events() { break; } } } ``` ## Why the Agent Wrote No Events The exact cause of agent silence is unclear since worktree session logs were deleted during cleanup. Possible causes: - The `-p` inline prompt passed by the Coordinator was insufficient for the worktree context - The agent encountered errors (missing `.env`, can't build) and exited without emitting - The Claude API returned fast (rate limit, error) and the executor treated it as success ## Suggested Fixes **Option 1: Don't allow `default_publishes` on the completion promise hat.** If a hat's `default_publishes` matches `completion_promise`, reject the configuration at startup. The final step should always require explicit agent action. **Option 2: Add minimum iteration duration guard.** If an iteration completes in < N seconds (e.g., 10s), don't fire `default_publishes`. A real Claude API call takes much longer than 1-3 seconds. **Option 3: Require agent tool usage before accepting defaults.** Before firing `default_publishes`, verify the agent performed at least one meaningful tool call (file read/write, bash command). If the agent did nothing, treat it as a failure — not silent success. **Option 4: Limit consecutive `default_publishes` cascading.** If `default_publishes` fires N times in a row without any agent-written events, terminate with a diagnostic error rather than completing. ## Environment - ralph-orchestrator: built from `arjhun-personal/ralph-orchestrator` fork (up to date with upstream) - Backend: Claude (claude-code CLI) - OS: macOS Darwin 24.6.0
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ralph-orchestrator#71
No description provided.