[GH-ISSUE #86] Hat system event routing bugs cause stalled loops and incorrect behavior #31

New issue

Closed

opened 2026-02-27 10:21:51 +03:00 by kerem · 1 comment

kerem commented

2026-02-27 10:21:51 +03:00

Owner

Originally created by @memyselfandm on GitHub (Jan 21, 2026).
Original GitHub issue: https://github.com/mikeyobrien/ralph-orchestrator/issues/86

Summary

The multi-hat event routing system has three related bugs that cause loops to stall, terminate prematurely, or exhibit incorrect behavior. These issues were discovered while using the spec-driven preset.

Bugs

Bug 1: `starting_event` instruction shown in all iterations

Location: crates/ralph-core/src/hatless_ralph.rs:239-244

The starting_event instruction (e.g., "After coordination, publish spec.start to start the workflow") is unconditionally included in every prompt when configured, even when a different hat should be active.

This causes Claude to sometimes publish spec.start instead of the correct event for the active hat (e.g., spec.approved for Spec Critic), creating self-triggering loops.

Evidence from captured prompt:

## PENDING EVENTS
Event: spec.ready - Phase 4 Chat Service & UI spec ready...

## HATS
**After coordination, publish `spec.start` to start the workflow.**  ← CONFLICTING

### 🔎 Spec Critic Instructions
If solid: publish spec.approved  ← CORRECT INSTRUCTION

Bug 2: Orphan events are dropped, not routed to Ralph

Location: crates/ralph-core/src/event_loop.rs:845-862

When Claude publishes an event that no custom hat subscribes to, the code logs it as an orphan but does NOT publish it to the EventBus:

} else {
    // Orphaned event - Ralph will handle it
    debug!(topic = %event.topic, "Event has no subscriber...");
    has_orphans = true;  // But event is NOT published!
}

The comment says "Ralph will handle it" but the event is silently dropped.

Bug 3: No validation of published event topics

Location: No validation exists

Hats define a publishes list in their config, but there's no validation that Claude only publishes allowed topics. Claude can emit any event topic, causing:

Events routing to wrong hats
Self-triggering loops
Invented topics that nobody handles

Event Log Evidence

{"topic":"spec.approved","ts":"2026-01-21T03:36:36.222014+00:00"}
{"topic":"spec.approved","ts":"2026-01-21T03:38:06.428883+00:00"}  // duplicate
{"topic":"spec.approved","ts":"2026-01-21T03:39:33.729276+00:00"}  // duplicate
{"topic":"spec.approved","ts":"2026-01-21T03:41:15.126800+00:00"}  // duplicate
{"topic":"implementation.start","ts":"2026-01-21T03:42:23.402977+00:00"}  // INVENTED
{"topic":"implementation.trigger","ts":"2026-01-21T03:46:52.626981+00:00"}  // INVENTED
{"topic":"loop.terminate","ts":"2026-01-21T03:47:21.025854+00:00"}  // PREMATURE

The Implementer should trigger on spec.approved and publish implementation.done, but instead:

Published invented events (implementation.start, implementation.trigger)
Those events were dropped as orphans (Bug 2)
Loop terminated because no valid events remained

Proposed Fixes

Bug 1

Only include starting_event instruction when is_fresh_start() or pending event is task.start.

Bug 2

Route orphan events to Ralph's queue instead of dropping them.

Bug 3

Validate that published events match the active hat's publishes list; reject invalid events with feedback.

Environment

Preset: spec-driven
Backend: Claude
Branch: main (verified bugs present as of 2026-01-20)

Originally created by @memyselfandm on GitHub (Jan 21, 2026). Original GitHub issue: https://github.com/mikeyobrien/ralph-orchestrator/issues/86 ## Summary The multi-hat event routing system has three related bugs that cause loops to stall, terminate prematurely, or exhibit incorrect behavior. These issues were discovered while using the `spec-driven` preset. ## Bugs ### Bug 1: `starting_event` instruction shown in all iterations **Location**: `crates/ralph-core/src/hatless_ralph.rs:239-244` The `starting_event` instruction (e.g., "After coordination, publish `spec.start` to start the workflow") is unconditionally included in every prompt when configured, even when a different hat should be active. This causes Claude to sometimes publish `spec.start` instead of the correct event for the active hat (e.g., `spec.approved` for Spec Critic), creating self-triggering loops. **Evidence from captured prompt:** ``` ## PENDING EVENTS Event: spec.ready - Phase 4 Chat Service & UI spec ready... ## HATS **After coordination, publish `spec.start` to start the workflow.** ← CONFLICTING ### 🔎 Spec Critic Instructions If solid: publish spec.approved ← CORRECT INSTRUCTION ``` ### Bug 2: Orphan events are dropped, not routed to Ralph **Location**: `crates/ralph-core/src/event_loop.rs:845-862` When Claude publishes an event that no custom hat subscribes to, the code logs it as an orphan but does NOT publish it to the EventBus: ```rust } else { // Orphaned event - Ralph will handle it debug!(topic = %event.topic, "Event has no subscriber..."); has_orphans = true; // But event is NOT published! } ``` The comment says "Ralph will handle it" but the event is silently dropped. ### Bug 3: No validation of published event topics **Location**: No validation exists Hats define a `publishes` list in their config, but there's no validation that Claude only publishes allowed topics. Claude can emit any event topic, causing: - Events routing to wrong hats - Self-triggering loops - Invented topics that nobody handles ## Event Log Evidence ```json {"topic":"spec.approved","ts":"2026-01-21T03:36:36.222014+00:00"} {"topic":"spec.approved","ts":"2026-01-21T03:38:06.428883+00:00"} // duplicate {"topic":"spec.approved","ts":"2026-01-21T03:39:33.729276+00:00"} // duplicate {"topic":"spec.approved","ts":"2026-01-21T03:41:15.126800+00:00"} // duplicate {"topic":"implementation.start","ts":"2026-01-21T03:42:23.402977+00:00"} // INVENTED {"topic":"implementation.trigger","ts":"2026-01-21T03:46:52.626981+00:00"} // INVENTED {"topic":"loop.terminate","ts":"2026-01-21T03:47:21.025854+00:00"} // PREMATURE ``` The Implementer should trigger on `spec.approved` and publish `implementation.done`, but instead: 1. Published invented events (`implementation.start`, `implementation.trigger`) 2. Those events were dropped as orphans (Bug 2) 3. Loop terminated because no valid events remained ## Proposed Fixes ### Bug 1 Only include `starting_event` instruction when `is_fresh_start()` or pending event is `task.start`. ### Bug 2 Route orphan events to Ralph's queue instead of dropping them. ### Bug 3 Validate that published events match the active hat's `publishes` list; reject invalid events with feedback. ## Environment - Preset: `spec-driven` - Backend: Claude - Branch: `main` (verified bugs present as of 2026-01-20)