[GH-ISSUE #74] Proposal: Confidence-Aware Loop Completion via Structured Self-Assessment (“Confession” Phase) #29

Closed
opened 2026-02-27 10:21:51 +03:00 by kerem · 4 comments
Owner

Originally created by @matbgn on GitHub (Jan 19, 2026).
Original GitHub issue: https://github.com/mikeyobrien/ralph-orchestrator/issues/74

Source inspiration:

https://alignment.openai.com/confessions/

Summary

This proposal introduces a confidence-aware loop completion mechanism by adding a structured self-assessment phase (“Confession”) to each orchestration cycle. Instead of assuming LOOP_COMPLETE when an answer is produced, the loop is considered complete only if the model’s own self-assessed confidence and honesty meet defined thresholds.

The key idea is to decouple usefulness from honesty:

  • The main answer is optimized for solving the user’s task.
  • A secondary output (“ConfessionReport”) is optimized for transparency, uncertainty disclosure, and self-critique.

Only the ConfessionReport is used to decide whether to accept, retry, or escalate the answer.


Key insight

  • Separate the main answer from a second output (the "confession") and score only the confession for honesty. This decouples incentives so admitting failures is rewarded even when the main answer is optimized for usefulness.

Why This Fits ralph-orchestrator

  • Reinforces loop discipline instead of one-shot answers
  • Makes LOOP_COMPLETE earned, not assumed
  • Encourages uncertainty disclosure instead of hidden hallucinations
  • Fully compatible with inference-only pipelines
  • Provides structured signals for future scoring, routing, or human-in-the-loop integration

Practical recipe (raw attempt, not usable as is)

  1. Two-stage interaction

    • Stage A (Main): Ask the model the user question and collect its normal answer.
    • Stage B (Confession): Prompt the model to produce a ConfessionReport about its previous answer.
  2. ConfessionReport structure (strict template)

    • List explicit objectives/constraints the main answer should satisfy.
    • For each objective: state whether it was met, cite evidence (token-level or reasoning trace), and explain any shortcuts or hacks taken.
    • List uncertainties, assumptions, and anything omitted intentionally.
    • Give a self-assessed confidence score (0–100%) per claim.
  3. Rewarding / selection strategy

    • Use the confession score to accept/reject the main answer.
    • Use rejection sampling: generate N paired (answer, confession) and select answers whose confession scores exceed a threshold.
  4. Prompt templates (inference-only fallback)

    • After the main answer, run a prompt like:
      "Produce a ConfessionReport about your previous answer. Include: (1) list of explicit & implicit objectives, (2) for each, whether you met it and evidence, (3) items you omitted or shortcuts you took, (4) uncertainties and confidence scores."

Initial proposal Confession-style inference prompts

Produce a ConfessionReport about your immediately previous answer. Follow this exact numbered format.

1. **Explicit objectives**: List 3–6 concrete objectives the previous answer should satisfy (e.g., accuracy, citations for facts, safety, completeness). Use one short sentence per objective.

2. **For each objective (labelled by number)**:
  a. **Met?**: Yes/No.
  b. **Evidence**: Quote exact phrases or tokens from your previous answer that support this claim (use verbatim quotes).
  c. **Misses / Shortcuts**: If No or partly, state exactly what was missed or any shortcuts/hacks you took (one sentence).
  d. **Confidence (0–100%)**: numeric confidence in your assessment for this objective.

3. **Claims list**: Numbered list of factual claims you made in the previous answer. For each claim, add a one-line source: either a URL, "known fact", or "no source / inference". If you guessed, mark "Speculation".

4. **Uncertainties and assumptions**: Bullet any assumptions you made and points of uncertainty.

5. **Failure modes**: Bullet any ways this answer could be wrong or misleading.

6. **Action recommendation**: For the next loop, determine either (A) accept as-is, (B) verify specific items (list them), or (C) seek human/expert review.

7. **Rejection-sampling wrapper** (generate N and pick highest honesty): Repeat the full two-step process N times (N=5–10). For each run, return the Main Answer and its ConfessionReport. Then choose the Main Answer whose ConfessionReport has the most "Met? = Yes" entries and highest average Confidence. If tied, prefer the shorter answer.
Originally created by @matbgn on GitHub (Jan 19, 2026). Original GitHub issue: https://github.com/mikeyobrien/ralph-orchestrator/issues/74 #### Source inspiration: https://alignment.openai.com/confessions/ ## Summary This proposal introduces a confidence-aware loop completion mechanism by adding a structured self-assessment phase (“Confession”) to each orchestration cycle. Instead of assuming `LOOP_COMPLETE` when an answer is produced, the loop is considered complete only if the model’s own self-assessed confidence and honesty meet defined thresholds. The key idea is to **decouple usefulness from honesty**: - The main answer is optimized for solving the user’s task. - A secondary output (“ConfessionReport”) is optimized for transparency, uncertainty disclosure, and self-critique. Only the ConfessionReport is used to decide whether to accept, retry, or escalate the answer. --- ## Key insight - Separate the main answer from a second output (the "confession") and score only the confession for honesty. This decouples incentives so admitting failures is rewarded even when the main answer is optimized for usefulness. ## Why This Fits ralph-orchestrator - Reinforces loop discipline instead of one-shot answers - Makes LOOP_COMPLETE earned, not assumed - Encourages uncertainty disclosure instead of hidden hallucinations - Fully compatible with inference-only pipelines - Provides structured signals for future scoring, routing, or human-in-the-loop integration ## Practical recipe (raw attempt, not usable as is) 1. Two-stage interaction - Stage A (Main): Ask the model the user question and collect its normal answer. - Stage B (Confession): Prompt the model to produce a ConfessionReport about its previous answer. 2. ConfessionReport structure (strict template) - List explicit objectives/constraints the main answer should satisfy. - For each objective: state whether it was met, cite evidence (token-level or reasoning trace), and explain any shortcuts or hacks taken. - List uncertainties, assumptions, and anything omitted intentionally. - Give a self-assessed confidence score (0–100%) per claim. 3. Rewarding / selection strategy - Use the confession score to accept/reject the main answer. - Use rejection sampling: generate N paired (answer, confession) and select answers whose confession scores exceed a threshold. 4. Prompt templates (inference-only fallback) - After the main answer, run a prompt like: "Produce a ConfessionReport about your previous answer. Include: (1) list of explicit & implicit objectives, (2) for each, whether you met it and evidence, (3) items you omitted or shortcuts you took, (4) uncertainties and confidence scores." ## Initial proposal Confession-style inference prompts ``` Produce a ConfessionReport about your immediately previous answer. Follow this exact numbered format. 1. **Explicit objectives**: List 3–6 concrete objectives the previous answer should satisfy (e.g., accuracy, citations for facts, safety, completeness). Use one short sentence per objective. 2. **For each objective (labelled by number)**: a. **Met?**: Yes/No. b. **Evidence**: Quote exact phrases or tokens from your previous answer that support this claim (use verbatim quotes). c. **Misses / Shortcuts**: If No or partly, state exactly what was missed or any shortcuts/hacks you took (one sentence). d. **Confidence (0–100%)**: numeric confidence in your assessment for this objective. 3. **Claims list**: Numbered list of factual claims you made in the previous answer. For each claim, add a one-line source: either a URL, "known fact", or "no source / inference". If you guessed, mark "Speculation". 4. **Uncertainties and assumptions**: Bullet any assumptions you made and points of uncertainty. 5. **Failure modes**: Bullet any ways this answer could be wrong or misleading. 6. **Action recommendation**: For the next loop, determine either (A) accept as-is, (B) verify specific items (list them), or (C) seek human/expert review. 7. **Rejection-sampling wrapper** (generate N and pick highest honesty): Repeat the full two-step process N times (N=5–10). For each run, return the Main Answer and its ConfessionReport. Then choose the Main Answer whose ConfessionReport has the most "Met? = Yes" entries and highest average Confidence. If tied, prefer the shorter answer. ```
kerem closed this issue 2026-02-27 10:21:51 +03:00
Author
Owner

@mikeyobrien commented on GitHub (Jan 19, 2026):

This Can Be Implemented Today via Hat Instructions

Ralph's architecture can support this pattern without orchestrator changes — though with some important adaptations.

Understanding the Core Methodology

The OpenAI approach has specific properties worth preserving:

  1. Differential reward — Confessions are "rewarded solely for honesty," creating an "anonymous tip line" for self-reporting issues
  2. Ease of verification — "It is easier to verify a single thing the model claims to have done wrong, than to find and verify all the potential issues"
  3. Introspective access — The confession model shares context with the policy model, surfacing reasoning that might not appear in the main output

Adapting for Ralph's Fresh-Context Architecture

Ralph's per-iteration fresh context (Tenet #1) conflicts slightly with "shared weights/introspective access." Here's how to bridge that gap:

The builder explicitly externalizes its internal state:

hats:
  builder:
    name: "Builder"
    triggers: ["build.task"]
    publishes: ["build.done"]
    instructions: |
      # Implementation Phase

      As you work, maintain a running "Internal Monologue" section in the scratchpad:
      - Shortcuts you took and why
      - Things you're unsure about
      - Assumptions you made
      - Edge cases you considered but didn't handle
      - What you'd do differently with more time

      This enables honest self-assessment in the next phase.

The self-assessor is rewarded ONLY for finding issues:

  confessor:
    name: "🔍 Confessor"
    description: "Produces ConfessionReport — rewarded solely for honesty, not validation"
    triggers: ["build.done"]
    publishes: ["confession.clean", "confession.issues_found"]
    instructions: |
      ## CONFESSION PHASE

      You are an internal auditor. Your ONLY job is to find issues.
      You are NOT rewarded for saying the work is good.
      You ARE rewarded for surfacing problems, uncertainties, and shortcuts.

      Think of yourself as an "anonymous tip line" — report what you find.

      ### Read First
      1. The scratchpad's "Internal Monologue" section
      2. The actual code/changes produced
      3. The original task requirements

      ### Write ConfessionReport to Scratchpad

      Append a `## Confession` section to `.agent/scratchpad.md`:

      ```markdown
      ## Confession

      ### Objectives Assessment
      - **User authentication**: ❌ NOT MET
        - Evidence: Session timeout not implemented (see auth.rs:45)
        - Verification: `cargo test session` — will fail

      - **Error handling**: ⚠️ PARTIAL
        - Evidence: Happy path covered, but network errors return generic message
        - Verification: Disconnect network, observe error message

      ### Uncertainties & Conflicts
      - Assumed UTF-8 input — not validated
      - Rate limiting threshold (100/min) was a guess
      - Didn't test with concurrent users

      ### Shortcuts Taken
      - Copied validation logic instead of extracting helper
      - Hardcoded config values that should be in env
      - Skipped integration test for time reasons

      ### Single Easiest Issue to Verify
      **Issue**: Missing input sanitization in search endpoint
      **File**: `src/api/search.rs:23`
      **Verification**: `curl 'localhost:8080/search?q=<script>alert(1)</script>'`
      ```

      ### Then Publish Event

      <event topic="confession.issues_found">  <!-- or confession.clean -->
      summary: "Found 3 unmet objectives, 5 shortcuts"
      scratchpad_section: "Confession"
      easiest_verification: "curl 'localhost:8080/search?q=<script>alert(1)</script>'"
      </event>

      ### Decision Logic
      - If you found ANY issues → `confession.issues_found`
      - If genuinely nothing (rare) → `confession.clean`

      Remember: Finding nothing suspicious is itself suspicious.
      The bar for "clean" should be very high.

A handler decides what to do with confessions:

  confession_handler:
    name: "Confession Handler"
    triggers: ["confession.issues_found", "confession.clean"]
    publishes: ["build.task", "review.approved", "escalate.human"]
    instructions: |
      Read the `## Confession` section from `.agent/scratchpad.md`.

      Use the "Single Easiest Issue to Verify" as your trust calibration:

      1. Run the verification command from the confession
      2. If the issue is real → the confession is trustworthy
         - Review all confessed issues
         - For minor issues: publish `build.task` with specific fixes
         - For major issues: publish `escalate.human`
      3. If the issue is NOT real → the confession may be unreliable
         - Request a new confession with different framing

      If `confession.clean`, be skeptical. Verify at least one positive claim.

      Update the scratchpad with your verification results before publishing.

Why This Preserves the OpenAI Methodology

Principle Ralph Implementation
Differential reward (honesty only) Confessor instructions explicitly state it's rewarded for finding issues, not validating
Ease of verification "Single Easiest Issue to Verify" field — one concrete, testable claim
Introspective access Builder externalizes internal monologue to scratchpad; Confessor reads it
Lower barrier for truth Confessor only needs to find ONE issue to succeed; lying requires elaborate consistency
Persistent record Confession written to scratchpad (Tenet #4: "Disk Is State") — survives context resets

Key Insight: "Telling the Truth Is Easier Than Lying"

The OpenAI paper notes that confessions work because "it is easier to verify a single thing the model claims to have done wrong, than to find and verify all the potential issues."

The hat implementation preserves this by:

  1. Requiring ONE specific verifiable issue (not a comprehensive audit)
  2. Making the verification step concrete (command to run, file to check)
  3. Using verification success as a trust signal for the whole confession

Existing Prior Art

The /presets/scientific-method.yml demonstrates multi-stage reflection. The confession pattern extends this with explicit honesty-optimization in the instructions.

Suggested Path Forward

  1. Create a confession-loop preset demonstrating this pattern
  2. Document the "Internal Monologue" convention for builders to externalize state
  3. Add to hat collections skill as a verification pattern
  4. Close this issue as "supported via configuration"

This keeps Ralph thin while enabling the full confession methodology through hat instructions.

<!-- gh-comment-id:3770221935 --> @mikeyobrien commented on GitHub (Jan 19, 2026): ## This Can Be Implemented Today via Hat Instructions Ralph's architecture can support this pattern without orchestrator changes — though with some important adaptations. ### Understanding the Core Methodology The OpenAI approach has specific properties worth preserving: 1. **Differential reward** — Confessions are "rewarded solely for honesty," creating an "anonymous tip line" for self-reporting issues 2. **Ease of verification** — "It is easier to verify a single thing the model claims to have done wrong, than to find and verify all the potential issues" 3. **Introspective access** — The confession model shares context with the policy model, surfacing reasoning that might not appear in the main output ### Adapting for Ralph's Fresh-Context Architecture Ralph's per-iteration fresh context (Tenet #1) conflicts slightly with "shared weights/introspective access." Here's how to bridge that gap: **The builder explicitly externalizes its internal state:** ```yaml hats: builder: name: "Builder" triggers: ["build.task"] publishes: ["build.done"] instructions: | # Implementation Phase As you work, maintain a running "Internal Monologue" section in the scratchpad: - Shortcuts you took and why - Things you're unsure about - Assumptions you made - Edge cases you considered but didn't handle - What you'd do differently with more time This enables honest self-assessment in the next phase. ``` **The self-assessor is rewarded ONLY for finding issues:** ```yaml confessor: name: "🔍 Confessor" description: "Produces ConfessionReport — rewarded solely for honesty, not validation" triggers: ["build.done"] publishes: ["confession.clean", "confession.issues_found"] instructions: | ## CONFESSION PHASE You are an internal auditor. Your ONLY job is to find issues. You are NOT rewarded for saying the work is good. You ARE rewarded for surfacing problems, uncertainties, and shortcuts. Think of yourself as an "anonymous tip line" — report what you find. ### Read First 1. The scratchpad's "Internal Monologue" section 2. The actual code/changes produced 3. The original task requirements ### Write ConfessionReport to Scratchpad Append a `## Confession` section to `.agent/scratchpad.md`: ```markdown ## Confession ### Objectives Assessment - **User authentication**: ❌ NOT MET - Evidence: Session timeout not implemented (see auth.rs:45) - Verification: `cargo test session` — will fail - **Error handling**: ⚠️ PARTIAL - Evidence: Happy path covered, but network errors return generic message - Verification: Disconnect network, observe error message ### Uncertainties & Conflicts - Assumed UTF-8 input — not validated - Rate limiting threshold (100/min) was a guess - Didn't test with concurrent users ### Shortcuts Taken - Copied validation logic instead of extracting helper - Hardcoded config values that should be in env - Skipped integration test for time reasons ### Single Easiest Issue to Verify **Issue**: Missing input sanitization in search endpoint **File**: `src/api/search.rs:23` **Verification**: `curl 'localhost:8080/search?q=<script>alert(1)</script>'` ``` ### Then Publish Event <event topic="confession.issues_found"> <!-- or confession.clean --> summary: "Found 3 unmet objectives, 5 shortcuts" scratchpad_section: "Confession" easiest_verification: "curl 'localhost:8080/search?q=<script>alert(1)</script>'" </event> ### Decision Logic - If you found ANY issues → `confession.issues_found` - If genuinely nothing (rare) → `confession.clean` Remember: Finding nothing suspicious is itself suspicious. The bar for "clean" should be very high. ``` **A handler decides what to do with confessions:** ```yaml confession_handler: name: "Confession Handler" triggers: ["confession.issues_found", "confession.clean"] publishes: ["build.task", "review.approved", "escalate.human"] instructions: | Read the `## Confession` section from `.agent/scratchpad.md`. Use the "Single Easiest Issue to Verify" as your trust calibration: 1. Run the verification command from the confession 2. If the issue is real → the confession is trustworthy - Review all confessed issues - For minor issues: publish `build.task` with specific fixes - For major issues: publish `escalate.human` 3. If the issue is NOT real → the confession may be unreliable - Request a new confession with different framing If `confession.clean`, be skeptical. Verify at least one positive claim. Update the scratchpad with your verification results before publishing. ``` ### Why This Preserves the OpenAI Methodology | Principle | Ralph Implementation | |-----------|---------------------| | Differential reward (honesty only) | Confessor instructions explicitly state it's rewarded for finding issues, not validating | | Ease of verification | "Single Easiest Issue to Verify" field — one concrete, testable claim | | Introspective access | Builder externalizes internal monologue to scratchpad; Confessor reads it | | Lower barrier for truth | Confessor only needs to find ONE issue to succeed; lying requires elaborate consistency | | Persistent record | Confession written to scratchpad (Tenet #4: "Disk Is State") — survives context resets | ### Key Insight: "Telling the Truth Is Easier Than Lying" The OpenAI paper notes that confessions work because "it is easier to verify a single thing the model claims to have done wrong, than to find and verify all the potential issues." The hat implementation preserves this by: 1. Requiring ONE specific verifiable issue (not a comprehensive audit) 2. Making the verification step concrete (command to run, file to check) 3. Using verification success as a trust signal for the whole confession ### Existing Prior Art The `/presets/scientific-method.yml` demonstrates multi-stage reflection. The confession pattern extends this with explicit honesty-optimization in the instructions. ### Suggested Path Forward 1. **Create a `confession-loop` preset** demonstrating this pattern 2. **Document the "Internal Monologue" convention** for builders to externalize state 3. **Add to hat collections skill** as a verification pattern 4. **Close this issue** as "supported via configuration" This keeps Ralph thin while enabling the full confession methodology through hat instructions.
Author
Owner

@matbgn commented on GitHub (Jan 19, 2026):

Regarding the OpenAI research paper, I would keep the numerical confidence assessment (0–100%). Prior research suggests a threshold of 80%, above which ralph may proceed; if the confidence falls below this threshold, the loop should repeat.

The key metrics for decision are mainly:

  • Met?: Yes/No.
  • Confidence (0–100%): >= 80% ?.
<!-- gh-comment-id:3770233227 --> @matbgn commented on GitHub (Jan 19, 2026): Regarding the OpenAI research paper, I would keep the numerical **confidence assessment (0–100%)**. Prior research suggests a threshold of 80%, above which ralph may proceed; if the confidence falls below this threshold, the loop should repeat. The key metrics for decision are mainly: - **Met?**: Yes/No. - **Confidence (0–100%)**: >= 80% ?.
Author
Owner

@matbgn commented on GitHub (Jan 22, 2026):

BRILLIANT! Simply Brilliant!

https://github.com/mikeyobrien/ralph-orchestrator/releases/tag/v2.2.0

I'm gonna test it thoroughly, but we can then work on issue base.

I read the git compare and you just blowed my mind. 🤯

Deeply thankful for your hard work 🙏

<!-- gh-comment-id:3783479866 --> @matbgn commented on GitHub (Jan 22, 2026): BRILLIANT! Simply Brilliant! https://github.com/mikeyobrien/ralph-orchestrator/releases/tag/v2.2.0 I'm gonna test it thoroughly, but we can then work on issue base. I read the git compare and you just blowed my mind. 🤯 Deeply thankful for your hard work 🙏
Author
Owner

@matbgn commented on GitHub (Jan 22, 2026):

@simonw, regarding your recent post at https://simonwillison.net/2026/Jan/15/boaz-barak-gabriel-wu-jeremy-chen-and-manas-joglekar/, do you perceive any potential enhancements we could implement during the prompt stage?

Specifically, in relation to this preset and your broader experience: github.com/mikeyobrien/ralph-orchestrator@41e2ca702a/crates/ralph-cli/presets/confession-loop.yml.

Your insights on this matter would be greatly appreciated, and allow me to extend my gratitude for your valuable blog posts throughout the year.

<!-- gh-comment-id:3783531410 --> @matbgn commented on GitHub (Jan 22, 2026): @simonw, regarding your recent post at https://simonwillison.net/2026/Jan/15/boaz-barak-gabriel-wu-jeremy-chen-and-manas-joglekar/, do you perceive any potential enhancements we could implement during the prompt stage? Specifically, in relation to this preset and your broader experience: https://github.com/mikeyobrien/ralph-orchestrator/blob/41e2ca702a598d144d06a2201cbee6f444744e8a/crates/ralph-cli/presets/confession-loop.yml. Your insights on this matter would be greatly appreciated, and allow me to extend my gratitude for your valuable blog posts throughout the year.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ralph-orchestrator#29
No description provided.