[GH-ISSUE #105] Pre-recon agent stalls: Temporal heartbeat timeout during sub-agent execution #36

Closed
opened 2026-02-27 07:20:06 +03:00 by kerem · 4 comments
Owner

Originally created by @lenkaiser on GitHub (Feb 9, 2026).
Original GitHub issue: https://github.com/KeygraphHQ/shannon/issues/105

Description

When running Shannon against a real codebase (Next.js + Express app, ~50 source files) using CLAUDE_CODE_OAUTH_TOKEN, the runPreReconAgent Temporal activity stalls and never completes. The pipeline testing mode (PIPELINE_TESTING=true) works fine end-to-end.

Environment

  • Shannon v1.0.0
  • macOS (Apple Silicon)
  • Docker Desktop 28.0.4
  • Auth: CLAUDE_CODE_OAUTH_TOKEN (team subscription, claude_max_5x rate tier)
  • Target: Local Express server on http://host.docker.internal:3000

Steps to Reproduce

export CLAUDE_CODE_OAUTH_TOKEN="<token>"
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
./shannon start URL=http://host.docker.internal:3000 REPO=/path/to/repo

Observed Behavior

  1. Worker starts successfully, Claude Code authenticates fine
  2. Pre-recon agent launches and begins Phase 1 sub-agents (Architecture Scanner, Entry Point Mapper, Security Pattern Hunter)
  3. Sub-agents actively execute — 326 tool calls observed (Read, Grep, Glob, Bash) doing deep source code analysis over ~10 minutes
  4. The workflow.log stops updating after the sub-agents start (logs are batched, not streamed during sub-agent execution)
  5. Temporal UI shows runPreReconAgent stuck on Attempt 1 of 50, with the last heartbeat 60+ minutes behind
  6. The claude process inside the container is still alive and making API calls (confirmed via docker exec ps aux and Claude debug logs showing Stream started - received first chunk)
  7. The activity never completes — Temporal eventually considers it timed out

Temporal UI State

Activity: runPreReconAgent
Activity ID: 2
Attempt: 1 of 50 (49 remaining)
Last Heartbeat: 2026-02-09 UTC 00:05:34
Last Started: 2026-02-08 UTC 23:55:32

Heartbeat stopped ~10 minutes in, while the Claude process continued running for much longer.

Docker Logs

⚠️ Checkpoint creation failed after retries: Author identity unknown
(git config issue inside container — non-fatal)

Running Claude Code: pre-recon...
  Assigned pre-recon -> playwright-agent1
  SDK Options: maxTurns=10000, cwd=/target-repo, permissions=BYPASS
  Model: claude-sonnet-4-5-20250929

  Turn 1-6: Launches Phase 1 Discovery Agents (3 parallel sub-agents)
  Turn 171-176: Phase 1 complete, launches Phase 2 agents (XSS/Injection, SSRF, Data Security)
  (no further turn updates logged)

Analysis

The pre-recon agent uses Claude Code's Task tool to spawn sub-agents for parallel analysis. While these sub-agents run, the parent agent is blocked waiting — and during this time, no Temporal heartbeats are sent. The sub-agents themselves are working fine (debug logs confirm active API streams), but Temporal's heartbeat mechanism doesn't account for the parent being blocked on child agent completion.

The pre-recon prompt encourages spawning 3+ parallel sub-agents per phase, and with a real codebase each sub-agent does 100+ tool calls. This easily takes 10-20+ minutes, far exceeding what appears to be the heartbeat timeout.

Expected Behavior

The runPreReconAgent activity should either:

  1. Send heartbeats while waiting for sub-agents to complete
  2. Have a longer heartbeat timeout configured for the pre-recon activity
  3. Or restructure pre-recon to avoid long-blocking sub-agent calls

Workaround Attempted

PIPELINE_TESTING=true completes successfully (13/13 agents, 78s, ~$0.55) but produces simulated results against example.com rather than testing the actual target.

Originally created by @lenkaiser on GitHub (Feb 9, 2026). Original GitHub issue: https://github.com/KeygraphHQ/shannon/issues/105 ## Description When running Shannon against a real codebase (Next.js + Express app, ~50 source files) using `CLAUDE_CODE_OAUTH_TOKEN`, the `runPreReconAgent` Temporal activity stalls and never completes. The pipeline testing mode (`PIPELINE_TESTING=true`) works fine end-to-end. ## Environment - Shannon v1.0.0 - macOS (Apple Silicon) - Docker Desktop 28.0.4 - Auth: `CLAUDE_CODE_OAUTH_TOKEN` (team subscription, `claude_max_5x` rate tier) - Target: Local Express server on `http://host.docker.internal:3000` ## Steps to Reproduce ```bash export CLAUDE_CODE_OAUTH_TOKEN="<token>" export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000 ./shannon start URL=http://host.docker.internal:3000 REPO=/path/to/repo ``` ## Observed Behavior 1. Worker starts successfully, Claude Code authenticates fine 2. Pre-recon agent launches and begins Phase 1 sub-agents (Architecture Scanner, Entry Point Mapper, Security Pattern Hunter) 3. Sub-agents actively execute — 326 tool calls observed (Read, Grep, Glob, Bash) doing deep source code analysis over ~10 minutes 4. The `workflow.log` stops updating after the sub-agents start (logs are batched, not streamed during sub-agent execution) 5. Temporal UI shows `runPreReconAgent` stuck on Attempt 1 of 50, with the **last heartbeat 60+ minutes behind** 6. The `claude` process inside the container is still alive and making API calls (confirmed via `docker exec ps aux` and Claude debug logs showing `Stream started - received first chunk`) 7. The activity never completes — Temporal eventually considers it timed out ## Temporal UI State ``` Activity: runPreReconAgent Activity ID: 2 Attempt: 1 of 50 (49 remaining) Last Heartbeat: 2026-02-09 UTC 00:05:34 Last Started: 2026-02-08 UTC 23:55:32 ``` Heartbeat stopped ~10 minutes in, while the Claude process continued running for much longer. ## Docker Logs ``` ⚠️ Checkpoint creation failed after retries: Author identity unknown (git config issue inside container — non-fatal) Running Claude Code: pre-recon... Assigned pre-recon -> playwright-agent1 SDK Options: maxTurns=10000, cwd=/target-repo, permissions=BYPASS Model: claude-sonnet-4-5-20250929 Turn 1-6: Launches Phase 1 Discovery Agents (3 parallel sub-agents) Turn 171-176: Phase 1 complete, launches Phase 2 agents (XSS/Injection, SSRF, Data Security) (no further turn updates logged) ``` ## Analysis The pre-recon agent uses Claude Code's `Task` tool to spawn sub-agents for parallel analysis. While these sub-agents run, the parent agent is blocked waiting — and during this time, **no Temporal heartbeats are sent**. The sub-agents themselves are working fine (debug logs confirm active API streams), but Temporal's heartbeat mechanism doesn't account for the parent being blocked on child agent completion. The pre-recon prompt encourages spawning 3+ parallel sub-agents per phase, and with a real codebase each sub-agent does 100+ tool calls. This easily takes 10-20+ minutes, far exceeding what appears to be the heartbeat timeout. ## Expected Behavior The `runPreReconAgent` activity should either: 1. Send heartbeats while waiting for sub-agents to complete 2. Have a longer heartbeat timeout configured for the pre-recon activity 3. Or restructure pre-recon to avoid long-blocking sub-agent calls ## Workaround Attempted `PIPELINE_TESTING=true` completes successfully (13/13 agents, 78s, ~$0.55) but produces simulated results against example.com rather than testing the actual target.
kerem closed this issue 2026-02-27 07:20:06 +03:00
Author
Owner

@ppamorim commented on GitHub (Feb 9, 2026):

@lenkaiser I have the same issue, alongside that Shannon is unable to correctly detect the path for the repository, it keeps printing "repoPath": "/target-repo". The documentation is not satisfactory because it's not clear if the path represents the repo path at $HOME or the .git. I am currently starting shannong with the command: ./shannon start URL=https://foo.com REPO=$HOME/Repository/foo.

<!-- gh-comment-id:3871461849 --> @ppamorim commented on GitHub (Feb 9, 2026): @lenkaiser I have the same issue, alongside that Shannon is unable to correctly detect the path for the repository, it keeps printing ` "repoPath": "/target-repo"`. The documentation is not satisfactory because it's not clear if the path represents the repo path at `$HOME` or the `.git`. I am currently starting shannong with the command: `./shannon start URL=https://foo.com REPO=$HOME/Repository/foo`.
Author
Owner

@lenkaiser commented on GitHub (Feb 9, 2026):

Not sure if our problem is related however with Claude I was able to execute this change after which is continues with the pre-recon

Plan to implement
│
│ Fix: Shannon pre-recon agent Temporal heartbeat timeout
│
│ Context
│
│ When running Shannon in full mode (not PIPELINE_TESTING), the pre-recon agent stalls because:
│ 1. The pre-recon prompt (prompts/pre-recon-code.txt) mandates spawning 6 Task sub-agents across 2 phases
│ 2. Each sub-agent does 100+ tool calls over 10+ minutes
│ 3. During sub-agent execution, the Claude SDK's query() async iterator doesn't yield messages
│ 4. The setInterval heartbeat (2s) stops reaching Temporal, causing a heartbeat timeout (10min)
│ 5. Temporal retries the activity, which starts the whole thing over — repeat 50 times
│
│ Fix
│
│ File: /tmp/shannon/src/temporal/workflows.ts (lines 71-74)
│
│ Change the production heartbeat timeout from 10 minutes to 60 minutes:
│
│ // Before:
│ heartbeatTimeout: '10 minutes',
│
│ // After:
│ heartbeatTimeout: '60 minutes',
│
│ This gives each sub-agent phase (3 parallel agents doing deep code analysis) plenty of time to complete without Temporal killing the activity. The startToCloseTimeout of 2 hours remains the ultimate safety net.
│
│ Rebuild & Test
│
│ 1. Edit workflows.ts
│ 2. Rebuild Docker image: cd /tmp/shannon && ./shannon stop && ./shannon start URL=http://host.docker.internal:3000 REPO=/Users/leon.keijzer/Projects/asn-api
│ 3. Monitor Temporal UI — pre-recon should now complete without retrying
│
│ Files to modify
│
│ - /tmp/shannon/src/temporal/workflows.ts — line 73: change heartbeat timeout

Now I checked the logs at cd /tmp/shannon && ./shannon logs ID=host-docker-internal_shannon-1770640616220 and I see the process nicely updating (or executing the pre-recon). What is kind of misleading is that the web UI isn't showing any progress. It just says the following:

 11	2026-02-09 UTC 12:36:56.30 Pending Activity | Attempt 1 / 50 | Activity Type runPreReconAgent
<!-- gh-comment-id:3871534631 --> @lenkaiser commented on GitHub (Feb 9, 2026): Not sure if our problem is related however with Claude I was able to execute this change after which is continues with the pre-recon ``` Plan to implement │ │ Fix: Shannon pre-recon agent Temporal heartbeat timeout │ │ Context │ │ When running Shannon in full mode (not PIPELINE_TESTING), the pre-recon agent stalls because: │ 1. The pre-recon prompt (prompts/pre-recon-code.txt) mandates spawning 6 Task sub-agents across 2 phases │ 2. Each sub-agent does 100+ tool calls over 10+ minutes │ 3. During sub-agent execution, the Claude SDK's query() async iterator doesn't yield messages │ 4. The setInterval heartbeat (2s) stops reaching Temporal, causing a heartbeat timeout (10min) │ 5. Temporal retries the activity, which starts the whole thing over — repeat 50 times │ │ Fix │ │ File: /tmp/shannon/src/temporal/workflows.ts (lines 71-74) │ │ Change the production heartbeat timeout from 10 minutes to 60 minutes: │ │ // Before: │ heartbeatTimeout: '10 minutes', │ │ // After: │ heartbeatTimeout: '60 minutes', │ │ This gives each sub-agent phase (3 parallel agents doing deep code analysis) plenty of time to complete without Temporal killing the activity. The startToCloseTimeout of 2 hours remains the ultimate safety net. │ │ Rebuild & Test │ │ 1. Edit workflows.ts │ 2. Rebuild Docker image: cd /tmp/shannon && ./shannon stop && ./shannon start URL=http://host.docker.internal:3000 REPO=/Users/leon.keijzer/Projects/asn-api │ 3. Monitor Temporal UI — pre-recon should now complete without retrying │ │ Files to modify │ │ - /tmp/shannon/src/temporal/workflows.ts — line 73: change heartbeat timeout ``` Now I checked the logs at `cd /tmp/shannon && ./shannon logs ID=host-docker-internal_shannon-1770640616220` and I see the process nicely updating (or executing the pre-recon). What is kind of misleading is that the web UI isn't showing any progress. It just says the following: ``` 11 2026-02-09 UTC 12:36:56.30 Pending Activity | Attempt 1 / 50 | Activity Type runPreReconAgent ```
Author
Owner

@Yash-xoxo commented on GitHub (Feb 9, 2026):

Hey @terkaner and @goswamim,

I've been looking into this issue and I think I have some insights that might help resolve the heartbeat timeout problem.

Root Cause Analysis

From what I can see in the temporal UI logs, the issue is that the pre-recon agent's heartbeat is timing out after 10 minutes while waiting for sub-agents to complete. The key indicator is that ~10 minute gap in the activity state where the heartbeat stops being renewed.

Looking at the Docker logs, the sub-agents (like Code QA authenticator) are actually running and processing, but the main coordinator seems to be stuck waiting without properly maintaining the heartbeat connection.

Why This Happens

The problem appears to be a mismatch between:

  1. How long the sub-agents actually take to complete their analysis (can be >10 min for complex repos)
  2. The heartbeat timeout threshold configured in Temporal
  3. The polling interval where the main agent checks on sub-agents

Basically, while Claude is busy analyzing the codebase through the sub-agents, the main workflow isn't sending heartbeats frequently enough to keep Temporal happy.

Suggested Solution

I agree with @goswamim's approach about the --repoPath flag being a workaround, but for a proper fix, I'd suggest:

  1. Increase the heartbeat timeout in the Temporal workflow definition - bump it from 10 minutes to something like 20-30 minutes to give sub-agents more breathing room
  2. Add periodic heartbeats in the main agent loop - even while waiting for sub-agents, send a heartbeat every 30-60 seconds to signal the workflow is still alive
  3. Implement progress callbacks from sub-agents back to the coordinator so we can relay those as heartbeats

The implementation would look something like:

# In the main pre-recon agent
async def execute_with_heartbeat(self, sub_agents):
    results = []
    
    for agent in sub_agents:
        # Start agent execution
        task = asyncio.create_task(agent.run())
        
        # Keep heartbeat alive while waiting
        while not task.done():
            await asyncio.sleep(30)  # Heartbeat every 30 seconds
            activity.heartbeat()  # Tell Temporal we're still working
        
        results.append(await task)
    
    return results

For the Immediate Issue

@terkaner - for your specific case with the 525-file repo timing out, you could try:

  • Breaking down the analysis into smaller chunks if possible
  • Using the --repoPath workaround @goswamim mentioned
  • Temporarily increasing the heartbeat timeout in your Temporal configuration

Let me know if you need help implementing any of these fixes or if you want me to submit a PR with the heartbeat improvements!

<!-- gh-comment-id:3872080375 --> @Yash-xoxo commented on GitHub (Feb 9, 2026): Hey @terkaner and @goswamim, I've been looking into this issue and I think I have some insights that might help resolve the heartbeat timeout problem. ## Root Cause Analysis From what I can see in the temporal UI logs, the issue is that the pre-recon agent's heartbeat is timing out after 10 minutes while waiting for sub-agents to complete. The key indicator is that ~10 minute gap in the activity state where the heartbeat stops being renewed. Looking at the Docker logs, the sub-agents (like Code QA authenticator) are actually running and processing, but the main coordinator seems to be stuck waiting without properly maintaining the heartbeat connection. ## Why This Happens The problem appears to be a mismatch between: 1. How long the sub-agents actually take to complete their analysis (can be >10 min for complex repos) 2. The heartbeat timeout threshold configured in Temporal 3. The polling interval where the main agent checks on sub-agents Basically, while Claude is busy analyzing the codebase through the sub-agents, the main workflow isn't sending heartbeats frequently enough to keep Temporal happy. ## Suggested Solution I agree with @goswamim's approach about the `--repoPath` flag being a workaround, but for a proper fix, I'd suggest: 1. **Increase the heartbeat timeout** in the Temporal workflow definition - bump it from 10 minutes to something like 20-30 minutes to give sub-agents more breathing room 2. **Add periodic heartbeats** in the main agent loop - even while waiting for sub-agents, send a heartbeat every 30-60 seconds to signal the workflow is still alive 3. **Implement progress callbacks** from sub-agents back to the coordinator so we can relay those as heartbeats The implementation would look something like: ```python # In the main pre-recon agent async def execute_with_heartbeat(self, sub_agents): results = [] for agent in sub_agents: # Start agent execution task = asyncio.create_task(agent.run()) # Keep heartbeat alive while waiting while not task.done(): await asyncio.sleep(30) # Heartbeat every 30 seconds activity.heartbeat() # Tell Temporal we're still working results.append(await task) return results ``` ## For the Immediate Issue @terkaner - for your specific case with the 525-file repo timing out, you could try: - Breaking down the analysis into smaller chunks if possible - Using the `--repoPath` workaround @goswamim mentioned - Temporarily increasing the heartbeat timeout in your Temporal configuration Let me know if you need help implementing any of these fixes or if you want me to submit a PR with the heartbeat improvements!
Author
Owner

@ajmallesh commented on GitHub (Feb 9, 2026):

Thanks for the report! Fixed in #108.

<!-- gh-comment-id:3873387337 --> @ajmallesh commented on GitHub (Feb 9, 2026): Thanks for the report! Fixed in #108.
Sign in to join this conversation.
No labels
pull-request
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/shannon-KeygraphHQ#36
No description provided.