mirror of
https://github.com/NikkeTryHard/zerogravity.git
synced 2026-04-25 15:15:59 +03:00
[GH-ISSUE #24] [BUG] maxOutputTokens=1 injection causes cascade loop, exhausts Flash quota in seconds #21
Labels
No labels
bug
enhancement
enhancement
notice
pull-request
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/zerogravity#21
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @yulin0629 on GitHub (Feb 19, 2026).
Original GitHub issue: https://github.com/NikkeTryHard/zerogravity/issues/24
Summary
When Claude Code connects via
/v1/messages(Anthropic Messages API), the MITMmaxOutputTokens=1injection on Turn 0 creates an infinite loop with the LS binary'sMAX_TOKENSauto-retry behavior. Combined with RESOURCE_EXHAUSTED retry logic, this amplifies 75 user API requests into 1,289 Google API calls, exhausting Gemini 3 Flash quota in under 3 seconds.Environment
ANTHROPIC_BASE_URL=http://localhost:8741Reproduction
POST /v1/token/v1/messagesrequests togemini-3-flashRoot Cause: Triple Amplification
1. Claude Code parallel requests (5x multiplier)
Claude Code sends 5 concurrent
/v1/messagesrequests on startup. Each creates an independent cascade:2.
maxOutputTokens=1+ LS auto-retry = infinite loop (~Nx multiplier)Every MITM-modified request injects
maxOutputTokens=1:The LS receives
finish_reason: MAX_TOKENSand automatically sends the next turn on the same cascade (content messages grow: 5 → 6 → 7 → ...). The next turn is also injected withmaxOutputTokens=1, creating an endless loop:3. RESOURCE_EXHAUSTED retry creates new cascades (2x multiplier)
When Google returns RESOURCE_EXHAUSTED, the proxy retries up to 2 times, each time creating a new cascade that goes through the same
maxOutputTokens=1loop:Impact Numbers
maxOutputTokens=1injectionsMAX_TOKENSfinish reasonsRESOURCE_EXHAUSTEDerrorsTimeline
10:49:21/v1/chat/completionsrequest) — succeeded normally11:25:50/v1/messages11:25:50maxOutputTokens=111:25:52MAX_TOKENS→ LS retries → loop begins11:25:5311:25:53~55maxOutputTokens=112:06:55Estimated Token Consumption
Each of the 1,041
maxOutputTokens=1calls sends ~5,000 input tokens (22 KB modified body) but receives only 1 output token. Google's quota counts input+output:~5.2 million input tokens consumed — enough to exhaust Flash quota instantly.
Suggested Fix
The
maxOutputTokens=1injection on Turn 0 should not persist across subsequent turns of the same cascade. Possible approaches:maxOutputTokens=1on the first turn of a cascade, then use the client's requestedmax_tokensfor subsequent turnsMAX_TOKENSfinish reason on amaxOutputTokens=1turn and stop the cascade instead of letting the LS auto-retrymaxOutputTokens=1roundsWorkaround
For users hitting this: avoid using
gemini-3-flashas the initial model when connecting Claude Code. Usegemini-3-prooropus-4.6for the first session, which appear less affected (possibly different quota bucket or rate limits).@yulin0629 commented on GitHub (Feb 19, 2026):
Source-Level Root Cause Analysis
After tracing the MITM request modification flow using docker logs and the built-in trace files (
~/.config/zerogravity/traces/), I can pinpoint the exact bug mechanism.Key Finding: Claude Code sends
max_tokens=1in preflight requestsClaude Code's startup sends "count" preflight pings with
max_tokens: 1via the Anthropic Messages API. This is legitimate behavior — it's a token counting probe, not a real generation request.Evidence from MITM trace files (
modified_request.json):f1be9ae5(11:25:52)"count"(5 chars)3f2214fc(11:25:59)ab72950b(11:33:14)"say hi in 5 words"b41f1d3d(10:49:21)"Say hello in one word"The proxy faithfully forwards
max_tokens=1→generationConfig.maxOutputTokens=1. This is correct for the first turn. The bug is what happens next.Bug:
CascadeCachereplaysmaxOutputTokens=1on every subsequent turnOn Turn 0, the MITM proxy caches
generation_params(includingmax_output_tokens=1) in the per-cascadeCascadeCache. On Turn 1+, it rebuildsToolContextfrom cache and replaysmax_output_tokens=1verbatim — injectinggenerationConfig.maxOutputTokens=1into every subsequent Google API request on that cascade.The loop mechanism
maxOutputTokens=1→ Google returnsfinish_reason: MAX_TOKENS(only 1 token allowed)MAX_TOKENS→ automatically sends Turn 1 on the same cascade (trying to "continue")ToolContextfromCascadeCache→ injectsmaxOutputTokens=1againMAX_TOKENSagain → LS sends Turn 2 → infinite loopLog evidence showing content message count growing each turn:
Suggested fix
The
CascadeCacheshould not replaymax_output_tokensfrom a preflight request on subsequent turns. Options:max_output_tokenswhen value ≤ 1 — treat it as a preflight and set it toNonefor subsequent turnsmax_output_tokensin cache after Turn 0 — subsequent turns use the LS default (16384)finish_reason=MAX_TOKENSandmaxOutputTokens=1, stop the cascade instead of letting the LS auto-retryOption 1 is the simplest and least invasive.
@NikkeTryHard commented on GitHub (Feb 19, 2026):
v1.1.6-beta.1 includes several fixes to the MITM pipeline that improve stability with multi-step tool calling and cascade handling. While this specific
maxOutputTokens=1loop issue requires a dedicated fix (tracking the turn counter), the improved thought_signature handling and cascade logic may reduce some of the amplification effects.Please test with the beta and let us know if you still see the same quota exhaustion behavior:
Binary:
Docker:
A targeted fix for the
maxOutputTokens=1cascade loop is still planned.@yulin0629 commented on GitHub (Feb 19, 2026):
v1.1.6-beta.1 Test Result: Bug Still Present
Tested with
ghcr.io/nikketryhard/zerogravity:v1.1.6-beta.1— themaxOutputTokens=1cascade loop is unchanged.Test
Single
max_tokens=1request (simulating Claude Code preflight):Result: Request timed out (loop ran until timeout).
Log Evidence
10 turns in 9 seconds, content messages growing from 5 → 14,
maxOutputTokens=1injected on every turn. Identical behavior to v1.1.5.Note
Normal requests (
max_tokens=100) work fine on the same build — the bug is specifically the cachedmaxOutputTokens=1replaying on subsequent cascade turns.Stopped the container to preserve quota. Will wait for the dedicated fix.
@NikkeTryHard commented on GitHub (Feb 19, 2026):
yeah
maxOutputTokens=1problem still unfixed. if you have access to src please PR there if you have any ideas@NikkeTryHard commented on GitHub (Feb 19, 2026):
i invited you!
@NikkeTryHard commented on GitHub (Feb 20, 2026):
Fixed in v1.1.8.
The
maxOutputTokensinjection now caps the minimum to 4096 tokens (m.max(4096)), preventing themaxOutputTokens=1loop. Additionally, thinking models skipmaxOutputTokensentirely and use the proxy default (32000).Verified with all three API paths:
/v1/messagesend_turn, 189 chars/v1/chat/completions(max_tokens)stop, 143 chars/v1/chat/completions(max_completion_tokens)stop, 97 charsNo
MAX_TOKENSfinish reason, no cascade loop, no quota drain. Please upgrade and test.@yulin0629 commented on GitHub (Feb 20, 2026):
v1.1.8 Test Results
Just tested the newly released
v1.1.8Docker image:Test 1: Normal request (
max_tokens=100) -- Works correctly, response received.Test 2: Issue #24 reproduction (
max_tokens=1) -- Still times out with cascade loop behavior.Logs show the bug persists
The
maxOutputTokens=1is still being injected on every cascade turn, triggering the infinite loop.Looking at the commit history between tags, it seems like the fix (cap
maxOutputTokensminimum to 4096) was merged into main after the v1.1.8 tag was created. Looking forward to the next release!@NikkeTryHard commented on GitHub (Feb 20, 2026):
Give me ur
zg reportfor test 2 if possible@NikkeTryHard commented on GitHub (Feb 20, 2026):
see if fixed with v1.1.9
@yulin0629 commented on GitHub (Feb 20, 2026):
v1.1.9 Test Results: Bug Fixed
Tested with
ghcr.io/nikketryhard/zerogravity:v1.1.9on macOS Docker (arm64).Test 1: Baseline (
max_tokens=100)Result:
"text":"hello world",stop_reason: end_turn. Normal.Test 2: Issue #24 reproduction (
max_tokens=1)Result:
"text":"hello world",stop_reason: end_turn. No timeout, no cascade loop.Log Evidence
Every request shows
maxOutputTokens=4096(capped from 1):No
maxOutputTokens=1injections, noMAX_TOKENSfinish reason, no growing content message counts. The fix works.Note
(gemini-3-flash quota was exhausted, tested with gemini-3-pro instead -- same MITM pipeline)
@NikkeTryHard commented on GitHub (Feb 20, 2026):
thank you for this issue!