[GH-ISSUE #24] [BUG] maxOutputTokens=1 injection causes cascade loop, exhausts Flash quota in seconds

kerem commented

2026-02-27 15:37:58 +03:00

Owner

Originally created by @yulin0629 on GitHub (Feb 19, 2026).
Original GitHub issue: https://github.com/NikkeTryHard/zerogravity/issues/24

Summary

When Claude Code connects via /v1/messages (Anthropic Messages API), the MITM maxOutputTokens=1 injection on Turn 0 creates an infinite loop with the LS binary's MAX_TOKENS auto-retry behavior. Combined with RESOURCE_EXHAUSTED retry logic, this amplifies 75 user API requests into 1,289 Google API calls, exhausting Gemini 3 Flash quota in under 3 seconds.

Environment

ZeroGravity: v1.107.0 (Docker, headless mode)
LS version: 1.16.5-6703236727046144
Platform: macOS 26.2, arm64, Docker Desktop
Client: Claude Code → ANTHROPIC_BASE_URL=http://localhost:8741
Model: gemini-3-flash
Plan: Google AI Pro (fresh account, 100% quota at start)

Reproduction

Start ZeroGravity in headless Docker mode
Inject OAuth token via POST /v1/token
Launch Claude Code pointed at the proxy
Claude Code sends its initial batch of /v1/messages requests to gemini-3-flash
Flash quota drops to 0% within 3 seconds

Root Cause: Triple Amplification

1. Claude Code parallel requests (5x multiplier)

Claude Code sends 5 concurrent /v1/messages requests on startup. Each creates an independent cascade:

11:25:50.475 POST /v1/messages model=gemini-3-flash stream=false
11:25:50.476 POST /v1/messages model=gemini-3-flash stream=false
11:25:50.476 POST /v1/messages model=gemini-3-flash stream=false
11:25:50.476 POST /v1/messages model=gemini-3-flash stream=false
11:25:50.477 POST /v1/messages model=gemini-3-flash stream=false
→ 5 cascades registered (0de6b576, 7b05a4e1, 4440ac61, 6208f9cf, 8e9ccb52)

2. `maxOutputTokens=1` + LS auto-retry = infinite loop (~Nx multiplier)

Every MITM-modified request injects maxOutputTokens=1:

MITM: request modified [...inject generationConfig: maxOutputTokens=1]
  original=61193 modified=22218 saved_bytes=38975 saved_pct=63

The LS receives finish_reason: MAX_TOKENS and automatically sends the next turn on the same cascade (content messages grow: 5 → 6 → 7 → ...). The next turn is also injected with maxOutputTokens=1, creating an endless loop:

11:25:50.676 → maxOutputTokens=1 injected (cascade 6208f9cf, 5 content msgs)
11:25:52.011 ← finish_reason: MAX_TOKENS
11:25:52.034 → maxOutputTokens=1 injected (cascade 8e9ccb52, 6 content msgs)
11:25:52.067 ← finish_reason: MAX_TOKENS
11:25:52.076 → maxOutputTokens=1 injected (cascade 7b05a4e1, 6 content msgs)
...continues indefinitely

3. RESOURCE_EXHAUSTED retry creates new cascades (2x multiplier)

When Google returns RESOURCE_EXHAUSTED, the proxy retries up to 2 times, each time creating a new cascade that goes through the same maxOutputTokens=1 loop:

11:25:53.046 WARN Messages sync RESOURCE_EXHAUSTED (attempt 1/2), retrying on new cascade
11:25:53.046 INFO RESOURCE_EXHAUSTED: retrying on new cascade attempt=1 delay_ms=1049

Impact Numbers

Metric	Count
User-facing API requests	75
MITM-forwarded Google API calls	1,289
`maxOutputTokens=1` injections	1,041
`MAX_TOKENS` finish reasons	718
`RESOURCE_EXHAUSTED` errors	588
RESOURCE_EXHAUSTED retry attempts	110
Cascades created	169
Per-request body size (modified)	~22 KB

Timeline

Time (UTC)	Event
`10:49:21`	Manual test (1 `/v1/chat/completions` request) — succeeded normally
`11:25:50`	Claude Code starts → 5 parallel `/v1/messages`
`11:25:50`	All 5 cascades injected with `maxOutputTokens=1`
`11:25:52`	First `MAX_TOKENS` → LS retries → loop begins
`11:25:53`	First RESOURCE_EXHAUSTED — only 3 seconds after start
`11:25:53~55`	Retry cascades spawn, also loop with `maxOutputTokens=1`
`12:06:55`	Last recorded Flash request (still hitting exhausted quota, ~41 min later)

Estimated Token Consumption

Each of the 1,041 maxOutputTokens=1 calls sends ~5,000 input tokens (22 KB modified body) but receives only 1 output token. Google's quota counts input+output:

~5.2 million input tokens consumed — enough to exhaust Flash quota instantly.

Suggested Fix

The maxOutputTokens=1 injection on Turn 0 should not persist across subsequent turns of the same cascade. Possible approaches:

Only inject maxOutputTokens=1 on the first turn of a cascade, then use the client's requested max_tokens for subsequent turns
Detect MAX_TOKENS finish reason on a maxOutputTokens=1 turn and stop the cascade instead of letting the LS auto-retry
Add a per-cascade turn counter and cap the number of maxOutputTokens=1 rounds

Workaround

For users hitting this: avoid using gemini-3-flash as the initial model when connecting Claude Code. Use gemini-3-pro or opus-4.6 for the first session, which appear less affected (possibly different quota bucket or rate limits).

Originally created by @yulin0629 on GitHub (Feb 19, 2026). Original GitHub issue: https://github.com/NikkeTryHard/zerogravity/issues/24 ## Summary When Claude Code connects via `/v1/messages` (Anthropic Messages API), the MITM `maxOutputTokens=1` injection on Turn 0 creates an infinite loop with the LS binary's `MAX_TOKENS` auto-retry behavior. Combined with RESOURCE_EXHAUSTED retry logic, this amplifies 75 user API requests into **1,289 Google API calls**, exhausting Gemini 3 Flash quota in **under 3 seconds**. ## Environment - **ZeroGravity**: v1.107.0 (Docker, headless mode) - **LS version**: 1.16.5-6703236727046144 - **Platform**: macOS 26.2, arm64, Docker Desktop - **Client**: Claude Code → `ANTHROPIC_BASE_URL=http://localhost:8741` - **Model**: gemini-3-flash - **Plan**: Google AI Pro (fresh account, 100% quota at start) ## Reproduction 1. Start ZeroGravity in headless Docker mode 2. Inject OAuth token via `POST /v1/token` 3. Launch Claude Code pointed at the proxy 4. Claude Code sends its initial batch of `/v1/messages` requests to `gemini-3-flash` 5. Flash quota drops to 0% within 3 seconds ## Root Cause: Triple Amplification ### 1. Claude Code parallel requests (5x multiplier) Claude Code sends 5 concurrent `/v1/messages` requests on startup. Each creates an independent cascade: ``` 11:25:50.475 POST /v1/messages model=gemini-3-flash stream=false 11:25:50.476 POST /v1/messages model=gemini-3-flash stream=false 11:25:50.476 POST /v1/messages model=gemini-3-flash stream=false 11:25:50.476 POST /v1/messages model=gemini-3-flash stream=false 11:25:50.477 POST /v1/messages model=gemini-3-flash stream=false → 5 cascades registered (0de6b576, 7b05a4e1, 4440ac61, 6208f9cf, 8e9ccb52) ``` ### 2. `maxOutputTokens=1` + LS auto-retry = infinite loop (~Nx multiplier) Every MITM-modified request injects `maxOutputTokens=1`: ``` MITM: request modified [...inject generationConfig: maxOutputTokens=1] original=61193 modified=22218 saved_bytes=38975 saved_pct=63 ``` The LS receives `finish_reason: MAX_TOKENS` and automatically sends the next turn on the same cascade (content messages grow: 5 → 6 → 7 → ...). The next turn is **also** injected with `maxOutputTokens=1`, creating an endless loop: ``` 11:25:50.676 → maxOutputTokens=1 injected (cascade 6208f9cf, 5 content msgs) 11:25:52.011 ← finish_reason: MAX_TOKENS 11:25:52.034 → maxOutputTokens=1 injected (cascade 8e9ccb52, 6 content msgs) 11:25:52.067 ← finish_reason: MAX_TOKENS 11:25:52.076 → maxOutputTokens=1 injected (cascade 7b05a4e1, 6 content msgs) ...continues indefinitely ``` ### 3. RESOURCE_EXHAUSTED retry creates new cascades (2x multiplier) When Google returns RESOURCE_EXHAUSTED, the proxy retries up to 2 times, each time creating a **new cascade** that goes through the same `maxOutputTokens=1` loop: ``` 11:25:53.046 WARN Messages sync RESOURCE_EXHAUSTED (attempt 1/2), retrying on new cascade 11:25:53.046 INFO RESOURCE_EXHAUSTED: retrying on new cascade attempt=1 delay_ms=1049 ``` ## Impact Numbers | Metric | Count | |--------|-------| | User-facing API requests | 75 | | MITM-forwarded Google API calls | **1,289** | | `maxOutputTokens=1` injections | **1,041** | | `MAX_TOKENS` finish reasons | **718** | | `RESOURCE_EXHAUSTED` errors | **588** | | RESOURCE_EXHAUSTED retry attempts | **110** | | Cascades created | **169** | | Per-request body size (modified) | ~22 KB | ## Timeline | Time (UTC) | Event | |---|---| | `10:49:21` | Manual test (1 `/v1/chat/completions` request) — succeeded normally | | `11:25:50` | Claude Code starts → 5 parallel `/v1/messages` | | `11:25:50` | All 5 cascades injected with `maxOutputTokens=1` | | `11:25:52` | First `MAX_TOKENS` → LS retries → loop begins | | **`11:25:53`** | **First RESOURCE_EXHAUSTED — only 3 seconds after start** | | `11:25:53~55` | Retry cascades spawn, also loop with `maxOutputTokens=1` | | `12:06:55` | Last recorded Flash request (still hitting exhausted quota, ~41 min later) | ## Estimated Token Consumption Each of the 1,041 `maxOutputTokens=1` calls sends ~5,000 input tokens (22 KB modified body) but receives only 1 output token. Google's quota counts input+output: **~5.2 million input tokens consumed** — enough to exhaust Flash quota instantly. ## Suggested Fix The `maxOutputTokens=1` injection on Turn 0 should not persist across subsequent turns of the same cascade. Possible approaches: 1. Only inject `maxOutputTokens=1` on the **first turn** of a cascade, then use the client's requested `max_tokens` for subsequent turns 2. Detect `MAX_TOKENS` finish reason on a `maxOutputTokens=1` turn and stop the cascade instead of letting the LS auto-retry 3. Add a per-cascade turn counter and cap the number of `maxOutputTokens=1` rounds ## Workaround For users hitting this: avoid using `gemini-3-flash` as the initial model when connecting Claude Code. Use `gemini-3-pro` or `opus-4.6` for the first session, which appear less affected (possibly different quota bucket or rate limits).

kerem closed this issue

2026-02-27 15:37:58 +03:00

kerem commented

2026-02-27 15:37:58 +03:00

Author

Owner

@yulin0629 commented on GitHub (Feb 19, 2026):

Source-Level Root Cause Analysis

After tracing the MITM request modification flow using docker logs and the built-in trace files (~/.config/zerogravity/traces/), I can pinpoint the exact bug mechanism.

Key Finding: Claude Code sends `max_tokens=1` in preflight requests

Claude Code's startup sends "count" preflight pings with max_tokens: 1 via the Anthropic Messages API. This is legitimate behavior — it's a token counting probe, not a real generation request.

Evidence from MITM trace files (modified_request.json):

Trace ID	user_text	maxOutputTokens	Type
`f1be9ae5` (11:25:52)	`"count"` (5 chars)	1	preflight ping
`3f2214fc` (11:25:59)	400 chars (real prompt)	32000	actual conversation
`ab72950b` (11:33:14)	`"say hi in 5 words"`	50	manual test
`b41f1d3d` (10:49:21)	`"Say hello in one word"`	16384	manual test (no max_tokens)

The proxy faithfully forwards max_tokens=1 → generationConfig.maxOutputTokens=1. This is correct for the first turn. The bug is what happens next.

Bug: `CascadeCache` replays `maxOutputTokens=1` on every subsequent turn

On Turn 0, the MITM proxy caches generation_params (including max_output_tokens=1) in the per-cascade CascadeCache. On Turn 1+, it rebuilds ToolContext from cache and replays max_output_tokens=1 verbatim — injecting generationConfig.maxOutputTokens=1 into every subsequent Google API request on that cascade.

The loop mechanism

Turn 0: maxOutputTokens=1 → Google returns finish_reason: MAX_TOKENS (only 1 token allowed)
LS auto-retry: LS binary sees MAX_TOKENS → automatically sends Turn 1 on the same cascade (trying to "continue")
Turn 1: proxy rebuilds ToolContext from CascadeCache → injects maxOutputTokens=1 again
Repeat: Google returns MAX_TOKENS again → LS sends Turn 2 → infinite loop

Log evidence showing content message count growing each turn:

11:25:50.676  remove 4/5 content messages  (Turn 0)
11:25:52.035  remove 4/6 content messages  (Turn 1 — LS added continuation)
11:25:53.420  remove 4/7 content messages  (Turn 2 — still growing)

Suggested fix

The CascadeCache should not replay max_output_tokens from a preflight request on subsequent turns. Options:

Don't cache max_output_tokens when value ≤ 1 — treat it as a preflight and set it to None for subsequent turns
Clear max_output_tokens in cache after Turn 0 — subsequent turns use the LS default (16384)
Detect the loop — if finish_reason=MAX_TOKENS and maxOutputTokens=1, stop the cascade instead of letting the LS auto-retry

Option 1 is the simplest and least invasive.

@yulin0629 commented on GitHub (Feb 19, 2026): ## Source-Level Root Cause Analysis After tracing the MITM request modification flow using docker logs and the built-in trace files (`~/.config/zerogravity/traces/`), I can pinpoint the exact bug mechanism. ### Key Finding: Claude Code sends `max_tokens=1` in preflight requests Claude Code's startup sends "count" preflight pings with `max_tokens: 1` via the Anthropic Messages API. This is legitimate behavior — it's a token counting probe, not a real generation request. **Evidence from MITM trace files (`modified_request.json`):** | Trace ID | user_text | maxOutputTokens | Type | |---|---|---|---| | `f1be9ae5` (11:25:52) | `"count"` (5 chars) | **1** | preflight ping | | `3f2214fc` (11:25:59) | 400 chars (real prompt) | **32000** | actual conversation | | `ab72950b` (11:33:14) | `"say hi in 5 words"` | **50** | manual test | | `b41f1d3d` (10:49:21) | `"Say hello in one word"` | **16384** | manual test (no max_tokens) | The proxy faithfully forwards `max_tokens=1` → `generationConfig.maxOutputTokens=1`. **This is correct for the first turn.** The bug is what happens next. ### Bug: `CascadeCache` replays `maxOutputTokens=1` on every subsequent turn On Turn 0, the MITM proxy caches `generation_params` (including `max_output_tokens=1`) in the per-cascade `CascadeCache`. On Turn 1+, it rebuilds `ToolContext` from cache and replays `max_output_tokens=1` verbatim — injecting `generationConfig.maxOutputTokens=1` into every subsequent Google API request on that cascade. ### The loop mechanism 1. **Turn 0**: `maxOutputTokens=1` → Google returns `finish_reason: MAX_TOKENS` (only 1 token allowed) 2. **LS auto-retry**: LS binary sees `MAX_TOKENS` → automatically sends Turn 1 on the same cascade (trying to "continue") 3. **Turn 1**: proxy rebuilds `ToolContext` from `CascadeCache` → injects `maxOutputTokens=1` again 4. **Repeat**: Google returns `MAX_TOKENS` again → LS sends Turn 2 → infinite loop **Log evidence showing content message count growing each turn:** ``` 11:25:50.676 remove 4/5 content messages (Turn 0) 11:25:52.035 remove 4/6 content messages (Turn 1 — LS added continuation) 11:25:53.420 remove 4/7 content messages (Turn 2 — still growing) ``` ### Suggested fix The `CascadeCache` should **not** replay `max_output_tokens` from a preflight request on subsequent turns. Options: 1. **Don't cache `max_output_tokens` when value ≤ 1** — treat it as a preflight and set it to `None` for subsequent turns 2. **Clear `max_output_tokens` in cache after Turn 0** — subsequent turns use the LS default (16384) 3. **Detect the loop** — if `finish_reason=MAX_TOKENS` and `maxOutputTokens=1`, stop the cascade instead of letting the LS auto-retry Option 1 is the simplest and least invasive.

kerem commented

2026-02-27 15:37:59 +03:00

Author

Owner

@NikkeTryHard commented on GitHub (Feb 19, 2026):

v1.1.6-beta.1 includes several fixes to the MITM pipeline that improve stability with multi-step tool calling and cascade handling. While this specific maxOutputTokens=1 loop issue requires a dedicated fix (tracking the turn counter), the improved thought_signature handling and cascade logic may reduce some of the amplification effects.

Please test with the beta and let us know if you still see the same quota exhaustion behavior:

Binary:

curl -L -o zerogravity https://github.com/NikkeTryHard/zerogravity/releases/download/v1.1.6-beta.1/zerogravity-linux-x86_64
chmod +x zerogravity

Docker:

docker pull ghcr.io/nikketryhard/zerogravity:v1.1.6-beta.1
docker run -p 8741:8741 ghcr.io/nikketryhard/zerogravity:v1.1.6-beta.1

A targeted fix for the maxOutputTokens=1 cascade loop is still planned.

@NikkeTryHard commented on GitHub (Feb 19, 2026): v1.1.6-beta.1 includes several fixes to the MITM pipeline that improve stability with multi-step tool calling and cascade handling. While this specific `maxOutputTokens=1` loop issue requires a dedicated fix (tracking the turn counter), the improved thought_signature handling and cascade logic may reduce some of the amplification effects. Please test with the beta and let us know if you still see the same quota exhaustion behavior: **Binary:** ```bash curl -L -o zerogravity https://github.com/NikkeTryHard/zerogravity/releases/download/v1.1.6-beta.1/zerogravity-linux-x86_64 chmod +x zerogravity ``` **Docker:** ```bash docker pull ghcr.io/nikketryhard/zerogravity:v1.1.6-beta.1 docker run -p 8741:8741 ghcr.io/nikketryhard/zerogravity:v1.1.6-beta.1 ``` A targeted fix for the `maxOutputTokens=1` cascade loop is still planned.

kerem commented

2026-02-27 15:37:59 +03:00

Author

Owner

@yulin0629 commented on GitHub (Feb 19, 2026):

v1.1.6-beta.1 Test Result: Bug Still Present

Tested with ghcr.io/nikketryhard/zerogravity:v1.1.6-beta.1 — the maxOutputTokens=1 cascade loop is unchanged.

Test

Single max_tokens=1 request (simulating Claude Code preflight):

curl -X POST http://localhost:8741/v1/messages \
  -H "Content-Type: application/json" -H "x-api-key: dummy" -H "anthropic-version: 2023-06-01" \
  -d '{"model":"gemini-3-flash","max_tokens":1,"messages":[{"role":"user","content":"count"}]}'

Result: Request timed out (loop ran until timeout).

Log Evidence

17:40:04  remove 4/5   maxOutputTokens=1 → MAX_TOKENS  (Turn 0)
17:40:06  remove 4/6   maxOutputTokens=1 → MAX_TOKENS  (Turn 1)
17:40:07  remove 4/7   maxOutputTokens=1 → MAX_TOKENS  (Turn 2)
17:40:08  remove 4/8   maxOutputTokens=1 → MAX_TOKENS  (Turn 3)
17:40:09  remove 4/9   maxOutputTokens=1 → MAX_TOKENS  (Turn 4)
17:40:10  remove 4/11  maxOutputTokens=1 → MAX_TOKENS  (Turn 5-6)
17:40:11  remove 4/12  maxOutputTokens=1 → MAX_TOKENS  (Turn 7)
17:40:12  remove 4/13  maxOutputTokens=1 → MAX_TOKENS  (Turn 8)
17:40:13  remove 4/14  maxOutputTokens=1 → MAX_TOKENS  (Turn 9)

10 turns in 9 seconds, content messages growing from 5 → 14, maxOutputTokens=1 injected on every turn. Identical behavior to v1.1.5.

Note

Normal requests (max_tokens=100) work fine on the same build — the bug is specifically the cached maxOutputTokens=1 replaying on subsequent cascade turns.

Stopped the container to preserve quota. Will wait for the dedicated fix.

@yulin0629 commented on GitHub (Feb 19, 2026): ## v1.1.6-beta.1 Test Result: Bug Still Present Tested with `ghcr.io/nikketryhard/zerogravity:v1.1.6-beta.1` — the `maxOutputTokens=1` cascade loop is **unchanged**. ### Test Single `max_tokens=1` request (simulating Claude Code preflight): ```bash curl -X POST http://localhost:8741/v1/messages \ -H "Content-Type: application/json" -H "x-api-key: dummy" -H "anthropic-version: 2023-06-01" \ -d '{"model":"gemini-3-flash","max_tokens":1,"messages":[{"role":"user","content":"count"}]}' ``` Result: **Request timed out** (loop ran until timeout). ### Log Evidence ``` 17:40:04 remove 4/5 maxOutputTokens=1 → MAX_TOKENS (Turn 0) 17:40:06 remove 4/6 maxOutputTokens=1 → MAX_TOKENS (Turn 1) 17:40:07 remove 4/7 maxOutputTokens=1 → MAX_TOKENS (Turn 2) 17:40:08 remove 4/8 maxOutputTokens=1 → MAX_TOKENS (Turn 3) 17:40:09 remove 4/9 maxOutputTokens=1 → MAX_TOKENS (Turn 4) 17:40:10 remove 4/11 maxOutputTokens=1 → MAX_TOKENS (Turn 5-6) 17:40:11 remove 4/12 maxOutputTokens=1 → MAX_TOKENS (Turn 7) 17:40:12 remove 4/13 maxOutputTokens=1 → MAX_TOKENS (Turn 8) 17:40:13 remove 4/14 maxOutputTokens=1 → MAX_TOKENS (Turn 9) ``` 10 turns in 9 seconds, content messages growing from 5 → 14, `maxOutputTokens=1` injected on every turn. Identical behavior to v1.1.5. ### Note Normal requests (`max_tokens=100`) work fine on the same build — the bug is specifically the cached `maxOutputTokens=1` replaying on subsequent cascade turns. Stopped the container to preserve quota. Will wait for the dedicated fix.

kerem commented

2026-02-27 15:37:59 +03:00

Author

Owner

@NikkeTryHard commented on GitHub (Feb 19, 2026):

yeah maxOutputTokens=1 problem still unfixed. if you have access to src please PR there if you have any ideas

@NikkeTryHard commented on GitHub (Feb 19, 2026): yeah `maxOutputTokens=1` problem still unfixed. if you have access to src please PR there if you have any ideas

kerem commented

2026-02-27 15:37:59 +03:00

Author

Owner

@NikkeTryHard commented on GitHub (Feb 19, 2026):

i invited you!

@NikkeTryHard commented on GitHub (Feb 19, 2026): i invited you!

kerem commented

2026-02-27 15:37:59 +03:00

Author

Owner

@NikkeTryHard commented on GitHub (Feb 20, 2026):

Fixed in v1.1.8.

The maxOutputTokens injection now caps the minimum to 4096 tokens (m.max(4096)), preventing the maxOutputTokens=1 loop. Additionally, thinking models skip maxOutputTokens entirely and use the proxy default (32000).

Verified with all three API paths:

Test	max_tokens sent	maxOutputTokens injected	Result
Anthropic `/v1/messages`	1	4096	`end_turn`, 189 chars
OpenAI `/v1/chat/completions` (max_tokens)	1	4096	`stop`, 143 chars
OpenAI `/v1/chat/completions` (max_completion_tokens)	1	4096	`stop`, 97 chars

No MAX_TOKENS finish reason, no cascade loop, no quota drain. Please upgrade and test.

@NikkeTryHard commented on GitHub (Feb 20, 2026): Fixed in v1.1.8. The `maxOutputTokens` injection now caps the minimum to 4096 tokens (`m.max(4096)`), preventing the `maxOutputTokens=1` loop. Additionally, thinking models skip `maxOutputTokens` entirely and use the proxy default (32000). Verified with all three API paths: | Test | max_tokens sent | maxOutputTokens injected | Result | |------|----------------|-------------------------|--------| | Anthropic `/v1/messages` | 1 | 4096 | `end_turn`, 189 chars | | OpenAI `/v1/chat/completions` (max_tokens) | 1 | 4096 | `stop`, 143 chars | | OpenAI `/v1/chat/completions` (max_completion_tokens) | 1 | 4096 | `stop`, 97 chars | No `MAX_TOKENS` finish reason, no cascade loop, no quota drain. Please upgrade and test.

kerem commented

2026-02-27 15:37:59 +03:00

Author

Owner

@yulin0629 commented on GitHub (Feb 20, 2026):

v1.1.8 Test Results

Just tested the newly released v1.1.8 Docker image:

Test 1: Normal request (max_tokens=100) -- Works correctly, response received.

Test 2: Issue #24 reproduction (max_tokens=1) -- Still times out with cascade loop behavior.

Logs show the bug persists

MITM: non-STOP finish reason finish_reason="MAX_TOKENS"
MITM: request modified [...] inject generationConfig: temperature=1.0(default-g3), maxOutputTokens=1
MITM: non-STOP finish reason finish_reason="MAX_TOKENS"
MITM: request modified [...] inject generationConfig: temperature=1.0(default-g3), maxOutputTokens=1
... (repeats until timeout)

The maxOutputTokens=1 is still being injected on every cascade turn, triggering the infinite loop.

Looking at the commit history between tags, it seems like the fix (cap maxOutputTokens minimum to 4096) was merged into main after the v1.1.8 tag was created. Looking forward to the next release!

@yulin0629 commented on GitHub (Feb 20, 2026): ## v1.1.8 Test Results Just tested the newly released `v1.1.8` Docker image: **Test 1: Normal request** (`max_tokens=100`) -- Works correctly, response received. **Test 2: Issue #24 reproduction** (`max_tokens=1`) -- **Still times out** with cascade loop behavior. ### Logs show the bug persists ``` MITM: non-STOP finish reason finish_reason="MAX_TOKENS" MITM: request modified [...] inject generationConfig: temperature=1.0(default-g3), maxOutputTokens=1 MITM: non-STOP finish reason finish_reason="MAX_TOKENS" MITM: request modified [...] inject generationConfig: temperature=1.0(default-g3), maxOutputTokens=1 ... (repeats until timeout) ``` The `maxOutputTokens=1` is still being injected on every cascade turn, triggering the infinite loop. --- Looking at the commit history between tags, it seems like the fix (cap `maxOutputTokens` minimum to 4096) was merged into main *after* the v1.1.8 tag was created. Looking forward to the next release!

kerem commented

2026-02-27 15:37:59 +03:00

Author

Owner

@NikkeTryHard commented on GitHub (Feb 20, 2026):

Give me ur zg report for test 2 if possible

@NikkeTryHard commented on GitHub (Feb 20, 2026): Give me ur `zg report` for test 2 if possible

kerem commented

2026-02-27 15:37:59 +03:00

Author

Owner

@NikkeTryHard commented on GitHub (Feb 20, 2026):

see if fixed with v1.1.9

@NikkeTryHard commented on GitHub (Feb 20, 2026): see if fixed with v1.1.9

kerem commented

2026-02-27 15:37:59 +03:00

Author

Owner

@yulin0629 commented on GitHub (Feb 20, 2026):

v1.1.9 Test Results: Bug Fixed

Tested with ghcr.io/nikketryhard/zerogravity:v1.1.9 on macOS Docker (arm64).

Test 1: Baseline (`max_tokens=100`)

curl -s --max-time 30 http://localhost:8741/v1/messages \
  -H "Content-Type: application/json" -H "x-api-key: dummy" -H "anthropic-version: 2023-06-01" \
  -d '{"model":"gemini-3-pro","max_tokens":100,"messages":[{"role":"user","content":"Say exactly: hello world"}]}'

Result: "text":"hello world", stop_reason: end_turn. Normal.

Test 2: Issue #24 reproduction (`max_tokens=1`)

curl -s --max-time 60 http://localhost:8741/v1/messages \
  -H "Content-Type: application/json" -H "x-api-key: dummy" -H "anthropic-version: 2023-06-01" \
  -d '{"model":"gemini-3-pro","max_tokens":1,"messages":[{"role":"user","content":"Say exactly: hello world"}]}'

Result: "text":"hello world", stop_reason: end_turn. No timeout, no cascade loop.

Log Evidence

Every request shows maxOutputTokens=4096 (capped from 1):

MITM: request modified [...] inject generationConfig: temperature=1.0(default-g3), maxOutputTokens=4096

No maxOutputTokens=1 injections, no MAX_TOKENS finish reason, no growing content message counts. The fix works.

Note

(gemini-3-flash quota was exhausted, tested with gemini-3-pro instead -- same MITM pipeline)

@yulin0629 commented on GitHub (Feb 20, 2026): ## v1.1.9 Test Results: Bug Fixed Tested with `ghcr.io/nikketryhard/zerogravity:v1.1.9` on macOS Docker (arm64). ### Test 1: Baseline (`max_tokens=100`) ```bash curl -s --max-time 30 http://localhost:8741/v1/messages \ -H "Content-Type: application/json" -H "x-api-key: dummy" -H "anthropic-version: 2023-06-01" \ -d '{"model":"gemini-3-pro","max_tokens":100,"messages":[{"role":"user","content":"Say exactly: hello world"}]}' ``` Result: `"text":"hello world"`, `stop_reason: end_turn`. Normal. ### Test 2: Issue #24 reproduction (`max_tokens=1`) ```bash curl -s --max-time 60 http://localhost:8741/v1/messages \ -H "Content-Type: application/json" -H "x-api-key: dummy" -H "anthropic-version: 2023-06-01" \ -d '{"model":"gemini-3-pro","max_tokens":1,"messages":[{"role":"user","content":"Say exactly: hello world"}]}' ``` Result: `"text":"hello world"`, `stop_reason: end_turn`. **No timeout, no cascade loop.** ### Log Evidence Every request shows `maxOutputTokens=4096` (capped from 1): ``` MITM: request modified [...] inject generationConfig: temperature=1.0(default-g3), maxOutputTokens=4096 ``` No `maxOutputTokens=1` injections, no `MAX_TOKENS` finish reason, no growing content message counts. The fix works. ### Note (gemini-3-flash quota was exhausted, tested with gemini-3-pro instead -- same MITM pipeline)

kerem commented

2026-02-27 15:37:59 +03:00

Author

Owner

@NikkeTryHard commented on GitHub (Feb 20, 2026):

thank you for this issue!

@NikkeTryHard commented on GitHub (Feb 20, 2026): thank you for this issue!

kerem referenced this issue

2026-02-27 15:38:14 +03:00

[PR #21] [MERGED] docs: add Claude Sonnet 4.6 to models table #72

Rows
Columns

[GH-ISSUE #24] [BUG] maxOutputTokens=1 injection causes cascade loop, exhausts Flash quota in seconds #21

Summary

Environment

Reproduction

Root Cause: Triple Amplification

1. Claude Code parallel requests (5x multiplier)

2. maxOutputTokens=1 + LS auto-retry = infinite loop (~Nx multiplier)

3. RESOURCE_EXHAUSTED retry creates new cascades (2x multiplier)

Impact Numbers

Timeline

Estimated Token Consumption

Suggested Fix

Workaround

Source-Level Root Cause Analysis

Key Finding: Claude Code sends max_tokens=1 in preflight requests

Bug: CascadeCache replays maxOutputTokens=1 on every subsequent turn

The loop mechanism

Suggested fix

v1.1.6-beta.1 Test Result: Bug Still Present

Test

Log Evidence

Note

v1.1.8 Test Results

Logs show the bug persists

v1.1.9 Test Results: Bug Fixed

Test 1: Baseline (max_tokens=100)

Test 2: Issue #24 reproduction (max_tokens=1)

Log Evidence

Note

2. `maxOutputTokens=1` + LS auto-retry = infinite loop (~Nx multiplier)

Key Finding: Claude Code sends `max_tokens=1` in preflight requests

Bug: `CascadeCache` replays `maxOutputTokens=1` on every subsequent turn

Test 1: Baseline (`max_tokens=100`)

Test 2: Issue #24 reproduction (`max_tokens=1`)