[GH-ISSUE #11] FR: enable thinking/reasoning mode via thinking tags? #12

New issue

Closed

opened 2026-02-27 07:17:23 +03:00 by kerem · 2 comments

kerem commented

2026-02-27 07:17:23 +03:00

Owner

Originally created by @JoeGrimes123 on GitHub (Dec 28, 2025).
Original GitHub issue: https://github.com/jwadow/kiro-gateway/issues/11

Apparently, if you add thinking_mode tag set to 'enabled', you will get response (still within content key) that contains similar to llm reasoning.

w/out thinking_mode tag:

req:
"messages": [ {"role": "user", "content": "Whats 2+2"} ]

res:
"message": {"role": "assistant", "content": "2 + 2 = **4**"}

w/ thinking_mode tag

req:
"messages": [ {"role": "user", "content": "<thinking_mode>enabled</thinking_mode>\n<max_thinking_length>32000</max_thinking_length>\n\nWhats 2+2"} ]

res:
"message": {"role": "assistant", "content": "<thinking>\nThe user is asking a simple arithmetic question: 2+2.\n\nThe answer is 4.\n</thinking>\n\n2 + 2 = **4**"}

If it is indeed how Kiro API handles thinking, I guess you can translate it to openai format so any client can pass the reasoning parameter though I think this is gonna be hard to implement since it'll be more prone to tool calling failures and errors like what happens in Antigravity models w/ thoughtsignature and what not.
Also I'm not sure if it's possible to create chain of thought/interleaved reasoning from this but it'd be fantastic if you can get it to work

Originally created by @JoeGrimes123 on GitHub (Dec 28, 2025). Original GitHub issue: https://github.com/jwadow/kiro-gateway/issues/11 Apparently, if you add thinking_mode tag set to 'enabled', you will get response (still within content key) that contains similar to llm reasoning. w/out thinking_mode tag: req: `"messages": [ {"role": "user", "content": "Whats 2+2"} ]` res: `"message": {"role": "assistant", "content": "2 + 2 = **4**"}` w/ thinking_mode tag req: `"messages": [ {"role": "user", "content": "<thinking_mode>enabled</thinking_mode>\n<max_thinking_length>32000</max_thinking_length>\n\nWhats 2+2"} ]` res: `"message": {"role": "assistant", "content": "<thinking>\nThe user is asking a simple arithmetic question: 2+2.\n\nThe answer is 4.\n</thinking>\n\n2 + 2 = **4**"}` If it is indeed how Kiro API handles thinking, I guess you can translate it to openai format so any client can pass the reasoning parameter though I think this is gonna be hard to implement since it'll be more prone to tool calling failures and errors like what happens in Antigravity models w/ thoughtsignature and what not. Also I'm not sure if it's possible to create chain of thought/interleaved reasoning from this but it'd be fantastic if you can get it to work

kerem

2026-02-27 07:17:23 +03:00

closed this issue
added the
fixed

enhancement
labels

kerem commented

2026-02-27 07:17:24 +03:00

Author

Owner

@jwadow commented on GitHub (Dec 28, 2025):

Hi, I hadn't thought about this method, it's actually a brilliant solution. When I have time, I'll definitely try it out. I'm really excited about it now. Thanks for the tip.

This is also cool because in some situations, the model breaks down and starts reasoning out loud, clogging up the context with tokens and poisons itself. Otherwise, it will stuff all the junk into the reasoning tags.

On the other hand, Kiro has a limitation, probably 8192 output tokens per request (700-800 lines in VS Code), which is impossible to bypass. Consequently, some responses may be short, since your reasoning hack is essentially "content" and not "reasoning."

@jwadow commented on GitHub (Dec 28, 2025): Hi, I hadn't thought about this method, it's actually a brilliant solution. When I have time, I'll definitely try it out. I'm really excited about it now. Thanks for the tip. This is also cool because in some situations, the model breaks down and starts reasoning out loud, clogging up the context with tokens and poisons itself. Otherwise, it will stuff all the junk into the reasoning tags. On the other hand, Kiro has a limitation, probably 8192 output tokens per request (700-800 lines in VS Code), which is impossible to bypass. Consequently, some responses may be short, since your reasoning hack is essentially "content" and not "reasoning."

kerem commented

2026-02-27 07:17:25 +03:00

Author

Owner

@jwadow commented on GitHub (Jan 3, 2026):

@JoeGrimes123

Done! Merged in the latest commit (git clone or in the future v1.0.8)

Added tag injection with FSM-based streaming parser that handles chunks properly. Converts to OpenAI reasoning_content format. Enabled by default.

Config: FAKE_REASONING_ENABLED, FAKE_REASONING_MAX_TOKENS (4000), FAKE_REASONING_HANDLING.

Yeah the 8k output limit is a real constraint - thinking eats into that budget. But for most cases it's fine, and you can always turn it off.

Thanks for finding this, was a fun one to implement.

@jwadow commented on GitHub (Jan 3, 2026): @JoeGrimes123 Done! Merged in the latest commit (git clone or in the future v1.0.8) Added tag injection with FSM-based streaming parser that handles chunks properly. Converts to OpenAI reasoning_content format. Enabled by default. Config: FAKE_REASONING_ENABLED, FAKE_REASONING_MAX_TOKENS (4000), FAKE_REASONING_HANDLING. Yeah the 8k output limit is a real constraint - thinking eats into that budget. But for most cases it's fine, and you can always turn it off. Thanks for finding this, was a fun one to implement.

kerem referenced this issue

2026-02-27 07:17:25 +03:00

[GH-ISSUE #16] [Bug]: AWS SSO OIDC - API host incorrectly uses SSO region instead of us-east-1 #13