mirror of
https://github.com/jwadow/kiro-gateway.git
synced 2026-04-25 01:15:57 +03:00
[GH-ISSUE #38] BUG: Connection pool exhaustion after 25+ hours - CLOSE_WAIT leak #27
Labels
No labels
bug
bug
enhancement
enhancement
fixed
fixed
invalid
needs-info
needs-testing
pull-request
question
upstream
wontfix
workaround
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/kiro-gateway-jwadow#27
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @bhaskoro-muthohar on GitHub (Jan 14, 2026).
Original GitHub issue: https://github.com/jwadow/kiro-gateway/issues/38
Kiro Gateway Version
v2.0.0-rc.1
What happened?
Summary
Gateway becomes unresponsive after running for 25+ hours. All requests hang for exactly 5 minutes before timing out. Restarting the gateway immediately fixes the issue.
Root Cause
TCP connections to the Kiro API accumulate in
CLOSE_WAITstate. This happens because:CLOSE_WAITstate (remote closed, local hasn't)Evidence
Running
lsofon the gateway process shows many connections stuck inCLOSE_WAIT:Sample monitoring data:
Why
response.aclose()doesn't fix itThe code in
streaming_openai.py:309-314correctly callsresponse.aclose():However,
response.aclose()closes the response, not the connection. httpx returns the connection to the pool for reuse. When the remote has already closed its side, the connection becomes a zombie inCLOSE_WAIT.Reproduction
watch -n 60 'lsof -p $(pgrep -f "python main.py") | grep CLOSE_WAIT | wc -l'Environment
python main.pySuggested Fixes
Reduce
keepalive_expiryinmain.pyfrom 30s to 5-10s:Add
Connection: closeheader for streaming requests to prevent pooling:Use per-request clients for streaming instead of the shared pool
Debug Logs
No debug logs available for this issue - it's a gradual resource leak, not an immediate error.
However, here's a monitoring script that tracks connection states over time:
Additional Notes
read_timeout=300.0sin the config@jwadow commented on GitHub (Jan 14, 2026):
Hi, good to see you again, added Connection: close for streaming requests. The issue is that AWS closes the connection after streaming, but httpx returns it to the pool as if it's still alive. Now we explicitly tell it not to reuse.
You can verify on your side, run it for a day and check lsof.
@bhaskoro-muthohar commented on GitHub (Jan 15, 2026):
@jwadow thanks for the quick fix! 🙏
Just updated to the latest commit and restarted the gateway. Initial results look promising:
CLOSE_WAIT dropped from 30 to 0 immediately after restart.
Will keep monitoring for the next 24 hours to confirm it stays low under normal usage. Will report back with the results.
@bhaskoro-muthohar commented on GitHub (Jan 16, 2026):
24+ Hour Monitoring Results ✅
The fix is confirmed working. Here's the data after running for over 24 hours:
CLOSE_WAIT Distribution
Current State
Summary
The
Connection: closeheader fix completely resolved the issue. CLOSE_WAIT no longer accumulates over time.Thanks again for the quick turnaround! 🎉
@jwadow commented on GitHub (Jan 16, 2026):
Really appreciate you running it for the full 24 hours and putting together those tables.
Didn't expect the memory to drop that much too, that's a welcome side effect.
Thanks for the detailed report, closing this one.