#27 - [GH-ISSUE #38] BUG: Connection pool exhaustion after 25+ hours - CLOSE_WAIT leak - starred/kiro-gateway-jwadow

kerem commented

2026-02-27 07:17:31 +03:00

Owner

Originally created by @bhaskoro-muthohar on GitHub (Jan 14, 2026).
Original GitHub issue: https://github.com/jwadow/kiro-gateway/issues/38

Kiro Gateway Version

v2.0.0-rc.1

What happened?

Summary

Gateway becomes unresponsive after running for 25+ hours. All requests hang for exactly 5 minutes before timing out. Restarting the gateway immediately fixes the issue.

Root Cause

TCP connections to the Kiro API accumulate in CLOSE_WAIT state. This happens because:

The Kiro API (AWS) closes connections after responses complete (sends FIN)
httpx receives the close but keeps the connection in the pool for reuse
The connection enters CLOSE_WAIT state (remote closed, local hasn't)
These zombie connections accumulate over time
Eventually the pool is full of unusable connections → requests hang

Evidence

Running lsof on the gateway process shows many connections stuck in CLOSE_WAIT:

$ lsof -p <PID> | grep -E "TCP|ESTABLISHED"
python3.1 79422 user   13u  IPv4  TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT)
python3.1 79422 user   14u  IPv4  TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT)
python3.1 79422 user   15u  IPv4  TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT)
... (25+ connections in CLOSE_WAIT)

Sample monitoring data:

Timestamp            ESTABLISHED  CLOSE_WAIT  Total_FDs  Memory_KB
2026-01-14 18:18:25  12           25          104        43680

Why `response.aclose()` doesn't fix it

The code in streaming_openai.py:309-314 correctly calls response.aclose():

finally:
    try:
        await response.aclose()
    except Exception as close_error:
        logger.debug(f"Error closing response: {close_error}")

However, response.aclose() closes the response, not the connection. httpx returns the connection to the pool for reuse. When the remote has already closed its side, the connection becomes a zombie in CLOSE_WAIT.

Reproduction

Start the gateway
Use it normally for 24+ hours
Monitor connections: watch -n 60 'lsof -p $(pgrep -f "python main.py") | grep CLOSE_WAIT | wc -l'
Observe CLOSE_WAIT count increasing over time
Eventually requests start timing out (5 minute hangs)

Environment

OS: macOS (Darwin 24.5.0)
Python: 3.12
Running as: python main.py

Suggested Fixes

Reduce keepalive_expiry in main.py from 30s to 5-10s:

limits = httpx.Limits(
    max_connections=100,
    max_keepalive_connections=20,
    keepalive_expiry=5.0  # Was 30.0
)

Add Connection: close header for streaming requests to prevent pooling:
```
headers["Connection"] = "close"
```
Use per-request clients for streaming instead of the shared pool

Debug Logs

No debug logs available for this issue - it's a gradual resource leak, not an immediate error.

However, here's a monitoring script that tracks connection states over time:

#!/bin/bash
# monitor.sh - tracks connection states
while true; do
  PID=$(pgrep -f "python main.py" | head -1)
  if [ -n "$PID" ]; then
    LSOF_OUTPUT=$(lsof -p "$PID" 2>/dev/null)
    ESTABLISHED=$(echo "$LSOF_OUTPUT" | grep -c ESTABLISHED)
    CLOSE_WAIT=$(echo "$LSOF_OUTPUT" | grep -c CLOSE_WAIT)
    TOTAL_FDS=$(echo "$LSOF_OUTPUT" | wc -l)
    MEMORY=$(ps -o rss= -p "$PID" 2>/dev/null | tr -d ' ')
    echo "$(date '+%Y-%m-%d %H:%M:%S') $ESTABLISHED $CLOSE_WAIT $TOTAL_FDS $MEMORY"
  fi
  sleep 60
done

Additional Notes

The 5-minute timeout matches read_timeout=300.0s in the config
This is likely related to how AWS/Kiro API handles keep-alive connections
The issue is more pronounced with frequent streaming requests

Originally created by @bhaskoro-muthohar on GitHub (Jan 14, 2026). Original GitHub issue: https://github.com/jwadow/kiro-gateway/issues/38 ## Kiro Gateway Version v2.0.0-rc.1 ## What happened? ### Summary Gateway becomes unresponsive after running for 25+ hours. All requests hang for exactly 5 minutes before timing out. Restarting the gateway immediately fixes the issue. ### Root Cause TCP connections to the Kiro API accumulate in `CLOSE_WAIT` state. This happens because: 1. The Kiro API (AWS) closes connections after responses complete (sends FIN) 2. httpx receives the close but keeps the connection in the pool for reuse 3. The connection enters `CLOSE_WAIT` state (remote closed, local hasn't) 4. These zombie connections accumulate over time 5. Eventually the pool is full of unusable connections → requests hang ### Evidence Running `lsof` on the gateway process shows many connections stuck in `CLOSE_WAIT`: ```bash $ lsof -p <PID> | grep -E "TCP|ESTABLISHED" python3.1 79422 user 13u IPv4 TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT) python3.1 79422 user 14u IPv4 TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT) python3.1 79422 user 15u IPv4 TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT) ... (25+ connections in CLOSE_WAIT) ``` Sample monitoring data: ``` Timestamp ESTABLISHED CLOSE_WAIT Total_FDs Memory_KB 2026-01-14 18:18:25 12 25 104 43680 ``` ### Why `response.aclose()` doesn't fix it The code in `streaming_openai.py:309-314` correctly calls `response.aclose()`: ```python finally: try: await response.aclose() except Exception as close_error: logger.debug(f"Error closing response: {close_error}") ``` However, `response.aclose()` closes the **response**, not the **connection**. httpx returns the connection to the pool for reuse. When the remote has already closed its side, the connection becomes a zombie in `CLOSE_WAIT`. ### Reproduction 1. Start the gateway 2. Use it normally for 24+ hours 3. Monitor connections: `watch -n 60 'lsof -p $(pgrep -f "python main.py") | grep CLOSE_WAIT | wc -l'` 4. Observe CLOSE_WAIT count increasing over time 5. Eventually requests start timing out (5 minute hangs) ### Environment - OS: macOS (Darwin 24.5.0) - Python: 3.12 - Running as: `python main.py` ### Suggested Fixes 1. **Reduce `keepalive_expiry`** in `main.py` from 30s to 5-10s: ```python limits = httpx.Limits( max_connections=100, max_keepalive_connections=20, keepalive_expiry=5.0 # Was 30.0 ) ``` 2. **Add `Connection: close` header** for streaming requests to prevent pooling: ```python headers["Connection"] = "close" ``` 3. **Use per-request clients for streaming** instead of the shared pool ## Debug Logs No debug logs available for this issue - it's a gradual resource leak, not an immediate error. However, here's a monitoring script that tracks connection states over time: ```bash #!/bin/bash # monitor.sh - tracks connection states while true; do PID=$(pgrep -f "python main.py" | head -1) if [ -n "$PID" ]; then LSOF_OUTPUT=$(lsof -p "$PID" 2>/dev/null) ESTABLISHED=$(echo "$LSOF_OUTPUT" | grep -c ESTABLISHED) CLOSE_WAIT=$(echo "$LSOF_OUTPUT" | grep -c CLOSE_WAIT) TOTAL_FDS=$(echo "$LSOF_OUTPUT" | wc -l) MEMORY=$(ps -o rss= -p "$PID" 2>/dev/null | tr -d ' ') echo "$(date '+%Y-%m-%d %H:%M:%S') $ESTABLISHED $CLOSE_WAIT $TOTAL_FDS $MEMORY" fi sleep 60 done ``` ## Additional Notes - The 5-minute timeout matches `read_timeout=300.0s` in the config - This is likely related to how AWS/Kiro API handles keep-alive connections - The issue is more pronounced with frequent streaming requests

kerem

2026-02-27 07:17:31 +03:00

closed this issue
added the
bug

fixed
labels

kerem commented

2026-02-27 07:17:32 +03:00

Author

Owner

@jwadow commented on GitHub (Jan 14, 2026):

Hi, good to see you again, added Connection: close for streaming requests. The issue is that AWS closes the connection after streaming, but httpx returns it to the pool as if it's still alive. Now we explicitly tell it not to reuse.

You can verify on your side, run it for a day and check lsof.

@jwadow commented on GitHub (Jan 14, 2026): Hi, good to see you again, added Connection: close for streaming requests. The issue is that AWS closes the connection after streaming, but httpx returns it to the pool as if it's still alive. Now we explicitly tell it not to reuse. You can verify on your side, run it for a day and check lsof.

kerem commented

2026-02-27 07:17:32 +03:00

Author

Owner

@bhaskoro-muthohar commented on GitHub (Jan 15, 2026):

@jwadow thanks for the quick fix! 🙏

Just updated to the latest commit and restarted the gateway. Initial results look promising:

# Before restart (old process)
2026-01-15 09:57:53 75 30      171 44944   # 30 CLOSE_WAIT

# After restart with fix
$ lsof -p <PID> | grep -c CLOSE_WAIT
0

CLOSE_WAIT dropped from 30 to 0 immediately after restart.

Will keep monitoring for the next 24 hours to confirm it stays low under normal usage. Will report back with the results.

@bhaskoro-muthohar commented on GitHub (Jan 15, 2026): @jwadow thanks for the quick fix! 🙏 Just updated to the latest commit and restarted the gateway. Initial results look promising: ``` # Before restart (old process) 2026-01-15 09:57:53 75 30 171 44944 # 30 CLOSE_WAIT # After restart with fix $ lsof -p <PID> | grep -c CLOSE_WAIT 0 ``` CLOSE_WAIT dropped from 30 to **0** immediately after restart. Will keep monitoring for the next 24 hours to confirm it stays low under normal usage. Will report back with the results.

kerem commented

2026-02-27 07:17:32 +03:00

Author

Owner

@bhaskoro-muthohar commented on GitHub (Jan 16, 2026):

24+ Hour Monitoring Results ✅

The fix is confirmed working. Here's the data after running for over 24 hours:

CLOSE_WAIT Distribution

CLOSE_WAIT Count	Occurrences	Phase
0-1	646	After fix (stable)
25-30	181	Before fix (leaking)
116-125	4	Before fix (peak)

Current State

Timestamp            ESTABLISHED  CLOSE_WAIT  Total_FDs  Memory_KB
2026-01-16 16:43:29  1            1           68         10112

Summary

Metric	Before Fix	After Fix (24h)
CLOSE_WAIT	25-30 (accumulating)	1 (stable)
Total FDs	100-180	67-72
Memory	~55 MB	~30 MB

The Connection: close header fix completely resolved the issue. CLOSE_WAIT no longer accumulates over time.

Thanks again for the quick turnaround! 🎉

@bhaskoro-muthohar commented on GitHub (Jan 16, 2026): ## 24+ Hour Monitoring Results ✅ The fix is confirmed working. Here's the data after running for over 24 hours: ### CLOSE_WAIT Distribution | CLOSE_WAIT Count | Occurrences | Phase | |------------------|-------------|-------| | 0-1 | 646 | After fix (stable) | | 25-30 | 181 | Before fix (leaking) | | 116-125 | 4 | Before fix (peak) | ### Current State ``` Timestamp ESTABLISHED CLOSE_WAIT Total_FDs Memory_KB 2026-01-16 16:43:29 1 1 68 10112 ``` ### Summary | Metric | Before Fix | After Fix (24h) | |--------|------------|-----------------| | CLOSE_WAIT | 25-30 (accumulating) | **1** (stable) | | Total FDs | 100-180 | 67-72 | | Memory | ~55 MB | ~30 MB | The `Connection: close` header fix completely resolved the issue. CLOSE_WAIT no longer accumulates over time. Thanks again for the quick turnaround! 🎉

kerem commented

2026-02-27 07:17:32 +03:00

Author

Owner

@jwadow commented on GitHub (Jan 16, 2026):

Really appreciate you running it for the full 24 hours and putting together those tables.
Didn't expect the memory to drop that much too, that's a welcome side effect.

Thanks for the detailed report, closing this one.

@jwadow commented on GitHub (Jan 16, 2026): Really appreciate you running it for the full 24 hours and putting together those tables. Didn't expect the memory to drop that much too, that's a welcome side effect. Thanks for the detailed report, closing this one.

Rows
Columns

[GH-ISSUE #38] BUG: Connection pool exhaustion after 25+ hours - CLOSE_WAIT leak #27

Kiro Gateway Version

What happened?

Summary

Root Cause

Evidence

Why response.aclose() doesn't fix it

Reproduction

Environment

Suggested Fixes

Debug Logs

Additional Notes

24+ Hour Monitoring Results ✅

CLOSE_WAIT Distribution

Current State

Summary

Why `response.aclose()` doesn't fix it