[GH-ISSUE #38] BUG: Connection pool exhaustion after 25+ hours - CLOSE_WAIT leak #27

Closed
opened 2026-02-27 07:17:31 +03:00 by kerem · 4 comments
Owner

Originally created by @bhaskoro-muthohar on GitHub (Jan 14, 2026).
Original GitHub issue: https://github.com/jwadow/kiro-gateway/issues/38

Kiro Gateway Version

v2.0.0-rc.1

What happened?

Summary

Gateway becomes unresponsive after running for 25+ hours. All requests hang for exactly 5 minutes before timing out. Restarting the gateway immediately fixes the issue.

Root Cause

TCP connections to the Kiro API accumulate in CLOSE_WAIT state. This happens because:

  1. The Kiro API (AWS) closes connections after responses complete (sends FIN)
  2. httpx receives the close but keeps the connection in the pool for reuse
  3. The connection enters CLOSE_WAIT state (remote closed, local hasn't)
  4. These zombie connections accumulate over time
  5. Eventually the pool is full of unusable connections → requests hang

Evidence

Running lsof on the gateway process shows many connections stuck in CLOSE_WAIT:

$ lsof -p <PID> | grep -E "TCP|ESTABLISHED"
python3.1 79422 user   13u  IPv4  TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT)
python3.1 79422 user   14u  IPv4  TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT)
python3.1 79422 user   15u  IPv4  TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT)
... (25+ connections in CLOSE_WAIT)

Sample monitoring data:

Timestamp            ESTABLISHED  CLOSE_WAIT  Total_FDs  Memory_KB
2026-01-14 18:18:25  12           25          104        43680

Why response.aclose() doesn't fix it

The code in streaming_openai.py:309-314 correctly calls response.aclose():

finally:
    try:
        await response.aclose()
    except Exception as close_error:
        logger.debug(f"Error closing response: {close_error}")

However, response.aclose() closes the response, not the connection. httpx returns the connection to the pool for reuse. When the remote has already closed its side, the connection becomes a zombie in CLOSE_WAIT.

Reproduction

  1. Start the gateway
  2. Use it normally for 24+ hours
  3. Monitor connections: watch -n 60 'lsof -p $(pgrep -f "python main.py") | grep CLOSE_WAIT | wc -l'
  4. Observe CLOSE_WAIT count increasing over time
  5. Eventually requests start timing out (5 minute hangs)

Environment

  • OS: macOS (Darwin 24.5.0)
  • Python: 3.12
  • Running as: python main.py

Suggested Fixes

  1. Reduce keepalive_expiry in main.py from 30s to 5-10s:

    limits = httpx.Limits(
        max_connections=100,
        max_keepalive_connections=20,
        keepalive_expiry=5.0  # Was 30.0
    )
    
  2. Add Connection: close header for streaming requests to prevent pooling:

    headers["Connection"] = "close"
    
  3. Use per-request clients for streaming instead of the shared pool

Debug Logs

No debug logs available for this issue - it's a gradual resource leak, not an immediate error.

However, here's a monitoring script that tracks connection states over time:

#!/bin/bash
# monitor.sh - tracks connection states
while true; do
  PID=$(pgrep -f "python main.py" | head -1)
  if [ -n "$PID" ]; then
    LSOF_OUTPUT=$(lsof -p "$PID" 2>/dev/null)
    ESTABLISHED=$(echo "$LSOF_OUTPUT" | grep -c ESTABLISHED)
    CLOSE_WAIT=$(echo "$LSOF_OUTPUT" | grep -c CLOSE_WAIT)
    TOTAL_FDS=$(echo "$LSOF_OUTPUT" | wc -l)
    MEMORY=$(ps -o rss= -p "$PID" 2>/dev/null | tr -d ' ')
    echo "$(date '+%Y-%m-%d %H:%M:%S') $ESTABLISHED $CLOSE_WAIT $TOTAL_FDS $MEMORY"
  fi
  sleep 60
done

Additional Notes

  • The 5-minute timeout matches read_timeout=300.0s in the config
  • This is likely related to how AWS/Kiro API handles keep-alive connections
  • The issue is more pronounced with frequent streaming requests
Originally created by @bhaskoro-muthohar on GitHub (Jan 14, 2026). Original GitHub issue: https://github.com/jwadow/kiro-gateway/issues/38 ## Kiro Gateway Version v2.0.0-rc.1 ## What happened? ### Summary Gateway becomes unresponsive after running for 25+ hours. All requests hang for exactly 5 minutes before timing out. Restarting the gateway immediately fixes the issue. ### Root Cause TCP connections to the Kiro API accumulate in `CLOSE_WAIT` state. This happens because: 1. The Kiro API (AWS) closes connections after responses complete (sends FIN) 2. httpx receives the close but keeps the connection in the pool for reuse 3. The connection enters `CLOSE_WAIT` state (remote closed, local hasn't) 4. These zombie connections accumulate over time 5. Eventually the pool is full of unusable connections → requests hang ### Evidence Running `lsof` on the gateway process shows many connections stuck in `CLOSE_WAIT`: ```bash $ lsof -p <PID> | grep -E "TCP|ESTABLISHED" python3.1 79422 user 13u IPv4 TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT) python3.1 79422 user 14u IPv4 TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT) python3.1 79422 user 15u IPv4 TCP ...->ec2-54-175-190-121.compute-1.amazonaws.com:https (CLOSE_WAIT) ... (25+ connections in CLOSE_WAIT) ``` Sample monitoring data: ``` Timestamp ESTABLISHED CLOSE_WAIT Total_FDs Memory_KB 2026-01-14 18:18:25 12 25 104 43680 ``` ### Why `response.aclose()` doesn't fix it The code in `streaming_openai.py:309-314` correctly calls `response.aclose()`: ```python finally: try: await response.aclose() except Exception as close_error: logger.debug(f"Error closing response: {close_error}") ``` However, `response.aclose()` closes the **response**, not the **connection**. httpx returns the connection to the pool for reuse. When the remote has already closed its side, the connection becomes a zombie in `CLOSE_WAIT`. ### Reproduction 1. Start the gateway 2. Use it normally for 24+ hours 3. Monitor connections: `watch -n 60 'lsof -p $(pgrep -f "python main.py") | grep CLOSE_WAIT | wc -l'` 4. Observe CLOSE_WAIT count increasing over time 5. Eventually requests start timing out (5 minute hangs) ### Environment - OS: macOS (Darwin 24.5.0) - Python: 3.12 - Running as: `python main.py` ### Suggested Fixes 1. **Reduce `keepalive_expiry`** in `main.py` from 30s to 5-10s: ```python limits = httpx.Limits( max_connections=100, max_keepalive_connections=20, keepalive_expiry=5.0 # Was 30.0 ) ``` 2. **Add `Connection: close` header** for streaming requests to prevent pooling: ```python headers["Connection"] = "close" ``` 3. **Use per-request clients for streaming** instead of the shared pool ## Debug Logs No debug logs available for this issue - it's a gradual resource leak, not an immediate error. However, here's a monitoring script that tracks connection states over time: ```bash #!/bin/bash # monitor.sh - tracks connection states while true; do PID=$(pgrep -f "python main.py" | head -1) if [ -n "$PID" ]; then LSOF_OUTPUT=$(lsof -p "$PID" 2>/dev/null) ESTABLISHED=$(echo "$LSOF_OUTPUT" | grep -c ESTABLISHED) CLOSE_WAIT=$(echo "$LSOF_OUTPUT" | grep -c CLOSE_WAIT) TOTAL_FDS=$(echo "$LSOF_OUTPUT" | wc -l) MEMORY=$(ps -o rss= -p "$PID" 2>/dev/null | tr -d ' ') echo "$(date '+%Y-%m-%d %H:%M:%S') $ESTABLISHED $CLOSE_WAIT $TOTAL_FDS $MEMORY" fi sleep 60 done ``` ## Additional Notes - The 5-minute timeout matches `read_timeout=300.0s` in the config - This is likely related to how AWS/Kiro API handles keep-alive connections - The issue is more pronounced with frequent streaming requests
kerem 2026-02-27 07:17:31 +03:00
  • closed this issue
  • added the
    bug
    fixed
    labels
Author
Owner

@jwadow commented on GitHub (Jan 14, 2026):

Hi, good to see you again, added Connection: close for streaming requests. The issue is that AWS closes the connection after streaming, but httpx returns it to the pool as if it's still alive. Now we explicitly tell it not to reuse.

You can verify on your side, run it for a day and check lsof.

<!-- gh-comment-id:3750335966 --> @jwadow commented on GitHub (Jan 14, 2026): Hi, good to see you again, added Connection: close for streaming requests. The issue is that AWS closes the connection after streaming, but httpx returns it to the pool as if it's still alive. Now we explicitly tell it not to reuse. You can verify on your side, run it for a day and check lsof.
Author
Owner

@bhaskoro-muthohar commented on GitHub (Jan 15, 2026):

@jwadow thanks for the quick fix! 🙏

Just updated to the latest commit and restarted the gateway. Initial results look promising:

# Before restart (old process)
2026-01-15 09:57:53 75 30      171 44944   # 30 CLOSE_WAIT

# After restart with fix
$ lsof -p <PID> | grep -c CLOSE_WAIT
0

CLOSE_WAIT dropped from 30 to 0 immediately after restart.

Will keep monitoring for the next 24 hours to confirm it stays low under normal usage. Will report back with the results.

<!-- gh-comment-id:3752694989 --> @bhaskoro-muthohar commented on GitHub (Jan 15, 2026): @jwadow thanks for the quick fix! 🙏 Just updated to the latest commit and restarted the gateway. Initial results look promising: ``` # Before restart (old process) 2026-01-15 09:57:53 75 30 171 44944 # 30 CLOSE_WAIT # After restart with fix $ lsof -p <PID> | grep -c CLOSE_WAIT 0 ``` CLOSE_WAIT dropped from 30 to **0** immediately after restart. Will keep monitoring for the next 24 hours to confirm it stays low under normal usage. Will report back with the results.
Author
Owner

@bhaskoro-muthohar commented on GitHub (Jan 16, 2026):

24+ Hour Monitoring Results

The fix is confirmed working. Here's the data after running for over 24 hours:

CLOSE_WAIT Distribution

CLOSE_WAIT Count Occurrences Phase
0-1 646 After fix (stable)
25-30 181 Before fix (leaking)
116-125 4 Before fix (peak)

Current State

Timestamp            ESTABLISHED  CLOSE_WAIT  Total_FDs  Memory_KB
2026-01-16 16:43:29  1            1           68         10112

Summary

Metric Before Fix After Fix (24h)
CLOSE_WAIT 25-30 (accumulating) 1 (stable)
Total FDs 100-180 67-72
Memory ~55 MB ~30 MB

The Connection: close header fix completely resolved the issue. CLOSE_WAIT no longer accumulates over time.

Thanks again for the quick turnaround! 🎉

<!-- gh-comment-id:3759091700 --> @bhaskoro-muthohar commented on GitHub (Jan 16, 2026): ## 24+ Hour Monitoring Results ✅ The fix is confirmed working. Here's the data after running for over 24 hours: ### CLOSE_WAIT Distribution | CLOSE_WAIT Count | Occurrences | Phase | |------------------|-------------|-------| | 0-1 | 646 | After fix (stable) | | 25-30 | 181 | Before fix (leaking) | | 116-125 | 4 | Before fix (peak) | ### Current State ``` Timestamp ESTABLISHED CLOSE_WAIT Total_FDs Memory_KB 2026-01-16 16:43:29 1 1 68 10112 ``` ### Summary | Metric | Before Fix | After Fix (24h) | |--------|------------|-----------------| | CLOSE_WAIT | 25-30 (accumulating) | **1** (stable) | | Total FDs | 100-180 | 67-72 | | Memory | ~55 MB | ~30 MB | The `Connection: close` header fix completely resolved the issue. CLOSE_WAIT no longer accumulates over time. Thanks again for the quick turnaround! 🎉
Author
Owner

@jwadow commented on GitHub (Jan 16, 2026):

Really appreciate you running it for the full 24 hours and putting together those tables.
Didn't expect the memory to drop that much too, that's a welcome side effect.

Thanks for the detailed report, closing this one.

<!-- gh-comment-id:3760773542 --> @jwadow commented on GitHub (Jan 16, 2026): Really appreciate you running it for the full 24 hours and putting together those tables. Didn't expect the memory to drop that much too, that's a welcome side effect. Thanks for the detailed report, closing this one.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/kiro-gateway-jwadow#27
No description provided.