[GH-ISSUE #1239] HARAKIRI Timeout Causing 5xx Errors in Healthchecks #836

Closed
opened 2026-02-25 23:43:45 +03:00 by kerem · 1 comment
Owner

Originally created by @cw-sarvesh on GitHub (Dec 5, 2025).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/1239

Summary

Healthchecks application is experiencing HARAKIRI timeout events that result in 5xx HTTP errors, particularly affecting the /ping/ endpoint. The uWSGI workers are being killed when requests exceed the timeout threshold, causing intermittent service disruptions.

Environment

  • Application: Healthchecks (healthchecks/healthchecks Docker image)
  • Deployment: Kubernetes (EKS)
  • Container Image: docker.io/healthchecks/healthchecks:latest

Problem Description

The application experiences HARAKIRI timeout events approximately every hour, where uWSGI workers are killed due to requests exceeding the configured timeout threshold. This results in:

  1. 5xx HTTP errors (likely 502/504) being returned to clients
  2. Worker respawns after HARAKIRI events
  3. Service disruption for affected requests, particularly on the /ping/ endpoint

Observed Behavior

HARAKIRI Timeout Event

Logs:

Fri Dec  5 03:31:20 2025 - *** HARAKIRI ON WORKER 4 (pid: 186, try: 1, graceful: yes) ***
Fri Dec  5 03:31:20 2025 - HARAKIRI !!! worker 4 status !!!
Fri Dec  5 03:31:20 2025 - HARAKIRI [core 0] 10.10.230.46 - POST /ping/xx
Fri Dec  5 03:31:20 2025 - HARAKIRI triggered by worker 4 core 0 !!!
Fri Dec  5 03:31:20 2025 - HARAKIRI !!! end of worker 4 status !!!

Worker Respawn:

DAMN ! worker 3 (pid: 185) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 4 (new pid: 190)

Performance Metrics

  • CPU Usage: ~0.002-0.006 CPU seconds/second (low, not a resource constraint)
  • Memory Usage: ~250-285 MB working set (stable)
  • Request Latencies: Most requests complete in <100ms, but some /ping/ requests show latencies up to 1899ms

Root Cause Analysis

HARAKIRI is a uWSGI feature that kills workers when requests take longer than the configured timeout. The observed behavior suggests:

  1. Timeout Configuration: The uWSGI harakiri timeout is likely set too low for certain operations
  2. Slow Operations: The /ping/ endpoint may be performing operations (database queries, external API calls, etc.) that occasionally exceed the timeout
  3. Resource Contention: While CPU/memory metrics appear normal, there may be database connection pool exhaustion or lock contention during peak times

Expected Behavior

  • All /ping/ requests should complete successfully without triggering HARAKIRI timeouts
  • Workers should not be killed due to timeout violations
  • No 5xx errors should be returned to clients for normal operations

Steps to Reproduce

  1. Monitor healthchecks application logs for HARAKIRI events
  2. Observe the pattern - appears to occur approximately every hour
  3. Check for 5xx errors in application metrics/logs around the same time

Labels: bug, performance, uwsgi, timeout, 5xx-errors

Originally created by @cw-sarvesh on GitHub (Dec 5, 2025). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/1239 ## Summary Healthchecks application is experiencing HARAKIRI timeout events that result in 5xx HTTP errors, particularly affecting the `/ping/` endpoint. The uWSGI workers are being killed when requests exceed the timeout threshold, causing intermittent service disruptions. ## Environment - **Application**: Healthchecks (healthchecks/healthchecks Docker image) - **Deployment**: Kubernetes (EKS) - **Container Image**: `docker.io/healthchecks/healthchecks:latest` ## Problem Description The application experiences HARAKIRI timeout events approximately every hour, where uWSGI workers are killed due to requests exceeding the configured timeout threshold. This results in: 1. **5xx HTTP errors** (likely 502/504) being returned to clients 2. **Worker respawns** after HARAKIRI events 3. **Service disruption** for affected requests, particularly on the `/ping/` endpoint ## Observed Behavior ### HARAKIRI Timeout Event **Logs**: ``` Fri Dec 5 03:31:20 2025 - *** HARAKIRI ON WORKER 4 (pid: 186, try: 1, graceful: yes) *** Fri Dec 5 03:31:20 2025 - HARAKIRI !!! worker 4 status !!! Fri Dec 5 03:31:20 2025 - HARAKIRI [core 0] 10.10.230.46 - POST /ping/xx Fri Dec 5 03:31:20 2025 - HARAKIRI triggered by worker 4 core 0 !!! Fri Dec 5 03:31:20 2025 - HARAKIRI !!! end of worker 4 status !!! ``` **Worker Respawn**: ``` DAMN ! worker 3 (pid: 185) died, killed by signal 9 :( trying respawn ... Respawned uWSGI worker 4 (new pid: 190) ``` ### Performance Metrics - **CPU Usage**: ~0.002-0.006 CPU seconds/second (low, not a resource constraint) - **Memory Usage**: ~250-285 MB working set (stable) - **Request Latencies**: Most requests complete in <100ms, but some `/ping/` requests show latencies up to 1899ms ## Root Cause Analysis HARAKIRI is a uWSGI feature that kills workers when requests take longer than the configured timeout. The observed behavior suggests: 1. **Timeout Configuration**: The uWSGI `harakiri` timeout is likely set too low for certain operations 2. **Slow Operations**: The `/ping/` endpoint may be performing operations (database queries, external API calls, etc.) that occasionally exceed the timeout 3. **Resource Contention**: While CPU/memory metrics appear normal, there may be database connection pool exhaustion or lock contention during peak times ## Expected Behavior - All `/ping/` requests should complete successfully without triggering HARAKIRI timeouts - Workers should not be killed due to timeout violations - No 5xx errors should be returned to clients for normal operations ## Steps to Reproduce 1. Monitor healthchecks application logs for HARAKIRI events 2. Observe the pattern - appears to occur approximately every hour 3. Check for 5xx errors in application metrics/logs around the same time **Labels**: `bug`, `performance`, `uwsgi`, `timeout`, `5xx-errors`
kerem closed this issue 2026-02-25 23:43:45 +03:00
Author
Owner

@cuu508 commented on GitHub (Dec 5, 2025):

AI slop

<!-- gh-comment-id:3615371127 --> @cuu508 commented on GitHub (Dec 5, 2025): AI slop
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/healthchecks#836
No description provided.