[GH-ISSUE #1239] HARAKIRI Timeout Causing 5xx Errors in Healthchecks #836

New issue

Closed

opened 2026-02-25 23:43:45 +03:00 by kerem · 1 comment

kerem commented

2026-02-25 23:43:45 +03:00

Owner

Originally created by @cw-sarvesh on GitHub (Dec 5, 2025).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/1239

Summary

Healthchecks application is experiencing HARAKIRI timeout events that result in 5xx HTTP errors, particularly affecting the /ping/ endpoint. The uWSGI workers are being killed when requests exceed the timeout threshold, causing intermittent service disruptions.

Environment

Application: Healthchecks (healthchecks/healthchecks Docker image)
Deployment: Kubernetes (EKS)
Container Image: docker.io/healthchecks/healthchecks:latest

Problem Description

The application experiences HARAKIRI timeout events approximately every hour, where uWSGI workers are killed due to requests exceeding the configured timeout threshold. This results in:

5xx HTTP errors (likely 502/504) being returned to clients
Worker respawns after HARAKIRI events
Service disruption for affected requests, particularly on the /ping/ endpoint

Observed Behavior

HARAKIRI Timeout Event

Logs:

Fri Dec  5 03:31:20 2025 - *** HARAKIRI ON WORKER 4 (pid: 186, try: 1, graceful: yes) ***
Fri Dec  5 03:31:20 2025 - HARAKIRI !!! worker 4 status !!!
Fri Dec  5 03:31:20 2025 - HARAKIRI [core 0] 10.10.230.46 - POST /ping/xx
Fri Dec  5 03:31:20 2025 - HARAKIRI triggered by worker 4 core 0 !!!
Fri Dec  5 03:31:20 2025 - HARAKIRI !!! end of worker 4 status !!!

Worker Respawn:

DAMN ! worker 3 (pid: 185) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 4 (new pid: 190)

Performance Metrics

CPU Usage: ~0.002-0.006 CPU seconds/second (low, not a resource constraint)
Memory Usage: ~250-285 MB working set (stable)
Request Latencies: Most requests complete in <100ms, but some /ping/ requests show latencies up to 1899ms

Root Cause Analysis

HARAKIRI is a uWSGI feature that kills workers when requests take longer than the configured timeout. The observed behavior suggests:

Timeout Configuration: The uWSGI harakiri timeout is likely set too low for certain operations
Slow Operations: The /ping/ endpoint may be performing operations (database queries, external API calls, etc.) that occasionally exceed the timeout
Resource Contention: While CPU/memory metrics appear normal, there may be database connection pool exhaustion or lock contention during peak times

Expected Behavior

All /ping/ requests should complete successfully without triggering HARAKIRI timeouts
Workers should not be killed due to timeout violations
No 5xx errors should be returned to clients for normal operations

Steps to Reproduce

Monitor healthchecks application logs for HARAKIRI events
Observe the pattern - appears to occur approximately every hour
Check for 5xx errors in application metrics/logs around the same time

Labels: bug, performance, uwsgi, timeout, 5xx-errors

Originally created by @cw-sarvesh on GitHub (Dec 5, 2025). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/1239 ## Summary Healthchecks application is experiencing HARAKIRI timeout events that result in 5xx HTTP errors, particularly affecting the `/ping/` endpoint. The uWSGI workers are being killed when requests exceed the timeout threshold, causing intermittent service disruptions. ## Environment - **Application**: Healthchecks (healthchecks/healthchecks Docker image) - **Deployment**: Kubernetes (EKS) - **Container Image**: `docker.io/healthchecks/healthchecks:latest` ## Problem Description The application experiences HARAKIRI timeout events approximately every hour, where uWSGI workers are killed due to requests exceeding the configured timeout threshold. This results in: 1. **5xx HTTP errors** (likely 502/504) being returned to clients 2. **Worker respawns** after HARAKIRI events 3. **Service disruption** for affected requests, particularly on the `/ping/` endpoint ## Observed Behavior ### HARAKIRI Timeout Event **Logs**: ``` Fri Dec 5 03:31:20 2025 - *** HARAKIRI ON WORKER 4 (pid: 186, try: 1, graceful: yes) *** Fri Dec 5 03:31:20 2025 - HARAKIRI !!! worker 4 status !!! Fri Dec 5 03:31:20 2025 - HARAKIRI [core 0] 10.10.230.46 - POST /ping/xx Fri Dec 5 03:31:20 2025 - HARAKIRI triggered by worker 4 core 0 !!! Fri Dec 5 03:31:20 2025 - HARAKIRI !!! end of worker 4 status !!! ``` **Worker Respawn**: ``` DAMN ! worker 3 (pid: 185) died, killed by signal 9 :( trying respawn ... Respawned uWSGI worker 4 (new pid: 190) ``` ### Performance Metrics - **CPU Usage**: ~0.002-0.006 CPU seconds/second (low, not a resource constraint) - **Memory Usage**: ~250-285 MB working set (stable) - **Request Latencies**: Most requests complete in <100ms, but some `/ping/` requests show latencies up to 1899ms ## Root Cause Analysis HARAKIRI is a uWSGI feature that kills workers when requests take longer than the configured timeout. The observed behavior suggests: 1. **Timeout Configuration**: The uWSGI `harakiri` timeout is likely set too low for certain operations 2. **Slow Operations**: The `/ping/` endpoint may be performing operations (database queries, external API calls, etc.) that occasionally exceed the timeout 3. **Resource Contention**: While CPU/memory metrics appear normal, there may be database connection pool exhaustion or lock contention during peak times ## Expected Behavior - All `/ping/` requests should complete successfully without triggering HARAKIRI timeouts - Workers should not be killed due to timeout violations - No 5xx errors should be returned to clients for normal operations ## Steps to Reproduce 1. Monitor healthchecks application logs for HARAKIRI events 2. Observe the pattern - appears to occur approximately every hour 3. Check for 5xx errors in application metrics/logs around the same time **Labels**: `bug`, `performance`, `uwsgi`, `timeout`, `5xx-errors`