starred/healthchecks

Fork 0

mirror of https://github.com/healthchecks/healthchecks.git synced 2026-04-25 23:15:49 +03:00

[GH-ISSUE #268] Send notifications for individual job failures when measuring job execution time #201

New issue

Closed

opened 2026-02-25 23:41:34 +03:00 by kerem · 4 comments

kerem commented

2026-02-25 23:41:34 +03:00

Owner

Originally created by @mrmachine on GitHub (Jul 16, 2019).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/268

I'm loving that healthchecks.io will send a notification when a running job (triggered by /start URL, see #23) does not finish within the grace period.

But we've found the currently implementation only notifies sometimes, when it happens to be the case that no OTHER overlapping task completes successfully within the grace period.

We have an on demand celery task that typically takes a few seconds to run, but sometimes gets "stuck". We want to be notified when it is stuck, so we send /start before a job starts and a regular ping after it ends, and set the grace period to 1 minute.

This works great unless another user triggers the same task say 30 seconds after the first task got "stuck". The second task completes successfully and healthchecks.io says nothing about the fact that we received two /start and one regular ping request.

I'd like to be able to append a nonce to the start and finish requests and have healthchecks.io still send an alert when a task fails to finish, even if the overall check status remains UP because it is still successfully processing other tasks.

For example:

def job():
    nonce = uuid4()
    requests.head(url + '/start/' + nonce)
    # do work
    requests.head(url + '/finish/' + nonce)

If I had a check configured with a 5 minute period and a 1 minute grace period, the check should be DOWN (and notification sent) if NO jobs are successfully completed within 5+1 minutes AND a separate notification should be sent if ANY job starts and fails to finish within 1 minute, regardless of the overall check status.

Originally created by @mrmachine on GitHub (Jul 16, 2019). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/268 I'm loving that healthchecks.io will send a notification when a running job (triggered by `/start` URL, see #23) does not finish within the grace period. But we've found the currently implementation only notifies sometimes, when it happens to be the case that no OTHER overlapping task completes successfully within the grace period. We have an on demand celery task that typically takes a few seconds to run, but sometimes gets "stuck". We want to be notified when it is stuck, so we send `/start` before a job starts and a regular ping after it ends, and set the grace period to 1 minute. This works great unless another user triggers the same task say 30 seconds after the first task got "stuck". The second task completes successfully and healthchecks.io says nothing about the fact that we received two `/start` and one regular ping request. I'd like to be able to append a nonce to the start and finish requests and have healthchecks.io still send an alert when a task fails to finish, even if the overall check status remains UP because it is still successfully processing other tasks. For example: ```python def job(): nonce = uuid4() requests.head(url + '/start/' + nonce) # do work requests.head(url + '/finish/' + nonce) ``` If I had a check configured with a 5 minute period and a 1 minute grace period, the check should be DOWN (and notification sent) if NO jobs are successfully completed within 5+1 minutes AND a separate notification should be sent if ANY job starts and fails to finish within 1 minute, regardless of the overall check status.

kerem closed this issue

2026-02-25 23:41:34 +03:00

kerem commented

2026-02-25 23:41:34 +03:00

Author

Owner

@cuu508 commented on GitHub (Jul 19, 2019):

Thanks for the suggestion!
I understand the problem, and the idea of adding nonces to distinguish overlapping start/finish pairs.
Two problems with this would be:

not trivial to implement efficiently
as an advanced feature, weaving it into the current "everything is simple" UX would be tricky

Generally I'd like Healthchecks to do one thing (OK, a couple things) and do it well. Use it to notify when an expected event didn't happen, don't use it for generic logging or metrics collection. In the Celery scenario, I would look into:

monitoring the celery process using Healthchecks. Does the system process exist? Has it processed any jobs recently?
collecting metrics from individual job executions using a log collector (ELK stack, Splunk, Papertrail, LogDNA, Loggly, etc. etc.)

Another idea to consider: I know a few Healthchecks.io customers are using API to create short-lived checks for each build or deploy: when the job starts, they create a new check via API, and clean it up when the job finishes.

@cuu508 commented on GitHub (Jul 19, 2019): Thanks for the suggestion! I understand the problem, and the idea of adding nonces to distinguish overlapping start/finish pairs. Two problems with this would be: * not trivial to implement efficiently * as an advanced feature, weaving it into the current "everything is simple" UX would be tricky Generally I'd like Healthchecks to do one thing (OK, a couple things) and do it well. Use it to notify when an expected event didn't happen, *don't* use it for generic logging or metrics collection. In the Celery scenario, I would look into: * monitoring the celery process using Healthchecks. Does the system process exist? Has it processed *any* jobs recently? * collecting metrics from individual job executions using a log collector (ELK stack, Splunk, Papertrail, LogDNA, Loggly, etc. etc.) Another idea to consider: I know a few Healthchecks.io customers are using API to create short-lived checks for each build or deploy: when the job starts, they create a new check via API, and clean it up when the job finishes.

kerem commented

2026-02-25 23:41:34 +03:00

Author

Owner

@mrmachine commented on GitHub (Jul 19, 2019):

Thanks for your feedback.

I think this feature could be implemented with very few changes to the UX. Users would just notice that their job execution time data is more accurate, and they would no longer miss notifications when tasks start and then fail to finish within the grace period as expected.

I'd like to have a go at implementing it efficiently, if you could point me to any relevant areas of the code to start and any tips on why you think the implementation wouldn't be efficient, unless you're dead set against the idea regardless.

I don't think it should be that hard or inefficient to store multiple last_started timestamps keyed by nonce in a JSONField (on Check) and check/update that field when we get a ping with a nonce. The field would only need to store a number of timestamps equal to the number of concurrent jobs.

We do also use several other monitoring and logging/alerting systems, but the trouble with them all (and the great thing about healthchecks.io) is that those other systems are complicated to setup and manage, require direct access into our system, need monitoring themselves (even hosted solutions that require we run a local agent), fail to work under certain conditions (e.g. hung process), etc.

A dead man's switch like healthchecks.io, where we are notified if a job fails to report in for any reason, either for a single job (within its grace period) or the queue as a whole, is what we need.

@mrmachine commented on GitHub (Jul 19, 2019): Thanks for your feedback. I think this feature could be implemented with very few changes to the UX. Users would just notice that their job execution time data is more accurate, and they would no longer miss notifications when tasks start and then fail to finish within the grace period as expected. I'd like to have a go at implementing it efficiently, if you could point me to any relevant areas of the code to start and any tips on why you think the implementation wouldn't be efficient, unless you're dead set against the idea regardless. I don't think it should be that hard or inefficient to store multiple `last_started` timestamps keyed by nonce in a `JSONField` (on `Check`) and check/update that field when we get a ping with a nonce. The field would only need to store a number of timestamps equal to the number of concurrent jobs. We do also use several other monitoring and logging/alerting systems, but the trouble with them all (and the great thing about healthchecks.io) is that those other systems are complicated to setup and manage, require direct access into our system, need monitoring themselves (even hosted solutions that require we run a local agent), fail to work under certain conditions (e.g. hung process), etc. A dead man's switch like healthchecks.io, where we are notified if a job fails to report in for any reason, either for a single job (within its grace period) or the queue as a whole, is what we need.

kerem commented

2026-02-25 23:41:34 +03:00

Author

Owner

@cuu508 commented on GitHub (Jul 19, 2019):

Implementation-wise, I would start by looking at how the check's state is tracked and updated in hc.accounts.models.Check class. Some relevant fields in that class:

timeout, grace, schedule: stores the expected schedule, as specificed by the user
last_ping: timestamp of the last "success" ping
last_start: timestamp of the last /start ping
alert_after: a computed field of when we expect the check to go down (unless it receives another ping – then alert_after is computed again).

Relevant methods:

get_grace_start(): calculates when check is expected to go from "up" to "late"
going_down_after(): calculates when the check is expected to go down. This method is used to update the alert_after field

I think the main difficulty would be around triggering notifications at the right times. Notifications are sent when checks change state (up -> down, down -> up). With concurrently running jobs, we would need to track the status and alert_after values separately for each run.

Maybe a less brain-bending way to implement this would be to allow parent-child relationship between checks. When a /start with a nonce is received for "Check A", create a new check "Check A child Nr1", and set "Check A" as its parent, and also copy timeout, grace, schedule fields from "Check A". There would need to be a mechanism to clean up these child checks. When "Check A child Nr1" changes state, the notification should actually be sent for "Check A". The child checks should be an implementation detail, invisible to the user.

Either way, my gut feeling is this would add a bunch of complexity, and require code changes in many places. There might be a clever solution that I haven't thought of, of course :-)

@cuu508 commented on GitHub (Jul 19, 2019): Implementation-wise, I would start by looking at how the check's state is tracked and updated in `hc.accounts.models.Check` class. Some relevant fields in that class: * `timeout`, `grace`, `schedule`: stores the expected schedule, as specificed by the user * `last_ping`: timestamp of the last "success" ping * `last_start`: timestamp of the last `/start` ping * `alert_after`: a computed field of when we expect the check to go down (unless it receives another ping – then `alert_after` is computed again). Relevant methods: * `get_grace_start()`: calculates when check is expected to go from "up" to "late" * `going_down_after()`: calculates when the check is expected to go down. This method is used to update the `alert_after` field I think the main difficulty would be around triggering notifications at the right times. Notifications are sent when checks change state (up -> down, down -> up). With concurrently running jobs, we would need to track the `status` and `alert_after` values separately for each run. Maybe a less brain-bending way to implement this would be to allow parent-child relationship between checks. When a `/start` with a nonce is received for "Check A", create a new check "Check A child Nr1", and set "Check A" as its parent, and also copy `timeout`, `grace`, `schedule` fields from "Check A". There would need to be a mechanism to clean up these child checks. When "Check A child Nr1" changes state, the notification should actually be sent for "Check A". The child checks should be an implementation detail, invisible to the user. Either way, my gut feeling is this would add a bunch of complexity, and require code changes in many places. There might be a clever solution that I haven't thought of, of course :-)

kerem commented

2026-02-25 23:41:34 +03:00

Author

Owner

@cuu508 commented on GitHub (Sep 11, 2019):

I'm not planning support for multiple, overlapping start/finish cycles for a single check. For tracking concurrent job execution I would recommend one of the:

create a new check via API for each job. Delete the check once the job completes successfully
or, have a single check that gets pinged every time a job completes. The check triggers an alert if no jobs have finished for "too long". Use a separate logging or APM system for tracking detailed debug information

@cuu508 commented on GitHub (Sep 11, 2019): I'm not planning support for multiple, overlapping start/finish cycles for a single check. For tracking concurrent job execution I would recommend one of the: 1. create a new check via API for each job. Delete the check once the job completes successfully 2. or, have a single check that gets pinged every time a job completes. The check triggers an alert if no jobs have finished for "too long". Use a separate logging or APM system for tracking detailed debug information