[GH-ISSUE #858] Notify on failure rate #603

Closed
opened 2026-02-25 23:42:59 +03:00 by kerem · 8 comments
Owner

Originally created by @Crocmagnon on GitHub (Jul 5, 2023).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/858

I have a case of a check that is a bit flaky these days, but all I care about is that it runs at least sometimes.

Would it be possible to implement something like a threshold of failures for notifications?
Instead of notifying immediately on the first failure, we would compute the percentage of failed jobs over a period of time and alert the user only when this threshold is crossed.

Individual checks would have different thresholds and time periods.
Users may also like to configure the threshold with percents, or absolute numbers (X failures over Y minutes for example).

To keep the current behavior, the default would be alerting if the percentage of failed jobs exceeds 0% (or 0 failures) over a minute.

If this is implemented, we could also improve it later by adding thresholds for integrations. For example, I may want to be alerted by email for every failed job but only by phone call if the failure rate exceeds some higher threshold.

Originally created by @Crocmagnon on GitHub (Jul 5, 2023). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/858 I have a case of a check that is a bit flaky these days, but all I care about is that it runs at least sometimes. Would it be possible to implement something like a threshold of failures for notifications? Instead of notifying immediately on the first failure, we would compute the percentage of failed jobs over a period of time and alert the user only when this threshold is crossed. Individual checks would have different thresholds and time periods. Users may also like to configure the threshold with percents, or absolute numbers (X failures over Y minutes for example). To keep the current behavior, the default would be alerting if the percentage of failed jobs exceeds 0% (or 0 failures) over a minute. If this is implemented, we could also improve it later by adding thresholds for integrations. For example, I may want to be alerted by email for every failed job but only by phone call if the failure rate exceeds some higher threshold.
kerem 2026-02-25 23:42:59 +03:00
  • closed this issue
  • added the
    feature
    label
Author
Owner

@mtesch-um commented on GitHub (Jul 14, 2023):

Just to devil's advocate this - If you know the frequency and acceptable number of failures, could you calculate the desired grace period and just use that?

<!-- gh-comment-id:1636111063 --> @mtesch-um commented on GitHub (Jul 14, 2023): Just to devil's advocate this - If you know the frequency and acceptable number of failures, could you calculate the desired grace period and just use that?
Author
Owner

@cuu508 commented on GitHub (Jul 14, 2023):

Yeah, I think that should work at least in some cases.

I have a case of a check that is a bit flaky these days, but all I care about is that it runs at least sometimes.

Let's say a job is expected to run every 5 minutes, but you would like to know when it has not run even once for full 24 hours. Set grace time to 24 hours and done.

The other half-serious thing I'd like to suggest: fix your checks to not be flaky! If you get notifications about them flapping up and down, make the notifications loud, obnoxious, hard to ignore, as a reminder you have stuff to fix, people to kick, suppliers to fire, etc.

<!-- gh-comment-id:1636119523 --> @cuu508 commented on GitHub (Jul 14, 2023): Yeah, I think that should work at least in some cases. > I have a case of a check that is a bit flaky these days, but all I care about is that it runs at least sometimes. Let's say a job is expected to run every 5 minutes, but you would like to know when it has not run even once for full 24 hours. Set grace time to 24 hours and done. The other half-serious thing I'd like to suggest: **fix your checks to not be flaky!** If you get notifications about them flapping up and down, make the notifications loud, obnoxious, hard to ignore, as a reminder you have stuff to fix, people to kick, suppliers to fire, etc.
Author
Owner

@Crocmagnon commented on GitHub (Jul 16, 2023):

Let's say a job is expected to run every 5 minutes, but you would like to know when it has not run even once for full 24 hours. Set grace time to 24 hours and done.

That's not how grace time is documented to work. From the docs:

Grace Time is the additional time to wait before sending an alert when a check is late. Use this parameter to account for small, expected deviations in job execution times.

From what I understand, it does not account for checks reporting failures.

fix your checks to not be flaky

Easier said than done 🙂

I'm using healthchecks.io as a low effort monitoring for some of my cronjobs on my personal home server. I don't have much time to invest on it so I need to prioritize. I currently have my backups set to run once an hour, but if some of them fail then it's not mission-critical. I can live with 24 hours old backups if needed. But if there aren't any backups for more than a day, then I need to really start worrying.

Same for a nextcloud cron: it's scheduled to run every 5 minutes. If a run fails, then it's not really an issue, as long as not too many fail in a row.

<!-- gh-comment-id:1637051040 --> @Crocmagnon commented on GitHub (Jul 16, 2023): > Let's say a job is expected to run every 5 minutes, but you would like to know when it has not run even once for full 24 hours. Set grace time to 24 hours and done. That's not how grace time is documented to work. From [the docs](https://healthchecks.io/docs/configuring_checks/): > **Grace Time** is the additional time to wait before sending an alert when a check is late. Use this parameter to account for small, expected deviations in job execution times. From what I understand, it does not account for checks reporting failures. > fix your checks to not be flaky Easier said than done 🙂 I'm using healthchecks.io as a low effort monitoring for some of my cronjobs on my personal home server. I don't have much time to invest on it so I need to prioritize. I currently have my backups set to run once an hour, but if some of them fail then it's not mission-critical. I can live with 24 hours old backups if needed. But if there aren't any backups for more than a day, then I need to really start worrying. Same for a nextcloud cron: it's scheduled to run every 5 minutes. If a run fails, then it's not really an issue, as long as not too many fail in a row.
Author
Owner

@cuu508 commented on GitHub (Jul 16, 2023):

Right, the setting grace time to 24 hours idea will not work if the job also sends failure signals. When Healthchecks receives a failure signal, it sends notifications immediately, regardless of grace time settings.

<!-- gh-comment-id:1637065639 --> @cuu508 commented on GitHub (Jul 16, 2023): Right, the setting grace time to 24 hours idea will not work if the job also sends failure signals. When Healthchecks receives a failure signal, it sends notifications immediately, regardless of grace time settings.
Author
Owner

@kiler129 commented on GitHub (Aug 30, 2023):

I was just about to crate an exact same issue, but even simpler. My idea was to allow for number of failures "threshold" before marking the job as failed. Until the threshold is crossed mark the job as "warning" like with grace period.

Why nor add redundancy in the check call?

That model works when the job is repeated internally and send success only when it finishes. However, in our case we try to make sure failures are send explicitly any time something fails.

In addition, even if it can be added in a different layer it, it doesn't mean it's a good idea. Lack of a check vs failure is a way different signal IMO.

So, why not fix the system to not be flaky?

That's an obvious question, even posed here. However, not everything requires a fix. One of the uses we have is to utilize a middleware to monitor internal HTTP services that cannot ping the HC themselves but expose a status endpoint.
In such a case, we can add the threshold in the middleware (edit: we did in github.com/kiler129/server-healthchecks@507b49267b to avoid notification spam), but we will lose distinction between job vs check failing.

Real-world scenario

One of the systems I am monitoring is an NVR. We push a status if the NVR itself but also each camera snapshot being taken. Being notified of cameras snapshot failing is crucial for obvious reasons.... except when cameras are being rebooted for update or as a preventive maintance (don't shoot the messenger, that's what many IP cameras need).

In this scenario the reboot is quick (~30s), but if the snapshot happens within that reboot time the failure will be reported and people will be notified. This already started leading to failures being ignored "because it's rebooting so I will wait for another failure".

<!-- gh-comment-id:1698593404 --> @kiler129 commented on GitHub (Aug 30, 2023): I was just about to crate an exact same issue, but even simpler. My idea was to allow for number of failures "threshold" before marking the job as failed. Until the threshold is crossed mark the job as "warning" like with grace period. ### Why nor add redundancy in the check call? That model works when the job is repeated internally and send success only when it finishes. However, in our case we try to make sure failures are send explicitly any time something fails. In addition, even if it *can* be added in a different layer it, it doesn't mean it's a good idea. Lack of a check vs failure is a way different signal IMO. ### So, why not fix the system to not be flaky? That's an obvious question, even posed here. However, not everything requires a fix. One of the uses we have is to utilize a [middleware to monitor internal HTTP services](https://github.com/kiler129/server-healthchecks) that cannot ping the HC themselves but expose a status endpoint. In such a case, ~~we can add the threshold in the middleware~~ *(edit: we did in https://github.com/kiler129/server-healthchecks/commit/507b49267bd5475e123ad3bc046707fded005765 to avoid notification spam)*, but we will lose distinction between job vs check failing. ### Real-world scenario One of the systems I am monitoring is an NVR. We push a status if the NVR itself but also each camera snapshot being taken. Being notified of cameras snapshot failing is crucial for obvious reasons.... except when cameras are being rebooted for update or as a preventive maintance (don't shoot the messenger, that's what many IP cameras need). In this scenario the reboot is quick (~30s), but if the snapshot happens within that reboot time the failure will be reported and people will be notified. This already started leading to failures being ignored "because it's rebooting so I will wait for another failure".
Author
Owner

@kiler129 commented on GitHub (Nov 21, 2023):

@cuu508 Is there any chance on this feature making the cut? :)

<!-- gh-comment-id:1821868291 --> @kiler129 commented on GitHub (Nov 21, 2023): @cuu508 Is there any chance on this feature making the cut? :)
Author
Owner

@cuu508 commented on GitHub (Nov 22, 2023):

I do not have near-term plans to work on this. My vision for Healthchecks is to stay simple and focused on its primary task – notify the user when something does not happen on time. I'd like to avoid or at least delay the common scenario of a tool starting with a simple scope, then slowly adding semi-related functionality until it has everything but the kitchen sink, has complex UI, and lacks cohesion.

However: nothing is set in stone, I did say "no" to check auto-provisioning, to first-party Dockerfile, and to other features for a long time, but eventually added them.

In this scenario the reboot is quick (~30s), but if the snapshot happens within that reboot time the failure will be reported and people will be notified. This already started leading to failures being ignored "because it's rebooting so I will wait for another failure".

Two comments on this:

  • if a service exposes a status endpoint, you may be better off with active monitoring (the monitoring system sending HTTP probes and alerting if configured conditions are not met). There are many, many options in this space, including self-hostable options.
  • I'm sure you've considered this, but if the reboots happen at predictable times, you could suppress the /fail signals during the reboot windows.

Side note: Every time I push code to Healthchecks.io I receive a wave of notifications from load balancer servers reporting that some backend nodes have gone down and have recovered. On one hand it's annoying to deal with these notifications on every code deploy, on the other hand it is good to see regular proof that 1) the monitoring system works as expected 2) during each deploy, the backend nodes do indeed recover (don't stay down). I've considered adding some notification suppression during code deploys, but decided against it due to these two reasons.

<!-- gh-comment-id:1822271917 --> @cuu508 commented on GitHub (Nov 22, 2023): I do not have near-term plans to work on this. My vision for Healthchecks is to stay simple and focused on its primary task – notify the user when something does not happen on time. I'd like to avoid or at least delay the common scenario of a tool starting with a simple scope, then slowly adding semi-related functionality until it has everything but the kitchen sink, has complex UI, and lacks cohesion. However: nothing is set in stone, I did say "no" to check auto-provisioning, to first-party Dockerfile, and to other features for a long time, but eventually added them. > In this scenario the reboot is quick (~30s), but if the snapshot happens within that reboot time the failure will be reported and people will be notified. This already started leading to failures being ignored "because it's rebooting so I will wait for another failure". Two comments on this: * if a service exposes a status endpoint, you may be better off with active monitoring (the monitoring system sending HTTP probes and alerting if configured conditions are not met). There are many, many options in this space, including self-hostable options. * I'm sure you've considered this, but if the reboots happen at predictable times, you could suppress the `/fail` signals during the reboot windows. Side note: Every time I push code to Healthchecks.io I receive a wave of notifications from load balancer servers reporting that some backend nodes have gone down and have recovered. On one hand it's annoying to deal with these notifications on every code deploy, on the other hand it is good to see regular proof that 1) the monitoring system works as expected 2) during each deploy, the backend nodes do indeed recover (don't stay down). I've considered adding some notification suppression during code deploys, but decided against it due to these two reasons.
Author
Owner

@cuu508 commented on GitHub (Nov 20, 2025):

I'm not planning to work on this. Closing this issue so there's no false expectation.

<!-- gh-comment-id:3558109604 --> @cuu508 commented on GitHub (Nov 20, 2025): I'm not planning to work on this. Closing this issue so there's no false expectation.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/healthchecks#603
No description provided.