[GH-ISSUE #587] Notification threshold #430

Closed
opened 2026-02-25 23:42:26 +03:00 by kerem · 9 comments
Owner

Originally created by @singpolyma on GitHub (Dec 7, 2021).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/587

It would be great to be able to set a check to only notify after 2 or 3 failures. I like being able to have an accurate log of failure for each failed run, but some tasks recover on their own after a few failures and I really don't need to get an email about it right away.

Originally created by @singpolyma on GitHub (Dec 7, 2021). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/587 It would be great to be able to set a check to only notify after 2 or 3 failures. I like being able to have an accurate log of failure for each failed run, but some tasks recover on their own after a few failures and I really don't need to get an email about it right away.
kerem 2026-02-25 23:42:26 +03:00
  • closed this issue
  • added the
    feature
    label
Author
Owner

@cuu508 commented on GitHub (Dec 10, 2021):

Hi @singpolyma, thanks for the suggestion.

If the check has a regular schedule (i.e., every 10 minutes), you can use the Grace Time parameter for this. For example, if you set Grace Time to 30 minutes, you will get an alert only after three 10-minute periods of no ping.

What are your thoughts on using Grace Time?

<!-- gh-comment-id:990821640 --> @cuu508 commented on GitHub (Dec 10, 2021): Hi @singpolyma, thanks for the suggestion. If the check has a regular schedule (i.e., every 10 minutes), you can use the Grace Time parameter for this. For example, if you set Grace Time to 30 minutes, you will get an alert only after three 10-minute periods of no ping. What are your thoughts on using Grace Time?
Author
Owner

@t-schuster commented on GitHub (Feb 7, 2022):

I would love to have an option to make N failures acceptable or allow a secondary "Grace Time" that tracks time since last successfully (exit code 0) ping. Currently I run a large selection of scripts that signal runtime and exit code to Healthchecks.io, but for a vast majority of them, it is entirely acceptable if they fail once (even if they explicitly signal an exit code of >=1). Grace Time isn't quite the option I'd need since the script does run and does signal completion or failure.

But at the same time, we want to be immediately alerted if the script does not run or complete at all because that is much more urgent. Ie, we want to both know if a script ran every 2 days but also want to know if it ran successfully in the past 2 attempts to run it.

<!-- gh-comment-id:1031218292 --> @t-schuster commented on GitHub (Feb 7, 2022): I would love to have an option to make N failures acceptable or allow a secondary "Grace Time" that tracks time since last successfully (exit code 0) ping. Currently I run a large selection of scripts that signal runtime and exit code to Healthchecks.io, but for a vast majority of them, it is entirely acceptable if they fail once (even if they explicitly signal an exit code of >=1). Grace Time isn't quite the option I'd need since the script does run and does signal completion or failure. But at the same time, we want to be **immediately** alerted if the script does not run or complete at all because that is much more urgent. Ie, we want to both know if a script ran every 2 days but also want to know if it ran **successfully** in the past 2 attempts to run it.
Author
Owner

@singpolyma commented on GitHub (Feb 7, 2022):

Yes, exactly. I use runitor and report when my job fails, which means that grace time no longer works, but I don't want to stop reporting failures because they can be useful to look back at.

<!-- gh-comment-id:1031524832 --> @singpolyma commented on GitHub (Feb 7, 2022): Yes, exactly. I use runitor and report when my job fails, which means that grace time no longer works, but I don't want to stop reporting failures because they can be useful to look back at.
Author
Owner

@singpolyma commented on GitHub (Mar 30, 2022):

I'm looking into implementing this and have some questions. Is a Flip just for the notification job or is it used for something else. It looks like it might be used for downtime reporting? So should a Flip get created if we know the check is down even if we aren't going to notify for it yet (and then add some code to delay notifying) or is that silly and the Ping is enough and we should just delay creating the Flip at all?

<!-- gh-comment-id:1083346790 --> @singpolyma commented on GitHub (Mar 30, 2022): I'm looking into implementing this and have some questions. Is a `Flip` just for the notification job or is it used for something else. It looks like it might be used for downtime reporting? So should a `Flip` get created if we know the check is down even if we aren't going to notify for it yet (and then add some code to delay notifying) or is that silly and the `Ping` is enough and we should just delay creating the `Flip` at all?
Author
Owner

@cuu508 commented on GitHub (Apr 1, 2022):

@t-schuster you could track the successive failures on the client side. Let's say the requirements are:

  • you have a script (let's call it "payload") that runs every 10 minutes and returns an exit code 0 on success, 1-255 on failure.
  • you want to be notified if payload doesn't run for 2 days
  • you want to be notified if payload finishes with a non-zero exit code two times in a row

I would look into writing a wrapper script which:

  • reads the exit code of the previous run from a file
  • runs the payload and captures its exit code
  • if the current and the previous exit code are both non-zero, report failure. Otherwise, report success
  • write the current exit code to a file

Alternatively, it could be time-based: write a timestamp of the last successful run to a file, and then signal failure if the time since the last successful run goes above some threshold.

<!-- gh-comment-id:1085742417 --> @cuu508 commented on GitHub (Apr 1, 2022): @t-schuster you could track the successive failures on the client side. Let's say the requirements are: * you have a script (let's call it "payload") that runs every 10 minutes and returns an exit code 0 on success, 1-255 on failure. * you want to be notified if payload doesn't run for 2 days * you want to be notified if payload finishes with a non-zero exit code two times in a row I would look into writing a wrapper script which: * reads the exit code of the previous run from a file * runs the payload and captures its exit code * if the current and the previous exit code are both non-zero, report failure. Otherwise, report success * write the current exit code to a file Alternatively, it could be time-based: write a timestamp of the last successful run to a file, and then signal failure if the time since the last successful run goes above some threshold.
Author
Owner

@cuu508 commented on GitHub (Apr 1, 2022):

@singpolyma the Flip object records the fact that check has changed a state (from "up" to "down" or vice-versa), and the timestamp when it happened.

It is used:

  • to send notifications asynchronously (we can create a Flip object in a request-response cycle, and process it in a long running management command, potentially on a different machine)
  • to calculate downtime stats (the number of downtimes, and the total downtime duration per month)
<!-- gh-comment-id:1085744841 --> @cuu508 commented on GitHub (Apr 1, 2022): @singpolyma the `Flip` object records the fact that check has changed a state (from "up" to "down" or vice-versa), and the timestamp when it happened. It is used: * to send notifications asynchronously (we can create a `Flip` object in a request-response cycle, and process it in a long running management command, potentially on a different machine) * to calculate downtime stats (the number of downtimes, and the total downtime duration per month)
Author
Owner

@t-schuster commented on GitHub (Apr 1, 2022):

I would look into writing a wrapper script which:

I already have a wrapper script, it's terribly useful but on the other hand I think it could be useful to have this functionality in healthchecks.io. Wrapper scripts, IMO, should be minimal to avoid failures. A failing or buggy wrapper script is for my use case a very dangerous thing to happen, since the job may not be able to run (for example if the timestamp file becomes read-only due to filesystem issues). HC.io I can monitor more easily if it works or not.

My thought is that it's better to minimize complexity in the machinery that notifies HC.io, to avoid any potential issues. But also to minimize the number of alerts sent by spurios failures that occur naturally and fix themselves by the next time the job runs.

<!-- gh-comment-id:1085872608 --> @t-schuster commented on GitHub (Apr 1, 2022): > I would look into writing a wrapper script which: I already have a wrapper script, it's terribly useful but on the other hand I think it could be useful to have this functionality in healthchecks.io. Wrapper scripts, IMO, should be minimal to avoid failures. A failing or buggy wrapper script is for my use case a very dangerous thing to happen, since the job may not be able to run (for example if the timestamp file becomes read-only due to filesystem issues). HC.io I can monitor more easily if it works or not. My thought is that it's better to minimize complexity in the machinery that notifies HC.io, to avoid any potential issues. But also to minimize the number of alerts sent by spurios failures that occur naturally and fix themselves by the next time the job runs.
Author
Owner

@cuu508 commented on GitHub (Apr 1, 2022):

Certainly, in the wrapper script, you have to be very careful with any tasks that come before the "payload" part. You don't want a monitoring failure to prevent the main job from running.

In this particular case though you can rearrange the steps so that the job runs first, and the monitoring logic comes after:

  • run the payload and capture its exit code
  • read the exit code of the previous run from a file
  • if the current and the previous exit code are both non-zero, report failure. Otherwise, report success
  • write the current exit code to a file

And, worst case, if the script is not able to run at all, Healthchecks will notify you after the grace time runs out.

<!-- gh-comment-id:1085914207 --> @cuu508 commented on GitHub (Apr 1, 2022): Certainly, in the wrapper script, you have to be very careful with any tasks that come before the "payload" part. You don't want a monitoring failure to prevent the main job from running. In this particular case though you can rearrange the steps so that the job runs first, and the monitoring logic comes after: * run the payload and capture its exit code * read the exit code of the previous run from a file * if the current and the previous exit code are both non-zero, report failure. Otherwise, report success * write the current exit code to a file And, worst case, if the script is not able to run at all, Healthchecks will notify you after the grace time runs out.
Author
Owner

@cuu508 commented on GitHub (May 27, 2022):

I just realized there's another, older, issue with the same feature request: #525
I'll close this one, so future discussion is in a single place.

<!-- gh-comment-id:1139356638 --> @cuu508 commented on GitHub (May 27, 2022): I just realized there's another, older, issue with the same feature request: #525 I'll close this one, so future discussion is in a single place.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/healthchecks#430
No description provided.