mirror of
https://github.com/healthchecks/healthchecks.git
synced 2026-04-26 07:25:51 +03:00
[GH-ISSUE #587] Notification threshold #430
Labels
No labels
bug
bug
bug
feature
good-first-issue
new integration
pull-request
question
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/healthchecks#430
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @singpolyma on GitHub (Dec 7, 2021).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/587
It would be great to be able to set a check to only notify after 2 or 3 failures. I like being able to have an accurate log of failure for each failed run, but some tasks recover on their own after a few failures and I really don't need to get an email about it right away.
@cuu508 commented on GitHub (Dec 10, 2021):
Hi @singpolyma, thanks for the suggestion.
If the check has a regular schedule (i.e., every 10 minutes), you can use the Grace Time parameter for this. For example, if you set Grace Time to 30 minutes, you will get an alert only after three 10-minute periods of no ping.
What are your thoughts on using Grace Time?
@t-schuster commented on GitHub (Feb 7, 2022):
I would love to have an option to make N failures acceptable or allow a secondary "Grace Time" that tracks time since last successfully (exit code 0) ping. Currently I run a large selection of scripts that signal runtime and exit code to Healthchecks.io, but for a vast majority of them, it is entirely acceptable if they fail once (even if they explicitly signal an exit code of >=1). Grace Time isn't quite the option I'd need since the script does run and does signal completion or failure.
But at the same time, we want to be immediately alerted if the script does not run or complete at all because that is much more urgent. Ie, we want to both know if a script ran every 2 days but also want to know if it ran successfully in the past 2 attempts to run it.
@singpolyma commented on GitHub (Feb 7, 2022):
Yes, exactly. I use runitor and report when my job fails, which means that grace time no longer works, but I don't want to stop reporting failures because they can be useful to look back at.
@singpolyma commented on GitHub (Mar 30, 2022):
I'm looking into implementing this and have some questions. Is a
Flipjust for the notification job or is it used for something else. It looks like it might be used for downtime reporting? So should aFlipget created if we know the check is down even if we aren't going to notify for it yet (and then add some code to delay notifying) or is that silly and thePingis enough and we should just delay creating theFlipat all?@cuu508 commented on GitHub (Apr 1, 2022):
@t-schuster you could track the successive failures on the client side. Let's say the requirements are:
I would look into writing a wrapper script which:
Alternatively, it could be time-based: write a timestamp of the last successful run to a file, and then signal failure if the time since the last successful run goes above some threshold.
@cuu508 commented on GitHub (Apr 1, 2022):
@singpolyma the
Flipobject records the fact that check has changed a state (from "up" to "down" or vice-versa), and the timestamp when it happened.It is used:
Flipobject in a request-response cycle, and process it in a long running management command, potentially on a different machine)@t-schuster commented on GitHub (Apr 1, 2022):
I already have a wrapper script, it's terribly useful but on the other hand I think it could be useful to have this functionality in healthchecks.io. Wrapper scripts, IMO, should be minimal to avoid failures. A failing or buggy wrapper script is for my use case a very dangerous thing to happen, since the job may not be able to run (for example if the timestamp file becomes read-only due to filesystem issues). HC.io I can monitor more easily if it works or not.
My thought is that it's better to minimize complexity in the machinery that notifies HC.io, to avoid any potential issues. But also to minimize the number of alerts sent by spurios failures that occur naturally and fix themselves by the next time the job runs.
@cuu508 commented on GitHub (Apr 1, 2022):
Certainly, in the wrapper script, you have to be very careful with any tasks that come before the "payload" part. You don't want a monitoring failure to prevent the main job from running.
In this particular case though you can rearrange the steps so that the job runs first, and the monitoring logic comes after:
And, worst case, if the script is not able to run at all, Healthchecks will notify you after the grace time runs out.
@cuu508 commented on GitHub (May 27, 2022):
I just realized there's another, older, issue with the same feature request: #525
I'll close this one, so future discussion is in a single place.