mirror of
https://github.com/healthchecks/healthchecks.git
synced 2026-04-25 23:15:49 +03:00
[GH-ISSUE #268] Send notifications for individual job failures when measuring job execution time #201
Labels
No labels
bug
bug
bug
feature
good-first-issue
new integration
pull-request
question
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/healthchecks#201
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mrmachine on GitHub (Jul 16, 2019).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/268
I'm loving that healthchecks.io will send a notification when a running job (triggered by
/startURL, see #23) does not finish within the grace period.But we've found the currently implementation only notifies sometimes, when it happens to be the case that no OTHER overlapping task completes successfully within the grace period.
We have an on demand celery task that typically takes a few seconds to run, but sometimes gets "stuck". We want to be notified when it is stuck, so we send
/startbefore a job starts and a regular ping after it ends, and set the grace period to 1 minute.This works great unless another user triggers the same task say 30 seconds after the first task got "stuck". The second task completes successfully and healthchecks.io says nothing about the fact that we received two
/startand one regular ping request.I'd like to be able to append a nonce to the start and finish requests and have healthchecks.io still send an alert when a task fails to finish, even if the overall check status remains UP because it is still successfully processing other tasks.
For example:
If I had a check configured with a 5 minute period and a 1 minute grace period, the check should be DOWN (and notification sent) if NO jobs are successfully completed within 5+1 minutes AND a separate notification should be sent if ANY job starts and fails to finish within 1 minute, regardless of the overall check status.
@cuu508 commented on GitHub (Jul 19, 2019):
Thanks for the suggestion!
I understand the problem, and the idea of adding nonces to distinguish overlapping start/finish pairs.
Two problems with this would be:
Generally I'd like Healthchecks to do one thing (OK, a couple things) and do it well. Use it to notify when an expected event didn't happen, don't use it for generic logging or metrics collection. In the Celery scenario, I would look into:
Another idea to consider: I know a few Healthchecks.io customers are using API to create short-lived checks for each build or deploy: when the job starts, they create a new check via API, and clean it up when the job finishes.
@mrmachine commented on GitHub (Jul 19, 2019):
Thanks for your feedback.
I think this feature could be implemented with very few changes to the UX. Users would just notice that their job execution time data is more accurate, and they would no longer miss notifications when tasks start and then fail to finish within the grace period as expected.
I'd like to have a go at implementing it efficiently, if you could point me to any relevant areas of the code to start and any tips on why you think the implementation wouldn't be efficient, unless you're dead set against the idea regardless.
I don't think it should be that hard or inefficient to store multiple
last_startedtimestamps keyed by nonce in aJSONField(onCheck) and check/update that field when we get a ping with a nonce. The field would only need to store a number of timestamps equal to the number of concurrent jobs.We do also use several other monitoring and logging/alerting systems, but the trouble with them all (and the great thing about healthchecks.io) is that those other systems are complicated to setup and manage, require direct access into our system, need monitoring themselves (even hosted solutions that require we run a local agent), fail to work under certain conditions (e.g. hung process), etc.
A dead man's switch like healthchecks.io, where we are notified if a job fails to report in for any reason, either for a single job (within its grace period) or the queue as a whole, is what we need.
@cuu508 commented on GitHub (Jul 19, 2019):
Implementation-wise, I would start by looking at how the check's state is tracked and updated in
hc.accounts.models.Checkclass. Some relevant fields in that class:timeout,grace,schedule: stores the expected schedule, as specificed by the userlast_ping: timestamp of the last "success" pinglast_start: timestamp of the last/startpingalert_after: a computed field of when we expect the check to go down (unless it receives another ping – thenalert_afteris computed again).Relevant methods:
get_grace_start(): calculates when check is expected to go from "up" to "late"going_down_after(): calculates when the check is expected to go down. This method is used to update thealert_afterfieldI think the main difficulty would be around triggering notifications at the right times. Notifications are sent when checks change state (up -> down, down -> up). With concurrently running jobs, we would need to track the
statusandalert_aftervalues separately for each run.Maybe a less brain-bending way to implement this would be to allow parent-child relationship between checks. When a
/startwith a nonce is received for "Check A", create a new check "Check A child Nr1", and set "Check A" as its parent, and also copytimeout,grace,schedulefields from "Check A". There would need to be a mechanism to clean up these child checks. When "Check A child Nr1" changes state, the notification should actually be sent for "Check A". The child checks should be an implementation detail, invisible to the user.Either way, my gut feeling is this would add a bunch of complexity, and require code changes in many places. There might be a clever solution that I haven't thought of, of course :-)
@cuu508 commented on GitHub (Sep 11, 2019):
I'm not planning support for multiple, overlapping start/finish cycles for a single check. For tracking concurrent job execution I would recommend one of the: