mirror of
https://github.com/healthchecks/healthchecks.git
synced 2026-04-25 06:55:53 +03:00
[GH-ISSUE #219] Job execution calculation not displayed if an integration alert delivered (or a job runs long) #159
Labels
No labels
bug
bug
bug
feature
good-first-issue
new integration
pull-request
question
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/healthchecks#159
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @danemacmillan on GitHub (Feb 12, 2019).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/219
In the log for job executions, if one uses the
/startendpoint, the next request will display the job execution time:However, if a notification or alert is sent between the "Started" and "OK" requests, the job's execution time is not displayed.
Edit:
Now I'm thinking it just has to do with a job that runs long. I've been adjusting some jobs that have moved to slower disks, and they now take about three hours to run, so the grace time was adjusted for this, so alerts are no longer being triggered.
Have a look at the last few dozen logs:
@cuu508 commented on GitHub (Feb 21, 2019):
Yep, the run times are not shown in the UI if the job takes longer than 1 hour to run: https://github.com/healthchecks/healthchecks/blob/master/hc/front/views.py#L415
A couple of reasons for this:
/start. Take this example: a worker process pings the/startendpoint and promptly crashes. Three days later somebody notices and, while investigating, manually pings the ping URL. The runtime now shows "3 days X hours" which is not useful.I'm thinking, raising the limit from 1 hour to 24 hours (so for long jobs UI would show "X h Y min Z sec") would probably be enough, right?
@danemacmillan commented on GitHub (Feb 21, 2019):
Point number 1 makes sense (as does number 2 😉 ). I know at least personally that 24h would be more than enough. Perhaps use the Grace Time setting in determining when (a) a ping is the end of a currently executing job or (b) a newly-started job? This way you avoid arbitrarily opening up the window so wide, while at the same time using concrete settings that play well with this functionality. For example, with these long-running jobs I've set their Grace Time to be 200 minutes (3h20m). I think it's a good indicator, while at the same time minimizing the likelihood of point number 1. Perhaps make it
Grace Time + 1 hour padding? Doing this would mean that even if an alert is sent out (exceeding Grace Time), the extra padding would allow the calculation to be done when it eventually pings--and that way the original description in this ticket wouldn't actually become true. If after this, then perhaps add some explicit note in the log that the previous job likely failed, because by this point I think the software has done it's due diligence in ensuring what it can.@cuu508 commented on GitHub (Mar 1, 2019):
Thanks, these are good ideas. I went with your suggested
Grace Time + 1 hourformula. I also clamped it to 12 hours max.@danemacmillan commented on GitHub (Mar 1, 2019):
Sweet! I already retroactively see the change in logic:
Something I just considered, and I don't expect the need to use it, but should a job ever radically become faster in execution time--and I decrease the grace time--the older ones with longer grace times may disappear if the grace time is not persistent data associated with the particular execution of the job. This would be true if it's being stored as a high-level property or descriptive attribute of the job that applies to every runtime of a job, past, present, and future. I don't know if this is the case or not as I haven't read through the code very thoroughly. Storing the current grace time value at the time of the execution, if not already being done, along with the job execution results would address this. I mention this only as a consideration, as I know it will increase the complexity, and I don't see the urgency. It's more like one of those back-of-the-mind quirks that you know about.
@cuu508 commented on GitHub (Mar 6, 2019):
That's a good point – changing the grace time can make the historic execution times flip on and off.
Will keep this in mind. In future we might build new features that require more data to be stored with each ping. At that time it would make sense to improve this area as well.