[GH-ISSUE #219] Job execution calculation not displayed if an integration alert delivered (or a job runs long) #159

Closed
opened 2026-02-25 23:41:23 +03:00 by kerem · 5 comments
Owner

Originally created by @danemacmillan on GitHub (Feb 12, 2019).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/219

In the log for job executions, if one uses the /start endpoint, the next request will display the job execution time:

However, if a notification or alert is sent between the "Started" and "OK" requests, the job's execution time is not displayed.

Edit:

Now I'm thinking it just has to do with a job that runs long. I've been adjusting some jobs that have moved to slower disks, and they now take about three hours to run, so the grace time was adjusted for this, so alerts are no longer being triggered.

Have a look at the last few dozen logs:

Originally created by @danemacmillan on GitHub (Feb 12, 2019). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/219 In the log for job executions, if one uses the `/start` endpoint, the next request will display the job execution time: ![](https://i.imgur.com/VRGe2PS.png) However, if a notification or alert is sent between the "Started" and "OK" requests, the job's execution time is not displayed. **Edit:** Now I'm thinking it just has to do with a job that runs long. I've been adjusting some jobs that have moved to slower disks, and they now take about three hours to run, so the grace time was adjusted for this, so alerts are no longer being triggered. Have a look at the last few dozen logs: ![](https://i.imgur.com/WYAK3V8.png)
kerem closed this issue 2026-02-25 23:41:24 +03:00
Author
Owner

@cuu508 commented on GitHub (Feb 21, 2019):

Yep, the run times are not shown in the UI if the job takes longer than 1 hour to run: https://github.com/healthchecks/healthchecks/blob/master/hc/front/views.py#L415

A couple of reasons for this:

  • I assumed the run times would typically be in seconds and minutes, and not in hours. If a ping arrives very late, it might be unrelated to the last /start. Take this example: a worker process pings the /start endpoint and promptly crashes. Three days later somebody notices and, while investigating, manually pings the ping URL. The runtime now shows "3 days X hours" which is not useful.
  • Less worries about formatting and displaying the run time in UI. The run time will always be in "X min Y sec" or "Y sec" form, so the UI string cannot get too long and cause problems. IOW, me being lazy ;-)

I'm thinking, raising the limit from 1 hour to 24 hours (so for long jobs UI would show "X h Y min Z sec") would probably be enough, right?

<!-- gh-comment-id:466117831 --> @cuu508 commented on GitHub (Feb 21, 2019): Yep, the run times are not shown in the UI if the job takes longer than 1 hour to run: https://github.com/healthchecks/healthchecks/blob/master/hc/front/views.py#L415 A couple of reasons for this: * I assumed the run times would typically be in seconds and minutes, and not in hours. If a ping arrives very late, it might be unrelated to the last `/start`. Take this example: a worker process pings the `/start` endpoint and promptly crashes. Three days later somebody notices and, while investigating, manually pings the ping URL. The runtime now shows "3 days X hours" which is not useful. * Less worries about formatting and displaying the run time in UI. The run time will always be in "X min Y sec" or "Y sec" form, so the UI string cannot get too long and cause problems. IOW, me being lazy ;-) I'm thinking, raising the limit from 1 hour to 24 hours (so for long jobs UI would show "X h Y min Z sec") would probably be enough, right?
Author
Owner

@danemacmillan commented on GitHub (Feb 21, 2019):

Point number 1 makes sense (as does number 2 😉 ). I know at least personally that 24h would be more than enough. Perhaps use the Grace Time setting in determining when (a) a ping is the end of a currently executing job or (b) a newly-started job? This way you avoid arbitrarily opening up the window so wide, while at the same time using concrete settings that play well with this functionality. For example, with these long-running jobs I've set their Grace Time to be 200 minutes (3h20m). I think it's a good indicator, while at the same time minimizing the likelihood of point number 1. Perhaps make it Grace Time + 1 hour padding? Doing this would mean that even if an alert is sent out (exceeding Grace Time), the extra padding would allow the calculation to be done when it eventually pings--and that way the original description in this ticket wouldn't actually become true. If after this, then perhaps add some explicit note in the log that the previous job likely failed, because by this point I think the software has done it's due diligence in ensuring what it can.

<!-- gh-comment-id:466156112 --> @danemacmillan commented on GitHub (Feb 21, 2019): Point number 1 makes sense (as does number 2 😉 ). I know at least personally that 24h would be more than enough. Perhaps use the Grace Time setting in determining when (a) a ping is the end of a currently executing job or (b) a newly-started job? This way you avoid *arbitrarily* opening up the window so wide, while at the same time using concrete settings that play well with this functionality. For example, with these long-running jobs I've set their Grace Time to be 200 minutes (3h20m). I think it's a good indicator, while at the same time minimizing the likelihood of point number 1. Perhaps make it `Grace Time + 1 hour padding`? Doing this would mean that even if an alert is sent out (exceeding Grace Time), the extra padding would allow the calculation to be done when it eventually pings--and that way the original description in this ticket wouldn't actually become true. If after this, then perhaps add some explicit note in the log that the previous job likely failed, because by this point I think the software has done it's due diligence in ensuring what it can.
Author
Owner

@cuu508 commented on GitHub (Mar 1, 2019):

Thanks, these are good ideas. I went with your suggested Grace Time + 1 hour formula. I also clamped it to 12 hours max.

<!-- gh-comment-id:468652646 --> @cuu508 commented on GitHub (Mar 1, 2019): Thanks, these are good ideas. I went with your suggested `Grace Time + 1 hour` formula. I also clamped it to 12 hours max.
Author
Owner

@danemacmillan commented on GitHub (Mar 1, 2019):

Sweet! I already retroactively see the change in logic:

screen shot 2019-03-01 at 10 15 55 am

Something I just considered, and I don't expect the need to use it, but should a job ever radically become faster in execution time--and I decrease the grace time--the older ones with longer grace times may disappear if the grace time is not persistent data associated with the particular execution of the job. This would be true if it's being stored as a high-level property or descriptive attribute of the job that applies to every runtime of a job, past, present, and future. I don't know if this is the case or not as I haven't read through the code very thoroughly. Storing the current grace time value at the time of the execution, if not already being done, along with the job execution results would address this. I mention this only as a consideration, as I know it will increase the complexity, and I don't see the urgency. It's more like one of those back-of-the-mind quirks that you know about.

<!-- gh-comment-id:468702816 --> @danemacmillan commented on GitHub (Mar 1, 2019): Sweet! I already retroactively see the change in logic: <img width="838" alt="screen shot 2019-03-01 at 10 15 55 am" src="https://user-images.githubusercontent.com/4390491/53647287-36033f00-3c0b-11e9-82c3-754651145d58.png"> Something I just considered, and I don't expect the need to use it, but should a job ever radically become faster in execution time--and I decrease the grace time--the older ones with longer grace times may disappear if the grace time is not persistent data associated with the *particular execution of the job*. This would be true if it's being stored as a high-level property or descriptive attribute of the job that applies to every runtime of a job, past, present, and future. I don't know if this is the case or not as I haven't read through the code very thoroughly. Storing the current grace time value at the time of the execution, if not already being done, along with the job execution results would address this. I mention this only as a consideration, as I know it will increase the complexity, and I don't see the urgency. It's more like one of those back-of-the-mind quirks that you know about.
Author
Owner

@cuu508 commented on GitHub (Mar 6, 2019):

That's a good point – changing the grace time can make the historic execution times flip on and off.

Storing the current grace time value at the time of the execution, if not already being done, along with the job execution results would address this.

Will keep this in mind. In future we might build new features that require more data to be stored with each ping. At that time it would make sense to improve this area as well.

<!-- gh-comment-id:470052763 --> @cuu508 commented on GitHub (Mar 6, 2019): That's a good point – changing the grace time can make the historic execution times flip on and off. > Storing the current grace time value at the time of the execution, if not already being done, along with the job execution results would address this. Will keep this in mind. In future we might build new features that require more data to be stored with each ping. At that time it would make sense to improve this area as well.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/healthchecks#159
No description provided.