[GH-ISSUE #218] Suggest a Grace Time based on average job execution time #157

New issue

Closed

opened 2026-02-25 23:41:23 +03:00 by kerem · 3 comments

kerem commented

2026-02-25 23:41:23 +03:00

Owner

Originally created by @danemacmillan on GitHub (Feb 8, 2019).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/218

I've begun measuring job execution time of all of the health checks that were deployed before the feature was available. One thing that occurred to me is that, given a large enough dataset of measured execution times, the Grace Time feature could be more empowered based on the data being collected, and help users set more informed values.

I don't think I'd want the Grace Time to be automatically modified based on this data, in the off chance that some jobs are severely skewed, but I think providing the average execution time for a job somewhere in the interface, and offering it up as a suggested good value based on n jobs, would be really insightful. I would even say this data should only be made available to users who have collected a certain threshold of data--say, 100 successfully measured job runs per job--in order to appear in the UI somewhere. Maybe even provide other time values, such as the longest time and fastest time of a job; maybe even break down execution times by dates, like average all time, average the last week, average the last month, etc.

The reason I bring this up is because I'm just manually setting Grace Times based on my anecdotal knowledge of their average run times; this is usually sufficient. Most of us using this service are likely intimately aware of their system operations. However, there are some jobs that take longer due to the simple fact that these jobs have stayed the same but the amount of data they process has increased. I'm not obsessing about how long these particular jobs run and then tweaking their grace times accordingly, but typically the moment I decide to increase a Grace Time is, for example, after a four month period I note that I'm now getting too many alerts for a particular job, and it's simply because it requires more time to process more data. In this example I would check how long it now takes, and then adjust the Grace Time accordingly.

To summarize, then, I think displaying the average execution time for each job somewhere in the UI would help users in setting more informed Grace Time values.

Originally created by @danemacmillan on GitHub (Feb 8, 2019). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/218 I've begun [measuring job execution time](https://healthchecks.io/docs/#start-event) of all of the health checks that were deployed before the feature was available. One thing that occurred to me is that, given a large enough dataset of measured execution times, the Grace Time feature could be more empowered based on the data being collected, and help users set more informed values. I don't think I'd want the Grace Time to be automatically modified based on this data, in the off chance that some jobs are severely skewed, but I think providing the average execution time for a job somewhere in the interface, and offering it up as a suggested good value based on `n` jobs, would be really insightful. I would even say this data should only be made available to users who have collected a certain threshold of data--say, 100 successfully measured job runs per job--in order to appear in the UI somewhere. Maybe even provide other time values, such as the longest time and fastest time of a job; maybe even break down execution times by dates, like average all time, average the last week, average the last month, etc. The reason I bring this up is because I'm just manually setting Grace Times based on my anecdotal knowledge of their average run times; this is usually sufficient. Most of us using this service are likely intimately aware of their system operations. However, there are some jobs that take longer due to the simple fact that these jobs have stayed the same but the amount of data they process has increased. I'm not obsessing about how long these particular jobs run and then tweaking their grace times accordingly, but typically the moment I decide to increase a Grace Time is, for example, after a four month period I note that I'm now getting too many alerts for a particular job, and it's simply because it requires more time to process more data. In this example I would check how long it now takes, and then adjust the Grace Time accordingly. To summarize, then, I think displaying the average execution time for each job somewhere in the UI would help users in setting more informed Grace Time values.

kerem

2026-02-25 23:41:23 +03:00

closed this issue
added the
feature
label

kerem commented

2026-02-25 23:41:24 +03:00

Author

Owner

@skorokithakis commented on GitHub (May 19, 2019):

That's a pretty cool feature, would you accept a PR if I had time to work on it?

@skorokithakis commented on GitHub (May 19, 2019): That's a pretty cool feature, would you accept a PR if I had time to work on it?

kerem commented

2026-02-25 23:41:24 +03:00

Author

Owner

@cuu508 commented on GitHub (May 20, 2019):

Yes, I think this would be a neat and useful feature. I'm open to reviewing and accepting PR(s).

@cuu508 commented on GitHub (May 20, 2019): Yes, I think this would be a neat and useful feature. I'm open to reviewing and accepting PR(s).

kerem commented

2026-02-25 23:41:24 +03:00

Author

Owner

@cuu508 commented on GitHub (Nov 19, 2025):

I am not planning to work on this. The idea is neat and easy to understand, but quite tricky to implement. We do not store execution times in the database, we calculate them at display time. It is possible to iterate through all pings a check has received, calculate all execution times, and then calculate the average or median. But it's a DB-heavy operation and messy python code. The cost-benefit ratio isn't there for me, and so I'm not planning to add this.

What I've been doing instead is –

when creating a check, set the grace time to an amount that feels "reasonable" (since I have some idea what the job is doing and how long it should take)
when the check overshoots its execution time then either fix the slowdown or increase the grace time

@cuu508 commented on GitHub (Nov 19, 2025): I am not planning to work on this. The idea is neat and easy to understand, but quite tricky to implement. We do not store execution times in the database, we calculate them at display time. It *is* possible to iterate through all pings a check has received, calculate all execution times, and then calculate the average or median. But it's a DB-heavy operation and messy python code. The cost-benefit ratio isn't there for me, and so I'm not planning to add this. What I've been doing instead is – * when creating a check, set the grace time to an amount that feels "reasonable" (since I have some idea what the job is doing and how long it should take) * when the check overshoots its execution time then either fix the slowdown or increase the grace time