[GH-ISSUE #282] Late vs Started #212

Closed
opened 2026-02-25 23:41:37 +03:00 by kerem · 13 comments
Owner

Originally created by @ScottBeeson on GitHub (Aug 28, 2019).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/282

If I use /start to signal the start of a job my badge shows up as "Late" even though it started well within the allowed time frame.

At the start of my long-running PS script I am calling this:
Invoke-RestMethod https://hc-ping.com/<id>/start

If it succeeds, I'm calling this:
Invoke-RestMethod https://hc-ping.com/<id>

If it fails, I'm adding /fail.

Can the badge show "Running"? Am I doing something wrong?

Originally created by @ScottBeeson on GitHub (Aug 28, 2019). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/282 If I use /start to signal the start of a job my badge shows up as "Late" even though it started well within the allowed time frame. At the start of my long-running PS script I am calling this: `Invoke-RestMethod https://hc-ping.com/<id>/start` If it succeeds, I'm calling this: `Invoke-RestMethod https://hc-ping.com/<id>` If it fails, I'm adding `/fail`. Can the badge show "Running"? Am I doing something wrong?
kerem closed this issue 2026-02-25 23:41:37 +03:00
Author
Owner

@cuu508 commented on GitHub (Aug 29, 2019):

Hi Scott,

The badges currently can only show "up", "late" or "down". My main reservation about adding a fourth "started" or "running" state is this:

Imagine a scenario where a check is initially down and then receives a "/start" event. In this case, if we show a "started" or "running" in the badge, it would give a false impression that things are OK. In reality, the check is ultimately still "not OK" until it completes successfully.

There could be a case where a job is looping in a infinite start-crash cycle. The badge would show "started" most of the time.

There's also a minor concern about backwards compatibility. If an existing client is using the JSON format, or looking for specific strings in SVG, an unexpected status could break it.

<!-- gh-comment-id:526076081 --> @cuu508 commented on GitHub (Aug 29, 2019): Hi Scott, The badges currently can only show "up", "late" or "down". My main reservation about adding a fourth "started" or "running" state is this: Imagine a scenario where a check is initially down and then receives a "/start" event. In this case, if we show a "started" or "running" in the badge, it would give a false impression that things are OK. In reality, the check is ultimately still "not OK" until it completes successfully. There could be a case where a job is looping in a infinite start-crash cycle. The badge would show "started" most of the time. There's also a minor concern about backwards compatibility. If an existing client is using the JSON format, or looking for specific strings in SVG, an unexpected status could break it.
Author
Owner

@ScottBeeson commented on GitHub (Aug 29, 2019):

That makes sense. I think I'm honestly not using /start as intended. I'm using it for a long-running batch job. Up doesn't make sense anyway. I'll adapt.

Thanks for the response.

<!-- gh-comment-id:526275372 --> @ScottBeeson commented on GitHub (Aug 29, 2019): That makes sense. I think I'm honestly not using /start as intended. I'm using it for a long-running batch job. Up doesn't make sense anyway. I'll adapt. Thanks for the response.
Author
Owner

@m42e commented on GitHub (Sep 30, 2019):

Hi,

I recently had a similar thought. I have some jobs, running for a few seconds and then report the result.

I‘d like to have the measurements available in healthchecks but then the status will change very often to the mentioned “late” state.

I would propose two possible solutions:

  1. First:
  • let started not alter the “ok” status
  • let a start after a previous one not reset the timer
  • so if, after the first start + length of period no ping is received report it as late
  1. Other solution:
  • Add an additional timeout criteria for running jobs
  • The period would then be the timespan between two sequent starts and the timeout would be expected to be met by the ping.
  • late would be set for a missed start or a exceeded run timeout

What do you think?

Thanks a lot for the effort you spent into this :)

<!-- gh-comment-id:536393591 --> @m42e commented on GitHub (Sep 30, 2019): Hi, I recently had a similar thought. I have some jobs, running for a few seconds and then report the result. I‘d like to have the measurements available in healthchecks but then the status will change very often to the mentioned “late” state. I would propose two possible solutions: 1. First: - let started not alter the “ok” status - let a start after a previous one not reset the timer - so if, after the first start + length of period no ping is received report it as late 2. Other solution: - Add an additional timeout criteria for running jobs - The period would then be the timespan between two sequent starts and the timeout would be expected to be met by the ping. - late would be set for a missed start or a exceeded run timeout What do you think? Thanks a lot for the effort you spent into this :)
Author
Owner

@m42e commented on GitHub (Oct 13, 2019):

@cuu508 Any thoughts?

<!-- gh-comment-id:541400435 --> @m42e commented on GitHub (Oct 13, 2019): @cuu508 Any thoughts?
Author
Owner

@cuu508 commented on GitHub (Oct 13, 2019):

@m42e sorry, saw your comment and then forgot about it since it's a closed ticket.

This is a tricky problem. Initially Healthchecks only had "simple" checks and no /start endpoint, and the "late" status made perfect sense. Now with cron schedules, and with "start" events, the meaning of "late" is a lot murkier.

I now sort of wish I had implemented /start in a very simple and limited way: store a "last_start" timestamp, and use it to calculate execution time on finish. And nothing more – don't display a "started" status in dashboard, don't interfere with badges, don't send an alert if /start isn't followed by a regular ping on time. Of course it's too late now to change that, and drop features. Also, I suspect, if Healthchecks had just this simple version, there would be feature requests to show the "started" status in dashboard, and to send alerts when a started check is hanging for too long.

I do agree that an orange "late" badge for started checks is not ideal: it is more likely to be interpreted as "we have a small problem" instead of "all is well for now, but one or more checks may or may not go down soon".

let started not alter the “ok” status

This is an option. There's a related problem with checks with cron expressions. Let's say the cron expression is "0 * * * *" (at minute 0 of every hour). In reality the job takes some time to run and so it reports in at minute 2-3 of every hour. As a result, for a few minutes at the start of every hour a badge would show "late" although everything is working as expected.

Maybe we should just get rid of the "late" status for badges – make it so that a badge can only ever be "up" or "down" ("late" would count as "up").

<!-- gh-comment-id:541412189 --> @cuu508 commented on GitHub (Oct 13, 2019): @m42e sorry, saw your comment and then forgot about it since it's a closed ticket. This is a tricky problem. Initially Healthchecks only had "simple" checks and no `/start` endpoint, and the "late" status made perfect sense. Now with cron schedules, and with "start" events, the meaning of "late" is a lot murkier. I now sort of wish I had implemented `/start` in a very simple and limited way: store a "last_start" timestamp, and use it to calculate execution time on finish. And nothing more – don't display a "started" status in dashboard, don't interfere with badges, don't send an alert if `/start` isn't followed by a regular ping on time. Of course it's too late now to change that, and drop features. Also, I suspect, if Healthchecks had just this simple version, there *would* be feature requests to show the "started" status in dashboard, and to send alerts when a started check is hanging for too long. I do agree that an orange "late" badge for started checks is not ideal: it is more likely to be interpreted as "we have a small problem" instead of "all is well for now, but one or more checks may or may not go down soon". > let started not alter the “ok” status This is an option. There's a related problem with checks with cron expressions. Let's say the cron expression is "0 * * * *" (at minute 0 of every hour). In reality the job takes some time to run and so it reports in at minute 2-3 of every hour. As a result, for a few minutes at the start of every hour a badge would show "late" although everything is working as expected. Maybe we should just get rid of the "late" status for badges – make it so that a badge can only ever be "up" or "down" ("late" would count as "up").
Author
Owner

@jesse-holden commented on GitHub (Nov 4, 2019):

I think it should be up to the user to ensure the /fail endpoint is called if the process crashes after calling /start. You could also include a configurable timeout/grace period for how long between start and end the check should take.

Also, a log warning in case /start is called back-to-back might be useful, ideally you should not be calling /start until the regular endpoint or /fail endpoint has been called.

<!-- gh-comment-id:549561267 --> @jesse-holden commented on GitHub (Nov 4, 2019): I think it should be up to the user to ensure the /fail endpoint is called if the process crashes after calling /start. You could also include a configurable timeout/grace period for how long between start and end the check should take. Also, a log warning in case /start is called back-to-back might be useful, ideally you should not be calling /start until the regular endpoint or /fail endpoint has been called.
Author
Owner

@wlupton commented on GitHub (Nov 24, 2020):

I started using this tool today and had a very enjoyable time getting things set up. This is the one slight annoyance that I hit, but it's enough for me not to want to use the status badges.

I like the /start URL because I think it's positively useful to be able to time the jobs, but immediately showing 'late' just seems wrong. I have set my grace period to rather longer than I expect the job to run for, so might it be possible not to show 'late' until some fraction of the grace period has expired (or until N minutes before the grace period expires)? Of course this knob could be configurable.

(My setup is that I protect the cron jobs with lockrun, and I simply call /start at the top of the script and /$status at the bottom. So I know that both will be called, unless the script itself crashes, and I know that the jobs will never overlap.)

<!-- gh-comment-id:733145731 --> @wlupton commented on GitHub (Nov 24, 2020): I started using this tool today and had a very enjoyable time getting things set up. This is the one slight annoyance that I hit, but it's enough for me not to want to use the status badges. I like the /start URL because I think it's positively useful to be able to time the jobs, but immediately showing 'late' just seems wrong. I have set my grace period to rather longer than I expect the job to run for, so might it be possible not to show 'late' until some fraction of the grace period has expired (or until N minutes before the grace period expires)? Of course this knob could be configurable. (My setup is that I protect the cron jobs with lockrun, and I simply call /start at the top of the script and /$status at the bottom. So I know that both will be called, unless the script itself crashes, and I know that the jobs will never overlap.)
Author
Owner

@wlupton commented on GitHub (Nov 25, 2020):

Here are some more thoughts on this:

  1. If there were a new duration field (expected duration in seconds; default 0) then there could be a new active state triggered by a /start message and automatically transitioning to late on duration expiry. This could be backwards compatible

  2. With the current settings, if the schedule was set to the expected completion times and if the /start message was used only for duration measurement then the late state wouldn't be entered until the expected completion time. I tried this but it doesn't currently work. This isn't backwards-compatible, and is messy in that schedule no longer matches the crontab entry, but it avoids the need for an additional configuration variable or for a change to the state machine

  3. Finally, re my comment on the status badges, I guess I could use the JSON URLs, in which case I could choose how to deal with the grace field and could treat {"status":"up","grace":>0} as {"status":"active"}

<!-- gh-comment-id:733599716 --> @wlupton commented on GitHub (Nov 25, 2020): Here are some more thoughts on this: 1. If there were a new `duration` field (expected duration in seconds; default 0) then there could be a new `active` state triggered by a /start message and automatically transitioning to `late` on duration expiry. This could be backwards compatible 2. With the current settings, if the `schedule` was set to the expected completion times and if the /start message was used only for duration measurement then the `late` state wouldn't be entered until the expected completion time. I tried this but it doesn't currently work. This isn't backwards-compatible, and is messy in that `schedule` no longer matches the crontab entry, but it avoids the need for an additional configuration variable or for a change to the state machine 3. Finally, re my comment on the status badges, I guess I could use the JSON URLs, in which case I could choose how to deal with the `grace` field and could treat `{"status":"up","grace":>0}` as `{"status":"active"}`
Author
Owner

@wlupton commented on GitHub (Nov 25, 2020):

Sorry, one more thing:

  1. A variant of bullet 2 above: The duration could be learned (on the server side). This would have to assume that the duration doesn't vary much. More complicated...
<!-- gh-comment-id:733606633 --> @wlupton commented on GitHub (Nov 25, 2020): Sorry, one more thing: 1. A variant of bullet 2 above: The duration could be learned (on the server side). This would have to assume that the duration doesn't vary much. More complicated...
Author
Owner

@cuu508 commented on GitHub (Nov 25, 2020):

Hi @wlupton thanks for your feedback and ideas, appreciate it!

  1. With the current settings, if the schedule was set to the expected completion times and if the /start message was used only for duration measurement then the late state wouldn't be entered until the expected completion time. I tried this but it doesn't currently work. This isn't backwards-compatible, and is messy in that schedule no longer matches the crontab entry, but it avoids the need for an additional configuration variable or for a change to the state machine

This sounds messy indeed, and would not work correctly whenever the job completes slightly early. For example, let's say the schedule expects the job to check in at 3PM. If the job checks in at 2:59 PM, Healthchecks does record the early ping, but will still expect a ping at or after 3PM.

I'm leaning towards doing the simple thing and just excluding "grace" from the possible badge statuses. The badges would only ever show "up" or "down". If any check is in the grace period, the badge would consider it "up". The only worry with that is breaking backwards compatibility. I'm currently experimenting with an UI that would allow toggling between "Up / Down" (proposed) and "Up / Grace / Down" (current) modes on the "Badges" page.

<!-- gh-comment-id:733609044 --> @cuu508 commented on GitHub (Nov 25, 2020): Hi @wlupton thanks for your feedback and ideas, appreciate it! > 2. With the current settings, if the schedule was set to the expected completion times and if the /start message was used only for duration measurement then the late state wouldn't be entered until the expected completion time. I tried this but it doesn't currently work. This isn't backwards-compatible, and is messy in that schedule no longer matches the crontab entry, but it avoids the need for an additional configuration variable or for a change to the state machine This sounds messy indeed, and would not work correctly whenever the job completes slightly early. For example, let's say the schedule expects the job to check in at 3PM. If the job checks in at 2:59 PM, Healthchecks does record the early ping, but will still expect a ping at or after 3PM. I'm leaning towards doing the simple thing and just excluding "grace" from the possible badge statuses. The badges would only ever show "up" or "down". If any check is in the grace period, the badge would consider it "up". The only worry with that is breaking backwards compatibility. I'm currently experimenting with an UI that would allow toggling between "Up / Down" (proposed) and "Up / Grace / Down" (current) modes on the "Badges" page.
Author
Owner

@wlupton commented on GitHub (Nov 25, 2020):

Thanks! Would you still include grace in the JSON? I think there is some value in doing this.

What about the main GUI? Currently an active check shows an exclamation mark ('!') suggesting a warning state, but I think that 'active' is really more INFO than WARNING.

Edit: In fact maybe just the existing flashing ... is sufficient indication?

<!-- gh-comment-id:733622653 --> @wlupton commented on GitHub (Nov 25, 2020): Thanks! Would you still include `grace` in the JSON? I think there is some value in doing this. What about the main GUI? Currently an active check shows an exclamation mark ('!') suggesting a warning state, but I think that 'active' is really more INFO than WARNING. Edit: In fact maybe just the existing flashing `...` is sufficient indication?
Author
Owner

@cuu508 commented on GitHub (Nov 25, 2020):

My current attempt at adding a toggle between "Up / Down" an "Up / Late / Down" modes:

image

Would you still include grace in the JSON? I think there is some value in doing this.

Yes, I think it makes sense for the "grace" key to stay. You would sometimes get responses where status is "up" but "grace" is non-zero, e.g.:

{"status":"up","total":2,"grace":2,"down":0}

What about the main GUI?

I've previously received comments about the orange "!" looking too alarming in the list of checks as well, so may need to rethink that as well.

<!-- gh-comment-id:733667685 --> @cuu508 commented on GitHub (Nov 25, 2020): My current attempt at adding a toggle between "Up / Down" an "Up / Late / Down" modes: ![image](https://user-images.githubusercontent.com/661859/100225259-9af5e100-2f26-11eb-9f46-3deae46d0101.png) > Would you still include grace in the JSON? I think there is some value in doing this. Yes, I think it makes sense for the "grace" key to stay. You would sometimes get responses where status is "up" but "grace" is non-zero, e.g.: `{"status":"up","total":2,"grace":2,"down":0}` > What about the main GUI? I've previously received comments about the orange "!" looking too alarming in the list of checks as well, so may need to rethink that as well.
Author
Owner

@wlupton commented on GitHub (Nov 25, 2020):

Thanks. The thing that still seems a little odd here is that surely it's normal for a job to take a significant time to execute (and there's even a feature for measuring the duration)? So it seems a bit surprising that this isn't reflected in the state definitions. But I'm repeating myself! I'm happy.

<!-- gh-comment-id:733686064 --> @wlupton commented on GitHub (Nov 25, 2020): Thanks. The thing that still seems a little odd here is that surely it's normal for a job to take a significant time to execute (and there's even a feature for measuring the duration)? So it seems a bit surprising that this isn't reflected in the state definitions. But I'm repeating myself! I'm happy.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/healthchecks#212
No description provided.