mirror of
https://github.com/healthchecks/healthchecks.git
synced 2026-04-25 23:15:49 +03:00
[GH-ISSUE #282] Late vs Started #212
Labels
No labels
bug
bug
bug
feature
good-first-issue
new integration
pull-request
question
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/healthchecks#212
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ScottBeeson on GitHub (Aug 28, 2019).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/282
If I use /start to signal the start of a job my badge shows up as "Late" even though it started well within the allowed time frame.
At the start of my long-running PS script I am calling this:
Invoke-RestMethod https://hc-ping.com/<id>/startIf it succeeds, I'm calling this:
Invoke-RestMethod https://hc-ping.com/<id>If it fails, I'm adding
/fail.Can the badge show "Running"? Am I doing something wrong?
@cuu508 commented on GitHub (Aug 29, 2019):
Hi Scott,
The badges currently can only show "up", "late" or "down". My main reservation about adding a fourth "started" or "running" state is this:
Imagine a scenario where a check is initially down and then receives a "/start" event. In this case, if we show a "started" or "running" in the badge, it would give a false impression that things are OK. In reality, the check is ultimately still "not OK" until it completes successfully.
There could be a case where a job is looping in a infinite start-crash cycle. The badge would show "started" most of the time.
There's also a minor concern about backwards compatibility. If an existing client is using the JSON format, or looking for specific strings in SVG, an unexpected status could break it.
@ScottBeeson commented on GitHub (Aug 29, 2019):
That makes sense. I think I'm honestly not using /start as intended. I'm using it for a long-running batch job. Up doesn't make sense anyway. I'll adapt.
Thanks for the response.
@m42e commented on GitHub (Sep 30, 2019):
Hi,
I recently had a similar thought. I have some jobs, running for a few seconds and then report the result.
I‘d like to have the measurements available in healthchecks but then the status will change very often to the mentioned “late” state.
I would propose two possible solutions:
What do you think?
Thanks a lot for the effort you spent into this :)
@m42e commented on GitHub (Oct 13, 2019):
@cuu508 Any thoughts?
@cuu508 commented on GitHub (Oct 13, 2019):
@m42e sorry, saw your comment and then forgot about it since it's a closed ticket.
This is a tricky problem. Initially Healthchecks only had "simple" checks and no
/startendpoint, and the "late" status made perfect sense. Now with cron schedules, and with "start" events, the meaning of "late" is a lot murkier.I now sort of wish I had implemented
/startin a very simple and limited way: store a "last_start" timestamp, and use it to calculate execution time on finish. And nothing more – don't display a "started" status in dashboard, don't interfere with badges, don't send an alert if/startisn't followed by a regular ping on time. Of course it's too late now to change that, and drop features. Also, I suspect, if Healthchecks had just this simple version, there would be feature requests to show the "started" status in dashboard, and to send alerts when a started check is hanging for too long.I do agree that an orange "late" badge for started checks is not ideal: it is more likely to be interpreted as "we have a small problem" instead of "all is well for now, but one or more checks may or may not go down soon".
This is an option. There's a related problem with checks with cron expressions. Let's say the cron expression is "0 * * * *" (at minute 0 of every hour). In reality the job takes some time to run and so it reports in at minute 2-3 of every hour. As a result, for a few minutes at the start of every hour a badge would show "late" although everything is working as expected.
Maybe we should just get rid of the "late" status for badges – make it so that a badge can only ever be "up" or "down" ("late" would count as "up").
@jesse-holden commented on GitHub (Nov 4, 2019):
I think it should be up to the user to ensure the /fail endpoint is called if the process crashes after calling /start. You could also include a configurable timeout/grace period for how long between start and end the check should take.
Also, a log warning in case /start is called back-to-back might be useful, ideally you should not be calling /start until the regular endpoint or /fail endpoint has been called.
@wlupton commented on GitHub (Nov 24, 2020):
I started using this tool today and had a very enjoyable time getting things set up. This is the one slight annoyance that I hit, but it's enough for me not to want to use the status badges.
I like the /start URL because I think it's positively useful to be able to time the jobs, but immediately showing 'late' just seems wrong. I have set my grace period to rather longer than I expect the job to run for, so might it be possible not to show 'late' until some fraction of the grace period has expired (or until N minutes before the grace period expires)? Of course this knob could be configurable.
(My setup is that I protect the cron jobs with lockrun, and I simply call /start at the top of the script and /$status at the bottom. So I know that both will be called, unless the script itself crashes, and I know that the jobs will never overlap.)
@wlupton commented on GitHub (Nov 25, 2020):
Here are some more thoughts on this:
If there were a new
durationfield (expected duration in seconds; default 0) then there could be a newactivestate triggered by a /start message and automatically transitioning tolateon duration expiry. This could be backwards compatibleWith the current settings, if the
schedulewas set to the expected completion times and if the /start message was used only for duration measurement then thelatestate wouldn't be entered until the expected completion time. I tried this but it doesn't currently work. This isn't backwards-compatible, and is messy in thatscheduleno longer matches the crontab entry, but it avoids the need for an additional configuration variable or for a change to the state machineFinally, re my comment on the status badges, I guess I could use the JSON URLs, in which case I could choose how to deal with the
gracefield and could treat{"status":"up","grace":>0}as{"status":"active"}@wlupton commented on GitHub (Nov 25, 2020):
Sorry, one more thing:
@cuu508 commented on GitHub (Nov 25, 2020):
Hi @wlupton thanks for your feedback and ideas, appreciate it!
This sounds messy indeed, and would not work correctly whenever the job completes slightly early. For example, let's say the schedule expects the job to check in at 3PM. If the job checks in at 2:59 PM, Healthchecks does record the early ping, but will still expect a ping at or after 3PM.
I'm leaning towards doing the simple thing and just excluding "grace" from the possible badge statuses. The badges would only ever show "up" or "down". If any check is in the grace period, the badge would consider it "up". The only worry with that is breaking backwards compatibility. I'm currently experimenting with an UI that would allow toggling between "Up / Down" (proposed) and "Up / Grace / Down" (current) modes on the "Badges" page.
@wlupton commented on GitHub (Nov 25, 2020):
Thanks! Would you still include
gracein the JSON? I think there is some value in doing this.What about the main GUI? Currently an active check shows an exclamation mark ('!') suggesting a warning state, but I think that 'active' is really more INFO than WARNING.
Edit: In fact maybe just the existing flashing
...is sufficient indication?@cuu508 commented on GitHub (Nov 25, 2020):
My current attempt at adding a toggle between "Up / Down" an "Up / Late / Down" modes:
Yes, I think it makes sense for the "grace" key to stay. You would sometimes get responses where status is "up" but "grace" is non-zero, e.g.:
{"status":"up","total":2,"grace":2,"down":0}I've previously received comments about the orange "!" looking too alarming in the list of checks as well, so may need to rethink that as well.
@wlupton commented on GitHub (Nov 25, 2020):
Thanks. The thing that still seems a little odd here is that surely it's normal for a job to take a significant time to execute (and there's even a feature for measuring the duration)? So it seems a bit surprising that this isn't reflected in the state definitions. But I'm repeating myself! I'm happy.