mirror of
https://github.com/healthchecks/healthchecks.git
synced 2026-04-25 06:55:53 +03:00
[GH-ISSUE #392] Expose job's last duration in Prometheus metrics? #296
Labels
No labels
bug
bug
bug
feature
good-first-issue
new integration
pull-request
question
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/healthchecks#296
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @cuu508 on GitHub (Jul 2, 2020).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/392
Prometheus users – would it be useful to have access to check's last duration in Prometheus?
Currently, for every check we report its up/down state:
To start a discussion, we could for example report last duration like so:
hc_check_last_durationnaming? In the above example I went with the gauge type because:@SuperSandro2000 commented on GitHub (Jul 2, 2020):
You could make a (time series)graph out of this in Grafana and monitor if the a job took unusual long and act based on that. Maybe set up a little alert if it is higher than x.
Or could you? I am not sure right now. If I only get the latest duration I don't think it is useful because I would want to make a graphic over time to notice if there are spikes maybe correlated to some other change in the system.
@fcharlier commented on GitHub (Aug 4, 2020):
I was looking at getting this information to keep a longer history of run times for my tasks, so:
@mbaldini1 commented on GitHub (Aug 26, 2020):
In general I think that every piece of useful information should be showed, for someone some info can be useful, for other users not, but it's better have more information than less
@xLogiiCx commented on GitHub (Oct 4, 2022):
Any Updates on this? I would love to have a Job duration Grafana Dashboard...
@cuu508 commented on GitHub (Dec 19, 2022):
I did some experimenting. I added
hc_check_last_durationmetric and tried to make visualizations in Grafana for it. Here's time series:My problem with this is, if the job runs (say) once per hour, but the time series shows a data point for (say) every minute, there will be 59 duplicated data points showing the same duration. In the above screenshot see the horizontal yellow line. It is showing a single measurement, that did not change over the displayed time period, because the job did not run in that period. Also for the orange line there are a few horizontal sections, where the same value is displayed 2-3 times in a row.
Here's a different visualization showing just the most recent values:
And another idea I had was to use the already existing
hc_check_startedmetric to make a visualization similar to what's suggested in #443:But I feel like I'm shooting in the dark at the random. Let's approach this from the other direction:
Ideally, since I know very little Grafana, I'd like to see mockups in the form of dashboard screenshots and dashboard configuration exported as JSON.
@tekert commented on GitHub (Jul 31, 2023):
Maybe using a push method from the server would be better, less strain on server and clients, and clients beign able to get short lived durations reliably.
https://github.com/prometheus/client_python#exporting-to-a-pushgateway
@tekert commented on GitHub (Jul 31, 2023):
I've made a proposal at https://github.com/healthchecks/healthchecks/issues/870 that maybe would solve this problem, using push metrics to our prometheus servers using a webhook POST request.
For example:
curl -d 'hc_check_started{name="TESTNAME", tags="", unique_key="xxxxxxxxxxxxxxxxxx"} 1' -X POST 'http://IP:PORT/api/v1/import/prometheus/metrics/job/<jobname>/instance/<instance_name>'or wathever format the metric server ingestion accepts.
I've just tested this locally and we can get short lived jobs to show reliably in a time series dashboard.
There is no need to stop scraping, we can scrap at 1min intervals and also get pushes to get short lived secuences on top.
@tekert commented on GitHub (Aug 2, 2023):
Ok after a lot of testing, i've come to the conclusion that scrapping the last runtime of the last job is the best way to spot trends and diagnose problems, set alerts etc.
Scrapping start and stopped metrics works ok, until you want to see 9s jobs for the past 24hs (the lines just dissapear) making trend analysis useless, i've setup multiple panels, each showing segments for up to 6hs (so i can render the pixels), but if a job last 5s or 2s, then the lines dissapear (not enough pixel density to render the jobs at big timescales), also, the lines become so tick that mousing them over to get durations is a pain. I haven't found a way to force short spikes of data in grafana at big time ranges.

A better aproch is to scrape a metric like
hc_check_last_duration, form a timeseries of it (doesn't matter if it repeats data) at least we can measure past runtime data.