[GH-ISSUE #392] Expose job's last duration in Prometheus metrics? #296

Open
opened 2026-02-25 23:41:56 +03:00 by kerem · 8 comments
Owner

Originally created by @cuu508 on GitHub (Jul 2, 2020).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/392

Prometheus users – would it be useful to have access to check's last duration in Prometheus?

Currently, for every check we report its up/down state:

# HELP hc_check_up Whether the check is currently up (1 for yes, 0 for no).
# TYPE hc_check_up gauge
hc_check_up{name="Daily Backup", tags="production", unique_key="c00108e230add54da12bec48d008476dd6efc8"} 0
hc_check_up{name="Playground Backup", tags="testing", unique_key="be4801d247858d69565fcd7c2568a1724e000a"} 1

To start a discussion, we could for example report last duration like so:

# HELP hc_check_last_duration Last duration, number of seconds.
# TYPE hc_check_last_duration gauge
hc_check_last_duration{name="Daily Backup", tags="production", unique_key="c00108e230add54da12bec48d008476dd6efc8"} 14.5
hc_check_last_duration{name="Playground Backup", tags="testing", unique_key="be4801d247858d69565fcd7c2568a1724e000a"} 2.1
  1. First and foremost – would this be useful?
  2. If yes, how would you use/visualize it (show the last value, show a graph)?
  3. If yes, should the last ping date also be included? So the dashboard can show information like "last ran X hours ago and ran for y seconds"
  4. Any thoughts about the gauge datatype and the hc_check_last_duration naming? In the above example I went with the gauge type because:

When implementing a non-trivial custom metrics collector, it is advised to export a gauge for how long the collection took in seconds and another for the number of errors encountered.

This is one of the two cases when it is okay to export a duration as a gauge rather than a summary or a histogram, the other being batch job durations. This is because both represent information about that particular push/scrape, rather than tracking multiple durations over time.
(https://prometheus.io/docs/practices/instrumentation/#collectors)

Originally created by @cuu508 on GitHub (Jul 2, 2020). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/392 Prometheus users – would it be useful to have access to check's last duration in Prometheus? Currently, for every check we report its up/down state: ``` # HELP hc_check_up Whether the check is currently up (1 for yes, 0 for no). # TYPE hc_check_up gauge hc_check_up{name="Daily Backup", tags="production", unique_key="c00108e230add54da12bec48d008476dd6efc8"} 0 hc_check_up{name="Playground Backup", tags="testing", unique_key="be4801d247858d69565fcd7c2568a1724e000a"} 1 ``` To start a discussion, we could for example report last duration like so: ``` # HELP hc_check_last_duration Last duration, number of seconds. # TYPE hc_check_last_duration gauge hc_check_last_duration{name="Daily Backup", tags="production", unique_key="c00108e230add54da12bec48d008476dd6efc8"} 14.5 hc_check_last_duration{name="Playground Backup", tags="testing", unique_key="be4801d247858d69565fcd7c2568a1724e000a"} 2.1 ``` 1. First and foremost – would this be useful? 2. If yes, how would you use/visualize it (show the last value, show a graph)? 3. If yes, should the last ping date also be included? So the dashboard can show information like "last ran X hours ago and ran for y seconds" 4. Any thoughts about the gauge datatype and the `hc_check_last_duration` naming? In the above example I went with the gauge type because: > When implementing a non-trivial custom metrics collector, it is advised to export a gauge for how long the collection took in seconds and another for the number of errors encountered. > > **This is one of the two cases when it is okay to export a duration as a gauge rather than a summary or a histogram, the other being batch job durations.** This is because both represent information about that particular push/scrape, rather than tracking multiple durations over time. > (https://prometheus.io/docs/practices/instrumentation/#collectors)
Author
Owner

@SuperSandro2000 commented on GitHub (Jul 2, 2020):

First and foremost – would this be useful?

You could make a (time series)graph out of this in Grafana and monitor if the a job took unusual long and act based on that. Maybe set up a little alert if it is higher than x.

Or could you? I am not sure right now. If I only get the latest duration I don't think it is useful because I would want to make a graphic over time to notice if there are spikes maybe correlated to some other change in the system.

<!-- gh-comment-id:652950403 --> @SuperSandro2000 commented on GitHub (Jul 2, 2020): > First and foremost – would this be useful? You could make a (time series)graph out of this in Grafana and monitor if the a job took unusual long and act based on that. Maybe set up a little alert if it is higher than x. Or could you? I am not sure right now. If I only get the latest duration I don't think it is useful because I would want to make a graphic over time to notice if there are spikes maybe correlated to some other change in the system.
Author
Owner

@fcharlier commented on GitHub (Aug 4, 2020):

I was looking at getting this information to keep a longer history of run times for my tasks, so:

  1. yes it would be useful
  2. I'd show it both as a graph or the last value, both are valid uses
  3. the corresponding ping timestamp should be included
  4. gauge looks fine to me for this metric.
<!-- gh-comment-id:668686776 --> @fcharlier commented on GitHub (Aug 4, 2020): I was looking at getting this information to keep a longer history of run times for my tasks, so: 1. yes it would be useful 2. I'd show it both as a graph or the last value, both are valid uses 3. the corresponding ping timestamp should be included 4. gauge looks fine to me for this metric.
Author
Owner

@mbaldini1 commented on GitHub (Aug 26, 2020):

  1. yes, it would be useful because a backup that is taking longer than usual often mean there is some problem
  2. I'd use it in a graph from a time series
  3. the last ping time could be useful to have in prometheus, so from the grafana dashboard one can see when the last backup correct backup has been made.
  4. I think gauge is ok

In general I think that every piece of useful information should be showed, for someone some info can be useful, for other users not, but it's better have more information than less

<!-- gh-comment-id:680897385 --> @mbaldini1 commented on GitHub (Aug 26, 2020): 1. yes, it would be useful because a backup that is taking longer than usual often mean there is some problem 2. I'd use it in a graph from a time series 3. the last ping time could be useful to have in prometheus, so from the grafana dashboard one can see when the last backup correct backup has been made. 4. I think gauge is ok In general I think that every piece of useful information should be showed, for someone some info can be useful, for other users not, but it's better have more information than less
Author
Owner

@xLogiiCx commented on GitHub (Oct 4, 2022):

Any Updates on this? I would love to have a Job duration Grafana Dashboard...

<!-- gh-comment-id:1266545314 --> @xLogiiCx commented on GitHub (Oct 4, 2022): Any Updates on this? I would love to have a Job duration Grafana Dashboard...
Author
Owner

@cuu508 commented on GitHub (Dec 19, 2022):

I did some experimenting. I added hc_check_last_duration metric and tried to make visualizations in Grafana for it. Here's time series:

image

My problem with this is, if the job runs (say) once per hour, but the time series shows a data point for (say) every minute, there will be 59 duplicated data points showing the same duration. In the above screenshot see the horizontal yellow line. It is showing a single measurement, that did not change over the displayed time period, because the job did not run in that period. Also for the orange line there are a few horizontal sections, where the same value is displayed 2-3 times in a row.

Here's a different visualization showing just the most recent values:

image

And another idea I had was to use the already existing hc_check_started metric to make a visualization similar to what's suggested in #443:

image

But I feel like I'm shooting in the dark at the random. Let's approach this from the other direction:

  • What visualizations do we want?
  • Consequently, what metrics do we need to produce to make those visualizations possible?

Ideally, since I know very little Grafana, I'd like to see mockups in the form of dashboard screenshots and dashboard configuration exported as JSON.

<!-- gh-comment-id:1357505632 --> @cuu508 commented on GitHub (Dec 19, 2022): I did some experimenting. I added `hc_check_last_duration` metric and tried to make visualizations in Grafana for it. Here's time series: ![image](https://user-images.githubusercontent.com/661859/208414561-23c3110f-7283-44e7-9a5e-b807bb7c0ac5.png) My problem with this is, if the job runs (say) once per hour, but the time series shows a data point for (say) every minute, there will be 59 duplicated data points showing the same duration. In the above screenshot see the horizontal yellow line. It is showing a single measurement, that did not change over the displayed time period, because the job did not run in that period. Also for the orange line there are a few horizontal sections, where the same value is displayed 2-3 times in a row. Here's a different visualization showing just the most recent values: ![image](https://user-images.githubusercontent.com/661859/208415520-486281ac-d63c-48f2-a3b5-c279e82063aa.png) And another idea I had was to use the already existing `hc_check_started` metric to make a visualization similar to what's suggested in #443: ![image](https://user-images.githubusercontent.com/661859/208416081-12f1cbd0-6c7c-43fc-bd95-9f10e72fb419.png) But I feel like I'm shooting in the dark at the random. Let's approach this from the other direction: * What visualizations do we want? * Consequently, what metrics do we need to produce to make those visualizations possible? Ideally, since I know very little Grafana, I'd like to see mockups in the form of dashboard screenshots and dashboard configuration exported as JSON.
Author
Owner

@tekert commented on GitHub (Jul 31, 2023):

Maybe using a push method from the server would be better, less strain on server and clients, and clients beign able to get short lived durations reliably.
https://github.com/prometheus/client_python#exporting-to-a-pushgateway

<!-- gh-comment-id:1657559342 --> @tekert commented on GitHub (Jul 31, 2023): Maybe using a push method from the server would be better, less strain on server and clients, and clients beign able to get short lived durations reliably. https://github.com/prometheus/client_python#exporting-to-a-pushgateway
Author
Owner

@tekert commented on GitHub (Jul 31, 2023):

I've made a proposal at https://github.com/healthchecks/healthchecks/issues/870 that maybe would solve this problem, using push metrics to our prometheus servers using a webhook POST request.
For example:
curl -d 'hc_check_started{name="TESTNAME", tags="", unique_key="xxxxxxxxxxxxxxxxxx"} 1' -X POST 'http://IP:PORT/api/v1/import/prometheus/metrics/job/<jobname>/instance/<instance_name>'
or wathever format the metric server ingestion accepts.
I've just tested this locally and we can get short lived jobs to show reliably in a time series dashboard.

There is no need to stop scraping, we can scrap at 1min intervals and also get pushes to get short lived secuences on top.

<!-- gh-comment-id:1659250054 --> @tekert commented on GitHub (Jul 31, 2023): I've made a proposal at https://github.com/healthchecks/healthchecks/issues/870 that maybe would solve this problem, using push metrics to our prometheus servers using a webhook POST request. For example: `curl -d 'hc_check_started{name="TESTNAME", tags="", unique_key="xxxxxxxxxxxxxxxxxx"} 1' -X POST 'http://IP:PORT/api/v1/import/prometheus/metrics/job/<jobname>/instance/<instance_name>'` or wathever format the metric server ingestion accepts. I've just tested this locally and we can get short lived jobs to show reliably in a time series dashboard. There is no need to stop scraping, we can scrap at 1min intervals and also get pushes to get short lived secuences on top.
Author
Owner

@tekert commented on GitHub (Aug 2, 2023):

Ok after a lot of testing, i've come to the conclusion that scrapping the last runtime of the last job is the best way to spot trends and diagnose problems, set alerts etc.

Scrapping start and stopped metrics works ok, until you want to see 9s jobs for the past 24hs (the lines just dissapear) making trend analysis useless, i've setup multiple panels, each showing segments for up to 6hs (so i can render the pixels), but if a job last 5s or 2s, then the lines dissapear (not enough pixel density to render the jobs at big timescales), also, the lines become so tick that mousing them over to get durations is a pain. I haven't found a way to force short spikes of data in grafana at big time ranges.
healthchecks-io-tek

A better aproch is to scrape a metric like hc_check_last_duration, form a timeseries of it (doesn't matter if it repeats data) at least we can measure past runtime data.

<!-- gh-comment-id:1661477312 --> @tekert commented on GitHub (Aug 2, 2023): Ok after a lot of testing, i've come to the conclusion that scrapping the last runtime of the last job is the best way to spot trends and diagnose problems, set alerts etc. Scrapping start and stopped metrics works ok, until you want to see 9s jobs for the past 24hs (the lines just dissapear) making trend analysis useless, i've setup multiple panels, each showing segments for up to 6hs (so i can render the pixels), but if a job last 5s or 2s, then the lines dissapear (not enough pixel density to render the jobs at big timescales), also, the lines become so tick that mousing them over to get durations is a pain. I haven't found a way to force short spikes of data in grafana at big time ranges. ![healthchecks-io-tek](https://github.com/healthchecks/healthchecks/assets/9638444/1f5950e5-7e0d-4b27-a973-5a949b94efc5) A better aproch is to scrape a metric like `hc_check_last_duration`, form a timeseries of it (doesn't matter if it repeats data) at least we can measure past runtime data.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/healthchecks#296
No description provided.