[GH-ISSUE #232] Cron monitoring via healthchecks on 5000+ servers #167

Closed
opened 2026-02-25 23:41:26 +03:00 by kerem · 6 comments
Owner

Originally created by @gganeshan on GitHub (Mar 20, 2019).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/232

@cuu508 thanks a lot for this amazing tool.

We have a use case where we want to run GOSS checks on 5000+ servers on a nightly basis via cron. The server list is not static and can change frequently when servers are decommissioned or provisioned.

I had a few questions for you to evaluate if healthchecks is the right option for us:

  • Do you think your tool can handle these many servers??
  • Server Registration: I think I should have a separate process to register all servers in the tool (via API) before the cron job runs - thoughts??
  • Server Discovery: In my cron I should first fetch the server specific URL (via API) and then send the ping to it - thoughts??
  • Does your tool support email distribution lists??
  • Is there a way to auto-provision email config per project in the tool??

Do any of your existing customers have a similar use case??
It would be interesting to know how they solve these issues.

Originally created by @gganeshan on GitHub (Mar 20, 2019). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/232 @cuu508 thanks a lot for this amazing tool. We have a use case where we want to run [GOSS](https://github.com/aelsabbahy/goss) checks on 5000+ servers on a nightly basis via cron. The server list is not static and can change frequently when servers are decommissioned or provisioned. I had a few questions for you to evaluate if [healthchecks](https://github.com/healthchecks/healthchecks) is the right option for us: - Do you think your tool can handle these many servers?? - Server Registration: I think I should have a separate process to register all servers in the tool (via API) before the cron job runs - thoughts?? - Server Discovery: In my cron I should first fetch the server specific URL (via API) and then send the ping to it - thoughts?? - Does your tool support email distribution lists?? - Is there a way to auto-provision email config per project in the tool?? Do any of your existing customers have a similar use case?? It would be interesting to know how they solve these issues.
kerem closed this issue 2026-02-25 23:41:26 +03:00
Author
Owner

@abitlegacy commented on GitHub (Mar 21, 2019):

I'm not the developer - but I do host my own instance.

  • How many servers depends on the infrastructure to support. It's just a website. 5000+ checks is easily done on a production level web server. If you're planning on using the Django development server - you're going to have a bad time. I currently have over 7k checks - clustered PostgreSQL (Master/Slave) and 2 IIS Servers with wfastcgi - mostly for HA, not for load. No problems so far.
  • With that many servers - definitely utilize the API. I have most of my checks self-register and delete.
  • I save the CheckID to an environment variable within the server once it's registered. The API doesn't support retrieving a single check via Name - so you'll have to pull all the checks in the project and then parse them for the one you want. With 5000 checks - that would be a large strain on the database.
  • It supports multiple E-Mails as well as lists. It's just standard SMTP. If your Distribution list has an address - that would likely be easier to manage.
  • I don't fully understand this last question. Provisioning E-Mail configuration is done via local_settings.py. There's no other way to do it once the web application is up.

I have a similar use case - but my company is largely Windows based. I have two primary projects that most checks belong to - one is for data pulls and pushes. The other is for Windows DSC - whenever a server updates it's configuration it throws a POST with information regarding status and any changes it made.

<!-- gh-comment-id:475082126 --> @abitlegacy commented on GitHub (Mar 21, 2019): I'm not the developer - but I do host my own instance. - How many servers depends on the infrastructure to support. It's just a website. 5000+ checks is easily done on a production level web server. If you're planning on using the Django development server - you're going to have a bad time. I currently have over 7k checks - clustered PostgreSQL (Master/Slave) and 2 IIS Servers with wfastcgi - mostly for HA, not for load. No problems so far. - With that many servers - definitely utilize the API. I have most of my checks self-register and delete. - I save the CheckID to an environment variable within the server once it's registered. The API doesn't support retrieving a single check via Name - so you'll have to pull all the checks in the project and then parse them for the one you want. With 5000 checks - that would be a large strain on the database. - It supports multiple E-Mails as well as lists. It's just standard SMTP. If your Distribution list has an address - that would likely be easier to manage. - I don't fully understand this last question. Provisioning E-Mail configuration is done via local_settings.py. There's no other way to do it once the web application is up. I have a similar use case - but my company is largely Windows based. I have two primary projects that most checks belong to - one is for data pulls and pushes. The other is for Windows DSC - whenever a server updates it's configuration it throws a POST with information regarding status and any changes it made.
Author
Owner

@gganeshan commented on GitHub (Mar 21, 2019):

@djreynolds922 thanks a lot for your response.

I have a few follow up questions 😄 .

I currently have over 7k checks - clustered PostgreSQL (Master/Slave) and 2 IIS Servers with wfastcgi - mostly for HA, not for load. No problems so far.

I see documentation on using MySQL or PostgreSQL as the backend but how do you integrate healthchecks with a webserver of your choice?? I would love to use caddy for my self-hosted instance.

I have most of my checks self-register and delete.

The self-registration happens as a separate cron job right??
If it is part of the same step as the healthcheck itself then if the cron never runs then your server registration also never happens so the purpose is defeated.

I save the CheckID to an environment variable within the server once it's registered.

By CheckID do you mean the unique ping address? Do you use puppet (or something similar) to set these environment variables??

The API doesn't support retrieving a single check via Name

if I tag the checks with servername then I can retrieve server specific unique ping address using the API.

It supports multiple E-Mails as well as lists. It's just standard SMTP.

Cool. Multiple emails / DLs can only be added by sending an invite right?? I wish we could add trusted email addresses somehow via the backend esp on self-hosted instances.

I don't fully understand this last question. Provisioning E-Mail configuration is done via local_settings.py.

I didnt mean SMPT config, I meant configuration of trusted emails/DLs recipients for a specific project (similar to my point above). I dont like the current workflow of adding recipients only via invitation.

<!-- gh-comment-id:475233877 --> @gganeshan commented on GitHub (Mar 21, 2019): @djreynolds922 thanks a lot for your response. I have a few follow up questions :smile: . > I currently have over 7k checks - clustered PostgreSQL (Master/Slave) and 2 IIS Servers with wfastcgi - mostly for HA, not for load. No problems so far. I see documentation on using MySQL or PostgreSQL as the backend but how do you integrate healthchecks with a webserver of your choice?? I would love to use [caddy](https://caddyserver.com/) for my self-hosted instance. > I have most of my checks self-register and delete. The self-registration happens as a separate cron job right?? If it is part of the same step as the healthcheck itself then if the cron never runs then your server registration also never happens so the purpose is defeated. > I save the CheckID to an environment variable within the server once it's registered. By CheckID do you mean the unique ping address? Do you use puppet (or something similar) to set these environment variables?? > The API doesn't support retrieving a single check via Name if I tag the checks with servername then I can retrieve server specific unique ping address using the API. > It supports multiple E-Mails as well as lists. It's just standard SMTP. Cool. Multiple emails / DLs can only be added by sending an invite right?? I wish we could add trusted email addresses somehow via the backend esp on self-hosted instances. > I don't fully understand this last question. Provisioning E-Mail configuration is done via local_settings.py. I didnt mean SMPT config, I meant configuration of trusted emails/DLs recipients for a specific project (similar to my point above). I dont like the current workflow of adding recipients only via invitation.
Author
Owner

@gganeshan commented on GitHub (Mar 21, 2019):

how do you integrate healthchecks with a webserver of your choice?? I would love to use caddy for my self-hosted instance.

May be I just run healthchecks with uwsgi and proxy caddy to the uwsgi endpoint.
Reference: https://github.com/mholt/caddy/issues/176

<!-- gh-comment-id:475266118 --> @gganeshan commented on GitHub (Mar 21, 2019): > how do you integrate healthchecks with a webserver of your choice?? I would love to use caddy for my self-hosted instance. May be I just run healthchecks with uwsgi and proxy caddy to the uwsgi endpoint. Reference: https://github.com/mholt/caddy/issues/176
Author
Owner

@abitlegacy commented on GitHub (Mar 22, 2019):

Healthchecks is just a Django application - for Caddy it seems like the community favorite deployment is Caddy as a Reverse Proxy => Gunicorn. Here's an example on the Caddy github repo: https://github.com/caddyserver/examples/tree/master/django - it took a bit for me to get everything working with IIS - but getting a *nix server in my environment would've been more difficult.

Self-Registration happens on initial provisioning - not within the scheduled jobs I'm monitoring. If I had a build server - I probably would have the build server register the expected servers.

CheckID is the unique ping address. Sorry - internally I call it a CheckID in my provisioning scripts.

That's true - but I would recommend looking at the UI - the checks page lists all the tags as filterable buttons. If you have 5k+ in there - I can't imagine it'll look good. You could always modify the templates yourself to work around it though.

For adding E-mail addresses - the only way I know how to do it is through the invite - although you could probably add them directly into the database as well - I just setup two distribution groups in our Exchange server and my E-Mails go to those (one for dev, one for prod) - people can add and removed themselves from the distribution group and then I modified the E-Mail template to ensure no one accidentally clicks "No longer receive these".

As far as adding users to a project - it's definitely a pain. I'm pretty sure you can use any django backend you'd like though - so if you want to do ldap or similar I'm sure you could. It'd just require modifying some files yourself.

<!-- gh-comment-id:475495756 --> @abitlegacy commented on GitHub (Mar 22, 2019): Healthchecks is just a Django application - for Caddy it seems like the community favorite deployment is Caddy as a Reverse Proxy => Gunicorn. Here's an example on the Caddy github repo: https://github.com/caddyserver/examples/tree/master/django - it took a bit for me to get everything working with IIS - but getting a *nix server in my environment would've been more difficult. Self-Registration happens on initial provisioning - not within the scheduled jobs I'm monitoring. If I had a build server - I probably would have the build server register the expected servers. CheckID is the unique ping address. Sorry - internally I call it a CheckID in my provisioning scripts. That's true - but I would recommend looking at the UI - the checks page lists all the tags as filterable buttons. If you have 5k+ in there - I can't imagine it'll look good. You could always modify the templates yourself to work around it though. For adding E-mail addresses - the only way I know how to do it is through the invite - although you could probably add them directly into the database as well - I just setup two distribution groups in our Exchange server and my E-Mails go to those (one for dev, one for prod) - people can add and removed themselves from the distribution group and then I modified the E-Mail template to ensure no one accidentally clicks "No longer receive these". As far as adding users to a project - it's definitely a pain. I'm pretty sure you can use any django backend you'd like though - so if you want to do ldap or similar I'm sure you could. It'd just require modifying some files yourself.
Author
Owner

@cuu508 commented on GitHub (Mar 26, 2019):

On the hosted service, healthchecks.io, I'm currently using nginx as the reverse proxy (+ TLS termination, serving static files, rate limiting, etc.) and uwsgi for running the Django application.
I have used Caddy instead of nginx in the past as well and it worked well.
gunicorn would work as well, but I personally like uwsgi for its "Swiss army knife of web serving" aspect. It has chaotic documentation but tons of configuration options and features.

For provisioning, I recommend the same approach as @djreynolds922 is suggesting: create the check during provisioning, and probably ping it as well to kick off the timer. On each server, cache the ping URL in an environment variable (or a configuration file, or wherever is convenient). On the first run, when the ping URL is not yet set, retrieve it using the "Create a Check" API call and specifying the unique parameter. When using the unique parameter, that operation effectively becomes "get or create".

Monitoring 5000 servers (checks) should be no issue. One thing to watch out for is if the all cron jobs run exactly at the same time (and have nice, synchronized clocks). When the monitoring host gets 5000 requests at the exact same millisecond, depending on your kernel and web server configuration you could see delays because of dropped and retried packets. If the pings are spaced out even a little bit then this would not be a worry.

On adding email addresses without the verification step: I guess in a self-hosted setting the verification has not much use and only gets in the way. I'll look into adding a USE_EMAIL_VERIFICATION=True/False configuration setting.

<!-- gh-comment-id:476595061 --> @cuu508 commented on GitHub (Mar 26, 2019): On the hosted service, healthchecks.io, I'm currently using nginx as the reverse proxy (+ TLS termination, serving static files, rate limiting, etc.) and uwsgi for running the Django application. I have used Caddy instead of nginx in the past as well and it worked well. gunicorn would work as well, but I personally like uwsgi for its "Swiss army knife of web serving" aspect. It has chaotic documentation but *tons* of configuration options and features. For provisioning, I recommend the same approach as @djreynolds922 is suggesting: create the check during provisioning, and probably ping it as well to kick off the timer. On each server, cache the ping URL in an environment variable (or a configuration file, or wherever is convenient). On the first run, when the ping URL is not yet set, retrieve it using the "Create a Check" API call and specifying the `unique` parameter. When using the `unique` parameter, that operation effectively becomes "get or create". Monitoring 5000 servers (checks) should be no issue. One thing to watch out for is if the all cron jobs run exactly at the same time (and have nice, synchronized clocks). When the monitoring host gets 5000 requests at the exact same millisecond, depending on your kernel and web server configuration you could see delays because of dropped and retried packets. If the pings are spaced out even a little bit then this would not be a worry. On adding email addresses without the verification step: I guess in a self-hosted setting the verification has not much use and only gets in the way. I'll look into adding a USE_EMAIL_VERIFICATION=True/False configuration setting.
Author
Owner

@gganeshan commented on GitHub (Apr 5, 2019):

thank you @cuu508 and @djreynolds922 for your responses.
Really appreciate it.

<!-- gh-comment-id:480111097 --> @gganeshan commented on GitHub (Apr 5, 2019): thank you @cuu508 and @djreynolds922 for your responses. Really appreciate it.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/healthchecks#167
No description provided.