[GH-ISSUE #343] Managing healthchecks at scale #264

Closed
opened 2026-02-25 23:41:49 +03:00 by kerem · 8 comments
Owner

Originally created by @caleb15 on GitHub (Mar 15, 2020).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/343

The Problem

Currently we hardcode the ping urls into the application code. This is fine for a single project. However, we are planning on creating new environments, which means new projects for those environments, which means dozens of new ping urls (89 and counting) that need to be hardcoded in. Hardcoding it in would require manually editing dozens of seperate files scattered throughout our project.

Solution A

My boss suggested scraping the urls off of the project and adding logic to conditionally check which one to use. I imagine the implementation of this would be writing a script to call the API to get the healthchecks, format them into a JSON dictionary indexed by project and name, and then adding in logic to our project to grab the healthcheck with the associated name for the current environment.

Psuedocode:

blorp.py

def blorpUsersEveryMonday(HealthCheckMixin):
  healthcheck_name = 'blorp'
  # logic

healthcheck_mixin.py

healthcheck_by_project_name = {
  "production": {
    "blorp": "url"
  }
}

def ping(name):
  requests.get(healthcheck_by_project_name[project][name])

When adding a new environment someone would run the copy script to duplicate the healthchecks to the new environment, run the format script to format the new checks into a JSON dict like "staging": {"blorp": "url"}, and finally add it to the variable. When adding a new healthcheck the user could simply create the healthcheck manually and manually add it into the hardcoded dictionary. The benefit of this approach is that it is somewhat simple. The drawback is it requires two custom scripts and still requires manual work, although less of it.*

Solution B

Then I had a second idea; Why not determine the healthcheck url dynamically? Assuming names are unique (they are in our case), you could do a API call and easily get the ping url for the name. For performance you could cache the API result, so in the vast majority of cases it would be almost as good as referencing a hardcoded url. As a added benefit you could automatically create the healthcheck if it doesn't exist, eliminating the need to manually create healthchecks and copy them over from project to project. The other benefit is that we wouldn't have to worry about access permissions to healthchecks.io - with the checks being created automatically there wouldn't be a need for engineering to ask devops to create the check. The downside is that it relies on the API not to have any breaking changes.

Psuedocode:

cache = {} # I could reuse our existing redis cache
endpoint = cache[project][name]
if not endpoint:
  cache = get_healthchecks_from_api()
  endpoint = cache[project][name]
  if not endpoint:
    check = create_healthcheck(name)
    cache[project][name] = check.ping_url
ping(endpoint)

A rough working prototype can be found here.

Conclusion

Either way will take a lot work to implement. I thought I'd check with you first to make sure I'm not missing something obvious. If I'm not missing something obvious, what do you think of the two approaches I outlined above? Also, how reliable are the other API endpoints compared to the ping API endpoint?


* Although now that I think of it I could use terraform to automate a good chunk of that - I could have a terraform module I call for each environment to create the healthchecks, eliminating the need for a copy script, and then I could use local-exec to automatically scrape the healthchecks into a dictionary. The only manual work left would be creating the healthcheck in terraform and editing the hardcoded dictionary.

Originally created by @caleb15 on GitHub (Mar 15, 2020). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/343 ## The Problem Currently we hardcode the ping urls into the application code. This is fine for a single project. However, we are planning on creating new environments, which means new projects for those environments, which means dozens of new ping urls (89 and counting) that need to be hardcoded in. Hardcoding it in would require manually editing dozens of seperate files scattered throughout our project. ## Solution A My boss suggested scraping the urls off of the project and adding logic to conditionally check which one to use. I imagine the implementation of this would be writing a script to call the API to get the healthchecks, format them into a JSON dictionary indexed by project and name, and then adding in logic to our project to grab the healthcheck with the associated name for the current environment. **Psuedocode:** blorp.py ```python def blorpUsersEveryMonday(HealthCheckMixin): healthcheck_name = 'blorp' # logic ``` healthcheck_mixin.py ```python healthcheck_by_project_name = { "production": { "blorp": "url" } } def ping(name): requests.get(healthcheck_by_project_name[project][name]) ``` When adding a new environment someone would run the copy script to duplicate the healthchecks to the new environment, run the format script to format the new checks into a JSON dict like `"staging": {"blorp": "url"}`, and finally add it to the variable. When adding a new healthcheck the user could simply create the healthcheck manually and manually add it into the hardcoded dictionary. The benefit of this approach is that it is somewhat simple. The drawback is it requires two custom scripts and still requires manual work, although less of it.* ## Solution B Then I had a second idea; Why not determine the healthcheck url dynamically? Assuming names are unique (they are in our case), you could do a API call and easily get the ping url for the name. For performance you could cache the API result, so in the vast majority of cases it would be almost as good as referencing a hardcoded url. As a added benefit you could automatically create the healthcheck if it doesn't exist, eliminating the need to manually create healthchecks and copy them over from project to project. The other benefit is that we wouldn't have to worry about access permissions to healthchecks.io - with the checks being created automatically there wouldn't be a need for engineering to ask devops to create the check. The downside is that it relies on the API not to have any breaking changes. Psuedocode: ```python cache = {} # I could reuse our existing redis cache endpoint = cache[project][name] if not endpoint: cache = get_healthchecks_from_api() endpoint = cache[project][name] if not endpoint: check = create_healthcheck(name) cache[project][name] = check.ping_url ping(endpoint) ``` A rough working prototype can be found [here](https://gist.github.com/caleb15/38eaa8e57bf6a1e567fe6e5348ab440d). ## Conclusion Either way will take a lot work to implement. I thought I'd check with you first to make sure I'm not missing something obvious. If I'm not missing something obvious, what do you think of the two approaches I outlined above? Also, how reliable are the other API endpoints compared to the ping API endpoint? --- \* Although now that I think of it I could use terraform to automate a good chunk of that - I could have a terraform module I call for each environment to create the healthchecks, eliminating the need for a copy script, and then I could use [local-exec](https://www.terraform.io/docs/provisioners/local-exec.html) to automatically scrape the healthchecks into a dictionary. The only manual work left would be creating the healthcheck in terraform and editing the hardcoded dictionary.
kerem closed this issue 2026-02-25 23:41:49 +03:00
Author
Owner

@cuu508 commented on GitHub (Mar 15, 2020):

Hi @caleb15, I've used / am using two different approaches myself.

The first one is similar to your solution A. I have a twelve-factor webapp that runs in a Heroku-like environment. There are a few separate deployments, for staging / production, and for different geographic regions. The app has 10 or so batch jobs that run regularly and use Healthchecks monitoring. I'm not hardcoding the ping URLs, instead I'm putting them in the environment variables. Each batch job knows which environment variable it needs to look up to find the ping URL. Each deployment has its own set of environment variables. There is manual work when adding a new batch job, or when adding a new environment. But these are relatively rare events and so I'm OK with that.

The other approach is similar to your solution B. I have a batch job that I want to be able to copy to any random host, add it to host's cron and run it without modification. The batch job auto-registers with Healthchecks.io -- it uses host's hostname as the check name, and it creates the check if it does not exist yet. I recently added sample code of this in docs: https://healthchecks.io/docs/bash/ ("Auto-provisioning New Checks").

Similarly, you could have a helper python function which retrieves ping URLs, and creates checks as needed. And you could just use a different API key per environment so the checks don't clash. A downside to this approach is that a read-write API key needs to be distributed with your code or configuration.

Also, how reliable are the other API endpoints compared to the ping API endpoint?

In terms of backwards compatibility – the plan is to:

  • make sure the changes are backwards compatible (i.e., adding new request parameters, and adding new response fields, but not changing or removing the existing ones)
  • or, if that is not possible, increase the API version
  • or, if that is not possible (for example, if I need to hide information from API responses, then removing it from /api/v2/ responses is no good if users can still use /api/v1/ ...), inform the API users in advance
<!-- gh-comment-id:599259779 --> @cuu508 commented on GitHub (Mar 15, 2020): Hi @caleb15, I've used / am using two different approaches myself. The first one is similar to your solution A. I have a twelve-factor webapp that runs in a Heroku-like environment. There are a few separate deployments, for staging / production, and for different geographic regions. The app has 10 or so batch jobs that run regularly and use Healthchecks monitoring. I'm not hardcoding the ping URLs, instead I'm putting them in the environment variables. Each batch job knows which environment variable it needs to look up to find the ping URL. Each deployment has its own set of environment variables. There is manual work when adding a new batch job, or when adding a new environment. But these are relatively rare events and so I'm OK with that. The other approach is similar to your solution B. I have a batch job that I want to be able to copy to any random host, add it to host's cron and run it without modification. The batch job auto-registers with Healthchecks.io -- it uses host's hostname as the check name, and it creates the check if it does not exist yet. I recently added sample code of this in docs: https://healthchecks.io/docs/bash/ ("Auto-provisioning New Checks"). Similarly, you could have a helper python function which retrieves ping URLs, and creates checks as needed. And you could just use a different API key per environment so the checks don't clash. A downside to this approach is that a read-write API key needs to be distributed with your code or configuration. > Also, how reliable are the other API endpoints compared to the ping API endpoint? In terms of backwards compatibility – the plan is to: * make sure the changes are backwards compatible (i.e., adding new request parameters, and adding new response fields, but not changing or removing the existing ones) * *or*, if that is not possible, increase the API version * *or*, if that is not possible (for example, if I need to hide information from API responses, then removing it from /api/v2/ responses is no good if users can still use /api/v1/ ...), inform the API users in advance
Author
Owner

@caleb15 commented on GitHub (Apr 22, 2020):

I've adapted the sample you provided to work with any check and to use /start and /fail endpoints.
It's been working great in lower environments - all of our healthchecks created dynamically through API without any manual actions on our parts. 😄 Thanks!

https://gist.github.com/caleb15/1a817ef5e58e8a8caf65190cff33806e

I did have one small problem I wanted to get your input on - we had a couple every-minute checks that were spamming us with alerts. Ordinarily we would just pause the checks while we work on a fix. However, the API overwrites any manual changes to the check, including overwriting the pause, making the check active again, and continuing the alert spam.

I can understand why a success or a ping of a endpoint would resume the check. Theoretically speaking that means the check is good again. But I'm a bit surprised that /fail and check updates (via "Create a Check" unique field) resume the check. The documentation for the check update API endpoint states that "If any parameter is omitted, its value is left unchanged."

As for /fail, it makes sense that it updates the status to "down" but at the same time this is undesirable behavior if you already know the check is down and want to stop receiving alerts from it while you fix the job.

What would make sense for me is for healthchecks to have a seperate boolean field "paused". Pausing is probably a intentional user action and as such should be ended intentionally while healthcheck ping/start/stop is probably automated. They are two different categories.

If that idea is undesirable for whatever reason is it possible for the create check API endpoint to be fixed such that it doesn't update the status? I can then skip /fail ping if the check is paused. If you're willing to change the behavior of /fail too that would be great but I can understand keeping it the same for backwards compatibility reasons.

<!-- gh-comment-id:618095687 --> @caleb15 commented on GitHub (Apr 22, 2020): I've adapted the sample you provided to work with any check and to use /start and /fail endpoints. It's been working great in lower environments - all of our healthchecks created dynamically through API without any manual actions on our parts. :smile: Thanks! https://gist.github.com/caleb15/1a817ef5e58e8a8caf65190cff33806e I did have one small problem I wanted to get your input on - we had a couple every-minute checks that were spamming us with alerts. Ordinarily we would just pause the checks while we work on a fix. However, the API overwrites any manual changes to the check, including overwriting the pause, making the check active again, and continuing the alert spam. I can understand why a success or a ping of a endpoint would resume the check. Theoretically speaking that means the check is good again. But I'm a bit surprised that /fail and check updates (via "Create a Check" unique field) resume the check. The documentation for the check update API endpoint states that "If any parameter is omitted, its value is left unchanged." As for /fail, it makes sense that it updates the status to "down" but at the same time this is undesirable behavior if you already know the check is down and want to stop receiving alerts from it while you fix the job. What would make sense for me is for healthchecks to have a seperate boolean field "paused". Pausing is probably a intentional user action and as such should be ended intentionally while healthcheck ping/start/stop is probably automated. They are two different categories. If that idea is undesirable for whatever reason is it possible for the create check API endpoint to be fixed such that it doesn't update the status? I can then skip /fail ping if the check is paused. If you're willing to change the behavior of /fail too that would be great but I can understand keeping it the same for backwards compatibility reasons.
Author
Owner

@cuu508 commented on GitHub (Apr 23, 2020):

Great! I'll think about this some more, but want to comment on one detail for now:

However, the API overwrites any manual changes to the check, including overwriting the pause, making the check active again, and continuing the alert spam.

That's unexpected. Updating a check via API should not touch its status. Looking at h.sh, after the API call it immediately sends a signal to the /start endpoint. Perhaps that is what is affecting the status?

<!-- gh-comment-id:618209161 --> @cuu508 commented on GitHub (Apr 23, 2020): Great! I'll think about this some more, but want to comment on one detail for now: > However, the API overwrites any manual changes to the check, including overwriting the pause, making the check active again, and continuing the alert spam. That's unexpected. Updating a check via API should not touch its status. Looking at `h.sh`, after the API call it immediately sends a signal to the `/start` endpoint. Perhaps that is what is affecting the status?
Author
Owner

@caleb15 commented on GitHub (Apr 24, 2020):

My bad, sorry. You're right, updating the check does not touch the status. I tested it by executing a modified h.sh but I was accidentally executing /usr/local/bin/h.sh instead of the modified h.sh in my local directory * facepalm *

In that case I could skip the /start, ping, and /fail checks if the check is paused. Preferably they would not change the status but skipping them would suffice as a workaround. Let me know what you decide :)

<!-- gh-comment-id:619141883 --> @caleb15 commented on GitHub (Apr 24, 2020): My bad, sorry. You're right, updating the check does not touch the status. I tested it by executing a modified h.sh but I was accidentally executing `/usr/local/bin/h.sh` instead of the modified h.sh in my local directory * facepalm * In that case I could skip the /start, ping, and /fail checks if the check is paused. Preferably they would not change the status but skipping them would suffice as a workaround. Let me know what you decide :)
Author
Owner

@cuu508 commented on GitHub (Apr 27, 2020):

When I added the "paused" status, the use case I had in mind was: sometimes a check will be "in a prolonged maintenance". Technically down, but acknowledged and understood, so to speak. In these cases, we don't want to show the red down icons in dashboards and in email reports. We don't want users to develop notification blindness ("something's red? that's fine, something's always red").

I had not thought about the case where the check is flapping between up and down. In this case indeed it would make sense to have a separate, overriding "is_paused" flag. But I think it's too late to change it now. By now there are users who rely on the "pinging un-pauses the check" behavior.

In that case I could skip the /start, ping, and /fail checks if the check is paused.

Another thing you can do is toggle off the notifications. This way, if a check is flapping between up and down states, at least you will not be spammed with notifications.

<!-- gh-comment-id:619831972 --> @cuu508 commented on GitHub (Apr 27, 2020): When I added the "paused" status, the use case I had in mind was: sometimes a check will be "in a prolonged maintenance". Technically down, but acknowledged and understood, so to speak. In these cases, we don't want to show the red down icons in dashboards and in email reports. We don't want users to develop notification blindness ("something's red? that's fine, something's always red"). I had not thought about the case where the check is flapping between up and down. In this case indeed it would make sense to have a separate, overriding "is_paused" flag. But I think it's too late to change it now. By now there are users who rely on the "pinging un-pauses the check" behavior. > In that case I could skip the /start, ping, and /fail checks if the check is paused. Another thing you can do is toggle off the notifications. This way, if a check is flapping between up and down states, at least you will not be spammed with notifications.
Author
Owner

@caleb15 commented on GitHub (Apr 27, 2020):

But I think it's too late to change it now. By now there are users who rely on the "pinging un-pauses the check" behavior.

That's too bad. I suppose I'll update the script then.

Another thing you can do is toggle off the notifications.

The integrations are automatically configured in the api/v1/checks/ call ~ if it's toggled off manually it would just be enabled again next API call. So it's not possible on a per-check basis with the automatic setup unfortunately.

<!-- gh-comment-id:620266271 --> @caleb15 commented on GitHub (Apr 27, 2020): > But I think it's too late to change it now. By now there are users who rely on the "pinging un-pauses the check" behavior. That's too bad. I suppose I'll update the script then. > Another thing you can do is toggle off the notifications. The integrations are automatically configured in the `api/v1/checks/` call ~ if it's toggled off manually it would just be enabled again next API call. So it's not possible on a per-check basis with the automatic setup unfortunately.
Author
Owner

@cuu508 commented on GitHub (May 8, 2020):

I filed a separate issue (#369) with a potential solution to paused checks getting unpaused -- feedback welcome!

Also, I'd like each issue to be focused on one specific thing. @caleb15 would you be OK with closing this one, and filing any remaining questions / feature requests as separate tickets?

<!-- gh-comment-id:625858371 --> @cuu508 commented on GitHub (May 8, 2020): I filed a separate issue (#369) with a potential solution to paused checks getting unpaused -- feedback welcome! Also, I'd like each issue to be focused on one specific thing. @caleb15 would you be OK with closing this one, and filing any remaining questions / feature requests as separate tickets?
Author
Owner

@caleb15 commented on GitHub (May 8, 2020):

Sure

<!-- gh-comment-id:625910954 --> @caleb15 commented on GitHub (May 8, 2020): Sure
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/healthchecks#264
No description provided.