mirror of
https://github.com/healthchecks/healthchecks.git
synced 2026-04-26 07:25:51 +03:00
[GH-ISSUE #343] Managing healthchecks at scale #264
Labels
No labels
bug
bug
bug
feature
good-first-issue
new integration
pull-request
question
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/healthchecks#264
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @caleb15 on GitHub (Mar 15, 2020).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/343
The Problem
Currently we hardcode the ping urls into the application code. This is fine for a single project. However, we are planning on creating new environments, which means new projects for those environments, which means dozens of new ping urls (89 and counting) that need to be hardcoded in. Hardcoding it in would require manually editing dozens of seperate files scattered throughout our project.
Solution A
My boss suggested scraping the urls off of the project and adding logic to conditionally check which one to use. I imagine the implementation of this would be writing a script to call the API to get the healthchecks, format them into a JSON dictionary indexed by project and name, and then adding in logic to our project to grab the healthcheck with the associated name for the current environment.
Psuedocode:
blorp.py
healthcheck_mixin.py
When adding a new environment someone would run the copy script to duplicate the healthchecks to the new environment, run the format script to format the new checks into a JSON dict like
"staging": {"blorp": "url"}, and finally add it to the variable. When adding a new healthcheck the user could simply create the healthcheck manually and manually add it into the hardcoded dictionary. The benefit of this approach is that it is somewhat simple. The drawback is it requires two custom scripts and still requires manual work, although less of it.*Solution B
Then I had a second idea; Why not determine the healthcheck url dynamically? Assuming names are unique (they are in our case), you could do a API call and easily get the ping url for the name. For performance you could cache the API result, so in the vast majority of cases it would be almost as good as referencing a hardcoded url. As a added benefit you could automatically create the healthcheck if it doesn't exist, eliminating the need to manually create healthchecks and copy them over from project to project. The other benefit is that we wouldn't have to worry about access permissions to healthchecks.io - with the checks being created automatically there wouldn't be a need for engineering to ask devops to create the check. The downside is that it relies on the API not to have any breaking changes.
Psuedocode:
A rough working prototype can be found here.
Conclusion
Either way will take a lot work to implement. I thought I'd check with you first to make sure I'm not missing something obvious. If I'm not missing something obvious, what do you think of the two approaches I outlined above? Also, how reliable are the other API endpoints compared to the ping API endpoint?
* Although now that I think of it I could use terraform to automate a good chunk of that - I could have a terraform module I call for each environment to create the healthchecks, eliminating the need for a copy script, and then I could use local-exec to automatically scrape the healthchecks into a dictionary. The only manual work left would be creating the healthcheck in terraform and editing the hardcoded dictionary.
@cuu508 commented on GitHub (Mar 15, 2020):
Hi @caleb15, I've used / am using two different approaches myself.
The first one is similar to your solution A. I have a twelve-factor webapp that runs in a Heroku-like environment. There are a few separate deployments, for staging / production, and for different geographic regions. The app has 10 or so batch jobs that run regularly and use Healthchecks monitoring. I'm not hardcoding the ping URLs, instead I'm putting them in the environment variables. Each batch job knows which environment variable it needs to look up to find the ping URL. Each deployment has its own set of environment variables. There is manual work when adding a new batch job, or when adding a new environment. But these are relatively rare events and so I'm OK with that.
The other approach is similar to your solution B. I have a batch job that I want to be able to copy to any random host, add it to host's cron and run it without modification. The batch job auto-registers with Healthchecks.io -- it uses host's hostname as the check name, and it creates the check if it does not exist yet. I recently added sample code of this in docs: https://healthchecks.io/docs/bash/ ("Auto-provisioning New Checks").
Similarly, you could have a helper python function which retrieves ping URLs, and creates checks as needed. And you could just use a different API key per environment so the checks don't clash. A downside to this approach is that a read-write API key needs to be distributed with your code or configuration.
In terms of backwards compatibility – the plan is to:
@caleb15 commented on GitHub (Apr 22, 2020):
I've adapted the sample you provided to work with any check and to use /start and /fail endpoints.
It's been working great in lower environments - all of our healthchecks created dynamically through API without any manual actions on our parts. 😄 Thanks!
https://gist.github.com/caleb15/1a817ef5e58e8a8caf65190cff33806e
I did have one small problem I wanted to get your input on - we had a couple every-minute checks that were spamming us with alerts. Ordinarily we would just pause the checks while we work on a fix. However, the API overwrites any manual changes to the check, including overwriting the pause, making the check active again, and continuing the alert spam.
I can understand why a success or a ping of a endpoint would resume the check. Theoretically speaking that means the check is good again. But I'm a bit surprised that /fail and check updates (via "Create a Check" unique field) resume the check. The documentation for the check update API endpoint states that "If any parameter is omitted, its value is left unchanged."
As for /fail, it makes sense that it updates the status to "down" but at the same time this is undesirable behavior if you already know the check is down and want to stop receiving alerts from it while you fix the job.
What would make sense for me is for healthchecks to have a seperate boolean field "paused". Pausing is probably a intentional user action and as such should be ended intentionally while healthcheck ping/start/stop is probably automated. They are two different categories.
If that idea is undesirable for whatever reason is it possible for the create check API endpoint to be fixed such that it doesn't update the status? I can then skip /fail ping if the check is paused. If you're willing to change the behavior of /fail too that would be great but I can understand keeping it the same for backwards compatibility reasons.
@cuu508 commented on GitHub (Apr 23, 2020):
Great! I'll think about this some more, but want to comment on one detail for now:
That's unexpected. Updating a check via API should not touch its status. Looking at
h.sh, after the API call it immediately sends a signal to the/startendpoint. Perhaps that is what is affecting the status?@caleb15 commented on GitHub (Apr 24, 2020):
My bad, sorry. You're right, updating the check does not touch the status. I tested it by executing a modified h.sh but I was accidentally executing
/usr/local/bin/h.shinstead of the modified h.sh in my local directory * facepalm *In that case I could skip the /start, ping, and /fail checks if the check is paused. Preferably they would not change the status but skipping them would suffice as a workaround. Let me know what you decide :)
@cuu508 commented on GitHub (Apr 27, 2020):
When I added the "paused" status, the use case I had in mind was: sometimes a check will be "in a prolonged maintenance". Technically down, but acknowledged and understood, so to speak. In these cases, we don't want to show the red down icons in dashboards and in email reports. We don't want users to develop notification blindness ("something's red? that's fine, something's always red").
I had not thought about the case where the check is flapping between up and down. In this case indeed it would make sense to have a separate, overriding "is_paused" flag. But I think it's too late to change it now. By now there are users who rely on the "pinging un-pauses the check" behavior.
Another thing you can do is toggle off the notifications. This way, if a check is flapping between up and down states, at least you will not be spammed with notifications.
@caleb15 commented on GitHub (Apr 27, 2020):
That's too bad. I suppose I'll update the script then.
The integrations are automatically configured in the
api/v1/checks/call ~ if it's toggled off manually it would just be enabled again next API call. So it's not possible on a per-check basis with the automatic setup unfortunately.@cuu508 commented on GitHub (May 8, 2020):
I filed a separate issue (#369) with a potential solution to paused checks getting unpaused -- feedback welcome!
Also, I'd like each issue to be focused on one specific thing. @caleb15 would you be OK with closing this one, and filing any remaining questions / feature requests as separate tickets?
@caleb15 commented on GitHub (May 8, 2020):
Sure