[GH-ISSUE #212] Robustness of Alerts #152

Closed
opened 2026-02-25 23:41:22 +03:00 by kerem · 5 comments
Owner

Originally created by @muff1nman on GitHub (Jan 16, 2019).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/212

Would you be open for adding logic to ensure alerts are processed correctly? I.e. say a check goes down and an alert is attempted to be sent. However, the alert fails because of external factors. Currently the code seems to count that alert as processed and no more alerts are sent until the check comes back up and goes back to down. I am considering creating a PR to change this behavior for a subset of the integration types that can handle duplicated alerts (e.g. PagerDuty). The idea would be to continue to send alerts until the check comes back up. Any concerns about such a change?

Originally created by @muff1nman on GitHub (Jan 16, 2019). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/212 Would you be open for adding logic to ensure alerts are processed correctly? I.e. say a check goes down and an alert is attempted to be sent. However, the alert fails because of external factors. Currently the code seems to count that alert as processed and no more alerts are sent until the check comes back up and goes back to down. I am considering creating a PR to change this behavior for a subset of the integration types that can handle duplicated alerts (e.g. PagerDuty). The idea would be to continue to send alerts until the check comes back up. Any concerns about such a change?
kerem closed this issue 2026-02-25 23:41:22 +03:00
Author
Owner

@cuu508 commented on GitHub (Jan 17, 2019):

@muff1nman what specific scenarios and external factors are you thinking of?

I'm thinking of three classes of problems:

  1. manage.py sendalerts gets an unexpected HTTP status code or a timeout back when sending an alert. When this happens, it retries 3 times with no delay. If still no success, it logs an error.

  2. manage.py sendalerts gets interrupted while sending an alert. It could be crash due to code bug, the system killing the process, or the whole system freezing or crashing. When this happens, no retries are made, and no error is logged.

  3. manage.py sendalerts delivers the alert payload to the remote system (e.g., it gets a 200 OK back from PagerDuty API). But the alert is lost further down the chain.

I'd like Healthchecks to have a better handling of the second case. It's been at the back of my mind, but have not yet tackled it.

The first case I'm not sure about. It would be nice to have things like exponential backoff and more than 3 retries for webhook integrations, but we also don't want 1) races between several notifications about the same check 2) too many spawned threads 3) the sendalerts process clogging up

The third case I'd like to leave alone. Once the alert is handed over, and the remote system has acknowledged it, over our job is done.

<!-- gh-comment-id:455090059 --> @cuu508 commented on GitHub (Jan 17, 2019): @muff1nman what specific scenarios and external factors are you thinking of? I'm thinking of three classes of problems: 1. `manage.py sendalerts` gets an unexpected HTTP status code or a timeout back when sending an alert. When this happens, it retries 3 times with no delay. If still no success, it logs an error. 2. `manage.py sendalerts` gets interrupted while sending an alert. It could be crash due to code bug, the system killing the process, or the whole system freezing or crashing. When this happens, *no retries are made, and no error is logged*. 3. `manage.py sendalerts` delivers the alert payload to the remote system (e.g., it gets a 200 OK back from PagerDuty API). But the alert is lost further down the chain. I'd like Healthchecks to have a better handling of the second case. It's been at the back of my mind, but have not yet tackled it. The first case I'm not sure about. It would be nice to have things like exponential backoff and more than 3 retries for webhook integrations, but we also don't want 1) races between several notifications about the same check 2) too many spawned threads 3) the sendalerts process clogging up The third case I'd like to leave alone. Once the alert is handed over, and the remote system has acknowledged it, over our job is done.
Author
Owner

@muff1nman commented on GitHub (Jan 17, 2019):

I'm mainly concerned about scenario 1. Basically, because of external network issues, I see many missed alerts being sent to pager duty. Could you point me to the 3 retry logic? I missed that when I read through the sendalerts logic and the notify() functions in transports.py.

But yes I think adding some delay between retries (potentially with backoff) would immensely help for the scenario I'm concerned about.

For scenario number 2 - I think this would be best handled via database transactions and row locking. It's supported by both mysql and postgres which are the two databases that healthchecks targets. Here is the relevant django doc I believe (https://docs.djangoproject.com/en/1.11/ref/models/querysets/#select-for-update).

<!-- gh-comment-id:455188185 --> @muff1nman commented on GitHub (Jan 17, 2019): I'm mainly concerned about scenario 1. Basically, because of external network issues, I see many missed alerts being sent to pager duty. Could you point me to the 3 retry logic? I missed that when I read through the sendalerts logic and the `notify()` functions in `transports.py`. But yes I think adding some delay between retries (potentially with backoff) would immensely help for the scenario I'm concerned about. For scenario number 2 - I think this would be best handled via database transactions and row locking. It's supported by both mysql and postgres which are the two databases that healthchecks targets. Here is the relevant django doc I believe (https://docs.djangoproject.com/en/1.11/ref/models/querysets/#select-for-update).
Author
Owner

@muff1nman commented on GitHub (Jan 17, 2019):

Some thoughts on your concerns around addressing scenario 1:

  1. races between several notifications about the same check

This would be directly solved by transactions and row locking

  1. too many spawned threads

A thread pool would probably help in this regard.

  1. the sendalerts process clogging up

I think doing a generous backoff would help here. I.e. wait 1 min after first failure, then 5 min, then 30 min, then 6 hours, etc

<!-- gh-comment-id:455190300 --> @muff1nman commented on GitHub (Jan 17, 2019): Some thoughts on your concerns around addressing scenario 1: > 1) races between several notifications about the same check This would be directly solved by transactions and row locking > 2) too many spawned threads A thread pool would probably help in this regard. > 3) the sendalerts process clogging up I think doing a generous backoff would help here. I.e. wait 1 min after first failure, then 5 min, then 30 min, then 6 hours, etc
Author
Owner

@cuu508 commented on GitHub (Jan 22, 2019):

The retry logic is in transports.py, HttpTransport class. See the get, post and put methods.

I was thinking about races where two notifications for the same check are delivered in the wrong order. This is more likely to happen when the retry delay is minutes or hours. It can be worked around (a later notification could cancel any earlier unsent notifications for the same check), just needs careful planning.

I'm mainly concerned about scenario 1. Basically, because of external network issues, I see many missed alerts being sent to pager duty.

To make sure I understand the issue: the self-hosted Healthchecks instance sometimes has network connectivity problems and cannot reach the PagerDuty API?

<!-- gh-comment-id:456436091 --> @cuu508 commented on GitHub (Jan 22, 2019): The retry logic is in `transports.py`, `HttpTransport` class. See the `get`, `post` and `put` methods. I was thinking about races where two notifications for the same check are delivered in the wrong order. This is more likely to happen when the retry delay is minutes or hours. It can be worked around (a later notification could cancel any earlier unsent notifications for the same check), just needs careful planning. > I'm mainly concerned about scenario 1. Basically, because of external network issues, I see many missed alerts being sent to pager duty. To make sure I understand the issue: the self-hosted Healthchecks instance sometimes has network connectivity problems and cannot reach the PagerDuty API?
Author
Owner

@cuu508 commented on GitHub (Jul 20, 2019):

In practice, the naive "try 3 times, then give up" approach seems to work reasonably well.

Exponential backoff could be problematic, as the notifications need to be timely. A "down" notification that arrives an hour late might just cause extra confusion.

Might reconsider this in future, but for now I'm inclined to keep the notification sending logic as-is.

<!-- gh-comment-id:513466774 --> @cuu508 commented on GitHub (Jul 20, 2019): In practice, the naive "try 3 times, then give up" approach seems to work reasonably well. Exponential backoff could be problematic, as the notifications need to be timely. A "down" notification that arrives an hour late might just cause extra confusion. Might reconsider this in future, but for now I'm inclined to keep the notification sending logic as-is.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/healthchecks#152
No description provided.