mirror of
https://github.com/healthchecks/healthchecks.git
synced 2026-04-25 23:15:49 +03:00
[GH-ISSUE #212] Robustness of Alerts #152
Labels
No labels
bug
bug
bug
feature
good-first-issue
new integration
pull-request
question
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/healthchecks#152
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @muff1nman on GitHub (Jan 16, 2019).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/212
Would you be open for adding logic to ensure alerts are processed correctly? I.e. say a check goes down and an alert is attempted to be sent. However, the alert fails because of external factors. Currently the code seems to count that alert as processed and no more alerts are sent until the check comes back up and goes back to down. I am considering creating a PR to change this behavior for a subset of the integration types that can handle duplicated alerts (e.g. PagerDuty). The idea would be to continue to send alerts until the check comes back up. Any concerns about such a change?
@cuu508 commented on GitHub (Jan 17, 2019):
@muff1nman what specific scenarios and external factors are you thinking of?
I'm thinking of three classes of problems:
manage.py sendalertsgets an unexpected HTTP status code or a timeout back when sending an alert. When this happens, it retries 3 times with no delay. If still no success, it logs an error.manage.py sendalertsgets interrupted while sending an alert. It could be crash due to code bug, the system killing the process, or the whole system freezing or crashing. When this happens, no retries are made, and no error is logged.manage.py sendalertsdelivers the alert payload to the remote system (e.g., it gets a 200 OK back from PagerDuty API). But the alert is lost further down the chain.I'd like Healthchecks to have a better handling of the second case. It's been at the back of my mind, but have not yet tackled it.
The first case I'm not sure about. It would be nice to have things like exponential backoff and more than 3 retries for webhook integrations, but we also don't want 1) races between several notifications about the same check 2) too many spawned threads 3) the sendalerts process clogging up
The third case I'd like to leave alone. Once the alert is handed over, and the remote system has acknowledged it, over our job is done.
@muff1nman commented on GitHub (Jan 17, 2019):
I'm mainly concerned about scenario 1. Basically, because of external network issues, I see many missed alerts being sent to pager duty. Could you point me to the 3 retry logic? I missed that when I read through the sendalerts logic and the
notify()functions intransports.py.But yes I think adding some delay between retries (potentially with backoff) would immensely help for the scenario I'm concerned about.
For scenario number 2 - I think this would be best handled via database transactions and row locking. It's supported by both mysql and postgres which are the two databases that healthchecks targets. Here is the relevant django doc I believe (https://docs.djangoproject.com/en/1.11/ref/models/querysets/#select-for-update).
@muff1nman commented on GitHub (Jan 17, 2019):
Some thoughts on your concerns around addressing scenario 1:
This would be directly solved by transactions and row locking
A thread pool would probably help in this regard.
I think doing a generous backoff would help here. I.e. wait 1 min after first failure, then 5 min, then 30 min, then 6 hours, etc
@cuu508 commented on GitHub (Jan 22, 2019):
The retry logic is in
transports.py,HttpTransportclass. See theget,postandputmethods.I was thinking about races where two notifications for the same check are delivered in the wrong order. This is more likely to happen when the retry delay is minutes or hours. It can be worked around (a later notification could cancel any earlier unsent notifications for the same check), just needs careful planning.
To make sure I understand the issue: the self-hosted Healthchecks instance sometimes has network connectivity problems and cannot reach the PagerDuty API?
@cuu508 commented on GitHub (Jul 20, 2019):
In practice, the naive "try 3 times, then give up" approach seems to work reasonably well.
Exponential backoff could be problematic, as the notifications need to be timely. A "down" notification that arrives an hour late might just cause extra confusion.
Might reconsider this in future, but for now I'm inclined to keep the notification sending logic as-is.