[GH-ISSUE #39] Concurrent sendalerts #19

New issue

Closed

opened 2026-02-25 23:40:48 +03:00 by kerem · 2 comments

kerem commented

2026-02-25 23:40:48 +03:00

Owner

Originally created by @cuu508 on GitHub (Jan 31, 2016).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/39

I've been thinking and experimenting with HA stuff quite a bit recently. One of the pieces of the puzzle is being able to run two "sendalerts" worker processes concurrently. So if one goes away, the other one picks up the slack. Obviously they must not send the notifications twice so there needs to be some kind of a locking mechanism.

Assume the database (PostgreSQL) is beefy, can take the load, has automatic failover, etc. Assume there's no separate message broker. Here's what I'm considering:

Assign an unique name to each worker process running the "sendalerts" command
Add two new fields to api_check table: alert_worker (varchar) and alert_date (date).
Each worker polls the api_check table and looks for alerts that need to be sent–same as now. But when it finds one:

it updates alert_worker and alert_date columns with worker's name and current date, and commits
it reads both values back
if either of the values has changed, it does nothing and goes back to polling for more work
if both values are unchanged, it goes ahead and sends the alert
after the alert is sent, it clears out alert_worker and alert_date fields

My thinking is, if the database supports at least "read commited" isolation level, this should prevent multiple workers from sending duplicate alerts. The alert_worker field is effectively an application-level lock, and the alert_date can be used for calculating its expiry date (in case a worker dies before it blanks out the the two fields).

Alternatively, with PostgreSQL, the advisory locks could be used, and would likely perform better. But then MySQL would still need a separate mechanism like the above.

Originally created by @cuu508 on GitHub (Jan 31, 2016). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/39 I've been thinking and experimenting with HA stuff quite a bit recently. One of the pieces of the puzzle is being able to run two "sendalerts" worker processes concurrently. So if one goes away, the other one picks up the slack. Obviously they must not send the notifications twice so there needs to be some kind of a locking mechanism. Assume the database (PostgreSQL) is beefy, can take the load, has automatic failover, etc. Assume there's no separate message broker. Here's what I'm considering: 1. Assign an unique name to each worker process running the "sendalerts" command 2. Add two new fields to `api_check` table: `alert_worker` (varchar) and `alert_date` (date). 3. Each worker polls the `api_check` table and looks for alerts that need to be sent–same as now. But when it finds one: - it updates `alert_worker` and `alert_date` columns with worker's name and current date, and commits - it reads both values back - if either of the values has changed, it does nothing and goes back to polling for more work - if both values are unchanged, it goes ahead and sends the alert - after the alert is sent, it clears out `alert_worker` and `alert_date` fields My thinking is, if the database supports at least "read commited" isolation level, this should prevent multiple workers from sending duplicate alerts. The `alert_worker` field is effectively an application-level lock, and the `alert_date` can be used for calculating its expiry date (in case a worker dies before it blanks out the the two fields). Alternatively, with PostgreSQL, the advisory locks could be used, and would likely perform better. But then MySQL would still need a separate mechanism like the above.

kerem closed this issue

2026-02-25 23:40:48 +03:00

kerem commented

2026-02-25 23:40:48 +03:00

Author

Owner

@diwu1989 commented on GitHub (Feb 19, 2016):

this is too complicated, please just use Celery with rabbitMQ, spin up multiple workers, and configure RabbitMQ to persistent HA for the message queue

the bottleneck is the interaction with the external services, not the actual DB alert checking query, so have one scheduled task in Celery to check and then queue up N alert tasks

@diwu1989 commented on GitHub (Feb 19, 2016): this is too complicated, please just use Celery with rabbitMQ, spin up multiple workers, and configure RabbitMQ to persistent HA for the message queue the bottleneck is the interaction with the external services, not the actual DB alert checking query, so have one scheduled task in Celery to check and then queue up N alert tasks

kerem commented

2026-02-25 23:40:49 +03:00

Author

Owner

@cuu508 commented on GitHub (Sep 15, 2016):

Now it should be safe to run multiple manage.py sendalerts processes at the same time.
The important part is this:

        q = Check.objects.filter(id=check.id, status=check.status)
        num_updated = q.update(status=flipped)

In a race condition when two processes try to update check's status to the same value, only one of them will get back a "number of rows changed = 1". The race winner then sends out notifications.

Notifications are sent on a thread so long running HTTP requests don't stall the loop.

@cuu508 commented on GitHub (Sep 15, 2016): Now it should be safe to run multiple `manage.py sendalerts` processes at the same time. The important part is this: ``` q = Check.objects.filter(id=check.id, status=check.status) num_updated = q.update(status=flipped) ``` In a race condition when two processes try to update check's status to the same value, only one of them will get back a "number of rows changed = 1". The race winner then sends out notifications. Notifications are sent on a thread so long running HTTP requests don't stall the loop.