[GH-ISSUE #809] Exit code whitelisting

kerem commented

2026-02-25 23:42:53 +03:00

Owner

Originally created by @quentinus95 on GitHub (Mar 24, 2023).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/809

Hello, would it be possible to have a feature that allows some non-0 exit codes to be whitelisted and considered as a success (or a warning for instance)?

I have some scripts that can end with a non-0 exit code that is not critical. It would be nice to be able to allow them and still consider execution as successful.

Originally created by @quentinus95 on GitHub (Mar 24, 2023). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/809 Hello, would it be possible to have a feature that allows some non-0 exit codes to be whitelisted and considered as a success (or a warning for instance)? I have some scripts that can end with a non-0 exit code that is not critical. It would be nice to be able to allow them and still consider execution as successful.

kerem closed this issue

2026-02-25 23:42:53 +03:00

kerem commented

2026-02-25 23:42:54 +03:00

Author

Owner

@cuu508 commented on GitHub (Jul 14, 2023):

Thanks for the suggestion. Technically possible of course, but I'm not sure how widely applicable this would be – is it common for to have scripts that return non-zero exit codes in success scenarios, and there is also no way to influence this either by passing parameters, by editing the scripts, or by using wrapper scripts with additional conditional logic?

@cuu508 commented on GitHub (Jul 14, 2023): Thanks for the suggestion. Technically possible of course, but I'm not sure how widely applicable this would be – is it common for to have scripts that return non-zero exit codes in success scenarios, and there is also no way to influence this either by passing parameters, by editing the scripts, or by using wrapper scripts with additional conditional logic?

kerem commented

2026-02-25 23:42:54 +03:00

Author

Owner

@quentinus95 commented on GitHub (Jul 20, 2023):

Hello @cuu508, here is one example I have in mind: when rsync performs a copy of a folder (e.g., a backup) and some files are deleted before they are copied (rsync performs a scan of the files, then runs the backup), it may return a non 0. In some (most?) scenarios, it is fine to ignore that specific error code because it can be related to some logs that were rotated, or some lock files that were removed (which is fine when performing a snapshot).

In such situations, it would be nice to have a warning state, rather than a failure. It would allow in the previous example to say “maybe you're backing up some folders or files that should be ignored”. Those situations are fine and can be investigated later (very different from a backup that failed to execute and might require immediate action).

@quentinus95 commented on GitHub (Jul 20, 2023): Hello @cuu508, here is one example I have in mind: when `rsync` performs a copy of a folder (e.g., a backup) and some files are deleted before they are copied (`rsync` performs a scan of the files, then runs the backup), it may return a non 0. In some (most?) scenarios, it is fine to ignore that specific error code because it can be related to some logs that were rotated, or some lock files that were removed (which is fine when performing a snapshot). In such situations, it would be nice to have a warning state, rather than a failure. It would allow in the previous example to say “maybe you're backing up some folders or files that should be ignored”. Those situations are fine and can be investigated later (very different from a backup that failed to execute and might require immediate action).

kerem commented

2026-02-25 23:42:54 +03:00

Author

Owner

@davidtorosyan commented on GitHub (Aug 27, 2023):

I have a similar use case, also backup related.

I expect my backup script to run successfully once a day. However, if it runs more frequently (say due to manual triggers), it'll bail out without actually doing anything.

I don't want to count this as success, but I don't want to alert on the failure either. So right now the only thing I can think to do is omit the "start" ping.

If I want to retain "start", then I'd need a way to signal that a run is canceled. Using an allow-listed non-zero status code could work for that.

@davidtorosyan commented on GitHub (Aug 27, 2023): I have a similar use case, also backup related. I expect my backup script to run successfully once a day. However, if it runs more frequently (say due to manual triggers), it'll bail out without actually doing anything. I don't want to count this as success, but I don't want to alert on the failure either. So right now the only thing I can think to do is omit the "start" ping. If I want to retain "start", then I'd need a way to signal that a run is canceled. Using an allow-listed non-zero status code could work for that.

kerem commented

2026-02-25 23:42:54 +03:00

Author

Owner

@cuu508 commented on GitHub (Aug 28, 2023):

@davidtorosyan a couple of questions, so I understand your use case:

I expect my backup script to run successfully once a day. However, if it runs more frequently (say due to manual triggers), it'll bail out without actually doing anything.

Why does it bail out on manual triggers? Do manual and automatic triggers launch the job differently? Or does the backup job somehow recognize that "it's not the right time for me to run"?

I don't want to count this as success, but I don't want to alert on the failure either. So right now the only thing I can think to do is omit the "start" ping.

If the job does what it is supposed to do (which may be "nothing" in some cases), why not count it as success?

If I want to retain "start", then I'd need a way to signal that a run is canceled.

At the time when you send the "start" signal, you do not yet know if the job will be cancelled / bail out, correct? Like, the script starts up, then recognizes that some condition is not met, and bails out? What is that condition?

If you could detect the bail out condition near the start of the script, perhaps you could send the "start" signal only after it is clear the script will [attempt to] run fully?

@cuu508 commented on GitHub (Aug 28, 2023): @davidtorosyan a couple of questions, so I understand your use case: > I expect my backup script to run successfully once a day. However, if it runs more frequently (say due to manual triggers), it'll bail out without actually doing anything. Why does it bail out on manual triggers? Do manual and automatic triggers launch the job differently? Or does the backup job somehow recognize that "it's not the right time for me to run"? > I don't want to count this as success, but I don't want to alert on the failure either. So right now the only thing I can think to do is omit the "start" ping. If the job does what it is supposed to do (which may be "nothing" in some cases), why not count it as success? > If I want to retain "start", then I'd need a way to signal that a run is canceled. At the time when you send the "start" signal, you do not yet know if the job will be cancelled / bail out, correct? Like, the script starts up, then recognizes that some condition is not met, and bails out? What is that condition? If you could detect the bail out condition near the start of the script, perhaps you could send the "start" signal only after it is clear the script will [attempt to] run fully?

kerem commented

2026-02-25 23:42:54 +03:00

Author

Owner

@davidtorosyan commented on GitHub (Aug 28, 2023):

@cuu508 good questions! Let me try and answer with pseudocode:

/* backup script, to be run daily */

// start for timing
http.post("hc.com/backup/start")

// expensive call, ideally happens after start
data = readData()

// the data only changes every 6 hours, so this will bail out if we run more frequently
// this is neither success nor failure, but a no-op.
// if we count this as success, then we won't be alerted if the data starts never changing (which is unexpected)
if ! data.changedSinceLastBackup {
  exit
}

try {
  data.backup()
  http.post("hc.com/backup/success")
catch {
  http.post("hc.com/backup/fail")
}

I see an additional solution I didn't before - solving this with two health checks. One for the backup script, and one for successful backup itself. That way I'd have a signal for the backup script running (and succeeding even in the bail out case) and for an actual backup being done with a daily frequency.

@davidtorosyan commented on GitHub (Aug 28, 2023): @cuu508 good questions! Let me try and answer with pseudocode: ```kt /* backup script, to be run daily */ // start for timing http.post("hc.com/backup/start") // expensive call, ideally happens after start data = readData() // the data only changes every 6 hours, so this will bail out if we run more frequently // this is neither success nor failure, but a no-op. // if we count this as success, then we won't be alerted if the data starts never changing (which is unexpected) if ! data.changedSinceLastBackup { exit } try { data.backup() http.post("hc.com/backup/success") catch { http.post("hc.com/backup/fail") } ``` --- I see an additional solution I didn't before - solving this with two health checks. One for the backup script, and one for successful backup itself. That way I'd have a signal for the backup script running (and succeeding even in the bail out case) and for an actual backup being done with a daily frequency.

kerem commented

2026-02-25 23:42:54 +03:00

Author

Owner

@davidtorosyan commented on GitHub (Aug 28, 2023):

After thinking about it more, I think I might be doing to much with healthchecks.

From what I can tell, healthchecks is best at making sure that a job is running with a given schedule (i.e. the backup job runs daily), not validating arbitrary conditions (i.e. the data that's backed up is the data I want).

That said I still do have a need for the latter, so maybe what I'll do is something like this:

/* append this to backup script described in previous comment */

backups = getBackups()
if backups.latest > ago(1d) {
  http.post("hc.com/backups_healthy/success")
} else {
  http.post("hc.com/backups_healthy/fail")
}

@davidtorosyan commented on GitHub (Aug 28, 2023): After thinking about it more, I think I might be doing to much with healthchecks. From what I can tell, healthchecks is best at making sure that a job is running with a given schedule (i.e. the backup job runs daily), not validating arbitrary conditions (i.e. the data that's backed up is the data I want). That said I still do have a need for the latter, so maybe what I'll do is something like this: ```kt /* append this to backup script described in previous comment */ backups = getBackups() if backups.latest > ago(1d) { http.post("hc.com/backups_healthy/success") } else { http.post("hc.com/backups_healthy/fail") } ```

kerem commented

2026-02-25 23:42:54 +03:00

Author

Owner

@quentinus95 commented on GitHub (Jul 29, 2024):

@cuu508 which alternative would you suggest?

@quentinus95 commented on GitHub (Jul 29, 2024): @cuu508 which alternative would you suggest?

kerem commented

2026-02-25 23:42:54 +03:00

Author

Owner

@cuu508 commented on GitHub (Jul 29, 2024):

@quentinus95 if you want to treat some non-zero exit codes as success, use a wrapper script which inspects the exit code and decides whether to report success or failure to Healthchecks.

If you want additional warning state, use monitoring software that supports metric collection, and configurable alerting rules based on collected metric values.

If you want to use something sort-of similar to Healthchecks, look into sensorpad.

@cuu508 commented on GitHub (Jul 29, 2024): @quentinus95 if you want to treat some non-zero exit codes as success, use a wrapper script which inspects the exit code and decides whether to report success or failure to Healthchecks. If you want additional warning state, use monitoring software that supports metric collection, and configurable alerting rules based on collected metric values. If you want to use something sort-of similar to Healthchecks, look into [sensorpad](https://sensorpad.io/).

Rows
Columns

[GH-ISSUE #809] Exit code whitelisting #569