[GH-ISSUE #609] Implement Object Storage for body

kerem commented

2026-02-25 23:42:28 +03:00

Owner

Originally created by @Wouter0100 on GitHub (Feb 12, 2022).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/609

Personally I'm thinking of setting up a Healthchecks.io instance and one of the features I'd love to see is the ability to store all the body information from a success ping. As far as I know this is not ideal to store in a database, and thus integration with an Object Storage (S3) would be ideal. This way we should be able to store hundreds, or thousands, of log entries - with multiple MB's of body.

Is there any interest in this for Healthchecks.io, and so that I should put in time to develop it and open a PR? :)

I would like to look into the ability to stream the body directly to S3 (not store it in memory), but I don't have any experience with Django, nor Python, so it would be a challenge. Thus far I looked into the code and it seems reasonable to do.

Originally created by @Wouter0100 on GitHub (Feb 12, 2022). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/609 Personally I'm thinking of setting up a Healthchecks.io instance and one of the features I'd love to see is the ability to store all the body information from a success ping. As far as I know this is not ideal to store in a database, and thus integration with an Object Storage (S3) would be ideal. This way we should be able to store hundreds, or thousands, of log entries - with multiple MB's of body. Is there any interest in this for Healthchecks.io, and so that I should put in time to develop it and open a PR? :) I would like to look into the ability to stream the body directly to S3 (not store it in memory), but I don't have any experience with Django, nor Python, so it would be a challenge. Thus far I looked into the code and it seems reasonable to do.

kerem closed this issue

2026-02-25 23:42:28 +03:00

kerem commented

2026-02-25 23:42:29 +03:00

Author

Owner

@cuu508 commented on GitHub (Feb 14, 2022):

Thanks for the suggestion, this is an interesting idea! I've been thinking about it over the weekend.

My first gut reaction was – no, it would be too slow, too complex and potentially too expensive:

Slow: the ping endpoint is the busiest part of the system. It needs to be as quick as possible. Currently, a single process can (and sometimes must) process 1000+ ping requests per second. If we add S3 operations in the mix, the requests/second would take a big hit.
Complex: additional code to upload objects, to clean up old data in the bucket, and to retrieve objects when user wants to view them. Also, I maintain two implementations of the ping handler. One is the Python one in this repository, and the other one is Go + PL/pgSQL (closed source). Whenever there's a proposed change to the ping API, I must think how feasible it would be to implement in both codebases.
Expensive: would have to be careful with S3 per-request charges.

But thinking more about it...

There's a number of alternative object storage providers with S3-compatible API. Some of them don't have per-request charges, and have better bandwidth and storage prices.
What if there was an async process that periodically scans the api_ping table and migrates big ping bodies to S3? The ping handler would need no changes – it would still insert ping bodies in the database as it does currently.

Fleshing out this idea, a list of random thoughts. From now on, when I say "S3" I mean any S3-compatible provider.

There would still need to be a configurable max ping body limit, but it could be bumped up. No limit would be too easy to abuse. The limit would be conservative on Healthchecks.io but could be set higher on self-hosted installations.
It probably makes no sense to upload very small objects to S3. There would need to be some lower threshold, say 100 bytes. If the body is below 100 bytes, keep it in the database. If it's above, upload it to S3.
Object naming: use a deterministic naming convention like "/{uuid}/{n}" where uuid is check's code, and n is the "n" field in the api_ping table. This way, we don't need to store object paths on our side, we can generate them.
How to clean up old objects in the bucket? One way to do it would be in the same worker process that uploads new objects. Whenever the worker uploads a new object for a given uuid, it can also check if any objects can be removed from S3 for that uuid.

The api_ping.body would be effectively be used as temporary storage before the data is offloaded to S3. Moving bulk data in and out of the database is not ideal. DB writes are expensive, and moving bulk data in and out at a fast rate will cause fragmentation. Instead of saving large ping bodies to the database, the ping handler could instead save them to files in a designated location on the local filesystem. The uploader process would look for any new files in that area and upload them. There are pros and cons:

👍 we avoid a bunch of database writes, and network traffic to and from the database
👍 writing to local filesystem is also quicker, especially since app servers currently do almost no disk IO
👍 it is easy for the uploader script to find work -- just look for any new files in a designated directory
😟 if app server dies, the recent and not yet uploaded ping bodies could be lost
😟 there is a time window from ping API accepting the ping to it being viewable in web UI. This is when the ping body has been saved to the local filesystem, but has not yet been uploaded to S3.
😟 would have to monitor upload activity and backlog very carefully.

This is where I am in terms of brainstorming at the moment. This would be a bunch of work to implement. But it would also materially improve the service: bigger ping bodies and/or more log entries per ping. So it's worth exploring!

@cuu508 commented on GitHub (Feb 14, 2022): Thanks for the suggestion, this is an interesting idea! I've been thinking about it over the weekend. My first gut reaction was – no, it would be too slow, too complex and potentially too expensive: * Slow: the ping endpoint is the busiest part of the system. It needs to be as quick as possible. Currently, a single process can (and sometimes must) process 1000+ ping requests per second. If we add S3 operations in the mix, the requests/second would take a big hit. * Complex: additional code to upload objects, to clean up old data in the bucket, and to retrieve objects when user wants to view them. Also, I maintain two implementations of the ping handler. One is the Python one in this repository, and the other one is Go + PL/pgSQL (closed source). Whenever there's a proposed change to the ping API, I must think how feasible it would be to implement in both codebases. * Expensive: would have to be careful with S3 per-request charges. But thinking more about it... * There's a number of alternative object storage providers with S3-compatible API. Some of them don't have per-request charges, and have better bandwidth and storage prices. * What if there was an async process that periodically scans the api_ping table and migrates big ping bodies to S3? The ping handler would need no changes – it would still insert ping bodies in the database as it does currently. Fleshing out this idea, a list of random thoughts. From now on, when I say "S3" I mean any S3-compatible provider. * There would still need to be a configurable max ping body limit, but it could be bumped up. No limit would be too easy to abuse. The limit would be conservative on Healthchecks.io but could be set higher on self-hosted installations. * It probably makes no sense to upload very small objects to S3. There would need to be some lower threshold, say 100 bytes. If the body is below 100 bytes, keep it in the database. If it's above, upload it to S3. * Object naming: use a deterministic naming convention like "/{uuid}/{n}" where uuid is check's code, and n is the "n" field in the api_ping table. This way, we don't need to store object paths on our side, we can generate them. * How to clean up old objects in the bucket? One way to do it would be in the same worker process that uploads new objects. Whenever the worker uploads a new object for a given uuid, it can also check if any objects can be removed from S3 for that uuid. The api_ping.body would be effectively be used as temporary storage before the data is offloaded to S3. Moving bulk data in and out of the database is not ideal. DB writes are expensive, and moving bulk data in and out at a fast rate will cause fragmentation. Instead of saving large ping bodies to the database, the ping handler could instead save them to files in a designated location on the local filesystem. The uploader process would look for any new files in that area and upload them. There are pros and cons: * 👍 we avoid a bunch of database writes, and network traffic to and from the database * 👍 writing to local filesystem is also quicker, especially since app servers currently do almost no disk IO * 👍 it is easy for the uploader script to find work -- just look for any new files in a designated directory * 😟 if app server dies, the recent and not yet uploaded ping bodies could be lost * 😟 there is a time window from ping API accepting the ping to it being viewable in web UI. This is when the ping body has been saved to the local filesystem, but has not yet been uploaded to S3. * 😟 would have to monitor upload activity and backlog very carefully. This is where I am in terms of brainstorming at the moment. This would be a bunch of work to implement. But it would also materially improve the service: bigger ping bodies and/or more log entries per ping. So it's worth exploring!

kerem commented

2026-02-25 23:42:29 +03:00

Author

Owner

@Wouter0100 commented on GitHub (Feb 16, 2022):

Wow. Thanks for the consideration and the detailed answer. Love to see that you're potentially interested into this as well. At first, unfortunately to hear that the Go implementation is closed source. I get it, but that's the language I do know and would be able to better implement it.

Personally I did not mean the S3 service of AWS, indeed. But I really meant the S3 API that more providers implement. Which, indeed, could be a cost saving. Personally I'm a fan of the Scaleway Object Storage service, where large buckets are supported and there's no API per-request fee, nor a bandwidth fee (with a specific setup).

Besides the benefits, I would not recommend to use the local storage nor the database as a temporary storage. Databases are not designed for it and you put additional read/write load on the DB. As an example, MySQL won't reduce the size of the DB if using InnoDB. This would result in very large table files and potentially large backups as well when backups are taken when there are large amounts of bodies stored. I don't have any experience with PostgreSQL, however - but I could imagine it has the same constraints. Throughout my work I've clustered various large clusters, and we had various issues with properly storing blobs resulting in downtime or unexpected behaviour.

Local file storage is difficult when using a containerized environment (e.g. Kubernetes) and normally I'd say to not use this as well for this reason. As containers may be swapped out at any time, and thus storage would be lost. Where "static" servers are used, this problem is kinda mitigated - as when you restart the system it could continue to process the locally stored files. Personally, when designing an app I always try to follow The Twelve-Factor App principles, which also recommends against it. This would make the application more suitable for High Available environments, possibly using a containerized environment.

I've done some testing: on average it takes about 500ms to upload 1MB and 100KB to the current iteration of the Scaleway Object Storage. Not that I would expect 1MB of logs, but it gives a good indication of how fast it'll be. 100KB takes also about 500ms, so it seems that it does not really differs that much depending on the upload size. I'd definitely agree that 500ms is way, way to much for the ping endpoint - and so an alternative solution should be found. Especially for the scale on Healthchecks.io. To ease the maintenance burden, I'd say this would be a possible solution for the open-source available ping handler.

A thing I see often is a queue, such as RabbitMQ (or Redis). This would introduce an additional component in the infrastructure, but would be suitable for processing the bodies. The only downside on this however, is that there is still a window between the ping is received and when the body is available in the UI. Redundancy in this case is handled by the message broker. On the other hand, if you'd decide on introducing the option to directly upload it to S3 in the open source ping handler, you could rely on the local files for Healthchecks.io - if that's a suitable solution for it.

Regardless, these are just my 2 cents - and would love to talk further about it.

@Wouter0100 commented on GitHub (Feb 16, 2022): Wow. Thanks for the consideration and the detailed answer. Love to see that you're potentially interested into this as well. At first, unfortunately to hear that the Go implementation is closed source. I get it, but that's the language I do know and would be able to better implement it. Personally I did not mean the S3 service of AWS, indeed. But I really meant the S3 API that more providers implement. Which, indeed, could be a cost saving. Personally I'm a fan of the Scaleway Object Storage service, where large buckets are supported and there's no API per-request fee, nor a bandwidth fee (with a specific setup). Besides the benefits, I would not recommend to use the local storage nor the database as a temporary storage. Databases are not designed for it and you put additional read/write load on the DB. As an example, MySQL won't reduce the size of the DB if using InnoDB. This would result in very large table files and potentially large backups as well when backups are taken when there are large amounts of bodies stored. I don't have any experience with PostgreSQL, however - but I could imagine it has the same constraints. Throughout my work I've clustered various large clusters, and we had various issues with properly storing blobs resulting in downtime or unexpected behaviour. Local file storage is difficult when using a containerized environment (e.g. Kubernetes) and normally I'd say to not use this as well for this reason. As containers may be swapped out at any time, and thus storage would be lost. Where "static" servers are used, this problem is kinda mitigated - as when you restart the system it could continue to process the locally stored files. Personally, when designing an app I always try to follow [The Twelve-Factor App](https://12factor.net/processes) principles, which also recommends against it. This would make the application more suitable for High Available environments, possibly using a containerized environment. I've done some testing: on average it takes about 500ms to upload 1MB and 100KB to the current iteration of the Scaleway Object Storage. Not that I would expect 1MB of logs, but it gives a good indication of how fast it'll be. 100KB takes also about 500ms, so it seems that it does not really differs that much depending on the upload size. I'd definitely agree that 500ms is way, way to much for the ping endpoint - and so an alternative solution should be found. Especially for the scale on Healthchecks.io. To ease the maintenance burden, I'd say this would be a possible solution for the open-source available ping handler. A thing I see often is a queue, such as RabbitMQ (or Redis). This would introduce an additional component in the infrastructure, but would be suitable for processing the bodies. The only downside on this however, is that there is still a window between the ping is received and when the body is available in the UI. Redundancy in this case is handled by the message broker. On the other hand, if you'd decide on introducing the option to directly upload it to S3 in the open source ping handler, you could rely on the local files for Healthchecks.io - if that's a suitable solution for it. Regardless, these are just my 2 cents - and would love to talk further about it.

kerem commented

2026-02-25 23:42:29 +03:00

Author

Owner

@cuu508 commented on GitHub (Feb 16, 2022):

As it happens, I was eyeing Scaleway Object Storage too! With no per-request fee, and free inbound bandwidth, it looks attractive. Good for us, but I wonder if Scaleway could have a problem with this usage pattern. I contacted them with a quick description of the use case, their preliminary answer was "should be OK". I followed up with more details and will see if they have any additional comments.

I don't have any experience with PostgreSQL, however - but I could imagine it has the same constraints.

Similar with PostgreSQL. Frequent inserts and deletes cause table and index bloat. There are solutions for this, but would be better to not have that problem in the first place.

Local file storage is difficult when using a containerized environment (e.g. Kubernetes) and normally I'd say to not use this as well for this reason. As containers may be swapped out at any time, and thus storage would be lost.

I have zero experience with Kubernetes, but perhaps volumes can help here? Or perhaps there's a way to do graceful shutdowns – wait until a process finishes up before killing the container?

A message broker sounds like the perfect tool for the job, except it's another piece that needs to be set up, maintained and monitored. Another moving part that can break. So far I've managed to get by with a single stateful component, the database.

On the other hand, if you'd decide on introducing the option to directly upload it to S3 in the open source ping handler, you could rely on the local files for Healthchecks.io - if that's a suitable solution for it.

For the open source handler, upload the object while handling the request, correct?

@cuu508 commented on GitHub (Feb 16, 2022): As it happens, I was eyeing Scaleway Object Storage too! With no per-request fee, and free inbound bandwidth, it looks attractive. Good for us, but I wonder if Scaleway could have a problem with this usage pattern. I contacted them with a quick description of the use case, their preliminary answer was "should be OK". I followed up with more details and will see if they have any additional comments. > I don't have any experience with PostgreSQL, however - but I could imagine it has the same constraints. Similar with PostgreSQL. Frequent inserts and deletes cause table and index bloat. There are solutions for this, but would be better to not have that problem in the first place. > Local file storage is difficult when using a containerized environment (e.g. Kubernetes) and normally I'd say to not use this as well for this reason. As containers may be swapped out at any time, and thus storage would be lost. I have zero experience with Kubernetes, but perhaps volumes can help here? Or perhaps there's a way to do graceful shutdowns – wait until a process finishes up before killing the container? A message broker sounds like the perfect tool for the job, except it's another piece that needs to be set up, maintained and monitored. Another moving part that can break. So far I've managed to get by with a single stateful component, the database. > On the other hand, if you'd decide on introducing the option to directly upload it to S3 in the open source ping handler, you could rely on the local files for Healthchecks.io - if that's a suitable solution for it. For the open source handler, upload the object while handling the request, correct?

kerem commented

2026-02-25 23:42:29 +03:00

Author

Owner

@Wouter0100 commented on GitHub (Feb 16, 2022):

I've been using Scaleway for quite some time now (1 year) and it's been good thus far. Especially the location in NL, the location in FR seems to have outages from time to time. Not sure why it differs, but it surely does.

They recently introduced a new backend for their Object Storage service (literally last month) called "HIVE" and since then the service improved significantly on many parts. In the past there was a limit with 500k items in a bucket, but with the new backend this was lifted as well.

If you want to chat about it, I'm quite active on the Scaleway Community slack.

For the open source handler, upload the object while handling the request, correct?

Indeed. I think this would greatly reduce the size of the implementation:

we need to upload it in the ping handler.
remove it when old pings are removed.
being able to view it in the UI.

As with this case it would not require any additional processing and no added complexity. Perhaps the removal of old ping should be moved outside of the ping handler, but besides this I think it would be feasible.

I have zero experience with Kubernetes, but perhaps volumes can help here? Or perhaps there's a way to do graceful shutdowns – wait until a process finishes up before killing the container?

If we would go with the implementation proposal above - this would not be a problem, as running the open source variant won't store anything in that case. And, the closed source ping handler could be developed differently depending on the infrastructure (e.g. by using files) :).

@Wouter0100 commented on GitHub (Feb 16, 2022): I've been using Scaleway for quite some time now (1 year) and it's been good thus far. Especially the location in NL, the location in FR seems to have outages from time to time. Not sure why it differs, but it surely does. They recently introduced a new backend for their Object Storage service (literally last month) called "HIVE" and since then the service improved significantly on many parts. In the past there was a limit with 500k items in a bucket, but with the new backend this was lifted as well. If you want to chat about it, I'm quite active on the Scaleway Community slack. > For the open source handler, upload the object while handling the request, correct? Indeed. I think this would greatly reduce the size of the implementation: - we need to upload it in the ping handler. - remove it when old pings are removed. - being able to view it in the UI. As with this case it would not require any additional processing and no added complexity. Perhaps the removal of old ping should be moved outside of the ping handler, but besides this I think it would be feasible. > I have zero experience with Kubernetes, but perhaps volumes can help here? Or perhaps there's a way to do graceful shutdowns – wait until a process finishes up before killing the container? If we would go with the implementation proposal above - this would not be a problem, as running the open source variant won't store anything in that case. And, the closed source ping handler could be developed differently depending on the infrastructure (e.g. by using files) :).

kerem commented

2026-02-25 23:42:29 +03:00

Author

Owner

@cuu508 commented on GitHub (Feb 23, 2022):

A quick update:

Got a positive reply from Scaleway: the use case fits, there should be no rate limits or technical limitations
Started work on implementation, and have an early, experimental version working locally. Will write test cases for the new bits, and then push the changes.

@cuu508 commented on GitHub (Feb 23, 2022): A quick update: * Got a positive reply from Scaleway: the use case fits, there should be no rate limits or technical limitations * Started work on implementation, and have an early, experimental version working locally. Will write test cases for the new bits, and then push the changes.

kerem commented

2026-02-25 23:42:29 +03:00

Author

Owner

@cuu508 commented on GitHub (Mar 18, 2022):

An initial implementation is ready, and is available in v2.0.1, released today.

I've added new environment variables for specifying S3 credentials (see https://github.com/healthchecks/healthchecks#external-object-storage). When these environment variables are set, Healthchecks will upload ping bodies to S3 using the minio library (which needs to be installed, pip install minio).

This functionality is not yet enabled on the hosted service (https://healthchecks.io), but it's coming along.

I've tested the new functionality with two S3-compatible object storage services: Scaleway and OVH.

Scaleway: The DeleteObjects API call did not work from minio. I reported the issue, it got fixed 👍. The API response times were mediocre. The DeleteObjects call is especially slow, with ~50 object names in the payload, it takes 10-30 seconds to complete. Sometimes it takes over 60 seconds. The service also seemed to have stability issues, API calls sometimes return "InternalError" in the response, and I also encountered a service-wide outage, when even the web dashboard was down.

OVH: The DeleteObjects API call did not work from minio, same as Scaleway. I reported the issue, it's acknowledged but not fixed yet. The API response times are significantly better. The DeleteObjects call with 50 filenames, for example, takes 300ms to 1 second in my testing. I have not yet used OVH enough to have a sense of their reliability in general.

@cuu508 commented on GitHub (Mar 18, 2022): An initial implementation is ready, and is available in v2.0.1, released today. I've added new environment variables for specifying S3 credentials (see https://github.com/healthchecks/healthchecks#external-object-storage). When these environment variables are set, Healthchecks will upload ping bodies to S3 using the minio library (which needs to be installed, `pip install minio`). This functionality is not yet enabled on the hosted service (https://healthchecks.io), but it's coming along. I've tested the new functionality with two S3-compatible object storage services: Scaleway and OVH. Scaleway: The [DeleteObjects](https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html) API call did not work from minio. I reported the issue, it got fixed 👍. The API response times were mediocre. The DeleteObjects call is especially slow, with ~50 object names in the payload, it takes 10-30 seconds to complete. Sometimes it takes over 60 seconds. The service also seemed to have stability issues, API calls sometimes return "InternalError" in the response, and I also encountered a service-wide outage, when even the web dashboard was down. OVH: The DeleteObjects API call did not work from minio, same as Scaleway. I reported the issue, it's acknowledged but not fixed yet. The API response times are significantly better. The DeleteObjects call with 50 filenames, for example, takes 300ms to 1 second in my testing. I have not yet used OVH enough to have a sense of their reliability in general.

kerem commented

2026-02-25 23:42:29 +03:00

Author

Owner

@Wouter0100 commented on GitHub (Mar 18, 2022):

Oh whow! That was fast. Awesome to see.

Yeah, Scaleway does have outages from time to time. We haven't had them on their NL-AMS location, but on FR I see quite a couple of issues (luckily, we're not located there).

@Wouter0100 commented on GitHub (Mar 18, 2022): Oh whow! That was fast. Awesome to see. Yeah, Scaleway does have outages from time to time. We haven't had them on their NL-AMS location, but on FR I see quite a couple of issues (luckily, we're not located there).

kerem commented

2026-02-25 23:42:29 +03:00

Author

Owner

@mike503 commented on GitHub (Apr 2, 2022):

my $.02: I'd think it might be better to add this into runitor or similar tool - that tool itself has the S3/larger log support and uses the smaller log support of healthchecks.io to point to the customer owned S3/location. less burden and requirements on healthchecks.io.

@mike503 commented on GitHub (Apr 2, 2022): my $.02: I'd think it might be better to add this into runitor or similar tool - that tool itself has the S3/larger log support and uses the smaller log support of healthchecks.io to point to the customer owned S3/location. less burden and requirements on healthchecks.io.

kerem commented

2026-02-25 23:42:29 +03:00

Author

Owner

@cuu508 commented on GitHub (Apr 4, 2022):

On self-hosted instances the S3 functionality is optional: if you don't specify the S3 credentials, Healthchecks stores ping body in the database as it used to.

I agree at certain log size it makes sense to handle logs on the client side. The client has many options then – upload logs to something other than S3, stash the logs locally and have another machine pick them up, do additional filtering, compression, encryption etc.

But if the logs are small, and there are no other special requirements, it's nice to have an option to avoid that client-side complexity.

@cuu508 commented on GitHub (Apr 4, 2022): On self-hosted instances the S3 functionality is optional: if you don't specify the S3 credentials, Healthchecks stores ping body in the database as it used to. I agree at certain log size it makes sense to handle logs on the client side. The client has many options then – upload logs to something other than S3, stash the logs locally and have another machine pick them up, do additional filtering, compression, encryption etc. But if the logs are small, and there are no other special requirements, it's nice to have an option to avoid that client-side complexity.

kerem commented

2026-02-25 23:42:29 +03:00

Author

Owner

@mike503 commented on GitHub (Apr 4, 2022):

I agree. However this sounds like you’re exploring storing more logs than right now, wanted to chime in about that. :p

For example I’d much rather see #626 work for sure before adding more storage on :)

@mike503 commented on GitHub (Apr 4, 2022): I agree. However this sounds like you’re exploring storing more logs than right now, wanted to chime in about that. :p For example I’d much rather see #626 work for sure before adding more storage on :)

Rows
Columns

[GH-ISSUE #609] Implement Object Storage for body #442