mirror of
https://github.com/healthchecks/healthchecks.git
synced 2026-04-26 07:25:51 +03:00
[GH-ISSUE #609] Implement Object Storage for body #442
Labels
No labels
bug
bug
bug
feature
good-first-issue
new integration
pull-request
question
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/healthchecks#442
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Wouter0100 on GitHub (Feb 12, 2022).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/609
Personally I'm thinking of setting up a Healthchecks.io instance and one of the features I'd love to see is the ability to store all the body information from a success ping. As far as I know this is not ideal to store in a database, and thus integration with an Object Storage (S3) would be ideal. This way we should be able to store hundreds, or thousands, of log entries - with multiple MB's of body.
Is there any interest in this for Healthchecks.io, and so that I should put in time to develop it and open a PR? :)
I would like to look into the ability to stream the body directly to S3 (not store it in memory), but I don't have any experience with Django, nor Python, so it would be a challenge. Thus far I looked into the code and it seems reasonable to do.
@cuu508 commented on GitHub (Feb 14, 2022):
Thanks for the suggestion, this is an interesting idea! I've been thinking about it over the weekend.
My first gut reaction was – no, it would be too slow, too complex and potentially too expensive:
But thinking more about it...
Fleshing out this idea, a list of random thoughts. From now on, when I say "S3" I mean any S3-compatible provider.
The api_ping.body would be effectively be used as temporary storage before the data is offloaded to S3. Moving bulk data in and out of the database is not ideal. DB writes are expensive, and moving bulk data in and out at a fast rate will cause fragmentation. Instead of saving large ping bodies to the database, the ping handler could instead save them to files in a designated location on the local filesystem. The uploader process would look for any new files in that area and upload them. There are pros and cons:
This is where I am in terms of brainstorming at the moment. This would be a bunch of work to implement. But it would also materially improve the service: bigger ping bodies and/or more log entries per ping. So it's worth exploring!
@Wouter0100 commented on GitHub (Feb 16, 2022):
Wow. Thanks for the consideration and the detailed answer. Love to see that you're potentially interested into this as well. At first, unfortunately to hear that the Go implementation is closed source. I get it, but that's the language I do know and would be able to better implement it.
Personally I did not mean the S3 service of AWS, indeed. But I really meant the S3 API that more providers implement. Which, indeed, could be a cost saving. Personally I'm a fan of the Scaleway Object Storage service, where large buckets are supported and there's no API per-request fee, nor a bandwidth fee (with a specific setup).
Besides the benefits, I would not recommend to use the local storage nor the database as a temporary storage. Databases are not designed for it and you put additional read/write load on the DB. As an example, MySQL won't reduce the size of the DB if using InnoDB. This would result in very large table files and potentially large backups as well when backups are taken when there are large amounts of bodies stored. I don't have any experience with PostgreSQL, however - but I could imagine it has the same constraints. Throughout my work I've clustered various large clusters, and we had various issues with properly storing blobs resulting in downtime or unexpected behaviour.
Local file storage is difficult when using a containerized environment (e.g. Kubernetes) and normally I'd say to not use this as well for this reason. As containers may be swapped out at any time, and thus storage would be lost. Where "static" servers are used, this problem is kinda mitigated - as when you restart the system it could continue to process the locally stored files. Personally, when designing an app I always try to follow The Twelve-Factor App principles, which also recommends against it. This would make the application more suitable for High Available environments, possibly using a containerized environment.
I've done some testing: on average it takes about 500ms to upload 1MB and 100KB to the current iteration of the Scaleway Object Storage. Not that I would expect 1MB of logs, but it gives a good indication of how fast it'll be. 100KB takes also about 500ms, so it seems that it does not really differs that much depending on the upload size. I'd definitely agree that 500ms is way, way to much for the ping endpoint - and so an alternative solution should be found. Especially for the scale on Healthchecks.io. To ease the maintenance burden, I'd say this would be a possible solution for the open-source available ping handler.
A thing I see often is a queue, such as RabbitMQ (or Redis). This would introduce an additional component in the infrastructure, but would be suitable for processing the bodies. The only downside on this however, is that there is still a window between the ping is received and when the body is available in the UI. Redundancy in this case is handled by the message broker. On the other hand, if you'd decide on introducing the option to directly upload it to S3 in the open source ping handler, you could rely on the local files for Healthchecks.io - if that's a suitable solution for it.
Regardless, these are just my 2 cents - and would love to talk further about it.
@cuu508 commented on GitHub (Feb 16, 2022):
As it happens, I was eyeing Scaleway Object Storage too! With no per-request fee, and free inbound bandwidth, it looks attractive. Good for us, but I wonder if Scaleway could have a problem with this usage pattern. I contacted them with a quick description of the use case, their preliminary answer was "should be OK". I followed up with more details and will see if they have any additional comments.
Similar with PostgreSQL. Frequent inserts and deletes cause table and index bloat. There are solutions for this, but would be better to not have that problem in the first place.
I have zero experience with Kubernetes, but perhaps volumes can help here? Or perhaps there's a way to do graceful shutdowns – wait until a process finishes up before killing the container?
A message broker sounds like the perfect tool for the job, except it's another piece that needs to be set up, maintained and monitored. Another moving part that can break. So far I've managed to get by with a single stateful component, the database.
For the open source handler, upload the object while handling the request, correct?
@Wouter0100 commented on GitHub (Feb 16, 2022):
I've been using Scaleway for quite some time now (1 year) and it's been good thus far. Especially the location in NL, the location in FR seems to have outages from time to time. Not sure why it differs, but it surely does.
They recently introduced a new backend for their Object Storage service (literally last month) called "HIVE" and since then the service improved significantly on many parts. In the past there was a limit with 500k items in a bucket, but with the new backend this was lifted as well.
If you want to chat about it, I'm quite active on the Scaleway Community slack.
Indeed. I think this would greatly reduce the size of the implementation:
As with this case it would not require any additional processing and no added complexity. Perhaps the removal of old ping should be moved outside of the ping handler, but besides this I think it would be feasible.
If we would go with the implementation proposal above - this would not be a problem, as running the open source variant won't store anything in that case. And, the closed source ping handler could be developed differently depending on the infrastructure (e.g. by using files) :).
@cuu508 commented on GitHub (Feb 23, 2022):
A quick update:
@cuu508 commented on GitHub (Mar 18, 2022):
An initial implementation is ready, and is available in v2.0.1, released today.
I've added new environment variables for specifying S3 credentials (see https://github.com/healthchecks/healthchecks#external-object-storage). When these environment variables are set, Healthchecks will upload ping bodies to S3 using the minio library (which needs to be installed,
pip install minio).This functionality is not yet enabled on the hosted service (https://healthchecks.io), but it's coming along.
I've tested the new functionality with two S3-compatible object storage services: Scaleway and OVH.
Scaleway: The DeleteObjects API call did not work from minio. I reported the issue, it got fixed 👍. The API response times were mediocre. The DeleteObjects call is especially slow, with ~50 object names in the payload, it takes 10-30 seconds to complete. Sometimes it takes over 60 seconds. The service also seemed to have stability issues, API calls sometimes return "InternalError" in the response, and I also encountered a service-wide outage, when even the web dashboard was down.
OVH: The DeleteObjects API call did not work from minio, same as Scaleway. I reported the issue, it's acknowledged but not fixed yet. The API response times are significantly better. The DeleteObjects call with 50 filenames, for example, takes 300ms to 1 second in my testing. I have not yet used OVH enough to have a sense of their reliability in general.
@Wouter0100 commented on GitHub (Mar 18, 2022):
Oh whow! That was fast. Awesome to see.
Yeah, Scaleway does have outages from time to time. We haven't had them on their NL-AMS location, but on FR I see quite a couple of issues (luckily, we're not located there).
@mike503 commented on GitHub (Apr 2, 2022):
my $.02: I'd think it might be better to add this into runitor or similar tool - that tool itself has the S3/larger log support and uses the smaller log support of healthchecks.io to point to the customer owned S3/location. less burden and requirements on healthchecks.io.
@cuu508 commented on GitHub (Apr 4, 2022):
On self-hosted instances the S3 functionality is optional: if you don't specify the S3 credentials, Healthchecks stores ping body in the database as it used to.
I agree at certain log size it makes sense to handle logs on the client side. The client has many options then – upload logs to something other than S3, stash the logs locally and have another machine pick them up, do additional filtering, compression, encryption etc.
But if the logs are small, and there are no other special requirements, it's nice to have an option to avoid that client-side complexity.
@mike503 commented on GitHub (Apr 4, 2022):
I agree. However this sounds like you’re exploring storing more logs than right now, wanted to chime in about that. :p
For example I’d much rather see #626 work for sure before adding more storage on :)