[GH-ISSUE #1023] Ping logs performance issue #710

Closed
opened 2026-02-25 23:43:20 +03:00 by kerem · 16 comments
Owner

Originally created by @Athorcis on GitHub (Jul 5, 2024).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/1023

Hi, I'm using the self-hosted version of healthchecks (and this is great work).

I increased the ping log limit to 40,000. Now, all the checks with 40.000 pings are getting slow or won't load (getting harakiris in
the container logs). The problematic requests: last_ping or event lists

Is it supposed to happen? Is there a way to fix it without decreasing the ping log limit?

Originally created by @Athorcis on GitHub (Jul 5, 2024). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/1023 Hi, I'm using the self-hosted version of healthchecks (and this is great work). I increased the ping log limit to 40,000. Now, all the checks with 40.000 pings are getting slow or won't load (getting harakiris in the container logs). The problematic requests: last_ping or event lists Is it supposed to happen? Is there a way to fix it without decreasing the ping log limit?
kerem closed this issue 2026-02-25 23:43:20 +03:00
Author
Owner

@cuu508 commented on GitHub (Jul 5, 2024):

Thanks for the report.

Can you post specific URLs that take long to load, and perhaps a screenshot from browser's developer tools with timings?

What database are you using, and what hardware are you running on?

<!-- gh-comment-id:2211158294 --> @cuu508 commented on GitHub (Jul 5, 2024): Thanks for the report. Can you post specific URLs that take long to load, and perhaps a screenshot from browser's developer tools with timings? What database are you using, and what hardware are you running on?
Author
Owner

@Athorcis commented on GitHub (Jul 10, 2024):

I use MySQL, and the server hardware is Intel i7-7700K - 4c/8t - 4.2 GHz/4.5 GHz, 32 Go RAM, 450 Go SSD.

I succeeded, in reducing some request's duration by adding indexes to some tables

I added a compound index on api_ping.owner_id and api_ping.created, it improved the performance on /checks/{check_id}/last_ping/ route
I added a compound index on api_ping.kind, api_ping.n, api_ping.created, it improved the performance on /checks/{check_id}/status/ and /checks/{check_id}/log_events/

but I still see issues specifically on /checks/{check_id}/log_events/?fail=on

The request gets canceled because it takes too long, if I resend it, I get a 502 (because of harakiris)
image

<!-- gh-comment-id:2219753778 --> @Athorcis commented on GitHub (Jul 10, 2024): I use MySQL, and the server hardware is Intel i7-7700K - 4c/8t - 4.2 GHz/4.5 GHz, 32 Go RAM, 450 Go SSD. I succeeded, in reducing some request's duration by adding indexes to some tables I added a compound index on `api_ping.owner_id` and `api_ping.created`, it improved the performance on `/checks/{check_id}/last_ping/` route I added a compound index on `api_ping.kind`, `api_ping.n`, `api_ping.created`, it improved the performance on `/checks/{check_id}/status/` and `/checks/{check_id}/log_events/` but I still see issues specifically on `/checks/{check_id}/log_events/?fail=on` The request gets canceled because it takes too long, if I resend it, I get a 502 (because of harakiris) <img width="1434" alt="image" src="https://github.com/healthchecks/healthchecks/assets/4405110/45e35feb-16a3-4394-a267-f26174379963">
Author
Owner

@cuu508 commented on GitHub (Jul 10, 2024):

Thanks for the details.

I haven't managed to reproduce the issue so far. My setup:

  • Intel 9700K, 32GB RAM
  • Ubuntu 24.04
  • MariaDB 10.11.8, stock settings
  • Current git master version of Healthchecks
  • I'm using ./manage.py runserver to run the webserver
  • ping log limit is 40'000
  • I used a script to create 1000 checks, and ping them randomly.
  • There are ~350K pings in the database total, and I'm testing with a check which has ~40'000 stored pings
  • /checks/fa4d810e-940f-414a-8b4d-3ee62254a056/log_events/?fail=on&start=on&log=on&ign=on&flip=on takes ~600ms

Would you be able to install django-debug-toolbar and see in which queries the time is spent?

<!-- gh-comment-id:2220350167 --> @cuu508 commented on GitHub (Jul 10, 2024): Thanks for the details. I haven't managed to reproduce the issue so far. My setup: * Intel 9700K, 32GB RAM * Ubuntu 24.04 * MariaDB 10.11.8, stock settings * Current git master version of Healthchecks * I'm using `./manage.py runserver` to run the webserver * ping log limit is 40'000 * I used a script to create 1000 checks, and ping them randomly. * There are ~350K pings in the database total, and I'm testing with a check which has ~40'000 stored pings * `/checks/fa4d810e-940f-414a-8b4d-3ee62254a056/log_events/?fail=on&start=on&log=on&ign=on&flip=on` takes ~600ms Would you be able to install django-debug-toolbar and see in which queries the time is spent?
Author
Owner

@cuu508 commented on GitHub (Jul 10, 2024):

For pinging, do you use HTTP POST with request body? If yes,

  • How big are the request bodies?
  • What is PING_BODY_LIMIT set to?
  • Do you have object storage configured (S3_* settings)?
<!-- gh-comment-id:2220363032 --> @cuu508 commented on GitHub (Jul 10, 2024): For pinging, do you use HTTP POST with request body? If yes, * How big are the request bodies? * What is PING_BODY_LIMIT set to? * Do you have object storage configured (S3_* settings)?
Author
Owner

@cuu508 commented on GitHub (Jul 10, 2024):

An unrelated thing I noticed though, when spamming lots of ping requests simultaneously, I fairly regularly see requests failing with:

File "[...]/healthchecks/lib/python3.12/site-packages/MySQLdb/connections.py", line 261, in query
_mysql.connection.query(self, query)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Exception Type: OperationalError at /ping/glhyOjZavRP-U5rnxA1ITA/check-282/1
Exception Value: (1213, &#x27;Deadlock found when trying to get lock; try restarting transaction&#x27;)
<!-- gh-comment-id:2220405898 --> @cuu508 commented on GitHub (Jul 10, 2024): An unrelated thing I noticed though, when spamming lots of ping requests simultaneously, I fairly regularly see requests failing with: ``` File "[...]/healthchecks/lib/python3.12/site-packages/MySQLdb/connections.py", line 261, in query _mysql.connection.query(self, query) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Exception Type: OperationalError at /ping/glhyOjZavRP-U5rnxA1ITA/check-282/1 Exception Value: (1213, &#x27;Deadlock found when trying to get lock; try restarting transaction&#x27;) ```
Author
Owner

@cuu508 commented on GitHub (Jul 10, 2024):

Update –

  • I switched to making HTTP POST requests with 10KB bodies containing random data
  • I've got ~800K rows in the api_ping table by this point
  • The request times for log_events kept slowly creeping up – 700ms, 900ms, above one second
  • I gave MariaDB more memory (innodb_buffer_pool_size = 8G)
  • Request times dropped back to ~500ms
<!-- gh-comment-id:2220524195 --> @cuu508 commented on GitHub (Jul 10, 2024): Update – * I switched to making HTTP POST requests with 10KB bodies containing random data * I've got ~800K rows in the `api_ping` table by this point * The request times for `log_events` kept slowly creeping up – 700ms, 900ms, above one second * I gave MariaDB more memory (`innodb_buffer_pool_size = 8G`) * Request times dropped back to ~500ms
Author
Owner

@Athorcis commented on GitHub (Jul 10, 2024):

For pinging, do you use HTTP POST with request body? If yes,

* How big are the request bodies?

It can vary from few 1ko to 50 Mo

* What is PING_BODY_LIMIT set to?

100 Mo

* Do you have object storage configured (S3_* settings)?

No

<!-- gh-comment-id:2220761613 --> @Athorcis commented on GitHub (Jul 10, 2024): > For pinging, do you use HTTP POST with request body? If yes, > > * How big are the request bodies? It can vary from few 1ko to 50 Mo > > * What is PING_BODY_LIMIT set to? 100 Mo > > * Do you have object storage configured (S3_* settings)? No
Author
Owner

@Athorcis commented on GitHub (Jul 10, 2024):

An unrelated thing I noticed though, when spamming lots of ping requests simultaneously, I fairly regularly see requests failing with:

File "[...]/healthchecks/lib/python3.12/site-packages/MySQLdb/connections.py", line 261, in query
_mysql.connection.query(self, query)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Exception Type: OperationalError at /ping/glhyOjZavRP-U5rnxA1ITA/check-282/1
Exception Value: (1213, &#x27;Deadlock found when trying to get lock; try restarting transaction&#x27;)

Since I use MySQL I did not have lock issues (but before with SQLite it happened)

<!-- gh-comment-id:2220767461 --> @Athorcis commented on GitHub (Jul 10, 2024): > An unrelated thing I noticed though, when spamming lots of ping requests simultaneously, I fairly regularly see requests failing with: > > ``` > File "[...]/healthchecks/lib/python3.12/site-packages/MySQLdb/connections.py", line 261, in query > _mysql.connection.query(self, query) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Exception Type: OperationalError at /ping/glhyOjZavRP-U5rnxA1ITA/check-282/1 > Exception Value: (1213, &#x27;Deadlock found when trying to get lock; try restarting transaction&#x27;) > ``` Since I use MySQL I did not have lock issues (but before with SQLite it happened)
Author
Owner

@Athorcis commented on GitHub (Jul 10, 2024):

  • I've got ~800K rows in the api_ping table by this point

My api_table contains currently about 5 millions rows

<!-- gh-comment-id:2220768957 --> @Athorcis commented on GitHub (Jul 10, 2024): > * I've got ~800K rows in the `api_ping` table by this point My api_table contains currently about 5 millions rows
Author
Owner

@Athorcis commented on GitHub (Jul 10, 2024):

Would you be able to install django-debug-toolbar and see in which queries the time is spent?

I'll try when as soon as I have some time

<!-- gh-comment-id:2220878907 --> @Athorcis commented on GitHub (Jul 10, 2024): > Would you be able to install django-debug-toolbar and see in which queries the time is spent? I'll try when as soon as I have some time
Author
Owner

@cuu508 commented on GitHub (Jul 10, 2024):

It can vary from few 1ko to 50 Mo

50 Mo as in 50 megabytes?

<!-- gh-comment-id:2220950718 --> @cuu508 commented on GitHub (Jul 10, 2024): > It can vary from few 1ko to 50 Mo 50 Mo as in 50 megabytes?
Author
Owner

@Athorcis commented on GitHub (Jul 11, 2024):

50 Mo as in 50 megabytes?

yes

<!-- gh-comment-id:2222182904 --> @Athorcis commented on GitHub (Jul 11, 2024): > 50 Mo as in 50 megabytes? yes
Author
Owner

@cuu508 commented on GitHub (Jul 11, 2024):

That explains it then.

Is it supposed to happen? Is there a way to fix it without decreasing the ping log limit?

Yes – if you push the limits far enough, you will eventually run into performance problems.

Consider lowering PING_BODY_LIMIT to, say, 1MB.

And additionally consider offloading ping bodies to object storage, see https://healthchecks.io/docs/self_hosted_configuration/#S3_ACCESS_KEY

<!-- gh-comment-id:2222303985 --> @cuu508 commented on GitHub (Jul 11, 2024): That explains it then. > Is it supposed to happen? Is there a way to fix it without decreasing the ping log limit? Yes – if you push the limits far enough, you will eventually run into performance problems. Consider lowering PING_BODY_LIMIT to, say, 1MB. And additionally consider offloading ping bodies to object storage, see https://healthchecks.io/docs/self_hosted_configuration/#S3_ACCESS_KEY
Author
Owner

@cuu508 commented on GitHub (Jul 11, 2024):

I implemented an experimental performance optimization: when querying pings for display in the "Log" page, instead of loading entire stored ping bodies, for each load only its initial 150 bytes (we're displaying only ~100 or so characters in the UI, so nothing should change visually for the user).

@Athorcis if you get a chance, please give this a try, and let me know if the log_events performance is better.

<!-- gh-comment-id:2223120441 --> @cuu508 commented on GitHub (Jul 11, 2024): I implemented an experimental performance optimization: when querying pings for display in the "Log" page, instead of loading entire stored ping bodies, for each load only its initial 150 bytes (we're displaying only ~100 or so characters in the UI, so nothing should change visually for the user). @Athorcis if you get a chance, please give this a try, and let me know if the `log_events` performance is better.
Author
Owner

@Athorcis commented on GitHub (Jul 13, 2024):

@cuu508 I applied your commit with a patch but I didn't see any performance improvement on the request generating the harakiris. On the other hand, I installed django-debug-toolbar (with some difficulties) then I identified which query was taking time and I succeeded with another new index (owner_id, n DESC, created) to decrease the query time so it doesn't fail with a 502. Even though the request still takes too long (6 seconds) and gets aborted by js.
Do you know if it would be possible to increase the timeout of ajax requests (or at least make it configurable)?

<!-- gh-comment-id:2227034201 --> @Athorcis commented on GitHub (Jul 13, 2024): @cuu508 I applied your commit with a patch but I didn't see any performance improvement on the request generating the harakiris. On the other hand, I installed django-debug-toolbar (with some difficulties) then I identified which query was taking time and I succeeded with another new index (`owner_id`, `n` DESC, `created`) to decrease the query time so it doesn't fail with a 502. Even though the request still takes too long (6 seconds) and gets aborted by js. Do you know if it would be possible to increase the timeout of ajax requests (or at least make it configurable)?
Author
Owner

@cuu508 commented on GitHub (Jul 14, 2024):

The request gets aborted when a new request is about to be run: github.com/healthchecks/healthchecks@1877a8324f/static/js/log.js (L40)

The refresh runs every 3 seconds, and the interval is specified here: github.com/healthchecks/healthchecks@1877a8324f/static/js/adaptive-setinterval.js (L13)

You could increase the refresh interval there, but the root problem is still the request taking excessively long.

<!-- gh-comment-id:2227215201 --> @cuu508 commented on GitHub (Jul 14, 2024): The request gets aborted when a new request is about to be run: https://github.com/healthchecks/healthchecks/blob/1877a8324f7c2f07dc241aa32e5ed7f9768b41e4/static/js/log.js#L40 The refresh runs every 3 seconds, and the interval is specified here: https://github.com/healthchecks/healthchecks/blob/1877a8324f7c2f07dc241aa32e5ed7f9768b41e4/static/js/adaptive-setinterval.js#L13 You could increase the refresh interval there, but the root problem is still the request taking excessively long.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/healthchecks#710
No description provided.