[GH-ISSUE #1054] ERROR: Too many open files #735

Closed
opened 2026-02-25 23:43:24 +03:00 by kerem · 8 comments
Owner

Originally created by @bilalvadion on GitHub (Sep 2, 2024).
Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/1054

Healthcheck ping returned a 500 internal server error to my client app. Upon investigation, I found the following error:

OSError: [Errno 24] Too many open files:

Please find syslog subset in log file attached below.
too-many-open-files.log

I have tried increases the ulimit, but issue still persists.
Reference: https://stackoverflow.com/questions/39537731/errno-24-too-many-open-files-but-i-am-not-opening-files

Current ulimit is 65535.

Environment:
Ubuntu 20.04.6 LTS
Python 3.8

Originally created by @bilalvadion on GitHub (Sep 2, 2024). Original GitHub issue: https://github.com/healthchecks/healthchecks/issues/1054 Healthcheck ping returned a 500 internal server error to my client app. Upon investigation, I found the following error: `OSError: [Errno 24] Too many open files:` Please find syslog subset in log file attached below. [too-many-open-files.log](https://github.com/user-attachments/files/16834389/too-many-open-files.log) I have tried increases the ulimit, but issue still persists. Reference: https://stackoverflow.com/questions/39537731/errno-24-too-many-open-files-but-i-am-not-opening-files Current ulimit is 65535. Environment: Ubuntu 20.04.6 LTS Python 3.8
kerem closed this issue 2026-02-25 23:43:24 +03:00
Author
Owner

@cuu508 commented on GitHub (Sep 2, 2024):

Can you please check if the limit is in fact 65535:

cat /proc/<pid>/limits | grep files

Also, please check what files are open:

lsof -p <PID>
<!-- gh-comment-id:2324391673 --> @cuu508 commented on GitHub (Sep 2, 2024): Can you please check if the limit is in fact 65535: ``` cat /proc/<pid>/limits | grep files ``` Also, please check what files are open: ``` lsof -p <PID> ```
Author
Owner

@bilalvadion commented on GitHub (Sep 2, 2024):

@cuu508 here are the command results:

lsof -p 3887889 | wc -l
78
cat /proc/<PID>/limits | grep files
Max open files            1024                 524288               files     
cat /proc/<PID>/limits | grep files
Max open files            1024                 524288               files     
ulimit -Sn
65535
ulimit -Hn
65535
ulimit
unlimited
ulimit -n
65535
<!-- gh-comment-id:2324595513 --> @bilalvadion commented on GitHub (Sep 2, 2024): @cuu508 here are the command results: ``` lsof -p 3887889 | wc -l 78 ``` ``` cat /proc/<PID>/limits | grep files Max open files 1024 524288 files ``` ``` cat /proc/<PID>/limits | grep files Max open files 1024 524288 files ``` ``` ulimit -Sn 65535 ``` ``` ulimit -Hn 65535 ``` ``` ulimit unlimited ``` ``` ulimit -n 65535 ```
Author
Owner

@cuu508 commented on GitHub (Sep 2, 2024):

/proc/<PID>/limits shows the soft limit is 1024 and the hard limit is 524288. Neither is 65535, so something's not right with your ulimit usage.

Can you reproduce the "Too many open files" error? If yes, please run lsof -p <pid> then send me the full output (you can send it privately to contact at healthchecks io). I'd like to figure out how the file handles are being spent.

<!-- gh-comment-id:2324671913 --> @cuu508 commented on GitHub (Sep 2, 2024): `/proc/<PID>/limits` shows the soft limit is 1024 and the hard limit is 524288. Neither is 65535, so something's not right with your ulimit usage. Can you reproduce the "Too many open files" error? If yes, please run `lsof -p <pid>` then send me the full output (you can send it privately to contact at healthchecks io). I'd like to figure out how the file handles are being spent.
Author
Owner

@bilalvadion commented on GitHub (Sep 4, 2024):

I increased the soft limits using the steps below:

sudo nano /etc/systemd/system.conf

#DefaultLimitNOFILE=1024:524288
to
DefaultLimitNOFILE=65535:524288

sudo systemctl daemon-reload //to reload config
systemctl restart healthchecks-runserver.service //to restart service
ps aux | grep python //to check process id
cat /proc/<PID>/limits | grep files //to check if limit increased form 1024 to 65535

But I had an unexpected sideaffect on the healthchecks service, below error keeps repeating per ping request:

Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: ---------------------------------------- Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: Exception happened during processing of request from ('xxx.xx.xx.xxx', 56782) Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: Traceback (most recent call last): Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: File "/usr/lib/python3.8/socketserver.py", line 316, in _handle_request_noblock Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: self.process_request(request, client_address) Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: File "/usr/lib/python3.8/socketserver.py", line 697, in process_request Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: t.start() Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: File "/usr/lib/python3.8/threading.py", line 852, in start Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: _start_new_thread(self._bootstrap, ()) Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: RuntimeError: can't start new thread Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: ----------------------------------------
Some other logs at the time of exception:

ps aux | grep python

intelli+ 3902758  0.0  0.2  61072  5244 ?        Ss   Sep03   0:00 /home/intelliagent/healthchecks/hc-venv/bin/python3.8 manage.py runserver 0.0.0.0:8000
intelli+ 3902760  1.2  8.5 19154060 163380 ?     Sl   Sep03  17:03 /home/intelliagent/healthchecks/hc-venv/bin/python3.8 manage.py runserver 0.0.0.0:8000
cat /proc/3902758/limits | grep files
Max open files            65535                524288               files    
cat /proc/3902760/limits | grep files
Max open files            65535                524288               files     
lsof -p 3902758 | wc -l
78
lsof -p 3902760 | wc -l
2285
<!-- gh-comment-id:2327933611 --> @bilalvadion commented on GitHub (Sep 4, 2024): I increased the soft limits using the steps below: ``` sudo nano /etc/systemd/system.conf #DefaultLimitNOFILE=1024:524288 to DefaultLimitNOFILE=65535:524288 sudo systemctl daemon-reload //to reload config systemctl restart healthchecks-runserver.service //to restart service ps aux | grep python //to check process id cat /proc/<PID>/limits | grep files //to check if limit increased form 1024 to 65535 ``` But I had an unexpected sideaffect on the healthchecks service, below error keeps repeating per ping request: `Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: ---------------------------------------- Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: Exception happened during processing of request from ('xxx.xx.xx.xxx', 56782) Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: Traceback (most recent call last): Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: File "/usr/lib/python3.8/socketserver.py", line 316, in _handle_request_noblock Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: self.process_request(request, client_address) Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: File "/usr/lib/python3.8/socketserver.py", line 697, in process_request Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: t.start() Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: File "/usr/lib/python3.8/threading.py", line 852, in start Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: _start_new_thread(self._bootstrap, ()) Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: RuntimeError: can't start new thread Sep 4 00:00:00 ip-172-31-6-192 python3.8[3902760]: ---------------------------------------- ` Some other logs at the time of exception: ``` ps aux | grep python intelli+ 3902758 0.0 0.2 61072 5244 ? Ss Sep03 0:00 /home/intelliagent/healthchecks/hc-venv/bin/python3.8 manage.py runserver 0.0.0.0:8000 intelli+ 3902760 1.2 8.5 19154060 163380 ? Sl Sep03 17:03 /home/intelliagent/healthchecks/hc-venv/bin/python3.8 manage.py runserver 0.0.0.0:8000 ``` ``` cat /proc/3902758/limits | grep files Max open files 65535 524288 files ``` ``` cat /proc/3902760/limits | grep files Max open files 65535 524288 files ``` ``` lsof -p 3902758 | wc -l 78 ``` ``` lsof -p 3902760 | wc -l 2285 ```
Author
Owner

@cuu508 commented on GitHub (Sep 4, 2024):

I haven't run into RuntimeError: can't start new thread error during request handling before, and don't know what might be causing it.

A few observations:

  • manage.py runserver is a development server. For production use, consider switching to uwsgi or gunicorn
  • lsof -p 3902760 | wc -l -> 2285 output seems high. On my dev system, with the development server running, there are only ~100 file handles open. Can I see the list of open files (same command, without wc -l)?
  • seeing as you are using Python 3.8 I assume you are also running an old version of Healthchecks. Note that Python 3.8 reaches end-of-life this year in October. By using old versions you are missing out on security, performance and functionality improvements in Healthchecks, Django and Python.
  • There are two python manage.py runserver processes stared at different times. This looks suspicious to me. Do you perhaps have an old process taking over a port?
<!-- gh-comment-id:2328015563 --> @cuu508 commented on GitHub (Sep 4, 2024): I haven't run into `RuntimeError: can't start new thread` error during request handling before, and don't know what might be causing it. A few observations: * `manage.py runserver` is a development server. For production use, consider switching to uwsgi or gunicorn * `lsof -p 3902760 | wc -l -> 2285` output seems high. On my dev system, with the development server running, there are only ~100 file handles open. Can I see the list of open files (same command, without `wc -l`)? * seeing as you are using Python 3.8 I assume you are also running an old version of Healthchecks. Note that Python 3.8 reaches end-of-life this year in October. By using old versions you are missing out on security, performance and functionality improvements in Healthchecks, Django and Python. * There are two `python manage.py runserver` processes stared at different times. This looks suspicious to me. Do you perhaps have an old process taking over a port?
Author
Owner

@bilalvadion commented on GitHub (Sep 4, 2024):

@cuu508 I have emailed you the lsof output on contact at healthchecks io.

We are using systemctl to manage our service. Given below is the configuration:

[Unit]
Description=Healthchecks Run Server

[Service]
WorkingDirectory = /home/intelliagent/healthchecks
ExecStart=/home/intelliagent/healthchecks/hc-venv/bin/python3.8 manage.py runserver 0.0.0.0:8080
SuccessExitStatus=143
TimeoutStopSec=10
Restart=on-failure
RestartSec=5
User=bilal
Group=bilal

[Install]
WantedBy=multi-user.target

I am using healthchecks 2.10. Would be happy to migrate to the latest version, is there a migration guide for that? As we dont want to re enter all the projects and integrations, and want the ping urls to ideally not change for our production clients.

<!-- gh-comment-id:2328038534 --> @bilalvadion commented on GitHub (Sep 4, 2024): @cuu508 I have emailed you the lsof output on contact at healthchecks io. We are using systemctl to manage our service. Given below is the configuration: ``` [Unit] Description=Healthchecks Run Server [Service] WorkingDirectory = /home/intelliagent/healthchecks ExecStart=/home/intelliagent/healthchecks/hc-venv/bin/python3.8 manage.py runserver 0.0.0.0:8080 SuccessExitStatus=143 TimeoutStopSec=10 Restart=on-failure RestartSec=5 User=bilal Group=bilal [Install] WantedBy=multi-user.target ``` I am using healthchecks 2.10. Would be happy to migrate to the latest version, is there a migration guide for that? As we dont want to re enter all the projects and integrations, and want the ping urls to ideally not change for our production clients.
Author
Owner

@cuu508 commented on GitHub (Sep 4, 2024):

I received the full lsof logs, thanks!

They show ~2000 open TCP connections to port 8080. Normally, a client would close the connection after it is done with the request. And if it doesn't the server would close it after some timeout. I guess manage.py runserver doesn't do that.

Regarding upgrade, there's no full guide, but a minimal version would be:

  • make backup of the database (the sqlite file in your case). All user state (projects, checks, integrations, ping URLs etc.) is stored in the database. As long as you keep it safe, you will not lose it. Worst case, you can roll back to your current version and use the backed-up database.
  • switch the current 2.10 codebase to whichever release you would like to use
  • run manage.py migrate to apply database migrations
  • run manage.py collectstatic and manage.py compress.

You can jump multiple releases, you do not need to do upgrades one-by-one.

If you are also upgrading python version, you will need to recreate the virtualenv. The virtualenv created with Python 3.8 will not work with Python 3.9+.

<!-- gh-comment-id:2328046140 --> @cuu508 commented on GitHub (Sep 4, 2024): I received the full lsof logs, thanks! They show ~2000 open TCP connections to port 8080. Normally, a client would close the connection after it is done with the request. And if it doesn't the server would close it after some timeout. I guess `manage.py runserver` doesn't do that. Regarding upgrade, there's no full guide, but a minimal version would be: * make backup of the database (the sqlite file in your case). All user state (projects, checks, integrations, ping URLs etc.) is stored in the database. As long as you keep it safe, you will not lose it. Worst case, you can roll back to your current version and use the backed-up database. * switch the current 2.10 codebase to whichever release you would like to use * run `manage.py migrate` to apply database migrations * run `manage.py collectstatic` and `manage.py compress`. You can jump multiple releases, you do not need to do upgrades one-by-one. If you are also upgrading python version, you will need to recreate the virtualenv. The virtualenv created with Python 3.8 will not work with Python 3.9+.
Author
Owner

@bilalvadion commented on GitHub (Sep 5, 2024):

@cuu508 running on uWSGI fixed the issue. Possibly a Django dev server limitation. I am making a PR with minimal steps for uWSGI on the Production section.

<!-- gh-comment-id:2331215049 --> @bilalvadion commented on GitHub (Sep 5, 2024): @cuu508 running on uWSGI fixed the issue. Possibly a Django dev server limitation. I am making a PR with minimal steps for uWSGI on the Production section.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/healthchecks#735
No description provided.