[GH-ISSUE #1296] New install, all agents overdue, uWSGI reports: "nats: encountered error" #2747

Closed
opened 2026-03-14 05:20:38 +03:00 by kerem · 10 comments
Owner

Originally created by @ZenDevMaster on GitHub (Sep 27, 2022).
Original GitHub issue: https://github.com/amidaware/tacticalrmm/issues/1296

Server Info (please complete the following information):

  • OS: Debian 11
  • Browser: Both Firefox / Chrome
  • RMM Version (as shown in top left of web UI): v0.15.0

Installation Method:

  • Standard

Agent Info (please complete the following information):

  • Agent version (as shown in the 'Summary' tab of the agent from web UI): Version:2.4.0
  • Agent OS: Variety - Win11, Win7

What I did:

  1. Launched a clean Ubuntu 11 VPS on Amazon EC2.
  2. Apt update/upgrade all
  3. Configure DNS records. api/mesh/rmm @ mydomain.com. Resolving and ping-able.
  4. Open inbound ICMP and ports TCP 22, 80, 443 TCP via Amazon's EC2 firewall and ufw
  5. Install with traditional Option 1 per instructions here: https://docs.tacticalrmm.com/install_server/#option-1-easy-install-on-a-vps
  6. Install the agents via Agent -> Install Agent -> Dynamically generated exe, copy it (using another RMM) and install. Service starts normally.

What happened

At first the agents show green (online), but quickly fell off and now all reporting overdue. I've rebooted the TRMM server several times, and all PCs that have agents.

Mesh features appear to work, as I can remote control systems.

When I run journalctl --identifier uwsgi --follow the following is periodically logged, especially after agents are restarted. A clue?:

Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: nats: encountered error
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: Traceback (most recent call last):
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:   File "/usr/local/lib/python3.10/asyncio/streams.py", line 47, in open_connection
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:     transport, _ = await loop.create_connection(
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:   File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1049, in create_connection
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:     sock = await self._connect_sock(
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:   File "/usr/local/lib/python3.10/asyncio/base_events.py", line 960, in _connect_sock
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:     await self.sock_connect(sock, address)
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:   File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 500, in sock_connect
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:     return await fut
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: asyncio.exceptions.CancelledError
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: During handling of the above exception, another exception occurred:
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: Traceback (most recent call last):
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:   File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:     return fut.result()
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: asyncio.exceptions.CancelledError
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: The above exception was the direct cause of the following exception:
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: Traceback (most recent call last):
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:   File "/rmm/api/env/lib/python3.10/site-packages/nats/aio/client.py", line 1202, in _select_next_server
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:     r, w = await asyncio.wait_for(
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:   File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]:     raise exceptions.TimeoutError() from exc
Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: asyncio.exceptions.TimeoutError

Finally, I ran troubleshoot_server.sh to verify, here's the output (scrubbed):

 Verified api.mydomain.com Verified rmm.mydomain.com Verified mesh.mydomain.com Checking IPs Locally Resolved:  Remotely Resolved: 52.221.xxx.xxx Your Local and Remote IP for api.mydomain.com all agents will require non-public DNS to find TRMM server Locally Resolved:  Remotely Resolved: 52.221.xxx.xxx echo Your Local and Remote IP for rmm.mydomain.com all agents will require non-public DNS to find TRMM server Locally Resolved:  Remotely Resolved: 52.221.xxx.xxx Your Local and Remote IP for mesh.mydomain.com all agents will require non-public DNS to find TRMM server Checking Services Success RMM Service is Running Success daphne Service is Running Success celery Service is Running Success celerybeat Service is Running Success nginx Service is Running Success nats Service is running Success nats-api Service is running Success meshcentral Service is running Success mongod Service is running Success postgresql Service is running Success redis-server Service is running Checking Open Ports WAN IP is 52.221.xxx.xxx HTTPs Port is open Checking For Proxy No Proxy detected using Certificate No Proxy detected using IP Checking SSL Certificate is up to date SSL Certificate for mydomain.com is fine
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Found the following certs:
  Certificate Name: mydomain.com
    Serial Number: 43191239a8fd60f10d1b7d075814efc2226
    Key Type: RSA
    Domains: *.mydomain.com
    Expiry Date: 2022-12-25 03:31:59+00:00 (VALID: 88 days)
    Certificate Path: /etc/letsencrypt/live/mydomain.com/fullchain.pem
    Private Key Path: /etc/letsencrypt/live/mydomain.com/privkey.pem
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Getting summary output of logs  File "/rmm/api/env/lib/python3.10/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/rmm/api/env/lib/python3.10/site-packages/django/db/backends/postgresql/base.py", line 215, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/rmm/api/env/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: connection to server at "localhost" (127.0.0.1), port 5432 failed: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

2022/09/26 13:39:03 [crit] 582#582: *52 connect() to unix:/rmm/daphne.sock failed (2: No such file or directory) while connecting to upstream, client: 171.96.xxx.xxx, server: api.mydomain.com, request: "GET /ws/dashinfo/?access_token=xxx HTTP/1.1", upstream: "http://unix:/rmm/daphne.sock:/ws/dashinfo/?access_token=xxx", host: "api.mydomain.com"
2022/09/27 02:46:58 [crit] 15485#15485: *11251 connect() to unix:////rmm/api/tacticalrmm/tacticalrmm.sock failed (2: No such file or directory) while connecting to upstream, client: 171.96.xxx.xxx, server: api.mydomain.com, request: "PATCH /alerts/ HTTP/1.1", upstream: "uwsgi://unix:////rmm/api/tacticalrmm/tacticalrmm.sock:", host: "api.mydomain.com", referrer: "https://rmm.mydomain.com/"
2022/09/27 04:14:14 [error] 18330#18330: *1543 directory index of "/rmm/api/tacticalrmm/static/" is forbidden, client: 58.136.xxx.xxx, server: api.mydomain.com, request: "HEAD /static/ HTTP/1.1", host: "api.mydomain.com"
2022/09/27 04:23:15 [crit] 584#584: *1 connect() to unix:////rmm/api/tacticalrmm/tacticalrmm.sock failed (2: No such file or directory) while connecting to upstream, client: 171.97.xxx.xxx, server: api.mydomain.com, request: "GET /api/v3/iBtPXPfCxhllGxtGdSaYvDgfdITFGJhlEdjfwWdU/checkinterval/ HTTP/1.1", upstream: "uwsgi://unix:////rmm/api/tacticalrmm/tacticalrmm.sock:", host: "api.mydomain.com"
2022/09/27 04:23:15 [crit] 584#584: *4 connect() to unix:/rmm/daphne.sock failed (2: No such file or directory) while connecting to upstream, client: 58.136.xxx.xxx, server: api.mydomain.com, request: "GET /ws/dashinfo/?access_token=xxx HTTP/1.1", upstream: "http://unix:/rmm/daphne.sock:/ws/dashinfo/?access_token=xxx", host: "api.mydomain.com"
2022/09/27 04:23:16 [crit] 584#584: *17 connect() to unix:////rmm/api/tacticalrmm/tacticalrmm.sock failed (2: No such file or directory) while connecting to upstream, client: 171.97.xxx.xxx, server: api.mydomain.com, request: "GET /api/v3/VyQimlePqoOkhjGyXRohvRnyqWbDmzCyYcqIYIMz/checkinterval/ HTTP/1.1", upstream: "uwsgi://unix:////rmm/api/tacticalrmm/tacticalrmm.sock:", host: "api.mydomain.com"
2022/09/27 04:23:17 [crit] 584#584: *23 connect() to unix:////rmm/api/tacticalrmm/tacticalrmm.sock failed (2: No such file or directory) while connecting to upstream, client: 171.97.xxx.xxx, server: api.mydomain.com, request: "GET /api/v3/MSobQnanhMEPtpIWbwTmscRxvshNPVBKYytfGYuT/checkinterval/ HTTP/1.1", upstream: "uwsgi://unix:////rmm/api/tacticalrmm/tacticalrmm.sock:", host: "api.mydomain.com"
2022/09/27 04:23:17 [crit] 584#584: *25 connect() to unix:////rmm/api/tacticalrmm/tacticalrmm.sock failed (2: No such file or directory) while connecting to upstream, client: 171.97.xxx.xxx, server: api.mydomain.com, request: "GET /api/v3/SwIwYETiBcpmcMJLvdtilqIBFTJbWheamGDaNQtW/checkinterval/ HTTP/1.1", upstream: "uwsgi://unix:////rmm/api/tacticalrmm/tacticalrmm.sock:", host: "api.mydomain.com"
2022/09/27 04:23:18 [crit] 584#584: *29 connect() to unix:/rmm/daphne.sock failed (2: No such file or directory) while connecting to upstream, client: 58.136.xxx.xxx, server: api.mydomain.com, request: "GET /ws/dashinfo/?access_token=xxx HTTP/1.1", upstream: "http://unix:/rmm/daphne.sock:/ws/dashinfo/?access_token=xxx", host: "api.mydomain.com"
2022/09/27 04:23:22 [crit] 584#584: *53 connect() to unix:/rmm/daphne.sock failed (2: No such file or directory) while connecting to upstream, client: 58.136.xxx.xxx, server: api.mydomain.com, request: "GET /ws/dashinfo/?access_token=xxx HTTP/1.1", upstream: "http://unix:/rmm/daphne.sock:/ws/dashinfo/?access_token=xxx", host: "api.mydomain.com"

I note some errors in the output from troubleshoot_server.sh but wonder if these were logged before the services had fully come online after a reboot. I checked all services manually, can see that they are all running.

Finally,

sudo netstat -tulpn | grep LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      595/sshd: /usr/sbin
tcp        0      0 127.0.0.1:5432          0.0.0.0:*               LISTEN      600/postgres
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      582/nginx: master p
tcp        0      0 127.0.0.1:27017         0.0.0.0:*               LISTEN      568/mongod
tcp        0      0 0.0.0.0:5355            0.0.0.0:*               LISTEN      7587/systemd-resolv
tcp        0      0 127.0.0.1:6379          0.0.0.0:*               LISTEN      580/redis-server 12
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      582/nginx: master p
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      7587/systemd-resolv
tcp6       0      0 :::22                   :::*                    LISTEN      595/sshd: /usr/sbin
tcp6       0      0 :::443                  :::*                    LISTEN      582/nginx: master p
tcp6       0      0 :::4222                 :::*                    LISTEN      2731/nats-server
tcp6       0      0 :::1024                 :::*                    LISTEN      638/node
tcp6       0      0 :::5355                 :::*                    LISTEN      7587/systemd-resolv
tcp6       0      0 ::1:6379                :::*                    LISTEN      580/redis-server 12
tcp6       0      0 :::4430                 :::*                    LISTEN      638/node
tcp6       0      0 :::80                   :::*                    LISTEN      582/nginx: master p
tcp6       0      0 :::4433                 :::*                    LISTEN      638/node
tcp6       0      0 :::9235                 :::*                    LISTEN      2731/nats-server

It really appears that everything is online. I'm at a loss about what to check next - why are all the agents offline on a brand new install?

Originally created by @ZenDevMaster on GitHub (Sep 27, 2022). Original GitHub issue: https://github.com/amidaware/tacticalrmm/issues/1296 **Server Info (please complete the following information):** - OS: Debian 11 - Browser: Both Firefox / Chrome - RMM Version (as shown in top left of web UI): v0.15.0 **Installation Method:** - [X] Standard **Agent Info (please complete the following information):** - Agent version (as shown in the 'Summary' tab of the agent from web UI): Version:2.4.0 - Agent OS: Variety - Win11, Win7 **What I did:** 1. Launched a clean Ubuntu 11 VPS on Amazon EC2. 2. Apt update/upgrade all 3. Configure DNS records. api/mesh/rmm @ mydomain.com. Resolving and ping-able. 4. Open inbound ICMP and ports TCP 22, 80, 443 TCP via Amazon's EC2 firewall and `ufw` 5. Install with traditional Option 1 per instructions here: [https://docs.tacticalrmm.com/install_server/#option-1-easy-install-on-a-vps](https://docs.tacticalrmm.com/install_server/#option-1-easy-install-on-a-vps) 6. Install the agents via Agent -> Install Agent -> Dynamically generated exe, copy it (using another RMM) and install. Service starts normally. **What happened** At first the agents show green (online), but quickly fell off and now all reporting overdue. I've rebooted the TRMM server several times, and all PCs that have agents. Mesh features appear to work, as I can remote control systems. When I run `journalctl --identifier uwsgi --follow` the following is periodically logged, especially after agents are restarted. A clue?: ``` Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: nats: encountered error Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: Traceback (most recent call last): Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: File "/usr/local/lib/python3.10/asyncio/streams.py", line 47, in open_connection Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: transport, _ = await loop.create_connection( Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1049, in create_connection Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: sock = await self._connect_sock( Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: File "/usr/local/lib/python3.10/asyncio/base_events.py", line 960, in _connect_sock Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: await self.sock_connect(sock, address) Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 500, in sock_connect Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: return await fut Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: asyncio.exceptions.CancelledError Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: During handling of the above exception, another exception occurred: Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: Traceback (most recent call last): Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: return fut.result() Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: asyncio.exceptions.CancelledError Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: The above exception was the direct cause of the following exception: Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: Traceback (most recent call last): Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: File "/rmm/api/env/lib/python3.10/site-packages/nats/aio/client.py", line 1202, in _select_next_server Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: r, w = await asyncio.wait_for( Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: raise exceptions.TimeoutError() from exc Sep 27 07:20:36 ip-xxx-xx-xx-xxx uwsgi[4616]: asyncio.exceptions.TimeoutError ``` Finally, I ran `troubleshoot_server.sh` to verify, here's the output (scrubbed): ```  Verified api.mydomain.com Verified rmm.mydomain.com Verified mesh.mydomain.com Checking IPs Locally Resolved: Remotely Resolved: 52.221.xxx.xxx Your Local and Remote IP for api.mydomain.com all agents will require non-public DNS to find TRMM server Locally Resolved: Remotely Resolved: 52.221.xxx.xxx echo Your Local and Remote IP for rmm.mydomain.com all agents will require non-public DNS to find TRMM server Locally Resolved: Remotely Resolved: 52.221.xxx.xxx Your Local and Remote IP for mesh.mydomain.com all agents will require non-public DNS to find TRMM server Checking Services Success RMM Service is Running Success daphne Service is Running Success celery Service is Running Success celerybeat Service is Running Success nginx Service is Running Success nats Service is running Success nats-api Service is running Success meshcentral Service is running Success mongod Service is running Success postgresql Service is running Success redis-server Service is running Checking Open Ports WAN IP is 52.221.xxx.xxx HTTPs Port is open Checking For Proxy No Proxy detected using Certificate No Proxy detected using IP Checking SSL Certificate is up to date SSL Certificate for mydomain.com is fine - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Found the following certs: Certificate Name: mydomain.com Serial Number: 43191239a8fd60f10d1b7d075814efc2226 Key Type: RSA Domains: *.mydomain.com Expiry Date: 2022-12-25 03:31:59+00:00 (VALID: 88 days) Certificate Path: /etc/letsencrypt/live/mydomain.com/fullchain.pem Private Key Path: /etc/letsencrypt/live/mydomain.com/privkey.pem - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  Getting summary output of logs File "/rmm/api/env/lib/python3.10/site-packages/django/utils/asyncio.py", line 26, in inner return func(*args, **kwargs) File "/rmm/api/env/lib/python3.10/site-packages/django/db/backends/postgresql/base.py", line 215, in get_new_connection connection = Database.connect(**conn_params) File "/rmm/api/env/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) django.db.utils.OperationalError: connection to server at "localhost" (127.0.0.1), port 5432 failed: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2022/09/26 13:39:03 [crit] 582#582: *52 connect() to unix:/rmm/daphne.sock failed (2: No such file or directory) while connecting to upstream, client: 171.96.xxx.xxx, server: api.mydomain.com, request: "GET /ws/dashinfo/?access_token=xxx HTTP/1.1", upstream: "http://unix:/rmm/daphne.sock:/ws/dashinfo/?access_token=xxx", host: "api.mydomain.com" 2022/09/27 02:46:58 [crit] 15485#15485: *11251 connect() to unix:////rmm/api/tacticalrmm/tacticalrmm.sock failed (2: No such file or directory) while connecting to upstream, client: 171.96.xxx.xxx, server: api.mydomain.com, request: "PATCH /alerts/ HTTP/1.1", upstream: "uwsgi://unix:////rmm/api/tacticalrmm/tacticalrmm.sock:", host: "api.mydomain.com", referrer: "https://rmm.mydomain.com/" 2022/09/27 04:14:14 [error] 18330#18330: *1543 directory index of "/rmm/api/tacticalrmm/static/" is forbidden, client: 58.136.xxx.xxx, server: api.mydomain.com, request: "HEAD /static/ HTTP/1.1", host: "api.mydomain.com" 2022/09/27 04:23:15 [crit] 584#584: *1 connect() to unix:////rmm/api/tacticalrmm/tacticalrmm.sock failed (2: No such file or directory) while connecting to upstream, client: 171.97.xxx.xxx, server: api.mydomain.com, request: "GET /api/v3/iBtPXPfCxhllGxtGdSaYvDgfdITFGJhlEdjfwWdU/checkinterval/ HTTP/1.1", upstream: "uwsgi://unix:////rmm/api/tacticalrmm/tacticalrmm.sock:", host: "api.mydomain.com" 2022/09/27 04:23:15 [crit] 584#584: *4 connect() to unix:/rmm/daphne.sock failed (2: No such file or directory) while connecting to upstream, client: 58.136.xxx.xxx, server: api.mydomain.com, request: "GET /ws/dashinfo/?access_token=xxx HTTP/1.1", upstream: "http://unix:/rmm/daphne.sock:/ws/dashinfo/?access_token=xxx", host: "api.mydomain.com" 2022/09/27 04:23:16 [crit] 584#584: *17 connect() to unix:////rmm/api/tacticalrmm/tacticalrmm.sock failed (2: No such file or directory) while connecting to upstream, client: 171.97.xxx.xxx, server: api.mydomain.com, request: "GET /api/v3/VyQimlePqoOkhjGyXRohvRnyqWbDmzCyYcqIYIMz/checkinterval/ HTTP/1.1", upstream: "uwsgi://unix:////rmm/api/tacticalrmm/tacticalrmm.sock:", host: "api.mydomain.com" 2022/09/27 04:23:17 [crit] 584#584: *23 connect() to unix:////rmm/api/tacticalrmm/tacticalrmm.sock failed (2: No such file or directory) while connecting to upstream, client: 171.97.xxx.xxx, server: api.mydomain.com, request: "GET /api/v3/MSobQnanhMEPtpIWbwTmscRxvshNPVBKYytfGYuT/checkinterval/ HTTP/1.1", upstream: "uwsgi://unix:////rmm/api/tacticalrmm/tacticalrmm.sock:", host: "api.mydomain.com" 2022/09/27 04:23:17 [crit] 584#584: *25 connect() to unix:////rmm/api/tacticalrmm/tacticalrmm.sock failed (2: No such file or directory) while connecting to upstream, client: 171.97.xxx.xxx, server: api.mydomain.com, request: "GET /api/v3/SwIwYETiBcpmcMJLvdtilqIBFTJbWheamGDaNQtW/checkinterval/ HTTP/1.1", upstream: "uwsgi://unix:////rmm/api/tacticalrmm/tacticalrmm.sock:", host: "api.mydomain.com" 2022/09/27 04:23:18 [crit] 584#584: *29 connect() to unix:/rmm/daphne.sock failed (2: No such file or directory) while connecting to upstream, client: 58.136.xxx.xxx, server: api.mydomain.com, request: "GET /ws/dashinfo/?access_token=xxx HTTP/1.1", upstream: "http://unix:/rmm/daphne.sock:/ws/dashinfo/?access_token=xxx", host: "api.mydomain.com" 2022/09/27 04:23:22 [crit] 584#584: *53 connect() to unix:/rmm/daphne.sock failed (2: No such file or directory) while connecting to upstream, client: 58.136.xxx.xxx, server: api.mydomain.com, request: "GET /ws/dashinfo/?access_token=xxx HTTP/1.1", upstream: "http://unix:/rmm/daphne.sock:/ws/dashinfo/?access_token=xxx", host: "api.mydomain.com" ``` I note some errors in the output from `troubleshoot_server.sh` but wonder if these were logged before the services had fully come online after a reboot. I checked all services manually, can see that they are all running. Finally, ``` sudo netstat -tulpn | grep LISTEN tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 595/sshd: /usr/sbin tcp 0 0 127.0.0.1:5432 0.0.0.0:* LISTEN 600/postgres tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 582/nginx: master p tcp 0 0 127.0.0.1:27017 0.0.0.0:* LISTEN 568/mongod tcp 0 0 0.0.0.0:5355 0.0.0.0:* LISTEN 7587/systemd-resolv tcp 0 0 127.0.0.1:6379 0.0.0.0:* LISTEN 580/redis-server 12 tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 582/nginx: master p tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN 7587/systemd-resolv tcp6 0 0 :::22 :::* LISTEN 595/sshd: /usr/sbin tcp6 0 0 :::443 :::* LISTEN 582/nginx: master p tcp6 0 0 :::4222 :::* LISTEN 2731/nats-server tcp6 0 0 :::1024 :::* LISTEN 638/node tcp6 0 0 :::5355 :::* LISTEN 7587/systemd-resolv tcp6 0 0 ::1:6379 :::* LISTEN 580/redis-server 12 tcp6 0 0 :::4430 :::* LISTEN 638/node tcp6 0 0 :::80 :::* LISTEN 582/nginx: master p tcp6 0 0 :::4433 :::* LISTEN 638/node tcp6 0 0 :::9235 :::* LISTEN 2731/nats-server ``` It really appears that everything is online. I'm at a loss about what to check next - why are all the agents offline on a brand new install?
kerem closed this issue 2026-03-14 05:20:44 +03:00
Author
Owner

@dinger1986 commented on GitHub (Sep 27, 2022):

Try opening tcp port 4222 just incase, (it shouldn't be needed). I'm just going into a meeting so haven't looked properly at the error.

<!-- gh-comment-id:1259229150 --> @dinger1986 commented on GitHub (Sep 27, 2022): Try opening tcp port 4222 just incase, (it shouldn't be needed). I'm just going into a meeting so haven't looked properly at the error.
Author
Owner

@ZenDevMaster commented on GitHub (Sep 27, 2022):

Try opening tcp port 4222 just incase, (it shouldn't be needed). I'm just going into a meeting so haven't looked properly at the error.

They all came online within a minute after opening 4222. So I closed it again to see if they'd all go back offline. Nope - stayed green. Perhaps for a new installation there's a one time check-in on port 4222 that reassigns it to 443?

<!-- gh-comment-id:1259318077 --> @ZenDevMaster commented on GitHub (Sep 27, 2022): > Try opening tcp port 4222 just incase, (it shouldn't be needed). I'm just going into a meeting so haven't looked properly at the error. They all came online within a minute after opening 4222. So I closed it again to see if they'd all go back offline. Nope - stayed green. Perhaps for a new installation there's a one time check-in on port 4222 that reassigns it to 443?
Author
Owner

@dinger1986 commented on GitHub (Sep 27, 2022):

I'm gonna let someone who knows more than me answer that but I have seen it once or twice and interestingly always on AWS that it needs 4222 opened for communications.

Leave it shut for an hour or so and make sure it still communicates fine, please report back your findings.

<!-- gh-comment-id:1259323934 --> @dinger1986 commented on GitHub (Sep 27, 2022): I'm gonna let someone who knows more than me answer that but I have seen it once or twice and interestingly always on AWS that it needs 4222 opened for communications. Leave it shut for an hour or so and make sure it still communicates fine, please report back your findings.
Author
Owner

@silversword411 commented on GitHub (Sep 27, 2022):

You said new install, but then talking about "lots of" agents. Is this a new server, or a restore of old one?

You replaced your real domain with mydomain.com right?

<!-- gh-comment-id:1259457354 --> @silversword411 commented on GitHub (Sep 27, 2022): You said new install, but then talking about "lots of" agents. Is this a new server, or a restore of old one? You replaced your real domain with `mydomain.com` right?
Author
Owner

@wh1te909 commented on GitHub (Sep 27, 2022):

They all came online within a minute after opening 4222. So I closed it again to see if they'd all go back offline. Nope - stayed green. Perhaps for a new installation there's a one time check-in on port 4222 that reassigns it to 443?

No, 4222 is the internal port used to communicate between Django (uwsgi), nats-api and nats which all run on the same machine on a standard install. They communicate via hostnames (api.example.com:4222) so my guess is maybe hosts file didn't get setup correctly during install and requests between the TRMM services are being routed externally rather than internally, which explains why it only works when you open up 4222. All 3 subdomains internally should resolve to 127.0.0.1 or 127.0.1.1

<!-- gh-comment-id:1260046781 --> @wh1te909 commented on GitHub (Sep 27, 2022): > They all came online within a minute after opening 4222. So I closed it again to see if they'd all go back offline. Nope - stayed green. Perhaps for a new installation there's a one time check-in on port 4222 that reassigns it to 443? No, 4222 is the internal port used to communicate between Django (uwsgi), nats-api and nats which all run on the same machine on a standard install. They communicate via hostnames (`api.example.com:4222`) so my guess is maybe hosts file didn't get setup correctly during install and requests between the TRMM services are being routed externally rather than internally, which explains why it only works when you open up 4222. All 3 subdomains internally should resolve to 127.0.0.1 or 127.0.1.1
Author
Owner

@dinger1986 commented on GitHub (Sep 27, 2022):

Ah! That will be why then, AWS overwrites the hosts file so it doesn't resolve to localhost.

@ZenDevMaster please do cat /etc/hosts and share the snip. Feel free to redact your IP

<!-- gh-comment-id:1260053399 --> @dinger1986 commented on GitHub (Sep 27, 2022): Ah! That will be why then, AWS overwrites the hosts file so it doesn't resolve to localhost. @ZenDevMaster please do `cat /etc/hosts` and share the snip. Feel free to redact your IP
Author
Owner

@ZenDevMaster commented on GitHub (Sep 28, 2022):

You said new install, but then talking about "lots of" agents. Is this a new server, or a restore of old one?

I loaded up 7 new agents across two external networks. All would reach the trmm vps via a public IP external to their networks. All of them are brand new to TRMM (I am as well, which is why I ran out of troubleshooting ideas).

You replaced your real domain with mydomain.com right?

Yes

cat /etc/hosts

# Your system has configured 'manage_etc_hosts' as True.
# As a result, if you wish for changes to this file to persist
# then you will need to either
# a.) make changes to the master file in /etc/cloud/templates/hosts.debian.tmpl
# b.) change or remove the value of 'manage_etc_hosts' in
#     /etc/cloud/cloud.cfg or cloud-config from user-data
#
127.0.1.1 ip-172-31-xxx-xxx.ap-southeast-1.compute.internal ip-172-31-xxx-xxx
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

127.0.0.1 api.mydomain.com
127.0.0.1 mesh.mydomain.com
127.0.0.1 rmm.mydomain.com

I built the associations to 127.0.0.1 myself during the troubleshooting process. It didn't appear to make any difference at the time. In my testing I could query NATS using the following:

nc localhost 4222
nc 127.0.0.1 4222

INFO {"server_id":"xyzabc","server_name":"xyzabc","version":"2.9.1","proto":1,"git_commit":"2363a2c","go":"go1.19.1","host":"0.0.0.0","port":4222,"headers":true,"auth_required":true,"tls_required":true,"max_payload":67108864,"client_id":85,"client_ip":"127.0.0.1"}

But the following failed until I added 4222 to ufw (when run from the trmm vps):

nc 52.221.xxx.xxx 4222

So yes, perhaps the issue did have something to do with resolving to and routing over the public IP. I'd be happy to try and run the scenario again from the start using all new VPS systems and see if I can replicate it on EC2... if that would be helpful.

At the moment I think I'm up and running - the agents are online, and port 4222 from outside has remained closed for about 12 hours after I'd opened it briefly. Still unsure how that could have fixed anything :|

<!-- gh-comment-id:1260338193 --> @ZenDevMaster commented on GitHub (Sep 28, 2022): > You said new install, but then talking about "lots of" agents. Is this a new server, or a restore of old one? > I loaded up 7 new agents across two external networks. All would reach the trmm vps via a public IP external to their networks. All of them are brand new to TRMM (I am as well, which is why I ran out of troubleshooting ideas). > You replaced your real domain with `mydomain.com` right? Yes ``` cat /etc/hosts # Your system has configured 'manage_etc_hosts' as True. # As a result, if you wish for changes to this file to persist # then you will need to either # a.) make changes to the master file in /etc/cloud/templates/hosts.debian.tmpl # b.) change or remove the value of 'manage_etc_hosts' in # /etc/cloud/cloud.cfg or cloud-config from user-data # 127.0.1.1 ip-172-31-xxx-xxx.ap-southeast-1.compute.internal ip-172-31-xxx-xxx 127.0.0.1 localhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts 127.0.0.1 api.mydomain.com 127.0.0.1 mesh.mydomain.com 127.0.0.1 rmm.mydomain.com ``` I built the associations to 127.0.0.1 myself during the troubleshooting process. It didn't appear to make any difference at the time. In my testing I could query NATS using the following: ``` nc localhost 4222 nc 127.0.0.1 4222 INFO {"server_id":"xyzabc","server_name":"xyzabc","version":"2.9.1","proto":1,"git_commit":"2363a2c","go":"go1.19.1","host":"0.0.0.0","port":4222,"headers":true,"auth_required":true,"tls_required":true,"max_payload":67108864,"client_id":85,"client_ip":"127.0.0.1"} ``` But the following failed until I added 4222 to ufw (when run from the trmm vps): `nc 52.221.xxx.xxx 4222` So yes, perhaps the issue did have something to do with resolving to and routing over the public IP. I'd be happy to try and run the scenario again from the start using all new VPS systems and see if I can replicate it on EC2... if that would be helpful. At the moment I think I'm up and running - the agents are online, and port 4222 from outside has remained closed for about 12 hours after I'd opened it briefly. Still unsure how that could have fixed anything :|
Author
Owner

@dinger1986 commented on GitHub (Sep 28, 2022):

so as it says change /etc/cloud/cloud.cfg and remove the value of manage_etc_hosts then add in your entries for localhost for 127.0.1.1 in one like after ip-172-31-xxx-xxx again and test, maybe a reboot of the server would be worthwhile as that will create new connections.

Opening briefly will have allowed the connection to be made and that will have stayed open even after the firewall is closed.

<!-- gh-comment-id:1260597754 --> @dinger1986 commented on GitHub (Sep 28, 2022): so as it says change `/etc/cloud/cloud.cfg` and remove the value of `manage_etc_hosts` then add in your entries for localhost for 127.0.1.1 in one like after `ip-172-31-xxx-xxx` again and test, maybe a reboot of the server would be worthwhile as that will create new connections. Opening briefly will have allowed the connection to be made and that will have stayed open even after the firewall is closed.
Author
Owner

@ZenDevMaster commented on GitHub (Sep 30, 2022):

Coming back to update. Port 4222 has been closed at the firewall for about 3 days

  • api/mesh/rmm.mydomain.com associations to 127.0.0.1 are now static in /etc/hosts and I've rebooted the vps several times.
  • New agents are showing up in the trmm panel as one would expect

Aside from briefly opening and then closing 4222, I can't point to anything that is permanently changed between now and the time I opened up this issue. Except that now everything works.

Perhaps it was some transient issue unique to this Debian 11 ami on Amazon.

<!-- gh-comment-id:1263203225 --> @ZenDevMaster commented on GitHub (Sep 30, 2022): Coming back to update. Port 4222 has been closed at the firewall for about 3 days - api/mesh/rmm.mydomain.com associations to 127.0.0.1 are now static in /etc/hosts and I've rebooted the vps several times. - New agents are showing up in the trmm panel as one would expect Aside from briefly opening and then closing 4222, I can't point to anything that is permanently changed between now and the time I opened up this issue. Except that now everything works. Perhaps it was some transient issue unique to this Debian 11 ami on Amazon.
Author
Owner
<!-- gh-comment-id:1263793877 --> @silversword411 commented on GitHub (Sep 30, 2022): It was cloudinit https://docs.tacticalrmm.com/troubleshooting/#issues-with-agents-offline-after-reboot-or-new-install-on-aws-and-other-cloud-platforms-using-cloudinit-or-similar
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/tacticalrmm#2747
No description provided.