mirror of
https://github.com/amidaware/tacticalrmm.git
synced 2026-04-26 15:05:57 +03:00
[GH-ISSUE #1296] New install, all agents overdue, uWSGI reports: "nats: encountered error" #2747
Labels
No labels
In Process
bug
bug
dev-triage
documentation
duplicate
enhancement
fixed
good first issue
help wanted
integration
invalid
pull-request
question
requires agent update
security
ui tweak
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/tacticalrmm#2747
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ZenDevMaster on GitHub (Sep 27, 2022).
Original GitHub issue: https://github.com/amidaware/tacticalrmm/issues/1296
Server Info (please complete the following information):
Installation Method:
Agent Info (please complete the following information):
What I did:
ufwWhat happened
At first the agents show green (online), but quickly fell off and now all reporting overdue. I've rebooted the TRMM server several times, and all PCs that have agents.
Mesh features appear to work, as I can remote control systems.
When I run
journalctl --identifier uwsgi --followthe following is periodically logged, especially after agents are restarted. A clue?:Finally, I ran
troubleshoot_server.shto verify, here's the output (scrubbed):I note some errors in the output from
troubleshoot_server.shbut wonder if these were logged before the services had fully come online after a reboot. I checked all services manually, can see that they are all running.Finally,
It really appears that everything is online. I'm at a loss about what to check next - why are all the agents offline on a brand new install?
@dinger1986 commented on GitHub (Sep 27, 2022):
Try opening tcp port 4222 just incase, (it shouldn't be needed). I'm just going into a meeting so haven't looked properly at the error.
@ZenDevMaster commented on GitHub (Sep 27, 2022):
They all came online within a minute after opening 4222. So I closed it again to see if they'd all go back offline. Nope - stayed green. Perhaps for a new installation there's a one time check-in on port 4222 that reassigns it to 443?
@dinger1986 commented on GitHub (Sep 27, 2022):
I'm gonna let someone who knows more than me answer that but I have seen it once or twice and interestingly always on AWS that it needs 4222 opened for communications.
Leave it shut for an hour or so and make sure it still communicates fine, please report back your findings.
@silversword411 commented on GitHub (Sep 27, 2022):
You said new install, but then talking about "lots of" agents. Is this a new server, or a restore of old one?
You replaced your real domain with
mydomain.comright?@wh1te909 commented on GitHub (Sep 27, 2022):
No, 4222 is the internal port used to communicate between Django (uwsgi), nats-api and nats which all run on the same machine on a standard install. They communicate via hostnames (
api.example.com:4222) so my guess is maybe hosts file didn't get setup correctly during install and requests between the TRMM services are being routed externally rather than internally, which explains why it only works when you open up 4222. All 3 subdomains internally should resolve to 127.0.0.1 or 127.0.1.1@dinger1986 commented on GitHub (Sep 27, 2022):
Ah! That will be why then, AWS overwrites the hosts file so it doesn't resolve to localhost.
@ZenDevMaster please do
cat /etc/hostsand share the snip. Feel free to redact your IP@ZenDevMaster commented on GitHub (Sep 28, 2022):
I loaded up 7 new agents across two external networks. All would reach the trmm vps via a public IP external to their networks. All of them are brand new to TRMM (I am as well, which is why I ran out of troubleshooting ideas).
Yes
I built the associations to 127.0.0.1 myself during the troubleshooting process. It didn't appear to make any difference at the time. In my testing I could query NATS using the following:
But the following failed until I added 4222 to ufw (when run from the trmm vps):
nc 52.221.xxx.xxx 4222So yes, perhaps the issue did have something to do with resolving to and routing over the public IP. I'd be happy to try and run the scenario again from the start using all new VPS systems and see if I can replicate it on EC2... if that would be helpful.
At the moment I think I'm up and running - the agents are online, and port 4222 from outside has remained closed for about 12 hours after I'd opened it briefly. Still unsure how that could have fixed anything :|
@dinger1986 commented on GitHub (Sep 28, 2022):
so as it says change
/etc/cloud/cloud.cfgand remove the value ofmanage_etc_hoststhen add in your entries for localhost for 127.0.1.1 in one like afterip-172-31-xxx-xxxagain and test, maybe a reboot of the server would be worthwhile as that will create new connections.Opening briefly will have allowed the connection to be made and that will have stayed open even after the firewall is closed.
@ZenDevMaster commented on GitHub (Sep 30, 2022):
Coming back to update. Port 4222 has been closed at the firewall for about 3 days
Aside from briefly opening and then closing 4222, I can't point to anything that is permanently changed between now and the time I opened up this issue. Except that now everything works.
Perhaps it was some transient issue unique to this Debian 11 ami on Amazon.
@silversword411 commented on GitHub (Sep 30, 2022):
It was cloudinit
https://docs.tacticalrmm.com/troubleshooting/#issues-with-agents-offline-after-reboot-or-new-install-on-aws-and-other-cloud-platforms-using-cloudinit-or-similar