mirror of
https://github.com/amidaware/rmmagent.git
synced 2026-04-26 06:45:48 +03:00
[GH-ISSUE #29] Resiliency againt startup issues #97
Labels
No labels
bug
bug
enhancement
fixed
pull-request
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/rmmagent#97
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @NiceGuyIT on GitHub (Jan 16, 2023).
Original GitHub issue: https://github.com/amidaware/rmmagent/issues/29
One server was offline and after researching the cause, I discovered there was an event log stating "A timeout was reached (30000 milliseconds) while waiting for the tacticalrmm service to connect.". It would be nice if the service (all OS's) was configured to stay running as best it can. For connectivity issues, retry logic is preferable over exiting after an initial failure to connect. If there's a domain configured, doing a fresh DNS lookup (can the agent clear the DNS cache?) and ping'ing the API until it's able to connect would be nice. If there's no domain configured, or if the agent configuration is corrupt, of course generate a friendly error message and exit.
Note: It's possible this could happen if the agent was restarted (computer rebooted) while the server was being updated and the API unavailable.
@silversword411 commented on GitHub (Jan 16, 2023):
That looks like a Windows service error. Windows service said startup RMM...and when it checked to see if it was running, after 30000 milliseconds it wasn't. You sure AV didn't kill it?
Check agent.log to see if there's an error in there.
@NiceGuyIT commented on GitHub (Jan 17, 2023):
I started TacticalAgent from Mesh and this server does not have Bitdefender. I did not find any other event logs relating to the service and there was nothing in the
agent.log.It seems this message is coming from Windows when trying to start the service. This kb article explains how the registry key
ServicesPipeTimeoutcan be increased from the default 30 seconds to 60 seconds. I view that as a workaround, not solution.The answers to this StackOverflow question have many other scenarios where this error may occur, including scenarios that do not relate to a timeout.
While some of this may be due to my environment, there are things the TacticalAgent can do. One is to add dependencies to the service so the service control manager starts it a little later in the boot process. I.e. after networking is available. Another option is to add a scheduled task to start the service if it's not running.
@NiceGuyIT commented on GitHub (Jan 17, 2023):
Relevant Grafana issue 2060:
There's a PR linked in that issue that may be of use.
@NiceGuyIT commented on GitHub (Jan 17, 2023):
Here's the relevant Go issue to fix the runtime: Windows service timeout during system startup.
@silversword411 commented on GitHub (Jan 18, 2023):
...when I start on a computer, it's usually 1-2 seconds for the service to start and show as running by windows checks.
How long is it there, and how can we measure that "time to running" (I know there's a powershell measure command that might do it?)?
I'm thinking under normal conditions TRMM agent from start request to running is less than 5 seconds. Are you sure there's not other extenuating circumstances in your test there making it take longer than 30 seconds? Is it nebula network delays that might be causing TRMM to stutter?
@NiceGuyIT commented on GitHub (Jan 18, 2023):
This happens only with high CPU usage. This thread specifically talks about this error happening only after rebooting to apply patches. This does not happen all the time. This is the first time I encountered this scenario while running TRMM for more than a year.
You can't measure this externally. The above thread mentioned they added an event log as the first action in
main()but that was not getting called. This is because the timeout is in the actual startup of the program as explained in the Grafana issue.Under normal circumstances, that's true.
Installing patches after a reboot could trigger this, but not all the time.
The link in the Grafana issue (here) explains how they are able to cause this to happen by limiting the CPU to 1/4 of a CPU in Hyper-V.
This issue is to address the Go runtime initialization slowness under high CPU load, as well as identify options that can be applied to alleviate the scenario.