starred/ntppool

Fork 0

mirror of https://github.com/abh/ntppool.git synced 2026-04-25 19:45:50 +03:00

[GH-ISSUE #269] Suggestion: Tighten Rules for Inaccurate or Unstable NTP Servers #651

New issue

Open

opened 2026-03-13 15:04:49 +03:00 by kerem · 5 comments

kerem commented

2026-03-13 15:04:49 +03:00

Owner

Originally created by @ghost on GitHub (Sep 10, 2025).
Original GitHub issue: https://github.com/abh/ntppool/issues/269

I have been looking over the NTP Pool status page and I noticed that the peak is reported as 145,260 DNS queries per second. Assuming each of these queries returns four NTP server IPs, and that each of those IPs is then queried for the time, I tried to estimate the traffic involved.

Using 114 bytes per query as a reasonable estimate (I am aware that the actual NTP request and response will be smaller than this), 145,260 DNS queries per second returning four IPs works out at about 0.53 Gbps, or roughly 530 Mbps.

That does not seem like a large amount of traffic for a global free NTP service. Am I misunderstanding the calculation somewhere?

With 3,601 IPv4 NTP servers in the pool running at the lowest bandwidth setting of 512 kbps per server, the total capacity is about 1.8 Gbps, or roughly 3.5 the total traffic demand. For the 2,084 IPv6 servers, the total capacity is about 1.1 Gbps, or a little over twice the total traffic demand.

This suggests that the pool is not short of volunteers. My initial thought was that the wide allowance for round trip time and offset was because there were not enough participants, but it seems this is not the case.

From my own checks, I have found around 10 percent of servers in the UK pool fall into one of the following categories:

circular peering, where the client references servers already in the pool, which in turn reference other servers in the pool
high round trip times (over 250 ms)
accuracy outside of 100 ms
non synchronous routing, where incoming and outgoing packets have different timings

I had assumed the NTP Pool would filter out such servers, but from my tests I am still quite often receiving them as part of the DNS responses.

I would like to suggest that the allowance rules are reviewed and perhaps tightened. Since the pool currently seems to have far more capacity than required, it should be possible to trim inaccurate or unstable servers without affecting the ability to provide NTP as a free service.

For my own testing, I queried 0.uk.pool.ntp.org, 1.uk.pool.ntp.org, 2.uk.pool.ntp.org, and 3.uk.pool.ntp.org every five minutes. I took all four IP addresses returned each time, so sixteen time servers in total, and queried them for the time. Over 1,440 probes, I sent 23,040 NTP requests. Out of these, 2,256 responses were off by more than 50 ms, and 1,408 were off by more than 100 ms. This means 9.8 percent of responses were over 50 ms wrong, and 6.1 percent were over 100 ms wrong.

If the rules were changed so that any server which goes out by more than 50 ms is removed immediately (not stepped down slowly), the pool size would only fall by less than 10 percent. Servers could then regain their status gradually as they provide correct responses. At present it seems unusual that there are strict rules to get into the pool, but once in, a server can drift significantly without being removed from DNS responses unless it is consistently bad for nearly a full day.

In my sample image, the NTP responses exceeding 50 ms are highlighted in red.

On the right side, you can observe where one pool server drifted, and the servers using the same upstream source drifted along with it, forming the visible arch.

Originally created by @ghost on GitHub (Sep 10, 2025). Original GitHub issue: https://github.com/abh/ntppool/issues/269 I have been looking over the NTP Pool status page and I noticed that the peak is reported as 145,260 DNS queries per second. Assuming each of these queries returns four NTP server IPs, and that each of those IPs is then queried for the time, I tried to estimate the traffic involved. Using 114 bytes per query as a reasonable estimate (I am aware that the actual NTP request and response will be smaller than this), 145,260 DNS queries per second returning four IPs works out at about 0.53 Gbps, or roughly 530 Mbps. That does not seem like a large amount of traffic for a global free NTP service. Am I misunderstanding the calculation somewhere? With 3,601 IPv4 NTP servers in the pool running at the lowest bandwidth setting of 512 kbps per server, the total capacity is about 1.8 Gbps, or roughly 3.5 the total traffic demand. For the 2,084 IPv6 servers, the total capacity is about 1.1 Gbps, or a little over twice the total traffic demand. This suggests that the pool is not short of volunteers. My initial thought was that the wide allowance for round trip time and offset was because there were not enough participants, but it seems this is not the case. From my own checks, I have found around 10 percent of servers in the UK pool fall into one of the following categories: - circular peering, where the client references servers already in the pool, which in turn reference other servers in the pool - high round trip times (over 250 ms) - accuracy outside of 100 ms - non synchronous routing, where incoming and outgoing packets have different timings I had assumed the NTP Pool would filter out such servers, but from my tests I am still quite often receiving them as part of the DNS responses. I would like to suggest that the allowance rules are reviewed and perhaps tightened. Since the pool currently seems to have far more capacity than required, it should be possible to trim inaccurate or unstable servers without affecting the ability to provide NTP as a free service. For my own testing, I queried 0.uk.pool.ntp.org, 1.uk.pool.ntp.org, 2.uk.pool.ntp.org, and 3.uk.pool.ntp.org every five minutes. I took all four IP addresses returned each time, so sixteen time servers in total, and queried them for the time. Over 1,440 probes, I sent 23,040 NTP requests. Out of these, 2,256 responses were off by more than 50 ms, and 1,408 were off by more than 100 ms. This means 9.8 percent of responses were over 50 ms wrong, and 6.1 percent were over 100 ms wrong. If the rules were changed so that any server which goes out by more than 50 ms is removed immediately (not stepped down slowly), the pool size would only fall by less than 10 percent. Servers could then regain their status gradually as they provide correct responses. At present it seems unusual that there are strict rules to get into the pool, but once in, a server can drift significantly without being removed from DNS responses unless it is consistently bad for nearly a full day. In my sample image, the NTP responses exceeding 50 ms are highlighted in red. On the right side, you can observe where one pool server drifted, and the servers using the same upstream source drifted along with it, forming the visible arch. <img width="1513" height="862" alt="Image" src="https://github.com/user-attachments/assets/9f3223c7-9c77-4f41-857d-dc3d95c6a8c2" />

kerem commented

2026-03-13 15:04:55 +03:00

Author

Owner

@abh commented on GitHub (Sep 11, 2025):

That does not seem like a large amount of traffic for a global free NTP service. Am I misunderstanding the calculation somewhere?

Most/many NTP clients will do several NTP queries for each DNS response. And we get a large amount of DNS caching downstream, so I think the NTP query count might be off by an order of magnitude or more.

I don't think that changes the relevancy of your suggestion for picking better NTP servers though!

@abh commented on GitHub (Sep 11, 2025): > That does not seem like a large amount of traffic for a global free NTP service. Am I misunderstanding the calculation somewhere? Most/many NTP clients will do several NTP queries for each DNS response. And we get a large amount of DNS caching downstream, so I think the NTP query count might be off by an order of magnitude or more. I don't think that changes the relevancy of your suggestion for picking better NTP servers though!

kerem commented

2026-03-13 15:05:00 +03:00

Author

Owner

@abh commented on GitHub (Sep 11, 2025):

Could you add a monitor to the pool? It'd be interesting to see if the monitor collects different data than your test tool, and if so why.

@abh commented on GitHub (Sep 11, 2025): Could you add a monitor to the pool? It'd be interesting to see if the monitor collects different data than your test tool, and if so why.

kerem commented

2026-03-13 15:05:05 +03:00

Author

Owner

@ghost commented on GitHub (Sep 12, 2025):

At the moment, I don’t meet the eligibility requirements to set up a monitor.

Monitor eligibility Monitor management is available for accounts with existing monitors or servers that have been verified for 18 months.

@ghost commented on GitHub (Sep 12, 2025): At the moment, I don’t meet the eligibility requirements to set up a monitor. `Monitor eligibility Monitor management is available for accounts with existing monitors or servers that have been verified for 18 months.`

kerem commented

2026-03-13 15:05:10 +03:00

Author

Owner

@demsbjf8 commented on GitHub (Sep 12, 2025):

This suggests that the pool is not short of volunteers.

Unfortunately, that conclusion is quite wrong. Please take a look at how the servers are distributed among continents, and among countries. E.g., Europe currently has about 2360 active servers, vs. just about 300 active IPv4 servers for Asia. China currently has 35 active IPv4 servers for an estimated 1,022,000,000 Internet users, vs. 224 active IPv4 servers in the UK serving an estimated 64,990,000 Internet users. So tightening the rules for better-served zones may be feasible, tightening the rules that currently apply globally is likely to make an already dire situation in large parts of the world even worse.

E.g., consider India, with an estimated 644,000,000 Internet users sharing 42 currently active IPv4 servers. For reasons I have yet to figure out, many parts of the Internet in India have asymmetric paths to other parts of the world. There are some monitors in India itself, but most monitors are outside of India.

As a consequence, if rules were tightened globally, that might mean that more of the already scarce servers in India that actually work well when compared to local references could be pushed out of the pool when assessed mostly by monitors seeing high offsets due to the path asymmetries (while the clocks are actually quite accurate), making the situation for the remaining servers even worse.

Similar considerations would apply to other underserved zones, some of whom sometimes are barely kept viable with the help of servers from other countries, often even on other continents. I.e., when latency requirements were tightened globally, that might again mean that some of the servers that support precarious zones get pushed out as well, again worsening the situation for remaining servers.

In a healthy zone, pushing some further servers out by tightening the quality rules may make no difference. In a zone that is already precarious and just barely surviving as it is, that might push such zones over the edge.

So while it is obviously always desirable to increase the quality of the pool, my preference would be to first make the pool actually work well globally, not only in Europe and the USA. Rather than potentially increasing the already existing bias of the pool even further.

@demsbjf8 commented on GitHub (Sep 12, 2025): > This suggests that the pool is not short of volunteers. Unfortunately, that conclusion is quite wrong. Please take a look at how the servers are distributed among continents, and among countries. E.g., Europe currently has about 2360 active servers, vs. just about 300 active IPv4 servers for Asia. China currently has 35 active IPv4 servers for an estimated 1,022,000,000 Internet users, vs. 224 active IPv4 servers in the UK serving an estimated 64,990,000 Internet users. So tightening the rules for better-served zones may be feasible, tightening the rules that currently apply globally is likely to make an already dire situation in large parts of the world even worse. E.g., consider India, with an estimated 644,000,000 Internet users sharing 42 currently active IPv4 servers. For reasons I have yet to figure out, many parts of the Internet in India have asymmetric paths to other parts of the world. There are some monitors in India itself, but most monitors are outside of India. As a consequence, if rules were tightened globally, that might mean that more of the already scarce servers in India that actually work well when compared to local references could be pushed out of the pool when assessed mostly by monitors seeing high offsets due to the path asymmetries (while the clocks are actually quite accurate), making the situation for the remaining servers even worse. Similar considerations would apply to other underserved zones, some of whom sometimes are barely kept viable with the help of servers from other countries, often even on other continents. I.e., when latency requirements were tightened globally, that might again mean that some of the servers that support precarious zones get pushed out as well, again worsening the situation for remaining servers. In a healthy zone, pushing some further servers out by tightening the quality rules may make no difference. In a zone that is already precarious and just barely surviving as it is, that might push such zones over the edge. So while it is obviously always desirable to increase the quality of the pool, my preference would be to first make the pool actually work well _globally_, not only in Europe and the USA. Rather than potentially increasing the already existing bias of the pool even further.

kerem commented

2026-03-13 15:05:16 +03:00

Author

Owner

@demsbjf8 commented on GitHub (Sep 13, 2025):

roughly 530 Mbps

Again, that ignores the huge imbalance between zones. As was mentioned in the pool forum, the lowest netspeed setting could get 100+ Mbps in the CN zone. I.e., that would give 3,500+ Mbps for the CN zone alone, and if all servers were at the lowest netspeed setting. But at least some of them are well above that netspeed setting. Other zones are somewhat less extreme than the CN zone, but large parts of the world are still seeing way higher traffic than a North America/Europe-centric perspective would suggest. E.g., Singapore, South Korea, Japan, India, Malaysia, the Philippines, and many others.

circular peering, where the client references servers already in the pool, which in turn reference other servers in the pool

Not sure what is wrong with that in the sense that the pool instructions explicitly do not preclude this. I.e., while servers in the pool should not use any of the pool's zone names as upstream due to the dynamic and somewhat unpredictable nature of what actual upstream servers will get assigned to a server, they may statically pick individual servers as upstreams even if those are in the pool as well. I also wouldn't call this "circular" as long as no server appears in the chain twice.

Mandating to not use other servers from the pool even based on static configuration would also present notable challenges from a practical point of view. E.g., many popular, well-known servers are in the pool, e.g., ntppoolX.time.nl or time.cloudflare.com, or various university, research institute, national metrology institute and similar servers. Forbidding servers to pick any of those as upstreams would seriously limit the choices of solid upstream servers for large portions of the server population. Removing those well-known servers from the pool on the other hand would seriously harm many zones in the pool where those well-known servers are one crucial contributor that prevents many zones' collapse.

the NTP responses exceeding 50 ms are highlighted in red

While the red color obviously makes it look bad, I actually don't really see the frequency of those events as anything concerning for most of the occurrences. Most of those seem somewhat isolated instances, events that may happen anytime for a large range of reasons. Unless they happen persistently, those individual events are nothing that would in my view warrant removing a server from the pool more aggressively than the current mechanisms already do. As has been discussed in the forum at various occasions, most recently due to a rather recent event that caused issues for some clients, it is the responsibility of the client to validate responses they get. Trying to offload that task to the pool would overload it, and not even be possible realistically. Because the probing by existing monitors are only happening at discrete points in time, and cannot catch anything happening in between. And it cannot represent any and all traffic paths from any and all potential clients to all of the servers. The other way round, because a monitor sees a glitch with a server doesn't necessarily mean that there is anything wrong with the server, or even infrastructure associated with the server. Could also be something on the monitor, or more closely associated with a monitor than the server. Kicking a server out because of a single event when it works well, and is actually serving the large majority of its clients well would do a disservice to those clients.

Those events where issues are getting clustered, especially over "longer" periods of time (vs. almost overlapping events) are more concerning. That would be something worthwhile looking into as to whether the current pool mechanisms could be tuned a bit more to catch those as well. But again, I am somewhat doubtful that the energy invested in such an undertaking would be well-spent by trying to replicate a task on the pool side that for a number of reasons is often considered the responsibility of clients, both from efficiency as well as effectiveness points of view (i.e., the effort it would need in each place to actually detect, and mitigate, such events, and the effectiveness of the mechanism, e.g., how complete it would be to catch any and all (or at least the large majority) of such cases).

The bump thing is an interesting effect, would be interesting to learn more details as to how it came to be. But again, there could be many reasons why groups of servers drifts together, e.g., as you suggest, because a set follows a common drifting upstream. But could also be that the set is experiencing similar networking effects because sharing connectivity aspects to the rest of the world. Again potentially nothing that in my view should generally disqualify those servers due to the still somewhat limited sample size of monitoring vantage points, while they might work well for a sufficiently large number of other clients. See the India example mentioned in my previous comment. And also again something that a sensible NTP client implementation would be expected to be able to deal with anyway (as even the best monitoring and pruning will never prevent any and all such events not to hit clients directly). So not sure again how much effort would be worthwhile to be spent replicating such functionality (the extensive clustering and selection algorithms of an NTP client) on the infrastructure side both from an efficiency as well as an effectiveness point of view.

Over 1,440 probes

Could you elaborate as to what those probes are? I get the impression that you were using some kind of tool for your investigation, but I am not yet clear which one it is, e.g., whether it is anything mentioned in recent forum discussions.

@demsbjf8 commented on GitHub (Sep 13, 2025): > roughly 530 Mbps Again, that ignores the huge imbalance between zones. As was mentioned in the pool forum, the lowest netspeed setting could get 100+ Mbps in the CN zone. I.e., that would give 3,500+ Mbps for the CN zone alone, and if all servers were at the lowest netspeed setting. But at least some of them are well above that netspeed setting. Other zones are somewhat less extreme than the CN zone, but large parts of the world are still seeing way higher traffic than a North America/Europe-centric perspective would suggest. E.g., Singapore, South Korea, Japan, India, Malaysia, the Philippines, and many others. > circular peering, where the client references servers already in the pool, which in turn reference other servers in the pool Not sure what is wrong with that in the sense that the pool instructions explicitly do not preclude this. I.e., while servers in the pool should not use any of the pool's zone _names_ as upstream due to the dynamic and somewhat unpredictable nature of what actual upstream servers will get assigned to a server, they _may_ _statically_ pick individual servers as upstreams even if those are in the pool as well. I also wouldn't call this "circular" as long as no server appears in the chain twice. Mandating to not use other servers from the pool even based on static configuration would also present notable challenges from a practical point of view. E.g., many popular, well-known servers are in the pool, e.g., ntppoolX.time.nl or time.cloudflare.com, or various university, research institute, national metrology institute and similar servers. Forbidding servers to pick any of those as upstreams would seriously limit the choices of solid upstream servers for large portions of the server population. Removing those well-known servers from the pool on the other hand would seriously harm many zones in the pool where those well-known servers are one crucial contributor that prevents many zones' collapse. > the NTP responses exceeding 50 ms are highlighted in red While the red color obviously makes it look bad, I actually don't really see the frequency of those events as anything concerning for most of the occurrences. Most of those seem somewhat isolated instances, events that may happen anytime for a large range of reasons. Unless they happen persistently, those individual events are nothing that would in my view warrant removing a server from the pool more aggressively than the current mechanisms already do. As has been discussed in the forum at various occasions, most recently due to a rather recent event that caused issues for some clients, it is the responsibility of the client to validate responses they get. Trying to offload that task to the pool would overload it, and not even be possible realistically. Because the probing by existing monitors are only happening at discrete points in time, and cannot catch anything happening in between. And it cannot represent any and all traffic paths from any and all potential clients to all of the servers. The other way round, because a monitor sees a glitch with a server doesn't necessarily mean that there is anything wrong with the server, or even infrastructure associated with the server. Could also be something on the monitor, or more closely associated with a monitor than the server. Kicking a server out because of a single event when it works well, and is actually serving the large majority of its clients well would do a disservice to those clients. Those events where issues are getting clustered, especially over "longer" periods of time (vs. almost overlapping events) are more concerning. That would be something worthwhile looking into as to whether the current pool mechanisms could be tuned a bit more to catch those as well. But again, I am somewhat doubtful that the energy invested in such an undertaking would be well-spent by trying to replicate a task on the pool side that for a number of reasons is often considered the responsibility of clients, both from efficiency as well as effectiveness points of view (i.e., the effort it would need in each place to actually detect, and mitigate, such events, and the effectiveness of the mechanism, e.g., how complete it would be to catch any and all (or at least the large majority) of such cases). The bump thing is an interesting effect, would be interesting to learn more details as to how it came to be. But again, there could be many reasons why groups of servers drifts together, e.g., as you suggest, because a set follows a common drifting upstream. But could also be that the set is experiencing similar networking effects because sharing connectivity aspects to the rest of the world. Again potentially nothing that in my view should generally disqualify those servers due to the still somewhat limited sample size of monitoring vantage points, while they might work well for a sufficiently large number of other clients. See the India example mentioned in my previous comment. And also again something that a sensible NTP client implementation would be expected to be able to deal with anyway (as even the best monitoring and pruning will never prevent any and all such events not to hit clients directly). So not sure again how much effort would be worthwhile to be spent replicating such functionality (the extensive clustering and selection algorithms of an NTP client) on the infrastructure side both from an efficiency as well as an effectiveness point of view. > Over 1,440 probes Could you elaborate as to what those probes are? I get the impression that you were using some kind of tool for your investigation, but I am not yet clear which one it is, e.g., whether it is anything mentioned in recent forum discussions.