[GH-ISSUE #158] [Resolver] Connection latency measurements #375

Open
opened 2026-03-15 22:11:39 +03:00 by kerem · 4 comments
Owner

Originally created by @bluejekyll on GitHub (Jun 30, 2017).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/158

Originally created by @bluejekyll on GitHub (Jun 30, 2017). Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/158
Author
Owner

@vlmutolo commented on GitHub (Feb 14, 2022):

I've started thinking a bit about this issue, and just wanted to log some stuff here. I read through the code base a bit and tried to trace the paths that might have to do with latencies and name-server selection. I basically found three/four separate stats that go into selection (or are planned to go into selection):

  1. Number of successes
  2. Number of failures
  3. Latency of connection (should be time-averaged using something like EWMA)
  4. Whether a connection already exists

I wonder if we can get rid of that last one by holding an open connection to every configured name server. If a user configures a name server, and they configure trust-dns to use DoT or DoH, they have to expect that there will be some open connection to that upstream name server, no?

For the other three, there should be some sane way to bridge these three stats together to form a way to decide between two name servers which should be selected. This would pair well with the simple "best of two choices" algorithm , also used by popular load balancers such as nginx (see here and Envoy (see here).

Can we just do a simple linear function of the three with user-configurable weights and sane defaults?

let cost = num_fail * fail_weight + num_success * success_weight + avg_latency * latency_weight;

This is simple and allows users to configure this to support their own use cases (what if latency is really important?), but I also feel like no user would ever really want to touch this.

I also had a thought where trust-dns could just use "latency" as the stat, and then handle failures by doing something like doubling the recorded latency for each failure.

So I don't have a clear-cut solution. Just a bunch of random thoughts on this that I wanted to put down somewhere.

<!-- gh-comment-id:1038696465 --> @vlmutolo commented on GitHub (Feb 14, 2022): I've started thinking a bit about this issue, and just wanted to log some stuff here. I read through the code base a bit and tried to trace the paths that might have to do with latencies and name-server selection. I basically found three/four separate stats that go into selection (or are planned to go into selection): 1. Number of successes 2. Number of failures 3. Latency of connection (should be time-averaged using something like EWMA) 4. Whether a connection already exists I wonder if we can get rid of that last one by holding an open connection to every configured name server. If a user configures a name server, and they configure trust-dns to use DoT or DoH, they have to expect that there will be some open connection to that upstream name server, no? For the other three, there should be some sane way to bridge these three stats together to form a way to decide between two name servers which should be selected. This would pair well with the simple ["best of two choices" algorithm ](http://www.eecs.harvard.edu/~michaelm/postscripts/tpds2001.pdf), also used by popular load balancers such as nginx ([see here](https://www.nginx.com/blog/nginx-power-of-two-choices-load-balancing-algorithm/) and Envoy ([see here](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/load_balancers#weighted-least-request)). Can we just do a simple linear function of the three with user-configurable weights and sane defaults? ```rust let cost = num_fail * fail_weight + num_success * success_weight + avg_latency * latency_weight; ``` This is simple and allows users to configure this to support their own use cases (what if latency is really important?), but I also feel like no user would ever really want to touch this. I also had a thought where trust-dns could just use "latency" as the stat, and then handle failures by doing something like doubling the recorded latency for each failure. So I don't have a clear-cut solution. Just a bunch of random thoughts on this that I wanted to put down somewhere.
Author
Owner

@bluejekyll commented on GitHub (Feb 14, 2022):

What you're talking about here @vlmutolo is something similar to what we used to do but ended up removing. The history around it was that for the general use case, resorting the Nameservers based on these stats, caused issues for people that poorly configured forwarding servers in places like corporate offices, etc.

The stats object is here: https://github.com/bluejekyll/trust-dns/blob/main/crates/resolver/src/name_server/name_server_stats.rs.

NameServerState tracks the current state of the connection: https://github.com/bluejekyll/trust-dns/blob/main/crates/resolver/src/name_server/name_server_state.rs#L16 (I think UDP may just claim to always be connected).

And then the Ord implementation is here on NameServer: https://github.com/bluejekyll/trust-dns/blob/main/crates/resolver/src/name_server/name_server.rs#L196

The changes you'd want to make are implementing the cost function in the Ord impl differently from how it is today. We can definitely apply some of the changes your suggesting, but we'd want them to be off by default.

<!-- gh-comment-id:1038774848 --> @bluejekyll commented on GitHub (Feb 14, 2022): What you're talking about here @vlmutolo is something similar to what we used to do but ended up removing. The history around it was that for the general use case, resorting the Nameservers based on these stats, caused issues for people that poorly configured forwarding servers in places like corporate offices, etc. The stats object is here: https://github.com/bluejekyll/trust-dns/blob/main/crates/resolver/src/name_server/name_server_stats.rs. NameServerState tracks the current state of the connection: https://github.com/bluejekyll/trust-dns/blob/main/crates/resolver/src/name_server/name_server_state.rs#L16 (I think UDP may just claim to always be connected). And then the Ord implementation is here on NameServer: https://github.com/bluejekyll/trust-dns/blob/main/crates/resolver/src/name_server/name_server.rs#L196 The changes you'd want to make are implementing the cost function in the Ord impl differently from how it is today. We can definitely apply some of the changes your suggesting, but we'd want them to be off by default.
Author
Owner

@djc commented on GitHub (Feb 14, 2022):

Feels to me like the configuration value from #1632 (@[Edu4rdSHL]) wants to be named balance_load and then ideally we'd sample two name servers randomly from a PRNG and pick the one with the best stats.

I think it would make sense to start with an exponential moving latency average and use double latency for failures. Have to be very careful about which kinds of error responses count as failures in this case, I suppose?

<!-- gh-comment-id:1038824758 --> @djc commented on GitHub (Feb 14, 2022): Feels to me like the configuration value from #1632 (@[Edu4rdSHL]) wants to be named `balance_load` and then ideally we'd sample two name servers randomly from a PRNG and pick the one with the best stats. I think it would make sense to start with an exponential moving latency average and use double latency for failures. Have to be very careful about which kinds of error responses count as failures in this case, I suppose?
Author
Owner

@bluejekyll commented on GitHub (Feb 14, 2022):

Have to be very careful about which kinds of error responses count as failures in this case, I suppose?

yes, this is where we've run into issues in the past. I think overall we're treating errors properly at this point, or as best as we can, but it's difficult especially when there are badly configured DNS resolvers that respond authoritatively. All of it's quite annoying since we have to effectively trust a bunch of random DNS servers out there.

<!-- gh-comment-id:1039410412 --> @bluejekyll commented on GitHub (Feb 14, 2022): > Have to be very careful about which kinds of error responses count as failures in this case, I suppose? yes, this is where we've run into issues in the past. I think overall we're treating errors properly at this point, or as best as we can, but it's difficult especially when there are badly configured DNS resolvers that respond authoritatively. All of it's quite annoying since we have to effectively trust a bunch of random DNS servers out there.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/hickory-dns#375
No description provided.