mirror of
https://github.com/hickory-dns/hickory-dns.git
synced 2026-04-25 03:05:51 +03:00
[GH-ISSUE #158] [Resolver] Connection latency measurements #375
Labels
No labels
blocked
breaking-change
bug
bug:critical
bug:tests
cleanup
compliance
compliance
compliance
crate:all
crate:client
crate:native-tls
crate:proto
crate:recursor
crate:resolver
crate:resolver
crate:rustls
crate:server
crate:util
dependencies
docs
duplicate
easy
easy
enhance
enhance
enhance
feature:dns-over-https
feature:dns-over-quic
feature:dns-over-tls
feature:dnsssec
feature:global_lb
feature:mdns
feature:tsig
features:edns
has workaround
ops
perf
platform:WASM
platform:android
platform:fuchsia
platform:linux
platform:macos
platform:windows
pull-request
question
test
tools
tools
trust
unclear
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/hickory-dns#375
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @bluejekyll on GitHub (Jun 30, 2017).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/158
@vlmutolo commented on GitHub (Feb 14, 2022):
I've started thinking a bit about this issue, and just wanted to log some stuff here. I read through the code base a bit and tried to trace the paths that might have to do with latencies and name-server selection. I basically found three/four separate stats that go into selection (or are planned to go into selection):
I wonder if we can get rid of that last one by holding an open connection to every configured name server. If a user configures a name server, and they configure trust-dns to use DoT or DoH, they have to expect that there will be some open connection to that upstream name server, no?
For the other three, there should be some sane way to bridge these three stats together to form a way to decide between two name servers which should be selected. This would pair well with the simple "best of two choices" algorithm , also used by popular load balancers such as nginx (see here and Envoy (see here).
Can we just do a simple linear function of the three with user-configurable weights and sane defaults?
This is simple and allows users to configure this to support their own use cases (what if latency is really important?), but I also feel like no user would ever really want to touch this.
I also had a thought where trust-dns could just use "latency" as the stat, and then handle failures by doing something like doubling the recorded latency for each failure.
So I don't have a clear-cut solution. Just a bunch of random thoughts on this that I wanted to put down somewhere.
@bluejekyll commented on GitHub (Feb 14, 2022):
What you're talking about here @vlmutolo is something similar to what we used to do but ended up removing. The history around it was that for the general use case, resorting the Nameservers based on these stats, caused issues for people that poorly configured forwarding servers in places like corporate offices, etc.
The stats object is here: https://github.com/bluejekyll/trust-dns/blob/main/crates/resolver/src/name_server/name_server_stats.rs.
NameServerState tracks the current state of the connection: https://github.com/bluejekyll/trust-dns/blob/main/crates/resolver/src/name_server/name_server_state.rs#L16 (I think UDP may just claim to always be connected).
And then the Ord implementation is here on NameServer: https://github.com/bluejekyll/trust-dns/blob/main/crates/resolver/src/name_server/name_server.rs#L196
The changes you'd want to make are implementing the cost function in the Ord impl differently from how it is today. We can definitely apply some of the changes your suggesting, but we'd want them to be off by default.
@djc commented on GitHub (Feb 14, 2022):
Feels to me like the configuration value from #1632 (@[Edu4rdSHL]) wants to be named
balance_loadand then ideally we'd sample two name servers randomly from a PRNG and pick the one with the best stats.I think it would make sense to start with an exponential moving latency average and use double latency for failures. Have to be very careful about which kinds of error responses count as failures in this case, I suppose?
@bluejekyll commented on GitHub (Feb 14, 2022):
yes, this is where we've run into issues in the past. I think overall we're treating errors properly at this point, or as best as we can, but it's difficult especially when there are badly configured DNS resolvers that respond authoritatively. All of it's quite annoying since we have to effectively trust a bunch of random DNS servers out there.