mirror of
https://github.com/hickory-dns/hickory-dns.git
synced 2026-04-25 19:25:56 +03:00
[GH-ISSUE #1702] Name server sorting is problematic #739
Labels
No labels
blocked
breaking-change
bug
bug:critical
bug:tests
cleanup
compliance
compliance
compliance
crate:all
crate:client
crate:native-tls
crate:proto
crate:recursor
crate:resolver
crate:resolver
crate:rustls
crate:server
crate:util
dependencies
docs
duplicate
easy
easy
enhance
enhance
enhance
feature:dns-over-https
feature:dns-over-quic
feature:dns-over-tls
feature:dnsssec
feature:global_lb
feature:mdns
feature:tsig
features:edns
has workaround
ops
perf
platform:WASM
platform:android
platform:fuchsia
platform:linux
platform:macos
platform:windows
pull-request
question
test
tools
tools
trust
unclear
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/hickory-dns#739
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @peterthejohnston on GitHub (Apr 25, 2022).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/1702
Describe the bug
The Trust-DNS resolver sorts the name servers it's been configured with, by (1) "connection state", so that name servers that have been successfully reached are prioritized above unreachable ones; and (2) based on stats collected on how many queries to a given resolver were successful vs. failed.
However, there are a few problems with this sorting:
NameServerState) seems to be sorted backwards from what I would expect. BecauseFailed = 0andEstablished = 2, and servers are sorted in (default) increasing order, servers with which we've established communication are sorted last, while servers we've previously attempted and failed to reach are ranked first. (Maybe I'm missing something here?)NameServerStats) are collected and compared in unintuitive ways.Unfortunately, when you combine the last two properties, you can get into a situation where a nonresponsive name server always gets queried before a good one as long as the overall query never fails. :( If, for example, NameServerA always drops queries and NameServerB always responds, NameServerA's failure stats won't get incremented, so you can imagine that at some point A's stats are
successes: 0 failures: 0and B's stats aresuccesses: 10 failures: 0and A will get queried first. When in practice this happens with pairs or groups of name servers, this can result in long timeouts / unexpected latency.To Reproduce
Configure the resolver with multiple name servers, one or more of which is nonresponsive. Observe that over time (depending on the particulars of the network setup), the nonresponsive name server(s) get queried first.
Expected behavior
Here's what I think we should do:
NameServerStats(see this TODO in the code).NameServerState), but not successes vs. failures (NameServerStats), or something similar.I'm happy to take on this work, but I wanted to raise the issues that we've run into and see what folks think.
@bluejekyll commented on GitHub (Apr 26, 2022):
Yes, that sounds like a good idea.
This sounds good, I thought there were more tests on ordering than there are. Let me know if you need any additional clarifications, but it looks like you've understood the code really well at this point.
I agree
Yes, I've wanted to do this, but never got around to it.
I think one thing you've identified is that we also don't "reset" the stats at any point, I wonder if we would want to move to histograms instead? Please see this PR as well, which also was taking steps at improving the ordering, #1632
Excellent, thank you!
@djc commented on GitHub (Apr 26, 2022):
See also discussion in https://github.com/bluejekyll/trust-dns/issues/158.
@peterthejohnston in your ideal end state, would sorting ever be off? It'd be interesting to understand what your optimal configuration would look like and if/how you think that configuration might differ from other's use cases. Like, what are the trade-offs here? Because ideally we wouldn't expose many knobs and just wire it up so that it does the right thing for "everyone".
@peterthejohnston commented on GitHub (Aug 8, 2022):
Apologies for my very delayed reply.
Thanks for the pointer!
Yeah, I see what you mean, that's a good question. I think we'd potentially be OK with sorting being non-configurable if it worked sensibly out of the box, but I'm honestly not sure exactly what our optimal configuration would look like. I think that @nhurley3 (also working on Fuchsia's networking stack) is doing some investigation into that and might have more thoughts.
@nhurley3 commented on GitHub (Aug 9, 2022):
Thanks for looping me in, Peter!
My intuition is that Trust-DNS should generally just do the right thing. One potential option could be to adopt something similar to the SRTT based algorithms that the more heavyweight caching resolvers use (e.g. BIND, Unbound, etc.) [1]. For example, something like:
The benefit of such an algorithm is that it responds to both negative and positive changes while also generally preferring the least latent server. This may be overkill though. What does everyone think?
Additionally, I think it also makes sense to add the on/off knob that Peter mentioned above. This seems like a fairly common practice with resolvers and is the resolv.conf default [2]. Any objections to me starting with a pull request that adds this config? It sounds like there was probably earlier agreement with this type of change, but just wanted to double check before proceeding.
[1] https://irl.cs.ucla.edu/data/files/papers/res_ns_selection.pdf
[2] https://man7.org/linux/man-pages/man5/resolv.conf.5.html
@bluejekyll commented on GitHub (Aug 9, 2022):
@nhurley3, I really like your proposal on improving the existing algorithm. That is inline with where I was trying to get to, but never completed. This would be really useful in the recursive resolver implementation that I plan to reuse the name server pool for.
I also think we may want a dumb resolution option as well, as that seems to be what most people actually want. So if that's the first simplest thing to do (i.e. the on/off switch), I think that would be a great change to get in.
@djc commented on GitHub (Aug 9, 2022):
Doesn't sound like overkill -- sounds pretty good to me!
How dumb do they want? @nhurley3 in what scenario do you think you/someone might want to turn it off?
@nhurley3 commented on GitHub (Aug 10, 2022):
When we say "turn it off", we really mean that the resolver should respect a user provided ordering. As a result, my envisioned scenario is similar to the case where a user manually edits resolv.conf in Linux or sets the primary + alternate servers in Windows. That is, a power user simply wants to enforce a particular ordering or the implemented algorithm isn't doing what the user wants for whatever reason. This type of escape hatch seems to be the simplest and most common, which is why I think it makes sense.
@bluejekyll commented on GitHub (Aug 10, 2022):
Yes, this aligns with what I believe people would like as well.
@nhurley3 commented on GitHub (Aug 10, 2022):
Great, I'll start working on a PR.