mirror of
https://github.com/hickory-dns/hickory-dns.git
synced 2026-04-25 11:15:54 +03:00
[GH-ISSUE #933] Spurious resolution failure with concurrent requests #571
Labels
No labels
blocked
breaking-change
bug
bug:critical
bug:tests
cleanup
compliance
compliance
compliance
crate:all
crate:client
crate:native-tls
crate:proto
crate:recursor
crate:resolver
crate:resolver
crate:rustls
crate:server
crate:util
dependencies
docs
duplicate
easy
easy
enhance
enhance
enhance
feature:dns-over-https
feature:dns-over-quic
feature:dns-over-tls
feature:dnsssec
feature:global_lb
feature:mdns
feature:tsig
features:edns
has workaround
ops
perf
platform:WASM
platform:android
platform:fuchsia
platform:linux
platform:macos
platform:windows
pull-request
question
test
tools
tools
trust
unclear
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/hickory-dns#571
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ktff on GitHub (Nov 28, 2019).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/933
Describe the bug
With 3 servers and a configuration with 2 concurrent requests, name resolution sometimes fails, other times not.
To Reproduce
Use default for
ResolverOpts::num_concurrent_reqsfield and 3 authorities on different domains. With them constructResolver. With such resolver,Resolver::lookup_ipwill sometimes spuriously fail with resolution.Expected behavior
To
Resolver::lookup_ipsuccessfully resolve a name.System:
Version:
Crate: trust-dns-resolver
Version: 0.12
Additional context
Client and Server were in the same process.
With
ResolverOpts::num_concurrent_reqs = 1Resolver::lookup_ipbehaves as expected.Servers were made with
trust-dns-server = 0.17.Discovered while working on https://github.com/timberio/vector/pull/1118
@bluejekyll commented on GitHub (Nov 28, 2019):
Thank you for the report. We should investigate this after the big refactor that's coming in #913 or the follow up PR that I'm working on now.
@LucioFranco commented on GitHub (Dec 3, 2019):
You can find the repro of this with this test https://github.com/timberio/vector/blob/master/src/dns.rs#L308
@bluejekyll commented on GitHub (Dec 3, 2019):
@LucioFranco, do you have any issues with me pulling that test directly into the trust-dns-resolver test cases?
@LucioFranco commented on GitHub (Dec 3, 2019):
@bluejekyll nope go for it!
@bluejekyll commented on GitHub (Dec 16, 2019):
@LucioFranco and @ktff, while I was working on updating the libraries to Tokio 0.2 I discovered a deadlock in the trust-dns-proto UdpStream. This has been patched, but I wonder if this is the root cause of the issues you were seeing here.
I took a look at your test, and I want to port it into the library, I just haven't had time. But maybe you could validate that this is still an issue in 0.18.0.alpha.3?
@LucioFranco commented on GitHub (Dec 16, 2019):
@bluejekyll I am not sure we can bring in that version yet since we don't have tokio 0.2 support yet, but would be good to bring the test over. Im not sure ill have much time before the end of the year to do so though.
@bluejekyll commented on GitHub (Dec 16, 2019):
Understood.
This and another bug are the only things I need to look into before publishing 0.18, so it will be at the top of my list as I have time to work on it.
@bluejekyll commented on GitHub (Apr 17, 2020):
I think this was resolved in the 0.19.4 release. Would be good to validate.
@LucioFranco commented on GitHub (Apr 17, 2020):
We have not upgraded yet, but will do so once I do!
@gabelerner commented on GitHub (Apr 19, 2020):
I just upgraded to
0.19.4to test my related issue and it's still broken. I was on0.18.0-alpha.2because that's what0.9.0ofactixis on. My issue is as follows:this does not panic - i get the IP back twice, but when i comment out the
num_concurrent_reqsline (i.e. set it to the default of2), I get a panic on the 2nd call tolookup_ip:on
0.19.4on
0.18.0-alpha.2, i got a similar error but instead of the labels formatted as a vec it wassubdomain.internaldomain.corp.internaldomain.corp.my
/etc/resolv.conf:it seems like when it's running concurrently, somewhere it's prepending the query onto the search defined in your config and exiting early?
@bluejekyll commented on GitHub (Apr 20, 2020):
Thank you for the test case. I will use that to try and reproduce.
@bluejekyll commented on GitHub (Apr 27, 2020):
For folks running into this issue, are some people using internal DNS servers for domains that are not public? I'm wondering if this is related to the lookup order in Trust-DNS?
@gabelerner commented on GitHub (Apr 27, 2020):
Yes, my dns server is internal / private -- that's why I put the
internaldomain.corpin my earlier reproduction steps. You can try by running a dns server in a docker and setting it up how I have above and issuing a query to your local dns server.I'm also glad to see that somebody else successfully fixed it by setting concurrent reqs to 1. I did that solution locally as well but now I know it's not just me.
@bluejekyll commented on GitHub (Apr 27, 2020):
I think I'm at least understanding the root cause of this issue here. There's a patch in #1085 to
name_server.rsthat would be cool if you could test (with concurrency of at least 2). My working theory here is that your getting an NXDomain for.corpwhich is a real domain, and therefor you're getting an Authoritative negative answer. Trust-DNS may have to loosen up it's definition of how it regards permanent lookup failures from authoritative servers.@gabelerner commented on GitHub (Apr 28, 2020):
@bluejekyll on my reproduction code above w/
0.19.5and the patch from the PR applied i get:seems like i do not hit any of the other match arms? can i put
printsanywhere to help you debug this more?@bluejekyll commented on GitHub (Apr 28, 2020):
Can you double check that the
nameserver 10.0.x.x # internalconfiguration in yourresolv.confis making it into the configuration with thelet (c, mut o) = read_system_conf().unwrap();from your example? It seems like it is doing the correct thing in that it is searching a lot of different domains (thus all the NXDomain dbg lines).@gabelerner commented on GitHub (Apr 28, 2020):
@gabelerner commented on GitHub (Apr 28, 2020):
Oops 🤦 I was testing a domain that we had deactivated! @bluejekyll
I just tested again with another domain and it worked even w/ concurrency=2!
The only message I saw was a single "Nameserver responded with NXDomain". Hopefully @svenstaro can also test by undoing the concurrency=1.
@bluejekyll commented on GitHub (Apr 28, 2020):
Excellent. This finally answers that problem. The root cause is that trust-dns is strictly regarding the upstream response as authoritative. Given that it eventually errors correctly, I'm going to say we should go forward with that PR, and we can let that bake in master and allow people to experiment with it.
BTW, I will make the same comment here that I did in #1085, which is that your DNS configuration is potentially vulnerable to a form of attack where an external name could be used to redirect traffic to a nefarious endpoint, so I'd strongly suggest having your organization move away from that.
Thanks for helping debug this.
@bluejekyll commented on GitHub (Apr 29, 2020):
FYI, #1086 was merged, it should resolve this. It sounds like this resolves the issues, but I'll leave this open until we have more confidence from everyone who was effected.
@gabelerner commented on GitHub (Apr 29, 2020):
Yea I myself will be waiting for
actixto get on the latest version. I think that'll be common for many of your users that are waiting for a dependency so they should just vendor, apply the patch, and test for now. Thanks for the quick fix!@LucioFranco commented on GitHub (Apr 30, 2020):
I can confirm this is fixed on our end by upgrading.
@bluejekyll commented on GitHub (Apr 30, 2020):
this is great news. Thanks!
@bluejekyll commented on GitHub (Apr 30, 2020):
#1085 resolves this issue.