[GH-ISSUE #933] Spurious resolution failure with concurrent requests #571

Closed
opened 2026-03-15 23:12:53 +03:00 by kerem · 24 comments
Owner

Originally created by @ktff on GitHub (Nov 28, 2019).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/933

Describe the bug
With 3 servers and a configuration with 2 concurrent requests, name resolution sometimes fails, other times not.

To Reproduce
Use default for ResolverOpts::num_concurrent_reqs field and 3 authorities on different domains. With them construct Resolver. With such resolver, Resolver::lookup_ip will sometimes spuriously fail with resolution.

Expected behavior
To Resolver::lookup_ip successfully resolve a name.

System:

  • OS: Ubuntu
  • Architecture: x64
  • Version 18.04.3 LTS
  • rustc version: 1.39.0

Version:
Crate: trust-dns-resolver
Version: 0.12

Additional context
Client and Server were in the same process.

With ResolverOpts::num_concurrent_reqs = 1 Resolver::lookup_ip behaves as expected.

Servers were made with trust-dns-server = 0.17.

Discovered while working on https://github.com/timberio/vector/pull/1118

Originally created by @ktff on GitHub (Nov 28, 2019). Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/933 **Describe the bug** With 3 servers and a configuration with 2 concurrent requests, name resolution sometimes fails, other times not. **To Reproduce** Use default for `ResolverOpts::num_concurrent_reqs` field and 3 authorities on different domains. With them construct `Resolver`. With such resolver, `Resolver::lookup_ip` will sometimes spuriously fail with resolution. **Expected behavior** To `Resolver::lookup_ip` successfully resolve a name. **System:** - OS: Ubuntu - Architecture: x64 - Version 18.04.3 LTS - rustc version: 1.39.0 **Version:** Crate: trust-dns-resolver Version: 0.12 **Additional context** Client and Server were in the same process. With `ResolverOpts::num_concurrent_reqs = 1` `Resolver::lookup_ip` behaves as expected. Servers were made with `trust-dns-server = 0.17`. Discovered while working on https://github.com/timberio/vector/pull/1118
kerem 2026-03-15 23:12:53 +03:00
Author
Owner

@bluejekyll commented on GitHub (Nov 28, 2019):

Thank you for the report. We should investigate this after the big refactor that's coming in #913 or the follow up PR that I'm working on now.

<!-- gh-comment-id:559607449 --> @bluejekyll commented on GitHub (Nov 28, 2019): Thank you for the report. We should investigate this after the big refactor that's coming in #913 or the follow up PR that I'm working on now.
Author
Owner

@LucioFranco commented on GitHub (Dec 3, 2019):

You can find the repro of this with this test https://github.com/timberio/vector/blob/master/src/dns.rs#L308

<!-- gh-comment-id:561308681 --> @LucioFranco commented on GitHub (Dec 3, 2019): You can find the repro of this with this test https://github.com/timberio/vector/blob/master/src/dns.rs#L308
Author
Owner

@bluejekyll commented on GitHub (Dec 3, 2019):

@LucioFranco, do you have any issues with me pulling that test directly into the trust-dns-resolver test cases?

<!-- gh-comment-id:561312854 --> @bluejekyll commented on GitHub (Dec 3, 2019): @LucioFranco, do you have any issues with me pulling that test directly into the trust-dns-resolver test cases?
Author
Owner

@LucioFranco commented on GitHub (Dec 3, 2019):

@bluejekyll nope go for it!

<!-- gh-comment-id:561315021 --> @LucioFranco commented on GitHub (Dec 3, 2019): @bluejekyll nope go for it!
Author
Owner

@bluejekyll commented on GitHub (Dec 16, 2019):

@LucioFranco and @ktff, while I was working on updating the libraries to Tokio 0.2 I discovered a deadlock in the trust-dns-proto UdpStream. This has been patched, but I wonder if this is the root cause of the issues you were seeing here.

I took a look at your test, and I want to port it into the library, I just haven't had time. But maybe you could validate that this is still an issue in 0.18.0.alpha.3?

<!-- gh-comment-id:565943090 --> @bluejekyll commented on GitHub (Dec 16, 2019): @LucioFranco and @ktff, while I was working on updating the libraries to Tokio 0.2 I discovered a deadlock in the trust-dns-proto UdpStream. This has been patched, but I wonder if this is the root cause of the issues you were seeing here. I took a look at your test, and I want to port it into the library, I just haven't had time. But maybe you could validate that this is still an issue in 0.18.0.alpha.3?
Author
Owner

@LucioFranco commented on GitHub (Dec 16, 2019):

@bluejekyll I am not sure we can bring in that version yet since we don't have tokio 0.2 support yet, but would be good to bring the test over. Im not sure ill have much time before the end of the year to do so though.

<!-- gh-comment-id:566151490 --> @LucioFranco commented on GitHub (Dec 16, 2019): @bluejekyll I am not sure we can bring in that version yet since we don't have tokio 0.2 support yet, but would be good to bring the test over. Im not sure ill have much time before the end of the year to do so though.
Author
Owner

@bluejekyll commented on GitHub (Dec 16, 2019):

Understood.

This and another bug are the only things I need to look into before publishing 0.18, so it will be at the top of my list as I have time to work on it.

<!-- gh-comment-id:566161397 --> @bluejekyll commented on GitHub (Dec 16, 2019): Understood. This and another bug are the only things I need to look into before publishing 0.18, so it will be at the top of my list as I have time to work on it.
Author
Owner

@bluejekyll commented on GitHub (Apr 17, 2020):

I think this was resolved in the 0.19.4 release. Would be good to validate.

<!-- gh-comment-id:615493781 --> @bluejekyll commented on GitHub (Apr 17, 2020): I think this was resolved in the 0.19.4 release. Would be good to validate.
Author
Owner

@LucioFranco commented on GitHub (Apr 17, 2020):

We have not upgraded yet, but will do so once I do!

<!-- gh-comment-id:615494041 --> @LucioFranco commented on GitHub (Apr 17, 2020): We have not upgraded yet, but will do so once I do!
Author
Owner

@gabelerner commented on GitHub (Apr 19, 2020):

I just upgraded to 0.19.4 to test my related issue and it's still broken. I was on 0.18.0-alpha.2 because that's what 0.9.0 of actix is on. My issue is as follows:

use trust_dns_resolver::Resolver;
use trust_dns_resolver::system_conf::read_system_conf;

fn main() {
    let (c, mut o) = read_system_conf().unwrap();
    o.num_concurrent_reqs = 1;
    let resolver = Resolver::new(c, o).unwrap();
    resolver.lookup_ip("subdomain.internaldomain.corp").unwrap();
    resolver.lookup_ip("subdomain.internaldomain.corp").unwrap();
}

this does not panic - i get the IP back twice, but when i comment out the num_concurrent_reqs line (i.e. set it to the default of 2), I get a panic on the 2nd call to lookup_ip:

on 0.19.4

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ResolveError { kind: NoRecordsFound { query: Query { name: Name { is_fqdn: true, labels: [subdomain, internaldomain, corp, internaldomain, corp] }, query_type: AAAA, query_class: IN }, valid_until: None }, backtrack: None }', src/main.rs:9:5

on 0.18.0-alpha.2, i got a similar error but instead of the labels formatted as a vec it was subdomain.internaldomain.corp.internaldomain.corp.

my /etc/resolv.conf:

search internaldomain.corp
nameserver 10.0.x.x # internal
nameserver 10.0.x.y # internal
nameserver 8.8.8.8 # google
nameserver 8.8.4.4 # google

it seems like when it's running concurrently, somewhere it's prepending the query onto the search defined in your config and exiting early?

<!-- gh-comment-id:616236072 --> @gabelerner commented on GitHub (Apr 19, 2020): I just upgraded to `0.19.4` to test my related issue and it's still broken. I was on `0.18.0-alpha.2` because that's what `0.9.0` of `actix` is on. My issue is as follows: ```rust use trust_dns_resolver::Resolver; use trust_dns_resolver::system_conf::read_system_conf; fn main() { let (c, mut o) = read_system_conf().unwrap(); o.num_concurrent_reqs = 1; let resolver = Resolver::new(c, o).unwrap(); resolver.lookup_ip("subdomain.internaldomain.corp").unwrap(); resolver.lookup_ip("subdomain.internaldomain.corp").unwrap(); } ``` this does not panic - i get the IP back twice, but when i comment out the `num_concurrent_reqs` line (i.e. set it to the default of `2`), I get a panic on the 2nd call to `lookup_ip`: on `0.19.4` ``` thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ResolveError { kind: NoRecordsFound { query: Query { name: Name { is_fqdn: true, labels: [subdomain, internaldomain, corp, internaldomain, corp] }, query_type: AAAA, query_class: IN }, valid_until: None }, backtrack: None }', src/main.rs:9:5 ``` on `0.18.0-alpha.2`, i got a similar error but instead of the labels formatted as a vec it was `subdomain.internaldomain.corp.internaldomain.corp`. my `/etc/resolv.conf`: ``` search internaldomain.corp nameserver 10.0.x.x # internal nameserver 10.0.x.y # internal nameserver 8.8.8.8 # google nameserver 8.8.4.4 # google ``` it seems like when it's running concurrently, somewhere it's prepending the query onto the search defined in your config and exiting early?
Author
Owner

@bluejekyll commented on GitHub (Apr 20, 2020):

Thank you for the test case. I will use that to try and reproduce.

<!-- gh-comment-id:616450276 --> @bluejekyll commented on GitHub (Apr 20, 2020): Thank you for the test case. I will use that to try and reproduce.
Author
Owner

@bluejekyll commented on GitHub (Apr 27, 2020):

For folks running into this issue, are some people using internal DNS servers for domains that are not public? I'm wondering if this is related to the lookup order in Trust-DNS?

<!-- gh-comment-id:620266391 --> @bluejekyll commented on GitHub (Apr 27, 2020): For folks running into this issue, are some people using internal DNS servers for domains that are not public? I'm wondering if this is related to the lookup order in Trust-DNS?
Author
Owner

@gabelerner commented on GitHub (Apr 27, 2020):

Yes, my dns server is internal / private -- that's why I put the internaldomain.corp in my earlier reproduction steps. You can try by running a dns server in a docker and setting it up how I have above and issuing a query to your local dns server.

I'm also glad to see that somebody else successfully fixed it by setting concurrent reqs to 1. I did that solution locally as well but now I know it's not just me.

<!-- gh-comment-id:620272502 --> @gabelerner commented on GitHub (Apr 27, 2020): Yes, my dns server is internal / private -- that's why I put the `internaldomain.corp` in my earlier reproduction steps. You can try by running a dns server in a docker and setting it up how I have above and issuing a query to your local dns server. I'm also glad to see that somebody else successfully fixed it by setting concurrent reqs to 1. I did that solution locally as well but now I know it's not just me.
Author
Owner

@bluejekyll commented on GitHub (Apr 27, 2020):

I think I'm at least understanding the root cause of this issue here. There's a patch in #1085 to name_server.rs that would be cool if you could test (with concurrency of at least 2). My working theory here is that your getting an NXDomain for .corp which is a real domain, and therefor you're getting an Authoritative negative answer. Trust-DNS may have to loosen up it's definition of how it regards permanent lookup failures from authoritative servers.

<!-- gh-comment-id:620277207 --> @bluejekyll commented on GitHub (Apr 27, 2020): I think I'm at least understanding the root cause of this issue here. There's a patch in #1085 to `name_server.rs` that would be cool if you could test (with concurrency of at least 2). My working theory here is that your getting an NXDomain for `.corp` which is a real domain, and therefor you're getting an Authoritative negative answer. Trust-DNS may have to loosen up it's definition of how it regards permanent lookup failures from authoritative servers.
Author
Owner

@gabelerner commented on GitHub (Apr 28, 2020):

@bluejekyll on my reproduction code above w/ 0.19.5 and the patch from the PR applied i get:

Starting ...
Nameserver responded with NXDomain
Nameserver responded with NXDomain
<... clipped: 140 more lines of the same message ...>
Nameserver responded with NXDomain
Nameserver responded with NXDomain
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ResolveError { kind: Proto(ProtoError { kind: Message("Nameserver responded with NXDomain"), backtrack: None }), backtrack: None }', src/main.rs:8:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

seems like i do not hit any of the other match arms? can i put prints anywhere to help you debug this more?

<!-- gh-comment-id:620828557 --> @gabelerner commented on GitHub (Apr 28, 2020): @bluejekyll on my reproduction code above w/ `0.19.5` and the patch from the PR applied i get: ``` Starting ... Nameserver responded with NXDomain Nameserver responded with NXDomain <... clipped: 140 more lines of the same message ...> Nameserver responded with NXDomain Nameserver responded with NXDomain thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ResolveError { kind: Proto(ProtoError { kind: Message("Nameserver responded with NXDomain"), backtrack: None }), backtrack: None }', src/main.rs:8:17 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` seems like i do not hit any of the other match arms? can i put `prints` anywhere to help you debug this more?
Author
Owner

@bluejekyll commented on GitHub (Apr 28, 2020):

Can you double check that the nameserver 10.0.x.x # internal configuration in your resolv.conf is making it into the configuration with the let (c, mut o) = read_system_conf().unwrap(); from your example? It seems like it is doing the correct thing in that it is searching a lot of different domains (thus all the NXDomain dbg lines).

<!-- gh-comment-id:620852084 --> @bluejekyll commented on GitHub (Apr 28, 2020): Can you double check that the `nameserver 10.0.x.x # internal` configuration in your `resolv.conf` is making it into the configuration with the `let (c, mut o) = read_system_conf().unwrap();` from your example? It seems like it is doing the correct thing in that it is searching a lot of different domains (thus all the NXDomain dbg lines).
Author
Owner

@gabelerner commented on GitHub (Apr 28, 2020):

ResolverConfig { domain: Some(Name { is_fqdn: false, labels: [local] }), search: [Name { is_fqdn: false, labels: [kernel, corp] }], name_servers: NameServerConfigGroup([NameServerConfig { socket_addr: V4(10.0.x.x:53), protocol: Udp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(10.0.x.x:53), protocol: Tcp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(10.0.x.y:53), protocol: Udp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(10.0.x.y:53), protocol: Tcp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(8.8.8.8:53), protocol: Udp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(8.8.8.8:53), protocol: Tcp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(8.8.4.4:53), protocol: Udp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(8.8.4.4:53), protocol: Tcp, tls_dns_name: None }]) }

ResolverOpts { ndots: 1, timeout: 5s, attempts: 2, rotate: false, check_names: true, edns0: false, validate: false, ip_strategy: Ipv4thenIpv6, cache_size: 32, use_hosts_file: true, positive_min_ttl: None, negative_min_ttl: None, positive_max_ttl: None, negative_max_ttl: None, distrust_nx_responses: true, num_concurrent_reqs: 2, preserve_intermediates: false }
<!-- gh-comment-id:620857792 --> @gabelerner commented on GitHub (Apr 28, 2020): ``` ResolverConfig { domain: Some(Name { is_fqdn: false, labels: [local] }), search: [Name { is_fqdn: false, labels: [kernel, corp] }], name_servers: NameServerConfigGroup([NameServerConfig { socket_addr: V4(10.0.x.x:53), protocol: Udp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(10.0.x.x:53), protocol: Tcp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(10.0.x.y:53), protocol: Udp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(10.0.x.y:53), protocol: Tcp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(8.8.8.8:53), protocol: Udp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(8.8.8.8:53), protocol: Tcp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(8.8.4.4:53), protocol: Udp, tls_dns_name: None }, NameServerConfig { socket_addr: V4(8.8.4.4:53), protocol: Tcp, tls_dns_name: None }]) } ResolverOpts { ndots: 1, timeout: 5s, attempts: 2, rotate: false, check_names: true, edns0: false, validate: false, ip_strategy: Ipv4thenIpv6, cache_size: 32, use_hosts_file: true, positive_min_ttl: None, negative_min_ttl: None, positive_max_ttl: None, negative_max_ttl: None, distrust_nx_responses: true, num_concurrent_reqs: 2, preserve_intermediates: false } ```
Author
Owner

@gabelerner commented on GitHub (Apr 28, 2020):

Oops 🤦 I was testing a domain that we had deactivated! @bluejekyll

I just tested again with another domain and it worked even w/ concurrency=2!

The only message I saw was a single "Nameserver responded with NXDomain". Hopefully @svenstaro can also test by undoing the concurrency=1.

<!-- gh-comment-id:620859499 --> @gabelerner commented on GitHub (Apr 28, 2020): Oops 🤦 I was testing a domain that we had deactivated! @bluejekyll I just tested again with another domain and it worked even w/ concurrency=2! The only message I saw was a single "Nameserver responded with NXDomain". Hopefully @svenstaro can also test by undoing the concurrency=1.
Author
Owner

@bluejekyll commented on GitHub (Apr 28, 2020):

Excellent. This finally answers that problem. The root cause is that trust-dns is strictly regarding the upstream response as authoritative. Given that it eventually errors correctly, I'm going to say we should go forward with that PR, and we can let that bake in master and allow people to experiment with it.

BTW, I will make the same comment here that I did in #1085, which is that your DNS configuration is potentially vulnerable to a form of attack where an external name could be used to redirect traffic to a nefarious endpoint, so I'd strongly suggest having your organization move away from that.

Thanks for helping debug this.

<!-- gh-comment-id:620876983 --> @bluejekyll commented on GitHub (Apr 28, 2020): Excellent. This finally answers that problem. The root cause is that trust-dns is strictly regarding the upstream response as authoritative. Given that it eventually errors correctly, I'm going to say we should go forward with that PR, and we can let that bake in master and allow people to experiment with it. BTW, I will make the same comment here that I did in #1085, which is that your DNS configuration is potentially vulnerable to a form of attack where an external name could be used to redirect traffic to a nefarious endpoint, so I'd strongly suggest having your organization move away from that. Thanks for helping debug this.
Author
Owner

@bluejekyll commented on GitHub (Apr 29, 2020):

FYI, #1086 was merged, it should resolve this. It sounds like this resolves the issues, but I'll leave this open until we have more confidence from everyone who was effected.

<!-- gh-comment-id:621314824 --> @bluejekyll commented on GitHub (Apr 29, 2020): FYI, #1086 was merged, it should resolve this. It sounds like this resolves the issues, but I'll leave this open until we have more confidence from everyone who was effected.
Author
Owner

@gabelerner commented on GitHub (Apr 29, 2020):

Yea I myself will be waiting for actix to get on the latest version. I think that'll be common for many of your users that are waiting for a dependency so they should just vendor, apply the patch, and test for now. Thanks for the quick fix!

<!-- gh-comment-id:621318774 --> @gabelerner commented on GitHub (Apr 29, 2020): Yea I myself will be waiting for `actix` to get on the latest version. I think that'll be common for many of your users that are waiting for a dependency so they should just vendor, apply the patch, and test for now. Thanks for the quick fix!
Author
Owner

@LucioFranco commented on GitHub (Apr 30, 2020):

I can confirm this is fixed on our end by upgrading.

<!-- gh-comment-id:622014319 --> @LucioFranco commented on GitHub (Apr 30, 2020): I can confirm this is fixed on our end by upgrading.
Author
Owner

@bluejekyll commented on GitHub (Apr 30, 2020):

this is great news. Thanks!

<!-- gh-comment-id:622174901 --> @bluejekyll commented on GitHub (Apr 30, 2020): this is great news. Thanks!
Author
Owner

@bluejekyll commented on GitHub (Apr 30, 2020):

#1085 resolves this issue.

<!-- gh-comment-id:622175028 --> @bluejekyll commented on GitHub (Apr 30, 2020): #1085 resolves this issue.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/hickory-dns#571
No description provided.