[GH-ISSUE #3090] Excessive queries and stuck lookup in hickory-resolver with DNSSEC validation enabled #1132

Closed
opened 2026-03-16 01:41:29 +03:00 by kerem · 6 comments
Owner

Originally created by @cr-tk on GitHub (Jun 26, 2025).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/3090

Describe the bug
On some combinations of DNS resolvers and domains, hickory-resolver appears to behave incorrectly on lookups if DNSSEC is active and validate = true is requested. It continuously retries lookups and fires off thousands of queries over a long period of time (as seen in tcpdump/Wireshark). It doesn't abort after the set amount of attempts or on timeout. In this particular example, I observed a rate of about 30 DNS requests per second.

This may be related to #3042.

We ran into this while testing edge cases after enabling DNSSEC. Observed with the Alibaba DNS (see wiki) which is known to not support DNSSEC, in combination with resolving a test domain with deliberately broken DNSSEC settings.

From our view, this has negative security implications in case the chosen upstream resolver, or equivalent MITM attacker on the network path to the resolver, can deliberately trigger this broken behavior.

To Reproduce
Example program:

use hickory_resolver::{
    Resolver,
    config::{ResolverConfig, ResolverOpts, NameServerConfigGroup},
    name_server::TokioConnectionProvider,
};
use std::{
	net::{IpAddr},
};

#[tokio::main]
async fn main() {
    // Alibaba DNS, this resolver strips DNSSEC
    let resolver_ips = vec![IpAddr::V4([223,5,5,5].into())];
    
    // For comparison: regular Google DNS resolver
    // let resolver_ips = vec![IpAddr::V4([8,8,8,8].into())];

    let resolver_config = ResolverConfig::from_parts(
        None,
        vec![],
        NameServerConfigGroup::from_ips_clear(&resolver_ips, 53, true),
    );

    let mut resolver_builder =
        Resolver::builder_with_config(resolver_config, TokioConnectionProvider::default());

    let mut resolver_options: ResolverOpts = ResolverOpts::default();
    resolver_options.validate = true;

    // set our improved resolver options
    *resolver_builder.options_mut() = resolver_options;
    let resolver = resolver_builder.build();

    // does not trigger the issue
    // let response = resolver.lookup_ip("www.example.com.").await.unwrap();
    let response = resolver.lookup_ip("sigfail.ippacket.stream").await;

    println!("{response:?}");
}

Cargo.toml:

[package]
name = "minimal_hickory_crate"
version = "0.1.0"
edition = "2024"

[dependencies]
hickory-resolver = { version = "0.25.2", features = [
    "tokio",
    "dnssec-ring", # for DNSSEC
] }
tokio = { version = "1.45.0" }

Expected behavior
I expect the resolver to deliver a successful lookup result or error with a minimum amount of queries. Notably, this should occur within the specified timeout duration. Based on the code documentation, I would expect the timeout to kick in within 2 * 5s = 10s in the provided example, which it does not. During testing, the provided example ran for several minutes before a manual abort.

System:
OS: Linux
rustc version: 1.87.0 (stable)

Version:
Crate: hickory-resolver
Version: 0.25.2

The last available crate version v0.26.0-alpha.1 also shows the general same bug behavior, but has a different pattern of DNS requests.

Originally created by @cr-tk on GitHub (Jun 26, 2025). Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/3090 **Describe the bug** On some combinations of DNS resolvers and domains, hickory-resolver appears to behave incorrectly on lookups if DNSSEC is active and `validate = true` is requested. It continuously retries lookups and fires off thousands of queries over a long period of time (as seen in tcpdump/Wireshark). It doesn't abort after the set amount of attempts or on timeout. In this particular example, I observed a rate of about 30 DNS requests per second. This may be related to #3042. We ran into this while testing edge cases after enabling DNSSEC. Observed with the Alibaba DNS (see [wiki](https://en.wikipedia.org/wiki/Public_recursive_name_server#Notable_public_DNS_service_operators)) which is known to not support DNSSEC, in combination with resolving a test domain with deliberately broken DNSSEC settings. From our view, this has negative security implications in case the chosen upstream resolver, or equivalent MITM attacker on the network path to the resolver, can deliberately trigger this broken behavior. **To Reproduce** Example program: ```Rust use hickory_resolver::{ Resolver, config::{ResolverConfig, ResolverOpts, NameServerConfigGroup}, name_server::TokioConnectionProvider, }; use std::{ net::{IpAddr}, }; #[tokio::main] async fn main() { // Alibaba DNS, this resolver strips DNSSEC let resolver_ips = vec![IpAddr::V4([223,5,5,5].into())]; // For comparison: regular Google DNS resolver // let resolver_ips = vec![IpAddr::V4([8,8,8,8].into())]; let resolver_config = ResolverConfig::from_parts( None, vec![], NameServerConfigGroup::from_ips_clear(&resolver_ips, 53, true), ); let mut resolver_builder = Resolver::builder_with_config(resolver_config, TokioConnectionProvider::default()); let mut resolver_options: ResolverOpts = ResolverOpts::default(); resolver_options.validate = true; // set our improved resolver options *resolver_builder.options_mut() = resolver_options; let resolver = resolver_builder.build(); // does not trigger the issue // let response = resolver.lookup_ip("www.example.com.").await.unwrap(); let response = resolver.lookup_ip("sigfail.ippacket.stream").await; println!("{response:?}"); } ``` Cargo.toml: ```toml [package] name = "minimal_hickory_crate" version = "0.1.0" edition = "2024" [dependencies] hickory-resolver = { version = "0.25.2", features = [ "tokio", "dnssec-ring", # for DNSSEC ] } tokio = { version = "1.45.0" } ``` **Expected behavior** I expect the resolver to deliver a successful lookup result or error with a minimum amount of queries. Notably, this should occur within the specified timeout duration. Based on the [code documentation](https://github.com/hickory-dns/hickory-dns/blob/527c9f470a418cf6b92da902ea0aaa5749963d59/crates/resolver/src/config.rs#L775-L780), I would expect the timeout to kick in within `2 * 5s` = 10s in the provided example, which it does not. During testing, the provided example ran for several minutes before a manual abort. **System:** OS: Linux rustc version: 1.87.0 (stable) **Version:** Crate: `hickory-resolver` Version: `0.25.2` The last available crate version `v0.26.0-alpha.1` also shows the general same bug behavior, but has a different pattern of DNS requests.
kerem closed this issue 2026-03-16 01:41:35 +03:00
Author
Owner

@djc commented on GitHub (Jun 26, 2025):

Can you try with current main? Who's we?

<!-- gh-comment-id:3008767381 --> @djc commented on GitHub (Jun 26, 2025): Can you try with current `main`? Who's we?
Author
Owner

@cr-tk commented on GitHub (Jun 26, 2025):

Testing with the current main required some code changes, because some APIs changed since the last alpha version.

I've now adapted the example code to test against the newest 88cb3033ae.
Interestingly, the behavior improved significantly: the lookup now finishes within the timeout period of one attempt, and delivers an IP lookup. It still sends out about ~250 DNS packets to do so though, which feels excessive.

On 64c137d898, the behavior was still problematic. This result suggests that #3042 is likely responsible for this bug behavior and that the recently merged #3075 improved it.

"We" is Turnkey. We're using hickory-resolver in https://github.com/tkhq/qos/ .

use hickory_resolver::TokioResolver;
use hickory_resolver::{
    config::{NameServerConfig, ResolverConfig, ResolverOpts},
    proto::runtime::TokioRuntimeProvider,
};
use std::net::IpAddr;

#[tokio::main]
async fn main() {
    // Alibaba DNS, this resolver strips DNSSEC
    let resolver_ip = IpAddr::V4([223, 5, 5, 5].into());

    // For comparison: regular Google DNS resolver
    // let resolver_ip = IpAddr::V4([8,8,8,8].into());

    let resolver_config =
        ResolverConfig::from_parts(None, vec![], vec![NameServerConfig::udp(resolver_ip)]);

    let mut resolver_builder = {
        // use the system resolver configuration
        TokioResolver::builder_with_config(resolver_config, TokioRuntimeProvider::default())
    };

    let mut resolver_options: ResolverOpts = ResolverOpts::default();
    resolver_options.validate = true;

    // set our improved resolver options
    *resolver_builder.options_mut() = resolver_options;
    let resolver = resolver_builder.build();

    // does not trigger the issue
    // let response = resolver.lookup_ip("www.example.com.").await.unwrap();
    let response = resolver.lookup_ip("sigfail.ippacket.stream").await;

    println!("{response:?}");
}
[package]
name = "minimal_hickory_crate"
version = "0.1.0"
edition = "2024"

[dependencies]
hickory-resolver = { git = "https://github.com/hickory-dns/hickory-dns", rev = "88cb3033aeca19181066eb13d47662eba77060bf", features = [
    "tokio",
    "dnssec-ring", # for DNSSEC
] }
tokio = { version = "1.45.0" }
<!-- gh-comment-id:3008877950 --> @cr-tk commented on GitHub (Jun 26, 2025): Testing with the current `main` required some code changes, because some APIs changed since the last alpha version. I've now adapted the example code to test against the newest 88cb3033aeca19181066eb13d47662eba77060bf. Interestingly, the behavior improved significantly: the lookup now finishes within the timeout period of one attempt, and delivers an IP lookup. It still sends out about ~250 DNS packets to do so though, which feels excessive. On 64c137d8988bf0645aa9d7dd3aa8863f31f78068, the behavior was still problematic. This result suggests that #3042 is likely responsible for this bug behavior and that the recently merged #3075 improved it. "We" is Turnkey. We're using `hickory-resolver` in https://github.com/tkhq/qos/ . ```Rust use hickory_resolver::TokioResolver; use hickory_resolver::{ config::{NameServerConfig, ResolverConfig, ResolverOpts}, proto::runtime::TokioRuntimeProvider, }; use std::net::IpAddr; #[tokio::main] async fn main() { // Alibaba DNS, this resolver strips DNSSEC let resolver_ip = IpAddr::V4([223, 5, 5, 5].into()); // For comparison: regular Google DNS resolver // let resolver_ip = IpAddr::V4([8,8,8,8].into()); let resolver_config = ResolverConfig::from_parts(None, vec![], vec![NameServerConfig::udp(resolver_ip)]); let mut resolver_builder = { // use the system resolver configuration TokioResolver::builder_with_config(resolver_config, TokioRuntimeProvider::default()) }; let mut resolver_options: ResolverOpts = ResolverOpts::default(); resolver_options.validate = true; // set our improved resolver options *resolver_builder.options_mut() = resolver_options; let resolver = resolver_builder.build(); // does not trigger the issue // let response = resolver.lookup_ip("www.example.com.").await.unwrap(); let response = resolver.lookup_ip("sigfail.ippacket.stream").await; println!("{response:?}"); } ``` ```Toml [package] name = "minimal_hickory_crate" version = "0.1.0" edition = "2024" [dependencies] hickory-resolver = { git = "https://github.com/hickory-dns/hickory-dns", rev = "88cb3033aeca19181066eb13d47662eba77060bf", features = [ "tokio", "dnssec-ring", # for DNSSEC ] } tokio = { version = "1.45.0" } ```
Author
Owner

@marcus0x62 commented on GitHub (Jun 26, 2025):

I think the issue is verify_rrset will attempt a verification of a DS record with no RRSIGs. verify_default_rrset will in turn call find_ds_records, which will trigger a new lookup and a new verification and so on until the DNSSEC depth limits are exceeded.

See attached log.

A superficial fix that seems to work -- but I need to put through the test suite is to add a check in verify_default_rrset to not call find_ds_records for a DS record with no RRSIGs.

<!-- gh-comment-id:3008992860 --> @marcus0x62 commented on GitHub (Jun 26, 2025): I think the issue is verify_rrset will attempt a verification of a DS record with no RRSIGs. verify_default_rrset will in turn call find_ds_records, which will trigger a new lookup and a new verification and so on until the DNSSEC depth limits are exceeded. See [attached log](https://github.com/user-attachments/files/20929539/infinite.txt). A superficial fix that seems to work -- but I need to put through the test suite is to add a check in verify_default_rrset to not call find_ds_records for a DS record with no RRSIGs.
Author
Owner

@marcus0x62 commented on GitHub (Jun 26, 2025):

This should be resolved in main via #3092. @cr-tk please try your test again and let us know the results.

<!-- gh-comment-id:3010485779 --> @marcus0x62 commented on GitHub (Jun 26, 2025): This should be resolved in main via #3092. @cr-tk please try your test again and let us know the results.
Author
Owner

@cr-tk commented on GitHub (Jun 27, 2025):

@djc @marcus0x62
I've tested commit rev 82c85dcfdb which has the changes of #3092. The behavior has further improved: the lookup returns faster now, and triggers just 7 DNS queries (1x A record, 4x NS, 2x DS).

I have not looked into the details of the DNS queries and responses, but from my perspective, the obvious problems of looping and excessive request numbers seem fixed. At least for this particular combination of domain settings and DNS server behavior.

Thank you for the fast reaction and fixes!
Is there any chance these get backported to 0.25.x in the future?

More broadly, I'm wondering if the code that does your timeout handling requires more attention. My naive expectation is that it should have stepped in and put a stop to whatever looping or recursion was going on. The lookup result would have been wrong/incomplete, but from my perspective that's still much better than getting stuck. This could be relevant to limit the impact of future bugs like this.
Additionally, it could be useful to think about counting the number of DNS requests on some level and keeping an eye on them in the unit tests / integration tests. Without #3092, this problem would have looked fixed to a simple test that checks for lookup success and the lookup end result.

<!-- gh-comment-id:3013275543 --> @cr-tk commented on GitHub (Jun 27, 2025): @djc @marcus0x62 I've tested commit rev 82c85dcfdbee165e81d8392575b1183d644ae922 which has the changes of #3092. The behavior has further improved: the lookup returns faster now, and triggers just 7 DNS queries (1x A record, 4x NS, 2x DS). I have not looked into the details of the DNS queries and responses, but from my perspective, the obvious problems of looping and excessive request numbers seem fixed. At least for this particular combination of domain settings and DNS server behavior. Thank you for the fast reaction and fixes! Is there any chance these get backported to `0.25.x` in the future? More broadly, I'm wondering if the code that does your timeout handling requires more attention. My naive expectation is that it should have stepped in and put a stop to whatever looping or recursion was going on. The lookup result would have been wrong/incomplete, but from my perspective that's still much better than getting stuck. This could be relevant to limit the impact of future bugs like this. Additionally, it could be useful to think about counting the number of DNS requests on some level and keeping an eye on them in the unit tests / integration tests. Without #3092, this problem would have looked fixed to a simple test that checks for lookup success and the lookup end result.
Author
Owner

@djc commented on GitHub (Jun 27, 2025):

Great to hear this improves the situation!

I'm not sure we have the resources to backport these to 0.25.x. We could quite easily release a 0.26.0-alpha.2 which would contain these fixes, though.

More broadly, I'm wondering if the code that does your timeout handling requires more attention. My naive expectation is that it should have stepped in and put a stop to whatever looping or recursion was going on. The lookup result would have been wrong/incomplete, but from my perspective that's still much better than getting stuck. This could be relevant to limit the impact of future bugs like this.

This is something we've noticed before. I tried a somewhat naive fix a few weeks ago and that did not seem to improve things.

In general, we have many improvements going on in the context of our push to get HickoryDNS deployed at Let's Encrypt. If HickoryDNS is load-bearing for your product/team, it might be interesting for your company to purchase support (which my company offers).

<!-- gh-comment-id:3013333140 --> @djc commented on GitHub (Jun 27, 2025): Great to hear this improves the situation! I'm not sure we have the resources to backport these to 0.25.x. We could quite easily release a 0.26.0-alpha.2 which would contain these fixes, though. > More broadly, I'm wondering if the code that does your timeout handling requires more attention. My naive expectation is that it should have stepped in and put a stop to whatever looping or recursion was going on. The lookup result would have been wrong/incomplete, but from my perspective that's still much better than getting stuck. This could be relevant to limit the impact of future bugs like this. This is something we've noticed before. I tried a somewhat naive fix a few weeks ago and that did not seem to improve things. In general, we have many improvements going on in the context of our push to get [HickoryDNS deployed at Let's Encrypt](https://github.com/hickory-dns/hickory-dns/issues/2725). If HickoryDNS is load-bearing for your product/team, it might be interesting for your company to purchase support (which my company offers).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/hickory-dns#1132
No description provided.