mirror of
https://github.com/hickory-dns/hickory-dns.git
synced 2026-04-25 03:05:51 +03:00
[GH-ISSUE #3090] Excessive queries and stuck lookup in hickory-resolver with DNSSEC validation enabled #1132
Labels
No labels
blocked
breaking-change
bug
bug:critical
bug:tests
cleanup
compliance
compliance
compliance
crate:all
crate:client
crate:native-tls
crate:proto
crate:recursor
crate:resolver
crate:resolver
crate:rustls
crate:server
crate:util
dependencies
docs
duplicate
easy
easy
enhance
enhance
enhance
feature:dns-over-https
feature:dns-over-quic
feature:dns-over-tls
feature:dnsssec
feature:global_lb
feature:mdns
feature:tsig
features:edns
has workaround
ops
perf
platform:WASM
platform:android
platform:fuchsia
platform:linux
platform:macos
platform:windows
pull-request
question
test
tools
tools
trust
unclear
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/hickory-dns#1132
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @cr-tk on GitHub (Jun 26, 2025).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/3090
Describe the bug
On some combinations of DNS resolvers and domains, hickory-resolver appears to behave incorrectly on lookups if DNSSEC is active and
validate = trueis requested. It continuously retries lookups and fires off thousands of queries over a long period of time (as seen in tcpdump/Wireshark). It doesn't abort after the set amount of attempts or on timeout. In this particular example, I observed a rate of about 30 DNS requests per second.This may be related to #3042.
We ran into this while testing edge cases after enabling DNSSEC. Observed with the Alibaba DNS (see wiki) which is known to not support DNSSEC, in combination with resolving a test domain with deliberately broken DNSSEC settings.
From our view, this has negative security implications in case the chosen upstream resolver, or equivalent MITM attacker on the network path to the resolver, can deliberately trigger this broken behavior.
To Reproduce
Example program:
Cargo.toml:
Expected behavior
I expect the resolver to deliver a successful lookup result or error with a minimum amount of queries. Notably, this should occur within the specified timeout duration. Based on the code documentation, I would expect the timeout to kick in within
2 * 5s= 10s in the provided example, which it does not. During testing, the provided example ran for several minutes before a manual abort.System:
OS: Linux
rustc version: 1.87.0 (stable)
Version:
Crate:
hickory-resolverVersion:
0.25.2The last available crate version
v0.26.0-alpha.1also shows the general same bug behavior, but has a different pattern of DNS requests.@djc commented on GitHub (Jun 26, 2025):
Can you try with current
main? Who's we?@cr-tk commented on GitHub (Jun 26, 2025):
Testing with the current
mainrequired some code changes, because some APIs changed since the last alpha version.I've now adapted the example code to test against the newest
88cb3033ae.Interestingly, the behavior improved significantly: the lookup now finishes within the timeout period of one attempt, and delivers an IP lookup. It still sends out about ~250 DNS packets to do so though, which feels excessive.
On
64c137d898, the behavior was still problematic. This result suggests that #3042 is likely responsible for this bug behavior and that the recently merged #3075 improved it."We" is Turnkey. We're using
hickory-resolverin https://github.com/tkhq/qos/ .@marcus0x62 commented on GitHub (Jun 26, 2025):
I think the issue is verify_rrset will attempt a verification of a DS record with no RRSIGs. verify_default_rrset will in turn call find_ds_records, which will trigger a new lookup and a new verification and so on until the DNSSEC depth limits are exceeded.
See attached log.
A superficial fix that seems to work -- but I need to put through the test suite is to add a check in verify_default_rrset to not call find_ds_records for a DS record with no RRSIGs.
@marcus0x62 commented on GitHub (Jun 26, 2025):
This should be resolved in main via #3092. @cr-tk please try your test again and let us know the results.
@cr-tk commented on GitHub (Jun 27, 2025):
@djc @marcus0x62
I've tested commit rev
82c85dcfdbwhich has the changes of #3092. The behavior has further improved: the lookup returns faster now, and triggers just 7 DNS queries (1x A record, 4x NS, 2x DS).I have not looked into the details of the DNS queries and responses, but from my perspective, the obvious problems of looping and excessive request numbers seem fixed. At least for this particular combination of domain settings and DNS server behavior.
Thank you for the fast reaction and fixes!
Is there any chance these get backported to
0.25.xin the future?More broadly, I'm wondering if the code that does your timeout handling requires more attention. My naive expectation is that it should have stepped in and put a stop to whatever looping or recursion was going on. The lookup result would have been wrong/incomplete, but from my perspective that's still much better than getting stuck. This could be relevant to limit the impact of future bugs like this.
Additionally, it could be useful to think about counting the number of DNS requests on some level and keeping an eye on them in the unit tests / integration tests. Without #3092, this problem would have looked fixed to a simple test that checks for lookup success and the lookup end result.
@djc commented on GitHub (Jun 27, 2025):
Great to hear this improves the situation!
I'm not sure we have the resources to backport these to 0.25.x. We could quite easily release a 0.26.0-alpha.2 which would contain these fixes, though.
This is something we've noticed before. I tried a somewhat naive fix a few weeks ago and that did not seem to improve things.
In general, we have many improvements going on in the context of our push to get HickoryDNS deployed at Let's Encrypt. If HickoryDNS is load-bearing for your product/team, it might be interesting for your company to purchase support (which my company offers).