[GH-ISSUE #2788] recursor fails when cname is returned from an intermediate NS request #1062

Closed
opened 2026-03-16 01:29:32 +03:00 by kerem · 6 comments
Owner

Originally created by @DirectXMan12 on GitHub (Feb 19, 2025).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2788

Describe the bug

a CNAME response to an NS query seems to throw the recursor for a loop -- it (correctly?) skips the CNAME, but instead of going "oh, i got an empty list of servers, oops idk what to do" instead of trying a longer prefix. this seems to be... backwards? from what a couple of other recursors do.

for instance, suppose we're trying to resolve docs.redhat.com.edgekey.net (skipping root querying):

this seems to be what hickory does now:

  1. NS query net for edgekey.net, get back a list of responses
  2. NS query $ONE_OF_THOSE for com.edgekey.net, get back 0 nameservers and 1 CNAME
  3. discard the CNAME, leaving us with an empty set of nameservers
  4. attempt to query that empty set, and fail with no connections available (afaict? i didn't dig too deep into the code, so this is my guess as to why this error message is being shown)

other resolvers 1 seem to do:

  1. NS query net for docs.redhat.com.edgekey.net, get back an authority section pointing to edgekey.net
  2. NS query $ONE_OF_THOSE for docs.redhat.com.edgekey.net, get back the actual response.

To Reproduce

configure hickory as a recursor:

[[zones]]
zone = "."
zone_type = "External"
[[zones.stores]]
ns_cache_size = 1024
ns_recursion_limit = 16
record_cache_size = 1048576
recursion_limit = 12
roots = "/path/to/the/roots.zone"
type = "recursor"

then try querying for docs.redhat.com (or docs.redhat.com.edgekey.net to skip the intervening CNAME).

Expected behavior

We get back a valid CNAME pointing (in the above case, pointing to somewhere on akamai's network).

System:

  • OS: linux (nixos)
  • Architecture: aarch64
  • Version: unstable
  • rustc version: 1.84 (i believe)

Version:
Crate: recursor (via the binary)
Version: 0.25.0-alpha.5

Additional context

it kinda? seems like maybe edgekey shouldn't be returning those CNAMEs on the NS record responses if i had to guess, but also it kinda seems like hickory's deviation from other resolvers is also causing issues.


  1. i tried dig +trace, then to confirm i found a random dns tracing tool online at https://simpledns.plus/lookup-dg and also "normal" public resolvers seem to handle this fine (e.g. 8.8.8.8). ↩︎

Originally created by @DirectXMan12 on GitHub (Feb 19, 2025). Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2788 **Describe the bug** a CNAME response to an NS query seems to throw the recursor for a loop -- it (correctly?) skips the CNAME, but instead of going "oh, i got an empty list of servers, oops idk what to do" instead of trying a longer prefix. this seems to be... backwards? from what a couple of other recursors do. for instance, suppose we're trying to resolve `docs.redhat.com.edgekey.net` (skipping root querying): this seems to be what hickory does now: 1. NS query `net` for `edgekey.net`, get back a list of responses 2. NS query `$ONE_OF_THOSE` for `com.edgekey.net`, get back 0 nameservers and 1 CNAME 3. discard the CNAME, leaving us with an empty set of nameservers 4. attempt to query that empty set, and fail with `no connections available` (afaict? i didn't dig too deep into the code, so this is my guess as to why this error message is being shown) other resolvers [^1] seem to do: 1. NS query `net` for `docs.redhat.com.edgekey.net`, get back an authority section pointing to `edgekey.net` 2. NS query `$ONE_OF_THOSE` for `docs.redhat.com.edgekey.net`, get back the actual response. [^1]: i tried `dig +trace`, then to confirm i found a random dns tracing tool online at `https://simpledns.plus/lookup-dg` and also "normal" public resolvers seem to handle this fine (e.g. `8.8.8.8`). **To Reproduce** configure hickory as a recursor: ```toml [[zones]] zone = "." zone_type = "External" [[zones.stores]] ns_cache_size = 1024 ns_recursion_limit = 16 record_cache_size = 1048576 recursion_limit = 12 roots = "/path/to/the/roots.zone" type = "recursor" ``` then try querying for `docs.redhat.com` (or `docs.redhat.com.edgekey.net` to skip the intervening CNAME). **Expected behavior** We get back a valid CNAME pointing (in the above case, pointing to somewhere on akamai's network). **System:** - OS: linux (nixos) - Architecture: aarch64 - Version: unstable - rustc version: 1.84 (i believe) **Version:** Crate: recursor (via the binary) Version: 0.25.0-alpha.5 **Additional context** it kinda? seems like maybe edgekey shouldn't be returning those CNAMEs on the NS record responses if i had to guess, but also it kinda seems like hickory's deviation from other resolvers is also causing issues.
kerem 2026-03-16 01:29:32 +03:00
Author
Owner

@djc commented on GitHub (Feb 19, 2025):

@divergentdave this should probably be part of #2725?

<!-- gh-comment-id:2667973774 --> @djc commented on GitHub (Feb 19, 2025): @divergentdave this should probably be part of #2725?
Author
Owner

@divergentdave commented on GitHub (Feb 19, 2025):

The reason Hickory DNS is querying for only com.edgekey.net is that it is doing "QNAME minimzation". This has better privacy properties, but may require more queries when zone cuts are more than one label apart. Other resolvers also support QNAME minimization, though it is often controlled by a configuration parameter.

The correct behavior here is described in step 6c of the algorithm in RFC 9156. When we get the NOERROR response with a CNAME, since the iterative query was not for the final QNAME, we need to add another label and make another query to the same server.

The fact that we can get both docs.redhat.com.edgekey.net. IN CNAME e21727.dsca.akamaiedge.net. and com.edgekey.net. IN CNAME e19.b.akamaiedge.net. from the same authoritative server seems weird to me, but oh well.

@divergentdave this should probably be part of #2725?

Agreed, will add.

<!-- gh-comment-id:2669299814 --> @divergentdave commented on GitHub (Feb 19, 2025): The reason Hickory DNS is querying for only `com.edgekey.net` is that it is doing "QNAME minimzation". This has better privacy properties, but may require more queries when zone cuts are more than one label apart. Other resolvers also support QNAME minimization, though it is often controlled by a configuration parameter. The correct behavior here is described in [step 6c of the algorithm in RFC 9156](https://www.rfc-editor.org/rfc/rfc9156.html#section-3-3.7.2.3). When we get the NOERROR response with a CNAME, since the iterative query was not for the final QNAME, we need to add another label and make another query to the same server. The fact that we can get both `docs.redhat.com.edgekey.net. IN CNAME e21727.dsca.akamaiedge.net.` and `com.edgekey.net. IN CNAME e19.b.akamaiedge.net.` from the same authoritative server seems weird to me, but oh well. > [@divergentdave](https://github.com/divergentdave) this should probably be part of [#2725](https://github.com/hickory-dns/hickory-dns/issues/2725)? Agreed, will add.
Author
Owner

@DirectXMan12 commented on GitHub (Feb 26, 2025):

aah, cool, i did not know about qname minimization, will definitely add rfc 9156 to my backlog of reading material, that's neat :-D.

<!-- gh-comment-id:2683781836 --> @DirectXMan12 commented on GitHub (Feb 26, 2025): aah, cool, i did not know about qname minimization, will definitely add rfc 9156 to my backlog of reading material, that's neat :-D.
Author
Owner

@bluejekyll commented on GitHub (Mar 2, 2025):

@divergentdave what's your opinion on this? offer an option to disable QNAME minimization?

<!-- gh-comment-id:2692909774 --> @bluejekyll commented on GitHub (Mar 2, 2025): @divergentdave what's your opinion on this? offer an option to disable QNAME minimization?
Author
Owner

@divergentdave commented on GitHub (Mar 2, 2025):

That's an option, but first and foremost we should fix the recursor so that it follows RFC 9156's algorithm

<!-- gh-comment-id:2692913643 --> @divergentdave commented on GitHub (Mar 2, 2025): That's an option, but first and foremost we should fix the recursor so that it follows RFC 9156's algorithm
Author
Owner

@divergentdave commented on GitHub (Apr 8, 2025):

I wrote up a test to reproduce this, and I was surprised to see that resolution succeeds if there is no CNAME at the name in between zone cuts. However, this seems like it only happens by accident, and it is fragile in how it depends on authoritative name server responses.

Here's what happens when looking up www.b.a.testing. IN A in the test. There are separate name servers for the ., testing., and b.a.testing. zones.

Zone cuts are two labels apart, no CNAME record in the middle

  • Name servers for testing. are located
  • Query the testing. zone for a.testing. IN NS
  • A no data response is received, since this name is an empty non-terminal
  • This response is transformed into ProtoErrorKind::NoRecordsFound, because its answer section is empty
  • RecursorDnsHandle::lookup() logs the error and propagates it up
  • RecursorDnsHandle::ns_pool_for_zone() logs "ns for a.testing forwarded to testing. via SOA record"
  • We then treat the name servers for the testing. zone as the name servers for a.testing. as well, though a.testing. is not really its own zone
  • Recursor pool logs "querying a.testing. for b.a.testing. IN NS"
  • The name servers for the leaf zone are queried for the original recursive query

This effectively achieves QNAME minimization with a different algorithm, but it's misleading that we refer to a.testing. as being a zone in the process. This only works if the authoritative server includes an SOA record in its response, which could introduce another compatibility issue. There is also an extra entry in name_server_cache under a.testing., and extra connections to the authoritative servers.

Zone cuts are two labels apart, with CNAME record in the middle

  • Name servers for testing. are located
  • Query the testing. zone for a.testing. IN NS
  • The response includes a CNAME in the answer section
  • ProtoError::from_response() calls DnsResponse::contains_answer(), which, for this query type, only requires that the answers section be non-empty. Thus, ProtoError::from_response() returns the response inside Ok(...).
  • RecursorDnsHandle::lookup() caches the response and returns it
  • RecursorDnsHandle::ns_pool_for_zone() logs "response is not NS ...; skipping" as it ignores the CNAME record
  • The name server pool that is constructed is an empty set
  • The next query for b.a.testing. IN NS fails with "no connections available"

(as discussed in previous comments)

<!-- gh-comment-id:2787781280 --> @divergentdave commented on GitHub (Apr 8, 2025): I wrote up a test to reproduce this, and I was surprised to see that resolution succeeds if there is no CNAME at the name in between zone cuts. However, this seems like it only happens by accident, and it is fragile in how it depends on authoritative name server responses. Here's what happens when looking up `www.b.a.testing. IN A` in the test. There are separate name servers for the `.`, `testing.`, and `b.a.testing.` zones. ### Zone cuts are two labels apart, no CNAME record in the middle * Name servers for `testing.` are located * Query the `testing.` zone for `a.testing. IN NS` * A no data response is received, since this name is an empty non-terminal * This response is transformed into `ProtoErrorKind::NoRecordsFound`, because its answer section is empty * `RecursorDnsHandle::lookup()` logs the error and propagates it up * `RecursorDnsHandle::ns_pool_for_zone()` logs "ns for a.testing forwarded to testing. via SOA record" * We then treat the name servers for the `testing.` zone as the name servers for `a.testing.` as well, though `a.testing.` is not really its own zone * Recursor pool logs "querying a.testing. for b.a.testing. IN NS" * The name servers for the leaf zone are queried for the original recursive query This effectively achieves QNAME minimization with a different algorithm, but it's misleading that we refer to `a.testing.` as being a zone in the process. This only works if the authoritative server includes an SOA record in its response, which could introduce another compatibility issue. There is also an extra entry in `name_server_cache` under `a.testing.`, and extra connections to the authoritative servers. ### Zone cuts are two labels apart, with CNAME record in the middle * Name servers for `testing.` are located * Query the `testing.` zone for `a.testing. IN NS` * The response includes a CNAME in the answer section * `ProtoError::from_response()` calls `DnsResponse::contains_answer()`, which, for this query type, only requires that the answers section be non-empty. Thus, `ProtoError::from_response()` returns the response inside `Ok(...)`. * `RecursorDnsHandle::lookup()` caches the response and returns it * `RecursorDnsHandle::ns_pool_for_zone()` logs "response is not NS ...; skipping" as it ignores the CNAME record * The name server pool that is constructed is an empty set * The next query for `b.a.testing. IN NS` fails with "no connections available" (as discussed in previous comments)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/hickory-dns#1062
No description provided.