[GH-ISSUE #2466] Recursor: follow NS referrals other than the first one #997

Closed
opened 2026-03-16 01:12:54 +03:00 by kerem · 2 comments
Owner

Originally created by @divergentdave on GitHub (Sep 18, 2024).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2466

There are three places in RecursorDnsHandle where we call take(1) on an iterator of NS records, and then look up A/AAAA records. If there are multiple addresses for the first eligible name server, we will set up a name server pool covering all of them, but we will only use the one eligible NS record. This means that the recursor may only contact primary name servers, and not secondary name servers.

This has first-order impacts on resilience because we can't fall back properly to the secondary, and second-order impacts on resilience because it increases the likelihood of getting rate-limited by the primary name server.

We should at least randomly pick an NS record, or better yet, load balance between name server names based on past request statistics, much as we currently do between different IP addresses for the same name server name. RFC 1034 and RFC 1035 provide a recommended algorithm that is more sophisticated. For best reliability, we may want to look up addresses for one or two NS records, try making requests to some addresses, and look up more name server addresses corresponding to other NS records as needed if the initial requests are taking too long, and we need more addresses.

Would it be appropriate to put all the connections for different name servers in the same NameServerPool for a zone, rather than just connections for different addresses of one name server?

Originally created by @divergentdave on GitHub (Sep 18, 2024). Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2466 There are three places in `RecursorDnsHandle` where we call `take(1)` on an iterator of NS records, and then look up A/AAAA records. If there are multiple addresses for the first eligible name server, we will set up a name server pool covering all of them, but we will only use the one eligible NS record. This means that the recursor may only contact primary name servers, and not secondary name servers. This has first-order impacts on resilience because we can't fall back properly to the secondary, and second-order impacts on resilience because it increases the likelihood of getting rate-limited by the primary name server. We should at least randomly pick an NS record, or better yet, load balance between name server names based on past request statistics, much as we currently do between different IP addresses for the same name server name. RFC 1034 and RFC 1035 provide a recommended algorithm that is more sophisticated. For best reliability, we may want to look up addresses for one or two NS records, try making requests to some addresses, and look up more name server addresses corresponding to other NS records as needed if the initial requests are taking too long, and we need more addresses. Would it be appropriate to put all the connections for different name servers in the same `NameServerPool` for a zone, rather than just connections for different addresses of one name server?
kerem closed this issue 2026-03-16 01:12:59 +03:00
Author
Owner

@marcus0x62 commented on GitHub (Sep 18, 2024):

This has been on my list for a while. The short answer is, the take(1) calls can be removed without any issue (I almost included that change in the recent infinite recursion PR, but there was already a lot going on in that PR.)

That doesn't fix the problem of tracking nameserver reliability, but it does add resiliency to the lookup process for the scenario where the parent nameserver(s) don't return glue records. Note that for lookups where the parent nameservers do return glue records, we will add whatever number of records are returned to the NS pool for the zone.

What this means, practically speaking, is that domains that have in-domain nameservers are very likely to have all or most of the NS records in their NS pool, while domains with out-of-domain nameservers are likely to have only a single entry (or a single v4 and a single v6 entry) in their NS pool.

<!-- gh-comment-id:2359624608 --> @marcus0x62 commented on GitHub (Sep 18, 2024): This has been on my list for a while. The short answer is, the take(1) calls can be removed without any issue (I almost included that change in the recent infinite recursion PR, but there was already a lot going on in that PR.) That doesn't fix the problem of tracking nameserver reliability, but it does add resiliency to the lookup process for the scenario where the parent nameserver(s) don't return glue records. Note that for lookups where the parent nameservers do return glue records, we will add whatever number of records are returned to the NS pool for the zone. What this means, practically speaking, is that domains that have in-domain nameservers are very likely to have all or most of the NS records in their NS pool, while domains with out-of-domain nameservers are likely to have only a single entry (or a single v4 and a single v6 entry) in their NS pool.
Author
Owner

@divergentdave commented on GitHub (Nov 15, 2024):

This was fixed by #2522. Both ns_pool_for_zone() and ns_pool_for_referral() now use all NS records to build pools.

<!-- gh-comment-id:2479933992 --> @divergentdave commented on GitHub (Nov 15, 2024): This was fixed by #2522. Both `ns_pool_for_zone()` and `ns_pool_for_referral()` now use all NS records to build pools.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/hickory-dns#997
No description provided.