[GH-ISSUE #606] Lookups fail when any nameserver returns SERVFAIL #248

Closed
opened 2026-03-07 23:00:28 +03:00 by kerem · 13 comments
Owner

Originally created by @stuartnelson3 on GitHub (Nov 7, 2018).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/606

Describe the bug
SRV lookup fails when using a resolverconf with nameservers in 10.10.x.x range.

err="pool failed to run"

To Reproduce
Perform an SRV lookup with a nameserver in this range

Expected behavior
I expected a lookup to work.

System:

  • OS: archlinux
  • Architecture: x86_64
  • rustc version: rustc 1.30.0 (da5f414c2 2018-10-24)

Version:
Crate: resolver
Version: 0.10

These nameservers are being pushed by the network, updating my NetworkManager-managed resolv.conf automatically. Resolution works as normal with dig, but the library doesn't like this address range.

Originally created by @stuartnelson3 on GitHub (Nov 7, 2018). Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/606 **Describe the bug** SRV lookup fails when using a resolverconf with nameservers in 10.10.x.x range. err="pool failed to run" **To Reproduce** Perform an SRV lookup with a nameserver in this range **Expected behavior** I expected a lookup to work. **System:** - OS: `archlinux` - Architecture: `x86_64` - rustc version: `rustc 1.30.0 (da5f414c2 2018-10-24)` **Version:** Crate: resolver Version: 0.10 These nameservers are being pushed by the network, updating my NetworkManager-managed resolv.conf automatically. Resolution works as normal with dig, but the library doesn't like this address range.
kerem 2026-03-07 23:00:28 +03:00
Author
Owner

@bluejekyll commented on GitHub (Nov 7, 2018):

Could you share your /etc/resolver.conf?

This is the first report I’ve seen like this. Also to confirm, it’s only the SRV recordtype that’s having this issue?

<!-- gh-comment-id:436642382 --> @bluejekyll commented on GitHub (Nov 7, 2018): Could you share your `/etc/resolver.conf`? This is the first report I’ve seen like this. Also to confirm, it’s only the SRV recordtype that’s having this issue?
Author
Owner

@stuartnelson3 commented on GitHub (Nov 7, 2018):

This also happens with A record lookups, I don't have any other lookups at hand to check (but since both are failing with the same pool failed to run message, I'm guessing applies to more/all).

$ cat /etc/resolv.conf
# Generated by NetworkManager
search ber.office.example.net office.example.net
nameserver 10.10.1.101
nameserver 10.10.1.102

When using different nameservers (for example, 10.33.x.x), it works fine.

It does seem weird that it would be a specific range? I don't know what's running at the 10.10.x.x addresses or their settings, but I would expect an error different from pool failed to run.

<!-- gh-comment-id:436649535 --> @stuartnelson3 commented on GitHub (Nov 7, 2018): This also happens with A record lookups, I don't have any other lookups at hand to check (but since both are failing with the same `pool failed to run` message, I'm guessing applies to more/all). ``` $ cat /etc/resolv.conf # Generated by NetworkManager search ber.office.example.net office.example.net nameserver 10.10.1.101 nameserver 10.10.1.102 ``` When using different nameservers (for example, `10.33.x.x`), it works fine. It does seem weird that it would be a specific range? I don't know what's running at the 10.10.x.x addresses or their settings, but I would expect an error different from `pool failed to run`.
Author
Owner

@bluejekyll commented on GitHub (Nov 7, 2018):

Have you verified both dns endpoints are responding properly with dig?

Also, to cutout the search path as a potential issue, in trust-dns you can issue FQDN by issuing queries with a fully defined name in the form of www.example.net., ie the final dot on the name does this.

<!-- gh-comment-id:436656800 --> @bluejekyll commented on GitHub (Nov 7, 2018): Have you verified both dns endpoints are responding properly with dig? Also, to cutout the search path as a potential issue, in trust-dns you can issue FQDN by issuing queries with a fully defined name in the form of `www.example.net.`, ie the final dot on the name does this.
Author
Owner

@stuartnelson3 commented on GitHub (Nov 7, 2018):

The queries I was testing were fully qualified. I confirmed that dig against example.com. works against those endpoints:

# dig A example.com. +short @10.10.1.102
93.184.216.34

It seems the issue might be that for the srv query (internal name that I won't share) that I was issuing to the above address, the server responds with SERVFAIL?

<!-- gh-comment-id:436663223 --> @stuartnelson3 commented on GitHub (Nov 7, 2018): The queries I was testing were fully qualified. I confirmed that dig against `example.com.` works against those endpoints: ``` # dig A example.com. +short @10.10.1.102 93.184.216.34 ``` It seems the issue might be that for the srv query (internal name that I won't share) that I was issuing to the above address, the server responds with SERVFAIL?
Author
Owner

@bluejekyll commented on GitHub (Nov 7, 2018):

It’s possible that we’re treating SERVFAIL responses incorrectly in the library.

You said you were also getting this for A records as well, though, right?

<!-- gh-comment-id:436672977 --> @bluejekyll commented on GitHub (Nov 7, 2018): It’s possible that we’re treating SERVFAIL responses incorrectly in the library. You said you were also getting this for A records as well, though, right?
Author
Owner

@stuartnelson3 commented on GitHub (Nov 8, 2018):

Correct, I'm seeing this behavior for A record lookups as well

<!-- gh-comment-id:436955272 --> @stuartnelson3 commented on GitHub (Nov 8, 2018): Correct, I'm seeing this behavior for A record lookups as well
Author
Owner

@bluejekyll commented on GitHub (Nov 8, 2018):

Ok, given your comment:

It seems the issue might be that for the srv query (internal name that I won't share) that I was issuing to the above address, the server responds with SERVFAIL?

Are both the A and SRV records getting SERVFAIL? What I really want to understand, is what is the desired behavior you'd like in that case. I'm trying to understand what we need to fix here. Also, I'm not clear on where pool failed to run is coming from. I don't see that inside the trust-dns library in my search, nor tokio-rs.

I'm not currently clear on either what the root issue is, nor what the fix should be at the moment. Are you able to do some more research in tracking this down?

One thing that might be useful are logs from the trust-dns libs, if you use the env-logger, this can be used to enable the logging there:

export RUST_LOG=trust_dns=debug,trust_dns_resolver=trace

<!-- gh-comment-id:437103712 --> @bluejekyll commented on GitHub (Nov 8, 2018): Ok, given your comment: > It seems the issue might be that for the srv query (internal name that I won't share) that I was issuing to the above address, the server responds with SERVFAIL? Are both the A and SRV records getting SERVFAIL? What I really want to understand, is what is the desired behavior you'd like in that case. I'm trying to understand what we need to fix here. Also, I'm not clear on where `pool failed to run` is coming from. I don't see that inside the `trust-dns` library in my search, nor `tokio-rs`. I'm not currently clear on either what the root issue is, nor what the fix should be at the moment. Are you able to do some more research in tracking this down? One thing that might be useful are logs from the trust-dns libs, if you use the `env-logger`, this can be used to enable the logging there: `export RUST_LOG=trust_dns=debug,trust_dns_resolver=trace`
Author
Owner

@stuartnelson3 commented on GitHub (Nov 9, 2018):

Ah, ok, sorry for all the run around.

The error being returned is in fact for the SERVFAIL

DNS Error: Server Failure

What's the behavior wrt using nameservers? My /etc/resolv.conf has 9 nameservers listed, and 2/9 will fail. Does the library use one, and on failure, return the failure?

I found this issue regarding noting differences between glibc/musl (https://github.com/bluejekyll/trust-dns/issues/249), but the readme seems to only say "NameServer pools with performance based priority usage". To me, I think it'd be cool to send out parallel requests and if all are failures, return a failure, or return the first successful response.

<!-- gh-comment-id:437342825 --> @stuartnelson3 commented on GitHub (Nov 9, 2018): Ah, ok, sorry for all the run around. The error being returned is in fact for the SERVFAIL ``` DNS Error: Server Failure ``` What's the behavior wrt using nameservers? My /etc/resolv.conf has 9 nameservers listed, and 2/9 will fail. Does the library use one, and on failure, return the failure? I found this issue regarding noting differences between glibc/musl (https://github.com/bluejekyll/trust-dns/issues/249), but the readme seems to only say "NameServer pools with performance based priority usage". To me, I think it'd be cool to send out parallel requests and if all are failures, return a failure, or return the first successful response.
Author
Owner

@bluejekyll commented on GitHub (Nov 9, 2018):

Ah. I understand now. We could add some options to allow parallel queries, but I wanted to be cognizant of network load which is why this didn’t happen before.

I now understand your issue better. The SERVFAIL is being regarded as a terminal query failure, meaning no query for this name will succeed. We can improve this by trying again with the other nodes. Currently the library only continues to other nameservers on connectivity failure.

Many local resolvers end up improperly responding to domains for which they are not authoritative, but as though they are authoritative. We might need want to make these configurable options.

To boil this down, there are two changes being asked for:

  1. perform multiple queries at once
  2. on SERVFAIL or other possibly in accurate upstream responses, continue to other nameservers.

Is this accurate?

<!-- gh-comment-id:437379564 --> @bluejekyll commented on GitHub (Nov 9, 2018): Ah. I understand now. We could add some options to allow parallel queries, but I wanted to be cognizant of network load which is why this didn’t happen before. I now understand your issue better. The SERVFAIL is being regarded as a terminal query failure, meaning no query for this name will succeed. We can improve this by trying again with the other nodes. Currently the library only continues to other nameservers on connectivity failure. Many local resolvers end up improperly responding to domains for which they are not authoritative, but as though they are authoritative. We might need want to make these configurable options. To boil this down, there are two changes being asked for: 1) perform multiple queries at once 2) on SERVFAIL or other possibly in accurate upstream responses, continue to other nameservers. Is this accurate?
Author
Owner

@stuartnelson3 commented on GitHub (Nov 9, 2018):

To boil this down, there are two changes being asked for:

  1. perform multiple queries at once
  2. on SERVFAIL or other possibly in accurate upstream responses, continue to other nameservers.

Is this accurate?

2 is definitely something I would want. 1, as you mentioned, comes with its own trade-offs. While I would like that behavior, others might not.

If this conflicts with how you want the library to work that's 100% ok, just let me know!

<!-- gh-comment-id:437383512 --> @stuartnelson3 commented on GitHub (Nov 9, 2018): >To boil this down, there are two changes being asked for: > 1. perform multiple queries at once > 2. on SERVFAIL or other possibly in accurate upstream responses, continue to other nameservers. > >Is this accurate? 2 is definitely something I would want. 1, as you mentioned, comes with its own trade-offs. While I would like that behavior, others might not. If this conflicts with how you want the library to work that's 100% ok, just let me know!
Author
Owner

@bluejekyll commented on GitHub (Nov 10, 2018):

I think we can support both. We should create a separate issue for 1, and we'll fix 2 as an answer to this issue. Also, I think we should rename this to "Lookups fail when any nameserver returns SERVFAIL".

<!-- gh-comment-id:437564308 --> @bluejekyll commented on GitHub (Nov 10, 2018): I think we can support both. We should create a separate issue for 1, and we'll fix 2 as an answer to this issue. Also, I think we should rename this to "Lookups fail when any nameserver returns SERVFAIL".
Author
Owner

@stuartnelson3 commented on GitHub (Nov 12, 2018):

How is the first nameserver to query chosen? Reading about glibc and musl's behavior, it seems they respect the order in resolv.conf. In my situation, the failing servers were in the last positions of the list. Perhaps being consistent with that behavior would be appropriate?

<!-- gh-comment-id:437863317 --> @stuartnelson3 commented on GitHub (Nov 12, 2018): How is the first nameserver to query chosen? Reading about glibc and musl's behavior, it seems they respect the order in resolv.conf. In my situation, the failing servers were in the last positions of the list. Perhaps being consistent with that behavior would be appropriate?
Author
Owner

@bluejekyll commented on GitHub (Nov 12, 2018):

I’m aware of glibc and musl’s ordering.

The ordering I’ve implemented is based on success vs. failure rate and connected vs unconnected streams. In the future I want to also take into consideration latency.

For this fix, it will try all nameservers, and continue through the list in certain cases, right now connection failures (timeouts) and now SERVFAIL. See #613

<!-- gh-comment-id:437917986 --> @bluejekyll commented on GitHub (Nov 12, 2018): I’m aware of glibc and musl’s ordering. The ordering I’ve implemented is based on success vs. failure rate and connected vs unconnected streams. In the future I want to also take into consideration latency. For this fix, it will try all nameservers, and continue through the list in certain cases, right now connection failures (timeouts) and now SERVFAIL. See #613
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/hickory-dns#248
No description provided.