mirror of
https://github.com/hickory-dns/hickory-dns.git
synced 2026-04-25 03:05:51 +03:00
[GH-ISSUE #606] Lookups fail when any nameserver returns SERVFAIL #248
Labels
No labels
blocked
breaking-change
bug
bug:critical
bug:tests
cleanup
compliance
compliance
compliance
crate:all
crate:client
crate:native-tls
crate:proto
crate:recursor
crate:resolver
crate:resolver
crate:rustls
crate:server
crate:util
dependencies
docs
duplicate
easy
easy
enhance
enhance
enhance
feature:dns-over-https
feature:dns-over-quic
feature:dns-over-tls
feature:dnsssec
feature:global_lb
feature:mdns
feature:tsig
features:edns
has workaround
ops
perf
platform:WASM
platform:android
platform:fuchsia
platform:linux
platform:macos
platform:windows
pull-request
question
test
tools
tools
trust
unclear
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/hickory-dns#248
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @stuartnelson3 on GitHub (Nov 7, 2018).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/606
Describe the bug
SRV lookup fails when using a resolverconf with nameservers in 10.10.x.x range.
err="pool failed to run"
To Reproduce
Perform an SRV lookup with a nameserver in this range
Expected behavior
I expected a lookup to work.
System:
archlinuxx86_64rustc 1.30.0 (da5f414c2 2018-10-24)Version:
Crate: resolver
Version: 0.10
These nameservers are being pushed by the network, updating my NetworkManager-managed resolv.conf automatically. Resolution works as normal with dig, but the library doesn't like this address range.
@bluejekyll commented on GitHub (Nov 7, 2018):
Could you share your
/etc/resolver.conf?This is the first report I’ve seen like this. Also to confirm, it’s only the SRV recordtype that’s having this issue?
@stuartnelson3 commented on GitHub (Nov 7, 2018):
This also happens with A record lookups, I don't have any other lookups at hand to check (but since both are failing with the same
pool failed to runmessage, I'm guessing applies to more/all).When using different nameservers (for example,
10.33.x.x), it works fine.It does seem weird that it would be a specific range? I don't know what's running at the 10.10.x.x addresses or their settings, but I would expect an error different from
pool failed to run.@bluejekyll commented on GitHub (Nov 7, 2018):
Have you verified both dns endpoints are responding properly with dig?
Also, to cutout the search path as a potential issue, in trust-dns you can issue FQDN by issuing queries with a fully defined name in the form of
www.example.net., ie the final dot on the name does this.@stuartnelson3 commented on GitHub (Nov 7, 2018):
The queries I was testing were fully qualified. I confirmed that dig against
example.com.works against those endpoints:It seems the issue might be that for the srv query (internal name that I won't share) that I was issuing to the above address, the server responds with SERVFAIL?
@bluejekyll commented on GitHub (Nov 7, 2018):
It’s possible that we’re treating SERVFAIL responses incorrectly in the library.
You said you were also getting this for A records as well, though, right?
@stuartnelson3 commented on GitHub (Nov 8, 2018):
Correct, I'm seeing this behavior for A record lookups as well
@bluejekyll commented on GitHub (Nov 8, 2018):
Ok, given your comment:
Are both the A and SRV records getting SERVFAIL? What I really want to understand, is what is the desired behavior you'd like in that case. I'm trying to understand what we need to fix here. Also, I'm not clear on where
pool failed to runis coming from. I don't see that inside thetrust-dnslibrary in my search, nortokio-rs.I'm not currently clear on either what the root issue is, nor what the fix should be at the moment. Are you able to do some more research in tracking this down?
One thing that might be useful are logs from the trust-dns libs, if you use the
env-logger, this can be used to enable the logging there:export RUST_LOG=trust_dns=debug,trust_dns_resolver=trace@stuartnelson3 commented on GitHub (Nov 9, 2018):
Ah, ok, sorry for all the run around.
The error being returned is in fact for the SERVFAIL
What's the behavior wrt using nameservers? My /etc/resolv.conf has 9 nameservers listed, and 2/9 will fail. Does the library use one, and on failure, return the failure?
I found this issue regarding noting differences between glibc/musl (https://github.com/bluejekyll/trust-dns/issues/249), but the readme seems to only say "NameServer pools with performance based priority usage". To me, I think it'd be cool to send out parallel requests and if all are failures, return a failure, or return the first successful response.
@bluejekyll commented on GitHub (Nov 9, 2018):
Ah. I understand now. We could add some options to allow parallel queries, but I wanted to be cognizant of network load which is why this didn’t happen before.
I now understand your issue better. The SERVFAIL is being regarded as a terminal query failure, meaning no query for this name will succeed. We can improve this by trying again with the other nodes. Currently the library only continues to other nameservers on connectivity failure.
Many local resolvers end up improperly responding to domains for which they are not authoritative, but as though they are authoritative. We might need want to make these configurable options.
To boil this down, there are two changes being asked for:
Is this accurate?
@stuartnelson3 commented on GitHub (Nov 9, 2018):
2 is definitely something I would want. 1, as you mentioned, comes with its own trade-offs. While I would like that behavior, others might not.
If this conflicts with how you want the library to work that's 100% ok, just let me know!
@bluejekyll commented on GitHub (Nov 10, 2018):
I think we can support both. We should create a separate issue for 1, and we'll fix 2 as an answer to this issue. Also, I think we should rename this to "Lookups fail when any nameserver returns SERVFAIL".
@stuartnelson3 commented on GitHub (Nov 12, 2018):
How is the first nameserver to query chosen? Reading about glibc and musl's behavior, it seems they respect the order in resolv.conf. In my situation, the failing servers were in the last positions of the list. Perhaps being consistent with that behavior would be appropriate?
@bluejekyll commented on GitHub (Nov 12, 2018):
I’m aware of glibc and musl’s ordering.
The ordering I’ve implemented is based on success vs. failure rate and connected vs unconnected streams. In the future I want to also take into consideration latency.
For this fix, it will try all nameservers, and continue through the list in certain cases, right now connection failures (timeouts) and now SERVFAIL. See #613