mirror of
https://github.com/hickory-dns/hickory-dns.git
synced 2026-04-25 03:05:51 +03:00
[GH-ISSUE #2649] Intermittent Windows test failures due to port access permissions denied #1037
Labels
No labels
blocked
breaking-change
bug
bug:critical
bug:tests
cleanup
compliance
compliance
compliance
crate:all
crate:client
crate:native-tls
crate:proto
crate:recursor
crate:resolver
crate:resolver
crate:rustls
crate:server
crate:util
dependencies
docs
duplicate
easy
easy
enhance
enhance
enhance
feature:dns-over-https
feature:dns-over-quic
feature:dns-over-tls
feature:dnsssec
feature:global_lb
feature:mdns
feature:tsig
features:edns
has workaround
ops
perf
platform:WASM
platform:android
platform:fuchsia
platform:linux
platform:macos
platform:windows
pull-request
question
test
tools
tools
trust
unclear
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/hickory-dns#1037
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @djc on GitHub (Dec 6, 2024).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2649
https://github.com/hickory-dns/hickory-dns/actions/runs/12200761181/job/34037642401?pr=2579 shows:
I've noticed similar failures a few times earlier.
(CI on Windows got enabled yesterday when #2589 was merged.)
@djc commented on GitHub (Dec 6, 2024):
@chrisdenton sorry to both you some more, but if you have any suggestions for this stuff, also much obliged!
@divergentdave commented on GitHub (Dec 6, 2024):
I did some testing on a local Windows machine, and I wasn't able to reproduce this so far. The integration tests pass, even if I don't act on the firewall pop-up I get. I tried writing a small program to bind two UDP sockets with the same port, and confirmed that gets an AddrInUse error, different from the permission error above.
@ChrisDenton commented on GitHub (Dec 7, 2024):
I don't know what the issue is off the top of my head and I've really only glanced at the code but my first suspicion would be that it's something to do with having multiple threads. If CI has less resources and/or more contention than tests run locally then it is possible things are being scheduled quite differently. At least that would explain why it's intermittent and hard to replicate.
@marcus0x62 commented on GitHub (Dec 11, 2024):
I saw this issue again here. This discussion seems relevant. I wonder if this is failing when we try to bind a port in Window's ephemeral range? Their default is 49,152 - 65535, whereas we try to pick a port from 1024-65535. That would explain why this works most of the time. If that is the case, we'll probably have to modify our behavior on Windows and just use their port allocation mechanism, perhaps with a doc note to users that are running hickory server on windows that they should be sure to set as big a dynamic range as possible...
@djc commented on GitHub (Dec 11, 2024):
Maybe on Windows we should narrow our default range?
@marcus0x62 commented on GitHub (Dec 11, 2024):
This sounds like the best course of action. If this hypothesis does turn out to be correct, we could narrow our range just on windows to be 1024 - 49151, which would give us around 3x the number of ports vs using their default range.
We'll need a documentation note to warn any windows users who change their ephemeral port range.
Maybe this should be a config option so that if someone does change their default range for whatever reason, hickory can be made to match.
Edit -
This may be getting into "too clever by half" territory, but we could also just default to selecting a port at random, then if we're running on Windows and our RNG returns a port in the range 49,152 - 65535, ask the OS for an ephemeral port. That would be easy to insert into the existing code (just as easy as changing the range on windows) and still give us the full entropy of the port range we use today on Windows.
@djc commented on GitHub (Dec 12, 2024):
Sounds good to me!
@divergentdave commented on GitHub (Dec 18, 2024):
Here's a CI failure including the port number log message:
@divergentdave commented on GitHub (Dec 18, 2024):
I did some more experimentation, and it looks like CI workers have the dynamic port range configured to be the top quarter of the port space. (https://github.com/divergentdave/hickory-dns/actions/runs/12402565933/job/34624328888#step:6:14) Thus, 62821 falls within the dynamic port range.
It looks like CI workers have "excluded port ranges" configured covering 200 UDP ports, split into two adjacent 100-port ranges. The ranges are different each run, but appear to be within the dynamic port range. I don't know if this is related to the problem at all. There's little documentation on what the effect of "excluded port ranges" are. I tried adding an excluded port range on a local machine, then binding a UDP socket to a port in that range, and I didn't get a permission error.
I wonder if we should just treat
PermissionDeniedsimilar toAddrInUsehere, and try again with a different port.@djc commented on GitHub (Dec 19, 2024):
That also seems reasonable and relies less on guessing knowledge about the configured port ranges. Should we only do that on Windows, or use the same strategy on other systems?
@divergentdave commented on GitHub (Dec 19, 2024):
Permission errors are also possible on Linux, particularly if SELinux policies restrict binding particular ports. See https://lists.nlnetlabs.nl/pipermail/unbound-users/2014-November/003634.html for an example of this. Though, if the policies are fixed and known ahead of time, this can also be fixed with
avoid_local_udp_ports.@divergentdave commented on GitHub (Feb 11, 2025):
#2693 has resolved this, as we can now tolerate permission denied errors, by retrying with another random port or port 0.