[GH-ISSUE #2649] Intermittent Windows test failures due to port access permissions denied #1037

Closed
opened 2026-03-16 01:23:29 +03:00 by kerem · 12 comments
Owner

Originally created by @djc on GitHub (Dec 6, 2024).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2649

https://github.com/hickory-dns/hickory-dns/actions/runs/12200761181/job/34037642401?pr=2579 shows:

---- server_future_tests::test_server_no_response_on_response stdout ----
udp_socket on port: 127.0.0.1:49718
thread 'server_future_tests::test_server_no_response_on_response' panicked at tests\integration-tests\tests\integration\server_future_tests.rs:175:76:
called `Result::unwrap()` on an `Err` value: ProtoError { kind: Io(Os { code: 10013, kind: PermissionDenied, message: "An attempt was made to access a socket in a way forbidden by its access permissions." }) }

I've noticed similar failures a few times earlier.

(CI on Windows got enabled yesterday when #2589 was merged.)

Originally created by @djc on GitHub (Dec 6, 2024). Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2649 https://github.com/hickory-dns/hickory-dns/actions/runs/12200761181/job/34037642401?pr=2579 shows: ``` ---- server_future_tests::test_server_no_response_on_response stdout ---- udp_socket on port: 127.0.0.1:49718 thread 'server_future_tests::test_server_no_response_on_response' panicked at tests\integration-tests\tests\integration\server_future_tests.rs:175:76: called `Result::unwrap()` on an `Err` value: ProtoError { kind: Io(Os { code: 10013, kind: PermissionDenied, message: "An attempt was made to access a socket in a way forbidden by its access permissions." }) } ``` I've noticed similar failures a few times earlier. (CI on Windows got enabled yesterday when #2589 was merged.)
kerem closed this issue 2026-03-16 01:23:34 +03:00
Author
Owner

@djc commented on GitHub (Dec 6, 2024):

@chrisdenton sorry to both you some more, but if you have any suggestions for this stuff, also much obliged!

<!-- gh-comment-id:2523472341 --> @djc commented on GitHub (Dec 6, 2024): @chrisdenton sorry to both you some more, but if you have any suggestions for this stuff, also much obliged!
Author
Owner

@divergentdave commented on GitHub (Dec 6, 2024):

I did some testing on a local Windows machine, and I wasn't able to reproduce this so far. The integration tests pass, even if I don't act on the firewall pop-up I get. I tried writing a small program to bind two UDP sockets with the same port, and confirmed that gets an AddrInUse error, different from the permission error above.

<!-- gh-comment-id:2524658922 --> @divergentdave commented on GitHub (Dec 6, 2024): I did some testing on a local Windows machine, and I wasn't able to reproduce this so far. The integration tests pass, even if I don't act on the firewall pop-up I get. I tried writing a small program to bind two UDP sockets with the same port, and confirmed that gets an AddrInUse error, different from the permission error above.
Author
Owner

@ChrisDenton commented on GitHub (Dec 7, 2024):

I don't know what the issue is off the top of my head and I've really only glanced at the code but my first suspicion would be that it's something to do with having multiple threads. If CI has less resources and/or more contention than tests run locally then it is possible things are being scheduled quite differently. At least that would explain why it's intermittent and hard to replicate.

<!-- gh-comment-id:2525020239 --> @ChrisDenton commented on GitHub (Dec 7, 2024): I don't know what the issue is off the top of my head and I've really only glanced at the code but my first suspicion would be that it's something to do with having multiple threads. If CI has less resources and/or more contention than tests run locally then it is possible things are being scheduled quite differently. At least that would explain why it's intermittent and hard to replicate.
Author
Owner

@marcus0x62 commented on GitHub (Dec 11, 2024):

I saw this issue again here. This discussion seems relevant. I wonder if this is failing when we try to bind a port in Window's ephemeral range? Their default is 49,152 - 65535, whereas we try to pick a port from 1024-65535. That would explain why this works most of the time. If that is the case, we'll probably have to modify our behavior on Windows and just use their port allocation mechanism, perhaps with a doc note to users that are running hickory server on windows that they should be sure to set as big a dynamic range as possible...

---- truncation_tests::test_truncation stdout ----
udp_socket on port: 127.0.0.1:62067
2024-12-11T14:33:26.136304Z TRACE hickory_proto::udp::udp_stream: binding UDP socket port=64538
2024-12-11T14:33:26.136403Z DEBUG hickory_proto::udp::udp_stream: failed to bind port: An attempt was made to access a socket in a way forbidden by its access permissions. (os error 10013)
thread 'truncation_tests::test_truncation' panicked at tests\integration-tests\tests\integration\truncation_tests.rs:60:56:
query failed: ProtoError { kind: Io(Os { code: 10013, kind: PermissionDenied, message: "An attempt was made to access a socket in a way forbidden by its access permissions." }) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
<!-- gh-comment-id:2536235322 --> @marcus0x62 commented on GitHub (Dec 11, 2024): I saw this issue again [here](https://github.com/hickory-dns/hickory-dns/actions/runs/12278329943/job/34259768526). [This discussion](https://github.com/orgs/community/discussions/26011) seems relevant. I wonder if this is failing when we try to bind a port in Window's ephemeral range? Their default is 49,152 - 65535, whereas we try to pick a port from 1024-65535. That would explain why this works *most of the time.* If that is the case, we'll probably have to modify our behavior on Windows and just use their port allocation mechanism, perhaps with a doc note to users that are running hickory server on windows that they should be sure to set as big a dynamic range as possible... ``` ---- truncation_tests::test_truncation stdout ---- udp_socket on port: 127.0.0.1:62067 2024-12-11T14:33:26.136304Z TRACE hickory_proto::udp::udp_stream: binding UDP socket port=64538 2024-12-11T14:33:26.136403Z DEBUG hickory_proto::udp::udp_stream: failed to bind port: An attempt was made to access a socket in a way forbidden by its access permissions. (os error 10013) thread 'truncation_tests::test_truncation' panicked at tests\integration-tests\tests\integration\truncation_tests.rs:60:56: query failed: ProtoError { kind: Io(Os { code: 10013, kind: PermissionDenied, message: "An attempt was made to access a socket in a way forbidden by its access permissions." }) } note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ```
Author
Owner

@djc commented on GitHub (Dec 11, 2024):

Maybe on Windows we should narrow our default range?

<!-- gh-comment-id:2536321835 --> @djc commented on GitHub (Dec 11, 2024): Maybe on Windows we should narrow our default range?
Author
Owner

@marcus0x62 commented on GitHub (Dec 11, 2024):

Maybe on Windows we should narrow our default range?

This sounds like the best course of action. If this hypothesis does turn out to be correct, we could narrow our range just on windows to be 1024 - 49151, which would give us around 3x the number of ports vs using their default range.

We'll need a documentation note to warn any windows users who change their ephemeral port range.

Maybe this should be a config option so that if someone does change their default range for whatever reason, hickory can be made to match.

Edit -

This may be getting into "too clever by half" territory, but we could also just default to selecting a port at random, then if we're running on Windows and our RNG returns a port in the range 49,152 - 65535, ask the OS for an ephemeral port. That would be easy to insert into the existing code (just as easy as changing the range on windows) and still give us the full entropy of the port range we use today on Windows.

<!-- gh-comment-id:2536489010 --> @marcus0x62 commented on GitHub (Dec 11, 2024): > Maybe on Windows we should narrow our default range? This sounds like the best course of action. If this hypothesis does turn out to be correct, we could narrow our range just on windows to be 1024 - 49151, which would give us around 3x the number of ports vs using their default range. We'll need a documentation note to warn any windows users who change their ephemeral port range. Maybe this should be a config option so that if someone does change their default range for whatever reason, hickory can be made to match. Edit - This may be getting into "too clever by half" territory, but we could also just default to selecting a port at random, then if we're running on Windows and our RNG returns a port in the range 49,152 - 65535, ask the OS for an ephemeral port. That would be easy to insert into the existing code (just as easy as changing the range on windows) and still give us the full entropy of the port range we use today on Windows.
Author
Owner

@djc commented on GitHub (Dec 12, 2024):

This may be getting into "too clever by half" territory, but we could also just default to selecting a port at random, then if we're running on Windows and our RNG returns a port in the range 49,152 - 65535, ask the OS for an ephemeral port. That would be easy to insert into the existing code (just as easy as changing the range on windows) and still give us the full entropy of the port range we use today on Windows.

Sounds good to me!

<!-- gh-comment-id:2539082654 --> @djc commented on GitHub (Dec 12, 2024): > This may be getting into "too clever by half" territory, but we could also just default to selecting a port at random, then if we're running on Windows and our RNG returns a port in the range 49,152 - 65535, ask the OS for an ephemeral port. That would be easy to insert into the existing code (just as easy as changing the range on windows) and still give us the full entropy of the port range we use today on Windows. Sounds good to me!
Author
Owner

@divergentdave commented on GitHub (Dec 18, 2024):

Here's a CI failure including the port number log message:

---- udp::udp_stream::tests::test_next_random_socket stdout ----
2024-12-18T21:38:06.032078Z TRACE hickory_proto::udp::udp_stream: binding UDP socket port=62821
2024-12-18T21:38:06.032257Z DEBUG hickory_proto::udp::udp_stream: failed to bind port: An attempt was made to access a socket in a way forbidden by its access permissions. (os error 10013)
thread 'udp::udp_stream::tests::test_next_random_socket' panicked at crates\proto\src\tests\udp.rs:30:14:
failed to get next socket address: Os { code: 10013, kind: PermissionDenied, message: "An attempt was made to access a socket in a way forbidden by its access permissions." }
<!-- gh-comment-id:2552329646 --> @divergentdave commented on GitHub (Dec 18, 2024): Here's a [CI failure](https://github.com/hickory-dns/hickory-dns/actions/runs/12401634090/job/34621427451#step:7:1129) including the port number log message: ``` ---- udp::udp_stream::tests::test_next_random_socket stdout ---- 2024-12-18T21:38:06.032078Z TRACE hickory_proto::udp::udp_stream: binding UDP socket port=62821 2024-12-18T21:38:06.032257Z DEBUG hickory_proto::udp::udp_stream: failed to bind port: An attempt was made to access a socket in a way forbidden by its access permissions. (os error 10013) thread 'udp::udp_stream::tests::test_next_random_socket' panicked at crates\proto\src\tests\udp.rs:30:14: failed to get next socket address: Os { code: 10013, kind: PermissionDenied, message: "An attempt was made to access a socket in a way forbidden by its access permissions." } ```
Author
Owner

@divergentdave commented on GitHub (Dec 18, 2024):

I did some more experimentation, and it looks like CI workers have the dynamic port range configured to be the top quarter of the port space. (https://github.com/divergentdave/hickory-dns/actions/runs/12402565933/job/34624328888#step:6:14) Thus, 62821 falls within the dynamic port range.

It looks like CI workers have "excluded port ranges" configured covering 200 UDP ports, split into two adjacent 100-port ranges. The ranges are different each run, but appear to be within the dynamic port range. I don't know if this is related to the problem at all. There's little documentation on what the effect of "excluded port ranges" are. I tried adding an excluded port range on a local machine, then binding a UDP socket to a port in that range, and I didn't get a permission error.

I wonder if we should just treat PermissionDenied similar to AddrInUse here, and try again with a different port.

<!-- gh-comment-id:2552463434 --> @divergentdave commented on GitHub (Dec 18, 2024): I did some more experimentation, and it looks like CI workers have the dynamic port range configured to be the top quarter of the port space. (https://github.com/divergentdave/hickory-dns/actions/runs/12402565933/job/34624328888#step:6:14) Thus, 62821 falls within the dynamic port range. It looks like CI workers have "excluded port ranges" configured covering 200 UDP ports, split into two adjacent 100-port ranges. The ranges are different each run, but appear to be within the dynamic port range. I don't know if this is related to the problem at all. There's little documentation on what the effect of "excluded port ranges" are. I tried adding an excluded port range on a local machine, then binding a UDP socket to a port in that range, and I didn't get a permission error. I wonder if we should just treat `PermissionDenied` similar to `AddrInUse` here, and try again with a different port.
Author
Owner

@djc commented on GitHub (Dec 19, 2024):

I wonder if we should just treat PermissionDenied similar to AddrInUse here, and try again with a different port.

That also seems reasonable and relies less on guessing knowledge about the configured port ranges. Should we only do that on Windows, or use the same strategy on other systems?

<!-- gh-comment-id:2553042499 --> @djc commented on GitHub (Dec 19, 2024): > I wonder if we should just treat `PermissionDenied` similar to `AddrInUse` here, and try again with a different port. That also seems reasonable and relies less on guessing knowledge about the configured port ranges. Should we only do that on Windows, or use the same strategy on other systems?
Author
Owner

@divergentdave commented on GitHub (Dec 19, 2024):

Permission errors are also possible on Linux, particularly if SELinux policies restrict binding particular ports. See https://lists.nlnetlabs.nl/pipermail/unbound-users/2014-November/003634.html for an example of this. Though, if the policies are fixed and known ahead of time, this can also be fixed with avoid_local_udp_ports.

<!-- gh-comment-id:2555070907 --> @divergentdave commented on GitHub (Dec 19, 2024): Permission errors are also possible on Linux, particularly if SELinux policies restrict binding particular ports. See https://lists.nlnetlabs.nl/pipermail/unbound-users/2014-November/003634.html for an example of this. Though, if the policies are fixed and known ahead of time, this can also be fixed with `avoid_local_udp_ports`.
Author
Owner

@divergentdave commented on GitHub (Feb 11, 2025):

#2693 has resolved this, as we can now tolerate permission denied errors, by retrying with another random port or port 0.

<!-- gh-comment-id:2651925374 --> @divergentdave commented on GitHub (Feb 11, 2025): #2693 has resolved this, as we can now tolerate permission denied errors, by retrying with another random port or port 0.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/hickory-dns#1037
No description provided.