mirror of
https://github.com/hickory-dns/hickory-dns.git
synced 2026-04-25 03:05:51 +03:00
[GH-ISSUE #2539] tx_id_validation_test is flaky #1010
Labels
No labels
blocked
breaking-change
bug
bug:critical
bug:tests
cleanup
compliance
compliance
compliance
crate:all
crate:client
crate:native-tls
crate:proto
crate:recursor
crate:resolver
crate:resolver
crate:rustls
crate:server
crate:util
dependencies
docs
duplicate
easy
easy
enhance
enhance
enhance
feature:dns-over-https
feature:dns-over-quic
feature:dns-over-tls
feature:dnsssec
feature:global_lb
feature:mdns
feature:tsig
features:edns
has workaround
ops
perf
platform:WASM
platform:android
platform:fuchsia
platform:linux
platform:macos
platform:windows
pull-request
question
test
tools
tools
trust
unclear
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/hickory-dns#1010
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @divergentdave on GitHub (Oct 29, 2024).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2539
The
tx_id_validation_teste2e-test has been flaking in CI, but I haven't been able to reproduce it locally. See https://github.com/hickory-dns/hickory-dns/pull/2388#issuecomment-2444731388 for the initial occurrence and #2536 for the additional logging, and https://github.com/hickory-dns/hickory-dns/actions/runs/11584003127/job/32250291256 for another occurrence, with more information.I have a hypothesis that we have a race between timeouts here. The recursor logs show that about five seconds elapsed between when it dropped the response message with the wrong transaction ID, and when it sent a response.
dig's default timeout is five seconds, and the command run in the client did not change this.@marcus0x62 commented on GitHub (Oct 30, 2024):
That sounds plausible. I think the default retry timer for dig is probably making this even more difficult to reproduce. I added support for setting the dig timeout in #2540, but in testing I had to set the timeout all the way down to 1 second to reliably get the test to fail.
@marcus0x62 commented on GitHub (Nov 1, 2024):
I don't think there have been any test failures since making the timeout change. @divergentdave ?
@divergentdave commented on GitHub (Nov 1, 2024):
I haven't seen any either, that fix is probably sufficient.
I'm still a little confused by how
dig's timeout and retry is working. It seems likedigmight have a blind spot between timing out one request and resending a retry, possibly? Based on the recursor logs, the retried requests fromdigshare the same transaction ID, and responses are being sent back to the client after five seconds. My naive guess was that a late response to an earlier attempt would satisfy a retried query, but clearly that wasn't happening consistently. At any rate, putting a good amount of distance between Hickory anddigtimeouts should make the test reliable. If we have similar issues in the future, we can start dumping packets withtsharkand more logging fromdigto diagnose further.