[GH-ISSUE #2539] tx_id_validation_test is flaky #1010

Closed
opened 2026-03-16 01:16:29 +03:00 by kerem · 3 comments
Owner

Originally created by @divergentdave on GitHub (Oct 29, 2024).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2539

The tx_id_validation_test e2e-test has been flaking in CI, but I haven't been able to reproduce it locally. See https://github.com/hickory-dns/hickory-dns/pull/2388#issuecomment-2444731388 for the initial occurrence and #2536 for the additional logging, and https://github.com/hickory-dns/hickory-dns/actions/runs/11584003127/job/32250291256 for another occurrence, with more information.

I have a hypothesis that we have a race between timeouts here. The recursor logs show that about five seconds elapsed between when it dropped the response message with the wrong transaction ID, and when it sent a response. dig's default timeout is five seconds, and the command run in the client did not change this.

Originally created by @divergentdave on GitHub (Oct 29, 2024). Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2539 The `tx_id_validation_test` e2e-test has been flaking in CI, but I haven't been able to reproduce it locally. See https://github.com/hickory-dns/hickory-dns/pull/2388#issuecomment-2444731388 for the initial occurrence and #2536 for the additional logging, and https://github.com/hickory-dns/hickory-dns/actions/runs/11584003127/job/32250291256 for another occurrence, with more information. I have a hypothesis that we have a race between timeouts here. The recursor logs show that about five seconds elapsed between when it dropped the response message with the wrong transaction ID, and when it sent a response. `dig`'s default timeout is five seconds, and the command run in the client did not change this.
kerem closed this issue 2026-03-16 01:16:34 +03:00
Author
Owner

@marcus0x62 commented on GitHub (Oct 30, 2024):

That sounds plausible. I think the default retry timer for dig is probably making this even more difficult to reproduce. I added support for setting the dig timeout in #2540, but in testing I had to set the timeout all the way down to 1 second to reliably get the test to fail.

<!-- gh-comment-id:2447209852 --> @marcus0x62 commented on GitHub (Oct 30, 2024): That sounds plausible. I think the default retry timer for dig is probably making this even more difficult to reproduce. I added support for setting the dig timeout in #2540, but in testing I had to set the timeout all the way down to 1 second to reliably get the test to fail.
Author
Owner

@marcus0x62 commented on GitHub (Nov 1, 2024):

I don't think there have been any test failures since making the timeout change. @divergentdave ?

<!-- gh-comment-id:2452084859 --> @marcus0x62 commented on GitHub (Nov 1, 2024): I don't think there have been any test failures since making the timeout change. @divergentdave ?
Author
Owner

@divergentdave commented on GitHub (Nov 1, 2024):

I haven't seen any either, that fix is probably sufficient.

I'm still a little confused by how dig's timeout and retry is working. It seems like dig might have a blind spot between timing out one request and resending a retry, possibly? Based on the recursor logs, the retried requests from dig share the same transaction ID, and responses are being sent back to the client after five seconds. My naive guess was that a late response to an earlier attempt would satisfy a retried query, but clearly that wasn't happening consistently. At any rate, putting a good amount of distance between Hickory and dig timeouts should make the test reliable. If we have similar issues in the future, we can start dumping packets with tshark and more logging from dig to diagnose further.

<!-- gh-comment-id:2452222469 --> @divergentdave commented on GitHub (Nov 1, 2024): I haven't seen any either, that fix is probably sufficient. I'm still a little confused by how `dig`'s timeout and retry is working. It seems like `dig` might have a blind spot between timing out one request and resending a retry, possibly? Based on the recursor logs, the retried requests from `dig` share the same transaction ID, and responses are being sent back to the client after five seconds. My naive guess was that a late response to an earlier attempt would satisfy a retried query, but clearly that wasn't happening consistently. At any rate, putting a good amount of distance between Hickory and `dig` timeouts should make the test reliable. If we have similar issues in the future, we can start dumping packets with `tshark` and more logging from `dig` to diagnose further.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/hickory-dns#1010
No description provided.