[GH-ISSUE #2539] `tx_id_validation_test` is flaky #1010

New issue

Closed

opened 2026-03-16 01:16:29 +03:00 by kerem · 3 comments

kerem commented

2026-03-16 01:16:29 +03:00

Owner

Originally created by @divergentdave on GitHub (Oct 29, 2024).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2539

The tx_id_validation_test e2e-test has been flaking in CI, but I haven't been able to reproduce it locally. See https://github.com/hickory-dns/hickory-dns/pull/2388#issuecomment-2444731388 for the initial occurrence and #2536 for the additional logging, and https://github.com/hickory-dns/hickory-dns/actions/runs/11584003127/job/32250291256 for another occurrence, with more information.

I have a hypothesis that we have a race between timeouts here. The recursor logs show that about five seconds elapsed between when it dropped the response message with the wrong transaction ID, and when it sent a response. dig's default timeout is five seconds, and the command run in the client did not change this.

Originally created by @divergentdave on GitHub (Oct 29, 2024). Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/2539 The `tx_id_validation_test` e2e-test has been flaking in CI, but I haven't been able to reproduce it locally. See https://github.com/hickory-dns/hickory-dns/pull/2388#issuecomment-2444731388 for the initial occurrence and #2536 for the additional logging, and https://github.com/hickory-dns/hickory-dns/actions/runs/11584003127/job/32250291256 for another occurrence, with more information. I have a hypothesis that we have a race between timeouts here. The recursor logs show that about five seconds elapsed between when it dropped the response message with the wrong transaction ID, and when it sent a response. `dig`'s default timeout is five seconds, and the command run in the client did not change this.

kerem closed this issue

2026-03-16 01:16:34 +03:00

kerem commented

2026-03-16 01:16:41 +03:00

Author

Owner

@marcus0x62 commented on GitHub (Oct 30, 2024):

That sounds plausible. I think the default retry timer for dig is probably making this even more difficult to reproduce. I added support for setting the dig timeout in #2540, but in testing I had to set the timeout all the way down to 1 second to reliably get the test to fail.

@marcus0x62 commented on GitHub (Oct 30, 2024): That sounds plausible. I think the default retry timer for dig is probably making this even more difficult to reproduce. I added support for setting the dig timeout in #2540, but in testing I had to set the timeout all the way down to 1 second to reliably get the test to fail.

kerem commented

2026-03-16 01:16:46 +03:00

Author

Owner

@marcus0x62 commented on GitHub (Nov 1, 2024):

I don't think there have been any test failures since making the timeout change. @divergentdave ?

@marcus0x62 commented on GitHub (Nov 1, 2024): I don't think there have been any test failures since making the timeout change. @divergentdave ?

kerem commented

2026-03-16 01:16:51 +03:00

Author

Owner

@divergentdave commented on GitHub (Nov 1, 2024):

I haven't seen any either, that fix is probably sufficient.

I'm still a little confused by how dig's timeout and retry is working. It seems like dig might have a blind spot between timing out one request and resending a retry, possibly? Based on the recursor logs, the retried requests from dig share the same transaction ID, and responses are being sent back to the client after five seconds. My naive guess was that a late response to an earlier attempt would satisfy a retried query, but clearly that wasn't happening consistently. At any rate, putting a good amount of distance between Hickory and dig timeouts should make the test reliable. If we have similar issues in the future, we can start dumping packets with tshark and more logging from dig to diagnose further.

@divergentdave commented on GitHub (Nov 1, 2024): I haven't seen any either, that fix is probably sufficient. I'm still a little confused by how `dig`'s timeout and retry is working. It seems like `dig` might have a blind spot between timing out one request and resending a retry, possibly? Based on the recursor logs, the retried requests from `dig` share the same transaction ID, and responses are being sent back to the client after five seconds. My naive guess was that a late response to an earlier attempt would satisfy a retried query, but clearly that wasn't happening consistently. At any rate, putting a good amount of distance between Hickory and `dig` timeouts should make the test reliable. If we have similar issues in the future, we can start dumping packets with `tshark` and more logging from `dig` to diagnose further.

kerem referenced this issue

2026-03-16 02:36:56 +03:00

[PR #1010] [MERGED] Bump openssl from 0.10.26 to 0.10.27 #1862

kerem referenced this issue

2026-03-16 02:51:45 +03:00

[PR #1010] [MERGED] Bump openssl from 0.10.26 to 0.10.27 #1869