mirror of
https://github.com/hickory-dns/hickory-dns.git
synced 2026-04-24 18:55:55 +03:00
[GH-ISSUE #3435] Recursive resolver benchmarks #1181
Labels
No labels
blocked
breaking-change
bug
bug:critical
bug:tests
cleanup
compliance
compliance
compliance
crate:all
crate:client
crate:native-tls
crate:proto
crate:recursor
crate:resolver
crate:resolver
crate:rustls
crate:server
crate:util
dependencies
docs
duplicate
easy
easy
enhance
enhance
enhance
feature:dns-over-https
feature:dns-over-quic
feature:dns-over-tls
feature:dnsssec
feature:global_lb
feature:mdns
feature:tsig
features:edns
has workaround
ops
perf
platform:WASM
platform:android
platform:fuchsia
platform:linux
platform:macos
platform:windows
pull-request
question
test
tools
tools
trust
unclear
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/hickory-dns#1181
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @djc on GitHub (Jan 9, 2026).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/3435
Originally assigned to: @cpu on GitHub.
We need some reproducible benchmarks for the recursive resolver. Ideally we'd have some kind of comparison to Unbound. Should think about how we come up with a realistic query load.
@cpu commented on GitHub (Jan 29, 2026):
I started to look at this, and I think it's both a tricky problem and one with a lot of potential ways to attack it. It feels like a case where it would be helpful to lay out some options and discuss their pros/cons. All benchmarking is a question of context and not over-reading the results but that feels even more true than usual for this particular domain. Very happy to hear suggestions from folks more familiar with operational DNS or benchmarking.
I found this post by ISC a useful introduction to some of the challenges in this space, but to criminally under-summarize, it's different from benchmarking authoritative servers because queries aren't independent and there's no shortage of confounding factors. There's also not a single metric that accurately distills overall performance in all conditions.
Design questions
I think the two most important axes of the benchmarking design are the client traffic, and the authoritative servers being recursively queried.
Client traffic
Real client traffic
On the client traffic side, the most realistic option is to capture a PCAP with representative real world query data, and then to replay it accurately with a tool like dns-shotgun.
Pros: realistic query distribution and timing patterns, more natural cache hit/miss ratios, closest to real user experience.
Cons: need to source that data from somewhere (privacy issues...), hard to pair with synthetic authoritative servers, harder to scale up/down
Synthetic client traffic
This would involve using a tool like dns-flamethrower and one or more of its generators to make synthetic client queries, or using a hard-coded list of data like Top-N domains.
Pros: No PCAP required, very reproducible, easy to scale up/down, pairs well with synthetic auth servers.
Cons: artificial query distribution, kinds of queries are also artificial, generally less representative of the real world.
Authoritative servers
Real authoritative servers
Like what ISC described, this would involve letting the recursor talk to the live internet and remote authoritative servers.
Pros: realistic latencies, zone data, etc. Exercises real timeout/retry logic.
Cons: hard to reproduce/compare across implementations, interference from stateful firewalls or bandwidth limitations
Synthetic authoritative servers
This would be a situation where we stand up our own test auth. servers and populate them with realistic zone data as best as we can.
Pros: reproducible, fewer confounding effects from the network, easier to compare between impls
Cons: unrealistic latency/network conditions, hard to generate realistic zone data
Recommendation
I don't think we have a good place to sample client traffic to get a PCAP, so I think we can exclude considering real client traffic for now. I also think we shouldn't let perfect stand in the way of good for a project that has no significant benchmarking. I think a closed loop environment would help us uncover plenty of interesting performance metrics even if it's not truly representative of what one would see in production. It would also let us side-step some of the thornier network provisioning issues that could confound a test setup that had to reach the wider internet.
I think we'd want three servers:
For metrics, we should be looking at:
For zone data, I think we should start simply and generate a single zone with lots of records and varying TTLs. We'd want to set up fake root data that delegates to itself so we can do full recursion against the one server. I think we should try to save ourselves some work and use an off-the-shelf tool like flamethrower to generate the client queries unless we find a reason that it doesn't meet our needs.
It will also take some care to tune each recursor equivalently, especially knowing that Hickory/Unbound/etc offer different configuration parameters. We'd also need to tune the hosts themselves to avoid common benchmarking gotchas like running out of FDs or bumping into CPU scaling.
We should probably run a handful of the dnsflamethrower generators since the query patterns test different caching effects. E.g. using 'randomlabel' is going to be benchmarking the raw resolution throughput since it will be generating cache misses. Using 'static' is the other end of the spectrum and would be dominated by cache hits.
We should probably try a handful of protocols, starting with UDP Do53, then TCP Do53, then DoT.
Future work
In terms of directions to head afterwards:
Thoughts? Questions? Areas I'm missing an important detail?
@djc commented on GitHub (Feb 2, 2026):
(Exceeded my target SLA for answering this -- sorry for the slowish response.)
I agree that using a completely synthetic setup probably makes for a good start. I'm still wondering if we can glean something useful from an existing large deployment (like Unbound at Let's Encrypt) -- like if we'd get aggregate statistics on something like cache hit ratios I think that might be very useful in trying to make our synthetic data more realistic.
Similarly, maybe we can use some top domain list but mirror it into our synthetic authoritatives? After all, the DNS is of necessity mostly public anyway.
@divergentdave can you or someone else at the ISRG (maybe Andrew) have a look at the discussion here, and maybe figure out if there are is some data you can share that helps us get a handle on the scope of this problem?
I think the high-level question here is, what are the major factors influencing the latency involved in answering a recursive query?
@bdaehlie commented on GitHub (Feb 2, 2026):
@mcpherrinm ^
@divergentdave commented on GitHub (Feb 2, 2026):
I looked at metrics a while back and pulled out some summary statistics about our workload. The overall cache hit ratio is around 25%. The ratio between incoming recursive queries and outgoing queries is about 1:4.
@cpu commented on GitHub (Feb 12, 2026):
I think that makes sense as a place to start for our synthetic auth. server zone data. I've started playing with this idea and the Tranco Top 1M and have a prototype bootstrapping zone data from the top 10k of the 1M loaded into
knoton my local machine. One downside of this approach is that the data is exclusively eTLDs (public suffixes) and eTLD+1s (no deeper subdomains), so I think the resulting delegation depth is going to be pretty shallow relative to real-world query patterns.That's helpful, thank you. I'm going to start with trying to gin up the input domains to
dns-flamethrowerto see if I can reliably produce something approaching this before scaling up to the full 1M and a proper dedicated benchmarking server.@mcpherrinm gentle ping that if you have any additional ideas on how to provide us more data to better reflect your operating conditions that it would be helpful to hear soon.
@mcpherrinm commented on GitHub (Feb 12, 2026):
Hi! Sorry, missed the ping earlier.
One possible source of realistic names to look up is Sunlight CT log "name tiles". They are much smaller than the full CT logs, as they just having the issued names in them. https://github.com/FiloSottile/sunlight/blob/main/names-tiles.md That might be better than using the Tranco lists, and relatively easy to grab the JSON.
Our issuance is split something like 5% TLS-ALPN-01, 45% DNS-01 and 50% HTTP-01, though which is used for any particular name isn't public, if you wanted to synthesize queries matching those portions of challenges, along with the CAA lookups for them.
@cpu commented on GitHub (Feb 13, 2026):
That's a really good idea 👍
That's also helpful, thanks! I've been reconstituting SOA, NS, TXT, A, AAAA, TXT and CAA records and I think that'll cover all bases. HTTP-01 and TLS-ALPN-01 challenges won't be meaningfully different at the DNS layer (just A/AAAA lookups + CAA right?) but DNS-01 will be and its useful to know it's nearly half of the all domain validation methods.
@mcpherrinm commented on GitHub (Feb 13, 2026):
The only difference between TLS-ALPN-01 and HTTP-01 I can think of would be any additional resolution from following HTTP redirects, but that seems unlikely to be relevant for this testing.
We do about an equal number of A and AAAA lookups.
@cpu commented on GitHub (Mar 10, 2026):
I've spent some time running a few different benchmarking experiments and I think I've made enough progress that it makes sense to try and summarize some results so far.
There are some confusing elements to dig into further, but the top-level takeaways are that:
flamethrowerinstance, we do about equivalent QPS but I think this is a case where we're comparing performance of fully cached data.flamethrowerinstances, Hickory starts to produce lots of timeouts when I increase the load in a way Unbound does not. This is troubling to me.I think digging in to 3 and 4 are the best next steps before trying to fiddle with the benchmarking setup itself, or switching to other sources of domains/zone data (e.g. CT tiles).
In general I continue to be fairly concerned that a fully local test scenario is not a great way to benchmark a recursive resolver, and that with real world latencies and a variety of remote authoritative server implementations we would see vastly different results. To paraphrase Mark Twain, "There are three kinds of lies: lies, damned lies, and benchmarks".
Before getting into the numbers, here's a general summary of the setup:
Server
These tests were done on a dedicated bare metal OVH cloud server:
For software versions, I tested with the 24.04 Ubuntu packages for knot and unbound:
DNS-flamethrower isn't packaged for Ubuntu, so I built it from source:
Similarly, I built HickoryDNS from source with a
--releaseconfiguration:9f4b15ea55I haven't concerned myself with testing the latest & greatest of either knot (3.5.3) or unbound (1.24.2) but it wouldn't be too much work to change to those versions in the future.
Zone data
As described in my earlier comments, I bootstrapped artificial zone data for the
knotdauthoritative DNS server using the Tranco Top 1M domains list. I wrote some throw away scripts/tooling that performed iterative DNS resolution (RD=0 queries) from root servers downward for each domain, recording the delegation steps observed (NS records, glue presence/absence, TTLs, and answer records for A, AAAA, TXT and CAA). These were used to generate RFC 1035 zone files for each eTLD+1 and a TSV file with the delegation metadata + general harvest info.I did some very light post-processing to transform this harvested data into a complete synthetic DNS hierarchy suitable for local load testing. I used the TSV + delegation data to generate synthetic root and TLD zones with NS records pointing back to the local knot server. I aso rewrote the harvested NS records in each zone file to use synthetic local nameservers, with glue pointing to localhost. I also filtered out IDN domains because I was having trouble feeding them through
flamethrower(though I might just need to punycode them first?). Lastly, I wrote a script that created a Knot configuration that loads the synthetic root + eTLD+1 zones. I haven't DNSSEC signed any of the zones, or configured the recursors to enforce DNSSEC, this is exclusively Do53 testing without DNSSEC so far.This ended up producing ~920,000 zone files with a tota of 11.9 million RRs (averaging ~13 per zone). The breakdown of record types was roughly ~35% NS, ~25% A, ~23% TXT, ~7% SOA, ~4% CAA, ~4% AAAA. In sum it's about 3.8GB of zone data and represents domains across ~1,045 TLDs.
Server configuration
I only made some minor tweaks to the default 24.04 server installation.
I applied a few system-level tunings:
knotwas used for the authoritative DNS server hosting the synthetic zone data for all tests. I usedflamethroweras a client feeding queries to a recursive resolver (eitherunboundorhickory-dnsdepending on the test).Here's the knot config I used (minus the many, many `zone` file definitions):
Here's the unbound config I used:
Here's the hickory-dns config I used (along with --workers=14 on the CLI):
For
flamethrower, I used either-n1for the cache simuation tests, or-lfor the duration based tests, and left other parameters their defaults (e.g. allowingflamethrowerto make as many queries as it can, with a 1ms delay between queries, and an intra-process concurrent traffic generator setting of 10):Since
nprocon this machine reports24, I tried to pin relevant software components to dedicated cores to avoid contention, and tried to configure relevant worker thread settings to match the cores. In particular:knotwas started undertaskset -c "0-3"unboundorhickory-dnswere started undertaskset -c "4-15"flamethrowerinstances were started undertasksetusing cores16..21node-exporter,htopand other misc admin tasks used the last 2 cores.I restarted
knotbetween tests, though that was probably unnecessary. I made sureunboundwas stopped whenhickory-dnswas running, and vice versa.As noted in the flamethrower docs on concurrency, it's a single-threaded application so I experimented with running 1 ... 6 copies of the program at a time pinned to the cores in the allocated range (more on that later...).
Even with 6 flamethrower instances I was never able to max out the CPU usage of the system. The flamethrower instances seemed to stall out at around ~40% CPU usage per-core, and the unbound/hickory instances hovered around ~35% ... 20% usage per core. Notably the
knotCPU usage never went over ~1% CPU and we could probably "steal" cores from there if needed.Query input for flamethrower
My first thought was to try and match the cache hit ratio and RR type diversity described in the comments above by crafting the domain list provided to
flamethrower. In particular, I used a repetition-based cache simulation where a "hot set" of 8,000 domains per record type is repeated 4 times in the query file, while a "cold set" of 64,000 domains per type appears only once. This was trying to achieve a 25% cache hit ratio: after warm-up, the first occurrence of each hot query populates the cache, and the subsequent 3 repetitions are served from cache. The queries were shuffled so hot entries are distributed throughout the file rather than appearing consecutively, trying to ensure the cache hit pattern emerges organically during steady-state operation.The query mix targets a 25/25/25/25 split across A, AAAA, TXT, and CAA record types to roughly approximate the pattern at LE. Importantly I tried to use "positive-only" inputs for each record type, only including the (domain, type) pair if the zone actually has that record type. This avoids conflating positive and negative caching behavior, and I hoped would give a cleaner measurement signal. With that approach, CAA records are the limiting factor (~70k domains), which constrains the total pool size but keeps the experiment well-defined.
Ultimately that meant we had a query input file of ~382,000 domains. At the QPS that both recursors are able to support this meant a "single run" through
flamethrowerwith-n1only lasts a few seconds. That most faithfully targets the cache hit/miss ratio but makes for very short load testing runs so I also ran some runs offlamethrowerthat used-lwith a fixed duration of 15 minutes that looped through the query file as needed (randomizing between loops). This obviously means the "cold set" isn't cold in this model, but did allow for stressing the recursors.Single run results
For the tests where I used
-n1withflamethrower, the high-level flamethrower results forunboundwere:and for
hickory-dns:In summary,
unboundachieved a higher QPS (94,952 avg r qps vs 76,443 avg r qps) but encountered more timeouts, whilehickory-dnshad a lower QPS but no timeouts. One mystery I've yet to figure out is why for the same set of input domains, thehickory-dnsinstance saw ~35kNXDOMAINresults whileunboundsaw 0. The difference inSERVFAILresults is also interesting, but less pronounced.15m soak, 1 flamethrower instance
For a test where I dropped
-n1, and used-Rand-l 900, the high-level flamethrower results forunboundwere:and for hickory-dns:
In this case the avg r qps values are closer, but
hickory-dnsagain reports more SERVFAIL results, andNXDOMAINresults. The # of timeouts seen by both recursors also increases.Increasing flamethrower instances
Once I started increasing the number of flamethrower instances beyond 1 things looked more and more dire for
hickory-dns. At the extreme, with 6 flamethrower instances the aggregate statistics across all instances looked like this forunbound:and for
hickory-dns:The
unboundinstance managed to get to ~568,000 avg r QPS quite happily, while thehickory-dnsinstance was only ~200,000 and much worse, over 64% of requests timed out....Scaling down the number of
flamethrowerinstances would reduce the % of timeouts observed, indicating there's some kind of bottleneck. E.g. with 4 instances instead of 6, only 47% timeouts were observed, using 3 instances it was ~28% timeouts.I think investigating the reason for these timeouts is the most pressing thing we can take away from this exercise.
Most troubling is that, in one case I saw many lines being logged from hickory stdout of the form:
I haven't been able to reproduce that since, but it happened on one of the runs where many timeouts were observed and I suspect the root cause is related...
Anyway, that's where I'm at so far. In general I'm finding this a challenging task and would welcome input from folks while I start to look into the timeouts/NXDOMAIN results.
@djc commented on GitHub (Mar 11, 2026):
This makes a lot of sense to me. Two thoughts:
@cpu commented on GitHub (Mar 11, 2026):
SGTM on both fronts.
@cpu commented on GitHub (Mar 11, 2026):
One chunk of these is from holes in my delegations in the zone data interacting differently with the two recursors implementation of qname minimization.
unboundwas usingqname-minimisation-strict: no(the default) and would fall back to sending the full query when it hit a buggy NXDOMAIN during the NS probing. If I switch toqname-minimisation-strict: yes,unboundgoes from 0 NXDOMAIN results to 1,764 in the single run.hickorystill sees more (35,393) so there's other root causes to dig out.We can't turn off qname minimization (https://github.com/hickory-dns/hickory-dns/issues/2917), and don't have the same fallback "relaxed" option either, so probably testing against
unboundwith strict enabled makes sense for the moment.