[GH-ISSUE #3435] Recursive resolver benchmarks #1181

Open
opened 2026-03-16 01:49:41 +03:00 by kerem · 12 comments
Owner

Originally created by @djc on GitHub (Jan 9, 2026).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/3435

Originally assigned to: @cpu on GitHub.

We need some reproducible benchmarks for the recursive resolver. Ideally we'd have some kind of comparison to Unbound. Should think about how we come up with a realistic query load.

Originally created by @djc on GitHub (Jan 9, 2026). Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/3435 Originally assigned to: @cpu on GitHub. We need some reproducible benchmarks for the recursive resolver. Ideally we'd have some kind of comparison to Unbound. Should think about how we come up with a realistic query load.
Author
Owner

@cpu commented on GitHub (Jan 29, 2026):

I started to look at this, and I think it's both a tricky problem and one with a lot of potential ways to attack it. It feels like a case where it would be helpful to lay out some options and discuss their pros/cons. All benchmarking is a question of context and not over-reading the results but that feels even more true than usual for this particular domain. Very happy to hear suggestions from folks more familiar with operational DNS or benchmarking.

I found this post by ISC a useful introduction to some of the challenges in this space, but to criminally under-summarize, it's different from benchmarking authoritative servers because queries aren't independent and there's no shortage of confounding factors. There's also not a single metric that accurately distills overall performance in all conditions.

Design questions

I think the two most important axes of the benchmarking design are the client traffic, and the authoritative servers being recursively queried.

Client traffic

Real client traffic

On the client traffic side, the most realistic option is to capture a PCAP with representative real world query data, and then to replay it accurately with a tool like dns-shotgun.

Pros: realistic query distribution and timing patterns, more natural cache hit/miss ratios, closest to real user experience.
Cons: need to source that data from somewhere (privacy issues...), hard to pair with synthetic authoritative servers, harder to scale up/down

Synthetic client traffic

This would involve using a tool like dns-flamethrower and one or more of its generators to make synthetic client queries, or using a hard-coded list of data like Top-N domains.

Pros: No PCAP required, very reproducible, easy to scale up/down, pairs well with synthetic auth servers.
Cons: artificial query distribution, kinds of queries are also artificial, generally less representative of the real world.

Authoritative servers

Real authoritative servers

Like what ISC described, this would involve letting the recursor talk to the live internet and remote authoritative servers.

Pros: realistic latencies, zone data, etc. Exercises real timeout/retry logic.
Cons: hard to reproduce/compare across implementations, interference from stateful firewalls or bandwidth limitations

Synthetic authoritative servers

This would be a situation where we stand up our own test auth. servers and populate them with realistic zone data as best as we can.

Pros: reproducible, fewer confounding effects from the network, easier to compare between impls
Cons: unrealistic latency/network conditions, hard to generate realistic zone data

Recommendation

I don't think we have a good place to sample client traffic to get a PCAP, so I think we can exclude considering real client traffic for now. I also think we shouldn't let perfect stand in the way of good for a project that has no significant benchmarking. I think a closed loop environment would help us uncover plenty of interesting performance metrics even if it's not truly representative of what one would see in production. It would also let us side-step some of the thornier network provisioning issues that could confound a test setup that had to reach the wider internet.

I think we'd want three servers:

  1. A load generator running flamethrower
  2. A resolver machine running hickory/unbound
  3. An auth server running hickory/knot/nsd

For metrics, we should be looking at:

  1. QPS
  2. Latency percentiles
  3. Response codes/timeouts

For zone data, I think we should start simply and generate a single zone with lots of records and varying TTLs. We'd want to set up fake root data that delegates to itself so we can do full recursion against the one server. I think we should try to save ourselves some work and use an off-the-shelf tool like flamethrower to generate the client queries unless we find a reason that it doesn't meet our needs.

It will also take some care to tune each recursor equivalently, especially knowing that Hickory/Unbound/etc offer different configuration parameters. We'd also need to tune the hosts themselves to avoid common benchmarking gotchas like running out of FDs or bumping into CPU scaling.

We should probably run a handful of the dnsflamethrower generators since the query patterns test different caching effects. E.g. using 'randomlabel' is going to be benchmarking the raw resolution throughput since it will be generating cache misses. Using 'static' is the other end of the spectrum and would be dominated by cache hits.

We should probably try a handful of protocols, starting with UDP Do53, then TCP Do53, then DoT.

Future work

In terms of directions to head afterwards:

  1. I think a single unsigned zone stuffed with fake record data is very un-representative, and something with more complex delegations, record types, and DNSSEC would change the observed performance characteristics greatly. I'm not sure the best strategy to generate this and would appreciate input in this direction.
  2. We could remove the synthetic authoritative servers, and test against the live internet using the Tranco Top 100k domain list and a dns-flamethrower client using the FILE generator. This would be very realistic, but operationally more difficult and harder to reproduce. I think it's worth doing (particularly to get a sense of performance when things like retries and timeouts are a factor) but it could be done after we knock out low-hanging fruit from a closed/artificial environment.

Thoughts? Questions? Areas I'm missing an important detail?

<!-- gh-comment-id:3820044647 --> @cpu commented on GitHub (Jan 29, 2026): I started to look at this, and I think it's both a tricky problem and one with a lot of potential ways to attack it. It feels like a case where it would be helpful to lay out some options and discuss their pros/cons. All benchmarking is a question of context and not over-reading the results but that feels even more true than usual for this particular domain. Very happy to hear suggestions from folks more familiar with operational DNS or benchmarking. I found [this post by ISC][ISC post] a useful introduction to some of the challenges in this space, but to criminally under-summarize, it's different from benchmarking authoritative servers because queries aren't independent and there's no shortage of confounding factors. There's also not a single metric that accurately distills overall performance in all conditions. ## Design questions I think the two most important axes of the benchmarking design are the client traffic, and the authoritative servers being recursively queried. ### Client traffic #### Real client traffic On the client traffic side, the most realistic option is to capture a PCAP with representative real world query data, and then to replay it accurately with a tool like [dns-shotgun]. Pros: realistic query distribution and timing patterns, more natural cache hit/miss ratios, closest to real user experience. Cons: need to source that data from somewhere (privacy issues...), hard to pair with synthetic authoritative servers, harder to scale up/down #### Synthetic client traffic This would involve using a tool like [dns-flamethrower] and one or more of its [generators][dns-flamethrower-generators] to make synthetic client queries, or using a hard-coded list of data like Top-N domains. Pros: No PCAP required, very reproducible, easy to scale up/down, pairs well with synthetic auth servers. Cons: artificial query distribution, kinds of queries are also artificial, generally less representative of the real world. ### Authoritative servers #### Real authoritative servers Like what [ISC described][ISC post], this would involve letting the recursor talk to the live internet and remote authoritative servers. Pros: realistic latencies, zone data, etc. Exercises real timeout/retry logic. Cons: hard to reproduce/compare across implementations, interference from stateful firewalls or bandwidth limitations #### Synthetic authoritative servers This would be a situation where we stand up our own test auth. servers and populate them with realistic zone data as best as we can. Pros: reproducible, fewer confounding effects from the network, easier to compare between impls Cons: unrealistic latency/network conditions, hard to generate realistic zone data ## Recommendation I don't think we have a good place to sample client traffic to get a PCAP, so I think we can exclude considering real client traffic for now. I also think we shouldn't let perfect stand in the way of good for a project that has no significant benchmarking. I think a closed loop environment would help us uncover plenty of interesting performance metrics even if it's not truly representative of what one would see in production. It would also let us side-step some of the thornier network provisioning issues that could confound a test setup that had to reach the wider internet. I think we'd want three servers: 1. A load generator running flamethrower 2. A resolver machine running hickory/unbound 3. An auth server running hickory/knot/nsd For metrics, we should be looking at: 1. QPS 2. Latency percentiles 3. Response codes/timeouts For zone data, I think we should start simply and generate a single zone with lots of records and varying TTLs. We'd want to set up fake root data that delegates to itself so we can do full recursion against the one server. I think we should try to save ourselves some work and use an off-the-shelf tool like flamethrower to generate the client queries unless we find a reason that it doesn't meet our needs. It will also take some care to tune each recursor equivalently, especially knowing that Hickory/Unbound/etc offer different configuration parameters. We'd also need to tune the hosts themselves to avoid common benchmarking gotchas like running out of FDs or bumping into CPU scaling. We should probably run a handful of the dnsflamethrower generators since the query patterns test different caching effects. E.g. using 'randomlabel' is going to be benchmarking the raw resolution throughput since it will be generating cache misses. Using 'static' is the other end of the spectrum and would be dominated by cache hits. We should probably try a handful of protocols, starting with UDP Do53, then TCP Do53, then DoT. ## Future work In terms of directions to head afterwards: 1. I think a single unsigned zone stuffed with fake record data is very un-representative, and something with more complex delegations, record types, and DNSSEC would change the observed performance characteristics greatly. I'm not sure the best strategy to generate this and would appreciate input in this direction. 2. We could remove the synthetic authoritative servers, and test against the live internet using the Tranco Top 100k domain list and a dns-flamethrower client using the FILE generator. This would be very realistic, but operationally more difficult and harder to reproduce. I think it's worth doing (particularly to get a sense of performance when things like retries and timeouts are a factor) but it could be done after we knock out low-hanging fruit from a closed/artificial environment. Thoughts? Questions? Areas I'm missing an important detail? [ISC post]: https://www.isc.org/blogs/bind-resolver-performance-july-2021/ [dns-shotgun]: https://dns-shotgun.readthedocs.io/en/stable/ [dns-flamethrower]: https://github.com/DNS-OARC/flamethrower/ [dns-flamethrower-generators]: https://github.com/DNS-OARC/flamethrower/blob/main/man/flame.1.md#generators
Author
Owner

@djc commented on GitHub (Feb 2, 2026):

(Exceeded my target SLA for answering this -- sorry for the slowish response.)

I agree that using a completely synthetic setup probably makes for a good start. I'm still wondering if we can glean something useful from an existing large deployment (like Unbound at Let's Encrypt) -- like if we'd get aggregate statistics on something like cache hit ratios I think that might be very useful in trying to make our synthetic data more realistic.

Similarly, maybe we can use some top domain list but mirror it into our synthetic authoritatives? After all, the DNS is of necessity mostly public anyway.

@divergentdave can you or someone else at the ISRG (maybe Andrew) have a look at the discussion here, and maybe figure out if there are is some data you can share that helps us get a handle on the scope of this problem?

I think the high-level question here is, what are the major factors influencing the latency involved in answering a recursive query?

<!-- gh-comment-id:3834023507 --> @djc commented on GitHub (Feb 2, 2026): (Exceeded my target SLA for answering this -- sorry for the slowish response.) I agree that using a completely synthetic setup probably makes for a good start. I'm still wondering if we can glean something useful from an existing large deployment (like Unbound at Let's Encrypt) -- like if we'd get aggregate statistics on something like cache hit ratios I think that might be very useful in trying to make our synthetic data more realistic. Similarly, maybe we can use some top domain list but mirror it into our synthetic authoritatives? After all, the DNS is of necessity mostly public anyway. @divergentdave can you or someone else at the ISRG (maybe Andrew) have a look at the discussion here, and maybe figure out if there are is some data you can share that helps us get a handle on the scope of this problem? I think the high-level question here is, what are the major factors influencing the latency involved in answering a recursive query?
Author
Owner

@bdaehlie commented on GitHub (Feb 2, 2026):

@mcpherrinm ^

<!-- gh-comment-id:3834732680 --> @bdaehlie commented on GitHub (Feb 2, 2026): @mcpherrinm ^
Author
Owner

@divergentdave commented on GitHub (Feb 2, 2026):

I looked at metrics a while back and pulled out some summary statistics about our workload. The overall cache hit ratio is around 25%. The ratio between incoming recursive queries and outgoing queries is about 1:4.

<!-- gh-comment-id:3836014598 --> @divergentdave commented on GitHub (Feb 2, 2026): I looked at metrics a while back and pulled out some summary statistics about our workload. The overall cache hit ratio is around 25%. The ratio between incoming recursive queries and outgoing queries is about 1:4.
Author
Owner

@cpu commented on GitHub (Feb 12, 2026):

Similarly, maybe we can use some top domain list but mirror it into our synthetic authoritatives? After all, the DNS is of necessity mostly public anyway.

I think that makes sense as a place to start for our synthetic auth. server zone data. I've started playing with this idea and the Tranco Top 1M and have a prototype bootstrapping zone data from the top 10k of the 1M loaded into knot on my local machine. One downside of this approach is that the data is exclusively eTLDs (public suffixes) and eTLD+1s (no deeper subdomains), so I think the resulting delegation depth is going to be pretty shallow relative to real-world query patterns.

The overall cache hit ratio is around 25%. The ratio between incoming recursive queries and outgoing queries is about 1:4.

That's helpful, thank you. I'm going to start with trying to gin up the input domains to dns-flamethrower to see if I can reliably produce something approaching this before scaling up to the full 1M and a proper dedicated benchmarking server.

@mcpherrinm ^

@mcpherrinm gentle ping that if you have any additional ideas on how to provide us more data to better reflect your operating conditions that it would be helpful to hear soon.

<!-- gh-comment-id:3892717657 --> @cpu commented on GitHub (Feb 12, 2026): > Similarly, maybe we can use some top domain list but mirror it into our synthetic authoritatives? After all, the DNS is of necessity mostly public anyway. I think that makes sense as a place to start for our synthetic auth. server zone data. I've started playing with this idea and the Tranco Top 1M and have a prototype bootstrapping zone data from the top 10k of the 1M loaded into `knot` on my local machine. One downside of this approach is that the data is exclusively eTLDs (public suffixes) and eTLD+1s (no deeper subdomains), so I think the resulting delegation depth is going to be pretty shallow relative to real-world query patterns. > The overall cache hit ratio is around 25%. The ratio between incoming recursive queries and outgoing queries is about 1:4. That's helpful, thank you. I'm going to start with trying to gin up the input domains to `dns-flamethrower` to see if I can reliably produce something approaching this before scaling up to the full 1M and a proper dedicated benchmarking server. > @mcpherrinm ^ @mcpherrinm gentle ping that if you have any additional ideas on how to provide us more data to better reflect your operating conditions that it would be helpful to hear soon.
Author
Owner

@mcpherrinm commented on GitHub (Feb 12, 2026):

Hi! Sorry, missed the ping earlier.

One possible source of realistic names to look up is Sunlight CT log "name tiles". They are much smaller than the full CT logs, as they just having the issued names in them. https://github.com/FiloSottile/sunlight/blob/main/names-tiles.md That might be better than using the Tranco lists, and relatively easy to grab the JSON.

Our issuance is split something like 5% TLS-ALPN-01, 45% DNS-01 and 50% HTTP-01, though which is used for any particular name isn't public, if you wanted to synthesize queries matching those portions of challenges, along with the CAA lookups for them.

<!-- gh-comment-id:3893770401 --> @mcpherrinm commented on GitHub (Feb 12, 2026): Hi! Sorry, missed the ping earlier. One possible source of realistic names to look up is Sunlight CT log "name tiles". They are much smaller than the full CT logs, as they just having the issued names in them. https://github.com/FiloSottile/sunlight/blob/main/names-tiles.md That might be better than using the Tranco lists, and relatively easy to grab the JSON. Our issuance is split something like 5% TLS-ALPN-01, 45% DNS-01 and 50% HTTP-01, though which is used for any particular name isn't public, if you wanted to synthesize queries matching those portions of challenges, along with the CAA lookups for them.
Author
Owner

@cpu commented on GitHub (Feb 13, 2026):

One possible source of realistic names to look up is Sunlight CT logs

That's a really good idea 👍

Our issuance is split something like 5% ALPN-1, 45% DNS-01 and 50% HTTP-01

That's also helpful, thanks! I've been reconstituting SOA, NS, TXT, A, AAAA, TXT and CAA records and I think that'll cover all bases. HTTP-01 and TLS-ALPN-01 challenges won't be meaningfully different at the DNS layer (just A/AAAA lookups + CAA right?) but DNS-01 will be and its useful to know it's nearly half of the all domain validation methods.

<!-- gh-comment-id:3894129841 --> @cpu commented on GitHub (Feb 13, 2026): > One possible source of realistic names to look up is Sunlight CT logs That's a really good idea 👍 > Our issuance is split something like 5% ALPN-1, 45% DNS-01 and 50% HTTP-01 That's also helpful, thanks! I've been reconstituting SOA, NS, TXT, A, AAAA, TXT and CAA records and I think that'll cover all bases. HTTP-01 and TLS-ALPN-01 challenges won't be meaningfully different at the DNS layer (just A/AAAA lookups + CAA right?) but DNS-01 will be and its useful to know it's nearly half of the all domain validation methods.
Author
Owner

@mcpherrinm commented on GitHub (Feb 13, 2026):

The only difference between TLS-ALPN-01 and HTTP-01 I can think of would be any additional resolution from following HTTP redirects, but that seems unlikely to be relevant for this testing.

We do about an equal number of A and AAAA lookups.

<!-- gh-comment-id:3894175726 --> @mcpherrinm commented on GitHub (Feb 13, 2026): The only difference between TLS-ALPN-01 and HTTP-01 I can think of would be any additional resolution from following HTTP redirects, but that seems unlikely to be relevant for this testing. We do about an equal number of A and AAAA lookups.
Author
Owner

@cpu commented on GitHub (Mar 10, 2026):

I've spent some time running a few different benchmarking experiments and I think I've made enough progress that it makes sense to try and summarize some results so far.

There are some confusing elements to dig into further, but the top-level takeaways are that:

  1. In the shorter test scenario where we're aiming to match the cache hit ratio of LE, we do less QPS than Unbound.
  2. In the longer "soak" test scenarios, with one flamethrower instance, we do about equivalent QPS but I think this is a case where we're comparing performance of fully cached data.
  3. In the longer "soak" test scenarios, with many flamethrower instances, Hickory starts to produce lots of timeouts when I increase the load in a way Unbound does not. This is troubling to me.
  4. There are some mysterious effects at play, like NXDOMAIN results from hickory that aren't observed with unbound (and similar differences in SERVFAIL rates). This too seems troubling.

I think digging in to 3 and 4 are the best next steps before trying to fiddle with the benchmarking setup itself, or switching to other sources of domains/zone data (e.g. CT tiles).

In general I continue to be fairly concerned that a fully local test scenario is not a great way to benchmark a recursive resolver, and that with real world latencies and a variety of remote authoritative server implementations we would see vastly different results. To paraphrase Mark Twain, "There are three kinds of lies: lies, damned lies, and benchmarks".

Before getting into the numbers, here's a general summary of the setup:

Server

These tests were done on a dedicated bare metal OVH cloud server:

  • CPU: AMD EPYC 4465P - 12c/24t - 3.4 GHz/5.4 GHz
  • RAM: 64 GB 5600 MHz
  • Disk: 2×960 GB SSD NVMe (Soft RAID)
  • OS: Ubuntu 24.04

For software versions, I tested with the 24.04 Ubuntu packages for knot and unbound:

  • Knot 3.3.4
  • Unbound 1.19.2

DNS-flamethrower isn't packaged for Ubuntu, so I built it from source:

  • dns-flamethrower 0.12.0.

Similarly, I built HickoryDNS from source with a --release configuration:

I haven't concerned myself with testing the latest & greatest of either knot (3.5.3) or unbound (1.24.2) but it wouldn't be too much work to change to those versions in the future.

Zone data

As described in my earlier comments, I bootstrapped artificial zone data for the knotd authoritative DNS server using the Tranco Top 1M domains list. I wrote some throw away scripts/tooling that performed iterative DNS resolution (RD=0 queries) from root servers downward for each domain, recording the delegation steps observed (NS records, glue presence/absence, TTLs, and answer records for A, AAAA, TXT and CAA). These were used to generate RFC 1035 zone files for each eTLD+1 and a TSV file with the delegation metadata + general harvest info.

I did some very light post-processing to transform this harvested data into a complete synthetic DNS hierarchy suitable for local load testing. I used the TSV + delegation data to generate synthetic root and TLD zones with NS records pointing back to the local knot server. I aso rewrote the harvested NS records in each zone file to use synthetic local nameservers, with glue pointing to localhost. I also filtered out IDN domains because I was having trouble feeding them through flamethrower (though I might just need to punycode them first?). Lastly, I wrote a script that created a Knot configuration that loads the synthetic root + eTLD+1 zones. I haven't DNSSEC signed any of the zones, or configured the recursors to enforce DNSSEC, this is exclusively Do53 testing without DNSSEC so far.

This ended up producing ~920,000 zone files with a tota of 11.9 million RRs (averaging ~13 per zone). The breakdown of record types was roughly ~35% NS, ~25% A, ~23% TXT, ~7% SOA, ~4% CAA, ~4% AAAA. In sum it's about 3.8GB of zone data and represents domains across ~1,045 TLDs.

Server configuration

I only made some minor tweaks to the default 24.04 server installation.

I applied a few system-level tunings:
sysctl -w net.core.rmem_max=134217728  
sysctl -w net.core.wmem_max=134217728  
sysctl -w net.core.rmem_default=16777216  
sysctl -w net.core.wmem_default=16777216
sysctl -w net.netfilter.nf_conntrack_max=1048576  
sysctl -w net.nf_conntrack_max=1048576
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
iptables -t raw -A PREROUTING -i lo -j NOTRACK
iptables -t raw -A OUTPUT -o lo -j NOTRACK
ulimit -n 1048576

knot was used for the authoritative DNS server hosting the synthetic zone data for all tests. I used flamethrower as a client feeding queries to a recursive resolver (either unbound or hickory-dns depending on the test).

Here's the knot config I used (minus the many, many `zone` file definitions):
server:
    listen: 127.0.0.1@53
    udp-workers: 6
    tcp-workers: 2
    background-workers: 2
    tcp-reuseport: on
    udp-max-payload: 4096
    tcp-max-clients: 100
    identity: "knot-loadtest"
    version: "3.3"
    user: hickory:users
    pidfile: "/tmp/knot-loadtest.pid"
control:
    listen: "/tmp/knot-loadtest.sock"
    timeout: 5
database:
    storage: "/tmp/knot-loadtest-db"
    journal-db: "/tmp/knot-loadtest-db/journal"
    journal-db-max-size: 1G
    kasp-db: "/tmp/knot-loadtest-db/kasp"
    kasp-db-max-size: 100M
    timer-db: "/tmp/knot-loadtest-db/timer"
    timer-db-max-size: 2G
log:
    - target: stderr
      any: info
template:
  - id: default
    storage: "./zone_files"
    file: "%s.zone"
    zonefile-sync: -1
    journal-content: none
    semantic-checks: off
zone:
  - domain: .
    file: root.zone
    template: default
Here's the unbound config I used:
server:
    username: "hickory"
    interface: 127.0.0.2@53
    access-control: 127.0.0.0/8 allow
    access-control: 0.0.0.0/0 refuse
    do-ip4: yes
    do-ip6: yes
    do-udp: yes
    do-tcp: yes
    num-threads: 14
    msg-cache-slabs: 16
    rrset-cache-slabs: 16
    infra-cache-slabs: 16
    key-cache-slabs: 16
    outgoing-range: 8192
    so-rcvbuf: 16m
    so-sndbuf: 16m
    msg-cache-size: 8g
    rrset-cache-size: 16g
    infra-cache-numhosts: 32768
    key-cache-size: 256m
    cache-min-ttl: 60
    cache-max-ttl: 3600
    cache-max-negative-ttl: 3600
    root-hints: "/home/hickory/tranco-1m/zone_files/root.zone"
    do-not-query-localhost: no
    module-config: "iterator"
    serve-expired: no
    prefetch: no
    prefetch-key: no
    jostle-timeout: 200
    infra-host-ttl: 900
    infra-cache-max-rtt: 5000
    verbosity: 1
    log-queries: no
    log-replies: no
    use-syslog: no
    log-time-ascii: yes
    statistics-interval: 0
    extended-statistics: yes
    statistics-cumulative: yes
remote-control:
    control-enable: yes
    control-interface: 127.0.0.1
    control-port: 8953
Here's the hickory-dns config I used (along with --workers=14 on the CLI):
user = "hickory"
group = "users"

listen_addrs_ipv4 = ["127.0.0.2"]
listen_port = 53

[[zones]]
zone = "."
zone_type = "External"

[zones.stores]
type = "recursor"
roots = "/home/hickory/tranco-1m/zone_files/root.zone"
ns_cache_size = 32768
response_cache_size = 8388608
recursion_limit = 24
ns_recursion_limit = 24
allow_server = ["127.0.0.1/32"]
deny_server = []

[zones.stores.cache_policy.default]
positive_max_ttl = 86400
positive_min_ttl = 60
negative_max_ttl = 3600

[zones.stores.cache_policy.A]
positive_max_ttl = 3600
positive_min_ttl = 60

[zones.stores.cache_policy.AAAA]
positive_max_ttl = 3600
positive_min_ttl = 60

[zones.stores.cache_policy.TXT]
positive_max_ttl = 3600
positive_min_ttl = 60

[zones.stores.cache_policy.CAA]
positive_max_ttl = 3600
positive_min_ttl = 60

For flamethrower, I used either -n1 for the cache simuation tests, or -l for the duration based tests, and left other parameters their defaults (e.g. allowing flamethrower to make as many queries as it can, with a 1ms delay between queries, and an intra-process concurrent traffic generator setting of 10):

flame -f flame-queries.txt -n 1 -o $OUTPUT.json 127.0.0.2
flame -f flame-queries.txt -l $DURATION -R -o $OUTPUT.json 127.0.0.2

Since nproc on this machine reports 24, I tried to pin relevant software components to dedicated cores to avoid contention, and tried to configure relevant worker thread settings to match the cores. In particular:

  • knot was started under taskset -c "0-3"
  • unbound or hickory-dns were started under taskset -c "4-15"
  • flamethrower instances were started under taskset using cores 16..21
  • node-exporter, htop and other misc admin tasks used the last 2 cores.

I restarted knot between tests, though that was probably unnecessary. I made sure unbound was stopped when hickory-dns was running, and vice versa.

As noted in the flamethrower docs on concurrency, it's a single-threaded application so I experimented with running 1 ... 6 copies of the program at a time pinned to the cores in the allocated range (more on that later...).

Even with 6 flamethrower instances I was never able to max out the CPU usage of the system. The flamethrower instances seemed to stall out at around ~40% CPU usage per-core, and the unbound/hickory instances hovered around ~35% ... 20% usage per core. Notably the knot CPU usage never went over ~1% CPU and we could probably "steal" cores from there if needed.

Query input for flamethrower

My first thought was to try and match the cache hit ratio and RR type diversity described in the comments above by crafting the domain list provided to flamethrower. In particular, I used a repetition-based cache simulation where a "hot set" of 8,000 domains per record type is repeated 4 times in the query file, while a "cold set" of 64,000 domains per type appears only once. This was trying to achieve a 25% cache hit ratio: after warm-up, the first occurrence of each hot query populates the cache, and the subsequent 3 repetitions are served from cache. The queries were shuffled so hot entries are distributed throughout the file rather than appearing consecutively, trying to ensure the cache hit pattern emerges organically during steady-state operation.

The query mix targets a 25/25/25/25 split across A, AAAA, TXT, and CAA record types to roughly approximate the pattern at LE. Importantly I tried to use "positive-only" inputs for each record type, only including the (domain, type) pair if the zone actually has that record type. This avoids conflating positive and negative caching behavior, and I hoped would give a cleaner measurement signal. With that approach, CAA records are the limiting factor (~70k domains), which constrains the total pool size but keeps the experiment well-defined.

Ultimately that meant we had a query input file of ~382,000 domains. At the QPS that both recursors are able to support this meant a "single run" through flamethrower with -n1 only lasts a few seconds. That most faithfully targets the cache hit/miss ratio but makes for very short load testing runs so I also ran some runs of flamethrower that used -l with a fixed duration of 15 minutes that looped through the query file as needed (randomizing between loops). This obviously means the "cold set" isn't cold in this model, but did allow for stressing the recursors.

Single run results

For the tests where I used -n1 with flamethrower, the high-level flamethrower results for unbound were:

run id      : 7ffefc59f790
run start   : 2026-03-10T19:11:20Z
runtime     : 7.00067 s
total sent  : 382220
total rcvd  : 379811
min resp    : 0.05844 ms
avg resp    : -nan ms
max resp    : 586.507 ms
avg r qps   : 94952
avg s qps   : 76443
avg pkt     : 42.7419 bytes
tcp conn.   : 0
timeouts    : 2409 (0.630265%)
bad recv    : 0
net errors  : 0
responses   :
  NOERROR: 379765
  SERVFAIL: 46

and for hickory-dns:

run id      : 7fffa5025300
run start   : 2026-03-10T19:14:21Z
runtime     : 4.50107 s
total sent  : 382220
total rcvd  : 382220
min resp    : 0.118573 ms
avg resp    : -nan ms
max resp    : 195.735 ms
avg r qps   : 76443
avg s qps   : 63702
avg pkt     : 42.8649 bytes
tcp conn.   : 0
timeouts    : 0 (0%)
bad recv    : 0
net errors  : 0
responses   :
  NXDOMAIN: 35394
  NOERROR: 346422
  SERVFAIL: 404

In summary, unbound achieved a higher QPS (94,952 avg r qps vs 76,443 avg r qps) but encountered more timeouts, while hickory-dns had a lower QPS but no timeouts. One mystery I've yet to figure out is why for the same set of input domains, the hickory-dns instance saw ~35k NXDOMAIN results while unbound saw 0. The difference in SERVFAIL results is also interesting, but less pronounced.

15m soak, 1 flamethrower instance

For a test where I dropped -n1, and used -R and -l 900, the high-level flamethrower results for unbound were:

run id      : 7ffdb3ec1d00
run start   : 2026-03-10T19:19:48Z
runtime     : 901.303 s
total sent  : 86027500
total rcvd  : 85930402
min resp    : 0.107421 ms
avg resp    : -nan ms
max resp    : 1899.95 ms
avg r qps   : 95362
avg s qps   : 95354
avg pkt     : 42.7416 bytes
tcp conn.   : 0
timeouts    : 97098 (0.112869%)
bad recv    : 0
net errors  : 0
responses   :
  NOERROR: 85925797
  SERVFAIL: 4605

and for hickory-dns:

run id      : 7ffcd9b30280
run start   : 2026-03-10T19:37:44Z
runtime     : 901.306 s
total sent  : 82225300
total rcvd  : 82225083
min resp    : 0.099446 ms
avg resp    : -nan ms
max resp    : 310.027 ms
avg r qps   : 91229
avg s qps   : 91138
avg pkt     : 42.7421 bytes
tcp conn.   : 0
timeouts    : 217 (0.000263909%)
bad recv    : 0
net errors  : 0
responses   :
  NXDOMAIN: 7611458
  NOERROR: 74526729
  SERVFAIL: 86896

In this case the avg r qps values are closer, but hickory-dns again reports more SERVFAIL results, and NXDOMAIN results. The # of timeouts seen by both recursors also increases.

Increasing flamethrower instances

Once I started increasing the number of flamethrower instances beyond 1 things looked more and more dire for hickory-dns. At the extreme, with 6 flamethrower instances the aggregate statistics across all instances looked like this for unbound:

Per-instance results:
--------------------------------------------------------------------------------
Core                 Sent         Rcvd     Timeouts     QPS(r)     QPS(s)
--------------------------------------------------------------------------------
core16           86001800     85566904       434896      94958      95336
core17           85927200     85495190       432010      94879      95263
core18           85822800     85400830       421970      94872      95236
core19           85837600     85417019       420581      94784      95151
core20           85801500     85378841       422659      94752      95110
core21           85688900     85262032       426868      94610      94987
--------------------------------------------------------------------------------

=== AGGREGATE TOTALS ===
--------------------------------------------------------------------------------
Total Sent:          515079800
Total Received:      512520816
Total Timeouts:      2558984 (0.50%)
Bad Recv:            0
Net Errors:          0
--------------------------------------------------------------------------------
Aggregate QPS (r):   568855
Aggregate QPS (s):   571083
--------------------------------------------------------------------------------
Min Response:        0.112911 ms
Max Response:        1911.72 ms
--------------------------------------------------------------------------------
NOERROR:             512506979
SERVFAIL:            13837

and for hickory-dns:

Per-instance results:
--------------------------------------------------------------------------------
Core                 Sent         Rcvd     Timeouts     QPS(r)     QPS(s)
--------------------------------------------------------------------------------
core16           85545200     30311501     55233699      33482      948174
core17           85564300     30291589     55272711      33462      94837
core18           85544500     30286876     55257624      33456      94827
core19           85524200     30176323     55347877      33342      94793
core20           85672200     30520464     55151736      33717      94971
core21           85406900     30044454     55362446      33188      94669
--------------------------------------------------------------------------------

=== AGGREGATE TOTALS ===
--------------------------------------------------------------------------------
Total Sent:          513257300
Total Received:      181631207
Total Timeouts:      331626093 (64.61%)
Bad Recv:            0
Net Errors:          0
--------------------------------------------------------------------------------
Aggregate QPS (r):   200647
Aggregate QPS (s):   568914
--------------------------------------------------------------------------------
Min Response:        12.4984 ms
Max Response:        614.321 ms

The unbound instance managed to get to ~568,000 avg r QPS quite happily, while the hickory-dns instance was only ~200,000 and much worse, over 64% of requests timed out....

Scaling down the number of flamethrower instances would reduce the % of timeouts observed, indicating there's some kind of bottleneck. E.g. with 4 instances instead of 6, only 47% timeouts were observed, using 3 instances it was ~28% timeouts.

I think investigating the reason for these timeouts is the most pressing thing we can take away from this exercise.

Most troubling is that, in one case I saw many lines being logged from hickory stdout of the form:

thread 'hickory-server-runtime' (265804) panicked at /home/hickory/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.32/src/future/future/shared.rs:300:25:
inner future panicked during poll

I haven't been able to reproduce that since, but it happened on one of the runs where many timeouts were observed and I suspect the root cause is related...

Anyway, that's where I'm at so far. In general I'm finding this a challenging task and would welcome input from folks while I start to look into the timeouts/NXDOMAIN results.

<!-- gh-comment-id:4034564553 --> @cpu commented on GitHub (Mar 10, 2026): I've spent some time running a few different benchmarking experiments and I think I've made enough progress that it makes sense to try and summarize some results so far. There are some confusing elements to dig into further, but the top-level takeaways are that: 1. In the shorter test scenario where we're aiming to match the cache hit ratio of LE, we do less QPS than Unbound. 2. In the longer "soak" test scenarios, with one `flamethrower` instance, we do about equivalent QPS but I think this is a case where we're comparing performance of fully cached data. 3. In the longer "soak" test scenarios, with many `flamethrower` instances, Hickory starts to produce _lots_ of timeouts when I increase the load in a way Unbound does not. This is troubling to me. 4. There are some mysterious effects at play, like NXDOMAIN results from hickory that aren't observed with unbound (and similar differences in SERVFAIL rates). This too seems troubling. I think digging in to 3 and 4 are the best next steps before trying to fiddle with the benchmarking setup itself, or switching to other sources of domains/zone data (e.g. CT tiles). In general I continue to be fairly concerned that a fully local test scenario is not a great way to benchmark a recursive resolver, and that with real world latencies and a variety of remote authoritative server implementations we would see vastly different results. To paraphrase Mark Twain, "There are three kinds of lies: lies, damned lies, and benchmarks". Before getting into the numbers, here's a general summary of the setup: ## Server These tests were done on a dedicated bare metal OVH cloud server: * CPU: AMD EPYC 4465P - 12c/24t - 3.4 GHz/5.4 GHz * RAM: 64 GB 5600 MHz * Disk: 2×960 GB SSD NVMe (Soft RAID) * OS: Ubuntu 24.04 For software versions, I tested with the 24.04 Ubuntu packages for knot and unbound: * Knot 3.3.4 * Unbound 1.19.2 DNS-flamethrower isn't packaged for Ubuntu, so I built it from source: * dns-flamethrower 0.12.0. Similarly, I built HickoryDNS from source with a `--release` configuration: * hickory-dns commit 9f4b15ea55aeb4b28cdd7cbf23da1c2a0a551781 I haven't concerned myself with testing the latest & greatest of either knot (3.5.3) or unbound (1.24.2) but it wouldn't be too much work to change to those versions in the future. ## Zone data As described in my earlier comments, I bootstrapped artificial zone data for the `knotd` authoritative DNS server using the Tranco Top 1M domains list. I wrote some throw away scripts/tooling that performed iterative DNS resolution (RD=0 queries) from root servers downward for each domain, recording the delegation steps observed (NS records, glue presence/absence, TTLs, and answer records for A, AAAA, TXT and CAA). These were used to generate RFC 1035 zone files for each eTLD+1 and a TSV file with the delegation metadata + general harvest info. I did some very light post-processing to transform this harvested data into a complete synthetic DNS hierarchy suitable for local load testing. I used the TSV + delegation data to generate synthetic root and TLD zones with NS records pointing back to the local knot server. I aso rewrote the harvested NS records in each zone file to use synthetic local nameservers, with glue pointing to localhost. I also filtered out IDN domains because I was having trouble feeding them through `flamethrower` (though I might just need to punycode them first?). Lastly, I wrote a script that created a Knot configuration that loads the synthetic root + eTLD+1 zones. I haven't DNSSEC signed any of the zones, or configured the recursors to enforce DNSSEC, this is exclusively Do53 testing without DNSSEC so far. This ended up producing ~920,000 zone files with a tota of 11.9 million RRs (averaging ~13 per zone). The breakdown of record types was roughly ~35% NS, ~25% A, ~23% TXT, ~7% SOA, ~4% CAA, ~4% AAAA. In sum it's about 3.8GB of zone data and represents domains across ~1,045 TLDs. ## Server configuration I only made some minor tweaks to the default 24.04 server installation. <details> <summary>I applied a few system-level tunings:</summary> ``` sysctl -w net.core.rmem_max=134217728 sysctl -w net.core.wmem_max=134217728 sysctl -w net.core.rmem_default=16777216 sysctl -w net.core.wmem_default=16777216 sysctl -w net.netfilter.nf_conntrack_max=1048576 sysctl -w net.nf_conntrack_max=1048576 sysctl -w net.ipv4.ip_local_port_range="1024 65535" iptables -t raw -A PREROUTING -i lo -j NOTRACK iptables -t raw -A OUTPUT -o lo -j NOTRACK ulimit -n 1048576 ``` </summary> </details> `knot` was used for the authoritative DNS server hosting the synthetic zone data for all tests. I used `flamethrower` as a client feeding queries to a recursive resolver (either `unbound` or `hickory-dns` depending on the test). <details> <summary>Here's the knot config I used (minus the many, many `zone` file definitions):</summary> ``` server: listen: 127.0.0.1@53 udp-workers: 6 tcp-workers: 2 background-workers: 2 tcp-reuseport: on udp-max-payload: 4096 tcp-max-clients: 100 identity: "knot-loadtest" version: "3.3" user: hickory:users pidfile: "/tmp/knot-loadtest.pid" control: listen: "/tmp/knot-loadtest.sock" timeout: 5 database: storage: "/tmp/knot-loadtest-db" journal-db: "/tmp/knot-loadtest-db/journal" journal-db-max-size: 1G kasp-db: "/tmp/knot-loadtest-db/kasp" kasp-db-max-size: 100M timer-db: "/tmp/knot-loadtest-db/timer" timer-db-max-size: 2G log: - target: stderr any: info template: - id: default storage: "./zone_files" file: "%s.zone" zonefile-sync: -1 journal-content: none semantic-checks: off zone: - domain: . file: root.zone template: default ``` </details> <details> <summary>Here's the unbound config I used:</summary> ``` server: username: "hickory" interface: 127.0.0.2@53 access-control: 127.0.0.0/8 allow access-control: 0.0.0.0/0 refuse do-ip4: yes do-ip6: yes do-udp: yes do-tcp: yes num-threads: 14 msg-cache-slabs: 16 rrset-cache-slabs: 16 infra-cache-slabs: 16 key-cache-slabs: 16 outgoing-range: 8192 so-rcvbuf: 16m so-sndbuf: 16m msg-cache-size: 8g rrset-cache-size: 16g infra-cache-numhosts: 32768 key-cache-size: 256m cache-min-ttl: 60 cache-max-ttl: 3600 cache-max-negative-ttl: 3600 root-hints: "/home/hickory/tranco-1m/zone_files/root.zone" do-not-query-localhost: no module-config: "iterator" serve-expired: no prefetch: no prefetch-key: no jostle-timeout: 200 infra-host-ttl: 900 infra-cache-max-rtt: 5000 verbosity: 1 log-queries: no log-replies: no use-syslog: no log-time-ascii: yes statistics-interval: 0 extended-statistics: yes statistics-cumulative: yes remote-control: control-enable: yes control-interface: 127.0.0.1 control-port: 8953 ``` </details> <details> <summary>Here's the hickory-dns config I used (along with --workers=14 on the CLI):</summary> ``` user = "hickory" group = "users" listen_addrs_ipv4 = ["127.0.0.2"] listen_port = 53 [[zones]] zone = "." zone_type = "External" [zones.stores] type = "recursor" roots = "/home/hickory/tranco-1m/zone_files/root.zone" ns_cache_size = 32768 response_cache_size = 8388608 recursion_limit = 24 ns_recursion_limit = 24 allow_server = ["127.0.0.1/32"] deny_server = [] [zones.stores.cache_policy.default] positive_max_ttl = 86400 positive_min_ttl = 60 negative_max_ttl = 3600 [zones.stores.cache_policy.A] positive_max_ttl = 3600 positive_min_ttl = 60 [zones.stores.cache_policy.AAAA] positive_max_ttl = 3600 positive_min_ttl = 60 [zones.stores.cache_policy.TXT] positive_max_ttl = 3600 positive_min_ttl = 60 [zones.stores.cache_policy.CAA] positive_max_ttl = 3600 positive_min_ttl = 60 ``` </details> For `flamethrower`, I used either `-n1` for the cache simuation tests, or `-l` for the duration based tests, and left other parameters their defaults (e.g. allowing `flamethrower` to make as many queries as it can, with a 1ms delay between queries, and an intra-process concurrent traffic generator setting of 10): ``` flame -f flame-queries.txt -n 1 -o $OUTPUT.json 127.0.0.2 flame -f flame-queries.txt -l $DURATION -R -o $OUTPUT.json 127.0.0.2 ``` Since `nproc` on this machine reports `24`, I tried to pin relevant software components to dedicated cores to avoid contention, and tried to configure relevant worker thread settings to match the cores. In particular: * `knot` was started under `taskset -c "0-3"` * `unbound` or `hickory-dns` were started under `taskset -c "4-15"` * `flamethrower` instances were started under `taskset` using cores `16..21` * `node-exporter`, `htop` and other misc admin tasks used the last 2 cores. I restarted `knot` between tests, though that was probably unnecessary. I made sure `unbound` was stopped when `hickory-dns` was running, and vice versa. As noted in the [flamethrower docs] on concurrency, it's a single-threaded application so I experimented with running 1 ... 6 copies of the program at a time pinned to the cores in the allocated range (more on that later...). Even with 6 flamethrower instances I was never able to max out the CPU usage of the system. The flamethrower instances seemed to stall out at around ~40% CPU usage per-core, and the unbound/hickory instances hovered around ~35% ... 20% usage per core. Notably the `knot` CPU usage never went over ~1% CPU and we could probably "steal" cores from there if needed. [flamethrower docs]: https://github.com/DNS-OARC/flamethrower/?tab=readme-ov-file#concurrency ## Query input for flamethrower My first thought was to try and match the cache hit ratio and RR type diversity described in the comments above by crafting the domain list provided to `flamethrower`. In particular, I used a repetition-based cache simulation where a "hot set" of 8,000 domains per record type is repeated 4 times in the query file, while a "cold set" of 64,000 domains per type appears only once. This was trying to achieve a 25% cache hit ratio: after warm-up, the first occurrence of each hot query populates the cache, and the subsequent 3 repetitions are served from cache. The queries were shuffled so hot entries are distributed throughout the file rather than appearing consecutively, trying to ensure the cache hit pattern emerges organically during steady-state operation. The query mix targets a 25/25/25/25 split across A, AAAA, TXT, and CAA record types to roughly approximate the pattern at LE. Importantly I tried to use "positive-only" inputs for each record type, only including the (domain, type) pair if the zone actually has that record type. This avoids conflating positive and negative caching behavior, and I hoped would give a cleaner measurement signal. With that approach, CAA records are the limiting factor (~70k domains), which constrains the total pool size but keeps the experiment well-defined. Ultimately that meant we had a query input file of ~382,000 domains. At the QPS that both recursors are able to support this meant a "single run" through `flamethrower` with `-n1` only lasts a few seconds. That most faithfully targets the cache hit/miss ratio but makes for very short load testing runs so I also ran some runs of `flamethrower` that used `-l` with a fixed duration of 15 minutes that looped through the query file as needed (randomizing between loops). This obviously means the "cold set" isn't cold in this model, but did allow for stressing the recursors. ## Single run results For the tests where I used `-n1` with `flamethrower`, the high-level flamethrower results for `unbound` were: ``` run id : 7ffefc59f790 run start : 2026-03-10T19:11:20Z runtime : 7.00067 s total sent : 382220 total rcvd : 379811 min resp : 0.05844 ms avg resp : -nan ms max resp : 586.507 ms avg r qps : 94952 avg s qps : 76443 avg pkt : 42.7419 bytes tcp conn. : 0 timeouts : 2409 (0.630265%) bad recv : 0 net errors : 0 responses : NOERROR: 379765 SERVFAIL: 46 ``` and for `hickory-dns`: ``` run id : 7fffa5025300 run start : 2026-03-10T19:14:21Z runtime : 4.50107 s total sent : 382220 total rcvd : 382220 min resp : 0.118573 ms avg resp : -nan ms max resp : 195.735 ms avg r qps : 76443 avg s qps : 63702 avg pkt : 42.8649 bytes tcp conn. : 0 timeouts : 0 (0%) bad recv : 0 net errors : 0 responses : NXDOMAIN: 35394 NOERROR: 346422 SERVFAIL: 404 ``` In summary, `unbound` achieved a higher QPS (94,952 avg r qps vs 76,443 avg r qps) but encountered more timeouts, while `hickory-dns` had a lower QPS but no timeouts. One mystery I've yet to figure out is why for the same set of input domains, the `hickory-dns` instance saw ~35k `NXDOMAIN` results while `unbound` saw 0. The difference in `SERVFAIL` results is also interesting, but less pronounced. ## 15m soak, 1 flamethrower instance For a test where I dropped `-n1`, and used `-R` and `-l 900`, the high-level flamethrower results for `unbound` were: ``` run id : 7ffdb3ec1d00 run start : 2026-03-10T19:19:48Z runtime : 901.303 s total sent : 86027500 total rcvd : 85930402 min resp : 0.107421 ms avg resp : -nan ms max resp : 1899.95 ms avg r qps : 95362 avg s qps : 95354 avg pkt : 42.7416 bytes tcp conn. : 0 timeouts : 97098 (0.112869%) bad recv : 0 net errors : 0 responses : NOERROR: 85925797 SERVFAIL: 4605 ``` and for hickory-dns: ``` run id : 7ffcd9b30280 run start : 2026-03-10T19:37:44Z runtime : 901.306 s total sent : 82225300 total rcvd : 82225083 min resp : 0.099446 ms avg resp : -nan ms max resp : 310.027 ms avg r qps : 91229 avg s qps : 91138 avg pkt : 42.7421 bytes tcp conn. : 0 timeouts : 217 (0.000263909%) bad recv : 0 net errors : 0 responses : NXDOMAIN: 7611458 NOERROR: 74526729 SERVFAIL: 86896 ``` In this case the avg r qps values are closer, but `hickory-dns` again reports more SERVFAIL results, and `NXDOMAIN` results. The # of timeouts seen by both recursors also increases. ## Increasing flamethrower instances Once I started increasing the number of flamethrower instances beyond 1 things looked more and more dire for `hickory-dns`. At the extreme, with 6 flamethrower instances the aggregate statistics across all instances looked like this for `unbound`: ``` Per-instance results: -------------------------------------------------------------------------------- Core Sent Rcvd Timeouts QPS(r) QPS(s) -------------------------------------------------------------------------------- core16 86001800 85566904 434896 94958 95336 core17 85927200 85495190 432010 94879 95263 core18 85822800 85400830 421970 94872 95236 core19 85837600 85417019 420581 94784 95151 core20 85801500 85378841 422659 94752 95110 core21 85688900 85262032 426868 94610 94987 -------------------------------------------------------------------------------- === AGGREGATE TOTALS === -------------------------------------------------------------------------------- Total Sent: 515079800 Total Received: 512520816 Total Timeouts: 2558984 (0.50%) Bad Recv: 0 Net Errors: 0 -------------------------------------------------------------------------------- Aggregate QPS (r): 568855 Aggregate QPS (s): 571083 -------------------------------------------------------------------------------- Min Response: 0.112911 ms Max Response: 1911.72 ms -------------------------------------------------------------------------------- NOERROR: 512506979 SERVFAIL: 13837 ``` and for `hickory-dns`: ``` Per-instance results: -------------------------------------------------------------------------------- Core Sent Rcvd Timeouts QPS(r) QPS(s) -------------------------------------------------------------------------------- core16 85545200 30311501 55233699 33482 948174 core17 85564300 30291589 55272711 33462 94837 core18 85544500 30286876 55257624 33456 94827 core19 85524200 30176323 55347877 33342 94793 core20 85672200 30520464 55151736 33717 94971 core21 85406900 30044454 55362446 33188 94669 -------------------------------------------------------------------------------- === AGGREGATE TOTALS === -------------------------------------------------------------------------------- Total Sent: 513257300 Total Received: 181631207 Total Timeouts: 331626093 (64.61%) Bad Recv: 0 Net Errors: 0 -------------------------------------------------------------------------------- Aggregate QPS (r): 200647 Aggregate QPS (s): 568914 -------------------------------------------------------------------------------- Min Response: 12.4984 ms Max Response: 614.321 ms ``` The `unbound` instance managed to get to ~568,000 avg r QPS quite happily, while the `hickory-dns` instance was only ~200,000 and much worse, over 64% of requests timed out.... Scaling down the number of `flamethrower` instances would reduce the % of timeouts observed, indicating there's some kind of bottleneck. E.g. with 4 instances instead of 6, only 47% timeouts were observed, using 3 instances it was ~28% timeouts. I think investigating the reason for these timeouts is the most pressing thing we can take away from this exercise. Most troubling is that, in one case I saw many lines being logged from hickory stdout of the form: ``` thread 'hickory-server-runtime' (265804) panicked at /home/hickory/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.32/src/future/future/shared.rs:300:25: inner future panicked during poll ``` I haven't been able to reproduce that since, but it happened on one of the runs where many timeouts were observed and I suspect the root cause is related... Anyway, that's where I'm at so far. In general I'm finding this a challenging task and would welcome input from folks while I start to look into the timeouts/NXDOMAIN results.
Author
Owner

@djc commented on GitHub (Mar 11, 2026):

I think digging in to 3 and 4 are the best next steps before trying to fiddle with the benchmarking setup itself, or switching to other sources of domains/zone data (e.g. CT tiles).

This makes a lot of sense to me. Two thoughts:

  • I think it would be great if you can minimize the reproduction path for both of the issues you've observed, which should make it easier for other people to help out with the analysis (although I will be fairly busy for the rest of this month).
  • It sounds like Hickory might lack backpressure on one or more paths, maybe that's something to look into as well?
<!-- gh-comment-id:4037398160 --> @djc commented on GitHub (Mar 11, 2026): > I think digging in to 3 and 4 are the best next steps before trying to fiddle with the benchmarking setup itself, or switching to other sources of domains/zone data (e.g. CT tiles). This makes a lot of sense to me. Two thoughts: - I think it would be great if you can minimize the reproduction path for both of the issues you've observed, which should make it easier for other people to help out with the analysis (although I will be fairly busy for the rest of this month). - It sounds like Hickory might lack backpressure on one or more paths, maybe that's something to look into as well?
Author
Owner

@cpu commented on GitHub (Mar 11, 2026):

Two thoughts

SGTM on both fronts.

<!-- gh-comment-id:4039279637 --> @cpu commented on GitHub (Mar 11, 2026): > Two thoughts SGTM on both fronts.
Author
Owner

@cpu commented on GitHub (Mar 11, 2026):

One mystery I've yet to figure out is why for the same set of input domains, the hickory-dns instance saw ~35k NXDOMAIN results while unbound saw 0.

One chunk of these is from holes in my delegations in the zone data interacting differently with the two recursors implementation of qname minimization. unbound was using qname-minimisation-strict: no (the default) and would fall back to sending the full query when it hit a buggy NXDOMAIN during the NS probing. If I switch to qname-minimisation-strict: yes, unbound goes from 0 NXDOMAIN results to 1,764 in the single run. hickory still sees more (35,393) so there's other root causes to dig out.

We can't turn off qname minimization (https://github.com/hickory-dns/hickory-dns/issues/2917), and don't have the same fallback "relaxed" option either, so probably testing against unbound with strict enabled makes sense for the moment.

<!-- gh-comment-id:4042062061 --> @cpu commented on GitHub (Mar 11, 2026): > One mystery I've yet to figure out is why for the same set of input domains, the hickory-dns instance saw ~35k NXDOMAIN results while unbound saw 0. One chunk of these is from holes in my delegations in the zone data interacting differently with the two recursors implementation of qname minimization. `unbound` was using `qname-minimisation-strict: no` (the default) and would fall back to sending the full query when it hit a buggy NXDOMAIN during the NS probing. If I switch to `qname-minimisation-strict: yes`, `unbound` goes from 0 NXDOMAIN results to 1,764 in the single run. `hickory` still sees more (35,393) so there's other root causes to dig out. We can't turn off qname minimization (https://github.com/hickory-dns/hickory-dns/issues/2917), and don't have the same fallback "relaxed" option either, so probably testing against `unbound` with strict enabled makes sense for the moment.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/hickory-dns#1181
No description provided.