[dns-wg] next steps for DNS, Transitive Trust and How Many is Many

22 May 2026

      I just watched Ondřej Surý's excellent talk about "How many DNS queries does
it take to resolve a single domain name?" (I declined to get up to watch
live, now I regret missing the interactive part...)

Jim Reid asked about turning this into a RIPE document.
And Joe Abley noted how the empty cache is not really the static state.

My completely unsurprising comment is that the right "goldilocks" number is
going to depend, and I think that what we probably need is a way to do
reproduceable tests.

The observation that bind 9.11 thru .21 have radically different numbers...
And I imagine unbound, and windows server's resolver, and dnsmasq, and whatever
cloudflare, google, and the other public resolvers use using... each will
have different results.

Is there also some dependancy on distance to authoratitive resolvers?
Certainly, the latency test is done for . and then the queries tend to stick
to the best one.    And LocalRoot changes this.
I've never been quite clear what the behaviour is for other levels.

"Easy" for most of the people reading this list to pop-off and start a
container or VM or ./named-I-just-compiled and do some measurements.
Not sure how I'd test an empty public resolver instance!

Whether people like public resolvers or hate them, some portion of people use
them, and most zone owners probably want to make sure they aren't tripping up
some pathology in one place by optimizing another.

So one thing that I think we need is something that explains how to do the
tests.  Command lines in appendices or better, on a forkable-on-coding-site wiki.
Concepts in the document.

The next thing I think that we need is then a way (a proceedure, not a tool),
given the above, to simulate some kind of failure.
Ondřej's reply about how, if I'm running potatocoding.org, and .org servers
are all down, then I'm toast, even if I've decided to put an NS in .net/.com
and .it.  That's relevant, but it's not the whole story.

Such broad, high-level outages are now rare, I think.
What isn't rare are 2016-style Murai attacks on parts of the infrastructure.

I wonder how long teams.office.com takes to resolve if ns3-39.azure-dns.org
and ns3-39.azure-dns.info are both down/under-multi-TB/s attack, along with
ns3-05.azure.dns.org (where office.com is).   The resolver that did more
queries, and cached more answers might be ahead... unless the more answers
meant that it was more likely to have LRU'ed out some other useful answers.

How do I simulate/test that?
How do I simulate my resolver being in some other continent?
Can I still resolve canada.ca when all our fiber to the US is turned off as
part of another trade dispute?  (Maybe more relevant to smaller *island* nations!!!)

It seems to me that knowing how things degrade (not if, but how) could become
an important part of due-diligence.

I think someone will need to fund this, even if it's "only" in the form
research grants to graduate students.

--
Michael Richardson <mcr+IETF@sandelman.ca>   . o O ( IPv6 IøT consulting )
           Sandelman Software Works Inc, Ottawa and Worldwide

**       My working hours and your working hours may be different.         **
** Please do not feel obligated to reply outside your normal working hours **