Joe, Joe Abley wrote:
Encouraging peoople to fix unresponsive nameservers seems like a more obviously good idea than encouraging people to fix measurably-lame delegations.
This is true. The current document (RIPE 400) does not make any distinction between timeouts and other responses. That probably be a nice improvement (and something that can be nicely measured to determine the scope of the lameness problem).
The other consideration that has come up in other regions (and which I have also not seen a study supporting) is that lame delegations caused increased traffic for the servers that publish them, and that the increased traffic presents an operational problem.
Which servers are supposed to be seeing increased traffic? The parent servers should have exactly the same number of queries. The non-lame servers in the NS-set will of course receive an increased number of queries - but this is the same amount as if you removed the lame servers from the set. The people who will see more traffic are people running recursive resolvers, due to failures. The RIPE document initiating the lame delegation mails says 11-13% of name servers are lame. We have no idea what the actual effect is of course, since this has not been measured. It could be more (if these lame servers are in heavily-queried zones), or it could be less (if these lame servers are in lightly-queried zones). Basically, we have no data at all.
Philosophically, it makes sense to me for the zone operator (the NCC) to care about the accuracy of the data published in the zone. It's not clear to me that doing so has any operational benefit to the NCC, however.
What I hear Shane saying is that since he sees no operational benefit in the checks, as the operator of the number resource, he ought to have the option of opting out of them. I note that opting out of lame delegation checks is possible at ARIN (since I happened to get some robot lame delegation checker mail from them the other day, and the mail mentioned it).
My basic position is that DNS works as good as it does because it allows an operator to spend as much time and money to provide DNS service as fast and reliable as it needs to be for their needs. The benefits go to the operator, and they pay the costs. If all the hosts in my domain run on a server at home at the end of a flaky DSL connection, I don't need multiple servers running from diverse autonomous systems. However, if I am running an online auctioning site where downtime means lawsuits then I *can* build an anycast cloud running with full redundancy in hundreds of sites around the world. I realize that this only works if all parent domains are also fast and reliable. But this is normally only the parent and a TLD, and these organizations tend to take DNS very, very seriously. What does this mean for lameness checking? If lameness causes problems for an operator, they will fix it. If it does not, they don't care (and that is OKAY). I am fully in favor of parent zones trying to help people keep their delegations working. The RIPE NCC already provides tools to check zone quality, and also checks the quality of delegations when you add or change reverse DNS configuration. Without actual evidence that there is a real problem that harms the user experience, anything else seems pointless. -- Shane