Got a weird DNS issue that's stumped me.

Here's a good one.

dns2 is primary nameserver for a local zone 'servers.datacentre.foo'. Running Debian Lenny, bind9, nothing unusual. Update this zone to add a Xen guest recently provisioned at the datacentre. Reload dns with rndc reload, the updated zone is transferred to the slave nameserver, which is dns1 (yeah I don't know why either, what's in a name).

/etc/resolv.conf on all the servers shows that dns1 is first, dns2 second - as far as I understand resolv.conf etc, the first nameserver will always be queried unless it's not healthy.. or unless 'options rotate' is specified in the resolv.conf which provides a round-robin approach of lookups for some programs..

The /etc/resolv.conf on dns2 (the primary), shows 127.0.0.1 as first, then the IP of dns1 as second.
This might be my clue, as dns1's resolv.conf is itself first (but not as localhost, its 192.168 address), and the IP of dns2 as second. But should that make a difference? It should be able to query itself on its 192.168 address

Despite the zone transferring to dns1, and I can look at the zone itself and see it's updated, the other servers get no A record for the newly added machine when querying dns1!

A dig @dns2, no problem. Sure I could reverse the order in the resolv.conf but I'd rather understand why this issue is happening first! :)

Even when I do a dig @localhost on dns1, the slave, for the host, there's no result!

After some time (never sat there and continued to query to see how long, maybe a couple hours?) it is as though a TTL expires and dns1 starts to correctly respond with the answer it should already know about.

Yes, we are using views, the servers querying dns1 are in the 'local' acl and I have confirmed this by running tcpdumps on dns1 to observe the requesting query from one of the other servers.

The fact that it works eventually, after a long while, with no other changes made, suggests a caching, or ttl thing. But why would a slave nameserver cache it's own lack of knowledge of a hostname for resolution, when it's already told dns2 that it's picked up the new zone as requested?

What am I missing?

Next time it occurs I'm going to run the lookups through strace on slave to see what it's trying to do.. the answer is probably obvious but I've been staring at it for too long!

Update: We fixed it, but not sure how, bunch of different things were applied. Details to follow.

Tags: