DNS lookups sometimes take 5 seconds

I have a VM running Debian Wheezy on which some hostname lookups take several seconds to complete, even though the resolver replies immediately. Strangely, lookups with getaddrinfo() are affected, but gethostbyname() is not.

I’ve switched to the Google resolvers to exclude the possibility that the local ones are broken, so my /etc/resolv.conf looks like:

search my-domain.com
nameserver 8.8.4.4
nameserver 8.8.8.8

My nsswitch.conf has the line:

hosts: files dns

and my /etc/hosts doesn’t contain anything unusual.

If I try telnet webserver 80, it hangs for several seconds before getting a name resolution. An ltrace output [1] shows that the hang is in a getaddrinfo() call:

getaddrinfo("ifconfig.me", "telnet", { AI_CANONNAME, 0, SOCK_STREAM, 0, 0, NULL, '00', NULL }, 0x7fffb4ffc160) = 0 <5.020621>

However, tcpdump reveals that the nameserver replied immediately, and it was only on the second reply that telnet unblocked. The replies look identical:

05:52:58.609731 IP 192.168.1.75.43017 > 8.8.4.4.53: 54755+ A? ifconfig.me. (29)
05:52:58.609786 IP 192.168.1.75.43017 > 8.8.4.4.53: 26090+ AAAA? ifconfig.me. (29)
05:52:58.612188 IP 8.8.4.4.53 > 192.168.1.75.43017: 54755 4/0/0 A 219.94.235.40, A 133.242.129.236, A 49.212.149.105, A 49.212.202.172 (93)

[...five second pause...]

05:53:03.613811 IP 192.168.1.75.43017 > 8.8.4.4.53: 54755+ A? ifconfig.me. (29)
05:53:03.616424 IP 8.8.4.4.53 > 192.168.1.75.43017: 54755 4/0/0 A 219.94.235.40, A 133.242.129.236, A 49.212.149.105, A 49.212.202.172 (93)
05:53:03.616547 IP 192.168.1.75.43017 > 8.8.4.4.53: 26090+ AAAA? ifconfig.me. (29)
05:53:03.618907 IP 8.8.4.4.53 > 192.168.1.75.43017: 26090 0/1/0 (76)

I’ve checked host firewall logs and nothing on port 53 is being blocked.

What is causing the first DNS reply to be ignored?

[1] I’ve added a couple of lines to my ltrace.conf so I can see inside the addrinfo struct.

Asked By: Flup

||

The first DNS reply isn’t ignored. getaddrinfo() didn’t return until it received the response to the first AAAA query (ID: 26090). So the real problem here is why your machine hasn’t immediately received the response to the AAAA query, while it has received the response for the A query (ID: 54755).

One of the differences between getaddrinfo() and gethostbyname() is that the former supports both IPv4 and IPv6, while the latter only supports IPv4. So when you call getaddrinfo() with ai_family set to 0 (AF_UNSPEC), it won’t return until it gets a response (or hits a timeout) for both A and AAAA queries for the domain name provided. gethostbyname() only queries for an A record.

It’s hard to remotely determine what may be causing your problem, especially that you’ve cut out some tcpdump output. Something might be selectively filtering/dropping the DNS traffic between your VM and Google Public DNS resolvers. I have tried to reproduce your problem using a KVM Debian Wheezy VM, but telnet ifconfig.me almost immediately printed the Trying <IP_address_here>... line (meaning it has already resolved the name by then).

Answered By: Kempniu

This was caused by an overly restrictive ruleset on a Juniper firewall that sits in front of the VMware infrastructure.

I built a test resolver so that I could see both sides of the conversation, and the missing packet identified by Kempniu in his excellent answer was indeed being dropped somewhere along the way. As noted in that answer, getaddrinfo() with no address family specified will wait for answers relating to all supported families before returning (or, in my case, timing out).

My colleague who runs the network noted that

The default behavior on the Juniper firewall is to close a DNS-related
session as soon as a DNS reply matching that session is received.

So the firewall was seeing the IPv4 response, noting that it answered the VM’s query, and closing the inbound path for that port. The following IPv6 reply packet was therefore dropped. I’ve no idea why both packets made it through the second time, but disabling this feature on the firewall fixed the problem.

This is a related extract from the Juniper KB:

Here’s a scenario where DNS Reply packets are dropped:

  1. A session for DNS traffic is created when the first DNS query packet hits the firewall and there is a permitting policy configured.
    The default timeout is 60 sec.
  2. Immediately before the session is closed, a new DNS query is transmitted, and since it matches an existing session (since source
    and destination port/IP pair is always the same), it is forwarded by
    the firewall. Note that the session timeout is not refreshed
    according to any newly arriving packet.
  3. The created DNS session is aged out when the first DNS query response (reply) hits the device, regardless how much the timeout
    remains.
  4. When a DNS reply is passed through the firewall, the session is aged out.
  5. All subsequent DNS replies are dropped by the firewall, since no session exists.

If you’re thinking of upvoting this answer, please also upvote Kempniu’s answer. Without it I’d still be thrashing around trying to find some configuration problem on the VM.

Answered By: Flup
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.