Getting Rid of DNS Timeouts Causing Speed Issues with Outbound Connections

In our continuing efforts to drive faster site performance, we’ve uncovered and already fixed an intermittent issue affecting code that creates outbound connections.

Summary:

  1. Some customers might have found our caching DNS resolution was timing out, which caused outgoing API requests to 3rd-party services either take a long time to complete or fail outright.
  2. Slow DNS resolution caused slow page loads, when the total time of the request exceeded 60 seconds. As a result, 500 errors occured when that connection was forcibly canceled by the platform’s self-healing mechanism.
  3. The fix has already been rolled out on all servers.
  4. Not all servers were affected. However it’s difficult to quantify how many were affected, because the problem was intermittent (i.e. appears and disappears over time and depends on frequency of outbound requests).

See below for an example graph of page-load time for one of our customers before and after this fix. The green area is “time taken for outbound API requests”—that’s the metric this fix addressed. It’s really obvious that the fix was applied at 01:38!

Web Transactions Response Time

For the Curious: Here are the Technical Details

In our investigation into platform performance with a few dozen customer sites, we kept finding slow external API calls to places like api.wordpress.org (where WordPress goes to check whether there are upgrades for core, plugins, and themes) and many other 3rd-party services corresponding to popular plugins and themes.

But the thing is, we know that api.wordpress.org is not slow! It doesn’t make sense that it would be slow to access it from our platform. Furthermore, we would occasionally see something like this:

Trace TimeDo all of these services, two of which are represented by different vendors, take exactly 5 seconds each to complete? No way. Something else was going on.

The problem turned out to be a timeout in our external DNS provider, which was Google DNS.

The reason for the timeout is that Google has policies designed to protect them from DNS-based abuse—DRDoS on port 53 in particular—which has been responsible for several of the largest DDoS attacks that have ever been performed on the internet. Google states that they have policies design to prevent abuse, but isn’t specific about what those policies are or whether a particular user of Google DNS has tripped the “abuse” system.

Regardless, we were tripping the system. Some servers were tripping it almost all the time due to a large volume of requests (we assume), while others tripped it occasionally, depending on volume and on what Google’s abuse algorithms contain.

It is completely reasonable that Google has such counter-measures in their system. Thus, we needed to stop using Google DNS as our primary outbound nameserver system.

The solution is to use data-center-specific internal DNS systems which have no such limits.

Why didn’t we use that system to begin with?

What we found was when the primary failed, sometimes the secondary would fail too, leaving us with no out-going DNS and thus causing failures for our customers.

So we switched to Google DNS, and haven’t had problems in terms of connectivity.

Now we’re putting monitoring in place for various things surrounding DNS, in particular checking once per minute that we can resolve external domains in <100ms rather than getting >1s time and sometimes >5s time as we saw with this problem.

Because of our previous experience with internal DNS, the fix we’re rolling out is a combination platter. We use internal DNS as the primary (and tertiary) services, so that in the 99.9% case DNS is unrestricted and as fast as possible. But then we use Google DNS for the secondary and quaternary servers, so that we fail-over automatically to a service with even higher availability. Because we’re not hitting it all the time, the higher service will not trigger a timeout as often, even in that special case.

Furthermore, we’ve reduced the DNS timeout from 5 seconds (the Linux default) to 1 second. This means we can try all four of the configured DNS servers in less time than our current system was timing-out a single server.

Our new configuration—coupled with monitoring that proactively alerts us whenever a DNS failure or time-out is occurring within 60s of the problem arising—adds even more performance to the DNS component of our platform.