Post-mortem: DynDNS Outage on October 21, 2016
On Friday, Oct 21, 2016, from 08:30 am to 1:30 pm PDT, approximately 27% of clients attempting to connect to all of Mixpanel’s API endpoints were unable to do so, due to a DDoS attack on our primary DNS provider, Dyn. A small number of clients were unable to access api.mixpanel.com until Saturday, Oct 22, 2:00 pm and decide.mixpanel.com until Monday, Oct 24, 10:05 am.
Our mobile clients buffer data and therefore were able to successfully re-submit data once the DNS outage was over, resulting in no permanent data loss. For web integrations, there is likely to have been some data loss during this period, though our week-over-week data volume was not significantly affected.
We are very sorry for this disruption in our service.
We conducted a post-mortem early last week and have begun work on action items designed to make us more resilient to this type of incident in the future. The remainder of this post provides details of our findings and plans.
At 9:40am Friday, our internal monitoring system revealed an anomalous, steep decline in our data ingestion volume. We alerted customers to the problem on status.mixpanel.com at At 9:42am. We had seen a tweet from Dyn at 7:00am that morning saying they were investigating an attack and immediately connected the two issues. We began to consider our options for failing over DNS.
At 1:20pm Friday, we modified the NS entry for mixpanel.com to point to a secondary DNS provider, Amazon Route 53, after configuring and testing it internally. Despite a 48-hour TTL on the NS record, the gTLD nameservers authoritative for .com picked up this change almost immediately, and intermediate recursive resolvers followed suit. Widely-used public resolver services, including Google Public DNS and OpenDNS, provide forms allowing service operators to force cache flushes, which were used for each of Mixpanel’s domains. As a result, by 1:30pm, there was no discernible difference in week-over-week event traffic.
Unlike Dyn, Route 53 does not allow us to return only a subset of the IP addresses for a geographic region in each DNS response, so instead we responded with all of the Mixpanel IP addresses for the datacenter closest to the client. This increased the size of our DNS responses from below 300 bytes to above 512 bytes. For most clients, this meant simply that their DNS requests, which would usually be served in a single UDP datagram, had to retry connecting to their resolver over TCP in order to accept the larger response. Some older clients have not implemented the IETF proposed standard for TCP retry, and as a result fail when the DNS record is more than 512 bytes long.
On Saturday at 2pm, we reduced the number of IPv4 addresses in the A record for api.mixpanel.com to 29, allowing the response to fit within the 512 byte limit required by legacy clients. On Monday morning, it came to our attention that some customers were using both legacy DNS clients that did not support TCP retry and using decide.mixpanel.com, which is a CNAME for api.mixpanel.com. The additional CNAME entry in the DNS response adds 32 bytes, pushing the response over the 512 byte limit. At 10am Monday morning, we reduced the number of IP addresses returned to fit within that limit even in the presence of the additional overhead of the CNAME record, restoring service for those clients.
We are taking the following actions which will improve our availability in the event of future DNS outages:
- Thoroughly documenting our traffic engineering best-practices, including DNS fail-over procedures, and maintaining hot-standby zones on a secondary DNS provider.
- Adding automated tests for DNS response sizes, ensuring a zone push does not break legacy UDP-only clients.
If you have any further questions regarding this incident, please don’t hesitate to reach out to Mixpanel Support at email@example.com.