Post-mortem: API Downtime on July 31st, 2012
Mixpanel is a company that prides itself on being able to provide
extraordinarily advanced analytics in real time. Paramount to this goal is
being able to accurately and reliably send data to our api.mixpanel.com
servers, something that we failed at from 8:00 AM to 12:20 PM PDT this
morning. For this we are extremely sorry, and you as our customers deserve
better. We realize that you are trusting us by placing our JS on your websites
and that your businesses are affected when we go down. Here is what happened,
and what we are doing to make sure this sort of failure never happens again.
When you send a track request to api.mixpanel.com, DNS round robining routes
the request to one of our two load balancer machines running nginx that act as
reverse proxies to the actual API processing servers. A sudden, substantial
increase in API requests resulted in both these load balancer machines
simultaneously running out of memory, causing them to begin swapping to disk.
This drastically increased the amount of time it took to service each request,
and since we also use these machines to serve up the static mixpanel.js
The problem was quickly identified, but solving it turned out to be more
difficult. We could not update DNS to allow for api.mixpanel.com to point to a
different set of servers, since we had previously set the TTL for the DNS
requests to be an entire day. However, we could reassign the portable IPs the
DNS was pointing at to a different machine. Unfortunately, the only machine we
had available was behind an entirely different router compared to the router
for the portable IP addresses assigned to the load balancer machines. We then
determined that the next best thing to do would just be to order several new
load balancer machines and temporarily lower the client header buffer size on
nginx requests to reduce the amount of memory nginx was using.
What this means for you
The vast majority of track requests sent during this time were lost, and once
again we sincerely apologize. In an hourly report on July 31st, you will see a
drop-off in events for the hours spanning 8AM – 11AM PDT. Furthermore, 12PM
PDT’s event counts should be about half of what they normally would be. Your
daily data for July 31st will be slightly depressed as well.
How we are preventing this from happening in the future
Here at Mixpanel, we spend a tremendous amount of engineering effort making
sure that our data collection infrastructure can withstand the thousands of
events per second our customers send to us. While our custom data store
handled the sudden event rate increase without a single hiccup, it is somewhat
ironic that a simple load balancer failed simply because it ran out of memory.
And frankly, also quite embarrassing. Here’s how we’re going to not get caught
with our pants down again.
1. Far more proactive monitoring
We use the Munin monitoring tool to keep tabs on server status. Munin provides
an incredible variety of plugins to monitor everything from simple CPU usage
to MongoDB write lock percentage. It also provides warning and critical
thresholds for any numeric values that these plugins report. As we’ve grown
our server count to over 200, properly setting these thresholds have fallen by
the wayside. We’ve gone through each and every one to make sure these
thresholds exist and make sense. Furthermore, we are adding an email
notification every time a value crosses a warning or critical threshold.
2. Long term capacity planning
It is not enough to simply add more memory to the machine and call it a day.
While we have increased the memory on the load balancing machines by 8x,
effectively removing it as a bottleneck in the future, we also want to resolve
the underlying issue. We’ve gone through our infrastructure to identify what
the next bottlenecks would be and made sure we have specific plans on how we
are going to upgrade them in the future. In addition to monitoring, we will
know exactly when and how we will mitigate bottlenecks.
3. Using a CDN for delivery of mixpanel.js
not affect your page load times, there have still been several reports of a
working to get to the bottom of this issue. However, there is still absolutely
no reason why we cannot get virtually 100% uptime for a single snippet of JS.
We will be moving the JS snippet off the API servers and onto a proper content
If you have any other questions about the downtime, please feel free to email