At Mixpanel, we take the stability and reliability of our product very seriously. Yesterday, our track API was down from 2:55 PM to 4:20 PM PDT. We are extremely sorry and feel that we owe you, our customers, an explanation of what happened and how we’re going to prevent it from happening again.
Before explaining exactly what happened, it’s useful to have a rough overview of what happens to an event when it’s sent to Mixpanel. From the client perspective, each event is simply an HTTP request. On Mixpanel’s side, however, each HTTP request hits a few different servers in succession:
- First, we have HTTP load balancers running on api.mixpanel.com with IP level failover. Each balancer maintains a list of API servers that are currently running and will pick one for each event it processes.
- Our API servers do basic event validation and optionally add some useful information to the event, such as time if it’s missing or the client IP address if requested. Similar to our balancers, each API server maintains a list of queue servers that are running and will pick one for each event.
- For queueing, we use piece of software called kestrel, which is a hybrid in-memory / on-disk queue. This is the final step during the initial track request. Normally, these queues hover around 0, meaning that you’ll see your data in real time.
Yesterday, we saw failures on each of these pieces of infrastructure. We spent most of the downtime tracking down the root problem (and we made a few mistakes changing settings on otherwise working servers). Ultimately, we found one queue server with an apparently running version of kestrel, but with the following in the log:
A fatal error has been detected by the Java Runtime Environment: Internal Error (safepoint.cpp:308), pid=13379, tid=140221886957312 guarantee(PageArmed == 0) failed: invariant JRE version: 6.0_26-b03 Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops) An error report file with more information is saved as: /tmp/hs_err_pid13379.log If you would like to submit a bug report, please visit: http://java.sun.com/webapps/bugreport/crash.jsp
Unfortunately, because the process was still running according to Linux, none of our automated failover systems worked. They rely on TCP/IP level errors, such as would happen if the machine was down, or kestrel wasn’t running. Instead of marking the machine as down, our API servers simply filled up with connections trying to write events to a process that presumably was spending time doing the Java equivalent of a core dump.
Once, we figured this all out fixing the problem was simply a matter of restarting the affected server.
What this means for you
Regrettably, track requests sent during this time were lost. In an hourly report on April 16th, you will see a drop-off in events for the hour of 3pm PDT, and slightly lower events for 2pm and 4pm. Consequently, your daily data for April 16th will be slightly depressed as well.
How we’re going to fix the problem
We are extremely proud of our uptime and stability, and are working to prevent this from happening again. We realize that all sorts of failures happen in the real world and we need to be able to recover from them quickly and transparently regardless of whether it’s a networking problem or a JVM crash. It’s clear that our current detection system is not good enough. The queue server should have been marked down almost immediately.
With that in mind, we’re rewriting our kestrel client code to put strict timeouts on every operation when enqueueing events. Even if we never get any TCP/IP errors, we should still fail and fail fast, so that we can mark the server as down and our API continues to work.
Additionally, it’s clear that our testing does not currently cover all failure scenarios. We’re working on expanding that test coverage so that when we can be sure that the failover systems we have in place will work correctly in production.
If you have any questions about the downtime, please email firstname.lastname@example.org