API Downtime on April 16, 2012

At Mixpanel, we take the stability and reliability of our product very seriously. Yesterday, our track API was down from 2:55 PM to 4:20 PM PDT. We are extremely sorry and feel that we owe you, our customers, an explanation of what happened and how we're going to prevent it from happening again.

What happened

Before explaining exactly what happened, it's useful to have a rough overview of what happens to an event when it's sent to Mixpanel. From the client perspective, each event is simply an HTTP request. On Mixpanel's side, however, each HTTP request hits a few different servers in succession:

  1. First, we have HTTP load balancers running on api.mixpanel.com with IP level failover. Each balancer maintains a list of API servers that are currently running and will pick one for each event it processes.

  2. Our API servers do basic event validation and optionally add some useful information to the event, such as time if it's missing or the client IP address if requested. Similar to our balancers, each API server maintains a list of queue servers that are running and will pick one for each event.

  3. For queueing, we use piece of software called kestrel, which is a hybrid in-memory / on-disk queue. This is the final step during the initial track request. Normally, these queues hover around 0, meaning that you'll see your data in real time.

Yesterday, we saw failures on each of these pieces of infrastructure. We spent most of the downtime tracking down the root problem (and we made a few mistakes changing settings on otherwise working servers). Ultimately, we found one queue server with an apparently running version of kestrel, but with the following in the log:

#  
# A fatal error has been detected by the Java Runtime Environment:  
#  
# Internal Error (safepoint.cpp:308), pid=13379, tid=140221886957312  
# guarantee(PageArmed == 0) failed: invariant  
#  
# JRE version: 6.0_26-b03  
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)  
# An error report file with more information is saved as:  
# /tmp/hs_err_pid13379.log  
#  
# If you would like to submit a bug report, please visit:  
# http://java.sun.com/webapps/bugreport/crash.jsp  
#

Unfortunately, because the process was still running according to Linux, none of our automated failover systems worked. They rely on TCP/IP level errors, such as would happen if the machine was down, or kestrel wasn't running. Instead of marking the machine as down, our API servers simply filled up with connections trying to write events to a process that presumably was spending time doing the Java equivalent of a core dump.

Once, we figured this all out fixing the problem was simply a matter of restarting the affected server.

What this means for you

Regrettably, track requests sent during this time were lost. In an hourly report on April 16th, you will see a drop-off in events for the hour of 3pm PDT, and slightly lower events for 2pm and 4pm. Consequently, your daily data for April 16th will be slightly depressed as well.

How we're going to fix the problem

We are extremely proud of our uptime and stability, and are working to prevent this from happening again. We realize that all sorts of failures happen in the real world and we need to be able to recover from them quickly and transparently regardless of whether it's a networking problem or a JVM crash. It's clear that our current detection system is not good enough. The queue server should have been marked down almost immediately.

With that in mind, we're rewriting our kestrel client code to put strict timeouts on every operation when enqueueing events. Even if we never get any TCP/IP errors, we should still fail and fail fast, so that we can mark the server as down and our API continues to work.

Additionally, it's clear that our testing does not currently cover all failure scenarios. We're working on expanding that test coverage so that when we can be sure that the failover systems we have in place will work correctly in production.

If you have any questions about the downtime, please email support@mixpanel.com

Community Tip : Addiction Best Practices

This Community Tip will illustrate how to interpret Mixpanel’s Addiction Report. We’ll walk through some best practices and methods for using the Addiction report to analyze your users’ behavior with an eye towards actionable insights. Our Retention report shows the portion of your customers or customer cohorts who engage with your application. Mixpanel’s Addiction Report takes it to the next level and analyzes the minimum number of hours or days your users engage with your app. How to interpret an Addiction Report The first column of the Addiction report displays the number of users in the cohort displayed on that row, just like our First Time and Recurring Retention reports. In the screenshot below, it means that 6,054 unique users fired an “App Open” event on February 1st. Each of the buckets (column headings) to the right indicate how many hours in day, days in a week, or day...

Updates to segmentation: See your top events & compare events too

We've made Segmentation a whole lot better, and we're excited to share all the updates we've made: See an overview of your top events We've made it possible to see an overview of the highest volume events that your users take in your app. It can help you see spikes in your data, and it's just a convenient way to get a snapshot of how things are going. Compare trends more easily Comparing two or more different events is now possible, so you can analyze the correlation between events. Just click the compare menu option after picking the first event. Plot your data in logarithmic scale It can often be tough to compare two trends if one event or segment dwarfs another or if your app takes off and creates a big spike in your metrics because the big differences in scale make it hard to see what's going on. With a logarithmic scale, you'll be able to compare and correlate...

Community Tip: Incremental Super Properties

This Community Tip will outline how to create incremental super properties using our JavaScript SDK, so you can keep a tally of how often a user takes specific actions on your site or in your app and then segment any event by count. What Makes A Super Property So Super? Before we dive in to incremental super properties let’s have a quick reminder of what super properties are. Super properties are client-side properties that are automatically attached to every event that a user sends to your Mixpanel project. They make working with Mixpanel data much more convenient because they act as global properties that are omnipresent in all your events (given that the cookie is not cleared). Some examples of information that you may want to attribute to every event are ad campaign, signup date, or paid account type, etc. It’s important to note that super properties are only for events , not...

Making one-time notifications better

Today we launched a major enhancement to our notifications dashboard that will give you more power over your one-time notifications. Previously a one-time notification meant sending a message to the wind. While useful for notifying, they couldn't be edited, cancelled, or analyzed after the fact. No longer - now one-time notifications are at your beck and call: The new notifications dashboard gives you a clear view of what one-time notifications are in flight and allows you to cancel them or even change their contents before they send. You can now view sent, opened, and conversion statistics for one-time and recurring notifications through the same analytics page: Our goal behind these changes is to better unify the notifications product and expand what you can do with it in the process. We give you better tools, you make better things happen. With that in mind, happy notifyi...

Formulas: Calculating with numeric values

When it comes to Mixpanel products, we’re all about constantly making improvements that our users have asked for. Regarding the formulas report, one thing we’ve had numerous requests for is the ability to add numeric values into a formula. Well Mixpanellers, ask and ye shall receive: The most basic and useful application of this is a formula where you would like to see percentage values rather than decimals. In our case, let’s look at a common event and take the number of mobile occurrences divided by total occurrences: As you can see, the x-axis values are calculated as decimals between 0 and 1. In order to transform the values in this chart into percentages, all we have to do is multiply the numerator by 100: And there you have it, a simple addition to formulas that greatly increases their flexibility. For more use cases and ideas, get in touch with our extremel...

How Fleksy Fell In Love With Data By Using Mixpanel

This is a guest blog post by Derek, Fleksy's Head of Product Marketing I have a confession: there was a time when data and I didn’t get along. I’m a brand marketing guy at heart, and my data-related technical skills are limited. Luckily, I have pretty good instincts. For years, I often viewed heavy data-diving as an aggravating drain on my time. Interpreting data? No problem. Manipulating it? Big problem. I would spend hours slicing data on Google Analytics, Flurry or other tools in every way possible, but more often than not I couldn’t find the key insight I really wanted. The only way I could extract meaningful insights was to task a database engineer/analyst to manipulate raw data to fit my needs. If I needed to make a quick decision, I had no choice but to trust my gut. At Fleksy, we’re building the fastest and most fun keyboard app in the world, and we’re iterating rap...

Community Tip: Tracking First-time Users

This community tip outlines how to track the first time a user visits your website or app so you can measure your metrics in terms of new versus existing users. First-Time Users First-time users are defined as any user who arrives at your website or app for the first time - you have never seen this user before. It is important to know who is a first time user versus a returning user in terms of measuring your metrics – How do first-time users convert through sign up? What is first-time versus returning user retention? What traffic source do my first-time users come from? Answering these questions will give you valuable product insight not readily available by looking at all users simultaneously. By measuring first-time users you can get a granular view into what your users are doing when they have never used your website or app before. You may find user experience insights that yo...

Community Tip: Importing People Profiles from other email tools

This Community Tip will outline how to import your external mailing lists into Mixpanel People profiles . Are you a marketing guru that depends on a certain monkey-related e-mail tool or other email marketing service providers? Mixpanel makes it easy to import these users' email addresses, names, unsubscribe status, and other details as People properties into your Mixpanel project. You can then use these details as targeting criteria to group users and send them tailored messages to increase the likelihood of a response or downstream action. Impress your users by demonstrating how well you really know them! Most mailing tools allow you to export your users to a CSV. A little manipulation will optimize this CSV for importing into your Mixpanel project. No code required! 1) Add a $distinct_id column The first property that you will want to consider is $distinct_id . You can ass...

Opening our pitch deck that helped us get our $865M valuation

Today, we're going to open-source the pitch deck that led to our $65M round and the $865M valuation for the company. So many people have helped us either directly or indirectly along our journey that we'd like pay it forward to other startups who aren't sure how to get started as they go out to raise money for the first time. When Tim and I first went to go pitch our company to investors in 2009 (right after the recession hit), we hopped on the Caltrain, rode our bikes down Sand Hill Road, and talked to 10 investors who all categorically told us "No." A few of them even wanted to replace the CEO of our company (we didn't know who would be CEO yet!). At that time, we hadn't quite realized we even needed an official pitch deck. And when we did build a deck, we had no idea how to construct one intelligently. We really struggled. This is not perfect for all companies; you may need so...