API Downtime on April 16, 2012

At Mixpanel, we take the stability and reliability of our product very seriously. Yesterday, our track API was down from 2:55 PM to 4:20 PM PDT. We are extremely sorry and feel that we owe you, our customers, an explanation of what happened and how we're going to prevent it from happening again.

What happened

Before explaining exactly what happened, it's useful to have a rough overview of what happens to an event when it's sent to Mixpanel. From the client perspective, each event is simply an HTTP request. On Mixpanel's side, however, each HTTP request hits a few different servers in succession:

  1. First, we have HTTP load balancers running on api.mixpanel.com with IP level failover. Each balancer maintains a list of API servers that are currently running and will pick one for each event it processes.

  2. Our API servers do basic event validation and optionally add some useful information to the event, such as time if it's missing or the client IP address if requested. Similar to our balancers, each API server maintains a list of queue servers that are running and will pick one for each event.

  3. For queueing, we use piece of software called kestrel, which is a hybrid in-memory / on-disk queue. This is the final step during the initial track request. Normally, these queues hover around 0, meaning that you'll see your data in real time.

Yesterday, we saw failures on each of these pieces of infrastructure. We spent most of the downtime tracking down the root problem (and we made a few mistakes changing settings on otherwise working servers). Ultimately, we found one queue server with an apparently running version of kestrel, but with the following in the log:

#  
# A fatal error has been detected by the Java Runtime Environment:  
#  
# Internal Error (safepoint.cpp:308), pid=13379, tid=140221886957312  
# guarantee(PageArmed == 0) failed: invariant  
#  
# JRE version: 6.0_26-b03  
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)  
# An error report file with more information is saved as:  
# /tmp/hs_err_pid13379.log  
#  
# If you would like to submit a bug report, please visit:  
# http://java.sun.com/webapps/bugreport/crash.jsp  
#

Unfortunately, because the process was still running according to Linux, none of our automated failover systems worked. They rely on TCP/IP level errors, such as would happen if the machine was down, or kestrel wasn't running. Instead of marking the machine as down, our API servers simply filled up with connections trying to write events to a process that presumably was spending time doing the Java equivalent of a core dump.

Once, we figured this all out fixing the problem was simply a matter of restarting the affected server.

What this means for you

Regrettably, track requests sent during this time were lost. In an hourly report on April 16th, you will see a drop-off in events for the hour of 3pm PDT, and slightly lower events for 2pm and 4pm. Consequently, your daily data for April 16th will be slightly depressed as well.

How we're going to fix the problem

We are extremely proud of our uptime and stability, and are working to prevent this from happening again. We realize that all sorts of failures happen in the real world and we need to be able to recover from them quickly and transparently regardless of whether it's a networking problem or a JVM crash. It's clear that our current detection system is not good enough. The queue server should have been marked down almost immediately.

With that in mind, we're rewriting our kestrel client code to put strict timeouts on every operation when enqueueing events. Even if we never get any TCP/IP errors, we should still fail and fail fast, so that we can mark the server as down and our API continues to work.

Additionally, it's clear that our testing does not currently cover all failure scenarios. We're working on expanding that test coverage so that when we can be sure that the failover systems we have in place will work correctly in production.

If you have any questions about the downtime, please email support@mixpanel.com

Community Tip: Setting up Deep Linking in iOS and Android

This Community Tip will walk developers through mobile deep linking via Mixpanel’s In App Notifications. For both iOS and Android mobile, we will describe the code changes and describe how to include deep links with Mixpanel notifications. Why are deep links useful? Deep links create an experience that links users directly to the content they desire. In today's mobile ecosystem, moving between applications or within applications can be friction-filled and lead to low conversions. For example, sending a notification to have your users try a new feature, deep link can point users precisely to relevant view. Creating flexible deep link architecture in your application and using tools like Mixpanel’s People Notifications allows your mobile team to drive user engagement to precise locations quickly and effectively. In the next sections, we'll take you through the steps to enable deep ...

Community Tip: Codeless Event Tracking for iOS & Android

This Community Tip will walk you through setting up Mixpanel’s codeless event tracking. This is the quickest and easiest way to get started tracking events in your iOS or Android mobile app. If your app already has the Mixpanel SDK installed, you will be able to start tracking events immediately, without re-submitting to the app store! The screenshots below are iOS but everything discussed here will apply to both iOS and Android. As long as you have the Mixpanel SDK installed in your app and it's being properly initialized, you're ready to go. If you need a reminder on how to initialize the library, here are the steps for iOS and Android . Getting connected The first thing to do is launch your app in the simulator or on a device. For demonstration purposes, I'll be using my awesome trivia app, Trvl (pronounced “Trivial”). Once it's open, navigate to your Mixpanel project and c...

Community Tip: Integrating Google AdWords with Mixpanel

This Community Tip will take you step-by-step through the process used to push Google AdWords campaign information to Mixpanel. This tutorial will ensure your campaign settings in Google are configured correctly. After setting up your campaigns, Mixpanel's JavaScript SDK will do the rest. This tutorial is broken into two parts. The first walks through the necessary settings in your Google AdWords dashboard. If you're using Mixpanel's JavaScript library, great news! No additional set up or coding is required for ad information to propagate into Mixpanel. If you're driving traffic to a mobile app, you'll want to read about mobile attribution with Mixpanel . The latter sections detail an optional and more advanced step-by-step guide that takes you through the process of setting up UTM-tags in parallel with autotagging (typical only for those using both Mixpanel and Google Analytics in ...

Community Tip: People Properties Best Practices

This Community Tip is geared for anyone interested in using Mixpanel’s People Analytics or seasoned users interested in tweaking their implementations. We'll help answer the question: What types of data should we be storing in People Profiles? People Analytics allow you to send highly targeted messages to your users People Analytics allows you to send targeted messages to your users by leveraging the People properties stored within each profile. When setting People Properties, you have five options of data to include: String, Numeric, Boolean, Date, and List type. For more on how to properly send these types to Mixpanel, check out this article! Each of these data types allows us to create differently targeted campaigns. Sometimes the differences in the properties can be subtle. For example, let's say you want to email your users after they sign up for a paid plan. 1) ...

Community Tip: Implementing Mixpanel via Google Tag Manager

This Community Tip will take you step-by-step through implementing Mixpanel with Google Tag Manager for tracking page views, link clicks, and form submissions. Pros and cons of implementing Mixpanel with Google Tag Manager Google Tag Manager (GTM) is a free application that allows users to remotely alter & activate certain code snippets that fire in a web or mobile app, without requiring direct access to the codebase or an App Store resubmission. GTM works by having you install a single code snippet, which fetches and runs any custom script "tags" you've set up within GTM's own UI based on "triggers" that you've identified. GTM can be particularly handy if you're in charge of your product's analytics, but would need to take up a developer's time in order to make any changes to the codebase. The software allows you to alter isolated portions of your code via a relatively int...

Community Tip: Useful Super Properties

Super Properties give you the power to cohortize user actions by descriptions of your users. This Community Tip lists many useful super properties by industry or focus, and describes how to implement super properties with any of Mixpanel’s client-side SDKs. Note : Super Properties are supported in all Mixpanel client-side libraries: JavaScript , iOS , Android , and AS3 . What is a Super Property? Generally, Super Properties are things you know about the user rather than about a specific event - for example, the age, gender, advertising source, or initial referrer. To make things easier, you can register these properties as super properties. If you tell us just once that these properties are important, we will automatically include them with all events sent. Super properties are stored in a browser cookie or local storage, and will persist between visits to your site ...

Community Tip: List Properties

Today’s Community Post will cover the ins and outs of list properties. We’ll go over how to set list type properties and why you want to use them. What is a list property? Event properties are a great way to send Mixpanel a lot of detailed information about how users interact with your website/app. In addition to lots of awesome property datatypes (numbers, strings, booleans), we support list properties. List properties are as simple as the name sounds—it’s a property containing a list of information! List properties allow you to describe dimensions that contains more than one value. When would you want to use a list property? You would want to use list properties whenever you have more than one value for a given property. Some examples: Items purchased in a “Checkout completed” event Multiple artists for a “Song Played” event Experiment groupings for A/B ...

Community Tip : Addiction Best Practices

This Community Tip will illustrate how to interpret Mixpanel’s Addiction Report. We’ll walk through some best practices and methods for using the Addiction report to analyze your users’ behavior with an eye towards actionable insights. Our Retention report shows the portion of your customers or customer cohorts who engage with your application. Mixpanel’s Addiction Report takes it to the next level and analyzes the minimum number of hours or days your users engage with your app. How to interpret an Addiction Report The first column of the Addiction report displays the number of users in the cohort displayed on that row, just like our First Time and Recurring Retention reports. In the screenshot below, it means that 6,054 unique users fired an “App Open” event on February 1st. Each of the buckets (column headings) to the right indicate how many hours in day, days in a week, or day...

Updates to segmentation: See your top events & compare events too

We've made Segmentation a whole lot better, and we're excited to share all the updates we've made: See an overview of your top events We've made it possible to see an overview of the highest volume events that your users take in your app. It can help you see spikes in your data, and it's just a convenient way to get a snapshot of how things are going. Compare trends more easily Comparing two or more different events is now possible, so you can analyze the correlation between events. Just click the compare menu option after picking the first event. Plot your data in logarithmic scale It can often be tough to compare two trends if one event or segment dwarfs another or if your app takes off and creates a big spike in your metrics because the big differences in scale make it hard to see what's going on. With a logarithmic scale, you'll be able to compare and correlate...