API Downtime on April 16, 2012

At Mixpanel, we take the stability and reliability of our product very seriously. Yesterday, our track API was down from 2:55 PM to 4:20 PM PDT. We are extremely sorry and feel that we owe you, our customers, an explanation of what happened and how we're going to prevent it from happening again.

What happened

Before explaining exactly what happened, it's useful to have a rough overview of what happens to an event when it's sent to Mixpanel. From the client perspective, each event is simply an HTTP request. On Mixpanel's side, however, each HTTP request hits a few different servers in succession:

  1. First, we have HTTP load balancers running on api.mixpanel.com with IP level failover. Each balancer maintains a list of API servers that are currently running and will pick one for each event it processes.

  2. Our API servers do basic event validation and optionally add some useful information to the event, such as time if it's missing or the client IP address if requested. Similar to our balancers, each API server maintains a list of queue servers that are running and will pick one for each event.

  3. For queueing, we use piece of software called kestrel, which is a hybrid in-memory / on-disk queue. This is the final step during the initial track request. Normally, these queues hover around 0, meaning that you'll see your data in real time.

Yesterday, we saw failures on each of these pieces of infrastructure. We spent most of the downtime tracking down the root problem (and we made a few mistakes changing settings on otherwise working servers). Ultimately, we found one queue server with an apparently running version of kestrel, but with the following in the log:

# A fatal error has been detected by the Java Runtime Environment:  
# Internal Error (safepoint.cpp:308), pid=13379, tid=140221886957312  
# guarantee(PageArmed == 0) failed: invariant  
# JRE version: 6.0_26-b03  
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)  
# An error report file with more information is saved as:  
# /tmp/hs_err_pid13379.log  
# If you would like to submit a bug report, please visit:  
# http://java.sun.com/webapps/bugreport/crash.jsp  

Unfortunately, because the process was still running according to Linux, none of our automated failover systems worked. They rely on TCP/IP level errors, such as would happen if the machine was down, or kestrel wasn't running. Instead of marking the machine as down, our API servers simply filled up with connections trying to write events to a process that presumably was spending time doing the Java equivalent of a core dump.

Once, we figured this all out fixing the problem was simply a matter of restarting the affected server.

What this means for you

Regrettably, track requests sent during this time were lost. In an hourly report on April 16th, you will see a drop-off in events for the hour of 3pm PDT, and slightly lower events for 2pm and 4pm. Consequently, your daily data for April 16th will be slightly depressed as well.

How we're going to fix the problem

We are extremely proud of our uptime and stability, and are working to prevent this from happening again. We realize that all sorts of failures happen in the real world and we need to be able to recover from them quickly and transparently regardless of whether it's a networking problem or a JVM crash. It's clear that our current detection system is not good enough. The queue server should have been marked down almost immediately.

With that in mind, we're rewriting our kestrel client code to put strict timeouts on every operation when enqueueing events. Even if we never get any TCP/IP errors, we should still fail and fail fast, so that we can mark the server as down and our API continues to work.

Additionally, it's clear that our testing does not currently cover all failure scenarios. We're working on expanding that test coverage so that when we can be sure that the failover systems we have in place will work correctly in production.

If you have any questions about the downtime, please email support@mixpanel.com

Scaling without losing focus on meaningful metrics - how Twitch gets it right

We're in the cafeteria of Twitch’s San Francisco office. Screens are on every wall, streaming games of League of Legends, Counter-Strike, and DotA 2. Dance music emanates from unseen speakers. A friendly golden retriever is wandering our way, looking for some attention. None of this fazes Drew Harry. He's focused - excited even - about documentation. "I know it's not everyone's favorite thing," he says. "But it's important." It is. Particularly at a company growing at the rate that Twitch is . Last year they more than doubled their monthly viewers – from 45 million to 100 million - and added a handful of new ways to watch, including PS4, Xbox One, and Chromecast apps. Educating everyone internally about how to find meaning in the midst of massive data and user growth has been a major challenge. "We've brought on a dozen new product managers. If they show up and just see 150 even...

Community Tip: Revisiting AARRR!

This Community Tip will revisit analyzing user behavior according to the AARRR analytics framework . By using Mixpanel to measure and iterate on the customer lifecycle, you can measure how business and product decisions independently affect acquisitions, activations, retention, revenue, and referrals. Please note this this Community Tip will focus solely on our Engagement reports; if you’re interested in learning how to apply the same AARRR framework for People notifications, check out our AARRR! for People Community Tip ! Making decisions without data leaves us in the dark, guessing as to whether or not we’ve optimized our site or app. When one is faced with data overload, it can be difficult to get through the noise and find the most impactful insights, which means lost time, lost efficiency, and potentially frustration. Dave McClure advocates for lean metrics with his Sta...

Community Tip: Maintaining User Identity

This Community Tip focuses on the nuances of identity management to ensure users are tracked within your Mixpanel implementation. This post describes best practices for single users on multiple devices, multiple users on single devices, and tips for handling user logouts. After you get started with Mixpanel , you will realize that identity management is one of the most important aspects of a Mixpanel integration. To attribute a user’s activity correctly, you must maintain the same unique identifier for a user throughout the course of their interactions with your app or site. Luckily, our SDK contains methods to manage this for both anonymous and authenticated users. Single Users on Multiple Devices: Alias on Signup, Identify on Login Nowadays, many users have multiple devices they can use to access their accounts — phones, laptops, and perhaps a tablet. How do you ensure that a ...

iOS 9 adoption hits 12% after 24 hours

After only 24 hours, and despite the usual delays caused by early adopters rushing to download the latest release from Apple, over 12% of users have upgraded to iOS9 . What's new The updated operating system comes with Apple News, a long overdue reworking of Newstand, and iCloud Drive, giving you access to any file in iCloud from your mobile device. Both Siri and the Notes app saw the addition of new features, and Apple Maps finally added a public transportation option for directions. Meanwhile, multitasking iPad users get new split-screen and picture-in-picture functionality. And users had to delete fewer selfies to clear up hard drive space for the download. While last years iOS8 tipped the scales at 4.58gb, this year's update was a noticeably slimmer 1.3gb. Head over to Mixpanel Trends to see the real-time adoption numbers as they come in. ...

How OpenTable improves UX by getting you in, out, and on your way

Creating the shortest path between you and a delightful meal "It’s not about how much time you spend in our app," says Alexa Andrzejewski, eschewing any notion that simply more user engagement is better. Alexa wants the right engagement. "We want to create the shortest path between you and a delightful meal." Exactly what that path is to your meal could be different for a hundred different reasons. At OpenTable, where Alexa is Mobile Experience Director, their top concern is that your trip is short and the meal is perfect for the occasion. Sometimes that path means looking at a single restaurant profile page, sometimes it means looking at seven. "In a high stakes situation - like if you're going on a date or taking someone out for a business meal - you're more likely to book a familiar restaurant." So finding and booking one specific restaurant for next Friday has to be easy....

Community Tip: Getting Started with Mixpanel

This Community Tip provides best practices for getting started with Mixpanel's core tracking elements — user identity, events, and properties — so you can save time and increase quality within your implementation as well as better understand Mixpanel reporting. Distinct Id: Tracking Unique Users Imagine your most recent trip to a restaurant. After taking your seat and browsing the menu, the server comes by and takes your order. However, there's a problem — with many customers in the restaurant at the same time, your server needs a way to identify who you are so that your food order is delivered properly to you. The restaurant likely has some method of distinguishing users based on the table at which you are seated, who your server is, and the time you arrived. Like the customers in a restaurant, Mixpanel also has a way of distinguishing unique users: Distinct Id . When properly...

Community Tip: iOS Implementation Roundup

This Community Tip will walk you through implementing Mixpanel in your iOS app. This walkthrough will cover not just how to insert Mixpanel code into your app, but also how to decide what events and properties to track. Get acquainted with Mixpanel You heard from your developer friends that Mixpanel is awesome, and you want to see it for yourself. But where to begin? A great place is our Live Introductory Webinar . The Webinar is specially aimed toward new Mixpanel users and it broadly covers Mixpanel’s capabilities, how to think about your analytics, and how to get started with your own data tracking. Plan your implementation: It starts and ends with your business goals Mixpanel is a highly customizable data-tracking solution that gives you the power to decide what actions to track and how to track them. With flexibility and power comes great responsibility, and oftentimes it...

Community Tip: All About Time

In this Community Tip we discuss all aspects of time in Mixpanel, including how to set your project timezone, send time properties, and handle timestamps when exporting or importing data. Using these directions, Mixpanel can help standardize how you handle time to ensure your data reflects accurate timestamps. In today’s connected world, a startup can be headquartered in Berlin, Germany, have a core user-base in Beijing, China, and be analyzing their data through an analytics company in San Francisco, California, USA. With all of these different locations in play, seamlessly handling time data being received from around the world can be intimidating. Luckily, with a few of the following best practices, you can ensure your Mixpanel data is time agnostic to the physical location of your users. Setting Your Mixpanel Timezone When you create a new Mixpanel project, your project timez...

Community Tip: Implement Mixpanel in Swift Apps

This Community Tip will describe how to implement the Mixpanel iOS SDK, written in Objective-C, within your Swift app. We will walk through integration using CocoaPods, library set-up, provide code samples, and ultimately save you development time. Requirements: CocoaPods This guide assumes you are using CocoaPods since that is the recommended way to integrate your app with the Mixpanel iOS library . For assistance integrating Mixpanel into Swift projects without CocoaPods, please follow the instructions to build the Mixpanel iOS library from source . Please note the below instructions for integration will only work for Swift apps targeting iOS 8.0 or higher — if you are targeting any versions below iOS 8.0 please jump to the final section of this guide. Integration with CocoaPods With any version of CocoaPods greater than 0.36, integration of Objective-C libraries within...