API Downtime on April 16, 2012

At Mixpanel, we take the stability and reliability of our product very seriously. Yesterday, our track API was down from 2:55 PM to 4:20 PM PDT. We are extremely sorry and feel that we owe you, our customers, an explanation of what happened and how we're going to prevent it from happening again.

What happened

Before explaining exactly what happened, it's useful to have a rough overview of what happens to an event when it's sent to Mixpanel. From the client perspective, each event is simply an HTTP request. On Mixpanel's side, however, each HTTP request hits a few different servers in succession:

  1. First, we have HTTP load balancers running on api.mixpanel.com with IP level failover. Each balancer maintains a list of API servers that are currently running and will pick one for each event it processes.

  2. Our API servers do basic event validation and optionally add some useful information to the event, such as time if it's missing or the client IP address if requested. Similar to our balancers, each API server maintains a list of queue servers that are running and will pick one for each event.

  3. For queueing, we use piece of software called kestrel, which is a hybrid in-memory / on-disk queue. This is the final step during the initial track request. Normally, these queues hover around 0, meaning that you'll see your data in real time.

Yesterday, we saw failures on each of these pieces of infrastructure. We spent most of the downtime tracking down the root problem (and we made a few mistakes changing settings on otherwise working servers). Ultimately, we found one queue server with an apparently running version of kestrel, but with the following in the log:

# A fatal error has been detected by the Java Runtime Environment:  
# Internal Error (safepoint.cpp:308), pid=13379, tid=140221886957312  
# guarantee(PageArmed == 0) failed: invariant  
# JRE version: 6.0_26-b03  
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)  
# An error report file with more information is saved as:  
# /tmp/hs_err_pid13379.log  
# If you would like to submit a bug report, please visit:  
# http://java.sun.com/webapps/bugreport/crash.jsp  

Unfortunately, because the process was still running according to Linux, none of our automated failover systems worked. They rely on TCP/IP level errors, such as would happen if the machine was down, or kestrel wasn't running. Instead of marking the machine as down, our API servers simply filled up with connections trying to write events to a process that presumably was spending time doing the Java equivalent of a core dump.

Once, we figured this all out fixing the problem was simply a matter of restarting the affected server.

What this means for you

Regrettably, track requests sent during this time were lost. In an hourly report on April 16th, you will see a drop-off in events for the hour of 3pm PDT, and slightly lower events for 2pm and 4pm. Consequently, your daily data for April 16th will be slightly depressed as well.

How we're going to fix the problem

We are extremely proud of our uptime and stability, and are working to prevent this from happening again. We realize that all sorts of failures happen in the real world and we need to be able to recover from them quickly and transparently regardless of whether it's a networking problem or a JVM crash. It's clear that our current detection system is not good enough. The queue server should have been marked down almost immediately.

With that in mind, we're rewriting our kestrel client code to put strict timeouts on every operation when enqueueing events. Even if we never get any TCP/IP errors, we should still fail and fail fast, so that we can mark the server as down and our API continues to work.

Additionally, it's clear that our testing does not currently cover all failure scenarios. We're working on expanding that test coverage so that when we can be sure that the failover systems we have in place will work correctly in production.

If you have any questions about the downtime, please email support@mixpanel.com

Introducing Predict - see who will convert, before they do.

Predict lets you know which of your users are going to convert, or not, before they do. Just select an action and, based on users' previous behavior, you'll get a grade on how likely each user is to complete it. Then target users by their grade. Like reaching out to those less likely to convert and giving them a nudge in the right direction. And starting today, Mixpanel Predict is live for all customers using People Analytics . Use Predict to impact leading metrics Use Mixpanel's predictive analytics to learn whether your app’s users will convert. You can define what converting means for your company as any action or event in Mixpanel – even the number of times it was done. Pick an important action that leads to success for your app (commonly called a leading metric ). So if you have a photo sharing app, you probably want to know who is going to share a picture. Or who wo...

Apple Music and Spotify: Parallel Product Teardown

This past June, Jimmy Iovine took to the stage of Moscone West in downtown San Francisco to introduce the always anticipated “one more thing” at Apple’s WWDC. Four years after Spotify brought the first mainstream streaming music service to America, Apple, the company that first turned music into a digital product, is unveiling its own service, Apple Music. A video plays on the large screen and now Trent Reznor is making lofty statements over a beautiful montage of people performing and consuming music. “Music has such power in our lives. The way we listen to and experience music is undergoing a profound change these days. To have access to nearly all the music in the world at our fingertips and in our pockets is remarkable,” Reznor narrates. “And yet there needs to be a place where music can be treated less like digital bits and more like the art it is, with a sense of respect a...

Community Tip: Downstream Effects of Notifications

Mixpanel Notifications are a powerful way to boost re-engagement with your site or app. In this Community Tip, we’ll discuss how to gauge the downstream effectiveness of your campaigns by analyzing how and to what degree they promote key user actions. Creating Notification Campaigns Before getting started with Mixpanel Notifications, you’ll need to create user profiles and set specific people properties to target your users. Push notifications also require that your app is correctly configured to receive pushes, and that you have the Apple Push Certificate and/or Android GCM API Key uploaded to your project. For reference, here are the guides for the creation of each type of notification: Email SMS In-app messages (Android) In-app messages (iOS) Push (Android) Push (iOS) Notification Analytics The Analytics tab next to your specific campaigns pro...

Community Tip: Server-side Identity Management

This Community Tip details how to maintain a user’s identity both in purely server-side as well as hybrid server-side and client-side implementations. Using these instructions, you can ensure user identity remains consistent across all server-side updates. Recently we published a Community Tip on maintaining user identity which highlights the nuances of identity management when users log in from multiple client-side devices. This post provides great insight and strategies to ensure the identity of a user remains consistent through the interactions made with your site or app. While implementations involving server-side updates follow the same basic principles, in this post we dive deeper into how identity can be maintained from the server-side perspective. Below we outline the two possible scenarios for your implementation: sending data only from the server, or from both the serve...

Scaling without losing focus on meaningful metrics - how Twitch gets it right

We're in the cafeteria of Twitch’s San Francisco office. Screens are on every wall, streaming games of League of Legends, Counter-Strike, and DotA 2. Dance music emanates from unseen speakers. A friendly golden retriever is wandering our way, looking for some attention. None of this fazes Drew Harry. He's focused - excited even - about documentation. "I know it's not everyone's favorite thing," he says. "But it's important." It is. Particularly at a company growing at the rate that Twitch is . Last year they more than doubled their monthly viewers – from 45 million to 100 million - and added a handful of new ways to watch, including PS4, Xbox One, and Chromecast apps. Educating everyone internally about how to find meaning in the midst of massive data and user growth has been a major challenge. "We've brought on a dozen new product managers. If they show up and just see 150 even...

Community Tip: Revisiting AARRR!

This Community Tip will revisit analyzing user behavior according to the AARRR analytics framework . By using Mixpanel to measure and iterate on the customer lifecycle, you can measure how business and product decisions independently affect acquisitions, activations, retention, revenue, and referrals. Please note this this Community Tip will focus solely on our Engagement reports; if you’re interested in learning how to apply the same AARRR framework for People notifications, check out our AARRR! for People Community Tip ! Making decisions without data leaves us in the dark, guessing as to whether or not we’ve optimized our site or app. When one is faced with data overload, it can be difficult to get through the noise and find the most impactful insights, which means lost time, lost efficiency, and potentially frustration. Dave McClure advocates for lean metrics with his Sta...

Community Tip: Maintaining User Identity

This Community Tip focuses on the nuances of identity management to ensure users are tracked within your Mixpanel implementation. This post describes best practices for single users on multiple devices, multiple users on single devices, and tips for handling user logouts. After you get started with Mixpanel , you will realize that identity management is one of the most important aspects of a Mixpanel integration. To attribute a user’s activity correctly, you must maintain the same unique identifier for a user throughout the course of their interactions with your app or site. Luckily, our SDK contains methods to manage this for both anonymous and authenticated users. Single Users on Multiple Devices: Alias on Signup, Identify on Login Nowadays, many users have multiple devices they can use to access their accounts — phones, laptops, and perhaps a tablet. How do you ensure that a ...

iOS 9 adoption hits 12% after 24 hours

After only 24 hours, and despite the usual delays caused by early adopters rushing to download the latest release from Apple, over 12% of users have upgraded to iOS9 . What's new The updated operating system comes with Apple News, a long overdue reworking of Newstand, and iCloud Drive, giving you access to any file in iCloud from your mobile device. Both Siri and the Notes app saw the addition of new features, and Apple Maps finally added a public transportation option for directions. Meanwhile, multitasking iPad users get new split-screen and picture-in-picture functionality. And users had to delete fewer selfies to clear up hard drive space for the download. While last years iOS8 tipped the scales at 4.58gb, this year's update was a noticeably slimmer 1.3gb. Head over to Mixpanel Trends to see the real-time adoption numbers as they come in. ...

How OpenTable improves UX by getting you in, out, and on your way

Creating the shortest path between you and a delightful meal "It’s not about how much time you spend in our app," says Alexa Andrzejewski, eschewing any notion that simply more user engagement is better. Alexa wants the right engagement. "We want to create the shortest path between you and a delightful meal." Exactly what that path is to your meal could be different for a hundred different reasons. At OpenTable, where Alexa is Mobile Experience Director, their top concern is that your trip is short and the meal is perfect for the occasion. Sometimes that path means looking at a single restaurant profile page, sometimes it means looking at seven. "In a high stakes situation - like if you're going on a date or taking someone out for a business meal - you're more likely to book a familiar restaurant." So finding and booking one specific restaurant for next Friday has to be easy....