API Downtime on April 16, 2012

At Mixpanel, we take the stability and reliability of our product very seriously. Yesterday, our track API was down from 2:55 PM to 4:20 PM PDT. We are extremely sorry and feel that we owe you, our customers, an explanation of what happened and how we're going to prevent it from happening again.

What happened

Before explaining exactly what happened, it's useful to have a rough overview of what happens to an event when it's sent to Mixpanel. From the client perspective, each event is simply an HTTP request. On Mixpanel's side, however, each HTTP request hits a few different servers in succession:

  1. First, we have HTTP load balancers running on api.mixpanel.com with IP level failover. Each balancer maintains a list of API servers that are currently running and will pick one for each event it processes.

  2. Our API servers do basic event validation and optionally add some useful information to the event, such as time if it's missing or the client IP address if requested. Similar to our balancers, each API server maintains a list of queue servers that are running and will pick one for each event.

  3. For queueing, we use piece of software called kestrel, which is a hybrid in-memory / on-disk queue. This is the final step during the initial track request. Normally, these queues hover around 0, meaning that you'll see your data in real time.

Yesterday, we saw failures on each of these pieces of infrastructure. We spent most of the downtime tracking down the root problem (and we made a few mistakes changing settings on otherwise working servers). Ultimately, we found one queue server with an apparently running version of kestrel, but with the following in the log:

#  
# A fatal error has been detected by the Java Runtime Environment:  
#  
# Internal Error (safepoint.cpp:308), pid=13379, tid=140221886957312  
# guarantee(PageArmed == 0) failed: invariant  
#  
# JRE version: 6.0_26-b03  
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)  
# An error report file with more information is saved as:  
# /tmp/hs_err_pid13379.log  
#  
# If you would like to submit a bug report, please visit:  
# http://java.sun.com/webapps/bugreport/crash.jsp  
#

Unfortunately, because the process was still running according to Linux, none of our automated failover systems worked. They rely on TCP/IP level errors, such as would happen if the machine was down, or kestrel wasn't running. Instead of marking the machine as down, our API servers simply filled up with connections trying to write events to a process that presumably was spending time doing the Java equivalent of a core dump.

Once, we figured this all out fixing the problem was simply a matter of restarting the affected server.

What this means for you

Regrettably, track requests sent during this time were lost. In an hourly report on April 16th, you will see a drop-off in events for the hour of 3pm PDT, and slightly lower events for 2pm and 4pm. Consequently, your daily data for April 16th will be slightly depressed as well.

How we're going to fix the problem

We are extremely proud of our uptime and stability, and are working to prevent this from happening again. We realize that all sorts of failures happen in the real world and we need to be able to recover from them quickly and transparently regardless of whether it's a networking problem or a JVM crash. It's clear that our current detection system is not good enough. The queue server should have been marked down almost immediately.

With that in mind, we're rewriting our kestrel client code to put strict timeouts on every operation when enqueueing events. Even if we never get any TCP/IP errors, we should still fail and fail fast, so that we can mark the server as down and our API continues to work.

Additionally, it's clear that our testing does not currently cover all failure scenarios. We're working on expanding that test coverage so that when we can be sure that the failover systems we have in place will work correctly in production.

If you have any questions about the downtime, please email support@mixpanel.com

Date property filters: Custom cohortize your events

Recent updates to date property filters allow enhanced segmentation and cohortization across all of Mixpanel's Engagement reports. Using date properties you can now create and analyze custom cohorts based entirely on important dates in your user lifecycle. New filters "on" and "between" When segmenting by a date property in Mixpanel reports, you will see two new filters, "on" and "between". These filters allow you to segment and cohortize your users into groups by date properties such as the date they signed up or a range of dates where users converted from free to paid. Using date filters in Mixpanel reports With these added abilities to cohortize users by their date properties, you can now more easily create cohorts of users and track their activity over time. For example, for an ecommerce site, you might wish to see how often users who were created 3 months ago return to ...

How Greenvelope Uses Mixpanel to Increase Sign-Up Conversion

Greenvelope strives to deliver the most elegant electronic invitation service by emulating the experience of opening a “traditional” printed invitation. Greenvelope uses Mixpanel's event based Engagement Analytics and People Analytics to gain actionable insights from how users' interact with their website. By creating a positive online experience, Greenvelope hopes more hosts will consider sending invitations electronically for formal events - to help save trees, time, and money. In 2014 Greenvelope received the coveted "Wedding Wire’s Couple Choice Award for Excellence.” Knowing this had the potential to elevate the brand, they wanted to share their success with potential customers. However, they were hesitant about how best to use this award to advertise their success. By measuring the conversion rates of various options with Mixpanel’s funnel report they were able to take all th...

Gett uses Mixpanel to optimize experiences, focus outreach, and encourage action

Gett is revolutionizing the taxi experience for both consumers and drivers by reducing the time to value of getting a taxi or black car. Gett uses Mixpanel’s event based Engagement Analytics and actionable People Analytics to help make product decisions and reach their customers on both their iOS and Android apps. The app is already up and running in 32 cities globally including New York, London, Edinburgh, Manchester, Birmingham, Glasgow, Moscow, St. Petersburg, Jerusalem, and Tel Aviv. In peeling back multiple layers of this Mixpanel Customer Story, we spoke with Gett's Anatoly Volovik, Product Manager, and Libby Alpert, Global Marketing Operations, about their experiences using Mixpanel. Areas of focus include features such as 1) Funnels 2) Retention 3) A/B Testing and 4) Notifications. What we quickly realized was that not only are they using these Mixpanel features, but they ...

Community Tip: Optimizing for goals with A/B tests

Mixpanel can help you take advantage of A/B testing for your app, website, or notification content to drive user action and engagement. This Community Tip will walk you through three awesome, real life A/B testing examples. How can Mixpanel help you A/B test? For your iOS app, A/B testing will allow you to experiment and tweak your app without app store resubmission. You can also track A/B tests for web by utilizing super properties (note that the “mpmetrics” call in this post has changed to “mixpanel”). Finally, you can run A/B tests with Mixpanel notifications from the Mixpanel web interface. When doing any A/B testing, remember to tweak only one variable per variant, so as to keep "all else equal" and make it easier for you to pinpoint exactly which change made an impact on your users’ behavior. A/B testing for mobile apps There are two ways you can A/B test on mob...

Community Tip: Use Property Types to Turbocharge Reports

This Community Tip includes three ways to leverage property data types to turbocharge your filtering, segmentation, and cohortizing power across Mixpanel reports. Events are the atomic units of Mixpanel analytics. The true power of Mixpanel’s reporting, however, comes from properties. Any detail of an user’s action (like an adverb) or a fact of about a user (like an adjective) can be provided as a property. There are three high level kinds of properties: event properties, event super properties , and People properties . You can understand the differences between these three by reading the Community Tips dedicated to each. This Community Tip will focus on the best ways of using the different property data types. Properties are incredibly powerful because they enable you to slice and dice, filter and segment, or roll-up and divide your data to get actionable insights. People prope...

How Outplay Uses Mixpanel to Build Successful Games

Outplay is an innovative, mobile and social game developer with a mission to create fun games of the highest quality across a wide range of platforms. Outplay decided to use Mixpanel instead of building their own analytics stack because Mixpanel empowers teams to leverage data to understand users and make decisions. As a result, Outplay overcomes competition with successful games that people love. Build or buy Outplay has always known the importance data has in creating awesome games. Though it has always been a high priority, Outplay did not rush into just any analytics solution, but decided to begin the search for a long term, strategic analytics partner. Outplay tried other analytics solutions, but was frustrated at how overly prescriptive these products were: telling them what to track instead of empowering Outplay's teams to track what mattered. When Outplay found Mixp...

Community Tip: Last Touch UTM Tags

This Community Tip describes how to tie users' most recent UTM tags (aka last touch attribution) to their actions. We will detail why it's useful, provide code for developers in JavaScript, and show the power in Mixpanel's reporting. Why track last touch UTM tags? Last touch attribution allows you to see how a user found your site most recently. You can utilize last touch UTM properties to measure the effectiveness of various marketing campaigns through Mixpanel's Segmentation report. You could also leverage Mixpanel's a Funnel report to determine if specific campaigns impacted conversions. If a user eventually ends up making a purchase or completing some other event of consequence you can use last touch UTM tags to determine what acquisition channel brought them to the site most recently. First touch UTM tags are automatically tracked by Mixpanel's JavaScript Library and sto...

Community Tip: JavaScript Implementation Roundup

This Community Tip will walk you through implementing Mixpanel using the JavaScript library. This walkthrough will cover not just how to insert Mixpanel code into your website, but also how to decide what events and properties to track. Get acquainted with Mixpanel You heard from your developer friends that Mixpanel is awesome, and you want to see it for yourself. But where to begin? A great place is our Live Introductory Webinar . The Webinar is specially aimed toward new Mixpanel users and it broadly covers Mixpanel’s capabilities, how to think about your analytics, and how to get started with your own data tracking. Plan your implementation: It starts and ends with your business goals Mixpanel is a highly customizable data-tracking solution that gives you the power to decide what actions to track and how to track them. With flexibility and power comes great responsibility, ...

Community Tip: Setting up Deep Linking in iOS and Android

This Community Tip will walk developers through mobile deep linking via Mixpanel’s In App Notifications. For both iOS and Android mobile, we will describe the code changes and describe how to include deep links with Mixpanel notifications. Why are deep links useful? Deep links create an experience that links users directly to the content they desire. In today's mobile ecosystem, moving between applications or within applications can be friction-filled and lead to low conversions. For example, when sending a notification to have your users try a new feature, a deep link can point users precisely to the relevant view. Creating a flexible deep link architecture in your application and using tools like Mixpanel’s People Notifications allows your mobile team to drive user engagement to precise locations quickly and effectively. In the next sections, we'll take you through the steps to...