API Downtime on April 16, 2012

At Mixpanel, we take the stability and reliability of our product very seriously. Yesterday, our track API was down from 2:55 PM to 4:20 PM PDT. We are extremely sorry and feel that we owe you, our customers, an explanation of what happened and how we're going to prevent it from happening again.

What happened

Before explaining exactly what happened, it's useful to have a rough overview of what happens to an event when it's sent to Mixpanel. From the client perspective, each event is simply an HTTP request. On Mixpanel's side, however, each HTTP request hits a few different servers in succession:

  1. First, we have HTTP load balancers running on api.mixpanel.com with IP level failover. Each balancer maintains a list of API servers that are currently running and will pick one for each event it processes.

  2. Our API servers do basic event validation and optionally add some useful information to the event, such as time if it's missing or the client IP address if requested. Similar to our balancers, each API server maintains a list of queue servers that are running and will pick one for each event.

  3. For queueing, we use piece of software called kestrel, which is a hybrid in-memory / on-disk queue. This is the final step during the initial track request. Normally, these queues hover around 0, meaning that you'll see your data in real time.

Yesterday, we saw failures on each of these pieces of infrastructure. We spent most of the downtime tracking down the root problem (and we made a few mistakes changing settings on otherwise working servers). Ultimately, we found one queue server with an apparently running version of kestrel, but with the following in the log:

#  
# A fatal error has been detected by the Java Runtime Environment:  
#  
# Internal Error (safepoint.cpp:308), pid=13379, tid=140221886957312  
# guarantee(PageArmed == 0) failed: invariant  
#  
# JRE version: 6.0_26-b03  
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)  
# An error report file with more information is saved as:  
# /tmp/hs_err_pid13379.log  
#  
# If you would like to submit a bug report, please visit:  
# http://java.sun.com/webapps/bugreport/crash.jsp  
#

Unfortunately, because the process was still running according to Linux, none of our automated failover systems worked. They rely on TCP/IP level errors, such as would happen if the machine was down, or kestrel wasn't running. Instead of marking the machine as down, our API servers simply filled up with connections trying to write events to a process that presumably was spending time doing the Java equivalent of a core dump.

Once, we figured this all out fixing the problem was simply a matter of restarting the affected server.

What this means for you

Regrettably, track requests sent during this time were lost. In an hourly report on April 16th, you will see a drop-off in events for the hour of 3pm PDT, and slightly lower events for 2pm and 4pm. Consequently, your daily data for April 16th will be slightly depressed as well.

How we're going to fix the problem

We are extremely proud of our uptime and stability, and are working to prevent this from happening again. We realize that all sorts of failures happen in the real world and we need to be able to recover from them quickly and transparently regardless of whether it's a networking problem or a JVM crash. It's clear that our current detection system is not good enough. The queue server should have been marked down almost immediately.

With that in mind, we're rewriting our kestrel client code to put strict timeouts on every operation when enqueueing events. Even if we never get any TCP/IP errors, we should still fail and fail fast, so that we can mark the server as down and our API continues to work.

Additionally, it's clear that our testing does not currently cover all failure scenarios. We're working on expanding that test coverage so that when we can be sure that the failover systems we have in place will work correctly in production.

If you have any questions about the downtime, please email support@mixpanel.com

Now you can hide events and properties

Keep stale event and property names out of your dropdowns. Mixpanel has long offered you the ability to hide events that you've decided are irrelevant or maybe just were typos from the very beginning. We’ve recently extended this functionality to let you also clean up the list of properties that you see when segmenting reports or creating custom events. Say your application tracked an an event called "ate" with integer property called "tomatos". Later you decided to fix the pluralization of this property name and renamed it "tomatoes". Even though you stopped sending events with the "tomatos" property, you'd still see entries for both “tomatos” and “tomatoes” in the dropdown list of properties on your segmentation report. By marking the property name "tomatos" as hidden , you can keep it from showing up in this dropdown and any other dropdown that lists properties. How to h...

XO Group democratizes data to build a better product

XO Group Inc., parent company of The Knot, The Nest, and The Bump, democratizes data across their data science, product, and marketing teams to build a better product. Planning a wedding, cohabiting with your significant other, or expecting a newborn? If so, chances are you’ve already discovered XO Group. With The Knot , The Nest , and The Bump , XO Group brands help make these important lifetime milestones stress-free and enjoyable. XO Group has always emphasized the importance of analytics to ensure their apps continually provide the best user experience. As a leading lifestyle brand, XO Group has a very savvy development team, but this did not necessarily make choosing an analytics strategy an easier decision. As XO Group grew, they found clunky commerce focused analytics tools were not agile enough to provide instant insight to the team. As XO Group continuously expa...

Community Tip: Session length tracking

This Community Tip will explore the utility of session length tracking and describe how to add this metric to your iOS, Android, or JavaScript app. Mixpanel's Engagement Analytics is designed as an event-driven tracking tool for drawing actionable insights from your users' engagement. Traditionally, businesses tracked session lengths as a proxy for user engagement. As you get started with your Mixpanel implementation, you will notice that Mixpanel does not include a default built-in session calculation. Instead, we encourage you to focus on the actions that constitute engaged usage of your application, site, or other web-connected widget. For most businesses, session length is not the best proxy for engagement, but for some, session length is the best proxy. This Community Tip will describe the implementation of session length tracking in our client-side libraries. Additionally, w...

Join us for Office Hours with Cozi & learn how to evolve user experience

Come join us for Office Hours in San Francisco where Tara Pugh, Product Owner at Cozi, will take us through: Evolving user experience with small, data-driven steps — Thursday, June 25, 2015 What it's about For an evolving product, the path to achieving big things looks like this: make a series of small steps, each in the right direction. But where to focus? How do you know if you're headed the right way? Why not make giant leaps? Tara Pugh, Product Owner and user experience program manager at Cozi will describe an iterative, data-driven framework for product development. Tara will outline the approach Cozi uses to optimize important flows and touchpoints using Mixpanel and provide examples of tangible improvements to user experience across mobile, web, and email. Questions about your integration? Need help interpreting the story behind your data? Just want to talk analy...

Community Tip: Naming conventions to stay organized

This Community Tip will describe some useful naming conventions to organize events, properties, reports, and campaigns. Add prefixes to separate product components As your application matures, different teams may be accessing the same Mixpanel project to analyze completely different product and process-oriented user-behavior. This shared studying of a project can very well increase the complexity in reporting, as your Mixpanel implementation accommodates tracking of all products and processes within your application. This can lead to an increase in the number of events and properties deployed to capture key metrics and granularity of data. Using prefixes offers you and your team an invaluable way to distinguish different types of metrics for easy recognition and usage across your teams while maintaining a comprehensive and granular tracking scheme. Prefixes such as "Billing -" a...

Community Tip: Guide to Exporting Mixpanel Data

This Community Tip walks developers through the Mixpanel APIs for raw data and tips to set up an ETL (Extract, Transform and Load) configuration for both Event and People data. Exporting your Mixpanel data can serve a number of purposes whether you simply wish to use the data for raw analysis purposes, store a local copy in your own internal datastore, or pipe the data to another service. API Endpoints for raw Event and People data Mixpanel provides powerful processed API endpoints which provide the formatted Event data you see within the various Mixpanel reports like Segmentation, Funnels, and Retention. These endpoints query your raw data for insight analysis, but do not return the raw data you send to Mixpanel. The raw export API provides a full export of your raw project data so that you can utilize this data for purposes outside of Mixpanel. Instead of returning the nu...

Updated Live View: A more powerful tool to help you debug and integrate Mixpanel

Mixpanel's Live View report displays events coming into your Mixpanel project in real time. Live View is an extremely useful tool for testing changes to your implementation, debugging or QA testing, hypothesizing user stories before diving into deeper analyses, or confirming a hunch about product use. Look at the past actions of individual users By hovering over the icon to the left of the event name in the Live View report, you can see that we can view the profile activity for this user. Individual users are denoted by colored icons. Regardless of whether or not there is a People Profile associated with this user, you will be able to see their Activity Feed by clicking on this icon. Filtering by both events and properties By clicking on the Filter tab at the top left of Live View, you can select to filter your data by the event name or by the properties associated with th...

Community Tip: Personalizing Mixpanel Email Notifications

This Community Tip will explain how People Properties can be used within an email message to customize both the content and presentation. Use these tricks to add personalization and drive deeper creativity in your marketing or lifecycle emails. Mixpanel’s Notifications are specifically designed to give you a tool to engage with your users, and keep them coming back. A great way to ensure your users connect to the content is to tailor the notification, and message itself, with information useful to the person, and what he/she may need at a given point in the lifecycle of the product/service. To illustrate this, let’s imagine the following scenario: I have a mobile game, where you, an up-and-coming space cadet, are tasked with the quite unique and innovative concept of defending the earth against gnarly aliens! I want to set up a notification campaign that looks for people that hav...

Date property filters: Custom cohortize your events

Recent updates to date property filters allow enhanced segmentation and cohortization across all of Mixpanel's Engagement reports. Using date properties you can now create and analyze custom cohorts based entirely on important dates in your user lifecycle. New filters "on" and "between" When segmenting by a date property in Mixpanel reports, you will see two new filters, "on" and "between". These filters allow you to segment and cohortize your users into groups by date properties such as the date they signed up or a range of dates where users converted from free to paid. Using date filters in Mixpanel reports With these added abilities to cohortize users by their date properties, you can now more easily create cohorts of users and track their activity over time. For example, for an ecommerce site, you might wish to see how often users who were created 3 months ago return to ...