API Downtime on April 16, 2012

At Mixpanel, we take the stability and reliability of our product very seriously. Yesterday, our track API was down from 2:55 PM to 4:20 PM PDT. We are extremely sorry and feel that we owe you, our customers, an explanation of what happened and how we're going to prevent it from happening again.

What happened

Before explaining exactly what happened, it's useful to have a rough overview of what happens to an event when it's sent to Mixpanel. From the client perspective, each event is simply an HTTP request. On Mixpanel's side, however, each HTTP request hits a few different servers in succession:

  1. First, we have HTTP load balancers running on api.mixpanel.com with IP level failover. Each balancer maintains a list of API servers that are currently running and will pick one for each event it processes.

  2. Our API servers do basic event validation and optionally add some useful information to the event, such as time if it's missing or the client IP address if requested. Similar to our balancers, each API server maintains a list of queue servers that are running and will pick one for each event.

  3. For queueing, we use piece of software called kestrel, which is a hybrid in-memory / on-disk queue. This is the final step during the initial track request. Normally, these queues hover around 0, meaning that you'll see your data in real time.

Yesterday, we saw failures on each of these pieces of infrastructure. We spent most of the downtime tracking down the root problem (and we made a few mistakes changing settings on otherwise working servers). Ultimately, we found one queue server with an apparently running version of kestrel, but with the following in the log:

#  
# A fatal error has been detected by the Java Runtime Environment:  
#  
# Internal Error (safepoint.cpp:308), pid=13379, tid=140221886957312  
# guarantee(PageArmed == 0) failed: invariant  
#  
# JRE version: 6.0_26-b03  
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)  
# An error report file with more information is saved as:  
# /tmp/hs_err_pid13379.log  
#  
# If you would like to submit a bug report, please visit:  
# http://java.sun.com/webapps/bugreport/crash.jsp  
#

Unfortunately, because the process was still running according to Linux, none of our automated failover systems worked. They rely on TCP/IP level errors, such as would happen if the machine was down, or kestrel wasn't running. Instead of marking the machine as down, our API servers simply filled up with connections trying to write events to a process that presumably was spending time doing the Java equivalent of a core dump.

Once, we figured this all out fixing the problem was simply a matter of restarting the affected server.

What this means for you

Regrettably, track requests sent during this time were lost. In an hourly report on April 16th, you will see a drop-off in events for the hour of 3pm PDT, and slightly lower events for 2pm and 4pm. Consequently, your daily data for April 16th will be slightly depressed as well.

How we're going to fix the problem

We are extremely proud of our uptime and stability, and are working to prevent this from happening again. We realize that all sorts of failures happen in the real world and we need to be able to recover from them quickly and transparently regardless of whether it's a networking problem or a JVM crash. It's clear that our current detection system is not good enough. The queue server should have been marked down almost immediately.

With that in mind, we're rewriting our kestrel client code to put strict timeouts on every operation when enqueueing events. Even if we never get any TCP/IP errors, we should still fail and fail fast, so that we can mark the server as down and our API continues to work.

Additionally, it's clear that our testing does not currently cover all failure scenarios. We're working on expanding that test coverage so that when we can be sure that the failover systems we have in place will work correctly in production.

If you have any questions about the downtime, please email support@mixpanel.com

Community Tip: Benefits & Best Practices for Cross-platform Apps

In this Community Tip, we highlight the benefits and best practices for accommodating all of your varied products, environments, and platforms under a single Mixpanel project. Using this guide, your team can decide how to best track your cross-platform app data within Mixpanel. If you're one of the many users working with cross-platform apps, you may have wondered whether it's best to combine everything into one Mixpanel project or split everything up and analyze it separately. We've found that creating separate Development and Production projects to avoid cluttering your Production data is a great best practice, but the question of what to do with cross-platform Production apps is a matter of preference. The good news: Mixpanel is flexible enough to handle cross-platform scenarios in any configuration you'd like to implement. The choice is yours, and each approach comes with its...

Introducing Codeless Mobile Analytics

Tracking how users are interacting with your app just got easier. Now you can get Mixpanel's advanced mobile analytics without writing a line of code. Let's face it, everyone's dealing with limited developer resources. If you've got to add a tracking request to the growing queue of dev needs, it'll probably take some time. With Codeless Mobile Analytics , you can focus your development time on what matters - making your app better. With a simple point and click setup, Codeless makes tracking events a breeze. Want to know if people are using a new feature? Skipping that tutorial? Changing a setting? After you've installed the Mixpanel SDK you can just fire up Codeless Mobile Analytics (it's that icon on the bottom of your sidebar). Select your platform - iOS or Android. Then connect to your app using either a phone or an emulator. That's it. From right there in your bro...

How Cozi got a 38% increase in signup completion

Tara Pugh, product owner of Cozi, stopped by our June Office Hours to explain the process for improving their UX through many small, data-driven steps. Cozi helps bring order to the chaos of modern family life with an app that gets the entire family on the same page with shared calendars, to-do lists, and shopping lists. Digging into their conversion funnel data with Mixpanel, they identified opportunities to improve user experience by streamlining the account creation flow. Testing hypothesis after hypothesis, Tara and the team at Cozi were able to incorporate bits of learning into the flow: lighter background, different calls to action, pre-populating name and email forms. No single change resulted in a huge increase in conversions, but all together, the improvements raised the signup completion rate from 55% to 76%—a 38% increase. Check out Tara's talk for all th...

Join us for Office Hours with Vandan Parikh (Product @ Capital One) & learn how to translate data into actions

Come join us for Office Hours in San Francisco where Vandan Parikh, Director of Product Management at Capital One, will take us through: Getting from information to action — Wednesday, July 29, 2015 What it's about Whether you are a product manager, growth hacker, data scientist, marketer, engineer, or founder, your role increasingly relies on turning data into action. The capacity to make this transformation consistently will determine the success of your team, product, and ultimately, your company. Vandan Parikh, Director of Product Management at Capital One and formerly Product Lead for Flickr at Yahoo, has developed tactics and strategies to create data-driven action and incredible success. Don't miss Mixpanel's Office Hours on July 29th when Vandan will detail the goals, principles, and environments necessary for you to bring actionable analytics into your company su...

AARRR! for People: Using Mixpanel Notifications to Grow your Business

This Community Tip will describe how to utilize Mixpanel notifications to engage users according to the AARRR analytics framework . By sending targeted notifications throughout the customer lifecycle, you can independently increase acquisitions, activations, retention, revenue, and referrals. Mixpanel Notifications are powerful tools for engaging with your user base. Through targeted push, email, SMS, and in-app messages, you can deliver effective calls to action to targeted groups of users. To determine who receives what call to action, Mixpanel uses targeting based off of people properties . There’s nothing worse than having a great call to action, but not having the right people properties to target the right group of users. This post will cover all of the tips and tricks you need to create effective campaigns throughout your user's lifecycle with Mixpanel notifications. We...

Community Tip: How to keep properties consistent across client and server libraries

To unlock the full power of Mixpanel reporting when using multiple integration libraries, super properties should be maintained across each Mixpanel library. In this Community Tip we discuss how to pass data from the client-side to your server for custom super properties and Mixpanel's default client-side properties. Passing custom super properties As a refresher, super properties are client-side properties that are automatically attached to every event that a user sends to your Mixpanel project. Super properties make working with Mixpanel data much more convenient because you can create custom global properties that are omnipresent on all of your events. These super properties will be attached to each and every event provided the memory location (cookies in a browser, device memory in a mobile phone) of the super properties is not cleared. However, since custom super properties ...

Now you can hide events and properties

Keep stale event and property names out of your dropdowns. Mixpanel has long offered you the ability to hide events that you've decided are irrelevant or maybe just were typos from the very beginning. We’ve recently extended this functionality to let you also clean up the list of properties that you see when segmenting reports or creating custom events. Say your application tracked an an event called "ate" with integer property called "tomatos". Later you decided to fix the pluralization of this property name and renamed it "tomatoes". Even though you stopped sending events with the "tomatos" property, you'd still see entries for both “tomatos” and “tomatoes” in the dropdown list of properties on your segmentation report. By marking the property name "tomatos" as hidden , you can keep it from showing up in this dropdown and any other dropdown that lists properties. How to h...

XO Group democratizes data to build a better product

XO Group Inc., parent company of The Knot, The Nest, and The Bump, democratizes data across their data science, product, and marketing teams to build a better product. Planning a wedding, cohabiting with your significant other, or expecting a newborn? If so, chances are you’ve already discovered XO Group. With The Knot , The Nest , and The Bump , XO Group brands help make these important lifetime milestones stress-free and enjoyable. XO Group has always emphasized the importance of analytics to ensure their apps continually provide the best user experience. As a leading lifestyle brand, XO Group has a very savvy development team, but this did not necessarily make choosing an analytics strategy an easier decision. As XO Group grew, they found clunky commerce focused analytics tools were not agile enough to provide instant insight to the team. As XO Group continuously expa...

Community Tip: Session length tracking

This Community Tip will explore the utility of session length tracking and describe how to add this metric to your iOS, Android, or JavaScript app. Mixpanel's Engagement Analytics is designed as an event-driven tracking tool for drawing actionable insights from your users' engagement. Traditionally, businesses tracked session lengths as a proxy for user engagement. As you get started with your Mixpanel implementation, you will notice that Mixpanel does not include a default built-in session calculation. Instead, we encourage you to focus on the actions that constitute engaged usage of your application, site, or other web-connected widget. For most businesses, session length is not the best proxy for engagement, but for some, session length is the best proxy. This Community Tip will describe the implementation of session length tracking in our client-side libraries. Additionally, w...