API Downtime on April 16, 2012

At Mixpanel, we take the stability and reliability of our product very seriously. Yesterday, our track API was down from 2:55 PM to 4:20 PM PDT. We are extremely sorry and feel that we owe you, our customers, an explanation of what happened and how we're going to prevent it from happening again.

What happened

Before explaining exactly what happened, it's useful to have a rough overview of what happens to an event when it's sent to Mixpanel. From the client perspective, each event is simply an HTTP request. On Mixpanel's side, however, each HTTP request hits a few different servers in succession:

  1. First, we have HTTP load balancers running on api.mixpanel.com with IP level failover. Each balancer maintains a list of API servers that are currently running and will pick one for each event it processes.

  2. Our API servers do basic event validation and optionally add some useful information to the event, such as time if it's missing or the client IP address if requested. Similar to our balancers, each API server maintains a list of queue servers that are running and will pick one for each event.

  3. For queueing, we use piece of software called kestrel, which is a hybrid in-memory / on-disk queue. This is the final step during the initial track request. Normally, these queues hover around 0, meaning that you'll see your data in real time.

Yesterday, we saw failures on each of these pieces of infrastructure. We spent most of the downtime tracking down the root problem (and we made a few mistakes changing settings on otherwise working servers). Ultimately, we found one queue server with an apparently running version of kestrel, but with the following in the log:

#  
# A fatal error has been detected by the Java Runtime Environment:  
#  
# Internal Error (safepoint.cpp:308), pid=13379, tid=140221886957312  
# guarantee(PageArmed == 0) failed: invariant  
#  
# JRE version: 6.0_26-b03  
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)  
# An error report file with more information is saved as:  
# /tmp/hs_err_pid13379.log  
#  
# If you would like to submit a bug report, please visit:  
# http://java.sun.com/webapps/bugreport/crash.jsp  
#

Unfortunately, because the process was still running according to Linux, none of our automated failover systems worked. They rely on TCP/IP level errors, such as would happen if the machine was down, or kestrel wasn't running. Instead of marking the machine as down, our API servers simply filled up with connections trying to write events to a process that presumably was spending time doing the Java equivalent of a core dump.

Once, we figured this all out fixing the problem was simply a matter of restarting the affected server.

What this means for you

Regrettably, track requests sent during this time were lost. In an hourly report on April 16th, you will see a drop-off in events for the hour of 3pm PDT, and slightly lower events for 2pm and 4pm. Consequently, your daily data for April 16th will be slightly depressed as well.

How we're going to fix the problem

We are extremely proud of our uptime and stability, and are working to prevent this from happening again. We realize that all sorts of failures happen in the real world and we need to be able to recover from them quickly and transparently regardless of whether it's a networking problem or a JVM crash. It's clear that our current detection system is not good enough. The queue server should have been marked down almost immediately.

With that in mind, we're rewriting our kestrel client code to put strict timeouts on every operation when enqueueing events. Even if we never get any TCP/IP errors, we should still fail and fail fast, so that we can mark the server as down and our API continues to work.

Additionally, it's clear that our testing does not currently cover all failure scenarios. We're working on expanding that test coverage so that when we can be sure that the failover systems we have in place will work correctly in production.

If you have any questions about the downtime, please email support@mixpanel.com

Community Tip: Getting Started with Mixpanel

This Community Tip provides best practices for getting started with Mixpanel's core tracking elements — user identity, events, and properties — so you can save time and increase quality within your implementation as well as better understand Mixpanel reporting. Distinct Id: Tracking Unique Users Imagine your most recent trip to a restaurant. After taking your seat and browsing the menu, the server comes by and takes your order. However, there's a problem — with many customers in the restaurant at the same time, your server needs a way to identify who you are so that your food order is delivered properly to you. The restaurant likely has some method of distinguishing users based on the table at which you are seated, who your server is, and the time you arrived. Like the customers in a restaurant, Mixpanel also has a way of distinguishing unique users: Distinct Id . When properly...

Community Tip: iOS Implementation Roundup

This Community Tip will walk you through implementing Mixpanel in your iOS app. This walkthrough will cover not just how to insert Mixpanel code into your app, but also how to decide what events and properties to track. Get acquainted with Mixpanel You heard from your developer friends that Mixpanel is awesome, and you want to see it for yourself. But where to begin? A great place is our Live Introductory Webinar . The Webinar is specially aimed toward new Mixpanel users and it broadly covers Mixpanel’s capabilities, how to think about your analytics, and how to get started with your own data tracking. Plan your implementation: It starts and ends with your business goals Mixpanel is a highly customizable data-tracking solution that gives you the power to decide what actions to track and how to track them. With flexibility and power comes great responsibility, and oftentimes it...

Community Tip: All About Time

In this Community Tip we discuss all aspects of time in Mixpanel, including how to set your project timezone, send time properties, and handle timestamps when exporting or importing data. Using these directions, Mixpanel can help standardize how you handle time to ensure your data reflects accurate timestamps. In today’s connected world, a startup can be headquartered in Berlin, Germany, have a core user-base in Beijing, China, and be analyzing their data through an analytics company in San Francisco, California, USA. With all of these different locations in play, seamlessly handling time data being received from around the world can be intimidating. Luckily, with a few of the following best practices, you can ensure your Mixpanel data is time agnostic to the physical location of your users. Setting Your Mixpanel Timezone When you create a new Mixpanel project, your project timez...

Community Tip: Implement Mixpanel in Swift Apps

This Community Tip will describe how to implement the Mixpanel iOS SDK, written in Objective-C, within your Swift app. We will walk through integration using CocoaPods, library set-up, provide code samples, and ultimately save you development time. Requirements: CocoaPods This guide assumes you are using CocoaPods since that is the recommended way to integrate your app with the Mixpanel iOS library . For assistance integrating Mixpanel into Swift projects without CocoaPods, please follow the instructions to build the Mixpanel iOS library from source . Please note the below instructions for integration will only work for Swift apps targeting iOS 8.0 or higher — if you are targeting any versions below iOS 8.0 please jump to the final section of this guide. Integration with CocoaPods With any version of CocoaPods greater than 0.36, integration of Objective-C libraries within...

How HubSpot grows a product into something you can't live without

Did they even open my email? Did they check out my link? Marketers have had tools to answer these questions for a long time. Sidekick , a project from HubSpot, brings this kind of email tracking to one-on-one email communication. For HubSpot, it's a little outside their wheelhouse. Over the last nine years they’ve developed some of the largest and most popular software for marketers. Sidekick is part of a decision to broaden their scope and serve non-marketers. And it's built on the freemium model. "The whole idea from the beginning was to go for really big scale that millions of people could use for free," says Dan Wolchonok. "And then to have a seamless path for those who want to pay for power features." Dan is a senior product manager at HubSpot. It's his job to get those millions of users. Which is less about finding them and more about keeping them around. Being active ...

Mobile A/B Testing: Walkthrough & Best Practices

Using Mobile A/B testing, you can test changes to your app without writing code. You can measure the impact on any conversion, like signups or purchases. In this Community Tip, we'll provide a walkthrough and best practices for running A/B tests in your iOS or Android app. Setup To get started with Mobile A/B testing, all you need is the Mixpanel SDK installed and initialized in your app. If you haven’t done this step yet, you can jump over to our iOS Integration Docs or Android Integration Docs for instructions on installing Mixpanel. To work with complex A/B tests which go beyond modifying UI elements, such as changing view flows, modifying defined constants, and more, you will need to implement developer tweaks ( iOS / Android ) in your app. Tweaks allow you to adjust actual variables in your code from the comfort of your Mixpanel dashboard. Once you add a Tweak to your...

Geolocation Error on 7/9/2015

Incident Summary: No geolocation was being performed on events and people requests between 5:00PM PST 7/8/2015 to 1:30AM PST 7/9/2015. This affected all customers. Timeline: On 7/8/2015 at approximately 5:00 PM PST, a change was deployed to consumers that migrated our geolocation system from a Python module to a C replacement as part of our systems optimization plan. After this point, all geolocation requests made to the Maxmind DB began failing. Around 7/9/2015 1:00AM PST, Mixpanel began reverting the change and by 1:30AM, all consumers were reverted back to the Python Maxmind DB reader, which resolved the issue. Root Cause: The migration from a Python module to a C replacement proved not to be a direct drop-in replacement and generated errors (TypeError) when fed a value other than a string. We typically send IPs in the format we receive them, which can sometimes be a long...

A/B testing comes to Android

So you've got an idea. "We have all these awesome features most users just haven’t tried yet. I bet more people would use them if they couldn't skip the tutorial. " "I know more people would create an account if our button just said 'Get Started'." "Our game's too easy. People want a challenge. If it were a bit faster, it would keep users coming back." Great, but a good idea alone isn’t gonna cut it. Especially when opinions differ. What you really need is cold, hard data. You need to put that hypothesis to the test and see how it fares with real users in the real world. That's where A/B testing comes in. Previously, you've been able to use Mixpanel to experiment and improve your iOS app. And we've seen how useful it has been to our customers. Today, we're expanding Mixpanel A/B Testing to Android. All the capabilities that enable you to experiment and improve your iOS a...

Community Tip: Benefits & Best Practices for Cross-platform Apps

In this Community Tip, we highlight the benefits and best practices for accommodating all of your varied products, environments, and platforms under a single Mixpanel project. Using this guide, your team can decide how to best track your cross-platform app data within Mixpanel. If you're one of the many users working with cross-platform apps, you may have wondered whether it's best to combine everything into one Mixpanel project or split everything up and analyze it separately. We've found that creating separate Development and Production projects to avoid cluttering your Production data is a great best practice, but the question of what to do with cross-platform Production apps is a matter of preference. The good news: Mixpanel is flexible enough to handle cross-platform scenarios in any configuration you'd like to implement. The choice is yours, and each approach comes with its...