API Downtime on April 16, 2012

At Mixpanel, we take the stability and reliability of our product very seriously. Yesterday, our track API was down from 2:55 PM to 4:20 PM PDT. We are extremely sorry and feel that we owe you, our customers, an explanation of what happened and how we're going to prevent it from happening again.

What happened

Before explaining exactly what happened, it's useful to have a rough overview of what happens to an event when it's sent to Mixpanel. From the client perspective, each event is simply an HTTP request. On Mixpanel's side, however, each HTTP request hits a few different servers in succession:

  1. First, we have HTTP load balancers running on api.mixpanel.com with IP level failover. Each balancer maintains a list of API servers that are currently running and will pick one for each event it processes.

  2. Our API servers do basic event validation and optionally add some useful information to the event, such as time if it's missing or the client IP address if requested. Similar to our balancers, each API server maintains a list of queue servers that are running and will pick one for each event.

  3. For queueing, we use piece of software called kestrel, which is a hybrid in-memory / on-disk queue. This is the final step during the initial track request. Normally, these queues hover around 0, meaning that you'll see your data in real time.

Yesterday, we saw failures on each of these pieces of infrastructure. We spent most of the downtime tracking down the root problem (and we made a few mistakes changing settings on otherwise working servers). Ultimately, we found one queue server with an apparently running version of kestrel, but with the following in the log:

#  
# A fatal error has been detected by the Java Runtime Environment:  
#  
# Internal Error (safepoint.cpp:308), pid=13379, tid=140221886957312  
# guarantee(PageArmed == 0) failed: invariant  
#  
# JRE version: 6.0_26-b03  
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)  
# An error report file with more information is saved as:  
# /tmp/hs_err_pid13379.log  
#  
# If you would like to submit a bug report, please visit:  
# http://java.sun.com/webapps/bugreport/crash.jsp  
#

Unfortunately, because the process was still running according to Linux, none of our automated failover systems worked. They rely on TCP/IP level errors, such as would happen if the machine was down, or kestrel wasn't running. Instead of marking the machine as down, our API servers simply filled up with connections trying to write events to a process that presumably was spending time doing the Java equivalent of a core dump.

Once, we figured this all out fixing the problem was simply a matter of restarting the affected server.

What this means for you

Regrettably, track requests sent during this time were lost. In an hourly report on April 16th, you will see a drop-off in events for the hour of 3pm PDT, and slightly lower events for 2pm and 4pm. Consequently, your daily data for April 16th will be slightly depressed as well.

How we're going to fix the problem

We are extremely proud of our uptime and stability, and are working to prevent this from happening again. We realize that all sorts of failures happen in the real world and we need to be able to recover from them quickly and transparently regardless of whether it's a networking problem or a JVM crash. It's clear that our current detection system is not good enough. The queue server should have been marked down almost immediately.

With that in mind, we're rewriting our kestrel client code to put strict timeouts on every operation when enqueueing events. Even if we never get any TCP/IP errors, we should still fail and fail fast, so that we can mark the server as down and our API continues to work.

Additionally, it's clear that our testing does not currently cover all failure scenarios. We're working on expanding that test coverage so that when we can be sure that the failover systems we have in place will work correctly in production.

If you have any questions about the downtime, please email support@mixpanel.com

Here’s what happened when Mixpanel finally built an app

“You’re going to have to shut off your computer. We’re about to take off,” said the flight attendant. Sergio, an iOS engineer for Mixpanel, needed one more minute. He was about to push his code and solve a problem in the app that had been bugging the team for weeks. Getting this update over to the team before a major product review felt just as urgent as the plane taking off on time. Mixpanel was about to launch our app and there was no time to waste. Our users can relate to that feeling on a daily basis. Building an app, especially under a deadline, is a high-pressure job. And to be completely honest? We should’ve known this already. At Mixpanel , we do analytics to help mobile developers learn from their data so they can do their job well and build something great. But here’s the kicker. We have been helping app builders do their jobs without doing it ourselves – until...

The Zen of Tomasz Tunguz

Whoever can see through all fear will always be safe. – Tao Te Ching A whisper in a Manhattan barroom, an idle conversation at a Palo Alto coffee shop, and rumor mills start to churn. The internet streamlines and aggregates this gossip and soon, before it should be possible, markets are dipping and someone halfway around the world is panicking, and the panic builds until companies are shedding value and employees. The antidote to all of this gossip and speculation is, as it should be, data. That’s what gets Tomasz Tunguz up before everyone else in the Valley. It’s why everyone from entrepreneurs to investors flock to his blog. His data-driven insights and metrics offer sanity and clarity on the state of tech, and of SaaS in particular, and cut through the panic. But that doesn’t mean his chart-filled posts don’t also have a quietly beating heart to them as well. “T...

Julie Zhou's discipline of growth

“Oh my God. How do you do that?” This is the slack-jawed question for Julie Zhou when friends learn that, after turning off the lights at Yik Yak’s West Coast office, she hits the gym and pumps 240 lbs. of pure iron. But Julie is certifiably flyweight, weighing in at just over 100 lbs. How can she lift something at that scale? Her answer, as it turns out, is fairly rote and unexciting: “Every single week, you lift one amount. Next week, you add five more pounds. Then, you add five more.” And yet there may be a deceptive genius here. Deadlifter by night, Julie effects a similar magic in her work life. She helps startups get swole. General Assembly’s Introduction to Growth Hacking is a slightly more expensive (and extensive) version of Julie’s answer. At the helm of this class, she draws from her decade-long experience growing companies and teaches how strategic perse...

Danielle Morrill's guide to the galaxy

You can depend on Danielle Morrill, co-founder and CEO of Mattermark, to draw the line between what’s important and what’s industry bullshit. Notorious for her Tweetstorms – dropping the mic on a recent one with, “Yes, I’m ‘too aggressive’ and that’s why I get what I want” – Danielle talks like she acts: fast, often punctuated with four letter words that might make your grandmother blush. I witnessed her electric personality in full force when I met Danielle at Mattermark for this interview. I was surprised when she didn’t take me to one of those airy, glass-plated conference rooms that litter most “open plan” startup offices. Instead, she led me downstairs, walked up to a bookcase, and tugged on a copy of The Hitchhiker’s Guide to the Galaxy, opening the door to “The Secret Office.” “This room gives us major nerd cred,” she told me with a grin. It’s also where she presents j...

Introducing JQL: A Query Language for Analytics

Today, we are excited to announce the official release of JQL (JavaScript Query Language), a powerful new way to access your Mixpanel data. By letting you write queries using JavaScript, JQL is flexible enough to answer any question about your data while remaining familiar and intuitive. Although JQL is a general-purpose tool, it is designed to make it easy to express typical analytics queries about customer behavior. It has a functional programming design, centered around streaming primitives like map, groupBy and reduce. By composing these elements, it is easy to write queries that scan over user activity streams, compute aggregates, or slice and dice the dataset on multiple dimensions. The aim of this post is to explain the purpose of JQL, the motivations behind its design, and our plans for its future. Why create JQL? We created JQL so that our customers could query their...

Why tracking less got VSCO more

In a meeting room in the VSCO office, on the corner of Broadway in Oakland, Steven Tang and Matt Turner stood in front of a whiteboard that was covered in little neon colored squares. Each Post-it Note had a phrase scribbled across it: “Picture taken,” “Filter applied,” “Image saved.” There were tons of them, accounting for each and every event that VSCO was tracking in their photography app. Some were vital to evaluate the success of the app, but many weren’t. It was an illustration of a problem they already knew: their data had gotten out of control. Matt is a Product Manager at VSCO, an app that enables over 30 million active users to create, discover, and connect through images and words. He has been there for four years and is part of the core team driving growth and engagement. He is now responsible for metrics across the company. Steven is an iOS Engineer at VSCO an...

No good data goes unpunished

Every so often, an email will appear in the inbox of careers@bayesimpact.org: “I have a good job. I’m ostensibly successful. But I feel empty inside.” The recipient of these emails can relate. Eric Liu, Paul Duan, and Everett Wetchler founded Bayes Impact after their high-tech gigs had stopped answering meaningful questions. Since 2014, the nonprofit has recruited several full-time engineers and data scientists to its cause: solving the world’s most intractable problems through an ambitious mix of data science and software. "Not to knock anyone who gets a lot of satisfaction from working in a for-profit job,” Everett begins. “I worked for Google. I worked with wonderful, kind, brilliant people on fascinating scientific problems.” Although he sounds sincere, one can hear the but... to follow. He too came from a good job. He too was ostensibly successful. He too was feeling...

Data Loss Incident Update

This post is an update regarding the data loss incident we had a month ago . The purpose is to describe what we are doing to prevent similar failures in the future. Incident Recap On February 22, 2016, we had a data loss incident that resulted in 9% of our customers losing their events from February 22 or 23, 2016, depending on the customer’s time zone. For more than half of those affected, we lost less than 1% of events from that day. However, in many cases it was more, and in some cases, it was the entire day’s worth of events. It was our worst instance of data loss in four years and a very stressful, disruptive experience for our affected customers, the Mixpanel engineering team, and the rest of the company. The incident raised two fundamental questions. First, how did a bug get rolled out to all production data servers without being detected? Second, given that bugs will ha...

This is the difference between statistics and data science

It’s unclear whether there is a greater demand for data scientists or for articles about data science. So it goes when terms make their way towards buzzwords. There’s a rush to produce content about whatever it is we are all searching for that day: “responsive”, “the Cloud”, “Omni-channel”. And there is certainly no lack of demand for data scientists. A few months ago, Glassdoor named it the top job of 2016 – with more than 1,700 job openings and an average salary of $116k. But after trudging from data science blog post to Quora response to b-school article – some of which were quite thoughtful – trying to understand the booming trend, I only had more questions. Everyone had a slightly different definition of what it was or wasn’t. After a couple hours, I wasn’t even sure if data science was actually a thing. I feared my own data science article would be just be another in the ...