The hard thing about machine learning
Every December, our CEO Suhail takes advantage of the holiday downtime to try and learn something new. Two Decembers ago, that led him to enroll in a Machine Learning class on Coursera, the online education site that offers video courses from top universities like Stanford, Johns Hopkins, and Penn.
Machine learning has long been appreciated by the engineering world, but more and more is finding practical application in products we use every day. And yet most tech companies are still some time away from integrating machine learning practices into their products.
After weeks of online lectures on regression, regularization, neural networks, unsupervised learning, and anomaly detection, it became clear to Suhail: Mixpanel’s users could use this.
Mixpanel already helps you see what actions your users have done in the past, but with machine learning, Mixpanel could predict what your users would do next, and that could be very valuable.
Suhail knew he needed to bring in someone who had more experience with machine learning than a holiday crash course. If Mixpanel was going to do this, they would do it right.
This is how robots take over, right?
To the uninitiated, machine learning is synonymous with artificial intelligence. It conjures mental images of pop culture AI, sentient robots like HAL 9000 and Skynet, bent on world domination. Or, at the very least, Marvin the Paranoid Android, bent on making us pretty sad.
But artificial intelligence is the broad concept of a computer that can mimic the human mind. And creating AI is a tall order that includes programming reason, motives, and general abstract thinking. Machine learning is more precise. It is a branch of artificial intelligence often focused on extracting knowledge from a set of data and using that knowledge to improve how a machine performs a task based on similar data in the future.
Supervised machine learning looks at many instances of something, like how long it takes me to get into work in the morning and accounts for the factors that play into it, like the time I walk out my front door, the day of the week, or the route that I take. Based on all of this data, a machine learning model could tell me when it thinks I should leave each morning and what route I should take to get to the office on time.
And the more and more I go to work, the more data is gathered. Machine learning uses this new data to continuously refine and improve the model to produce more and more accurate suggestions. It could even take into account new factors, like if I start riding a bike, or if I’m stopping at a new coffee shop every morning.
And while true artificial intelligence is still a ways off (from bringing an end to the human race as we know it), machine learning is already being put to use in the world around us. It’s behind the TV shows that Netflix suggests as we endlessly scroll, looking for something to watch. It’s in the search results that Google spits out after we search “rio olympics.” And it’s in the climate models that give us the weather forecast for this Saturday’s hike.
Machine learning is especially suited to solve prediction and real world optimization problems. And every day we are finding more practical applications for it. Google recently used machine learning to minimize the power usage of its data centers – a significant expense for the world’s largest tech company. The result was a 15% decrease in electricity usage, which, over years, will save the company hundreds of millions of dollars.
But machine learning isn’t some sentient robot that does this all on its own. Behind every good machine learning model is a team of engineers that took a long thoughtful look at the problem and crafted the right model that will get better at solving the problem the more it encounters it.
And finding that problem and crafting the right model is what makes machine learning really hard.
The hard thing about machine learning
“Think of it this way,” Jenny Finkel, Mixpanel’s head of machine learning, tells me. “There’s a spectrum of problems that you can try and solve with ML. On one side are the very concrete problems and on the other are the abstract problems.”
Jenny joined Mixpanel a year and a half ago and has experience solving all varieties of machine learning problems.
As she explains it to me, my morning commute problem is concrete. It is a very well defined problem. There are a series of routes to the office I can take and a series of times I can leave my house. From those, the model tries to pick the best ones based on the current day’s circumstances.
But the broader and more abstract a problem becomes, the more varying instances it must account for, the more difficult it becomes to solve with the same model.
Imagine we add someone else to my commute problem. And they’re leaving from a different neighborhood, or maybe even a different city. Maybe they’re driving instead of taking a bus. They may even have a different metric that they’re judging their commute on. Where I want to get into the office by a specific time, they might have a window of time when they’d like to get in, and are more concerned with having the shortest commute possible.
The broader the problem, the more universal the model needs to be. But the more universal the model, the less accurate it is for each particular instance.
The hard part of machine learning is thinking about a problem critically, crafting a model to solve the problem, finding how that model breaks, and then updating it to work better. A universal model can’t do that.
“There’s a black art to making a really good machine learning model,” Jenny says. “The hard part isn’t the math. The hard part is figuring out how to represent your problem in a way that it’s learnable.”
Discovering the right problem
When Jenny looks for the right problem to work on, she’s looking for something that is sufficiently difficult, but ultimately solvable. At Stanford, where Jenny earned a PhD in computer science, she focused on machine learning and natural language processing.
“Human language is really fun to work on,” she tells me. “It’s really, really hard to model, but you know you can do it, because humans can do it.”
At Stanford she co-authored papers, including one with Andrew Ng, the Stanford professor and Coursera co-founder who taught Suhail’s online crash course in machine learning.
Afterwards Jenny spent three years working at Prismatic, a small startup which made a smart personalized news reader. She loved working on the product, but ultimately the company wasn’t successful. For her next role she was looking to be the first machine learning person at a post-product-market fit company with an obvious need for machine learning; a company where she could build the foundation according to her vision of how it should be done.
“You talk with some companies and they just say, ‘Oh, data! Big data is popular. Come do stuff with our data.’ and I think, ‘No, I promise you, if that’s your mindset going into it, you will not be happy coming out of it.'”
Blindly diving into data without an understanding of what you want to get out of it is an easy way to set yourself up for failure. A huge part of doing machine learning well is starting with the right problem to solve.
Through a friend, Jenny got connected with Mixpanel. And it was the problem that Suhail was looking to help Mixpanel users solve that interested Jenny.
Prototyping the model
Every product using Mixpanel to track the actions users are taking has some variation of a “conversion event.” For this publication, it might be getting you to subscribe or create a Mixpanel account. For a streaming music app, it could be upgrading to a paid account. For a shopping app, it’s making a purchase.
Mixpanel was looking to build a tool that could help predict which of your users were most likely to convert, regardless of what a conversion was for you.
“It’s a meta problem,” Jenny says. “You aren’t trying to predict who will do one specific action, like buy a shirt. It could be any action, or even performing a certain action a specific amount of times.”
But it’s not completely abstract either. The Mixpanel data model, regardless of the whether it’s for a gaming app or a photo-sharing app, is consistent. Each and every implementation of Mixpanel is different, but they all send events in the same format.
Jenny was intrigued.
“I like tasks where you don’t know if you’ll succeed,” she tells me. “It means you’re trying.”
With the problem laid out for her, and as the sole member of the new machine learning team, Jenny began pulling data from Mixpanel and modeling it.
“I was on my own for six months. I just had my mandate and I went to my corner and went for it.”
She’s not joking. I joined Mixpanel during her early days and I first knew Jenny as the engineer constantly working from a couch, tucked away in a low-trafficked hallway at the far end of the office.
From that couch she built early prediction models and saw when they broke. She ran through the model data from big companies and small companies; products with a high conversion rate, like those with sign-up events, and products with low conversion rates, like a big-ticket e-commerce store.
“Predicting conversion events on data is hard because circumstances are different, companies are different, users are different.”
The problem was broad, and that meant Jenny started with a more generalized model. Now it was time to apply the black art and make it sing. For several months she pulled test data, and ran it through to see how it performed, and fixed it when the model output inaccurate predictions.
One iteration at a time, the model got better and better, and closer to the product that would become Mixpanel Predict.
Turning a prototype into a product
When it came time to turn the Predict model into a product, the prototype needed to be integrated into Mixpanel’s engineering environment. And that meant writing some C++.
But coding in C++ is not fast, and it can be a bit of an arduous process. C++ requires mapping out every little detail, and Jenny wasn’t enjoying it.
Then, at a company outing to watch the Giants play a day game, an opportunity fell into Jenny’s lap. Wonja, a new addition to the Mixpanel engineering team, mentioned how much she missed writing C++.
And Wonja was interested in machine learning. She had data analysis experience from her days as a research assistant/specialist at the Tufts University NeuroCognition Lab and the UC Davis M.I.N.D. Institute. To Jenny, it seemed like a perfect fit.
“That evening, as soon as I got home from the ballgame, I wrote Joe an email,” Jenny recalls. Joe is Joe Xavier, Mixpanel’s head of engineering. “I was like, ‘Hey, I was just thinking … maybe Wonja should work with me.’”
Joe bought in, and Wonja joined the team to help turn Jenny’s model into a product that Mixpanel users could run on their own data.
That transition would also require a good deal of communication to the end user. It’s not enough merely to find the statistical insights; the Predict product would have to communicate those insights to its users in a way that was understandable and actionable.
Typically in a machine learning problem, the output is a number. In my morning commute example, the model would assign a value to all of the possible routes and return to me, the user, the highest value route. In Predict, the model would assign a number value to every user in a project based on how likely they were to convert. But while that precision might work for an engineer, it wouldn’t be as accessible to the product person or marketer.
Instead, the team settled on turning the scores into conversion grades of A, B, C, or D. Users with an A score were likely to convert, B scores would possibly convert, C scores had a marginal chance of converting, and users with a D score were unlikely to convert.
The next issue, which Jenny saw coming from a mile away, was communicating correlation vs. causation in the prediction data – the bane of any statistician’s existence.
“Our models can tell you who we believe will convert based off indicators, but then of course people want to know why a user converts,” Jenny explains. “That’s understandable, but those aren’t the same thing.”
When you tell someone that a user is likely to convert, the first question you’re bound to get is “why?” The honest answer, which isn’t always what someone wants to hear, is that correlation doesn’t say why.
Before she goes into her explanation, she pauses. “I’m not very good at explaining these kinds of things.”
(Jenny has a habit of prefacing her explanation with statements like this before launching into a perfect example.)
“For Mixpanel, an indicator that a user will pay us could be that she keeps looking at the pricing page,” she explains. “But looking at the pricing page isn’t what gets you to pay us. It just means you are thinking of paying us, and if we stick a salesperson on you, you’re more likely to pay us.”
Mixpanel can’t just funnel as many users as possible to the pricing page and expect them to convert. The event was correlated with the conversion, but it wasn’t causing it.
Correlation does not imply causation. Just because an action is a good indicator of an impending conversion event doesn’t mean that it is causing the conversion event.
For Predict, correlation worked for the problem that it was trying to solve, which was who is most likely to convert not why are they likely to convert.
What makes Predict’s grades accurate is that they aren’t based on one big indicator, like visiting a pricing page, but on many indicators of varying importance. The output of which isn’t why a user is ready to convert, but that they are likely to convert. It’s a product that sets you up to take action on the results, to, figuratively (and sometimes literally) stick a salesperson on the user and drive an outcome.
Finally, on November 17th of last year, Predict went live to the world.
It was covered by the likes of Fast Company, VentureBeat, and Business Insider. It spent the day atop Product Hunt. And in the first few days, it calculated millions of conversion grades for Mixpanel customers.
Mixpanel announces a new prediction feature, I tested it, awesome as usual! https://t.co/wS3RxRyWUF
— François-G. Ribreau (@FGRibreau) November 17, 2015
After almost a year of work, the product was in the hands of users. And much like a machine learning model, the more a product is used, and the more information there is to extract from the data, the better you can make it. The team was able to see how Predict performed in the wild and what people were using it to… predict. With that information, they were able to make improvements to the prediction model.
And as Predict evolved, so did the machine learning team. Growing from Jenny and Wonja to a team of five, machine learning at Mixpanel expanded to new products that solve new problems with new models.
For a start, they want to build smart mobile alerts that will message a user when her analytics take an uncharacteristic spike or falls off a cliff. This is a problem that might sound simple enough, until you factor in how much normal variation there is in user events, based on the day of the week or time of the year, or on completely non-cyclical factors. It’s a product that means a little something different for each and every person who uses it.
A product that could use a little machine learning.