This is the difference between statistics and data science
It’s unclear whether there is a greater demand for data scientists or for articles about data science. So it goes when terms make their way towards buzzwords. There’s a rush to produce content about whatever it is we are all searching for that day: “responsive”, “the Cloud”, “Omni-channel”.
And there is certainly no lack of demand for data scientists. A few months ago, Glassdoor named it the top job of 2016 – with more than 1,700 job openings and an average salary of $116k.
But after trudging from data science blog post to Quora response to b-school article – some of which were quite thoughtful – trying to understand the booming trend, I only had more questions. Everyone had a slightly different definition of what it was or wasn’t. After a couple hours, I wasn’t even sure if data science was actually a thing.
I feared my own data science article would be just be another in the pile. That’s just what the world needs, another marketer running their mouth off about something they don’t fully understand. What is data science? How is it different than statistics? And why are they in such demand?
The answer, as I would soon find out, had to do with not just the ability to program, but an immense knowledge of the product.
A skeptical statistician
Nate Silver doesn’t seem to think data science is different than statistics. The well-known number cruncher behind the media site FiveThirtyEight – and the guy who famously and correctly predicted the electoral outcome of 49 of 50 states in the 2008 U.S. Presidential election, and went a perfect 50 for 50 in 2012 – is more than a bit skeptical of the term.
“I think data-scientist is a sexed up term for a statistician,” Silver told an audience of statisticians in 2013 at the Joint Statistical Meeting.
“Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician.”
For statisticians, the entire data science trend seems a bit patronizing. No matter what your exact definition of data science is, it’s going to sound pretty similar to the work that statisticians have been doing for decades.
And while there are a myriad of arguments suggesting otherwise, it’s a difficult opinion to refute without consensus on what data science actually is. Too many definitions rely on past generations of buzzwords to prop up the data science title. Mining big data for business intelligence. Ambiguous buzzwords, one after another. Turtles all the way down.
Even if data science is something distinct, I wasn’t sure what all these companies wanted armies of data scientists for. Why is it such a hot job? Are companies just copying Google, Facebook, and Netflix, longing for their outcomes and valuations?
Frustrated, I switched browser tabs and shot a quick text message to a CTO friend. “Don’t get me started on data scientists,” he fired back within seconds.
Off and on for the last several months, he had been interviewing candidates for a data science position they had created at their company. And it turns out self-proclaimed data scientists themselves were more than a bit murky on the role. Each applicant had a slightly different skill set and an even more different concept of what they should be doing.
“99% of the applicants are not actually data scientists,” he told me. “They can’t do what we need.”
It seems that even those trying to secure a data science role aren’t entirely sure what it entails and where it diverges from statistics.
Someone with answers
Looking for answers, I email Drew Harry, the director of data science at Twitch. We chatted last fall for an article about how Twitch has scaled. If anyone can point me in the right direction, Drew can.
“Actually, I’ve got a colleague with some interesting thoughts on this,” he wrote back.
And a few days later, on a rainy Tuesday morning, I meet Brad Schumitsch at a cafe a couple blocks from the San Francisco Twitch headquarters.
“So tell me where you’re at with data science and statistics,” Brad asks. Then he patiently sits back, sipping a hot chocolate, and listening intently while I, already two coffees in, ramble from R to data pipeline management to algorithms.
Brad’s a Fulbright scholar. A dozen years ago he wrote an important paper detailing how a mathematical technique called convex optimization improved H.264 video encoding. He has a PhD in Machine Learning from Stanford, and he spent a year at Google X, the experimental research department behind Google moonshots like the self-driving car and Google Glass. Brad is someone with answers, which I’m looking for, but like a good data scientist, he begins by asking questions, establishing a baseline.
“That makes a bunch of sense,” Brad says, kindly, after I finish my rambling. “I think it can be tricky. This is a great topic because if it was something that you didn’t have to think about, it wouldn’t be interesting.”
After a pause, he begins, “First, I want to say that I have a lot of respect for statisticians.”
He’s deliberate and not afraid of taking a moment to gather his thoughts.
“Statistics is a crucial component of data science. At Twitch, our data science team brings together three things: statistics, programming, and product knowledge. And we would never hire someone who wasn’t strong in stats. You can be a great programmer, but if you don’t know what Bayes Rule is, then we have an engineering department I can point you to.”
“Some people might just say data science is applied stats,” Brad says. “We’re certainly not pure statisticians. But I don’t necessarily need people who are going to do theoretical statistics research. No one is writing papers that Fisher would write,” he continues, referencing Ronald Fisher, considered the father of modern statistics and experimental design. “It’s much more about applying those learnings.”
And at a tech company like Twitch, it’s clear that applying those learnings requires a deep understanding of computer science.
Expanding beyond statistics
There have been calls to do more in the statistics community, to expand its boundaries, to look more to data collection, management, and presentation, to focus more on predicting future outcomes and less on merely inferring relationships. There are many ways statistics could grow. Instead of just handing off learnings and then returning to their theoretical statistical pursuits, statisticians need to better communicate and take action on the learnings.
For example, a few decades ago, quants (statisticians working as quantitative analysts) were crunching numbers in windowless rooms and passing on their results for others, often financial traders, to take action on. Today data scientists are writing the algorithms to ingest real-time data, crunch the numbers, and make trades, all automated, all within fractions of a second.
The origin in statistics is undeniable. I understand why many, including the revered Nate Silver, might conflate the two. But the scope of the work data scientists are doing has become so much more than that of statistics.
“Back when I was looking at colleges I distinctly remember some cocky guys at MIT say, ‘Look a computer science degree is like the liberal arts degree for the next century. It’s useful for everything,’” Brad recalls.
That’s undeniable. Just a couple weeks ago the same concept came up in my chat about growth hacking with Andrew Chen. Computer science is bringing new dimensions to many fields. Marketing + coding = growth hacking. And maybe, statistics + coding = data science. I make a mental note to get back on those Udemy classes I’ve been neglecting.
“And it kinda makes sense,” Brad continues. “We can talk about a lot of ideas, but at the end of the day, how does it get done? You type some shit into a computer. And someone who can do that is just going to be more productive.”
The age of dynamic products
Twenty years ago, the pages I visited on the Macintosh IIsi in the Harlan Elementary computer lab were mostly static documents. But static pages can only get you so far, and soon more complicated websites would respond to user input. Like a site called Google, which allowed you to enter text, and then returned a list of webpages related to that text.
But obviously Google wasn’t going to have a static document for every single possible text input. Instead it crawled webpages, going from one to another, collecting as much data as possible about each page. Then, when you would enter “bicycle parts” into their search field, Google would programmatically look through all of its data and build a page for you with links to the pages that seemed to most be associated with that term.
Of course, today, we just assume sites and apps with data are dynamic, based not only on what you input, but also off the troves of information a product has about you. My Netflix homepage will have movies recommended specifically for me based on my past behavior. Spotify builds my weekly “Discover” playlist.
When you open Facebook, an untold number of variables factor into creating a better news feed. Will Oremus, Slate’s senior technology writer, explains the process in his excellent exploration of the algorithm behind the Facebook news feed:
Every time you open Facebook, one of the world’s most influential, controversial, and misunderstood algorithms springs into action. It scans and collects everything posted in the past week by each of your friends, everyone you follow, each group you belong to, and every Facebook page you’ve liked. For the average Facebook user, that’s more than 1,500 posts. If you have several hundred friends, it could be as many as 10,000. Then, according to a closely guarded and constantly shifting formula, Facebook’s news feed algorithm ranks them all, in what it believes to be the precise order of how likely you are to find each post worthwhile. Most users will only ever see the top few hundred.
And someone needs to write an algorithm to power those features. Facebook could take all that historical data and hand it off to a very talented statistician. And she would put her immense knowledge and experience to use, diving into R and producing an excellent model that infers the relationship between all of these variables. And that would, no doubt, yield valuable insights into which ads would perform best in different situations.
But how do you bake that into the product? How much difference can that make when it’s only backwards looking? Facebook needs an algorithm that can do all of that in the time it takes for the page to load, predicting and delivering the best newsfeed. That’s what a data scientist does.
And that is why tech companies need data scientists. And it’s why data scientists, though working with statistics, are so much more than rebranded statisticians.
But excelling in data science requires something even more – a deep understanding of the product you are working on.
The question behind the question
“There are a lot of great people at Twitch that don’t necessarily know statistics. And so, in order to have impact, you have to be a bridge between the data and what a product manager cares about,” says Brad.
A word Brad uses a lot when we talk about data science’s role in product is “efficiency.”
“It’s much more efficient to have the same mind with the product sense decide what metrics are relevant, the programming sense to implement tracking, and the statistical sense to provide analysis.”
Without an understanding of how people are using the product, and what the company goals are, the data analysis can get lost in translation. It’s a data scientist’s job to have all the things in her head all at once, so when someone comes to the department with a problem that isn’t very well defined, they know what data they have at hand to answer the question.
It reminds me of a phrase that has become something of a mantra within the Mixpanel support team: the question behind the question. It’s a phrase repeated so often that it’s been shortened to simply “the QBQ.”
People will often come to our support team with a very specific question they are looking to have answered. But sometimes that question isn’t going to yield the best answer to inform whatever actual decision they are looking to make. Whether in customer support or data science, it’s valuable to take a step back to figure out the broader question behind the question. Then you can reformulate your query to get at the right data that gives you an informative answer. And that requires, along with those statistical and programming proficiencies, a solid understanding of how a product works.
Eclectic oddballs
Stepping back, I can understand why it’s such a difficult field to define distinctly, as the practitioners bridge programming and stats, and stats and product. And I understand even more why everyone is looking to build a data science team.
Google and Netflix have been doing this for years, but today’s eight-person startup wants in on the game. Nearly every app has its own use case for delivering content optimized for each individual user. The better the algorithm is for a dating app like Hinge, the better the recommended dating partners are, and the more likely a person is to find a match. It’s obvious to me why companies need this discipline, but it’s even more obvious why we’re all struggling to fill the roles themselves. And so the demand for data scientists only increases.
Today’s data scientists are an eclectic mix of economists, physicists, and mathematicians. Oddballs who by some series of events and education happen to be both skilled engineers and number crunchers. But they’re hard to find. And as my friend’s ongoing search for a great data scientist illustrates, not everyone that claims the data science title has earned it or can even define it.
Perhaps if we get on the same page for what a data scientist does, then there might be less of a need for all these “what is data science” blog posts. Still, I get the feeling that the high demand for actual data scientists is going to be around for a while.