Why most A/B tests give you bullshit resultsLast edited: Feb 25, 2022
By now, anyone in product or marketing knows what A/B testing is. What we don’t know, or at least won’t admit, is that too many A/B tests yield nothing.
Too often they measure meaningless variants, produce inconclusive results, and nothing comes from them. Of course, some A/B tests yield real, meaningful, actionable results. Those are the ones you hear about. We’ve all seen the articles. Company X increases conversions 38% with this simple trick. Hell, I’ve written some of them.
But those success stories have hidden the grey underbelly of testing and experimentation.
Yet many new testers walk into A/B testing thinking it’ll be quick and easy to get results. After running a handful of simple tests, they think they’ll find the right color for this button or the right tweak to that subject line, and conversions will, poof, increase by 38% like magic.
Then they start running tests on their apps or sites, and reality suddenly sets in. Tests are inconclusive. They yield “statistically insignificant” results and no valuable insights about the product or users. What’s happening? Where’s that 38% bump and subsequent pat on the back?
Don’t get frustrated. If you’re going to be running A/B tests, you’re going to have some tests that fail to produce meaningful results you can learn from. But if you run good tests, you’ll have fewer failures and more successes. By running thoughtful A/B tests, you’ll get more statistically significant results and real learnings to improve your product.
And that makes it seem like you learn from every A/B test you run. You don’t, but (almost) no one is sitting down to write a blog post about the time they tested three variants and saw no noticeable difference in conversions. That means the results are not statistically significant enough to draw any conclusions from the experiment. Essentially it’s asking what is the chance of getting these same results, or results with even larger difference in performance, without there being any actual difference between your A and B.*
Imagine you’re tossing two coins, twenty times each. Coin A lands on heads 12 times. And Coin B lands on heads nine times. You wouldn’t rush out proclaiming you’ve found a coin that is 33% more successful at landing on heads, right? From your understanding of coins, you know the difference is simply by chance. It’s not statistically significant.
Now if you tossed each coin another 180 times, and Coin A landed on heads 120 times and Coin B landed on heads 90 times, clearly something significant is happening. But, again, we know that isn’t what would happen. After 200 hundred tosses, there might still be a small difference in how many times each landed on heads, but it would be chance. Any difference is just noise.
And that might seem like a silly experiment. Of course two coins aren’t going to perform noticeably different. But, honestly, this is precisely why so many A/B tests yield inconclusive results. We waste our time testing variants without any real meaningful differences and, unsurprisingly, we end up with a bunch of tests with statistically insignificant results.
And if anyone is to blame, it’s that stupid button example’s fault.
The button color experiment is the “Hello, World!” of A/B testing. It’s a simple example that does an excellent job of explaining the concept. And so, without fail, any time A/B testing is being explained for the first time, someone is using the button color example, where one variant of a page has a green button purchase button and one has a red button. You run the test and see which color button has a higher conversion rate.
And the truth is, some companies have conducted the button experiment and actually received meaningful results to improve their product. If you want your user to interact with something, there is certainly value to making it stand out. That said, as most who have run the experiment have discovered, while button color is an excellent way to describe A/B testing, it’s rarely a meaningful way to improve your product.
I ran my own meaningless test about a month and a half ago.
Mixpanel rarely sends out emails to our master list. We usually only email out our new articles to users that have subscribed to the blog (which you can do at the bottom of this article). But it had been some time since a large send, so we got the okay to email the latest in our Grow & Tell series, a feature on QuizUp’s transition into a social platform, to a large chunk of our users. It seemed like the perfect opportunity to run a really quick A/B test.
The email had a subject line of “Why 15 million users weren’t good enough for this mobile trivia app”. But I’d heard that starting out an email with your company name can improve open rate, so I made a variant with the subject line, “Mixpanel – Why 15 million users weren’t good enough for this mobile trivia app.” Easy, right? And if it performed better, we could put what we learned to use, starting every subject with our name, increasing open rates on all of our emails, and hopefully increasing results – people doing what you’re doing right now, reading our articles.
The email went out to hundreds of thousands of users, split between the two versions. And then I waited impatiently for my success to come rolling in.
When the results did come in, they could not have been less statistically significant. The subject line without “Mixpanel” had a 22.75% open rate. The subject with “Mixpanel” had a 22.73% open rate. A difference of .02%.
Hundreds of thousands of emails sends later, the difference in my test was 20 opens. For all intents and purposes, I was flipping coins.
Even with such a large sample size, there just wasn’t enough contrast in my test to yield significant results. I learned nothing, except to take my tests more seriously.
So what could I have done to get more significant results?
Well, first, I could have tested a completely different subject line altogether. Like less scintillating but more semantic article title of “Why QuizUp turned the fastest-growing game in history into a social platform.” That contrast would have had a much greater chance of producing statistically significant results.
But even then, what would I have learned besides that one did better than the other? What actions would I have taken from it? Perhaps if I tested it a few more times I could reach the large conclusion of whether our readers prefer scintillating subject lines or semantic ones.
My test was meaningless because it wasn’t constructed well and it wasn’t part of a larger strategy asking meaty questions about what matters to our readers. It was quick and simple, but it didn’t go anywhere. A/B testing is never as easy as it seems. If you want results, it takes work. Either put in the time to thoughtfully and strategically test many little things hoping to find an array of small improvements, like different pictures, slightly different designs, and changes in the text of your calls to action. This is one camp of A/B testers, the “optimize your way to success” testers. The other camp includes those who develop out features of the product and test a drastically different experiences, like re-working the process of user onboarding.
You can find valuable lessons and improve your product with A/B testing, but it takes some hard work.
I’m not the only one mulling on this. Recently, I spoke with Hari Ananth, co-founder of Jobr, about some not-so-meaningless A/B tests they conducted to improve user acquisition.
“We wanted to improve our onboarding flow to get more users in the app and swiping,” Hari, told me.
Jobr is an app that allows job seekers to swipe, Tinder-style, through curated job opportunities.
“We identified two crucial steps in our funnel and built a sufficiently wide list of variants for each experiment to ensure proper coverage. After sending enough traffic through each variant, we were able to incorporate the optimized flow and boost conversions by 225%.”
Jobr essentially rebuilt their onboarding process, informed by data on where users dropped out of the previous process.
After testing hypothesis after hypothesis, Tara and the team at Cozi were able to incorporate bits of learning into the flow. Some were small aesthetics tweaks, like switching to a lighter background. Others where larger changes that asked the user to do fewer steps and removed friction from the process – like pre-populating forms and eliminating checkboxes.
No single change resulted in a major increase in conversions. But combined, the improvements raised the signup completion rate from 55% to 76%.
Run tests that produce meaningful results
It wasn’t random that these experiments were able escape the all-too-often plight of A/B testing and deliver meaningful results. The experiments were constructed to test meaningful aspects of the product, aspects that had a strong impact on how the user would behave. And of course they ran the experiment enough times to produce statistically significant results.
So if you’re sick of bullshit results, and you want to produce that 38% lift in conversions to get that pat on the back and the nice case study, then put in the work. Take the time to construct meaningful A/B tests and you’ll get meaningful results.
*Editor’s note: This sentence previously read, “Essentially it’s asking what is the likelihood that the performance difference in your variants was merely a result of chance.” After a little feedback and a chat with Trey Causey it was revised.