Why Airbnb conducts scientific experiments

Webster’s Dictionary defines trust as: “Confidence; a reliance or resting of the mind on the integrity, veracity, justice, friendship or other sound principle of another person,” or at least that’s what I say it does.

There are factors that affect your judgment about whether or not you trust in something that you’re not even consciously aware of. How did you come to read that definition? Does it sound accurate? Do you trust this website? This author? Does this all feel like some sort of weird reverse-psychology thing designed to prove some kind of point about trust?

In an instant, your brain takes all those factors into account, and you make a decision: you either trust that definition or don’t. (See for yourself). But quantifying those things is nearly impossible. And the reason it’s only “nearly” impossible is because of the work of people like Airbnb Director of Data Science Alok Gupta. He and his team work to put numbers to fuzzy concepts like “trust.”

For a company like Airbnb, trust is essential to their business—the company basically only exists because they thought they could build a product that would capitalize on a reservoir of trust nobody else had noticed. So trust is a big deal for them.

And turning that important but nebulous concept into something quantifiable is, in Alok’s wonderfully English parlance, part of his remit. But even defining what kind of trust they’re after is hard.

“Is it peer-to-peer trust? Is it peer-to-platform-trust? Is it platform-to-host trust? What is the hierarchy?” Alok asks. “And for us, the most important question was: what metric can we put on trust so that we can measure it over time?”

Perhaps it shouldn’t be surprising that someone who studied Mathematics at Cambridge and Imperial College London and got his PhD in Statistical Finance from Oxford takes a quantitative approach to emotional areas that usually receive more qualitative study.

But for Alok, tackling the least quantitative problems is the biggest place where a data scientist can make an impact.

We spoke with Alok about how he defines “data science,” where data scientists can have the biggest impact, and what he found when he worked with a team of researchers to quantify trust on Airbnb’s platform.

Three core pieces of data science work

While the title “data scientist” is vague and broad, Alok has a framework that is helpful for understanding the things that tend to fall under that umbrella. He sees three key types of data science work, each of which require different skillsets: analytics, measurement, and optimization. These roles range from analyzing the company’s performance as a whole to actually building software products designed to improve its performance.

“Analytics is: ‘What are the summary data we should care about as a business, and what we should look at to know if we are operating well?’” Alok tells us. “This category has always existed in financial firms—sometimes what tech companies call ‘data scientists’ is the same as what financial firms call ‘quants’ – short for ‘quantitative analysts’.

“The next piece, measurement, is about accurately observing the metric that the analysts care about and knowing what moves it. This means developing hypotheses and working with teams to build features that test them. Figuring out how we can build an experiment that will demonstrate causality is an enormous task.”

Optimization differs in that it is the only area where data scientists are actually building the products themselves. “Once you have good measurement in place, people want to automate a lot of the products we’re building through data; so the third type of data scientist builds machine learning products, with the idea of automating the optimization of existing functions and predicting future outcomes,” Alok says.

All three of these types of data scientists have an important role, but Alok sees one as perhaps the most impactful for its ability to get at the truth of things.

Measure twice, cut once

Of those three, Alok hones in on measurement as a place where data scientists can be most impactful. To him, “measurement,” means designing experiments, scientific method-style: observation, hypothesis, experimentation, conclusion. “I’d say it’s ‘science’ in the truest sense, of those three categories,” Alok says.

The key to any good experiment is a hypothesis worth testing, but that’s not enough. An experimental design that can produce falsifiable results is a crucial next step. “Measurement is often overlooked and is often the most impactful,” Alok tells us. “In our product teams at Airbnb, there’s usually half a dozen engineers and a data scientist, and maybe a designer. The role of the data scientist is to help the product manager value and prioritize opportunity sizes and hypotheses right at the start of the product lifecycle.”

So when Stanford University researchers came to Airbnb with an interesting experiment idea, Alok saw a potentially impactful outcome for Airbnb. The research planned to “investigate whether and to what extent a sharing-economy platform can design technological features to counteract natural behavioral tendencies that may lead to social biases.” In plain English, they wanted to see how much trust people put in ratings like those on Airbnb.

There is a strong body of evidence that demonstrates people are more likely to trust people who are like them, demographically speaking. The Stanford researchers wanted to see to what extent people will factor in a good rating from a trusted service over their initial inclinations. For Airbnb, this is the idea upon which they’ve built their business, so they were fairly confident that to some extent, this is true. What mattered was the degree.

Quantifying trust

In setting up an experiment, they had to decide what they meant by trust. Alok says, “We distilled trust down to Airbnb’s reputation system: guests rate hosts and hosts rate guests. We wanted to understand what was the incremental power of the rating system over natural biases, what we call homophily—that is, the tendency to feel safer, or more secure with people that look like you, based on shallow attributes,” which in this case meant location, age, gender, and marital status.

The goal of the experiment was to test to what extent actual people would trust Airbnb’s rating system, and reviews left by previous hosts or guests. Airbnb’s review and rating system is intentionally designed to help foster and facilitate trust. Hosts and guests only review each other after a reservation is complete—meaning the information you see is informed and real.

The experimenters did this by putting their nearly 9,000 participants—all real Airbnb users—into a game in which they were given 100 “credits” to invest in manufactured profiles that were varying degrees away from them on those four attributes: location, age, gender, and marital status. The more credits a participant had at the end of the experiment, the likelier they were to receive a cash prize.

The subjects of the experiment were told that their credits would triple when given away, and then it would be up to the recipient (that is, the manufactured profiles designed by the experimenters) to return as much to them as they saw fit. If a subject gave a profile 10 credits, it would be tripled to 30 credits, and an even split of the rewards would mean 20 credits returned to the subject, so a 100% return on investment. Thus the more faith participants had in a profile to return them an amount in excess of their original investment, the more they ought to invest in that profile. This was how Alok’s team and the researchers proxied trust.

If they trusted someone to return more credits than they had given them, they would invest, and the more they expected from that person, the more they would invest.

The participants were split into two groups. The first group (World 1) showed conclusively that people invested the most heavily with those that shared all four of the recorded attributes, less if they shared three of four, even less if they shared two and so on. This was behavior consistent with previous research on homophily.

The second group (World 2) was shown the same profiles as the first group with one key difference. For the profiles that differed from the participants on all four measured attributes, those profiles also had a high Airbnb rating and a high number of positive reviews. This is what Alok and the researchers cared about. They expected these would do better than a profile that differed on all four attributes and did not have a high rating. But would it do better than a profile that shared one attribute? Two attributes? All of the attributes? If so, by how much?

The results were surprising. The highly rated, but characteristically different manufactured profiles received more investment than anyone else. To put a precise number on it, profiles in that group—all characteristics different, high rating and number of reviews—received 51.5% more investment than those within the same experimental group who had the same characteristics as the experiment participant.

“We did not expect the fifth group, in which all attributes are different but with a high rating, to be able to beat all attributes are the same, no rating, and by so much. I think that is really a stellar finding.” Alok says.

Chart showing how much experiment participants invested in users in the control group and in the experiment group

Changing the tiebreakers

It makes sense, after all. What is homophily but a decidedly faulty rating system? Could there be something better to replace it? The participants in the experiment certainly seemed ready to trust something over their own intuitions. “This has implications way beyond home sharing, way beyond Airbnb,” Alok says. It’s a finding that is at once unsettling and deeply encouraging.

“We’ve often thought about how to give people the opportunity to use their Airbnb social currency in other ways,” Alok says. “Figuring out how to help people to do that in other ways is a really cool opportunity, and it’s definitely something that we might think about in the future.”

It’s an interesting situation for Alok. In his own words, the biggest adjustment academics have in transitioning into business is “not having the luxury always to tie up loose ends to make up for an incremental 5, 10% impact.” In business, “pretty confident” is good enough and “totally sure” takes too many resources. But in this case, Airbnb was able to partner with ongoing research and get that conclusive data to confirm and quantify their suspicions, and can now move forward more confidently. It’s an opportunity more companies should try and seize.

Science marches on

Alok Gupta may be outside of the ivory tower now, but he’ll continue to apply a scientific approach to his work, and hopes others will do the same. “I don’t think people value enough, how important measurement is,” he tells us. “A lot of problems today, outside of Airbnb, outside of tech, on a global scale, come from a lack of good measurement.

“When we think about global warming, poverty, inequality, literacy, unemployment, automation, there are a lot of hypotheses. We make a lot of decisions based on hypotheses: in education, in healthcare, in nearly everything. We have an aversion to running experiments. But there are great benefits to be had from measuring things rather than hypothesizing about them.”

At the very least we can be sure that whatever falls within Alok’s remit will be thoroughly measured.