Last Friday, over the course of ten hours, large areas of North America and Europe experienced internet outages that made a significant portion of the web unreachable. The scale was almost without precedent. The list of sites affected was massive, from large companies to small projects, and included names like Etsy, The New York Times, Paypal, Reddit, Yelp … on and on. And while the downtime is being investigated, the cost in dollars is almost immeasurable.
The downtime was so widespread because instead of targeting an individual site, the attack instead went after the domain name service (DNS) provider Dyn. DNS routes a person from the address they type in their browser to the location on the internet where the files that make up that site are located. The distributed denial of service (DDoS) attack on Dyn essentially sent an unmanageable flood of phony traffic to Dyn, so when a legitimate request was made by a user, it could not be handled.
That meant that, if you’re like me, you spent much of the morning and early afternoon looking at Chrome’s “This site can’t be reached” page. And ops teams around the world scrambled to react as best they could.
By midafternoon, Dyn had resolved the issue, and once again we were able to tweet nonsense and browse Reddit for cute animal gifs. But the unknown cost was surely millions and millions of dollars.
Calculating the cost of downtime is pretty difficult. While it might be frustrating, and even a bit embarrassing, when your little side project is unreachable, the cost is pretty minimal. But for some companies, downtime can cost more than a $100,000 per minute. In 2012, Ponemon Institute estimated that the average cost of a minute of downtime was $22,000.
And when these outages are extended, that number adds up very, very quickly. A handful of years ago Virgin Blue airlines struggled with eleven days of outages, costing the company as much as $20 million.
And so, not surprisingly, an entire industry of sophisticated tools has evolved to support uptime. But, as any product knows, it’s not as simple at the binary “up” and “down.” Bugs create issues within a product that might make a particular feature go down, or that might take the product down for a very specific subset of users.
While it’s abundantly clear when your entire system is down – just check for the people screaming at you on Twitter (if that’s still up) – it’s much less obvious when a bug means that your onboarding completion is broke for mobile web visitors using mobile Safari on iOS 10. And today’s companies are supporting a fragmented user base across more platforms than ever before: Desktop and Mobile; iOS and Android; Safari, Chrome, and Internet Explorer. That’s not to mention the early internet-connected devices like Apple TV, and the whole slew of new ones from watches to lightbulbs, that are on the way.
Bugs might not have as immediate of an effect as a whole portion of the internet going down, but they also have a tendency to fly unnoticed under the radar. And that, over a longer period, can be much more costly to a company.
Like calculating the cost of outages, it’s a difficult thought experiment to try and wrap your head around the total cost of a bug. Some are much more costly than others. But it’s certainly astronomical. Three years ago a Cambridge University study estimated that the cost of software bugs to the global economy was $312 billion – that’s “billion” with a “b” – per year.
This is one of the reasons why, after putting all the right systems and tools in place, it is absolutely vital to have a real time understanding of your user metrics.
Last August I talked with Dan Wolchonok, a senior product manager at Hubspot, about how the team on their Sidekick product uses data to debug changes in user behavior. This is a particularly difficult task for Sidekick because their product works on top of popular email clients. So an issue might not be their own doing, it may be a result of an underlying change.
“Gmail can decide to test something out with just a small percentage of their users. Our developers might not even be able to recreate it. You just can’t test everything,” Dan told me.
That’s when being able to dig into the usage data and understand what is actually happening, and why, is so important.
“You can get to the root of a problem by segmenting your data. Follow the chain of events, and segment those funnels, and that retention, to really get a feel for why usage has changed.”
Issues aren’t always as clear-cut as a DDoS attack that leaves your site unreachable. Sometimes they’re as small as a visual bug that makes it difficult for a user to convert. And when that’s the case, you need to be tracking your users, and understand how they use your product, across all sorts of segments, to identify when things are working, versus when you may have a very costly problem.