This post is an update regarding the data loss incident we had a month ago. The purpose is to describe what we are doing to prevent similar failures in the future.
On February 22, 2016, we had a data loss incident that resulted in 9% of our customers losing their events from February 22 or 23, 2016, depending on the customer’s time zone. For more than half of those affected, we lost less than 1% of events from that day. However, in many cases it was more, and in some cases, it was the entire day’s worth of events. It was our worst instance of data loss in four years and a very stressful, disruptive experience for our affected customers, the Mixpanel engineering team, and the rest of the company.
The incident raised two fundamental questions. First, how did a bug get rolled out to all production data servers without being detected? Second, given that bugs will happen, how did a single bug cause unrecoverable data loss?
We have four primary tools for preventing and limiting the scope of production bugs: automated tests, code review, staging, and canary deploys. Unfortunately, in this case, the bug slipped past all of them for various reasons.
- We had one set of tests for compaction that exercised the conditions that triggered the bug–repeated runs for data belonging to the same day–but these tests only asserted that the right files existed after a compaction run, not that the contents of the files was correct. Conversely, we had another set of tests for compaction that verified the contents output files, but not under the conditions that triggered the bug. (We have since fixed the compaction tests so that they would have caught this bug.)
- The code was given a thorough review by the right people, but the bug was not obvious.
- The change was staged multiple times, first in our staging cluster and then on selected canary hosts in production. In each case, we ran compaction and everything looked as expected; however, we did not trigger multiple runs on a project with enough volume and latent data to trigger the bug. Since compaction only runs once per day in production for any given project, it wasn’t until hours after the deploy that the bug gradually began to manifest.
The reality is that it is difficult to write tests that anticipate every possible scenario, and it is similarly difficult to catch every single bug in code review. Given that changes to compaction can result in data corruption or loss and that we know that there can be latent effects because it is only run daily, we should have let the new code run for a couple days on one set of replicas prior to rolling it out everywhere.
Why did this not happen? While we have tools to test, stage and canary compaction, they are manual steps that get applied slightly differently every time. Instead, we need a repeatable, automated way to deploy services. We are now convinced that we need to move to a continuous deployment approach. What that means is that changes to our code repository will kick off a process that will move those changes into production in a completely automated, well-defined way. This will be critical as we scale the team too.
Moving to continuous deployment will be a long-term effort led by our Site Reliability Engineering team. In the short term, we have made a set of deployment runbooks that explain exactly how to adhere to current best practices. Even if they unfortunately still require manual actions, at least we’ll be operating as safely as we know how using our current tools and in a way that is accessible and obvious to everyone on the team. Since building continuous deployment and runbooks across all services will take a while, we also reviewed our services and split them into tiers. This allows us to move through them in priority order.
Preventing Unrecoverable Data Loss
To protect against data loss, event data stored in our database, Arb, is written to at least four, and typically eight, separate disks. We also do periodic, automated backups of all events and ship them to two separate data centers.
However, while this does a reasonable job of providing redundancy for disk failures, it is not very resilient against bugs. Bugs are an inevitable part of software development. If you accept that they will happen–and when they do, all bets are off–then it becomes clear that a single bug could comprehensively delete production data. It follows, then, that to prevent unrecoverable data loss, we need a complete, non-production copy of the data.
Since the incident, we have rewritten our backup system. It is now much more efficient, reliable and flexible. The biggest change is that Arb backups, which used to happen weekly, now happen daily.
However, what about the last 24 hours, in-between daily updates? Those are still at risk, and were, in fact, much of what was lost in this incident. To address that, we are in the midst of implementing a system that stores the last seven days of ingested traffic, siphoned off at the head of our ingestion API. We ship the forked copy of the data directly from the load balancers, which means that the copy gets made at the outer edge of our system, before the data hit any of our other servers or processes running our code. In that way, we minimize the surface area for potential bugs or ops mistakes that might affect both the online and offline data. In the event of another incident like this, we’ll have backups for anything older than a day, and for anything more recent, we will be able to replay our traffic from the ephemeral store.
Once again, we appreciate your patience and understanding as we have been figuring out the best short and long term solutions to prevent future such incidents. If you have any questions regarding this incident, please don’t hesitate to reach out to your account representative or Mixpanel Support at email@example.com.