Data Loss Incident Post-mortem
On Feb 22, 2016, at 5:21 pm PT, we inadvertently deployed a bug to production that had a latent effect on our event data compression process, which ultimately resulted in partially dropping the previous day’s data for a subset of our customers.
Unfortunately, most of these lost events were unrecoverable. People data was not affected, however. Once we identified the error, on Feb 23 at 10:08 am PT, we immediately took action to prevent further data loss by reverting the change.
We care deeply about data integrity, so once we had reverted the change and confirmed that the problem had stopped, our engineers took steps to identify three things:
- Root cause
- Customer impact
- Possible data recovery strategies
After some investigation, we were able to narrow down the root cause to a bug that accidentally overwrote customer data under specific triggering conditions. By the time the effects of the bug became apparent, it had unfortunately been deployed to all our replicas, severely limiting the possibility of recovering data from redundant copies of customer data.
We have a daily compaction process that sorts and compresses new data, and writes out a canonical data file for each day. If event data arrives late—which is quite common in mobile since events can be tracked when the phone doesn’t have connectivity—this compaction process can run multiple times for a given day. When this happens, it creates a new data file for that day by combining the contents of the existing file with the freshly arrived data.
For the past few months, we have been producing two canonical data files for each day, one in our old storage format and one in our new column-oriented storage format. This has allowed us to seamlessly transition customers to the new format and check for correctness by comparing the output from the old and the new files.
We have been running exclusively using the new format files for customer queries for several weeks, and are beginning the process of removing the old files altogether from production data servers. A first step in the removal plan involved having the compaction process start outputting files in the old format to a new directory. Our understanding was that nothing relied on the old format files anymore, and therefore this change would not have any effect. We had verified that access times on the old files were always the same as created times, indicating that they were never read anymore.
However, the compaction process still read from the old format file when combining existing data with new data for the same day, as described above. When it did so, it overwrote the existing file with a new file (which would have the same created and access times). Now that files generated in the old format were outputted into a different directory, the compaction process would not pick them up. This was fine the first time compaction ran for a given day. However, if new data arrived for that day, triggering subsequent runs of the compaction process, it would miss the existing data and create a new data file with only the new data, effectively dropping the existing data.
Our safety checks failed to catch the bug before it hit production. We have integration tests for our compaction process that ran throughout the day on Feb 22. On Feb 22, 2016, at 12:29 pm PT, the change was deployed to our staging cluster and we forced a run of compaction and everything looked fine. On Feb 22, 2016, at 5:02 pm PT, the change was deployed to a single production replica, where we triggered a compaction process and observed the expected results again. However, none of these measures tested the case of multiple compaction runs in the context of late data.
Additionally, our automated alerting did not catch the problem. While they monitor drop-offs in write or query volume to the data servers, and the frequency and success rates of the compaction process, none of these metrics catch when compaction drops events.
To understand customer impact, we had to identify which Mixpanel projects had lost events and how many events were lost. To do so, we had to construct a timeline by coalescing logs from various servers and identifying a pattern of sudden drops in the count of accumulated data points. Once we identified the affected customers and the impact, we moved on to investigating data recovery strategies.
Possible data recovery strategies
Since recovering overwritten data off of hard disks is extremely sensitive to additional write operations, we immediately began the recovery process by running various file recovery tools. The data recovered via this method required further processing to identify and establish how correct it was. After writing the necessary tooling, we realized that the amount of data points which were recoverable was relatively small, due to the nature of how overwriting files affects the underlying blocks on the physical disks.
In parallel with our internal data recovery efforts, we reached out to various data recovery services and consulted with them about the situation. Unfortunately, because of how the data had been overwritten, none of these services could offer a better solution than the processes we already had in flight.
Another opportunity for data recovery came from introspecting Live View snapshots. Our Live View infrastructure is what allows Mixpanel to surface a real-time stream of events for each project. Since Live View retains a limited number of the most recent events for each project, we were able to recover some amount of data by writing more tooling around extracting events from Live View and re-processing them. However, for high volume projects, Live View snapshots only contained a fraction of the necessary data.
We understand data availability is important for our customers, and we apologize for any inconvenience we may have caused. We are completing a full post-mortem to determine all of the steps we can take in the future to make sure an incident of this nature does not happen again. This will require making a number of changes to our code, tooling and monitoring. We will also follow up with an additional blog post describing these action items. If you have any questions regarding this incident, please don’t hesitate to reach out to Mixpanel Support firstname.lastname@example.org.