Downtime: Read API post-mortem
Read API downtime impacted ~20% of Mixpanel customers from 08:00 to 09:00 10/3/14 UTC. Incoming data was not lost.
On 10/03/2014 at approximately 08:00 UTC, a disk in a single cluster received a disk full error. The disk had received an uncharacteristically large amount of data over the course of several hours. Mixpanel’s write architecture is robust to lack of free disk space: Data writing to this cluster was automatically queued to prevent any loss. Mixpanel engineering responded within 5 minutes and actively resolved the issue over a ~60 minute period. No resources were on call to communicate about the issue and our response.
Projects stored on this cluster experienced an intermittent “disk full error” while loading Engagement reports during this period. Generating reports from our read API requires a small amount of disk space, which led to errors for end users.
The impacted cluster contains about 20% of projects. Other projects should not have experienced any disruption. For those who did, we recognize that inaccessibility to vital business data causes difficulty, frustration, and delay. We apologize for this inconvenience and the impact it caused. Moreover, we have developed a plan to prevent a repeat or similar situation from blocking access in the future.
Plan to prevent recurrence
We will make the following changes to our infrastructure and alert systems:
- We are making improvements to our disk usage to alerts to provide earlier warnings about potential problems
- We will detect potentially faulty integrations such as the one that caused this issue and redirect writes appropriately
- Return more useful error messages to end users instead of passing the backend error through
- Create a plan to activate communication during non-business hours to ensure public acknowledgements of impacts can be made