Post-mortem: Intermittent website outage
From August 17th to the 25th , some customers noticed intermittent failure to load pages on the Mixpanel web application. We apologize for the degraded reliability of our service.
We’d like to explain the issue and the remedies we’ve implemented. We’ll also share our strategy for preventing this scenario in the future.
The website outages were the result of two unrelated issues:
First, Mixpanel’s Messages and Campaign features connect to a database when executing on message deliveries. Certain conditions caused these deliveries to establish a large volume of database connections simultaneously and begin sending commands, resulting in a sudden spike of high load on the database. Our service reached a scale where these traffic spikes caused latency to increase and some connections to timeout.
Second, Mixpanel recently launched a new product that notifies customers of data anomalies in the web app. The backend for this product had an error-handling bug which led to an out of memory condition. Failures from the backend caused our load balancer to mark all hosts as unhealthy, resulting in customer visible error pages.
Both issues led to periods of reduced availability of mixpanel.com.
During this seven day period, the website saw a total outage for approximately five minutes. The website had intermittent failures to load pages for a total of approximately 95 minutes across incidents on multiple days.
Several changes have been made to address these issues, detailed below. We have not seen any website availability issues since these three occurrences.
The team has investigated each issue as it occurred and has implemented the following changes.
Changes made related to Messages and Campaign:
- Reduced the spikes of concurrent database connections by gradually ramping and adding jitter to requests coming from the notifications service.
- Reduced page load failures by adjusting database connection retries from the mixpanel.com web application.
Changes made related to anomaly detection:
- Fixed error handling issue in the anomaly detection service and reduced the connection timeout for requests to the service.
- Backend errors no longer cascade as frontend errors. We catch and gracefully degrade on backend errors.
- Increased number of web app workers to serve requests.
Prevention & Next Steps
Mixpanel runs engineering post-mortems on a regular cadence to deep dive on incidents and identify action plans for preventing recurrence. After reviewing this incident, we are taking the following preventative steps:
- We will decouple dependencies between the main web app and any individual feature. This way, potential issues with one aspect of the product cannot cause outages in other areas. For the incident related to Notifications and Campaign, we will do this by separating the high-volume services for these features into separate backing datastores. For the incident related to data anomaly detection, we made a change so that backend failures no longer propagate to the user, which will prevent bugs from new features from taking down our site.
- We have also improved monitoring and alerting across multiple services to allow for faster incident response times. This would have eliminated a significant portion of the partial outage duration discussed in the Impact section.
If you have further questions regarding this incident or any others you may have experienced, please don’t hesitate to reach out to Mixpanel Support at firstname.lastname@example.org.