Downtime: Ingestion API post‑mortem
On January 20th, 4:05pm PST, Mixpanel’s ingestion API was intermittently unavailable until approximately 8:55pm PST. This incident was caused by two main factors: insufficient capacity in our backend queuing system and a bug in our Android SDK that produced a thundering herd.
At 2:30pm, we rerouted traffic away from one of our datacenters due to planned network maintenance, placing double the load on our Washington, DC datacenter.
At 4:00pm, a 10X spike in ingestion traffic occurred. Combined with the additional load from the maintenance, we started queuing data. We queued enough data to cause our queue servers to become disk IO bound. This lowered the enqueue throughput, causing some API requests to time out.
Due to a bug, our Android client library responded to these timeouts by retrying every second instead of backing off. This triggered a thundering herd. The influx of retry requests increased the load on our servers, increasing response latency, which in turn caused timeouts, causing more retries.
Our remediation step was first to return a response with an HTTP “Retry-After” header to our Android clients in an attempt to increase their retry interval. Once we tested and deployed this to production, we were able to get the load on our server under control. After response latencies came back down to manageable levels, we slowly ramped up requests from Android clients at a low enough rate for them not to trigger the thundering herd effect, eventually accepting all traffic.
By 8:55pm the API had completely recovered. By 10:27pm all queued data was consumed and normal operation resumed.
We placed banners on the top of every report to alert customers of degraded service. Both the issue and corrective plan was tweeted from MixpanelStatus as soon as the solution was in place. All email questions sent to firstname.lastname@example.org were updated periodically throughout the incident. However, because of a configuration error status.mixpanel.com did not check the affected data center and failed to reflect the incident correctly.
To remediate this issue in the short term, we will be replacing our spinning disks with SSDs in order to increase performance while we consider longer-term solutions.
We are taking several steps to prevent this issue in the future. We are currently working on fixes for the status server. We are implementing randomized exponential backoff logic in our Android and iOS mobile SDKs. In the meantime, we have made a change in our load balancer configuration to work around the Android client bug. Finally, we have ordered additional queue servers to add capacity and are reconfiguring all of our queue servers to use SSDs, as opposed to spinning disks.
We apologize for this incident. We know our availability is important to our customers, so we take these events very seriously.