Post-mortem: Latency Issues
Recently, some of our customers have probably noticed periods of increased latency and report timeouts on mixpanel.com and to our query API, particularly in the three-week period from January 30th to February 20th, 2017. We understand that it is very frustrating when your reports don’t load quickly, so we’d like to explain what caused these problems and outline the remedies we’ve put in place, as well as share our strategy to prevent this from happening in future.
First, a timeline of events:
January 11: Our Infrastructure Engineering team began an investigation into gradually increasing query latencies on one of our production data clusters. The investigation was inconclusive as to any particular cause, but we noted an overall increase in customer data volume and queries on that cluster.
January 30 – February 3: Our alerting showed us that query latencies on this cluster were spiking a lot at certain times of day. Because the computing power of the cluster is shared among all customers, a daily jump in queries from some high volume customers was causing slowness for others. We lowered our API rate limits to 30 QPS per customer to try to prevent this monopolizing of resources.
February 10 – 12 : Our efforts so far had not been successful in reducing the query latency spikes. We set ourselves the goal of getting our 90th percentile query latency back below one second and started taking more substantial measures. We doubled the size of our pool of query nodes, which are responsible for gathering and aggregating our query results. This had a positive effect on query speed, but it wasn’t sufficient to get us under the target. We also further investigated the causes of these spikes and found some internal processes that were contributing to them.
February 15 – 17: We lowered our API rate limits further to 10 QPS per customer and took some steps to mitigate the load caused by our own internal processes. We also ordered more hardware to provision a completely new cluster.
February 24 – 25: We started moving customers’ data to the new cluster. Once we had moved over a sufficient amount of data, we hit our goal of getting 90th percentile latency below one second for all customers.
The root cause of this incident was our customers’ increasing volume of data and queries significantly outpacing our capacity planning in the last month. We also took too long to increase our capacity when the increased latency became an issue. In light of these failings, we have a number of things we are doing to help prevent this in the future.
- Elastic capacity: We have a project that has been underway for several months that will allow us to add capacity in response to demand much more easily and quickly.
- Increased isolation: We are also working on improving resource isolation for our customers, so that the queries of one high-volume customer do not slow down those of others.
- Service level objectives: We are beginning to roll out a set of internal service level objectives that are focused on the factors that directly impact the customer experience. We will be monitoring these closely so that we can quickly respond when our customers are feeling pain.
We know that our customers trust us to provide them with the numbers and insights needed to make critical business decisions and that slow load times are disruptive. We apologize for failing to maintain the high level of service that our customers expect, and we are working hard to make sure that an incident like this does not happen again.
– Alex Hofsteede, Manager, Performance Team