Data Inconsistency on January 16
On January 16th, from 6:00 – 6:20 pm Pacific Time there was an error in our
production system, causing every event sent to Mixpanel to be counted four
times. All of our customers, if looking at an hourly view of their data, will
see an artificial spike for the hour between 6:00pm and 7:00pm, similar to the
one pictured below:
We are very proud of the accuracy of our data, and are extremely sorry that
this error occurred. We know that you make key business decisions based on the
data you see in Mixpanel, and even though the miscounting only lasted for
twenty minutes, we wanted to make you immediately aware of it. The rest of
this blog post will discuss the full details of this error and possible
ramifications for your decision-making.
The over-counting impacts all reports on Mixpanel.com or any API calls that
request a total count for any event and include 6pm PST on 1/16/12 within
the time period of the query. However, this over counting does not impact
queries for unique event counts. This means that the Funnels report and the
Retention report, which are entirely based on uniques, are completely
unaffected. In addition, any queries in the Segmenation or Trends report in
uniques mode are not impacted either.
How can I adjust my data to account for this?
In most cases, we do not recommend trying to adjust your data to account for this error. For daily reports, the difference will be trivial. For the day of January 16th total event counts will be roughly 5% higher than they were in actuality. For monthly reports, the total event counts will only be 0.1% higher than they were in actuality. The only case where you might want to adjust the data coming out of Mixpanel is in the case of hourly reports. If you are basing a decision on an hourly report, then divide the count for your total events that happened during the hour of 6 PM by 2. The result will be very close to the true event count for that hour.
Will I be billed for these data points?
Absolutely not. You will not be charged for any overages to your plan caused by this error.
I’m a geek – tell me what really happened
First, it’s necessary to describe a small part of our infrastructure. When a
user sends an event to Mixpanel, we do a small amount of validation — mostly
checking for syntactical correctness — and then immediately put the event on
a queue. Under normal circumstances, the number of items on the queue stays
very close to zero, meaning that within seconds of sending an event it should
show up in your reports. However, decoupling receiving events from processing
them allows us to easily perform server maintenance that would otherwise
require significant downtime.
For a long time now, we’ve had multiple queue servers so we aren’t reliant on
a single machine, but we haven’t had automated failover. In practical terms,
that means that if a queue server goes down in the middle of the day we can do
a manual failover within minutes, but if it goes down in the middle of the
night, it could be quite a bit longer before we can switch everything over.
The change we pushed out Monday was intended to remedy this situation.
Basically, when an event comes in, we try each queue server that we currently
think is up one at a time until we successfully enqueue an item.
Unfortunately, our code to check whether putting an item on a queue was
successful or not was incorrect and consequently each event was added to each
queue server (currently, there are four). We noticed the problem almost
immediately and had the fix within 20 minutes.
What we are doing to keep this from happening again
Unfortunately, although we have queueing related tests, in our test
environment there is only one queue server and so none of our tests caught
this particular problem. That particular hole will be fixed in the coming
Once again, our most sincere apologies and regrets for this error. If you have
any questions please do not hesitate to reach out to us at