During 5AM – 9AM PST we experienced significant down time on our application
server. The application server is used to show you your reports and your data
Why were you down?
We were down due to a piece of core infrastructure we use called Cassandra.
Over the past 6-8 months while we were using Cassandra to help with data
processing we experienced lots of issues with it that contributed to a vast
portion of why we were continuously not real-time. Since then we’ve deprecated
it completely from our stack save one small portion.
Effectively, Cassandra encounters severe IO issues when your data set grows as
it tries to compact larger and larger amounts of data. So much so it saturates
your IO completely causing the node to appear “Unavailable”
Our application server has just recently become dependent on Cassandra to
solve a different scaling issue we encountered. (About a week or so)
What are you doing to fix it?
We stuck a caching layer in between to avoid doing look ups to Cassandra as
much as possible. This will approach nearly 100% for all users as they use
Mixpanel. If you’ve looked at your report today, you’re good to go. This is by
no means a permanent fix but it will save us from going down hard in the
future and give us time to get rid of Cassandra
Was there any data loss?
There was no data loss, this only affected the application server that shows
you your reports.
Again, we’re sorry for the down-time during that period. Email us at
email@example.com if you have any questions.