Cisco’s data scientist explains how to measure bots with Mixpanel
In a series of posts, Irzana Golding, data scientist from Cisco, will give a brief tour on the novel uses of Mixpanel, plus a few tips on getting the most from the tool. This article was written by Irzana Golding.
I’m a “Data Jockey” (or Junkie?) at Cisco, working in the realm of developer platforms. I call myself a “Data Scientist” although by training I’m a statistician and financial modeler who traveled the R and Python road to end up in data la-la-land.
Much has been said by folks cleverer than myself about the definition of data science. But my view is simple. Science is the attempt to prove (or falsify) hypotheses. This is the more important part, especially as the biggest obstacle I’ve met is not on the data processing side, but on the hypothesis forming side.
I’ll digress too far if I get into it, but know that assumptions (easily made by all of us) are not the same as testable hypotheses. Assumptions often go hand in hand with “proving” vanity metrics whereas hypotheses and insights are more important bed fellows. It depends on what you really want to measure.
Job definitions aside, I never quite find myself in a situation where any one job description fits the bill, nor where I get to use sexy techniques that one hopes will reveal a mountain of insights. Life in data la-la-land is messy — constantly. Much of the job is dealing with messiness, not sexiness.
The most important tool in my entire toolkit — I kid you not — is the good old-fashioned paper, pencil and critical thinking. It comes in handy to sanity check numbers. Of course, as a former finance analyst, the rigor of cross-checking comes naturally. But I’ve come to learn that the phrase, “These numbers don’t look right” can mean many things, such as, “I don’t like your numbers.”
The old adage of garbage-in, garbage-out, reigns supreme in data la-la-land and it’s easy to spend effort crunching crappy data. Therefore, I seek to find the simplest solutions possible that stand the best chance of providing statistically useful and meaningful results with veracity.
To that end, I recently found that repurposing Mixpanel to measure an NLP-bot provided an immediately usable way to visualize first-approximations of user behaviors.
Qualifying bot effectiveness
Firstly, I should mention that bots play an important part of the Cisco Spark collaboration ecosystem. And so figuring out usage patterns is essential, both in aggregate and in particular. But bots are tricky beasts because there are no straightforward ways to measure their effectiveness (versus usage) due to the many disparate use cases. Attempts to standardize on derived metrics, like “chattiness” (approximately message volume/user/time) can be deceiving.
For example, some bots generate large volumes of messages, for things like monitoring. In such cases, usage (by messaging volume) could be high, but relevance (for many of the messages) is probably low — high data rate, low information rate.
In fact, that’s probably due to a poor implementation that doesn’t pay enough attention to exceptions or anomalies.
Indeed, many of these sorts of messages (from overly chatty bots) are probably ignored. But this is true for most messaging systems, especially multi-conversation environments where there’s a potentially high cognitive load in keeping track of what’s being said.
But this is precisely why bots can be so useful . They can help automate what to pay attention to and when.
Many developers have yet to think of them in these terms. For example, the “killer app” for (enterprise) bots offloads various workflow steps that could be small yet significant. Think of the messaging tool (i.e. Cisco Spark) as the glue that helps orchestrate various process and human interactions as part of a contiguous manageable workflow. Interacting with bots should be a common occurrence, not a novelty in business.
A highly effective bot is one that doesn’t chat much at all, but takes care of a vital offloaded task with precision and timeliness. Viewed via naive usage metrics, such a bot could be deemed ineffective. (This is where clustering analysis could be useful — i.e. to reveal subtle, yet important and valid classes of bot characteristics in order to guide towards more meaningful derived metrics.)
Bot implementations run the gamut from simple notifications to highly elaborate interactive dialogs that act more like AI “agents.”
Inside Cisco’s NLP bot
When one of the Cisco teams decided to build an NLP bot that was more “agent-like,” in terms of its relatively sophisticated vocabulary and decision-tree, the team turned to API.AI to handle the NLP aspects — i.e. deciphering user utterances into intents.
The question then arose as how best to measure user behaviors and give easily-digestible insights to product folks.
As Mixpanel’s strength is behavioral analytics, at least from a mobile/web perspective, I decided to see how well bot interactions (behaviors) could be mapped to the Mixpanel event model. This was also a good test of how well Mixpanel really handles behaviors (via events) versus page views (and other classical web/app metrics).
It turns out that it’s a fairly easy to get a good overview of bot interaction by mapping intents to a single event space in Mixpanel. The single event is Intent Fired (say) and the property is the intent name (as labelled in the API.AI dashboard).
(All data that follows is a contrived example for purposes of this article.)
On the API.AI dashboard, an intent might look something like this example:
Here the intent is labelled capital_time and a typical utterance is “tell me the time in London” — and the reply might be: “Brexit time” 😉
With Intent Fired as a first-class event, we assign Intent Name as an event property. We can then visualize the usual segmentations in Mixpanel, in this case by Intent Name:
Going a step further, we can exploit the API.AI parsing of entities (like capital_time in the above example) and push these as parameters into another event attribute so that we can see which cities are being requested.
But what about conversion? Is there such a thing for bots?
Do bots even ‘convert’?
Assuming that the bot is helping users to complete tasks, then the concept of conversion still applies. In this case, conversion would be how many users completed a given task that the bot supports.
Therefore, we can apply funnel analysis (by making sure to use the same distinct id — probably your bot user id — for each event call).
This is particularly useful for dialogs , or the back-and-forth conversation threads within a certain context (e.g. asking for help) to arrive at an eventual final outcome via a series of intents.
We can also use funnel analysis in a very simple way to look at things like how often the first utterance isn’t recognized by API.AI — i.e. fires the Default Intent (fallback). I put this into a funnel and called it “mis-recognition rate”.
Of course, we have to be careful here as to how we interpret this data because here is an example of a funnel where we want the completion rate to be low. This is because completion rate here means that the user ran into a default_intent — i.e. the user was not understood by the bot.
My example is contrived with a few test messages, so ignore the 50%, but that would be a very poor result.
It’s fairly easy to make these kinds of data-points available at a glance via the latest dashboards feature of Mixpanel:
We could also get fancy here with the bot logic and tag first-time events — i.e. what’s the very first thing a user did with the bot. (It’s also possible to filter those via Mixpanel JQL.)
This would allow us to get a feel for a kind of “bounce mis-recognition” rate — i.e. users whose first interaction with the bot was a #fail and then they left.
Of course, retention is also possible. The usefulness — and meaning — of retention will depend upon the bot’s function. Because Mixpanel enables slicing of retention by event property, we could see if a particular feature of the bot (such as telling the time in London) causes the user to come back more often than another feature.
Here I have only scratched the surface, but hopefully it’s obvious that with just a few lines of code in your bot to push events to Mixpanel, it’s possible to get a fairly decent behavioral analysis up and running that offers good first-approximation behavioral insights.
Other areas worth exploring (outside of Mixpanel) would be things like sentiment analysis on the utterances that result in Default Intent because it’s possible that users are expressing frustration.
To learn more about how Mixpanel can drive the unique edge cases to your product, be sure to reach out to email@example.com.
In Part 2, Irzana will discuss AARRR metrics in a developer platform context, including folding data from multiple sites into a single Mixpanel project to measure events.