AI agent analytics: why your click data can’t see the whole picture
Most advice about measuring AI products comes from builders who are still figuring it out. Kylan Gibbs isn’t one of them.
As co-founder and CEO of Inworld AI, Gibbs has spent five years building AI infrastructure for some of the largest consumer applications in the world. Products with millions of users, razor-thin cost tolerances, and zero patience for "we'll figure out measurement later."
At MXP San Francisco, he made a case that was equal parts practical and unsettling. The analytics playbook most product teams rely on was built for a world where users tap buttons and navigate pages. That world is shrinking fast.
AI adoption kept climbing in 2026 even as engagement depth fell 38% year-over-year in North America, a signal that the old ways of measuring product success are losing their grip. And if you're building an AI-native product and still primarily watching click data, the story your dashboard tells you is incomplete at best.
AI agent analytics and the black box problem
The uncomfortable reality for teams shipping AI features is that when a user enters a conversation with your agent, they step into a space that traditional funnels can't see into. You know they signed up. You might know they churned. But everything in between lives inside the conversation itself, not on a page you can track with click events. What the agent said. How it said it. Where the user lost interest. Why they never came back.
This is what Gibbs calls the black box problem, and it's more consequential than most teams realize. When a system gives you outputs without showing its reasoning, you can't improve what you can't see. That's true of AI analytics tools, and it's equally true of the AI products you're building. If you don’t have visibility into the conversation, everything after the signup is a guess.
"The page is no longer the product. If it wasn't even a voice conversation, there is no interface other than the actual conversation itself."
For teams building chat agents, voice assistants, or any AI experience where the UI recedes into the background, this is a measurement problem as much as a product one. AI isn't too unpredictable to track. The funnel just looks different now, running through model outputs, prompt behavior, voice tone, and tool calls rather than button clicks and page views.
A useful breakdown of how to measure AI features draws a distinction here between infrastructure metrics (is the feature working?) and behavioral metrics (is it actually succeeding with users?). Both matter, but for agentic products, the behavioral layer is where the leverage is, and most teams aren't measuring it yet.
Every layer of the agent stack is a variable
Traditional product teams run A/B tests on button copy, landing page layouts, onboarding flows. The instinct is exactly right—change one thing, measure the impact, promote the winner. The mistake is thinking that instinct doesn't apply to AI products.
It does, just to a different set of variables.
Gibbs walked through the layers that Inworld AI treats as dials worth tuning:
- Model: which LLM you're using, at what cost, and whether it's the right fit for your specific use case
- Prompt: the agent’s personality, values, and reasoning style—who it is and how it thinks
- Voice: tone, accent, and pacing; small changes here move engagement more than most teams expect
- Tools: what the agent can and can't do, and when it decides to act
- Memory: how much context the agent carries, and for how long
Each of those variables has a measurable downstream effect on the metrics that actually drive the business—retention, CSAT, LTV, conversion. And each one can be tested.
The problem is that most teams aren't testing them. They ship an agent, watch the high-level traffic numbers, and assume that if users seem engaged, the product is working. Gibbs saw this pattern repeatedly. Teams integrate AI, get excited about the possibilities, and then never build the measurement infrastructure to know whether their choices are optimal. There are hundreds of model options available, each with different cost and performance profiles. Teams that settle for the first one that works are almost certainly leaving both quality and margin on the table.
For a complete picture of what to track, this breakdown of 30+ AI product metrics is a practical starting point, including autonomy rate, which may be the single most important metric for agent-based features.
Two customer examples
Frameworks can be easy to follow, but what made Gibbs' talk land with MXP attendees were the customer stories behind them.
One Inworld customer, a large consumer app with millions of users, had a problem that snuck up on them. Engagement was high, but their AI feature was working well enough that it was threatening to bankrupt them. Their original setup fed a single massive prompt into one model and got back an expensive response at scale.
Inworld helped them break that single prompt into dozens of smaller, specialized models, each handling a narrower piece of the interaction. The result was a 95% reduction in month-over-month costs while maintaining quality. Not because they found a magic model. Rather they treated the agent stack like a product surface, ran experiments, and promoted the winners.
The second example was smaller in scope but just as telling. A single voice accent swap (same general personality, different regional inflection) shifted engagement metrics within a week. Gibbs attributed this to something that sounds soft but shows up hard in the data. How personally connected a user feels to the agent matters. If the voice feels foreign or robotic, users disengage. If it feels familiar, they lean in. That's not a UX nicety; it compounds over time into retention.
The teams seeing results like these aren't doing anything exotic. They're applying the same experimental discipline that good product teams have always applied. They've just extended it to cover the full agent stack.
Where to start
Gibbs' advice for teams earlier in this journey was refreshingly practical. Don't try to instrument everything at once. Pick one variable, tie it to one metric you already care about, and run the experiment. A few starting points he suggested:
- Start with model or prompt: an engineer can spin this up in a day, and the performance impact is usually the most visible
- Expose a small cohort first: don't push changes to your full user base until you have signal
- Measure against your actual business metrics: LLM-level evals only tell part of the story. User behavior is where the real signal lives (more on that here).
- Build the muscle before you automate it: The goal eventually is a system that promotes winning variants automatically, but you need the measurement foundation first.
The broader point Gibbs was making is that AI product teams are still in an early window where building this discipline is a genuine competitive advantage. Those that figure out how to continuously optimize their agent stack will compound those gains. The ones that don't will eventually wonder why their churn is high and their costs are climbing and nothing in their dashboard explains why.
The new funnel
Building an AI product has never been easier. Gibbs' argument at MXP was that understanding what the product’s doing to your users is still the hard part most teams skip.
The funnel has moved, not disappeared. It lives inside the conversation now, inside the prompt, the model, the voice, and the tools. And the teams who treat those things as testable variables aren't just measuring better. They're building something their competitors can't easily replicate.
The teams getting this right aren't checking dashboards and hoping for the best. They have always-on product intelligence watching their most important metrics continuously. Analytics built specifically for AI products tie those model changes directly to engagement, retention, and conversion. That's how you measure whether your AI is actually working—not just whether it's live.

