
The ultimate AI eval is your user

The past few years have ushered in a seismic shift in how digital products are built. With the rise of AI, we are witnessing the dawn of a new product paradigm—one powered by agents, large language models, and machine learning systems capable of mimicking, augmenting, or entirely automating tasks once thought to be the domain of humans alone. And this renaissance is happening fast: Already, over 80% of recent Y Combinator startups are declaring themselves as AI-first.
This explosion of AI tooling and products has led to breathtaking innovation—but it has also surfaced a gnarly, universal question:
How do you measure whether your AI product is any good?
This isn't just a theoretical concern. It's a very real, very present challenge for companies across the ecosystem. As the Mixpanel PM responsible for helping startups and SMB companies, I work closely with a wide array of AI-driven companies, and I've seen this challenge up close.
The problem with AI evals
In AI, the conventional answer to "how good is it?" is the "eval." Evals, or evaluations, are processes designed to assess the performance of a model or agent, often via predefined benchmarks. They aim to capture a set of metrics that quantify how well an AI system performs a given task. But here's the problem: Evals are hard.
Let’s say your AI startup provides a platform for querying scientific literature. A user might ask, “What are the latest advancements in reversing the aging process?” Your AI agent generates a response—maybe it even includes citations. But now comes the hard part:
- Is that response accurate?
- Is it helpful?
- Is it legible?
- Is it scientifically sound?
In traditional software, success might be as straightforward as whether a button was clicked or a form was submitted. But in AI, outputs are probabilistic, not deterministic. The same input may yield subtly different results each time. Worse, these outputs are often judged qualitatively: Was the answer "good"? That question resists easy automation.
Sure, you might try to quantify the quality of that scientific response:
- How many citations did it include?
- Did the user upvote it?
- Did it pass a hallucination check?
But all of these are proxies. The gold standard would be an expert manually scoring each response—and that simply doesn't scale.
Your best eval is behavioral
This is where product analytics enters the scene. What you can't measure reliably through model-level metrics, you can measure through user behavior. This isn't new. Mixpanel has been helping product teams analyze user behavior since 2009. What is new is applying this lens to AI-driven products.
You can spend weeks building complex eval pipelines. Or you can zoom out and ask: Do people come back? Do they keep using your product? Do they invite others? Do they buy?
These behavioral signals are your best proxy for product quality. Especially when traditional evals fail to capture the full picture, user behavior becomes your North Star. If your AI is actually helping people solve problems, they’ll show it by engaging, returning, and converting.
Version AI like you version apps
This approach isn’t novel—it echoes how product teams approached the mobile app boom. Back then, teams would monitor user behavior across different app versions:
- Are users who onboarded in version 2.0 more likely to activate than those in 1.5?
- Does the new onboarding flow improve day 1 retention?
Today, AI teams can do the same. Each iteration of your model or agent is a new "version" of the product. Track it. Compare behavioral outcomes across versions. Is usage growing? Are users more satisfied? Is churn down?
When the signals say "this isn't working"
Of course, not all signals will be positive. What happens when users don’t like your AI product? When retention is low, conversions drop, or engagement flatlines, it's tempting to point fingers at the model or assume it's just not "good enough." But these moments are gold.
Digging into failed conversions, drop-off points, or low engagement funnels can reveal exactly where things go wrong. Maybe your agent is too slow. Maybe it misses key use cases. Maybe your onboarding sets the wrong expectations.

With Mixpanel, you can dissect every moment of the user journey, look at where users churn, and analyze which prompts fail to convert. You can break down behavior by cohort, version, or even specific user intent. This data is the map to your product-market fit.
Combine product analytics with AI outputs to make more data-driven decisions.
Mixpanel can help you refine your models, improve user experiences, and unlock new revenue streams: Get started today with Mixpanel's AI for Analytics template.
Improving an AI product isn't just about training a better model; it's about tuning the full user experience. And the insights that drive that tuning? They live in product analytics.
AI needs product analytics
Ultimately, we’re entering a world where model quality is necessary but not sufficient. In this world, product analytics is essential. Because the only real way to know if your AI product is good is to know whether people actually use it—and keep using it.
So yes, build your evals. Define your metrics. Benchmark your agents. But never forget the ultimate eval: your user.