Experimentation terms 101: Health checks that make your A/B test results trustworthy

Before trusting what an A/B test reveals, it’s crucial to make sure the experiment itself is set up correctly. A result may look statistically significant while bias, noise, and tracking bugs actually drive it. In reviewing experiments run in Mixpanel over the past year, we noticed that health checks frequently surfaced problems with setup (like broken randomization or tracking), yet—if the experiment moved forward anyway—the experiment still produced a significant result.

It’s a reminder that without experiment health checks, even seemingly meaningful tests can lead to misleading conclusions. They act as guardrails that help you validate, stabilize, and debug your experiments, ensuring your product decisions are based on trustworthy results rather than statistical accidents.

Keep reading to learn about key experimentation terms, how they work, and why to use them, including:

Sample ratio mismatch (SRM) detection

Winsorization

CUPED

Retrospective A/A testing

The Bonferroni correction

Experimentation terms & statistical health checks

The five health checks below will help you validate your experiment so you can trust your results.

1. Sample ratio mismatch (SRM) detection

A sample ratio mismatch check ensures that your experiment is actually fair and each variant receives the right share of users.

What is SRM?

SRM happens when the actual number of users in each variant differs from the originally planned split in a statistically significant way. For example, you planned a 50/50 test, but one group somehow ended up with 70% of users, which may indicate that the randomization process is broken, making the results unreliable.

In a properly randomized experiment, user counts should closely match the planned split, aside from small random variation. SRM refers to differences that are larger than what randomness would reasonably explain.

SRM often happens because of mistakes in user bucketing, differences in the start times of each variant, or technical issues like broken redirects.

Why SRM matters

SRM can be a silent killer for experiments. If you haven't distributed users evenly:

Differences may appear because the groups are no longer comparable

Tracking errors or bugs may distort results

The experiment’s final results may not be actionable

When to use it

Whenever you run an experiment that splits users into multiple variants, but especially when you’re:

Launching new experiments where the user assignment logic is untested or complex.

Experimenting across multiple platforms or regions where technical differences might affect user assignment.

💡Pro tip: SRM is often the first sign that something is broken, so catching it early can save wasted time and incorrect conclusions.

2. Winsorization

Winsorization stabilizes metrics that can be skewed by extreme values, making results less sensitive to extreme users.

What is winsorization?

Winsorization is a way to handle extreme values (outliers) without removing them entirely by capping values at a certain threshold. You don’t remove extreme users; you simply replace their value with a maximum (or minimum) threshold.

Imagine you’re measuring the average number of messages sent per user in a social app. Most users send five to 15 messages per day, but a few power users send hundreds. Without winsorization, those extreme users would inflate the (mean) average, making it look like typical users are way more active than they really are.

Why winsorization matters

Most products have some users who behave differently from everyone else, which might skew metrics like revenue per user or session length.

If you’re running an experiment, these users can also skew the results by making it seem like your experiment is having a bigger or smaller effect than it really is. Winsorization reduces the impact of these outliers while preserving the sample size, which provides a more realistic view.

Note: While winsorization improves stability, it does change the definition of success. If extreme behavior is part of what you're testing, using it can actually undermine your results. For example, say you’re testing a pricing change: a few high-spend users are on the “extreme” end of purchases, but that isn’t noise—it’s an indication of opportunity.

When to use it

Before running the experiment, not applied afterward to “fix” results. Changing your metric after seeing the results can introduce bias and undermine trust in your conclusions.

When you’re looking at metrics that can be distorted by a few extreme users, making it hard to see typical behavior (like revenue per user or session length).

When extreme user behavior isn’t the focus of your test. Say your goal is to see how your overall CSAT (customer satisfaction) scores are trending over the past quarter. You wouldn’t want to react to the most extreme and rarely occurring scores, so you might cap values at the 5th and 95th percentiles to reduce the influence of the most polarized responses.

💡Pro tip: Don’t use winsorization when those extreme outcomes matter to what you’re looking for, like if you’re testing fraud detection or high-spend behaviors. In these cases, you’d want to identify the outliers where users are spending an unusually large amount of money.

3. CUPED (Controlled-experiment Using Pre-experiment Data)

CUPED is a variance reduction technique that reduces noise in an experiment, making results more precise and easier to interpret. This helps you detect whether the experiment really had an impact, without needing huge sample sizes or long test periods.

What is CUPED?

CUPED works to “level the playing field” by accounting for how users were already behaving before the experiment began using historical, pre-experiment data. You adjust for prior user behavior and then measure how the actual result during the experiment differs from that baseline. By controlling for how users were already behaving, CUPED helps isolate the true impact of the experiment.

This reduces random variation and makes it easier to detect a real effect. Imagine you’re testing a new onboarding flow and measuring activation. Even before the experiment launches, your users all have very different baselines: some already complete those actions almost every time, while others might rarely get past step one.

If you just compare raw activation rates, those built-in differences can hide a small but real lift from the new flow. Importantly, CUPED doesn’t change how users are assigned to variants; it only adjusts how outcomes are measured.

Why CUPED matters

Makes real effects easier to detect

Reduces the time and traffic needed to reach conclusions

Especially useful when you’re analyzing “noisy” metrics that have high variability and random fluctuations

When to use it

When you have reliable historical user data. Check that your key metrics have been consistently tracked over time. CUPED works best when pre-experiment data accurately reflects your users’ normal behavior and is strongly correlated with the metric you’re measuring.

When experiments are costly or take a long time. Prioritize applying CUPED for experiments where running a full-size test would be expensive or slow. This can help you reduce required sample sizes or shorten test durations without sacrificing confidence in the results.

When metrics are noisy (like revenue per user or session duration). Use CUPED when a metric naturally jumps around a lot from user to user or day to day. Stabilizing these fluctuating metrics makes it easier to spot real changes caused by your experiment rather than random ups and downs.

4. Retrospective A/A testing

Use retrospective A/A testing to validate your measurement and experiment setup. Run an A/A test when you want to be confident that differences in your experiments are valid and not the result of broken tracking or setup errors.

What is retrospective A/A testing?

An A/A test compares two identical experiences, with the goal of confirming that any observed differences are consistent with normal random variation.

Why retrospective A/A testing matters

A/A tests help ensure your system isn’t generating unexpected or inflated differences. When two identical groups show unexpected differences, it can indicate tracking bugs, timing issues, or inconsistencies in how you define audiences. By catching these problems early, A/A testing gives you confidence that future A/B test results really reflect the impact of changes you made rather than errors in data collection or experiment setup.

When to use it

When launching a new experiment setup. Running an A/A test helps you make sure your experiment infrastructure is working correctly, and you’re assigning users consistently across variants before measuring actual changes.

When implementing a new experimentation platform. An A/A test can validate that your new solution is tracking metrics accurately and integrated properly with your existing data tech stack.

To set a baseline or establish what “normal” looks like before any changes are applied. Collect A/A test results to understand typical metric variations so you can distinguish real experiment effects from natural fluctuations in your data.

5. The Bonferroni correction

The Bonferroni correction reduces the probability of a false positive when you’re looking at a large number of metrics, variants, or audience segments at once.

What is the Bonferroni correction?

When you run many tests simultaneously with multiple metrics, segments, or variants, the chance of seeing a “significant” difference by pure luck increases.

The Bonferroni correction is a simple way to adjust the statistical threshold so the overall risk of false positives stays under control when making multiple comparisons. More tests mean more chances for randomness to look real; Bonferroni adjusts for that.

For example, if you’re testing 10 metrics at the standard 0.05 significance level, one might appear significant just as a fluke. The Bonferroni correction divides the threshold by the number of tests (0.05 ÷ 10 = 0.005), so only results that meet the stricter cutoff are considered meaningful.

Why the Bonferroni correction matters

Reduces the chance of claiming a false winner when testing many comparisons

Helps control overall error rates without requiring additional data, but makes it harder to detect small real effects

When to use it

When you’re running experiments with many metrics, segments, or variants simultaneously.

As a complement to other health checks. The Bonferroni correction is useful, but it doesn’t fix underlying issues like tracking bugs, misassigned users, or experiment design problems.

When experiment health checks matter most

Even seasoned teams can be tempted to skip health checks, especially when experiments feel urgent, timelines are tight, or executives are pressing for quick results. Health checks are a safeguard against shipping the wrong insights under pressure, and they’re especially important when:

Experiments are high-stakes, and unnoticed data issues can lead to costly, incorrect decisions.

Multiple metrics or segments are analyzed simultaneously, which increases the risk of false positives or misleading patterns.

The results are surprising or counterintuitive. Unexpected outcomes are sometimes a result of bugs, bias, or broken assumptions rather than true impact.

Traffic sources are mixed or unstable. Shifts in traffic quality or composition during the experiment can be mistaken for experiment effects.

A 5-step experiment health checklist

These checks aren’t about second-guessing your experiment; they’re for verifying that the results are trustworthy. Think of this like a pre-flight checklist before making a product decision.

Before trusting your experiment’s results, ask:

1. Are users assigned correctly (no SRM)?

If not, stop and investigate further (e.g., bucketing or form IDs).

2. Is there a chance the results were driven by outliers (Winsorization)?

If it’s unclear, try comparing effect sizes with and without outlier treatment, or re-run the analysis with winsorization or more robust metrics to understand how sensitive your results are to extreme values.

3. Could using historical data make your results more precise (CUPED)?

If you have past data available, apply CUPED (if it was defined in advance) and reassess your effect size and confidence intervals. Otherwise, document why and move on.

4. Did you validate the experiment setup first (A/A tests)?

If not, consider running a retrospective A/A test if infrastructure issues are suspected. You don’t have to do this every time, but it’s good to confirm nothing infrastructural could be wrong.

5. Could the results be a fluke from testing many things at once (Bonferroni)?

If you’re testing multiple metrics, variants, or audience segments as separate hypotheses, apply the correction and re-evaluate the results.

Use these experiment health checks to make confident product decisions

Most experiment failures are preventable, especially when they come from hidden noise, bias, or measurement problems.

Before launching your next test, take some time to run through these health checks. They’ll help you catch problems early and separate real product insight from false signals, so you can trust what your data is actually telling you.

Learn more about experiments in Mixpanel today and turn every result—win or loss—into a deeper understanding of your product and your users.

Build better products.

Dillon Baker

Senior Product Marketing Manager @ Mixpanel