Experimentation terms: Understanding power, uncertainty, and detectable effects

If you’re running product experiments, you probably sometimes wish you paid a bit more attention in statistics class. There’s a lot of terminology and concepts that sound complicated, and it can be intimidating to start to learn it all.

But here at Mixpanel, we want to make experiments as self-serve as possible. We know from experience that you don’t need to be a statistician to run product experiments. But it’s important to understand the core ideas behind how experiments work so you don’t ship changes based on faulty, misleading, or misinterpreted results.

This guide introduces the statistical foundations behind experimentation in practical, plain-language terms for PMs, analysts, growth teams, and marketers who want to design better experiments and interpret results with confidence.

Keep reading to learn about:

Frequentist vs. Bayesian logic

Statistical power

Sample size

Minimum Detectable Effect (MDE)

Confidence and credible intervals

Sequential analysis

6 core experimentation concepts to know

1. Frequentist vs Bayesian

When you run an experiment, you’re trying to answer a simple question: “Is the difference I’m seeing real, or just random noise?”

Frequentist and Bayesian approaches are two different ways of answering that question. Most experimentation platforms use one of these approaches in their statistical models. Before we dive into each, it’s important to understand some concepts underlying how Frequentist and Bayesian work.

Every experiment has:

Hypothesis: a testable, educated guess, typically in the format of “if ___, then ___”.

Variables: specific factors in an experiment that are manipulated or measured to determine their effect on an outcome.

A lot of experiment concepts revolve around the concept of the null hypothesis: that the hypothesis is wrong, and there is no significant relationship between your variables.

When you’re doing an experiment, you’re conducting a test assuming the null hypothesis is correct—you’re testing to see if there is an acceptable probability to reject it! This is what undergirds a lot of statistical tests to ensure a test can be trusted.

Frequentist approach

The frequentist approach is the classic method for A/B testing. At its core, it evaluates evidence based on what would happen if you repeated the experiment many times. Frequentist:

Has a fixed sample size: You have to determine how many people will be in the test from the start

Can’t be stopped mid-test (sometimes called “peeking”)

Only looks at data from the current test (ignores any prior data/experiments)

Produces outputs like p-values and confidence intervals (a range of values that would contain the real impact in most repeated experiments)

❓What’s a p-value? A p-value measures the probability to which the data supports the null hypothesis (For example, a p-value of 0.04 means that if there were actually no real effect (the null hypothesis was true), you’d see a result this strong only about 4% of the time due to randomness. In other words, the lower your p-value, the more confident you can be that the experiment’s effect is real and not just noise. Typically, .05% is a standard p-value to consider a result statistically significant.

Let’s say you run an A/B test on a new onboarding flow. Variant A converts at 20% and variant B converts at 22%. Your experiment platform reports a p-value of 0.02.

That means if variant B isn’t actually better than variant A, you’d only see a gap this large (or larger) about 2% of the time due to random chance. Because that probability is low (less than 5%), we can conclude that we’ve collected statistically significant data.

Remember, frequentist statistics don’t tell you how big or valuable the improvement is, only how unlikely it is that it’s a coincidence.

Generally, you use frequentist testing when:

You really want to avoid false positives

You have enough traffic for properly powered tests

You want strict, predefined decision rules

You need consistency across many experiments

It may not be ideal when:

You want direct probability statements (“How likely is B to win?”)

You need flexible or continuous monitoring

You have small sample sizes

You want results framed more intuitively for decision-makers

So, in sum, while it is the most conservative method (i.e., least likely to lead you wrong), it has many prerequisites and it can be difficult to communicate the results to non-technical stakeholders.

Bayesian approach

The Bayesian approach is a way of analyzing experiments that focuses on directly estimating how likely different outcomes are. Instead of asking, “How surprising would this data be if nothing changed?” it asks: ““Given the data we’ve observed, how likely is each possible outcome?”

Unlike a frequentist approach, it:

Starts with an initial assumption (called a prior) and updates it as new data comes in

Treats uncertainty as something that can be measured and expressed as probabilities

Produces a full range of likely outcomes, not just a single yes/no result

Can be monitored in-flight without risk of polluting the results (or even stopped if you have proper confidence)

Typical Bayesian outputs sound like “there is an 85% probability that variant B is better than A” or “there’s a 90% chance the lift is between 2% and 6%.”

This tends to be more intuitive for decision-makers because results are expressed directly as probabilities, instead of being framed as how surprising the data would be if there were “no effect.”

Typically, teams use the frequentist approach when:

They want intuitive probability-based results

They need to monitor results continuously

Speed of decision-making matters

You sometimes run smaller experiments

You want outputs framed around business risk

However, it may not be ideal when:

You need strict long-term false positive guarantees

You’re in a regulated or highly conservative environment

Your organization already standardizes on p-values

Why this matters

Your statistical framework shapes:

How results are reported

Your stopping rules, the predetermined plans for when and under what conditions you’ll end an experiment

How uncertainty is expressed

What claims you can make about your results

Both frameworks can be useful depending on the situation; there is no “correct” way to do it. However, choosing one framework and standardizing how results are interpreted is critical. Without consistency, teams may misread outcomes, apply different decision rules, or unintentionally increase their risk of false positives.

2. Statistical power

Power is the probability that your experiment will detect a real effect, if one truly exists. In other words, how likely is your test to catch a meaningful difference?

A common mistake is running an experiment and thinking nothing happened, when actually the test is just too small. Power helps you detect this.

You can think of power as your experiment’s “signal detection strength.” A high-power test is more likely to notice a real improvement. A low-power test might completely miss it. Running a low-power experiment is like trying to hear a whisper in a loud room. The signal might be there, you just don’t have enough clarity to detect it.

Power is mainly driven by three factors working together:

Your sample size (how much data you collect)

The size of the effect you’re trying to detect

The noise/variance in your metric

Here’s a quick example: say your current signup conversion rate is 10%, and your new design improves it to 12%. That’s the same real effect in both cases, but your ability to detect it depends on the experiment setup.

If you run the test with:

500 users per variant, the lift may be too small relative to the noise in conversion data, making the experiment underpowered and likely to miss the effect.

50,000 users per variant, the same 2-point lift stands out more clearly from the noise, making the experiment much more likely to detect it.

Why this matters

When experiments are underpowered, teams often end up with misleading or unusable results. Low-power experiments tend to:

Miss real effects

Produce unstable or noisy estimates

End with inconclusive outcomes

Increase the chance of false negatives

Meanwhile, high-power experiments lead to far more reliable decisions because they:

Detect real differences more consistently

Make it more credible when you conclude there’s no meaningful effect

Reduce wasted experimentation cycles

Support more confident product decisions

So how can you know if your experiment has enough statistical power to make the results matter? Before launching your experiment, you can do a power or sample size calculation.

Most experimentation platforms and sample size calculators will ask you for four inputs (more on some of these in just a bit):

Your baseline metric (like current conversion or retention)

Your desired statistical significance level (usually 5%)

Your target power level (commonly 80% or 90%)

Your minimum detectable effect (MDE), or the smallest lift worth detecting

From these, the tool estimates how many users or events you need per variant.

💡 Pro tip: A non-significant result doesn’t prove there’s no effect; it may simply mean the test didn’t have enough power to detect it. (“We didn’t detect a difference” isn’t the same as “there is no difference.”)

3. Sample size

Sample size is the number of observations collected per variant. You could have sample sizes of users, sessions, accounts, or events, depending on what you’re analyzing. For example, if you run an A/B test with 10,000 users in variant A and 10,000 users in variant B, your sample size is 10,000 per variant.

Coin flip example: Flipping a coin 5 times could give you 4 heads. Yet we know flipping a coin is not 80% likely to be heads. Flipping it 5,000 times gives you a much clearer picture of the true odds.

Why this matters

Sample size impacts your ability to separate real signals from noise. User behavior naturally varies: some users convert, some don’t. Some days are higher than others. With too little data, randomness can easily masquerade as a meaningful effect, so larger samples average out that randomness.

When your sample is too small

Results swing wildly from day to day

Estimated lifts jump around

Confidence intervals are wide

“Winners” often flip with more data

A common mistake is stopping an experiment too early because early results look exciting. Small samples often exaggerate differences that disappear with more data.

When your sample is large enough for what you’re trying to detect

Results stabilize

Random fluctuations cancel each other out

Confidence intervals narrow

True differences become easier to detect

Bigger samples give you more precision, but they also take more time and traffic. The goal isn’t the biggest sample possible, but the right size for your question.

Make sure to plan your sample size before starting your experiment. Again, your analytics or experimentation platform should have a calculator that estimates the sample size you’ll need based on your baseline metric, desired power, significance level, and minimum detectable effect.

💡 Pro tip: Large sample sizes won’t save you if your metrics are poorly defined, the tracking is inconsistent, or the experiment design is flawed in other ways. Bad design at a large scale just produces very precise wrong answers.

4. Minimum Detectable Effect (MDE)

Minimum Detectable Effect (MDE) is the smallest true effect your experiment is designed to detect with acceptable power and statistical confidence.

It answers a simple question: what size of improvement can this test realistically discover? Every experiment has a detection limit. Changes smaller than that limit are unlikely to be detected reliably, even if they’re real. For example, if your MDE is 5%, you can’t reliably detect a 1% lift. It’s like trying to weigh something on a scale that only measures in 5-pound increments. If the change is 1 pound, the scale won’t reliably show it.

That’s why MDE is directly related to sample size: The smaller the improvement you want to detect, the more data you’ll need.

Why this matters

MDE helps teams plan smarter experiments by setting realistic expectations about what a test can actually detect. Instead of running experiments and hoping they surface small improvements, teams can evaluate ahead of time whether their design has enough sensitivity to find the kinds of effects they care about.

A common mistake is running experiments where the expected impact is smaller than the test can realistically detect.

This usually happens when teams test very small changes, like minor copy tweaks or subtle UI adjustments, but don’t have enough traffic or sample size to measure small lifts. The experiment runs, results come back “not significant,” and the team concludes the idea didn’t work.

5. Confidence and uncertainty

When you run an A/B test, you’re measuring behavior from a sample of users, not every possible user who could ever see your product. Because of natural variations in behavior, your measured lift will always include some uncertainty.

The point estimate is the single number your experiment reports, or your best guess of the true effect based on the data you observed. If your A/B test says “Variant B increased signups by 3%,” that 3% is the point estimate. In other words, it’s your best guess based on the data you saw, but not a guarantee that the true impact is exactly 3%.

🧠 Think of a point estimate as the spot a flashlight is pointing at. The interval is how wide the beam is. A narrow beam means you’re very precise. A wide beam means there’s more uncertainty.

Confidence (Frequentist) or credible (Bayesian) intervals are designed to show that uncertainty explicitly. Instead of giving you a single number, they give you a range of plausible values for the true effect. The reported lift as your best estimate, and the interval as how precise that estimate is.

Why this matters

Point estimates alone can be misleading because they hide how much uncertainty sits behind the number. A reported +3% lift can mean very different things depending on the uncertainty range around it.

For example, a +4% estimate with an interval of:

+3.8% to +4.2% suggests a tight, stable measurement

–2% to +10% suggests a wide, uncertain measurement

Both have the same average, but very different levels of decision risk. The interval shows how far the true impact might reasonably differ from your estimate.

6. Sequential testing (and “peeking”)

In real experiments, teams naturally want to check results as they come in. Dashboards update daily, and it’s tempting to stop a test early as soon as one variant looks like a winner. Teams often call this habit “peeking.”

Revisiting the coin flip: Imagine checking the results after every flip. If you stop after three consecutive heads, you might think the coin is weighted to always land on heads, even though the coin is actually 50/50.

Why this matters

The problem is that most traditional statistical tests assume you only check results once, at the end of the experiment. If you keep checking repeatedly and stop the moment you see significance, you’re quietly increasing your chance of a false win.

Sequential testing methods solve this by allowing you to evaluate results as data accumulates, while still preserving statistical validity. They matter most when:

Teams check experiment dashboards frequently

Traffic is high, and results move quickly

Variants carry risk and need early shutdown triggers

Tests are expensive, and stopping early saves resources

Product teams operate in fast iteration cycles

You don’t need to run these manually, because modern experimentation platforms like Mixpanel handle this under the hood. When properly implemented, sequential methods allow you to monitor results and stop early without inflating false positives.

A framework for designing experiments you can trust

Reliable experimentation involves using all the statistical concepts above together, in the right order, to support impactful decisions. A strong experimentation workflow looks like this:

1. Define your baseline and MDE first

Before running anything, be clear on two things:

Your current baseline (for example, today’s conversion or retention rate)

The smallest change that would actually be worth acting on (your MDE)

This sets expectations up front. If the effect you care about is smaller than what your experiment can detect, no amount of analysis later will fix that.

2. Calculate power and sample size

Once you know what you’re trying to detect, use your target power and MDE to calculate how much data you’ll need. Ask: Will this experiment be able to surface the signal we care about, or will it be lost in the noise?

A proper sample size calculator can do a lot of the heavy lifting here.

3. Choose a statistical framework that matches your decision velocity

Frequentist and Bayesian approaches answer different questions and support different decision styles.

Use Frequentist testing when you want guardrails and discipline. It’s built to protect you from declaring winners too easily.

Use Bayesian testing when you want flexibility and faster decision-making, expressed in intuitive probabilities.

In other words, if your top priority is minimizing false positives across many experiments, Frequentist methods are often the better fit. If your top priority is making timely, probability-based decisions with flexibility, Bayesian methods may be more practical.

4. Account for peeking with sequential testing

If your team checks results as data comes in (like most teams do), your workflow should account for that. Otherwise, repeatedly checking results can increase your chance of false positives.

5. Validate decisions using confidence or credible intervals

Finally, use confidence or credible intervals to understand how uncertain your result is and how risky a decision might be. Two experiments with the same average lift can imply very different levels of confidence.

Remember to ask not just “Is this positive?”, but also “How wrong could this reasonably be?

Reliable experimentation starts with the basics

Strong experimentation isn’t about complex math or models. It’s about understanding uncertainty and the tradeoffs you’re willing to make, respecting the limits of your data, and planning experiments so they’re capable of detectingmeaningful effects.

When teams know how power, sample size, MDE, and uncertainty intervals work together, they’re far less likely to be misled by random fluctuations or overconfident in results that won’t hold up with more data. The real skill is interpreting evidence correctly and making decisions with calibrated risk.

These fundamentals form the foundation of trustworthy experimentation programs. Teams that skip these fundamentals often waste weeks running experiments that were never capable of answering the question in the first place.

To learn more about experimentation in Mixpanel go here or visit our Docs.

Build better products.

Dillon Baker

Senior Product Marketing Manager @ Mixpanel