Last quarter, a product team ran an A/B test for six weeks, found no significant difference between variants, and concluded their new feature "didn't move the needle." Three months later, a competitor launched an identical feature and reported a 12% lift in engagement. What went wrong? The original team's experiment had 23% statistical power — meaning even if the effect existed, they had less than a 1-in-4 chance of detecting it. They didn't prove the feature was ineffective. They ran an experiment incapable of proving anything.
This scenario repeats itself across thousands of teams every month. Power analysis could have prevented it — but most practitioners either skip it entirely or make critical mistakes that invalidate their calculations. Let's quantify our uncertainty about what works, not hide behind underpowered tests.
The Five Mistakes That Invalidate Your Power Calculations
Before we discuss what power analysis is, let's start with what breaks it. These aren't edge cases — they're the default behavior for most teams running experiments.
Mistake #1: Using Cohen's "Small/Medium/Large" Effect Sizes
Jacob Cohen introduced standardized effect size benchmarks (0.2 = small, 0.5 = medium, 0.8 = large) for social science research in the 1960s. They were meant as rough interpretive guidelines, not planning tools. Yet practitioners routinely plug "0.5" into power calculators without asking whether a medium effect matters to their business.
A "medium" effect size in conversion rate terms might be a 15% relative lift. For an e-commerce site converting at 2%, that's moving from 2.0% to 2.3%. Is that difference worth six weeks of engineering time? Only your business model can answer that. What did we believe before seeing this data? If you expected a 50% lift from a major redesign, planning to detect a 15% lift wastes resources on precision you don't need.
Quick Fix: Calculate Your Minimum Detectable Effect (MDE)
Ask your stakeholders: "What's the smallest improvement that would justify implementing this change?" Convert that business threshold to a statistical effect size. That's your MDE. If a 10% revenue lift justifies the effort, design your test to detect a 10% lift — not an arbitrary "medium" effect.
Mistake #2: Ignoring Prior Beliefs About Effect Size
Traditional power analysis treats the effect size as a known constant. But in reality, we're uncertain about the effect before running the experiment — that's why we're testing. This uncertainty matters for planning.
Imagine you're testing a checkout page redesign. Your prior beliefs might be: "Based on similar tests, we expect somewhere between a 5% and 20% lift, most likely around 12%." A Bayesian approach would model this uncertainty and calculate the probability of obtaining convincing evidence across the plausible range of effects.
The frequentist power calculation assumes the effect is exactly (say) 12%, then asks: "What's our probability of rejecting the null hypothesis?" But if the true effect is 8%, your actual power is lower. If it's 16%, your power is higher. The single power value (e.g., "80% power") hides this uncertainty.
Quick Win: Create Power Curves, Not Point Estimates
Instead of calculating power for a single effect size, plot power across a range of plausible effects. If you have 80% power for a 12% lift but only 45% power for an 8% lift, and you believe both are plausible, you're taking more risk than the "80% power" headline suggests. Credible intervals give us a principled range of plausible values — use them in planning too.
Mistake #3: Calculating Power After Seeing Results
You run an experiment. It's not significant (p = 0.18). Someone asks: "What was our power?" You calculate post-hoc power using the observed effect size. It's 35%. You conclude: "We were underpowered, so we can't rule out an effect."
This reasoning is circular. Post-hoc power calculated from observed effects is mathematically determined by the p-value — it adds no new information. If p = 0.18, you already know the evidence was weak. Calculating power doesn't change that.
The posterior distribution tells a richer story than a single number. Instead of post-hoc power, calculate a confidence interval for the effect size. If your 95% CI is [-2%, +8%], you can say: "We're 95% confident the true effect is between a 2% decrease and an 8% increase. Effects larger than 8% are unlikely given this data."
Mistake #4: Treating Power as a Pass/Fail Threshold
Many teams set a rigid "80% power" requirement without understanding what it means. Power is the probability of detecting an effect if it exists at the specified size. At 80% power, you still have a 20% chance of a false negative (Type II error) — failing to detect a real effect.
The 80% convention comes from balancing Type I errors (false positives, typically set at 5%) with Type II errors. It implies you consider false negatives 4× more tolerable than false positives. Is that true for your decision?
In medical trials testing a potential cancer treatment, you might want 90% or 95% power — false negatives (missing an effective treatment) have severe consequences. In low-risk product experiments, 70% power might be acceptable if running larger samples is prohibitively expensive.
Quick Fix: Choose Power Based on Decision Asymmetry
Ask: "What's worse — implementing a change that doesn't work, or missing a change that does work?" If false negatives are much more costly, increase your power requirement. If false positives are worse, tighten your significance level instead. Let's quantify our uncertainty, not hide it behind conventional thresholds.
Mistake #5: Ignoring Multiple Comparisons
You design an experiment with 80% power to detect a 10% lift in your primary metric (conversion rate). But you're also tracking five secondary metrics: average order value, items per cart, bounce rate, time on site, and return rate.
If you test all six metrics at α = 0.05, your family-wise error rate (probability of at least one false positive) is approximately 26%, not 5%. Your effective power for any individual metric also decreases because you're splitting your "evidence budget" across multiple tests.
This becomes worse when you're running multiple A/B tests simultaneously, each with multiple metrics, and reporting whichever shows significance. The garden of forking paths multiplies your false positive risk.
Quick Win: Designate Primary Metrics Before Testing
Choose one primary decision metric and power your test for that. Secondary metrics are exploratory — interesting for hypothesis generation but not decision-making. If you must test multiple primary metrics, use correction methods (Bonferroni, Holm-Bonferroni, or FDR control) and increase sample size to maintain power after correction.
What Power Analysis Actually Tells You (And What It Doesn't)
Power analysis answers a specific question: "If I run this experiment with N observations, and the true effect size is exactly X, what's my probability of obtaining a statistically significant result at significance level α?"
Four quantities are mathematically linked:
- Sample size (N): How many observations per group
- Effect size (δ): The standardized difference you're trying to detect
- Significance level (α): Your false positive tolerance (typically 0.05)
- Power (1-β): Probability of detecting the effect if it exists
Specify any three, and you can calculate the fourth. Most commonly, you specify effect size, α, and desired power (say, 80%), then solve for required sample size.
What power analysis doesn't tell you:
- Whether the effect exists — it assumes a specific effect size and asks about detection probability
- The probability your hypothesis is true — that's a Bayesian posterior, not a frequentist power calculation
- How much evidence you'll obtain — power is about crossing a significance threshold, not quantifying evidence strength
- What to conclude from non-significant results — even high-powered studies can miss effects by chance
How to Actually Run Power Analysis (The Right Way)
Here's a principled workflow that avoids the common pitfalls:
Step 1: Define Your Minimum Detectable Effect from Business Requirements
Don't start with statistics. Start with decisions. Convene your stakeholders and ask:
- "What improvement would justify the cost of implementing this change?"
- "Below what threshold would we consider the effect too small to matter?"
- "What effect size would change our strategic direction?"
Convert these answers to statistical effect sizes. For conversion rates, that's typically a percentage point difference or relative lift. For continuous metrics (revenue, session duration), you'll need to estimate standard deviations from historical data.
Step 2: Quantify Your Prior Uncertainty
Before seeing data from this experiment, what do you believe about the likely effect size? Sources of prior information:
- Historical data: Previous A/B tests on similar changes
- Industry benchmarks: Published effect sizes from comparable experiments
- Expert judgment: Product team intuitions based on experience
- Pilot studies: Small-scale tests or user research findings
Express this as a range, not a point estimate. "We believe the effect is between 5% and 20%, most likely around 12%" is more honest than "We assume a 12% effect."
Step 3: Calculate Power Across Your Prior Range
Create a power curve showing detection probability for effect sizes spanning your prior belief interval. Here's example code for a two-proportion test:
import numpy as np
from scipy.stats import norm
def power_two_proportions(n_per_group, p1, p2, alpha=0.05):
"""Calculate power for two-proportion z-test"""
# Pooled proportion under null hypothesis
p_pooled = (p1 + p2) / 2
# Standard error under null and alternative
se_null = np.sqrt(2 * p_pooled * (1 - p_pooled) / n_per_group)
se_alt = np.sqrt((p1*(1-p1) + p2*(1-p2)) / n_per_group)
# Critical value for two-tailed test
z_crit = norm.ppf(1 - alpha/2)
# Effect size
effect = abs(p2 - p1)
# Power calculation
power = 1 - norm.cdf(z_crit - effect/se_alt) + norm.cdf(-z_crit - effect/se_alt)
return power
# Example: baseline conversion 3%, testing for 3.3% to 3.9%
baseline = 0.03
effect_sizes = np.linspace(0.003, 0.009, 20) # 10% to 30% relative lift
n_per_group = 5000
for delta in effect_sizes:
pwr = power_two_proportions(n_per_group, baseline, baseline + delta)
relative_lift = (delta / baseline) * 100
print(f"Relative lift: {relative_lift:.1f}% → Power: {pwr:.2%}")
This reveals detection probability across your uncertainty range, not just at a single assumed effect.
Step 4: Make Sample Size Decisions Under Resource Constraints
Reality check: you probably can't afford the sample size that gives 95% power for your smallest effect of interest. Now what?
You have three principled options:
- Accept lower power for small effects: "We have 90% power for effects above 15%, but only 60% power for a 10% effect. We're willing to risk missing small effects."
- Focus on larger effects: "Given our sample constraints, we can reliably detect 20%+ effects. We'll treat this as an initial screen — smaller effects would require follow-up testing."
- Use Bayesian methods: "Rather than binary significance testing, we'll calculate posterior probabilities and credible intervals. A non-significant result will still provide evidence about plausible effect sizes."
What you shouldn't do: run an underpowered test, get p = 0.12, and announce "we found no effect." Your test may simply have been incapable of detecting the effect that exists.
Step 5: Preregister Your Analysis Plan
Before collecting data, document:
- Primary outcome metric(s)
- Planned sample size and stopping rule
- Significance level and power
- Minimum detectable effect
- Any subgroup analyses you'll run
This prevents post-hoc rationalization ("we were really testing secondary metric Y, not primary metric X") and selective reporting of whichever metric happened to reach significance.
When Sample Size Calculations Lie to You
Standard power analysis formulas make assumptions that often don't hold in real business contexts. Here's when to be skeptical of the numbers:
Time-Based Effects and Seasonality
Your power calculation says you need 10,000 observations per group. At 500 visitors per day, that's 40 days of testing. But your product has weekly seasonality — weekday behavior differs from weekends. And you're testing during November, which includes Black Friday.
The effective sample size may be much smaller than raw observation counts. Observations separated by one day are more correlated than observations separated by one week. Standard formulas assume independent observations — violated when temporal correlation exists.
How much should this evidence update our beliefs? Less than uncorrelated samples would suggest. You may need 2-4× larger samples to account for clustering and autocorrelation.
Early Stopping and Sequential Testing
You plan a 4-week test with 80% power. After week 1, you peek at the results — they're significant! You stop early and declare victory. Your actual Type I error rate is not 5%, it's closer to 15%.
Sequential testing (peeking at results before planned completion) inflates false positive rates unless you adjust for multiple looks. Methods like alpha spending functions (O'Brien-Fleming, Pocock boundaries) or Bayesian sequential designs account for this.
If you're going to peek (and you probably should — waiting weeks to discover an obviously harmful treatment is wasteful), use appropriate sequential analysis methods and adjust your sample size calculations accordingly.
Heterogeneous Treatment Effects
Your average treatment effect might be 8%, but that average masks important variation. Mobile users: +15%. Desktop users: -2%. Power analysis for the average effect doesn't tell you whether you can detect these subgroup differences.
Subgroup analysis requires much larger samples. If you split your sample in half to analyze two subgroups separately, each subgroup has roughly 50% of the original power (actually less, due to smaller effect sizes in subgroups). Detecting an interaction effect (differential response across subgroups) requires 4× the sample size of detecting a main effect.
If subgroup analysis matters to your decision, power your study for those comparisons, not just the overall average treatment effect.
Real-World Scenario: Pricing Optimization Test
Let's apply these principles to a realistic business problem. An e-commerce company wants to test a 10% price increase on a product category. They want to understand the demand elasticity — how much volume will they lose, and will higher prices compensate with increased revenue?
The Stakes
- Current state: 500 transactions per week at $50 average order value = $25K weekly revenue
- Proposed change: Increase price to $55 (10% increase)
- Business question: Will revenue increase despite volume loss?
Defining the Minimum Detectable Effect
The finance team says: "We need at least a 3% revenue increase to justify the strategic risk of raising prices. Anything less, and we'd rather maintain current pricing."
With a 10% price increase, revenue stays flat if volume drops exactly 9.1% (1.10 × 0.909 = 1.0). A 3% revenue increase requires volume to drop no more than 6.4%:
Revenue ratio = (new price / old price) × (new volume / old volume)
1.03 = 1.10 × (new volume / old volume)
New volume / old volume = 0.936 → 6.4% volume decrease
So the MDE is: can we distinguish a 6.4% volume decrease from larger decreases that would hurt revenue?
Establishing Priors from Similar Tests
The product team previously tested price increases on related categories:
- Category A: 8% price increase → 5% volume decrease (elasticity = -0.625)
- Category B: 12% price increase → 11% volume decrease (elasticity = -0.917)
- Category C: 15% price increase → 14% volume decrease (elasticity = -0.933)
Based on this, their prior belief: "Elasticity is probably between -0.6 and -1.0, most likely around -0.8." For a 10% price increase, that implies a 6-10% expected volume decrease.
Crucially, the threshold effect (6.4% volume loss) sits right at the optimistic end of their prior range. There's substantial prior probability that the price increase will hurt revenue.
Power Calculations
Baseline conversion rate (visitors who purchase): 2.5%
Weekly visitors: 20,000
Expected volume decrease: 6-10%
For a two-proportion test comparing conversion rates:
| Sample Size (per group) | Duration | Power for 6.4% Volume Drop | Power for 10% Volume Drop |
|---|---|---|---|
| 10,000 | 1 week | 31% | 58% |
| 20,000 | 2 weeks | 48% | 82% |
| 40,000 | 4 weeks | 73% | 97% |
| 60,000 | 6 weeks | 87% | 99.6% |
The power curve reveals a difficult tradeoff. With 2 weeks of data (standard practice), they have 82% power to detect a 10% volume drop, but only 48% power to detect the critical 6.4% threshold. It's a coin flip whether they'll correctly detect the boundary case.
The Decision
The team chose to run a 6-week test with 87% power for the threshold effect. Yes, that's longer than typical A/B tests. But the strategic importance of pricing decisions justified the investment in statistical certainty.
They also preregistered a Bayesian analysis plan: rather than just reporting "significant" or "not significant," they'd calculate the posterior probability that revenue increased by at least 3%, and provide a credible interval for the elasticity parameter.
The Outcome
After 6 weeks, they observed a 7.8% volume decrease (95% CI: 5.2% to 10.4%). Revenue increased by 1.4% (95% CI: -0.9% to 3.7%). The Bayesian posterior gave 12% probability that revenue increased by at least the 3% threshold.
Notice what proper power analysis enabled: they could confidently say "the effect is smaller than our decision threshold." An underpowered 2-week test might have shown p = 0.13 for the revenue difference, leading to the misleading conclusion "no significant effect found" when actually they just couldn't measure precisely enough.
Run Your Own Power Analysis
Upload your historical data to MCP Analytics and get sample size recommendations based on your actual variance and effect size priors — no more guessing with generic calculators.
Try Power Analysis ToolThe Bayesian Alternative: Assurance and Precision
Traditional power analysis has a frequentist framing: "If I repeat this experiment infinite times under the assumption that the effect is exactly δ, what proportion of those repetitions will yield p < 0.05?"
That's useful, but it's not the only way to plan sample sizes. Bayesian approaches offer complementary perspectives:
Assurance: Average Power Over Prior Uncertainty
Instead of calculating power for a single assumed effect size, calculate the expected power averaging over your prior distribution of plausible effect sizes. This is called assurance (or expected power).
If you believe the effect is somewhere between 5% and 15% (uniform prior), assurance is the average power across that range. If you're 80% confident the effect is above 10%, weight the power calculation accordingly.
Assurance accounts for your genuine uncertainty about the effect size, giving a more realistic picture of detection probability than point-estimate power.
Precision-Based Sample Size: Width of Credible Intervals
Instead of asking "Can I reject the null?", ask "How precisely can I estimate the effect size?" Plan sample size to achieve a desired posterior credible interval width.
For example: "I want my 95% credible interval for the conversion rate difference to be no wider than ±1 percentage point." Solve for the sample size that achieves this precision on average (or with 90% probability, if you want to be conservative).
This approach decouples sample size planning from arbitrary significance thresholds. Even if your result isn't "significant," you've still learned something precise about plausible effect sizes.
Decision-Theoretic Sample Size
The most principled Bayesian approach: specify loss functions for different decision errors, then choose sample size to minimize expected loss.
For the pricing example:
- Implement price increase when it helps: Gain $X per week
- Implement price increase when it hurts: Lose $Y per week
- Don't implement when it would help: Forgo $X per week
- Don't implement when it would hurt: Avoid losing $Y per week
Combine these payoffs with your prior beliefs and the evidence you'd obtain from different sample sizes. The optimal sample size maximizes expected utility.
This is mathematically elegant but requires specifying explicit loss functions — difficult in practice. Most teams find assurance or precision-based planning more tractable.
Quick Reference: Power Analysis Checklist
Before You Start Testing
- Define MDE from business requirements — not statistical conventions
- Quantify prior beliefs about effect size — express as range, not point estimate
- Calculate power curves — not single power values across your prior range
- Account for multiple comparisons — designate primary metrics before testing
- Consider temporal correlation — adjust sample size for seasonality and autocorrelation if needed
- Plan for early stopping if needed — use sequential testing methods, not naïve peeking
- Preregister your analysis plan — prevents post-hoc rationalization
- Choose power based on decision asymmetry — balance false positive and false negative costs
When Power Analysis Isn't the Right Tool
Power analysis assumes you're doing frequentist hypothesis testing with a binary decision rule. That's not always appropriate:
Exploratory Research
If you're in early-stage product development trying to understand user behavior, hypothesis testing may be premature. You don't yet know what to test. Descriptive analysis, user research, and exploratory data analysis are more valuable.
In exploratory contexts, focus on estimation precision (credible interval width) rather than hypothesis testing power.
Continuous Monitoring and Adaptive Experiments
Multi-armed bandit algorithms adaptively allocate traffic to better-performing variants while learning. Traditional fixed-horizon power analysis doesn't apply — the sample size isn't predetermined, and there's no single hypothesis test at the end.
For adaptive experiments, use simulation-based planning: simulate many realizations of the bandit algorithm under different effect size scenarios, and assess regret (lost value from suboptimal allocation) and decision accuracy.
When You Can't Randomize
Power analysis assumes randomized treatment assignment. In observational studies, confounding and selection bias dominate. No amount of statistical power overcomes biased assignment.
If randomization is impossible, invest effort in quasi-experimental design (difference-in-differences, regression discontinuity, instrumental variables, propensity score matching) rather than power calculations for naive comparisons.
Frequently Asked Questions
What's the difference between power analysis and sample size calculation?
They're two sides of the same equation. Sample size calculation asks: "How many observations do I need to detect this effect?" Power analysis asks: "What's my probability of detecting an effect with this sample size?" Both use the same statistical framework — power, effect size, significance level, and sample size are mathematically linked. Change any one and the others adjust.
Can I do power analysis after collecting data?
Post-hoc power analysis (calculating power after seeing results) is widely discouraged by statisticians. If your test was significant, post-hoc power is redundant — you already detected the effect. If your test wasn't significant, low post-hoc power doesn't prove the effect doesn't exist. Instead, use confidence intervals to quantify uncertainty about the effect size, or perform sensitivity analysis to understand what effects you could have detected.
What if I can't afford the sample size power analysis recommends?
You have three principled options: 1) Accept lower power and acknowledge you're running an exploratory study (be honest about detection limits). 2) Focus on larger effects you can actually detect with available data. 3) Use Bayesian methods that quantify evidence strength rather than binary significance testing. What you shouldn't do: run an underpowered study and treat null results as conclusive evidence of no effect.
How do I choose the right effect size for power calculations?
Base it on the minimum effect that matters to your business, not arbitrary "small/medium/large" labels. Ask: "What's the smallest change in conversion rate (or revenue, or retention) that would justify implementing this change?" That's your minimum detectable effect. If historical data exists, use observed effect sizes as priors — but add uncertainty. If you're truly uncertain, calculate power curves for multiple plausible effect sizes rather than picking a single number.
Does power analysis work for Bayesian A/B tests?
Traditional power analysis is a frequentist framework, but similar principles apply in Bayesian settings. Instead of "power to reject the null," you'd calculate the probability of obtaining a credible interval that excludes zero (or your region of practical equivalence). Bayesian approaches can also use simulation-based sample size determination, where you simulate data under different scenarios and assess whether your posterior distributions provide adequate evidence for decision-making.
Connected Techniques for Robust Experiments
Power analysis is one piece of rigorous experiment design. These related techniques strengthen your statistical inference:
- A/B Testing and Statistical Significance: Understanding the hypothesis testing framework that power analysis supports — how to interpret p-values and avoid common significance testing mistakes.
- Confidence Intervals: Moving beyond binary significance to quantify precision — the Bayesian alternative to power-focused planning prioritizes credible interval width.
- Multiple Testing Correction: When you're testing multiple hypotheses simultaneously, correction methods prevent inflated false positive rates — critical for accurate power calculations.
- Sequential Analysis: Methods for peeking at results before experiment completion without inflating Type I error — requires modified power calculations for early stopping.
- Effect Size Estimation: How to translate business requirements into statistical effect sizes, and how to estimate them from pilot data or historical experiments.
Moving from Power Calculations to Principled Decisions
Power analysis is fundamentally about honesty — acknowledging what your experiment can and cannot detect. The five mistakes we started with all involve self-deception: pretending you can detect effects you can't, treating arbitrary thresholds as meaningful, or ignoring the uncertainty inherent in planning.
How much should this evidence update our beliefs? The answer depends on how much evidence you designed your experiment to collect. An underpowered test that finds p = 0.09 provides different information than a well-powered test with the same p-value. In the former case, the data are consistent with a wide range of effect sizes. In the latter, you've meaningfully narrowed the plausible range.
Let's quantify our uncertainty, not hide it. Calculate power curves that reveal your detection probability across plausible effect sizes. Acknowledge when resource constraints force you to accept low power for small effects. Use credible intervals to show what you've learned, even when results aren't "significant."
The posterior distribution tells a richer story than a single number. Whether you use frequentist power analysis or Bayesian assurance calculations, the goal is the same: design experiments that provide adequate evidence for the decisions you need to make. That starts with clear thinking about effect sizes that matter, honest assessment of your prior uncertainty, and transparent reporting of what your data can — and cannot — tell you.
Your stakeholders don't need p-values. They need answers to questions like: "Should we implement this change? What's the likely impact on revenue? How confident should we be in that estimate?" Power analysis ensures you collect enough evidence to answer those questions responsibly. Everything else is just mathematics in service of better decisions.