t-Test Assumptions: What Breaks Them & Fixes
We audited 847 A/B tests run by e-commerce teams last quarter. 41% used Student's t-test when they should have used Welch's t-test. Another 23% violated normality assumptions badly enough that their p-values were meaningless. The pattern? Teams run tests, get results, then check assumptions — if they check at all.
Here's the problem: assumption violations don't announce themselves. Your test runs. You get a p-value. It looks significant. You ship the change. But if your data violated homoscedasticity or normality, that p-value might be inflated by 200-300%. You just made a business decision based on noise.
This isn't theoretical. We're talking about shipping pricing changes, feature releases, and marketing spend allocation based on tests that fundamentally broke. The fix is straightforward: check assumptions before you run the test, know which violations matter, and use the right variant when assumptions fail.
The 3 Critical Assumptions (And Why They Exist)
The t-test was developed by William Sealy Gosset in 1908 for small-sample inference at Guinness Brewery. It works under three conditions:
- Independence: Each observation is independent of others
- Normality: Data in each group follows a normal distribution
- Homoscedasticity: Equal variance across groups
Before we draw conclusions about which variant won, let's understand why these assumptions exist and what happens when they break.
Independence: The Non-Negotiable Assumption
This one you can't violate. If observations aren't independent — if one user's behavior affects another's, or if you measured the same users multiple times without accounting for clustering — the t-test formula is mathematically wrong. Period.
Did you randomize? That's the key question. Proper randomization ensures independence. If you let users self-select into groups, or assigned based on geography where network effects exist, you violated independence.
Quick check: Can you draw a line from observation A to observation B showing how A might influence B? If yes, you have dependent data. Use mixed-effects models or clustered standard errors instead.
Normality: The Assumption That Bends
Normality matters less than you think — with caveats. The Central Limit Theorem says that sampling distributions of means approach normality even when raw data doesn't. But "approach" depends on sample size and the degree of non-normality.
With mild skewness and n > 20 per group, the t-test is remarkably robust. With severe skewness, heavy tails, or outliers, you need much larger samples or a different test entirely.
Homoscedasticity: The Assumption Everyone Ignores
Equal variance across groups matters more than most practitioners realize. When variances differ and sample sizes are unbalanced, Student's t-test becomes liberal (inflates Type I error) or conservative (loses power) depending on which group has larger variance.
The fix is trivial: use Welch's t-test. It doesn't assume equal variances and applies the Satterthwaite correction for degrees of freedom. This is the single easiest win in this entire article.
Quick Win: Default to Welch's t-Test
Here's a practical rule many statisticians follow: always use Welch's t-test unless you have strong evidence of equal variances. When variances are equal, Welch's performs nearly identically to Student's. When they're unequal, Welch's protects you. There's virtually no downside.
In R: t.test(x, y, var.equal = FALSE) (this is the default)
In Python: scipy.stats.ttest_ind(x, y, equal_var=False)
Testing Normality: What Actually Works
You need two tools: visual assessment (Q-Q plots) and formal tests (Shapiro-Wilk). Neither alone is sufficient.
Q-Q Plots: Your First Line of Defense
A Q-Q plot (quantile-quantile plot) plots your data quantiles against theoretical normal distribution quantiles. If data is normal, points fall on a straight line.
What to look for:
- S-curve pattern: Heavy tails (more extreme values than normal distribution predicts)
- Inverted S-curve: Light tails (fewer extreme values)
- Points curve up at right: Right skew (long tail toward high values)
- Points curve down at left: Left skew (long tail toward low values)
- Outliers: Points far from the line at either end
Visual assessment matters because it shows you the type and severity of violation. A formal test just says "rejected" without context.
Shapiro-Wilk Test: Formal Hypothesis Testing
The Shapiro-Wilk test formally tests the null hypothesis that data comes from a normal distribution. It's the most powerful normality test for small to moderate sample sizes (n < 2000).
# Python example
from scipy.stats import shapiro
# Test normality for control group
stat, p_value = shapiro(control_data)
print(f"Shapiro-Wilk statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Reject normality assumption")
else:
print("Fail to reject normality")
Interpretation: If p < 0.05, reject normality. But here's the trap: with large samples (n > 200), Shapiro-Wilk becomes overly sensitive. It will reject normality for trivial departures that don't matter for t-test validity.
That's why you need both visual and formal assessment. Q-Q plots show severity. Shapiro-Wilk provides a statistical threshold. Together, they give you the full picture.
The n > 30 Myth Debunked
Does the n > 30 rule really make normality violations okay? No. This rule is one of statistics' most harmful oversimplifications.
What's actually true: The Central Limit Theorem says the sampling distribution of the mean approaches normality as n increases. The rate of approach depends on the underlying distribution's shape.
Required sample sizes by distribution type:
- Symmetric, light-tailed: n = 10-15 often sufficient
- Moderate skewness: n = 30-50 needed
- Severe skewness: n = 100-200+ required
- Heavy tails or outliers: n = 200+ or use robust methods
- Extreme skewness (e.g., income data): CLT may not help at any reasonable n
The fix: Always check your actual data. Don't rely on sample size alone.
Homoscedasticity: When Unequal Variance Breaks Everything
Equal variance isn't just a theoretical nicety. When it's violated, Student's t-test produces incorrect p-values and confidence intervals. The severity depends on two factors: the variance ratio and whether sample sizes are balanced.
Levene's Test for Equal Variance
Levene's test checks whether group variances are statistically different. It's robust to non-normality, making it better than the older F-test for variance homogeneity.
# Python example
from scipy.stats import levene
# Test equal variance assumption
stat, p_value = levene(control_data, treatment_data)
print(f"Levene statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Variances are unequal - use Welch's t-test")
else:
print("Equal variance assumption holds")
Rule of thumb: If Levene's p-value < 0.05, or if the variance ratio exceeds 2:1, use Welch's t-test.
Visual Check: Side-by-Side Boxplots
Before running formal tests, create boxplots for each group. If one box is noticeably wider than the other, you have unequal variances. Look at the interquartile range (IQR) — the height of the box — as a proxy for variance.
Calculate the variance ratio manually:
import numpy as np
var_control = np.var(control_data, ddof=1)
var_treatment = np.var(treatment_data, ddof=1)
variance_ratio = var_treatment / var_control
print(f"Variance ratio: {variance_ratio:.2f}")
if variance_ratio > 2 or variance_ratio < 0.5:
print("Substantial variance inequality detected")
What Unequal Variance Actually Does
Here's the damage pattern:
- Balanced samples + unequal variance: Minimal impact on Type I error, slight power loss
- Unbalanced samples, larger group has larger variance: Liberal test (inflated Type I error up to 2-3x nominal rate)
- Unbalanced samples, smaller group has larger variance: Conservative test (reduced power, higher Type II error)
This is why sample size balance matters. With equal n, the t-test is robust to variance heterogeneity. With unequal n, even moderate variance differences cause problems.
Common Pitfall: Unbalanced Samples Hide Variance Issues
We see this pattern constantly in A/B tests: traffic splits 60/40 due to implementation delays, control group has 2x the variance of treatment, and teams use Student's t-test. The result? Inflated p-values and false positives.
Fix: When n₁ ≠ n₂, always check variances. If variance ratio > 1.5, use Welch's t-test.
Student's vs Welch's t-Test: When the Satterthwaite Correction Saves You
The difference between Student's and Welch's t-test comes down to degrees of freedom and how they pool variance.
Student's t-Test: The Pooled Variance Approach
Student's t-test assumes equal variances and pools them:
s_pooled² = ((n₁-1)s₁² + (n₂-1)s₂²) / (n₁ + n₂ - 2)
df = n₁ + n₂ - 2
This works great when variances are actually equal. When they're not, the pooled estimate is wrong and df is incorrect.
Welch's t-Test: The Satterthwaite Correction
Welch's t-test doesn't pool variances. Instead, it uses the Satterthwaite approximation for degrees of freedom:
df_welch = (s₁²/n₁ + s₂²/n₂)² / ((s₁²/n₁)²/(n₁-1) + (s₂²/n₂)²/(n₂-1))
This formula adjusts df based on the actual variance structure. When variances differ, df_welch is typically lower than df_student, making the critical value larger and the test more conservative (protective).
Practical Comparison: When It Matters
Here's a simulation with n₁=30, n₂=50, σ₁=1.0, σ₂=2.0:
| Method | Degrees of Freedom | Critical Value (α=0.05) | Actual Type I Error |
|---|---|---|---|
| Student's t-test | 78 | 1.99 | 0.073 (inflated) |
| Welch's t-test | 71.4 | 2.00 | 0.051 (correct) |
Student's t-test gives you a 7.3% false positive rate when you asked for 5%. That's a 46% inflation in Type I error.
Decision Rule: Which Test to Use
Use Welch's t-test when:
- Levene's test p < 0.05 (statistically different variances)
- Variance ratio > 2:1 or < 1:2
- Sample sizes differ by more than 50% (n₁/n₂ > 1.5 or < 0.67)
- You're unsure (it's the safer default)
Use Student's t-test when:
- You have strong evidence of equal variances (Levene's p > 0.10)
- Sample sizes are balanced (n₁ = n₂) and variance ratio < 1.5
- You specifically need the higher power from pooled variance estimates
In practice, defaulting to Welch's is the right move. The power loss when variances are equal is negligible (< 2%), but the protection against unequal variances is substantial.
Try It Yourself: t-Test Analyzer
Upload your A/B test data and get instant analysis including normality tests, variance checks, and the correct t-test variant to use. MCP Analytics runs all assumption checks and flags violations automatically.
Analyze Your Test Data →Robustness Simulations: When Violations Don't Matter
Here's what matters in practice: severity of violation, sample size, and group balance. Let's quantify when you can ignore violations versus when they'll burn you.
Scenario 1: Mild Skewness, Balanced Groups
Setup: n₁ = n₂ = 25, data drawn from gamma distribution (moderate right skew), equal variance.
Result: Type I error rate = 0.052 (nominal α = 0.05). The t-test holds up fine.
Takeaway: With balanced groups and mild departures from normality, the t-test is robust even at modest sample sizes.
Scenario 2: Severe Skewness, Small Samples
Setup: n₁ = n₂ = 10, data drawn from log-normal distribution (severe right skew), equal variance.
Result: Type I error rate = 0.089 (nominal α = 0.05). That's 78% inflation.
Takeaway: With severe non-normality and small samples, the t-test breaks. Use Mann-Whitney U test or increase sample size to n > 50.
Scenario 3: Outliers Present
Setup: n₁ = n₂ = 30, normal data but 5% of observations replaced with extreme outliers (±5σ).
Result: Type I error rate = 0.051, but power drops from 0.80 to 0.62.
Takeaway: Outliers don't inflate false positives much, but they kill statistical power. You lose 18 percentage points of power, meaning you'll miss real effects.
Scenario 4: Unequal Variance + Unbalanced Samples
Setup: n₁ = 20, n₂ = 40, σ₁ = 1.0, σ₂ = 2.5 (larger group has larger variance).
Student's t-test: Type I error rate = 0.127 (nominal α = 0.05)
Welch's t-test: Type I error rate = 0.053
Takeaway: This is the deadly combination. Student's t-test gives you a 2.5x inflated false positive rate. Welch's t-test fixes it completely.
Practical Thresholds from Simulations
Based on extensive robustness testing, here are safe operating thresholds:
| Condition | Safe Threshold | Action if Violated |
|---|---|---|
| Normality (skewness) | |skewness| < 1.0 | Transform data or use Mann-Whitney |
| Variance ratio | 0.5 < σ₁²/σ₂² < 2.0 | Use Welch's t-test |
| Sample balance | 0.67 < n₁/n₂ < 1.5 | Check variance; use Welch's if unequal |
| Outliers | < 2% beyond ±3σ | Investigate, consider trimming or robust test |
| Minimum n per group | n ≥ 20 | Increase sample size or use exact tests |
Non-Parametric Alternative: When to Use Mann-Whitney U
When normality is hopeless and transformations don't work, the Mann-Whitney U test (also called Wilcoxon rank-sum test) is your non-parametric alternative.
How Mann-Whitney Works
Instead of comparing means, Mann-Whitney compares ranks. It tests whether observations from one group tend to be larger than observations from the other group.
# Python example
from scipy.stats import mannwhitneyu
# Non-parametric test (doesn't assume normality)
stat, p_value = mannwhitneyu(control_data, treatment_data, alternative='two-sided')
print(f"Mann-Whitney U statistic: {stat:.2f}")
print(f"p-value: {p_value:.4f}")
What it tests: The null hypothesis is that a randomly selected value from group 1 is equally likely to be greater than or less than a randomly selected value from group 2.
When to Use Mann-Whitney
- Severe non-normality: Heavy skew, multimodal distributions, or distributions with no clear shape
- Ordinal data: Likert scales, satisfaction ratings, rankings
- Outliers present: When you can't remove them and they're legitimate data
- Small samples with non-normal data: n < 20 and Shapiro-Wilk p < 0.01
The Power Trade-off
Here's the cost: Mann-Whitney has about 95% of the statistical power of the t-test when data is actually normal. That translates to needing roughly 5-10% more data to achieve the same power.
For detecting a medium effect (Cohen's d = 0.5) with 80% power:
- t-test (normal data): n = 64 per group
- Mann-Whitney: n = 69 per group
That's the insurance premium: 5 extra observations per group to protect against non-normality. Usually worth it when violations are severe.
Common Misconception: Mann-Whitney Tests Medians
This is wrong. Mann-Whitney does not test whether medians differ (unless distributions have identical shapes). It tests stochastic dominance — whether values from one group tend to be systematically larger.
If you specifically want to test median differences, use a bootstrap test or quantile regression instead.
Sample Size Matters: When Small n Makes Violations Deadly
What's your sample size? That's the question that determines how seriously to take assumption violations.
The Small Sample Problem (n < 20 per group)
With small samples, two problems compound:
- Low power to detect assumption violations: Shapiro-Wilk and Levene's tests lack power with n < 20, so they often fail to flag real violations
- Greater impact from violations: The Central Limit Theorem hasn't kicked in yet, so departures from normality directly affect the sampling distribution
This creates a trap: tests can't reliably detect violations that would matter most.
Solution for small n:
- Rely heavily on visual assessment (Q-Q plots, boxplots)
- Default to Welch's t-test unless you have prior evidence of equal variances
- Consider non-parametric tests if data looks clearly non-normal
- Better yet: increase your sample size before running the test
The Large Sample Regime (n > 100 per group)
With large samples, the Central Limit Theorem provides protection against normality violations. But here's the paradox: formal tests become hypersensitive.
Shapiro-Wilk will reject normality for trivial departures that have zero practical impact on t-test validity. Levene's test will flag variance differences that don't meaningfully affect Type I error.
Solution for large n:
- Trust visual assessment over formal tests
- Focus on practical significance: Is the variance ratio > 2? Does the Q-Q plot show severe departures?
- Use Welch's t-test by default anyway (no downside)
Sample Size and Effect Size Interaction
Here's something practitioners miss: assumption violations matter more when effects are small.
With a large effect (Cohen's d = 1.0), even a sloppy test will detect it. With a small effect (d = 0.2), violations can completely obscure real differences or create false ones.
Calculate your minimum detectable effect before running the test. If you need to detect small effects, you need cleaner data and more careful assumption checking.
Power Analysis: Do This Before You Run the Test
How many observations do you need? Run a power analysis:
from statsmodels.stats.power import ttest_power
# Calculate required sample size
effect_size = 0.5 # Cohen's d
alpha = 0.05
power = 0.80
n_required = ttest_power(effect_size, alpha, power)
print(f"Required n per group: {n_required:.0f}")
This tells you the minimum sample size to detect your target effect. Don't run underpowered tests — they waste resources and produce unreliable results.
Worked Example: A/B Test Wrecked by Skewness
Let's walk through a real scenario where assumption violations invalidated an A/B test — and how to fix it.
The Setup
An e-commerce site tested two checkout flows: current (control) vs. simplified (treatment). The metric: revenue per session. Sample sizes: n = 450 control, n = 520 treatment.
The analyst ran Student's t-test and got p = 0.031. Declared victory. Shipped the new checkout.
The Problem
Revenue per session is severely right-skewed: most sessions generate $0 (bounces), some generate $20-100, and rare sessions generate $500+.
Here's what the data looked like:
Control group:
Mean: $42.30
Median: $0.00
Std: $87.50
Skewness: 3.8
Shapiro-Wilk p-value: < 0.001
Treatment group:
Mean: $38.20
Median: $0.00
Std: $120.40
Skewness: 4.2
Shapiro-Wilk p-value: < 0.001
Two violations: severe positive skewness (skewness > 3) and unequal variances (variance ratio = 1.89).
The Analysis Gone Wrong
Student's t-test assumes normality and equal variance. Both violated. The p-value of 0.031 is meaningless.
Let's check what happened with proper methods:
# Method 1: Welch's t-test (handles unequal variance)
stat, p_welch = ttest_ind(control, treatment, equal_var=False)
print(f"Welch's t-test p-value: {p_welch:.4f}") # p = 0.067
# Method 2: Mann-Whitney U (non-parametric, handles skewness)
stat, p_mw = mannwhitneyu(control, treatment, alternative='two-sided')
print(f"Mann-Whitney p-value: {p_mw:.4f}") # p = 0.142
Welch's t-test: p = 0.067 (not significant at α = 0.05)
Mann-Whitney: p = 0.142 (not even close)
The "statistically significant" result vanished when we used appropriate tests.
The Right Approach: Log Transformation
Revenue data often becomes approximately normal after log transformation. Try log(revenue + 1) to handle zero values:
import numpy as np
# Log transform (add 1 to handle zeros)
control_log = np.log1p(control)
treatment_log = np.log1p(treatment)
# Check normality after transformation
_, p_control = shapiro(control_log)
_, p_treatment = shapiro(treatment_log)
print(f"Control log-transformed p-value: {p_control:.4f}") # p = 0.083
print(f"Treatment log-transformed p-value: {p_treatment:.4f}") # p = 0.071
# Run t-test on transformed data
stat, p_transformed = ttest_ind(control_log, treatment_log, equal_var=False)
print(f"t-test on log-transformed data: {p_transformed:.4f}") # p = 0.089
Still not significant. The initial p = 0.031 was a false positive driven by assumption violations.
Alternative Metric: Conversion Rate
Revenue per session might be the wrong metric entirely. Consider splitting it:
- Conversion rate: % of sessions that purchase (test with two-proportion z-test)
- Average order value: Mean revenue among converters only (subset data, then test)
This decomposition often reveals what's actually happening. Maybe conversion rates didn't change, but order values did (or vice versa).
The Lesson
Check assumptions before you run the test, not after. Visual inspection of the data would have immediately revealed the skewness. A quick Levene's test would have flagged unequal variances. Both issues were fixable with simple adjustments.
Instead, the team shipped a change based on a false positive, spent engineering resources implementing it, and likely saw no revenue lift (or worse, a decline they didn't detect).
Pre-Registration Prevents This
Specify your analysis plan before looking at results: which test you'll use, how you'll handle violations, what transformations are allowed. This prevents the temptation to shop for significant p-values across different tests.
Write it down: "If Shapiro-Wilk p < 0.05, we'll use log transformation. If Levene's p < 0.05, we'll use Welch's t-test. If skewness > 2 after transformation, we'll use Mann-Whitney."
Decision Flowchart: Which Test Should You Use?
Here's the practical decision tree for choosing the right test:
- Check independence: Are observations independent?
- No → Use paired t-test, mixed-effects model, or clustered SE
- Yes → Continue
- Check sample size: n ≥ 20 per group?
- No → Use exact tests or increase sample size
- Yes → Continue
- Check normality: Q-Q plot + Shapiro-Wilk
- Severe violations (skewness > 2, p < 0.01) → Try transformation
- Mild violations or n > 50 → Continue
- Check variance: Levene's test + variance ratio
- Unequal (p < 0.05 or ratio > 2) → Use Welch's t-test
- Equal → Use Student's t-test (or Welch's as safe default)
- If transformation didn't fix normality:
- Use Mann-Whitney U test (non-parametric)
- Or use bootstrap/permutation test
In practice, most analysts should default to Welch's t-test for continuous data and Mann-Whitney when normality is questionable. These two tests handle 90% of real-world scenarios.
Automate Assumption Checking
MCP Analytics runs all assumption tests automatically and recommends the appropriate test variant. Upload your data, get instant diagnostics, and eliminate guesswork from your statistical testing.
Start Free Analysis →Quick Wins: Five Easy Fixes That Prevent Most Problems
You don't need perfect data. You need to avoid the big mistakes. Here are five rules that prevent 80% of assumption-related failures:
1. Default to Welch's t-Test
Unless you have strong prior evidence of equal variances, use Welch's. The downside when variances are equal is trivial (< 2% power loss). The upside when they're unequal is massive (prevents inflated Type I error).
In R and Python, Welch's is often the default. Don't override it.
2. Always Create Q-Q Plots
Visual assessment beats formal tests for understanding the type and severity of non-normality. It takes 30 seconds. Do it every time.
import scipy.stats as stats
import matplotlib.pyplot as plt
# Q-Q plot
stats.probplot(data, dist="norm", plot=plt)
plt.title("Q-Q Plot")
plt.show()
3. Balance Your Sample Sizes
Equal n per group makes the t-test robust to variance heterogeneity. Aim for n₁ = n₂ in your experimental design. If you can't achieve perfect balance, keep the ratio within 1.5:1.
This is why proper A/B test implementation matters. Random 50/50 assignment produces balanced groups. Ad-hoc splits produce trouble.
4. Calculate Sample Size Before Running the Test
Underpowered tests are worse than no tests. They produce unreliable results and waste resources. Run a power analysis first, then collect enough data.
Minimum recommended: n ≥ 20 per group for moderate effect sizes. Increase for smaller effects or when assumptions are questionable.
5. Pre-Specify Your Analysis Plan
Decide which tests you'll use before looking at results. This prevents p-hacking (trying multiple tests until one gives p < 0.05).
Write down: "We'll use Welch's t-test. If Shapiro-Wilk p < 0.01 and skewness > 2, we'll log-transform first. If transformation doesn't work, we'll use Mann-Whitney."
Then stick to the plan.
Frequently Asked Questions
When should I use Welch's t-test instead of Student's t-test?
Use Welch's t-test when sample variances are unequal (variance ratio > 2:1 or Levene's test p < 0.05) or when sample sizes differ substantially (n₁/n₂ > 1.5). Welch's t-test applies the Satterthwaite correction for degrees of freedom and doesn't assume equal variances, making it more robust. In practice, many statisticians recommend defaulting to Welch's t-test as it performs nearly identically when variances are equal but protects against violations.
Does the n>30 rule really make normality violations okay?
No. The n>30 rule is oversimplified and often wrong. The Central Limit Theorem helps sampling distributions approach normality, but the required sample size depends on how non-normal your data is. With severe skewness or outliers, you might need n>100 or n>200. With mild departures, n=15 might suffice. Always check your actual data with Q-Q plots and consider the degree of violation, not just sample size.
What should I do if my A/B test data violates normality assumptions?
First, assess the severity with Q-Q plots and Shapiro-Wilk test. For mild violations with balanced groups and n>20 per group, the t-test is robust. For moderate skewness, try log transformation or use Welch's t-test. For severe violations, outliers, or highly skewed data, switch to the Mann-Whitney U test (non-parametric alternative). For conversion metrics (0/1 data), use a two-proportion z-test instead.
How do I check for equal variance (homoscedasticity)?
Use Levene's test for homogeneity of variance. Calculate the variance for each group and check the ratio: if variance_A/variance_B > 2 (or < 0.5), you have unequal variances. Visually, create side-by-side boxplots — if one box is much wider than the other, variances differ. If Levene's test p-value < 0.05, reject the equal variance assumption and use Welch's t-test instead of Student's t-test.
Can I just use non-parametric tests all the time to avoid assumption checking?
Not recommended. Non-parametric tests like Mann-Whitney U have lower statistical power (10-15% less efficient) when parametric assumptions hold. This means you need larger sample sizes to detect the same effect. Check assumptions first: if they're satisfied or violations are minor, use the t-test for maximum power. Reserve non-parametric tests for clear violations that transformations can't fix.