You have data and a question. Revenue went up after a campaign -- was it real or noise? Two customer segments show different churn rates -- is the gap statistically meaningful? Choosing the wrong test can produce a confident-looking p-value that means nothing, or miss a genuine effect hiding in your data. This guide maps every common scenario to the right statistical test so you stop guessing and start deciding.
Why Choosing the Right Test Matters
Every hypothesis test balances two risks. A Type I error (false positive) occurs when you declare a difference real when it is actually random variation -- you roll out a pricing change that had no true effect. A Type II error (false negative) happens when you miss a real difference -- you abandon a winning campaign because the test lacked statistical power to detect its impact.
The test you choose determines how well you control these risks. Parametric tests like the t-test assume your data follows a normal distribution and offer maximum statistical power when those assumptions hold. Use them on non-normal data, however, and your p-values become unreliable -- you either see effects that are not there or miss ones that are. Nonparametric alternatives like the Mann-Whitney U test make fewer assumptions at the cost of slightly less power, but they produce trustworthy results across a wider range of data shapes.
Matching the test to the data also matters for the number of groups you compare. A t-test handles two groups. Comparing three or more groups with repeated t-tests inflates your false-positive rate exponentially -- with 10 groups and 45 pairwise comparisons, you have a 90% chance of at least one spurious finding at alpha = 0.05. ANOVA and its nonparametric counterparts exist precisely to solve this problem.
Rule of Thumb
If your data is continuous and roughly normal, use parametric tests (t-test, ANOVA). If it is ordinal, heavily skewed, or has outliers, use nonparametric tests (Mann-Whitney, Kruskal-Wallis). If it is categorical (counts), use Chi-Square or Fisher's Exact. When in doubt, run both -- if they agree, report the parametric result; if they disagree, trust the nonparametric.
Quick Comparison: Every Test at a Glance
| Test | Data Type | Groups | Assumptions | Guide |
|---|---|---|---|---|
| T-Test | Continuous | 2 (independent or paired) | Normality, equal variance (Welch's relaxes this) | Read guide |
| ANOVA | Continuous | 3+ | Normality, homogeneity of variance | Read guide |
| Chi-Square Test | Categorical (counts) | 2+ categories | Expected frequency ≥ 5 per cell | Read guide |
| Mann-Whitney U | Ordinal / Non-normal continuous | 2 (independent) | Independent samples, similar shape | Read guide |
| Wilcoxon Signed-Rank | Ordinal / Non-normal continuous | 2 (paired) | Paired observations, symmetric differences | Read guide |
| Kruskal-Wallis | Ordinal / Non-normal continuous | 3+ | Independent samples | Read guide |
| Fisher's Exact Test | Categorical (counts) | 2×2 (small samples) | None (exact calculation) | Read guide |
| McNemar's Test | Categorical (paired) | 2 (before/after) | Paired binary outcomes | Read guide |
| Kolmogorov-Smirnov | Continuous | 1 or 2 distributions | Continuous data | Read guide |
| Bonferroni Correction | Any (p-value adjustment) | Multiple tests | Conservative; controls family-wise error | Read guide |
| Holm-Bonferroni | Any (p-value adjustment) | Multiple tests | Less conservative than Bonferroni; step-down | Read guide |
| Benjamini-Hochberg | Any (p-value adjustment) | Multiple tests | Controls false discovery rate, not FWER | Read guide |
| Power Analysis | Any (design planning) | Pre-test | Requires effect size estimate | Read guide |
When to Use Each Test
T-Test
Use the t-test when comparing means between exactly two groups on a continuous, approximately normal outcome. The independent-samples t-test compares two separate groups (treatment vs. control), while the paired t-test compares the same subjects measured twice (before vs. after). Welch's t-test is the safer default because it does not assume equal variances. If your sample is large (n > 30 per group), the t-test is robust to moderate non-normality thanks to the central limit theorem. Full t-test guide →
ANOVA
Use one-way ANOVA when comparing means across three or more independent groups -- for example, testing whether conversion rates differ across four landing page variants. ANOVA tells you whether at least one group differs; follow up with post-hoc tests (Tukey HSD, Bonferroni) to find which pairs. Two-way ANOVA adds a second factor and tests for interaction effects. If normality or equal-variance assumptions fail, switch to Kruskal-Wallis. Full ANOVA guide →
Chi-Square Test
Use the Chi-Square test of independence when both your variables are categorical and you want to know if they are associated. Common applications include testing whether customer segment (new vs. returning) is related to purchase category, or whether conversion rates differ across traffic sources. Requires expected cell counts of at least 5; if any cell falls below this, use Fisher's Exact Test instead. Full Chi-Square guide →
Mann-Whitney U Test
Use the Mann-Whitney U test as the nonparametric alternative to the independent-samples t-test. It compares the rank distributions of two independent groups and works well with ordinal data, skewed distributions, or when outliers make the t-test unreliable. It tests whether one group tends to have larger values than the other. Common in A/B testing on revenue data (which is almost always right-skewed) and customer satisfaction scores. Full Mann-Whitney guide →
Wilcoxon Signed-Rank Test
Use the Wilcoxon signed-rank test as the nonparametric alternative to the paired t-test. It compares two related measurements -- before and after an intervention, or matched pairs -- when the differences are not normally distributed. Particularly useful for small samples where normality is hard to verify, or when working with ordinal rating scales. Full Wilcoxon guide →
Kruskal-Wallis Test
Use the Kruskal-Wallis test as the nonparametric alternative to one-way ANOVA. It compares rank distributions across three or more independent groups when normality assumptions are violated. Follow up a significant result with Dunn's post-hoc test to identify which pairs differ. Common in comparing customer satisfaction scores, response times, or order values across multiple segments. Full Kruskal-Wallis guide →
Fisher's Exact Test
Use Fisher's Exact Test for 2x2 contingency tables when sample sizes are small (any expected cell count below 5). Unlike the Chi-Square test, Fisher's computes the exact probability rather than relying on an approximation, so it is always valid regardless of sample size. Use it for rare events, small A/B tests, or medical/safety data where precision matters more than computational convenience. Full Fisher's Exact guide →
McNemar's Test
Use McNemar's test when you have paired binary outcomes -- the same subjects measured on a yes/no variable at two time points. It tests whether the proportion of "yes" changed between measurements. Common applications include before/after surveys ("Did you purchase? Yes/No"), diagnostic test comparisons, and matched case-control studies. Full McNemar's guide →
Kolmogorov-Smirnov Test
Use the Kolmogorov-Smirnov (KS) test to compare an observed distribution against a theoretical distribution (one-sample) or to compare two observed distributions (two-sample). It is sensitive to differences in both location and shape. In business analytics, the KS test is widely used to validate model calibration, detect data drift, and check normality assumptions before applying parametric tests. Full KS test guide →
Power Analysis
Use power analysis before you run any test to determine how many observations you need to detect a meaningful effect. It links four quantities: sample size, effect size, significance level (alpha), and statistical power (1 - beta). Skipping this step is the most common reason tests return inconclusive results -- the experiment simply did not have enough data. Run power analysis during experiment design, not after. Full power analysis guide →
Decision Flowchart: Choosing the Right Test
Start at the top and follow the branches based on your data characteristics.
Multiple Comparisons: When and How to Correct
Every time you run a hypothesis test at alpha = 0.05, you accept a 5% chance of a false positive. Run 20 tests on the same dataset and you expect one spurious "significant" result even if no real effects exist. Multiple comparison corrections control this inflation so your findings remain trustworthy.
When You Must Correct
Correction is mandatory when you test multiple hypotheses on the same dataset and report any result as significant. This includes: post-hoc pairwise comparisons after ANOVA, testing multiple outcome metrics in a single experiment, subgroup analyses that were not pre-specified, and running the same test across many segments (e.g., testing conversion lift in each of 15 countries).
Bonferroni Correction
The simplest and most conservative approach: divide your significance threshold by the number of tests. With 10 comparisons at alpha = 0.05, each individual test must reach p < 0.005 to be declared significant. Easy to implement and explain, but overly strict when you have many tests -- it dramatically increases the risk of missing real effects. Best for small numbers of planned comparisons where false positives are costly. Full Bonferroni guide →
Holm-Bonferroni Method
A step-down procedure that is uniformly more powerful than Bonferroni while still controlling the family-wise error rate. Sort p-values from smallest to largest and compare each to a progressively less strict threshold. It rejects more true effects than Bonferroni with the same Type I error guarantee, making it the recommended default for most applications. There is no reason to use plain Bonferroni when Holm-Bonferroni is available. Full Holm-Bonferroni guide →
Benjamini-Hochberg Procedure
Controls the false discovery rate (FDR) rather than the family-wise error rate, making it less conservative and more suitable for exploratory analyses with many tests. Instead of guaranteeing that no false positive slips through, it controls the expected proportion of false positives among rejected hypotheses. Use it when you are screening many variables for potential effects (e.g., which of 200 product features predict churn) and can tolerate some false leads in exchange for detecting more real effects. Full Benjamini-Hochberg guide →
Which Correction Should You Use?
Few planned comparisons, high stakes: Bonferroni or Holm-Bonferroni. Many comparisons, exploratory: Benjamini-Hochberg. Always prefer Holm-Bonferroni over Bonferroni -- it is strictly more powerful with the same error control. Use Benjamini-Hochberg when you are screening for discoveries and can follow up with confirmatory tests.
Beyond Frequentist Testing
Classical hypothesis tests answer a narrow question: "Is this result unlikely under the null hypothesis?" Two complementary approaches provide richer answers for business decision-making.
Bayesian A/B Testing
Instead of a binary significant/not-significant verdict, Bayesian A/B testing gives you a probability that one variant beats another -- for example, "there is an 94% probability that the new checkout flow increases revenue." This maps directly to business risk decisions and does not require fixed sample sizes, making it ideal for continuous experimentation programs. Full Bayesian A/B testing guide →
Causal Impact Analysis
When you cannot run a randomized experiment -- a policy change affects all customers simultaneously, or you need to measure the impact of an external event -- Causal Impact uses Bayesian structural time series to construct a synthetic counterfactual. It estimates what would have happened without the intervention and quantifies the causal effect with credible intervals. Essential for measuring marketing campaigns, product launches, and operational changes that lack a clean control group.
Run the Right Test on Your Data
Upload a CSV, pick a hypothesis test, and get a publication-ready report with effect sizes, confidence intervals, and plain-language interpretation -- no code required.
Start Free Trial