When testing multiple hypotheses simultaneously, data scientists face a critical challenge: how to control false positives without sacrificing statistical power. The Holm-Bonferroni method offers an elegant solution, but many practitioners fall into common traps that undermine their analyses. This guide compares the Holm approach to alternative correction methods and reveals the mistakes that can lead to misleading conclusions in your data-driven decisions.
The Holm-Bonferroni method is a step-down procedure that controls the family-wise error rate (FWER) when performing multiple hypothesis tests, offering strictly more statistical power than the standard Bonferroni correction.
Multiple testing corrections are essential in modern analytics. Whether you're running A/B tests across different user segments, analyzing clinical trial outcomes, or exploring correlations in high-dimensional datasets, uncorrected multiple comparisons inflate your risk of false discoveries. The Holm method provides stronger statistical power than traditional Bonferroni correction while maintaining rigorous family-wise error rate control.
What is the Holm-Bonferroni Method?
The Holm-Bonferroni method, proposed by Sture Holm in 1979, is a step-down sequential procedure for controlling the family-wise error rate (FWER) when conducting multiple hypothesis tests. Unlike the standard Bonferroni correction that uses a single adjusted significance threshold for all tests, Holm employs a sequential process with decreasing alpha thresholds.
The method works by ordering your p-values from smallest to largest and comparing each against progressively less stringent thresholds. This sequential approach makes Holm uniformly more powerful than Bonferroni while guaranteeing the same FWER protection.
Understanding Family-Wise Error Rate
Before diving into the mechanics, it's essential to understand what FWER means. The family-wise error rate is the probability of making at least one Type I error (false positive) across all your hypothesis tests. If you set your significance level at alpha equals 0.05 and conduct 20 independent tests, your chance of at least one false positive jumps to approximately 64 percent without correction.
FWER control ensures that across your entire family of tests, the probability of any false rejection remains at or below your chosen alpha level. This stringent criterion is crucial when false positives carry serious consequences, such as in medical research or regulatory decisions.
The Holm Procedure Step-by-Step
Here's how the Holm method works in practice:
- Order your p-values: Arrange all m p-values from smallest to largest, denoted as p(1), p(2), ..., p(m)
- Start with the smallest p-value: Compare p(1) against alpha/m
- If p(1) ≤ alpha/m: Reject that hypothesis and proceed to the next test
- Compare p(2) against alpha/(m-1): If significant, continue; otherwise, stop
- Continue sequentially: For the i-th ordered p-value, compare against alpha/(m-i+1)
- Stop at first non-rejection: When a p-value exceeds its threshold, retain that hypothesis and all remaining ones
The critical insight is that each successive comparison uses a less conservative threshold, allowing more power to detect true effects after controlling for multiple testing.
Key Mathematical Formula
For the i-th smallest p-value, the adjusted threshold is:
alpha_i = alpha / (m - i + 1)
Where alpha is your significance level (typically 0.05), m is the total number of tests, and i is the rank of the p-value (1 for smallest).
| Method | Controls | Procedure | Power | Best For |
|---|---|---|---|---|
| Bonferroni | FWER | Single-step (α/m) | Lowest | Quick conservative correction |
| Holm-Bonferroni | FWER | Step-down | Higher than Bonferroni | Default FWER control |
| Hochberg | FWER | Step-up | Higher than Holm | Independent or positively dependent tests |
| Benjamini-Hochberg | FDR | Step-up | Highest | Exploratory analysis, many tests |
Comparing Approaches: Holm vs. Other Multiple Testing Corrections
Understanding when to use Holm requires comparing it to alternative approaches. Each method makes different tradeoffs between statistical power, computational complexity, and the type of error control.
Holm vs. Bonferroni: A Clear Winner
The traditional Bonferroni correction divides your alpha level equally among all m tests, using alpha/m as the threshold for each. While simple to implement, this approach is overly conservative because it doesn't account for the sequential nature of hypothesis testing.
The Holm method dominates Bonferroni in every scenario. It's uniformly more powerful, meaning it will never reject fewer true hypotheses than Bonferroni while maintaining identical FWER control. The only advantage of Bonferroni is computational simplicity, but with modern statistical software, this distinction is negligible.
Consider testing 10 hypotheses at alpha equals 0.05. Bonferroni uses 0.005 as the threshold for all tests. Holm starts at 0.005 for the smallest p-value but increases to 0.0056 for the second, 0.00625 for the third, and so on. These incremental increases compound to provide substantially more power.
Holm vs. False Discovery Rate Methods
False Discovery Rate (FDR) methods like Benjamini-Hochberg take a fundamentally different approach to multiple testing. Instead of controlling the probability of any false positive (FWER), FDR controls the expected proportion of false discoveries among all rejections.
The comparison between Holm and FDR methods involves a strategic choice:
- Choose Holm when: Each false positive has serious consequences, you're conducting confirmatory analysis, regulatory or medical decisions depend on results, or you have a moderate number of tests (under 100)
- Choose FDR when: You're conducting exploratory analysis, working with genomics or high-dimensional data, can tolerate some false discoveries in exchange for more true discoveries, or have a large number of tests (hundreds or thousands)
FDR methods are more powerful than FWER methods like Holm, but they allow more false positives in absolute terms. This tradeoff makes sense in discovery science but is inappropriate for decision-making contexts where each error carries significant cost.
Holm vs. Hochberg and Hommel
The Hochberg method is a step-up procedure (starting from the largest p-value) that's slightly more powerful than Holm but requires an independence assumption. The Hommel method is even more powerful but computationally intensive.
For most practical applications, Holm provides the best balance. It requires no independence assumptions beyond what you need for basic hypothesis testing, it's computationally efficient, and the power difference from more complex methods is typically minimal.
See It in Action
Holm, Bonferroni, BH & Sidak Compared on 84 Tests
See exactly how each correction method changes your conclusions. Interactive report with step-down visualization, significance decision matrix, and effect sizes.
When to Use the Holm Method
The Holm-Bonferroni method excels in specific analytical contexts. Understanding these use cases helps you choose the right correction method for your data-driven decisions.
Ideal Scenarios for Holm
A/B Testing Multiple Variants: When you're testing multiple website designs, email campaigns, or product features against a control, Holm controls the risk that any of your "winning" variants is actually a false positive. This is critical when implementation costs are high.
Clinical Trials with Multiple Endpoints: Medical research often evaluates treatments across several outcomes. Regulatory agencies require FWER control to ensure reported benefits are genuine, making Holm an appropriate choice.
Quality Control Testing: Manufacturing processes with multiple quality metrics benefit from Holm's conservative approach. You can't afford false positives that suggest a problem where none exists.
Subgroup Analysis: When analyzing treatment effects across demographic segments, age groups, or customer cohorts, Holm prevents spurious subgroup findings that could misdirect strategy.
When to Consider Alternatives
Holm may not be optimal in certain situations:
Exploratory Genomics: With thousands or millions of tests, FWER methods become too conservative. FDR approaches better balance discovery and error control in gene expression studies.
Highly Correlated Tests: When your tests are positively correlated (common in financial time series or spatial data), Holm becomes overly conservative. Permutation-based methods may provide better power while maintaining valid error control.
Pre-specified Primary Endpoints: If you have a single primary outcome and several secondary endpoints, hierarchical testing procedures may be more appropriate than simultaneous correction methods like Holm.
Key Assumptions and Requirements
Like all statistical methods, the Holm procedure rests on important assumptions. Violating these can invalidate your results or reduce statistical power.
Independence Assumptions
The Holm method was originally derived assuming independent test statistics. However, research has shown that Holm maintains valid FWER control under positive dependence structures as well. This robustness is a significant advantage over some alternative procedures.
For tests with arbitrary dependence structures, Holm may be conservative (controlling FWER below the nominal level) but remains valid. This means you're protected against false positives even with correlated tests, though you may sacrifice some statistical power.
Pre-specification of the Hypothesis Family
A critical assumption is that you define your family of hypotheses before examining the data. Adding tests after seeing results (data snooping) invalidates the FWER control. This requirement demands discipline in your analytical workflow.
Document your testing plan before analysis begins. Specify which comparisons you'll make, which metrics you'll evaluate, and which subgroups you'll examine. Post-hoc additions require separate correction or should be treated as exploratory.
Proper P-value Calculation
The Holm method assumes your individual p-values are calculated correctly. This means verifying that:
- Sample sizes are adequate for your chosen tests
- Distributional assumptions are met (normality, equal variances, etc.)
- Tests are two-tailed unless you have strong prior directional hypotheses
- P-values account for any necessary corrections (continuity corrections, degrees of freedom adjustments)
Garbage in, garbage out applies here. Holm can't fix problems with your underlying statistical tests.
Python Implementation: Step-by-Step Walkthrough
While the Holm procedure is straightforward in concept, seeing it implemented line by line builds intuition for what the algorithm actually does at each step. The following walkthrough uses a realistic A/B testing scenario: an e-commerce team has run six simultaneous experiments on their site and needs to determine which results survive multiple testing correction.
Setting Up the Problem
Suppose your team ran six A/B tests over the same two-week window: checkout flow redesign, header layout, CTA button color, pricing page layout, testimonial placement, and search bar position. Each test produced a two-proportion z-test p-value. Before looking at the results, you pre-registered the Holm-Bonferroni procedure at alpha equals 0.05 to control the family-wise error rate across all six comparisons.
import numpy as np
import pandas as pd
from statsmodels.stats.multitest import multipletests
# Six A/B tests: checkout flow, header design, CTA color,
# pricing layout, testimonial placement, search bar position
tests = ['Checkout Flow', 'Header Design', 'CTA Color',
'Pricing Layout', 'Testimonials', 'Search Bar']
p_values = np.array([0.001, 0.009, 0.025, 0.040, 0.120, 0.350])
alpha = 0.05
m = len(p_values)
# Step 1: Sort p-values and track which test each belongs to
sorted_idx = np.argsort(p_values)
sorted_p = p_values[sorted_idx]
sorted_tests = [tests[i] for i in sorted_idx]
# Step 2: Apply Holm step-down procedure manually
print("Holm-Bonferroni Step-Down Procedure")
print("=" * 60)
rejected = []
for i in range(m):
threshold = alpha / (m - i)
reject = sorted_p[i] <= threshold
status = "REJECT" if reject else "RETAIN (stop)"
print(f"Step {i+1}: {sorted_tests[i]}")
print(f" p = {sorted_p[i]:.4f} vs threshold = {threshold:.4f} -> {status}")
if not reject:
break
rejected.append(sorted_tests[i])
print(f"\nRejected hypotheses: {rejected}")
print(f"Retained (not significant): {[t for t in sorted_tests if t not in rejected]}")
Interpreting the Manual Output
Running the code above produces the following step-by-step trace:
Holm-Bonferroni Step-Down Procedure
============================================================
Step 1: Checkout Flow
p = 0.0010 vs threshold = 0.0083 -> REJECT
Step 2: Header Design
p = 0.0090 vs threshold = 0.0100 -> REJECT
Step 3: CTA Color
p = 0.0250 vs threshold = 0.0125 -> RETAIN (stop)
Rejected hypotheses: ['Checkout Flow', 'Header Design']
Retained (not significant): ['CTA Color', 'Pricing Layout', 'Testimonials', 'Search Bar']
The procedure rejects the first two tests (Checkout Flow and Header Design) because their p-values fall below the progressively relaxing thresholds. At Step 3, CTA Color's p-value of 0.025 exceeds the threshold of 0.0125, so the procedure stops. All remaining tests (CTA Color, Pricing Layout, Testimonials, Search Bar) are retained regardless of their individual p-values.
Verifying with statsmodels
It is good practice to verify manual calculations against an established library. The multipletests function from statsmodels computes Holm-adjusted p-values and a rejection mask in a single call.
# Step 3: Verify with statsmodels
reject_mask, pvals_adj, _, _ = multipletests(p_values, alpha=0.05, method='holm')
# Step 4: Build a results DataFrame for reporting
results = pd.DataFrame({
'Test': tests,
'P-value': p_values,
'Adjusted P-value': pvals_adj,
'Reject?': reject_mask
})
# Add the Holm threshold for each test (based on sorted rank)
thresholds = np.full(m, np.nan)
rank_in_sorted = np.argsort(np.argsort(p_values)) # rank of each test
for idx, rank in enumerate(rank_in_sorted):
thresholds[idx] = alpha / (m - rank)
results['Holm Threshold'] = thresholds
# Sort by p-value for presentation
results = results.sort_values('P-value').reset_index(drop=True)
print("\nFull Results Table")
print(results.to_string(index=False))
This produces a clean summary table:
Full Results Table
Test P-value Adjusted P-value Reject? Holm Threshold
Checkout Flow 0.001 0.0060 True 0.008333
Header Design 0.009 0.0450 True 0.010000
CTA Color 0.025 0.1000 False 0.012500
Pricing Layout 0.040 0.1200 False 0.016667
Testimonials 0.120 0.2400 False 0.025000
Search Bar 0.350 0.3500 False 0.050000
Understanding Adjusted P-values vs. Thresholds
The table above shows both the Holm threshold (which decreases as rank increases) and the adjusted p-value from statsmodels. The two approaches are equivalent: comparing the original p-value against the Holm threshold gives the same accept/reject decision as comparing the adjusted p-value against the original alpha of 0.05. Use whichever framing is clearer for your audience. Adjusted p-values are convenient for reporting because they can all be compared to a single 0.05 cutoff.
Key Takeaways from the Code
- Sorting matters: The procedure operates on rank-ordered p-values. A sorting error cascades through every subsequent step.
- The stopping rule is strict: Once a p-value exceeds its threshold, all remaining hypotheses are retained, even if later p-values would individually pass their thresholds.
- Two tests survive correction: Although three tests (Checkout Flow, Header Design, CTA Color) had raw p-values below 0.05, only two survive Holm correction. CTA Color at p = 0.025 fails because its threshold tightens to 0.0125 after accounting for the family of six tests.
- Always verify: Comparing manual results with
statsmodelsoutput catches implementation bugs before they affect business decisions.
Common Mistakes to Avoid When Applying Holm
Despite its straightforward procedure, practitioners frequently make errors when implementing the Holm method. These mistakes can lead to incorrect conclusions and flawed business decisions.
Mistake 1: Incorrect P-value Ordering
The most fundamental error is failing to properly order p-values from smallest to largest before applying the sequential procedure. This mistake typically occurs when working with spreadsheets or custom code rather than dedicated statistical software.
Always double-check your sorting. A single misplaced p-value can cascade through the sequential procedure, leading to incorrect rejections or failures to reject. Use automated sorting functions rather than manual arrangement.
Mistake 2: Stopping the Procedure Prematurely
Some analysts mistakenly continue testing all hypotheses even after encountering a non-significant result in the sequence. The Holm procedure requires you to stop at the first p-value that exceeds its adjusted threshold and retain all remaining hypotheses.
This stopping rule is essential to the method's validity. Continuing past the first non-rejection violates the sequential nature of the procedure and can lead to inflated Type I error rates.
Mistake 3: Applying Holm to Post-hoc Comparisons
A dangerous practice is defining your hypothesis family after examining preliminary results. For example, seeing unexpected patterns in one customer segment and then deciding to test that segment separately invalidates the FWER control.
The hypothesis family must be pre-specified. If you discover interesting patterns during exploratory analysis, treat follow-up tests as a separate, prospective analysis with its own error control. Document the exploratory nature of initial findings and don't claim confirmatory evidence without proper testing.
Mistake 4: Ignoring the Multiple Testing Problem Entirely
Paradoxically, one of the biggest mistakes is not using Holm (or any correction) when multiple testing is present. Many business analysts conduct dozens of significance tests and report any p-value below 0.05 as a "finding" without correction.
This practice virtually guarantees false positives. With 20 tests, you expect one spurious significant result by chance alone. These false discoveries can lead to costly business decisions based on statistical noise rather than real effects.
Mistake 5: Confusing Adjusted P-values with Adjusted Thresholds
Some software packages report adjusted p-values (multiplying each p-value by its correction factor), while others report adjusted significance thresholds. Mixing these approaches causes confusion.
Be clear about which approach you're using. With adjusted p-values, you compare them against your original alpha. With adjusted thresholds (the Holm approach), you compare original p-values against modified thresholds. Document your method clearly in reports.
Mistake 6: Using Holm for Dependent Tests Without Justification
While Holm is robust to positive dependence, it can be overly conservative with certain dependency structures. Applying it to highly correlated tests without acknowledging the power loss represents a missed opportunity.
When you know your tests are dependent, consider whether resampling methods, permutation tests, or other approaches might provide better power while maintaining valid error control. At minimum, acknowledge the dependence structure and its implications for your results.
Critical Mistakes Summary
- Incorrectly ordering p-values before sequential testing
- Continuing the procedure after the first non-rejection
- Defining the hypothesis family after seeing data
- Skipping multiple testing correction entirely
- Confusing adjusted p-values with adjusted thresholds
- Ignoring known dependencies between tests
Interpreting Results from Holm-Bonferroni Analysis
Proper interpretation of Holm-corrected results requires understanding both what the method guarantees and what it doesn't tell you.
What Rejection Means
When the Holm procedure leads you to reject a null hypothesis, you can conclude that the effect is statistically significant after accounting for multiple testing. The probability that any of your rejections is a false positive remains controlled at your chosen alpha level across the entire family.
However, statistical significance doesn't imply practical significance. A tiny effect size might reach significance with a large sample, but lack meaningful business impact. Always pair Holm corrections with effect size reporting and confidence intervals.
What Retention Means
Failing to reject a hypothesis after Holm correction doesn't prove the null hypothesis is true. It indicates insufficient evidence to declare significance while controlling for multiple testing. The effect might be real but too small to detect with your sample size and the conservative correction.
Consider reporting confidence intervals for retained hypotheses. An effect that narrowly misses significance might still inform business decisions, especially if the confidence interval excludes meaningless effect sizes.
Reporting Your Results
Transparent reporting builds trust in your analysis. Include these elements:
- The total number of tests in your family and how you defined it
- Your chosen alpha level (typically 0.05)
- The sequential order of tests and their p-values
- Which hypotheses were rejected and at which step the procedure stopped
- Effect sizes and confidence intervals for all tests, not just significant ones
A table format works well for presenting Holm results. Show the original p-values in order, the adjusted threshold for each, and whether each hypothesis was rejected.
Real-World Example: E-commerce A/B Testing
Let's walk through a concrete example that demonstrates the Holm method in action and illustrates the comparison with uncorrected testing and Bonferroni correction.
The Business Context
An e-commerce company tests five new homepage designs against their current version, measuring conversion rate improvements. They run the experiment for two weeks with 10,000 visitors per variant, using a two-proportion z-test for each comparison at alpha equals 0.05.
The Raw Results
Here are the p-values for each design comparison:
- Design A: p = 0.003
- Design B: p = 0.018
- Design C: p = 0.042
- Design D: p = 0.089
- Design E: p = 0.234
Without Correction: Misleading Conclusions
Using the naive approach of testing each at alpha equals 0.05, three designs (A, B, and C) appear significant. The company might conclude they have three winning variants and begin implementation.
However, with five tests, the probability of at least one false positive is approximately 23 percent. The company faces a nearly one-in-four chance that at least one of their "winners" is actually no better than the current design.
With Bonferroni Correction
Bonferroni divides alpha by the number of tests: 0.05 / 5 = 0.01. Comparing each p-value against 0.01:
- Design A: 0.003 < 0.01 → Reject (significant)
- Design B: 0.018 > 0.01 → Retain
- Design C: 0.042 > 0.01 → Retain
- Design D: 0.089 > 0.01 → Retain
- Design E: 0.234 > 0.01 → Retain
Only Design A shows significance. While this controls FWER, Bonferroni may be throwing away real improvements in Design B.
With Holm-Bonferroni Method
First, order the p-values from smallest to largest (already done above). Then apply sequential thresholds:
- Design A: p = 0.003, threshold = 0.05/5 = 0.010 → 0.003 < 0.010, Reject
- Design B: p = 0.018, threshold = 0.05/4 = 0.0125 → 0.018 > 0.0125, Retain and stop
The procedure stops at Design B. We reject only Design A and retain all others (B, C, D, E).
Comparing the Approaches
Notice that Holm gives the same conclusion as Bonferroni in this example, but with a more nuanced approach. Design B (p = 0.018) was close to the Holm threshold of 0.0125. With a slightly larger sample size or stronger effect, Holm would have detected it while Bonferroni would still miss it.
The key insight: Design A shows robust evidence of improvement even after strict multiple testing correction. The company can confidently implement it. Designs B and C showed promising signals that might warrant follow-up testing in a separate, prospective analysis.
Business Decision
The company implements Design A immediately. They document the suggestive but non-significant results for Designs B and C, planning a focused head-to-head test between the new Design A and a refined version incorporating elements from Design B. This disciplined approach prevents false discoveries while maintaining an innovation pipeline.
Real-World Example: Clinical Trial with Multiple Endpoints
Multiple testing correction is not just a convenience in clinical research; it is a regulatory requirement. When a pharmaceutical company submits efficacy data to the FDA or EMA, each endpoint that drives a labeling claim must survive family-wise error rate control. The Holm-Bonferroni method is a common choice in this setting because it offers more power than standard Bonferroni while providing the strict FWER guarantee that regulators demand.
Trial Design
Consider a Phase III randomized, double-blind trial evaluating a new anti-inflammatory drug for rheumatoid arthritis. The trial enrolls 400 patients (200 treatment, 200 placebo) and follows them for 24 weeks. The statistical analysis plan pre-specifies five efficacy endpoints, each tested at alpha equals 0.05 with Holm-Bonferroni correction across the family:
- Pain reduction (Visual Analog Scale score change from baseline) — primary endpoint
- C-reactive protein (CRP) level (serum inflammatory marker reduction)
- Morning stiffness duration (minutes of stiffness after waking)
- Joint swelling count (number of swollen joints out of 28 assessed)
- Patient global assessment (self-reported overall disease activity on 100-point scale)
Why Holm Over FDR
In an exploratory genomics study, FDR control (Benjamini-Hochberg) would be appropriate because the goal is discovery and a small proportion of false positives is tolerable. A clinical trial is fundamentally different. Each endpoint that reaches significance may appear on the drug's approved label, directly influencing prescribing decisions for millions of patients. A false claim about joint swelling improvement, for example, could lead clinicians to choose this drug over an alternative that genuinely reduces swelling. Regulatory agencies therefore require FWER control, not FDR, for confirmatory trials.
Holm is preferred over plain Bonferroni because it provides uniformly greater power at no cost to FWER control. In a trial that cost hundreds of millions of dollars to run, the additional power to detect real endpoints translates directly into broader labeling claims and larger addressable market.
The Raw P-values
After unblinding and running pre-specified mixed-model analyses, the five endpoints yield the following p-values:
- Pain reduction (VAS): p = 0.002
- C-reactive protein: p = 0.008
- Morning stiffness: p = 0.031
- Joint swelling: p = 0.048
- Patient global assessment: p = 0.072
Without correction, four of five endpoints fall below 0.05. The temptation is to report all four as significant and seek labeling claims for each.
Applying the Holm Procedure
Order the p-values from smallest to largest and compare each against its sequential Holm threshold (alpha / (m - i + 1), where m = 5):
- Step 1 — Pain reduction: p = 0.002, threshold = 0.05 / 5 = 0.010. Since 0.002 < 0.010, reject. Proceed.
- Step 2 — CRP level: p = 0.008, threshold = 0.05 / 4 = 0.0125. Since 0.008 < 0.0125, reject. Proceed.
- Step 3 — Morning stiffness: p = 0.031, threshold = 0.05 / 3 = 0.0167. Since 0.031 > 0.0167, retain and stop.
The procedure stops at Step 3. Morning stiffness, joint swelling, and patient global assessment are all retained as non-significant after correction.
Wait — What About Morning Stiffness?
Morning stiffness had a raw p-value of 0.031, well below the uncorrected 0.05 threshold. But its Holm threshold is 0.0167 (0.05 / 3), and it fails. This is the multiple testing correction doing its job: with five endpoints, the bar for the third-ranked p-value is nearly three times as stringent as the uncorrected threshold. The trial team notes this as a promising signal for a follow-up study but cannot claim it on the drug label.
Results Summary
The Holm procedure produces a clear partition of the five endpoints:
- Significant after Holm correction (2 of 5): Pain reduction (p = 0.002) and CRP level (p = 0.008)
- Not significant after correction (3 of 5): Morning stiffness (p = 0.031), joint swelling (p = 0.048), patient global assessment (p = 0.072)
Joint swelling is a particularly instructive case. Its raw p-value of 0.048 would have been declared significant without any correction. Under Bonferroni (threshold = 0.05 / 5 = 0.01), it clearly fails. Under Holm, it never even gets evaluated because the procedure already stopped at Step 3. Holm protected the trial from a potential false claim that uncorrected testing would have allowed.
FDA Labeling Decision
Based on the Holm-corrected results, the company submits the drug for approval with efficacy claims limited to two endpoints:
- Statistically significant reduction in pain (VAS score)
- Statistically significant reduction in C-reactive protein, an objective inflammatory biomarker
The FDA grants approval with labeling for pain reduction and CRP improvement. The label does not include claims for morning stiffness, joint swelling, or patient global assessment. The company plans a Phase IV post-marketing study to investigate the suggestive morning stiffness signal with adequate power for a single-endpoint test.
Key Lesson
This example illustrates why Holm-Bonferroni is the preferred method for confirmatory clinical trials. Without correction, the trial would have reported four significant endpoints, including joint swelling (p = 0.048) which barely crossed the 0.05 line. With Holm, two endpoints survive, and the drug's label accurately reflects only the endpoints with robust evidence. The cost of the correction is real (three fewer labeling claims), but the benefit is equally real: every claim on the label is defensible, and no patient or physician is misled by a false positive masquerading as efficacy.
Best Practices for Implementing Holm in Your Workflow
Successful application of the Holm method requires more than mathematical correctness. These best practices ensure your multiple testing correction supports better business decisions.
Pre-register Your Analysis Plan
Document your hypothesis family, significance level, and analysis approach before data collection. This prevents data snooping and creates accountability. For critical business decisions, share this plan with stakeholders so everyone understands what evidence you're seeking.
Pre-registration doesn't prohibit exploratory analysis. It simply creates a clear boundary between confirmatory tests (where Holm applies) and exploratory findings (which require separate validation).
Use Statistical Software, Not Manual Calculations
While the Holm procedure is conceptually simple, manual implementation invites errors. Use established statistical packages that implement the method correctly:
# R example
p_values <- c(0.003, 0.018, 0.042, 0.089, 0.234)
p.adjust(p_values, method = "holm")
# Python example
from statsmodels.stats.multitest import multipletests
reject, pvals_corrected, _, _ = multipletests(p_values, method='holm')
These functions handle the ordering, sequential testing, and edge cases automatically.
Consider Statistical Power Before Testing
Multiple testing corrections reduce your effective power to detect true effects. Before collecting data, conduct power analysis accounting for the Holm correction. You may need larger sample sizes than you would for a single test.
If power analysis reveals you need impractically large samples, reconsider your testing strategy. Can you reduce the number of comparisons? Can you use hierarchical testing? Sometimes the answer is to run fewer, more focused tests rather than correcting for many underpowered comparisons.
Report Both Corrected and Uncorrected Results
Transparency builds credibility. Show stakeholders both the raw p-values and the Holm-corrected conclusions. This helps them understand what evidence exists and what meets your stringent significance threshold.
Effects that are significant before correction but not after represent suggestive evidence worth tracking. They might inform future testing priorities even if they don't warrant immediate action.
Combine with Effect Size Reporting
The Holm method tells you which effects are statistically significant, not which are practically meaningful. Always report effect sizes (Cohen's d, odds ratios, conversion rate lifts) alongside significance tests.
An effect can be significant after Holm correction but too small to matter for your business. Conversely, a non-significant effect might be large enough to warrant follow-up with a larger sample.
Document Your Family Definition
Be explicit about which tests constitute a family. In A/B testing, is it all tests run this month? All tests for a particular feature? All comparisons within a single experiment?
Different reasonable definitions exist. The key is consistency and transparency. Avoid the temptation to define families in ways that optimize for significant results.
Related Multiple Testing Techniques
The Holm method sits within a broader landscape of multiple testing approaches. Understanding related techniques helps you choose the right tool for each analytical challenge.
Bonferroni Correction
The Bonferroni correction is the simpler predecessor to Holm. While it's never more powerful than Holm, it remains widely used due to familiarity and ease of explanation to non-technical stakeholders. If you're using Bonferroni, consider switching to Holm for free power improvements.
Benjamini-Hochberg (FDR Control)
The Benjamini-Hochberg procedure controls the false discovery rate rather than the family-wise error rate. It's substantially more powerful than Holm in high-dimensional settings but allows some false positives. Choose this for exploratory genomics, market basket analysis, or other contexts where you're seeking signals among thousands of comparisons.
Hochberg Step-Up Procedure
Hochberg reverses the Holm procedure, starting with the largest p-value and working down. It's slightly more powerful than Holm but requires independent test statistics. Use Hochberg when you can verify independence and want maximum power under FWER control.
Šidák Correction
The Šidák correction uses 1 - (1 - alpha)^(1/m) as the per-test threshold, accounting for the probabilistic combination of errors more precisely than Bonferroni. It's slightly less conservative than Bonferroni but requires independence. Holm generally performs better in practice.
Permutation-Based Methods
When tests are dependent with known or estimable correlation structures, permutation methods can provide better power than Holm while maintaining valid FWER control. These computationally intensive approaches resample your data to estimate the null distribution of the maximum test statistic.
Consider permutation methods for time series analysis, spatial statistics, or other settings where correlations between tests are substantial and structured.
Sequential Testing Procedures
In some contexts, you can test hypotheses sequentially rather than simultaneously, using alpha spending functions to allocate your error budget across interim analyses. This approach is common in clinical trials and can be adapted to business analytics when you need to make decisions before all data is collected.
See Holm-Bonferroni on Real Data
84 hypothesis tests, 4 correction methods (Holm, Bonferroni, Benjamini-Hochberg, Sidak), 10 interactive slides. Compare how each method changes your conclusions.
Conclusion: Making Better Decisions with Holm
The Holm-Bonferroni method represents a best practice in multiple testing correction for confirmatory analysis. By comparing it to alternatives like Bonferroni and FDR methods, you can choose the right approach for your analytical context. The sequential nature of Holm provides more power than Bonferroni while maintaining the same rigorous FWER control.
Avoiding common mistakes is essential to valid implementation. Pre-specify your hypothesis family, order p-values correctly, apply the sequential procedure properly, and stop at the first non-rejection. These disciplines ensure your multiple testing correction supports rather than undermines your conclusions.
The Holm method shines when false positives carry serious costs and you need confirmatory evidence for decision-making. While it's more conservative than FDR approaches, this conservatism protects against costly mistakes in business strategy, medical interventions, and policy decisions.
Integrate Holm into your analytical workflow with proper planning, statistical software, and transparent reporting. Combine significance testing with effect size reporting and power analysis. You can upload your experiment CSV to run multiple testing corrections directly. Document your methods clearly so stakeholders understand both your evidence and your evidentiary standards.
As data-driven decision-making becomes more sophisticated, the multiple testing problem grows more acute. Organizations that master techniques like Holm-Bonferroni correction will make better decisions, avoid false discoveries, and build more robust analytical practices. The method's mathematical elegance translates into practical business value: fewer costly mistakes, more confident decisions, and a culture of statistical rigor.
Key Takeaways
- Sort p-values ascending, then compare each to α/(m−k+1) — reject sequentially until one fails
- Controls the family-wise error rate at α with strictly more power than standard Bonferroni
- No independence assumption required — works regardless of test correlation structure
- Use Benjamini-Hochberg instead when you can tolerate some false discoveries (FDR control) for greater power
- Pre-register your hypothesis family before analysis — adding tests post-hoc invalidates the correction
Frequently Asked Questions
What is the difference between Holm and Bonferroni methods?
The Holm method is a sequential procedure that adjusts alpha thresholds based on ordered p-values, while Bonferroni divides alpha equally among all tests. Holm is uniformly more powerful than Bonferroni while maintaining the same FWER control, meaning it can detect more true effects without increasing false positives.
When should I use Holm-Bonferroni instead of FDR methods?
Use Holm when you need strict control over family-wise error rate and can't afford even a single false positive. Choose FDR methods like Benjamini-Hochberg when you're conducting exploratory analysis with many tests and can tolerate a small proportion of false discoveries in exchange for more statistical power.
What are the most common mistakes when applying the Holm method?
The most common mistakes include incorrectly ordering p-values, stopping the sequential procedure too early, applying the method to dependent tests without justification, and defining the hypothesis family after seeing results. Always order p-values from smallest to largest and apply adjustments sequentially.
Can I use the Holm method with dependent tests?
The Holm method was designed for independent tests, but research shows it can be conservative with certain dependency structures. For positively correlated tests, Holm still controls FWER but may lose some power. For complex dependencies, consider alternatives like permutation-based methods or resampling approaches.
How do I determine if my results are significant after Holm correction?
Compare each p-value to its adjusted threshold: alpha/(m-i+1) where m is total tests and i is the rank. Start with the smallest p-value and continue sequentially. The first p-value that exceeds its threshold stops the procedure, and all remaining hypotheses are retained (not significant).