Analysis Overview
Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| significance_level | 0.05 | significance_level |
| primary_method | holm | primary_method |
| comparison_methods | bonferroni,BH,hochberg | comparison_methods |
| min_effect_size | 0.2 | min_effect_size |
Purpose
This analysis applies multiple comparison correction methods to 84 educational hypothesis tests comparing student performance across demographic groups (lunch type, test prep, parental education, race/ethnicity). The objective is to control family-wise error rate (FWER) and identify which differences remain statistically significant after accounting for the inflated Type I error risk inherent in conducting 84 simultaneous tests.
Key Findings
- Uncorrected FWER: 0.987 – Without correction, there is a 98.7% probability of at least one false positive across all 84 tests, demonstrating critical need for adjustment
- Holm-Bonferroni Rejections: 36 significant findings – Holm method retains 36 rejections while controlling FWER at α=0.05
- Bonferroni Rejections: 29 findings – More conservative; Holm gains 24% additional power over Bonferroni
- BH (FDR) Rejections: 56 findings – Most liberal method, controlling false discovery rate rather than FWER
- Practical Significance Alignment: 36 tests flagged as practically significant (effect size ≥0.2), perfectly matching Holm rejections, indicating strong concordance between statistical and practical significance
###
Data preprocessing and column mapping
Purpose
This section documents the data preprocessing pipeline for 84 hypothesis tests undergoing multiple comparison corrections. Perfect data retention (100%) indicates no observations were excluded during cleaning, which is critical for maintaining the integrity of family-wise error rate (FWER) control calculations that depend on the complete test set.
Key Findings
- Initial & Final Rows: 84 observations retained across both stages—no data loss occurred during preprocessing
- Retention Rate: 100% indicates complete dataset integrity with zero exclusions
- Rows Removed: 0 deletions suggests either pristine input data or that no quality thresholds triggered removal criteria
- Train/Test Split: Not applicable, as this is a multiple comparison correction analysis rather than a predictive modeling task
Interpretation
The perfect retention rate ensures that all 84 tests remain eligible for correction methods (Holm, Bonferroni, BH, Hochberg). This is essential because FWER calculations depend on the complete family of tests; removing even one test would alter the stepdown thresholds and rejection decisions. The absence of preprocessing transformations preserves the raw p-values and effect sizes needed for accurate multiple comparison control.
Context
No train/test split is applicable here since the objective is statistical inference across a fixed test family, not predictive generalization. The analysis assumes all 84 tests
Executive Summary
Executive summary of multiple comparison correction results
| finding | value | interpretation |
|---|---|---|
| Total Tests | 84 | Number of hypothesis tests corrected |
| Holm Rejections | 36 | Tests significant after Holm correction |
| Bonferroni Rejections | 29 | Tests significant after Bonferroni (baseline) |
| BH Rejections | 56 | Tests significant under FDR control |
| Uncorrected FWER | 98.7% | Probability of false positive without correction |
| Holm Power Gain vs Bonferroni | 24.1% | Additional true positives gained by Holm over Bonferroni |
| Marginal Test | Writing: associates degree vs Some high school | Last test rejected by Holm procedure |
Key Findings:
• Holm rejected 7 additional tests vs Bonferroni (24.1% power gain)
• BH (FDR control) rejected 56 tests (less conservative than FWER methods)
• Marginal test: Writing: associates degree vs Some high school — the last test where Holm stopped rejecting
Recommendation: Focus on the 36 Holm-significant tests for follow-up. These results are statistically robust with test_group-wise error rate controlled at 5%.
EXECUTIVE SUMMARY: MULTIPLE COMPARISON CORRECTION ANALYSIS
Purpose
This analysis applied four statistical correction methods to 84 hypothesis tests comparing student performance across demographic groups (lunch type, test prep, race/ethnicity, parental education). The objective was to identify which findings remain statistically significant after controlling for multiple testing inflation, which would otherwise produce a 98.7% probability of at least one false positive.
Key Findings
- Uncorrected FWER: 98.7% — Without correction, the analysis would almost certainly report false positives across the 84 tests
- Holm-Bonferroni Rejections: 36 tests — Maintains family-wise error rate at 5% while rejecting 7 more tests than strict Bonferroni (24.1% power gain)
- Bonferroni Rejections: 29 tests — Most conservative FWER method; rejects only the strongest signals
- BH (FDR) Rejections: 56 tests — Less stringent control; allows up to 5% false discovery rate rather than protecting against any false positive
- Effect Size Alignment: 36 tests with practically significant effects (≥0.2) match Holm rejections, indicating statistical and practical significance overlap
Interpretation
The analysis successfully
Holm Step-Down Procedure
Holm step-down sequential rejection procedure with decreasing thresholds
Purpose
The Holm step-down procedure controls family-wise error rate (FWER) by sequentially testing hypotheses in order of increasing p-value, with thresholds that become progressively more lenient. This section demonstrates how Holm allocates the significance budget across 84 tests while maintaining strict control over false positives—critical for identifying genuinely significant effects amid multiple comparisons.
Key Findings
- Holm Rejections: 36 of 84 tests rejected—57% retention rate reflects conservative FWER control
- Threshold Range: 0 to 0.05, decreasing as α/(m-i+1), creating a staircase pattern that tightens early tests most severely
- Marginal Test: Writing: associates degree vs Some high school (p=0.0009) marks the rejection boundary; tests beyond this fail their thresholds
- Raw p-value Distribution: Mean=0.09, median=0, indicating strong clustering of significant results at the lower tail
Interpretation
Holm's sequential rejection strategy is more powerful than Bonferroni (24% power gain) because it relaxes thresholds for weaker tests while maintaining FWER at α=0.05. The 36 rejections represent findings robust to multiple comparison correction. The marginal test at
Method Comparison: Adjusted P-values
Raw vs adjusted p-values across all correction methods
Purpose
This section compares how four multiple-comparison correction methods adjust raw p-values to control error rates across 84 simultaneous hypothesis tests. It demonstrates the trade-off between statistical conservatism (fewer false positives) and statistical power (ability to detect true effects). Understanding these differences is critical for determining which findings remain credible after correcting for multiple testing.
Key Findings
- Holm vs. Bonferroni: Holm rejected 36 tests versus Bonferroni's 29—a 24.1% power gain while maintaining identical FWER control, demonstrating Holm's superiority as a step-down procedure.
- BH (FDR) Method: Rejected 56 tests, the most permissive approach, because it controls False Discovery Rate rather than Family-Wise Error Rate, allowing more discoveries at the cost of accepting some false positives.
- Hochberg Performance: Matched Holm's 36 rejections despite weaker assumptions, suggesting strong positive dependence among test statistics in this dataset.
- Uncorrected FWER: At 0.987, the probability of at least one false positive without correction is nearly certain across 84 tests.
Interpretation
The analysis reveals a clear hierarchy: Bonferroni is most conservative, Holm and
Decision Matrix
Which tests survive under which correction methods
Purpose
This section visualizes which of the 84 hypothesis tests remain statistically significant under four different multiple comparison correction methods. It reveals the power-conservativeness tradeoff: stricter methods (Bonferroni, Holm) reject fewer tests but provide stronger Type I error control, while less conservative methods (BH, Hochberg) detect more signals by controlling different error rates. Understanding method agreement identifies robust findings versus marginal discoveries.
Key Findings
- Bonferroni Rejections: 29 tests (35%) — most conservative FWER control; highest false negative risk
- Holm Rejections: 36 tests (43%) — 24% power gain over Bonferroni while maintaining FWER control
- BH Rejections: 56 tests (67%) — most liberal; controls FDR rather than FWER, detecting 93% more signals
- Hochberg Rejections: 36 tests (43%) — matches Holm; stepdown improvement over Bonferroni
- Perfect Agreement: First 5 tests rejected by all methods; last 5 retained by all methods — indicating clear signal/noise separation at distribution extremes
Interpretation
The decision matrix demonstrates that 36 tests form a robust consensus across FWER-controlling methods
Method Comparison: Rejection Counts
Number of rejections per correction method
Purpose
This section compares the statistical power and rejection rates across four multiple comparison correction methods applied to the same 84 hypothesis tests. It demonstrates the trade-off between controlling false positives (Type I error) and detecting true effects (statistical power), which is central to choosing an appropriate correction strategy for the analysis objective.
Key Findings
- Benjamini-Hochberg (BH) Rejections: 56 tests (66.7%) — highest rejection rate with 1.93× power relative to Bonferroni, controls False Discovery Rate rather than Family-Wise Error Rate
- Holm & Hochberg Rejections: 36 tests each (42.9%) — equivalent rejection counts despite different algorithmic approaches; Holm uses step-down logic while Hochberg uses step-up
- Bonferroni Rejections: 29 tests (34.5%) — most conservative method with lowest power; provides strongest control but sacrifices detection ability
- Power Differential: 27-test gap between BH (most powerful) and Bonferroni (least powerful) reflects fundamental difference between FDR and FWER control philosophies
Interpretation
The 93% increase in rejections from Bonferroni (29) to BH (56) illustrates why method selection depends on study context.
Effect Size Analysis
Practical significance alongside statistical significance
Purpose
This section validates whether statistically significant findings (after Holm-Bonferroni correction) also represent meaningful, real-world differences. Statistical significance alone can be misleading with large sample sizes; effect size analysis ensures that rejected hypotheses reflect substantive rather than trivial differences. This bridges the gap between p-values and practical importance.
Key Findings
- Practical Significance Alignment: 36 of 84 tests (43%) are both statistically significant (Holm-corrected) AND practically significant (Cohen's d ≥ 0.20), indicating robust, meaningful effects.
- Mean Effect Size: 0.35 (medium effect), with median 0.34 and range 0.02–0.94, showing considerable variation in effect magnitudes across comparisons.
- Confidence Interval Coverage: 69% of tests flagged as "Practically significant" have confidence intervals excluding zero, strengthening evidence of real differences.
- Small Effect Concern: 26 tests show small effects (d < 0.20), suggesting some Holm-rejected findings may be statistically robust but practically negligible due to large sample sizes.
Interpretation
The analysis reveals a healthy concordance between statistical and practical significance for the majority of findings. The 36 practically significant results represent genuine, meaningful differences in student outcomes across
FWER Accumulation Curve
Why correction is needed: FWER accumulation without correction
Purpose
This section quantifies the multiple testing problem: conducting 84 independent hypothesis tests at α=0.05 without correction inflates the probability of at least one false positive to 98.7%. This demonstrates why statistical correction methods are essential and motivates the application of Holm-Bonferroni and alternative procedures in the overall analysis.
Key Findings
- Uncorrected FWER: 98.7% — nearly certain to observe at least one false positive by chance alone
- Mathematical Driver: FWER = 1 − (0.95)^84 = 0.987, showing exponential accumulation of error across tests
- Critical Threshold: Even 5 tests yield 23% FWER; by 80+ tests, false positive risk approaches certainty
- Correction Necessity: Holm-Bonferroni reduces FWER back to the nominal 5% level through adjusted p-value thresholds
Interpretation
Without correction, the 84 comparisons across Math, Reading, and Writing assessments would produce a 98.7% probability of spurious significance—rendering raw p-values unreliable for decision-making. The FWER curve illustrates that error accumulation is non-linear: the first 10 tests contribute modest inflation, but by test
Complete Results Table
Complete numerical results with all p-values and decisions
| test_label | raw_pvalue | cohens_d | n_per_group | family | statistical_test | holm_adjusted_p | holm_reject | bonferroni_adjusted_p | bonferroni_reject | BH_adjusted_p | BH_reject | hochberg_adjusted_p | hochberg_reject |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Math: free/reduced vs standard | 0 | 0.7823 | 355 | Math by Lunch Type | welch_t_test | 0 | True | 0 | True | 0 | True | 0 | True |
| Reading: free/reduced vs standard | 0 | 0.4924 | 355 | Reading by Lunch Type | welch_t_test | 0 | True | 0 | True | 0 | True | 0 | True |
| Writing: free/reduced vs standard | 0 | 0.5293 | 355 | Writing by Lunch Type | welch_t_test | 0 | True | 0 | True | 0 | True | 0 | True |
| Reading: completed vs none | 0 | 0.5192 | 358 | Reading by Test Prep | welch_t_test | 0 | True | 0 | True | 0 | True | 0 | True |
| Writing: completed vs none | 0 | 0.6866 | 358 | Writing by Test Prep | welch_t_test | 0 | True | 0 | True | 0 | True | 0 | True |
| Reading: female vs male | 0 | 0.5037 | 482 | Reading by Gender | welch_t_test | 0 | True | 0 | True | 0 | True | 0 | True |
| Writing: female vs male | 0 | 0.6316 | 482 | Writing by Gender | welch_t_test | 0 | True | 0 | True | 0 | True | 0 | True |
| Writing: bachelors degree vs high school | 2e-10 | 0.7629 | 118 | Writing by Parental Education | welch_t_test | 1.46e-08 | True | 1.68e-08 | True | 1.4e-09 | True | 1.46e-08 | True |
| Writing: high school vs masters degree | 9e-10 | 0.9446 | 59 | Writing by Parental Education | welch_t_test | 6.48e-08 | True | 7.56e-08 | True | 5.815e-09 | True | 6.48e-08 | True |
| Math: Group C vs Group E | 1.9e-09 | 0.6212 | 140 | Math by Race/Ethnicity | welch_t_test | 1.349e-07 | True | 1.596e-07 | True | 1.14e-08 | True | 1.349e-07 | True |
| Math: Group B vs Group E | 5e-09 | 0.6691 | 140 | Math by Race/Ethnicity | welch_t_test | 3.5e-07 | True | 4.2e-07 | True | 2.8e-08 | True | 3.5e-07 | True |
| Math: Group A vs Group E | 1.08e-08 | 0.8048 | 89 | Math by Race/Ethnicity | welch_t_test | 7.452e-07 | True | 9.072e-07 | True | 5.67e-08 | True | 7.452e-07 | True |
| Math: completed vs none | 1.54e-08 | 0.3763 | 358 | Math by Test Prep | welch_t_test | 1.047e-06 | True | 1.294e-06 | True | 7.609e-08 | True | 1.047e-06 | True |
| Math: female vs male | 9.12e-08 | 0.3407 | 482 | Math by Gender | welch_t_test | 6.11e-06 | True | 7.661e-06 | True | 4.256e-07 | True | 6.11e-06 | True |
| Writing: associates degree vs high school | 1.465e-07 | 0.5242 | 196 | Writing by Parental Education | welch_t_test | 9.669e-06 | True | 0 | True | 6.477e-07 | True | 9.669e-06 | True |
| Reading: high school vs masters degree | 6.258e-07 | 0.7593 | 59 | Reading by Parental Education | welch_t_test | 0 | True | 0.0001 | True | 2.628e-06 | True | 0 | True |
| Reading: bachelors degree vs high school | 8.804e-07 | 0.5846 | 118 | Reading by Parental Education | welch_t_test | 0.0001 | True | 0.0001 | True | 3.522e-06 | True | 0.0001 | True |
| Writing: masters degree vs Some high school | 4.275e-06 | 0.7067 | 59 | Writing by Parental Education | welch_t_test | 0.0003 | True | 0.0004 | True | 0 | True | 0.0003 | True |
| Writing: bachelors degree vs Some high school | 4.628e-06 | 0.5535 | 118 | Writing by Parental Education | welch_t_test | 0.0003 | True | 0.0004 | True | 0 | True | 0.0003 | True |
| Reading: associates degree vs high school | 7.442e-06 | 0.4448 | 196 | Reading by Parental Education | welch_t_test | 0.0005 | True | 0.0006 | True | 0 | True | 0.0005 | True |
| Writing: high school vs Some college | 9.275e-06 | 0.4381 | 196 | Writing by Parental Education | welch_t_test | 0.0006 | True | 0.0008 | True | 0 | True | 0.0006 | True |
| Math: Group D vs Group E | 0 | 0.4483 | 140 | Math by Race/Ethnicity | welch_t_test | 0 | True | 0 | True | 0 | True | 0 | True |
| Math: bachelors degree vs high school | 0 | 0.4936 | 118 | Math by Parental Education | welch_t_test | 0 | True | 0 | True | 0 | True | 0 | True |
| Writing: Group A vs Group E | 0 | 0.5726 | 89 | Writing by Race/Ethnicity | welch_t_test | 0 | True | 0 | True | 0 | True | 0 | True |
| Writing: Group A vs Group D | 0 | 0.5099 | 89 | Writing by Race/Ethnicity | welch_t_test | 0 | True | 0 | True | 0 | True | 0 | True |
| Reading: Group A vs Group E | 0.0001 | 0.5519 | 89 | Reading by Race/Ethnicity | welch_t_test | 0.0059 | True | 0.0084 | True | 0.0003 | True | 0.0058 | True |
| Math: associates degree vs high school | 0.0001 | 0.387 | 196 | Math by Parental Education | welch_t_test | 0.0059 | True | 0.0084 | True | 0.0003 | True | 0.0058 | True |
| Reading: masters degree vs Some high school | 0.0002 | 0.5594 | 59 | Reading by Parental Education | welch_t_test | 0.0114 | True | 0.0168 | True | 0.0006 | True | 0.0114 | True |
| Math: high school vs Some college | 0.0004 | 0.3461 | 196 | Math by Parental Education | welch_t_test | 0.0224 | True | 0.0336 | True | 0.0012 | True | 0.0224 | True |
| Math: high school vs masters degree | 0.0006 | 0.5182 | 59 | Math by Parental Education | welch_t_test | 0.033 | True | 0.0504 | False | 0.0016 | True | 0.0324 | True |
| Reading: high school vs Some college | 0.0006 | 0.3375 | 196 | Reading by Parental Education | welch_t_test | 0.033 | True | 0.0504 | False | 0.0016 | True | 0.0324 | True |
| Reading: bachelors degree vs Some high school | 0.0008 | 0.4036 | 118 | Reading by Parental Education | welch_t_test | 0.0424 | True | 0.0672 | False | 0.002 | True | 0.0408 | True |
| Reading: Group B vs Group E | 0.0008 | 0.3771 | 140 | Reading by Race/Ethnicity | welch_t_test | 0.0424 | True | 0.0672 | False | 0.002 | True | 0.0408 | True |
| Writing: Group B vs Group E | 0.0008 | 0.3768 | 140 | Writing by Race/Ethnicity | welch_t_test | 0.0424 | True | 0.0672 | False | 0.002 | True | 0.0408 | True |
| Math: Group A vs Group D | 0.0009 | 0.4106 | 89 | Math by Race/Ethnicity | welch_t_test | 0.045 | True | 0.0756 | False | 0.0021 | True | 0.0441 | True |
| Writing: associates degree vs Some high school | 0.0009 | 0.3347 | 179 | Writing by Parental Education | welch_t_test | 0.045 | True | 0.0756 | False | 0.0021 | True | 0.0441 | True |
| Writing: Group B vs Group D | 0.0015 | 0.3049 | 190 | Writing by Race/Ethnicity | welch_t_test | 0.072 | False | 0.126 | False | 0.0033 | True | 0.0705 | False |
| Math: bachelors degree vs Some high school | 0.0015 | 0.3791 | 118 | Math by Parental Education | welch_t_test | 0.072 | False | 0.126 | False | 0.0033 | True | 0.0705 | False |
| Writing: masters degree vs Some college | 0.0017 | 0.4633 | 59 | Writing by Parental Education | welch_t_test | 0.0782 | False | 0.1428 | False | 0.0037 | True | 0.0782 | False |
| Reading: Group A vs Group D | 0.0025 | 0.3738 | 89 | Reading by Race/Ethnicity | welch_t_test | 0.1125 | False | 0.21 | False | 0.0052 | True | 0.1125 | False |
| Reading: masters degree vs Some college | 0.0042 | 0.4223 | 59 | Reading by Parental Education | welch_t_test | 0.1848 | False | 0.3528 | False | 0.0086 | True | 0.1848 | False |
| Writing: Group A vs Group C | 0.0046 | 0.3415 | 89 | Writing by Race/Ethnicity | welch_t_test | 0.1978 | False | 0.3864 | False | 0.0092 | True | 0.1978 | False |
| Math: Group B vs Group D | 0.0049 | 0.2695 | 190 | Math by Race/Ethnicity | welch_t_test | 0.2058 | False | 0.4116 | False | 0.0095 | True | 0.205 | False |
| Math: associates degree vs Some high school | 0.005 | 0.2833 | 179 | Math by Parental Education | welch_t_test | 0.2058 | False | 0.42 | False | 0.0095 | True | 0.205 | False |
| Writing: associates degree vs masters degree | 0.0058 | 0.4074 | 59 | Writing by Parental Education | welch_t_test | 0.232 | False | 0.4872 | False | 0.0108 | True | 0.232 | False |
| Reading: associates degree vs Some high school | 0.0068 | 0.2731 | 179 | Reading by Parental Education | welch_t_test | 0.2652 | False | 0.5712 | False | 0.0123 | True | 0.2622 | False |
| Reading: Group C vs Group E | 0.0069 | 0.2751 | 140 | Reading by Race/Ethnicity | welch_t_test | 0.2652 | False | 0.5796 | False | 0.0123 | True | 0.2622 | False |
| Writing: bachelors degree vs Some college | 0.0077 | 0.3044 | 118 | Writing by Parental Education | welch_t_test | 0.2849 | False | 0.6468 | False | 0.0135 | True | 0.2849 | False |
| Math: masters degree vs Some high school | 0.0087 | 0.397 | 59 | Math by Parental Education | welch_t_test | 0.3132 | False | 0.7308 | False | 0.0149 | True | 0.3132 | False |
| Writing: Some college vs Some high school | 0.0104 | 0.2577 | 179 | Writing by Parental Education | welch_t_test | 0.364 | False | 0.8736 | False | 0.0171 | True | 0.3536 | False |
| Reading: Group A vs Group C | 0.0104 | 0.3087 | 89 | Reading by Race/Ethnicity | welch_t_test | 0.364 | False | 0.8736 | False | 0.0171 | True | 0.3536 | False |
| Math: Group C vs Group D | 0.0159 | 0.2017 | 262 | Math by Race/Ethnicity | welch_t_test | 0.5247 | False | 1 | False | 0.0257 | True | 0.5216 | False |
| Math: Some college vs Some high school | 0.0163 | 0.2413 | 179 | Math by Parental Education | welch_t_test | 0.5247 | False | 1 | False | 0.0258 | True | 0.5216 | False |
| Writing: Group C vs Group E | 0.0192 | 0.2383 | 140 | Writing by Race/Ethnicity | welch_t_test | 0.5952 | False | 1 | False | 0.0299 | True | 0.5952 | False |
| Reading: bachelors degree vs Some college | 0.0281 | 0.2504 | 118 | Reading by Parental Education | welch_t_test | 0.843 | False | 1 | False | 0.0429 | True | 0.843 | False |
| Reading: associates degree vs masters degree | 0.0293 | 0.3209 | 59 | Reading by Parental Education | welch_t_test | 0.8497 | False | 1 | False | 0.044 | True | 0.8497 | False |
| Writing: associates degree vs bachelors degree | 0.0351 | 0.2411 | 118 | Writing by Parental Education | welch_t_test | 0.9828 | False | 1 | False | 0.0517 | False | 0.882 | False |
| Reading: Group D vs Group E | 0.045 | 0.2105 | 140 | Reading by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.0652 | False | 0.882 | False |
| Reading: Group B vs Group D | 0.0524 | 0.1854 | 190 | Reading by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.0746 | False | 0.882 | False |
| Writing: Group C vs Group D | 0.0593 | 0.1576 | 262 | Writing by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.083 | False | 0.882 | False |
| Reading: Some college vs Some high school | 0.0873 | 0.1715 | 179 | Reading by Parental Education | welch_t_test | 1 | False | 1 | False | 0.1202 | False | 0.882 | False |
| Math: Group A vs Group C | 0.1104 | 0.1918 | 89 | Math by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.148 | False | 0.882 | False |
| Writing: Group B vs Group C | 0.111 | 0.1463 | 190 | Writing by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.148 | False | 0.882 | False |
| Writing: high school vs Some high school | 0.1141 | 0.1638 | 179 | Writing by Parental Education | welch_t_test | 1 | False | 1 | False | 0.1498 | False | 0.882 | False |
| Writing: Group A vs Group B | 0.1448 | 0.1878 | 89 | Writing by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.1843 | False | 0.882 | False |
| Reading: high school vs Some high school | 0.1448 | 0.1511 | 179 | Reading by Parental Education | welch_t_test | 1 | False | 1 | False | 0.1843 | False | 0.882 | False |
| Math: bachelors degree vs Some college | 0.1715 | 0.1556 | 118 | Math by Parental Education | welch_t_test | 1 | False | 1 | False | 0.2148 | False | 0.882 | False |
| Reading: Group A vs Group B | 0.1739 | 0.1751 | 89 | Reading by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.2148 | False | 0.882 | False |
| Reading: Group B vs Group C | 0.1867 | 0.1212 | 190 | Reading by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.2273 | False | 0.882 | False |
| Reading: associates degree vs bachelors degree | 0.1952 | 0.1479 | 118 | Reading by Parental Education | welch_t_test | 1 | False | 1 | False | 0.2342 | False | 0.882 | False |
| Math: masters degree vs Some college | 0.2176 | 0.1806 | 59 | Math by Parental Education | welch_t_test | 1 | False | 1 | False | 0.2574 | False | 0.882 | False |
| Reading: associates degree vs Some college | 0.2666 | 0.1051 | 222 | Reading by Parental Education | welch_t_test | 1 | False | 1 | False | 0.311 | False | 0.882 | False |
| Reading: bachelors degree vs masters degree | 0.2933 | 0.1681 | 59 | Reading by Parental Education | welch_t_test | 1 | False | 1 | False | 0.3375 | False | 0.882 | False |
| Writing: bachelors degree vs masters degree | 0.3188 | 0.1594 | 59 | Writing by Parental Education | welch_t_test | 1 | False | 1 | False | 0.3619 | False | 0.882 | False |
| Math: Group A vs Group B | 0.3503 | 0.1202 | 89 | Math by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.3923 | False | 0.882 | False |
| Math: associates degree vs bachelors degree | 0.3802 | 0.1001 | 118 | Math by Parental Education | welch_t_test | 1 | False | 1 | False | 0.4202 | False | 0.882 | False |
| Math: high school vs Some high school | 0.3881 | 0.0893 | 179 | Math by Parental Education | welch_t_test | 1 | False | 1 | False | 0.4234 | False | 0.882 | False |
| Math: associates degree vs masters degree | 0.401 | 0.1232 | 59 | Math by Parental Education | welch_t_test | 1 | False | 1 | False | 0.4318 | False | 0.882 | False |
| Writing: Group D vs Group E | 0.4104 | 0.0863 | 140 | Writing by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.4364 | False | 0.882 | False |
| Reading: Group C vs Group D | 0.4258 | 0.0665 | 262 | Reading by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.4471 | False | 0.882 | False |
| Writing: associates degree vs Some college | 0.4467 | 0.072 | 222 | Writing by Parental Education | welch_t_test | 1 | False | 1 | False | 0.4632 | False | 0.882 | False |
| Math: Group B vs Group C | 0.4648 | 0.067 | 190 | Math by Race/Ethnicity | welch_t_test | 1 | False | 1 | False | 0.4761 | False | 0.882 | False |
| Math: associates degree vs Some college | 0.5876 | 0.0513 | 222 | Math by Parental Education | welch_t_test | 1 | False | 1 | False | 0.5947 | False | 0.882 | False |
| Math: bachelors degree vs masters degree | 0.882 | 0.0237 | 59 | Math by Parental Education | welch_t_test | 1 | False | 1 | False | 0.882 | False | 0.882 | False |
Purpose
This section presents the complete numerical results from all 84 hypothesis tests with raw and adjusted p-values across four correction methods. It enables users to identify which findings remain statistically significant after controlling for multiple comparisons—the core objective of the analysis. By displaying all rejection decisions side-by-side, it facilitates comparison of method stringency and power.
Key Findings
- Holm Rejections: 36 of 84 – Conservative stepdown procedure retains 43% of significant findings while controlling family-wise error rate
- BH Rejections: 56 of 84 – False discovery rate control yields 93% more rejections than Bonferroni, reflecting its higher power
- Effect Size Alignment: 36 tests – Exactly match Holm rejections with practically significant effects (Cohen's d ≥ 0.2), indicating statistical and practical significance align
- Uncorrected FWER: 0.987 – Without correction, 98.7% probability of at least one false positive across 84 tests, justifying the correction approach
Interpretation
The analysis reveals a clear hierarchy: Bonferroni (most conservative, 29 rejections) < Holm/Hochberg (moderate, 36 rejections) < BH (least conservative, 56
Methodology Summary
Technical details and method comparison
| method | rejections | controls | assumptions | power_rank | power_relative_to_bonferroni |
|---|---|---|---|---|---|
| holm | 36 | FWER | General (weak assumptions) | 3 | 1.241 |
| bonferroni | 29 | FWER | General (weak assumptions) | 4 | 1 |
| BH | 56 | FDR | Independence or positive dependence | 1 | 1.931 |
| hochberg | 36 | FWER | General (weak assumptions) | 2 | 1.241 |
Purpose
This section compares four multiple comparison correction methods to demonstrate why Holm-Bonferroni was selected as the primary approach. It shows the trade-off between statistical power (ability to detect true effects) and error control (preventing false positives) across methods with different assumptions and objectives. Understanding these differences is critical for interpreting which of the 36 Holm-rejected hypotheses represent genuine findings versus false discoveries.
Key Findings
- Bonferroni (29 rejections): Most conservative FWER method; 24% less powerful than Holm but requires no dependence assumptions
- Holm (36 rejections): 24% more powerful than Bonferroni with identical FWER control; assumes independence or positive dependence
- Hochberg (36 rejections): Matches Holm's power but requires stronger assumptions; ranks 2nd in power
- Benjamini-Hochberg (56 rejections): 93% more powerful than Bonferroni but controls FDR (looser standard) rather than FWER; allows ~5% false discovery rate among rejected tests
Interpretation
The 7-test gap between Holm (36) and Bonferroni (29) represents the power gain from assuming test dependence structure.