Logistic Regression
Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| confidence_level | 0.95 | confidence_level |
| test_size | 0.3 | test_size |
| classification_threshold | 0.5 | classification_threshold |
| positive_class | completed | positive_class |
Purpose
This logistic regression analysis identifies which student characteristics predict test preparation completion at an Educational Research Institute. The model evaluates five predictors (math score, reading score, writing score, gender, and lunch plan) against a binary outcome (completed vs. none) using 1,000 complete student records with no missing data.
Key Findings
- AUC (0.80): Model demonstrates good discrimination ability between students who complete and don't complete test preparation, exceeding the 0.70 quality threshold.
- Accuracy (0.75): Overall correct classification rate, though masked by class imbalance (35.8% positive cases).
- Specificity (0.87): Strong at identifying non-completers, but Sensitivity (0.52) is notably weaker at identifying completers.
- Significant Predictors (4 of 5): Writing score (OR=1.26), student gender male (OR=7.46), math score (OR=0.92), and reading score (OR=0.91) are statistically significant; lunch plan is not.
- McFadden R² (0.17): Model explains 17% of variance, indicating moderate explanatory power.
Interpretation
The model successfully identifies non-completers but struggles with true positive detection. Male students show
Data preprocessing and column mapping
Purpose
This section documents the data cleaning and preparation phase for the logistic regression model predicting test preparation completion. Perfect data retention indicates no observations were excluded during preprocessing, which is critical for maintaining statistical power and representativeness when identifying student characteristics that predict completion behavior.
Key Findings
- Retention Rate: 100% (1,000 of 1,000 rows retained) - No observations were removed during cleaning, suggesting either minimal data quality issues or that missing values were handled through imputation rather than deletion
- Rows Removed: 0 - The dataset required no exclusions, contrasting with the metadata note that "missing values removed" yet maintaining full row count
- Train/Test Split: Not documented - The absence of explicit split information suggests the model may use the full dataset or employ alternative validation methods not captured here
Interpretation
The perfect retention rate supports the model's ability to leverage the complete 1,000-student sample for logistic regression estimation. This maximizes statistical power for detecting significant predictors of test preparation completion. However, the discrepancy between "missing values removed" and zero rows removed suggests preprocessing may have occurred at the column level rather than row level, potentially through imputation or feature engineering that isn't explicitly documented here.
Context
The lack of train/test split documentation limits visibility into how model performance metrics (AUC=0.8, Accuracy=
Executive Summary
Executive summary of logistic regression classification
| Metric | Value | Interpretation |
|---|---|---|
| AUC | 0.8 | Good |
| Accuracy | 74.6% | Moderate |
| Sensitivity | 52.3% | Low |
| Specificity | 87% | High |
| F1 Score | 0.596 | Low |
| McFadden R² | 0.172 | Weak |
| Significant Predictors | 4 | Many factors |
Key Findings:
• Model discrimination: AUC = 0.8 (good — model reliably separates the two classes)
• Sensitivity: 52.3% — proportion of 'completed' cases correctly identified
• Specificity: 87% — proportion of 'none' cases correctly identified
• McFadden R² = 0.172 (weak model fit — consider adding predictors)
Recommendation: The model provides useful discrimination. Use an optimal threshold of 0.367 to classify new cases. Focus interventions on predictors with large, significant odds ratios.
EXECUTIVE SUMMARY
Purpose
This section synthesizes the logistic regression model's performance in predicting test preparation completion across 1,000 students. Understanding whether the model achieves sufficient predictive accuracy and identifies actionable student characteristics is critical for determining deployment viability and intervention strategy effectiveness.
Key Findings
- AUC (0.8): Model demonstrates good discrimination ability—reliably separates students who completed test prep from those who did not
- Accuracy (74.6%): Overall correct classification rate, though masked by class imbalance (35.8% positive cases)
- Sensitivity (52.3%): Captures only about half of students who actually completed prep; high false-negative risk
- Specificity (87%): Excellent at identifying non-completers, reducing false-positive interventions
- Significant Predictors (4 of 5): Student gender, writing score, math score, and reading score drive predictions; lunch plan is non-significant
- McFadden R² (0.172): Weak explanatory power suggests unmeasured factors influence completion behavior
Interpretation
The model achieves the business objective of identifying predictive characteristics with acceptable discrimination (AUC 0.8). However, the low sensitivity reveals a critical trade-off: while the model excels at identifying non-completers
ROC Curve
ROC curve showing model discrimination ability
Purpose
The ROC curve evaluates the logistic regression model's ability to discriminate between students who completed test preparation and those who did not across all possible classification thresholds. This section directly addresses the model's predictive quality for the stated objective of identifying student characteristics that predict test preparation completion.
Key Findings
- AUC (0.8): Indicates good discrimination ability—the model correctly ranks a randomly selected completer higher than a non-completer 80% of the time, substantially better than random guessing (0.5).
- Optimal Threshold (0.367): Balances sensitivity and specificity by maximizing Youden's J statistic, suggesting predictions below this probability should be classified as "none" and above as "completed."
- Sensitivity-Specificity Trade-off: The curve shows the model achieves ~52% sensitivity (true positive rate) at ~13% false positive rate, reflecting the class imbalance (35.8% positive cases).
Interpretation
The AUC of 0.8 demonstrates the model has meaningful predictive power for test preparation completion. The model performs substantially better than chance, validating that the selected student characteristics (gender, lunch plan, and test scores) contain discriminative information. However, the moderate AUC reflects inherent complexity in predicting behavioral outcomes and suggests room for improvement through additional
Confusion Matrix
Confusion matrix showing classification accuracy by class
Purpose
This confusion matrix evaluates how well the logistic regression model predicts test preparation completion across the two outcome classes. It reveals the model's ability to correctly identify students who completed preparation versus those who did not, which directly addresses the core objective of identifying predictive student characteristics.
Key Findings
- Accuracy (74.6%): Overall correctness across both classes; the model correctly classifies nearly 3 in 4 students
- Sensitivity (52.3%): The model identifies only about half of students who actually completed preparation, missing 48% of true completers (51 false negatives)
- Specificity (87%): Strong performance identifying non-completers; correctly classifies 87% of students who did not complete
- F1 Score (0.596): Moderate balance between precision and recall, reflecting the trade-off between false positives (25) and false negatives (51)
Interpretation
The model demonstrates asymmetric performance: it excels at identifying non-completion but struggles with completion detection. The high specificity (87%) indicates the model conservatively predicts completion, resulting in many false negatives. This imbalance reflects the class distribution (35.8% positive cases) and suggests the model's decision boundary favors the majority class. The moderate F1 score indicates reasonable but imperfect predictive utility for
Odds Ratios
Odds ratios with 95% confidence intervals for all predictors
Purpose
This section quantifies the individual effect of each predictor on the odds of test preparation completion. The odds ratios and confidence intervals reveal which student characteristics are statistically reliable predictors and the magnitude of their influence on completion likelihood. This directly supports the analysis objective to identify which characteristics predict test preparation completion.
Key Findings
- Student Gender (Male): OR = 7.46 (95% CI: 4.3–13.25) - Male students have 7.5 times higher odds of completing test preparation; highly significant (p<0.001) with a confidence interval far above 1.0
- Writing Score: OR = 1.26 (95% CI: 1.2–1.33) - Each unit increase in writing score increases completion odds by 26%; statistically significant (p<0.001)
- Math & Reading Scores: OR ≈ 0.91–0.92 - Both decrease completion odds by ~8–9% per unit; significant protective effects (p<0.001)
- Lunch Plan: OR = 0.82 (95% CI: 0.55–1.2) - Not statistically significant (p=0.305); confidence interval crosses 1.0, indicating no reliable effect
Interpretation
Four of five predictors significantly influence completion odds.
Predicted Probability Distribution
Distribution of predicted probabilities by actual class
Purpose
This section visualizes how well the logistic regression model separates students who completed test preparation from those who did not. The distribution of predicted probabilities reveals the model's confidence in its classifications and identifies overlap regions where the model struggles to distinguish between classes. This directly supports the objective of identifying student characteristics that predict test preparation completion.
Key Findings
- Positive Class Percentage: 35.8% of students completed test preparation, creating moderate class imbalance that affects model calibration
- Classification Threshold: Set at 0.367, below the 50% default, reflecting the class imbalance and optimizing for balanced sensitivity/specificity
- Predicted Probability Range: Mean of 0.36 with standard deviation of 0.22 indicates moderate spread; skewness of 0.45 suggests slight right-skew toward higher probabilities
- Class Separation: Moderate overlap between distributions suggests the model achieves reasonable but imperfect discrimination between completed and non-completed cases
Interpretation
The predicted probabilities show meaningful separation between the two classes, consistent with the model's AUC of 0.80. The threshold of 0.367 is optimized below 0.50 because only 35.8% of students completed preparation, allowing the model to balance false positives and false negatives. The observed overlap explains why sensitivity (0
Model Coefficients
Full coefficient table with log-odds, odds ratios, and significance
| variable | log_odds | std_error | z_stat | p_value | odds_ratio | ci_lower | ci_upper | significant |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | -5.032 | 0.5788 | -8.694 | 0 | 0.007 | 0.002 | 0.02 | Yes |
| math score | -0.0882 | 0.0158 | -5.595 | 0 | 0.916 | 0.887 | 0.944 | Yes |
| reading score | -0.0946 | 0.0218 | -4.334 | 0 | 0.91 | 0.871 | 0.949 | Yes |
| writing score | 0.2322 | 0.0249 | 9.309 | 0 | 1.261 | 1.203 | 1.326 | Yes |
| student_gendermale | 2.009 | 0.287 | 7.001 | 0 | 7.457 | 4.295 | 13.25 | Yes |
| lunch_planstandard | -0.2038 | 0.1986 | -1.026 | 0.3048 | 0.816 | 0.552 | 1.204 | No |
Purpose
This section quantifies the relationship between each student characteristic and test preparation completion. The coefficient table reveals which factors statistically predict completion and the magnitude of their effects, directly addressing the research objective to identify predictive student characteristics through logistic regression.
Key Findings
- Student Gender (Male): Odds ratio of 7.46 (p<0.001) — male students have 7.5× higher odds of completing test preparation than females, the strongest predictor in the model
- Writing Score: Odds ratio of 1.26 (p<0.001) — each additional point increases completion odds by 26%, the only positive academic predictor
- Math & Reading Scores: Odds ratios of 0.92 each (p<0.001) — counterintuitively, higher scores decrease completion odds by 8% per point, suggesting high-performing students may skip preparation
- Lunch Plan: Odds ratio of 0.82 (p=0.305) — not statistically significant; socioeconomic status shows no meaningful effect
- Model Fit: McFadden's R² = 0.172 indicates modest explanatory power; the model explains 17% of variance in completion
Interpretation
The model identifies gender as the dominant predictor of completion, with writing ability as a secondary factor. The inverse relationship