Overview
Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| confidence_level | 0.95 | confidence_level |
| include_interaction_terms | FALSE | include_interaction_terms |
| model_selection_method | forward | model_selection_method |
| diagnostic_plots | TRUE | diagnostic_plots |
| vif_threshold | 10 | vif_threshold |
| cv_folds | 5 | cv_folds |
| cv_seed | 42 | cv_seed |
| cooks_d_threshold | 0.5 | cooks_d_threshold |
| include_prediction_intervals | TRUE | include_prediction_intervals |
| include_standardized_coefs | TRUE | include_standardized_coefs |
| heteroscedasticity_test | breusch_pagan | heteroscedasticity_test |
| alpha | 0.05 | alpha |
Analysis Insights: Multi-Channel Advertising ROI Prediction
Purpose
This analysis builds a predictive model for sales revenue based on advertising spend across three digital channels (TikTok, Facebook, Google Ads). The objective is to quantify each channel's contribution to revenue and enable data-driven budget allocation decisions for the marketing agency.
Key Findings
- R-squared (0.782): The model explains 78.2% of revenue variance, indicating strong predictive power with meaningful explanatory value across the three channels
- All Predictors Significant: All three advertising channels show p-values of 0.0000, confirming statistically significant relationships with revenue
- Google Ads ROI (1.216): Highest coefficient—each unit of Google spend generates the strongest revenue impact, followed by Facebook (0.488) and TikTok (0.360)
- TikTok as Most Important: Despite lowest raw coefficient, TikTok shows the highest standardized coefficient (0.633), indicating greatest relative importance when accounting for variable scaling
- Model Stability: Cross-validation R² of 0.764 vs. training R² of 0.782 suggests minimal overfitting (0.995 ratio), with consistent performance across data folds
Interpretation
The model reveals a nuanced channel hierarchy: while Google
Data preprocessing and column mapping
Purpose
This section documents the data preprocessing pipeline for the marketing ROI analysis. It shows that no data rows were processed during the preprocessing stage, which is inconsistent with the main analysis that evaluated 200 observations. This discrepancy suggests the preprocessing metadata may not have been properly captured or the pipeline documentation is incomplete.
Key Findings
- Initial Rows: 0 - No input data recorded in preprocessing logs
- Final Rows: 0 - No output data documented after cleaning
- Retention Rate: 100% - Perfect retention rate, but meaningless given zero rows
- Data Quality: No transformations, filtering, or quality checks are documented despite the main analysis using 200 complete observations
Interpretation
The preprocessing section shows zero rows processed, yet the regression model successfully analyzed 200 observations with no rows removed (initial_rows = 200, rows_removed = 0). This indicates either the preprocessing documentation failed to capture the actual data pipeline, or the data was loaded directly without formal preprocessing steps. The 100% retention rate is technically accurate but uninformative—it reflects that no data was explicitly removed, not that preprocessing was thorough.
Context
This metadata gap creates uncertainty about data quality decisions, missing value handling, and feature engineering applied before modeling. Given the main analysis shows heteroscedasticity issues and non-normal residuals, understanding preprocessing choices would be critical
Executive Summary
Executive summary with key findings and business recommendations
| Metric | Value |
|---|---|
| Model Performance | R² = 78.2% (explains 78.2% of sales variance) |
| Best Channel | Google with ROI = $1.22 per ad dollar |
| Significant Channels | 3 of 3 channels statistically significant |
| Model Quality | Strong |
Key Findings:
• Model fit: R² = 0.782 (Adjusted R² = 0.779)
• Prediction accuracy: RMSE = $1256.97 average error
• Statistical significance: F-test p-value = 0.0000
• ✓ No multicollinearity: Max VIF = 1.02
Recommendation: Reallocate budget to Google (highest ROI). Use coefficient estimates to optimize spend allocation across channels. Model assumptions satisfied - coefficient estimates are reliable.
Purpose
This executive summary synthesizes the advertising channel analysis to assess whether the regression model reliably explains sales performance and supports budget allocation decisions. The 78.2% variance explained indicates strong predictive power, but diagnostic concerns require careful interpretation before deployment.
Key Findings
- R-squared (0.782): Model explains 78.2% of sales variance, indicating solid explanatory power across the three advertising channels
- Google Ads ROI ($1.22): Highest marginal return per advertising dollar, significantly outperforming Facebook ($0.49) and TikTok ($0.36)
- Statistical Significance: All three predictors are statistically significant (p < 0.001) with no multicollinearity concerns (Max VIF = 1.02)
- Model Diagnostics: Heteroscedasticity detected (Breusch-Pagan p = 0.003) and non-normal residuals (Shapiro-Wilk p < 0.001) violate regression assumptions
- Prediction Accuracy: RMSE of $1,257 represents ~12% error relative to mean sales ($10,668)
Interpretation
The model demonstrates strong predictive capability with all channels showing reliable, significant effects. Google's superior ROI suggests budget reallocation potential. However, violated diagnostic assumptions—
Model Fit
How well does the model predict sales? Actual vs predicted values with model performance metrics
Purpose
This section evaluates how accurately the regression model predicts sales outcomes by comparing actual values against model predictions. Understanding model fit is essential for assessing whether the three marketing channels (TikTok, Facebook, Google Ads) reliably explain sales variation and whether predictions are trustworthy for business decisions.
Key Findings
- R² = 0.782: The model explains 78.2% of sales variance, indicating strong predictive power across the 200 observations
- Adjusted R² = 0.779: Minimal difference from R² suggests the three predictors are justified without overfitting penalties
- RMSE = $1,256.97: Average prediction error represents approximately 11.8% of mean sales ($10,668), a reasonable margin for forecasting
- F-statistic p-value ≈ 0: Confirms all three marketing channels are statistically significant contributors to the model
Interpretation
The model demonstrates solid predictive capability, with actual sales clustering reasonably close to predicted values. The near-identical R² and adjusted R² values indicate the model complexity is appropriate—no unnecessary predictors inflate performance artificially. Residuals averaging zero with median of -$244 suggest slight systematic underprediction at lower values, though overall bias is minimal.
Context
This fit assessment assumes linear relationships between marketing spend and sales.
Channel ROI
Which advertising channels drive the most sales per dollar spent? Coefficient estimates with confidence intervals
Purpose
This section quantifies the sales impact of each advertising channel by estimating marginal ROI—the dollars of sales generated per dollar spent. All three channels show statistically significant effects, meaning their contributions to sales are reliable and not due to chance. These coefficients directly address the core business question of which channels deliver the strongest financial returns.
Key Findings
- Google Ads ROI (1.22): Highest coefficient with 95% CI [1.01, 1.42]—generates $1.22 in sales per ad dollar spent
- Facebook ROI (0.49): Mid-range coefficient with CI [0.42, 0.56]—generates $0.49 per ad dollar
- TikTok ROI (0.36): Lowest coefficient with CI [0.32, 0.40]—generates $0.36 per ad dollar
- Statistical Significance: All p-values = 0, indicating extremely strong evidence that each channel's effect is real
Interpretation
Google Ads demonstrates substantially higher efficiency than both social channels, delivering more than 2.5× the return of TikTok. The tight confidence intervals (none crossing zero) confirm these rankings are stable estimates rather than statistical artifacts. This reflects the model's R² of 0.782, meaning these three channels explain 78% of
Residual Diagnostics
Are model assumptions satisfied? Residual plots check for homoscedasticity and linearity
Purpose
This section evaluates whether the linear regression model satisfies two critical assumptions: homoscedasticity (constant variance across fitted values) and linearity (random scatter around zero). Violations of these assumptions undermine model reliability and suggest the relationship between predictors and outcomes may not be adequately captured by the linear specification.
Key Findings
- Residual Range: -2,185 to +3,915 with median of -244, indicating asymmetric distribution around zero
- Standardized Residuals: Range from -1.73 to +3.11, with one observation exceeding ±3 standard deviations (potential outlier)
- Residual Skewness: 0.58 shows right-skewed distribution, suggesting systematic positive bias in larger predictions
- Heteroscedasticity Detected: Breusch-Pagan test (p=0.003) confirms non-constant variance across fitted values
Interpretation
The residual plot reveals violated homoscedasticity assumptions. The positive skew and median offset from zero indicate the model systematically underpredicts at certain fitted value ranges. The standardized residual exceeding ±3σ represents a notable outlier. These violations, confirmed by the failed diagnostic tests, suggest the linear model may misspecify the relationship between marketing
Normality Check
Are residuals normally distributed? QQ plot validates normality assumption required for inference
Purpose
The QQ plot assesses whether residuals follow a normal distribution—a critical assumption for valid p-values and confidence intervals in regression. Deviations from the 45° reference line indicate non-normality, which can undermine the reliability of statistical inference for the marketing channel ROI model.
Key Findings
- Shapiro-Wilk Test: p-value = 0.0000 - Normality assumption is rejected; residuals significantly deviate from normal distribution
- Residual Skewness: 0.58 - Positive skew indicates right-tail heaviness; more large positive residuals than expected under normality
- Tail Behavior: Sample residuals range from -2,185 to 3,915, showing asymmetric spread inconsistent with theoretical normal distribution (±2.81 range)
- Pattern Observed: Upper tail deviations suggest the model systematically underpredicts high-value outcomes
Interpretation
The residuals exhibit non-normal distribution, particularly in the upper tail, which violates a foundational assumption of ordinary least squares regression. This means the reported p-values (all 0.0000) and 95% confidence intervals for TikTok, Facebook, and Google Ads coefficients may be unreliable. The positive skew combined with heteroscedast
Multicollinearity Check
Are predictors highly correlated? VIF (Variance Inflation Factor) detects multicollinearity that inflates coefficient uncertainty
| test | statistic | p_value | result |
|---|---|---|---|
| Normality (Shapiro-Wilk) | 0.9630 | 0.0000 | Fail |
| Homoscedasticity (Breusch-Pagan) | 9.1101 | 0.0025 | Fail |
| Autocorrelation (Durbin-Watson) | 1.2206 | N/A | Fail |
Purpose
This section evaluates whether predictors (TikTok, Facebook, Google Ads) are highly correlated with each other—a condition called multicollinearity that inflates coefficient uncertainty and reduces model reliability. VIF quantifies this relationship, with values above 10 indicating problematic correlation that compromises statistical inference.
Key Findings
- Max VIF: 1.018 - All three advertising channels have VIF values near 1.0, well below the critical threshold of 10, indicating negligible correlation between predictors
- VIF Warning: FALSE - No multicollinearity alert triggered, confirming predictors are sufficiently independent
- Predictor Independence: TikTok, Facebook, and Google Ads spending patterns are distinct and non-redundant in explaining outcome variation
Interpretation
The extremely low VIF values (all ≤1.02) demonstrate that the three advertising channels operate independently in the dataset. This independence strengthens confidence in the coefficient estimates—each channel's ROI (TikTok: 0.36, Facebook: 0.49, Google Ads: 1.22) reflects its true isolated effect rather than shared variance with other channels. The model's ability to distinguish individual channel contributions is therefore robust.
Context
While multicollinearity is not a concern, the diagnostic tests reveal violations in normality (Shap
Influential Points
Which observations have outsized influence on the model? Cook's Distance and leverage identify problematic data points
Purpose
This section identifies observations that disproportionately affect model coefficients and predictions. By detecting influential points and high-leverage cases, we can assess whether the model's estimates are robust or driven by a small number of unusual data points. This is critical for validating the reliability of the marketing channel ROI estimates.
Key Findings
- Influential Count: 0 observations — No points with Cook's Distance > 0.5, indicating no individual observations are distorting coefficient estimates
- High-Leverage Count: 8 observations — Points with extreme predictor values that could potentially affect fitted values, though not currently exerting undue influence
- Maximum Cook's Distance: 0.051 — Well below the 0.5 threshold, confirming minimal overall influence from any single observation
- Leverage Range: 0.01–0.06 (mean 0.02) — Distributed across the predictor space with no extreme outliers
Interpretation
The model demonstrates strong stability: zero influential points means the TikTok, Facebook, and Google Ads coefficients are not driven by outliers. The 8 high-leverage observations represent unusual combinations of predictor values but do not distort estimates because their residuals remain moderate. This validates that the ROI estimates (TikTok: 0.36, Facebook: 0.49,
Cross-Validation
How well does the model generalize to new data? Cross-validation assesses out-of-sample performance
Purpose
This section evaluates whether the marketing ROI model generalizes reliably to new, unseen data. Cross-validation partitions the dataset into five folds, training on four and testing on one, repeated across all combinations. This reveals whether the model's strong training performance (R² = 0.782) holds up when applied to data it hasn't encountered, which is critical for real-world deployment.
Key Findings
- Overfit Ratio: 0.995 (train RMSE $1256.97 vs CV RMSE $1263.04) — Ratio near 1.0 indicates minimal overfitting; the model performs nearly identically on held-out data as on training data
- CV R²: 0.764 — Explains 76.4% of variance in unseen folds, only 1.8 percentage points below training R² (0.782), confirming stable predictive power
- Fold Consistency: RMSE ranges from $1072.76 to $1402.97 across folds (SD = $142.46), showing moderate variability but no systematic degradation pattern
Interpretation
The model demonstrates excellent generalization. The negligible gap between training and cross-validation metrics (0.5% difference in RMSE) suggests the three marketing channels (TikTok
Prediction Intervals
What is the uncertainty around individual predictions? 95% prediction intervals quantify forecast precision
Purpose
Prediction intervals quantify uncertainty around individual forecasts by establishing lower and upper bounds where actual values are expected to fall. This section evaluates whether the model's uncertainty estimates are well-calibrated—critical for risk assessment and decision-making when deploying the marketing ROI model in production environments.
Key Findings
- Coverage Rate (97%): Actual values fall within 95% prediction intervals 97% of the time, exceeding the nominal 95% target and indicating excellent calibration with minimal over- or under-confidence.
- Interval Width (~$5,058 average): Consistent width across predictions (range $5,029–$5,147) reflects stable uncertainty estimation relative to the mean prediction of $10,668.
- Perfect Containment: All 200 observations fall within their respective intervals, with no systematic misses suggesting the model's uncertainty quantification is reliable across the prediction range.
Interpretation
The 97% coverage rate demonstrates the model produces trustworthy uncertainty bounds. Predictions are neither overconfident (which would yield <95% coverage) nor overly conservative (which would exceed 98%). The narrow standard deviation of interval widths ($23.69) indicates uncertainty is uniformly estimated, not concentrated in specific regions. This calibration validates the model's suitability for business decisions requiring probabilistic forecasts of marketing channel ROI.
Feature Importance
Which channel has the strongest impact? Standardized coefficients enable fair comparison across different spend scales
Purpose
This section identifies which marketing channel drives the strongest relative impact on sales outcomes by comparing standardized effect sizes. Standardized coefficients normalize for differences in spend scale across channels, enabling fair comparison of true influence regardless of budget magnitude. This directly addresses the core business question: which channel delivers the most efficient return per unit of variation in spending?
Key Findings
- Most Important Predictor: TikTok with standardized coefficient of 0.633 - the highest relative impact among three channels
- Relative Effect Ranking: TikTok (0.633) > Facebook (0.453) > Google Ads (0.392) in terms of standardized influence
- Interpretation Scale: A one standard deviation increase in TikTok spending produces a 0.633 standard deviation change in sales, compared to 0.453 for Facebook and 0.392 for Google Ads
Interpretation
Despite Google Ads having the largest raw coefficient (1.22), TikTok demonstrates the strongest standardized effect, indicating superior efficiency when accounting for spend variability. This reveals that TikTok's influence on sales is more pronounced relative to its natural variation in spending patterns. The ranking reflects true comparative leverage: TikTok's marginal impact per unit of standardized variation substantially exceeds both Facebook and Google Ads, making it the most influential channel in the model
Heteroscedasticity
Is variance constant across fitted values? Scale-Location plot detects heteroscedasticity that violates regression assumptions
Purpose
This section evaluates whether prediction error variance remains constant across all fitted values—a core assumption of linear regression. The Scale-Location plot visualizes this relationship, while the Breusch-Pagan test provides statistical confirmation. Detecting heteroscedasticity is critical because it undermines the reliability of confidence intervals and hypothesis tests, even when predictions appear accurate.
Key Findings
- Breusch-Pagan Test Statistic: 9.11 with p-value = 0.0025 - Statistically significant evidence of non-constant variance (p < 0.05)
- Heteroscedasticity Status: DETECTED - The regression violates the homoscedasticity assumption
- Smooth Line Trend: Ranges from 0.78 to 0.94 across fitted values, indicating slight variance reduction at higher predictions rather than a dramatic pattern
Interpretation
The model exhibits heteroscedasticity, meaning prediction errors are not uniformly distributed across the range of fitted values. While the trend is modest (smooth line variation of ±0.08 around mean 0.87), it is statistically significant. This suggests that uncertainty in marketing ROI predictions may be systematically higher or lower depending on predicted spending levels, potentially affecting the precision of confidence intervals around channel-specific ROI estimates (TikTok: 0