Overview

Analysis overview and configuration

Analysis TypeLinear Regression

CompanyDigital Marketing Agency

ObjectiveBuild a multiple linear regression model to predict sales revenue from multi-channel advertising spend (TikTok, Facebook, Google Ads)

Analysis Date2026-03-02

Processing Idtest_1772483613

Total Observations0

Parameter	Value	_row
confidence_level	0.95	confidence_level
include_interaction_terms	FALSE	include_interaction_terms
model_selection_method	forward	model_selection_method
diagnostic_plots	TRUE	diagnostic_plots
vif_threshold	10	vif_threshold
cv_folds	5	cv_folds
cv_seed	42	cv_seed
cooks_d_threshold	0.5	cooks_d_threshold
include_prediction_intervals	TRUE	include_prediction_intervals
include_standardized_coefs	TRUE	include_standardized_coefs
heteroscedasticity_test	breusch_pagan	heteroscedasticity_test
alpha	0.05	alpha

Interpretation

Analysis Insights: Multi-Channel Advertising ROI Prediction

Purpose

This analysis builds a predictive model for sales revenue based on advertising spend across three digital channels (TikTok, Facebook, Google Ads). The objective is to quantify each channel's contribution to revenue and enable data-driven budget allocation decisions for the marketing agency.

Key Findings

R-squared (0.782): The model explains 78.2% of revenue variance, indicating strong predictive power with meaningful explanatory value across the three channels
All Predictors Significant: All three advertising channels show p-values of 0.0000, confirming statistically significant relationships with revenue
Google Ads ROI (1.216): Highest coefficient—each unit of Google spend generates the strongest revenue impact, followed by Facebook (0.488) and TikTok (0.360)
TikTok as Most Important: Despite lowest raw coefficient, TikTok shows the highest standardized coefficient (0.633), indicating greatest relative importance when accounting for variable scaling
Model Stability: Cross-validation R² of 0.764 vs. training R² of 0.782 suggests minimal overfitting (0.995 ratio), with consistent performance across data folds

Interpretation

The model reveals a nuanced channel hierarchy: while Google

Data preprocessing and column mapping

Initial Rows0

Final Rows0

Rows Removed0

Retention Rate100

Interpretation

Purpose

This section documents the data preprocessing pipeline for the marketing ROI analysis. It shows that no data rows were processed during the preprocessing stage, which is inconsistent with the main analysis that evaluated 200 observations. This discrepancy suggests the preprocessing metadata may not have been properly captured or the pipeline documentation is incomplete.

Key Findings

Initial Rows: 0 - No input data recorded in preprocessing logs
Final Rows: 0 - No output data documented after cleaning
Retention Rate: 100% - Perfect retention rate, but meaningless given zero rows
Data Quality: No transformations, filtering, or quality checks are documented despite the main analysis using 200 complete observations

Interpretation

The preprocessing section shows zero rows processed, yet the regression model successfully analyzed 200 observations with no rows removed (initial_rows = 200, rows_removed = 0). This indicates either the preprocessing documentation failed to capture the actual data pipeline, or the data was loaded directly without formal preprocessing steps. The 100% retention rate is technically accurate but uninformative—it reflects that no data was explicitly removed, not that preprocessing was thorough.

Context

This metadata gap creates uncertainty about data quality decisions, missing value handling, and feature engineering applied before modeling. Given the main analysis shows heteroscedasticity issues and non-normal residuals, understanding preprocessing choices would be critical

Executive Summary

Executive summary with key findings and business recommendations

r_squared

0.7823

best_channel

Google

significant_predictors

Metric	Value
Model Performance	R² = 78.2% (explains 78.2% of sales variance)
Best Channel	Google with ROI = $1.22 per ad dollar
Significant Channels	3 of 3 channels statistically significant
Model Quality	Strong

Bottom Line: Regression model explains 78.2% of sales variance across 3 advertising channels. Google delivers highest marginal ROI at $1.22 of sales per ad dollar. 3 channels show statistically significant effects (p < 0.05).

Key Findings:
• Model fit: R² = 0.782 (Adjusted R² = 0.779)
• Prediction accuracy: RMSE = $1256.97 average error
• Statistical significance: F-test p-value = 0.0000
• ✓ No multicollinearity: Max VIF = 1.02

Recommendation: Reallocate budget to Google (highest ROI). Use coefficient estimates to optimize spend allocation across channels. Model assumptions satisfied - coefficient estimates are reliable.

Interpretation

Purpose

This executive summary synthesizes the advertising channel analysis to assess whether the regression model reliably explains sales performance and supports budget allocation decisions. The 78.2% variance explained indicates strong predictive power, but diagnostic concerns require careful interpretation before deployment.

Key Findings

R-squared (0.782): Model explains 78.2% of sales variance, indicating solid explanatory power across the three advertising channels
Google Ads ROI ($1.22): Highest marginal return per advertising dollar, significantly outperforming Facebook ($0.49) and TikTok ($0.36)
Statistical Significance: All three predictors are statistically significant (p < 0.001) with no multicollinearity concerns (Max VIF = 1.02)
Model Diagnostics: Heteroscedasticity detected (Breusch-Pagan p = 0.003) and non-normal residuals (Shapiro-Wilk p < 0.001) violate regression assumptions
Prediction Accuracy: RMSE of $1,257 represents ~12% error relative to mean sales ($10,668)

Interpretation

The model demonstrates strong predictive capability with all channels showing reliable, significant effects. Google's superior ROI suggests budget reallocation potential. However, violated diagnostic assumptions—

Visualization

Model Fit

How well does the model predict sales? Actual vs predicted values with model performance metrics

Interpretation

Purpose

This section evaluates how accurately the regression model predicts sales outcomes by comparing actual values against model predictions. Understanding model fit is essential for assessing whether the three marketing channels (TikTok, Facebook, Google Ads) reliably explain sales variation and whether predictions are trustworthy for business decisions.

Key Findings

R² = 0.782: The model explains 78.2% of sales variance, indicating strong predictive power across the 200 observations
Adjusted R² = 0.779: Minimal difference from R² suggests the three predictors are justified without overfitting penalties
RMSE = $1,256.97: Average prediction error represents approximately 11.8% of mean sales ($10,668), a reasonable margin for forecasting
F-statistic p-value ≈ 0: Confirms all three marketing channels are statistically significant contributors to the model

Interpretation

The model demonstrates solid predictive capability, with actual sales clustering reasonably close to predicted values. The near-identical R² and adjusted R² values indicate the model complexity is appropriate—no unnecessary predictors inflate performance artificially. Residuals averaging zero with median of -$244 suggest slight systematic underprediction at lower values, though overall bias is minimal.

Context

This fit assessment assumes linear relationships between marketing spend and sales.

Visualization

Channel ROI

Which advertising channels drive the most sales per dollar spent? Coefficient estimates with confidence intervals

Interpretation

Purpose

This section quantifies the sales impact of each advertising channel by estimating marginal ROI—the dollars of sales generated per dollar spent. All three channels show statistically significant effects, meaning their contributions to sales are reliable and not due to chance. These coefficients directly address the core business question of which channels deliver the strongest financial returns.

Key Findings

Google Ads ROI (1.22): Highest coefficient with 95% CI [1.01, 1.42]—generates $1.22 in sales per ad dollar spent
Facebook ROI (0.49): Mid-range coefficient with CI [0.42, 0.56]—generates $0.49 per ad dollar
TikTok ROI (0.36): Lowest coefficient with CI [0.32, 0.40]—generates $0.36 per ad dollar
Statistical Significance: All p-values = 0, indicating extremely strong evidence that each channel's effect is real

Interpretation

Google Ads demonstrates substantially higher efficiency than both social channels, delivering more than 2.5× the return of TikTok. The tight confidence intervals (none crossing zero) confirm these rankings are stable estimates rather than statistical artifacts. This reflects the model's R² of 0.782, meaning these three channels explain 78% of

Visualization

Residual Diagnostics

Are model assumptions satisfied? Residual plots check for homoscedasticity and linearity

Interpretation

Purpose

This section evaluates whether the linear regression model satisfies two critical assumptions: homoscedasticity (constant variance across fitted values) and linearity (random scatter around zero). Violations of these assumptions undermine model reliability and suggest the relationship between predictors and outcomes may not be adequately captured by the linear specification.

Key Findings

Residual Range: -2,185 to +3,915 with median of -244, indicating asymmetric distribution around zero
Standardized Residuals: Range from -1.73 to +3.11, with one observation exceeding ±3 standard deviations (potential outlier)
Residual Skewness: 0.58 shows right-skewed distribution, suggesting systematic positive bias in larger predictions
Heteroscedasticity Detected: Breusch-Pagan test (p=0.003) confirms non-constant variance across fitted values

Interpretation

The residual plot reveals violated homoscedasticity assumptions. The positive skew and median offset from zero indicate the model systematically underpredicts at certain fitted value ranges. The standardized residual exceeding ±3σ represents a notable outlier. These violations, confirmed by the failed diagnostic tests, suggest the linear model may misspecify the relationship between marketing

Visualization

Normality Check

Are residuals normally distributed? QQ plot validates normality assumption required for inference

Interpretation

Purpose

The QQ plot assesses whether residuals follow a normal distribution—a critical assumption for valid p-values and confidence intervals in regression. Deviations from the 45° reference line indicate non-normality, which can undermine the reliability of statistical inference for the marketing channel ROI model.

Key Findings

Shapiro-Wilk Test: p-value = 0.0000 - Normality assumption is rejected; residuals significantly deviate from normal distribution
Residual Skewness: 0.58 - Positive skew indicates right-tail heaviness; more large positive residuals than expected under normality
Tail Behavior: Sample residuals range from -2,185 to 3,915, showing asymmetric spread inconsistent with theoretical normal distribution (±2.81 range)
Pattern Observed: Upper tail deviations suggest the model systematically underpredicts high-value outcomes

Interpretation

The residuals exhibit non-normal distribution, particularly in the upper tail, which violates a foundational assumption of ordinary least squares regression. This means the reported p-values (all 0.0000) and 95% confidence intervals for TikTok, Facebook, and Google Ads coefficients may be unreliable. The positive skew combined with heteroscedast

Data Table

Multicollinearity Check

Are predictors highly correlated? VIF (Variance Inflation Factor) detects multicollinearity that inflates coefficient uncertainty

test	statistic	p_value	result
Normality (Shapiro-Wilk)	0.9630	0.0000	Fail
Homoscedasticity (Breusch-Pagan)	9.1101	0.0025	Fail
Autocorrelation (Durbin-Watson)	1.2206	N/A	Fail

Interpretation

Purpose

This section evaluates whether predictors (TikTok, Facebook, Google Ads) are highly correlated with each other—a condition called multicollinearity that inflates coefficient uncertainty and reduces model reliability. VIF quantifies this relationship, with values above 10 indicating problematic correlation that compromises statistical inference.

Key Findings

Max VIF: 1.018 - All three advertising channels have VIF values near 1.0, well below the critical threshold of 10, indicating negligible correlation between predictors
VIF Warning: FALSE - No multicollinearity alert triggered, confirming predictors are sufficiently independent
Predictor Independence: TikTok, Facebook, and Google Ads spending patterns are distinct and non-redundant in explaining outcome variation

Interpretation

The extremely low VIF values (all ≤1.02) demonstrate that the three advertising channels operate independently in the dataset. This independence strengthens confidence in the coefficient estimates—each channel's ROI (TikTok: 0.36, Facebook: 0.49, Google Ads: 1.22) reflects its true isolated effect rather than shared variance with other channels. The model's ability to distinguish individual channel contributions is therefore robust.

Context

While multicollinearity is not a concern, the diagnostic tests reveal violations in normality (Shap

Visualization

Influential Points

Which observations have outsized influence on the model? Cook's Distance and leverage identify problematic data points

Interpretation

Purpose

This section identifies observations that disproportionately affect model coefficients and predictions. By detecting influential points and high-leverage cases, we can assess whether the model's estimates are robust or driven by a small number of unusual data points. This is critical for validating the reliability of the marketing channel ROI estimates.

Key Findings

Influential Count: 0 observations — No points with Cook's Distance > 0.5, indicating no individual observations are distorting coefficient estimates
High-Leverage Count: 8 observations — Points with extreme predictor values that could potentially affect fitted values, though not currently exerting undue influence
Maximum Cook's Distance: 0.051 — Well below the 0.5 threshold, confirming minimal overall influence from any single observation
Leverage Range: 0.01–0.06 (mean 0.02) — Distributed across the predictor space with no extreme outliers

Interpretation

The model demonstrates strong stability: zero influential points means the TikTok, Facebook, and Google Ads coefficients are not driven by outliers. The 8 high-leverage observations represent unusual combinations of predictor values but do not distort estimates because their residuals remain moderate. This validates that the ROI estimates (TikTok: 0.36, Facebook: 0.49,

Visualization

Cross-Validation

How well does the model generalize to new data? Cross-validation assesses out-of-sample performance

Interpretation

Purpose

This section evaluates whether the marketing ROI model generalizes reliably to new, unseen data. Cross-validation partitions the dataset into five folds, training on four and testing on one, repeated across all combinations. This reveals whether the model's strong training performance (R² = 0.782) holds up when applied to data it hasn't encountered, which is critical for real-world deployment.

Key Findings

Overfit Ratio: 0.995 (train RMSE $1256.97 vs CV RMSE $1263.04) — Ratio near 1.0 indicates minimal overfitting; the model performs nearly identically on held-out data as on training data
CV R²: 0.764 — Explains 76.4% of variance in unseen folds, only 1.8 percentage points below training R² (0.782), confirming stable predictive power
Fold Consistency: RMSE ranges from $1072.76 to $1402.97 across folds (SD = $142.46), showing moderate variability but no systematic degradation pattern

Interpretation

The model demonstrates excellent generalization. The negligible gap between training and cross-validation metrics (0.5% difference in RMSE) suggests the three marketing channels (TikTok

Visualization

Prediction Intervals

What is the uncertainty around individual predictions? 95% prediction intervals quantify forecast precision

Interpretation

Purpose

Prediction intervals quantify uncertainty around individual forecasts by establishing lower and upper bounds where actual values are expected to fall. This section evaluates whether the model's uncertainty estimates are well-calibrated—critical for risk assessment and decision-making when deploying the marketing ROI model in production environments.

Key Findings

Coverage Rate (97%): Actual values fall within 95% prediction intervals 97% of the time, exceeding the nominal 95% target and indicating excellent calibration with minimal over- or under-confidence.
Interval Width (~$5,058 average): Consistent width across predictions (range $5,029–$5,147) reflects stable uncertainty estimation relative to the mean prediction of $10,668.
Perfect Containment: All 200 observations fall within their respective intervals, with no systematic misses suggesting the model's uncertainty quantification is reliable across the prediction range.

Interpretation

The 97% coverage rate demonstrates the model produces trustworthy uncertainty bounds. Predictions are neither overconfident (which would yield <95% coverage) nor overly conservative (which would exceed 98%). The narrow standard deviation of interval widths ($23.69) indicates uncertainty is uniformly estimated, not concentrated in specific regions. This calibration validates the model's suitability for business decisions requiring probabilistic forecasts of marketing channel ROI.

Visualization

Feature Importance

Which channel has the strongest impact? Standardized coefficients enable fair comparison across different spend scales

Interpretation

Purpose

This section identifies which marketing channel drives the strongest relative impact on sales outcomes by comparing standardized effect sizes. Standardized coefficients normalize for differences in spend scale across channels, enabling fair comparison of true influence regardless of budget magnitude. This directly addresses the core business question: which channel delivers the most efficient return per unit of variation in spending?

Key Findings

Most Important Predictor: TikTok with standardized coefficient of 0.633 - the highest relative impact among three channels
Relative Effect Ranking: TikTok (0.633) > Facebook (0.453) > Google Ads (0.392) in terms of standardized influence
Interpretation Scale: A one standard deviation increase in TikTok spending produces a 0.633 standard deviation change in sales, compared to 0.453 for Facebook and 0.392 for Google Ads

Interpretation

Despite Google Ads having the largest raw coefficient (1.22), TikTok demonstrates the strongest standardized effect, indicating superior efficiency when accounting for spend variability. This reveals that TikTok's influence on sales is more pronounced relative to its natural variation in spending patterns. The ranking reflects true comparative leverage: TikTok's marginal impact per unit of standardized variation substantially exceeds both Facebook and Google Ads, making it the most influential channel in the model

Visualization

Heteroscedasticity

Is variance constant across fitted values? Scale-Location plot detects heteroscedasticity that violates regression assumptions

Interpretation

Purpose

This section evaluates whether prediction error variance remains constant across all fitted values—a core assumption of linear regression. The Scale-Location plot visualizes this relationship, while the Breusch-Pagan test provides statistical confirmation. Detecting heteroscedasticity is critical because it undermines the reliability of confidence intervals and hypothesis tests, even when predictions appear accurate.

Key Findings

Breusch-Pagan Test Statistic: 9.11 with p-value = 0.0025 - Statistically significant evidence of non-constant variance (p < 0.05)
Heteroscedasticity Status: DETECTED - The regression violates the homoscedasticity assumption
Smooth Line Trend: Ranges from 0.78 to 0.94 across fitted values, indicating slight variance reduction at higher predictions rather than a dramatic pattern

Interpretation

The model exhibits heteroscedasticity, meaning prediction errors are not uniformly distributed across the range of fitted values. While the trend is modest (smooth line variation of ±0.08 around mean 0.87), it is statistically significant. This suggests that uncertainty in marketing ROI predictions may be systematically higher or lower depending on predicted spending levels, potentially affecting the precision of confidence intervals around channel-specific ROI estimates (TikTok: 0