Correlation Analysis
Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| method | pearson | method |
| significance_level | 0.05 | significance_level |
| top_n_pairs | 15 | top_n_pairs |
Purpose
This correlation analysis examines relationships among 7 HR variables across 300 employees to identify which factors move together and may inform predictive modeling. Understanding these interdependencies helps prioritize which variables are most relevant for downstream analysis and reveals the underlying structure of HR metrics at Demo Corp.
Key Findings
- Strongest Correlation: Age vs YearsAtCompany (r=0.88, p<0.001) - A very strong positive relationship indicating older employees tend to have longer tenure
- Significant Pairs: 8 of 21 possible pairs (38.1%) show statistically significant relationships at p<0.05
- Job Satisfaction Link: JobSatisfaction vs PerformanceRating (r=0.59) is the second-strongest relationship, suggesting employee satisfaction correlates meaningfully with performance outcomes
- Weak Overall Pattern: 81% of correlations are classified as weak (r<0.5), indicating most HR variables operate relatively independently
- Data Completeness: 300 observations analyzed with minimal missing data (0-15 values per variable)
Interpretation
The analysis reveals a sparse correlation landscape where most HR metrics don't strongly predict each other. The dominant Age-Tenure relationship is intuitive and expected. The moderate Job Satisfaction-Performance link is noteworthy for modeling, while variables like Distance
Data preprocessing and column mapping
Purpose
This section documents the data preprocessing pipeline for the correlation analysis of 7 variables across 300 observations. It demonstrates data integrity and completeness prior to statistical testing, which is critical for ensuring the reliability of the 8 significant correlations (38.1% of 21 pairs) identified in the analysis.
Key Findings
- Retention Rate: 100% (300/300 rows) - No observations were removed during preprocessing, indicating clean input data with minimal quality issues
- Rows Removed: 0 - The dataset required no filtering, deletion, or exclusion steps
- Train/Test Split: Not applied - The full dataset was used for correlation analysis rather than predictive modeling
- Data Completeness: Variable-level missing data exists (15 missing values in MonthlySalary, JobSatisfaction, TrainingHours) but did not trigger row-level removal
Interpretation
The 100% retention rate reflects a well-curated dataset entering the analysis phase. However, the presence of 15 missing observations in three variables (noted in variable_summary_data) suggests selective missingness rather than systematic data loss. This explains why some correlation pairs have n_obs=285 while others have n_obs=300. The absence of train/test splitting confirms this is descriptive correlation analysis rather than predictive modeling, appropriate for the stated
Executive Summary
Executive summary of correlation analysis findings
| Finding | Value |
|---|---|
| Variables Analyzed | 7 |
| Total Pairs | 21 |
| Significant Pairs | 8 (38.1%) |
| Strongest Correlation | Age vs YearsAtCompany (r=0.8831) |
| Correlation Method | Pearson |
| Observations Used | 300 |
Key Findings:
• Strongest pair: Age vs YearsAtCompany (r = 0.8831)
• 8 of 21 pairs are statistically significant (p < 0.05)
• 300 observations analyzed
Recommendation: Use the correlation matrix to identify variable clusters. Pairs with |r| > 0.7 may indicate multicollinearity — consider removing one variable from predictive models. Always inspect scatter plots for the strongest pairs to confirm linear relationships before drawing conclusions.
Purpose
This analysis examines relationships among 7 organizational variables across 300 employees using Pearson correlation. The objective is to identify which variable pairs have statistically significant associations, enabling data-driven decisions about workforce dynamics, compensation structures, and performance drivers.
Key Findings
- Strongest Correlation: Age vs YearsAtCompany (r = 0.883) - Nearly perfect positive relationship indicating tenure strongly tracks with employee age
- Significant Pairs: 8 of 21 possible pairs (38.1%) show statistical significance at p < 0.05
- Secondary Strong Relationships: JobSatisfaction vs PerformanceRating (r = 0.59) and Age vs MonthlySalary (r = 0.58) demonstrate moderate-to-strong positive associations
- Weak Correlations Dominate: 81% of all pairs show weak strength, suggesting variables operate largely independently
- Data Quality: Complete observations for 300 employees with minimal missing values (0-15 per variable)
Interpretation
The analysis reveals that organizational outcomes are driven by multiple independent factors rather than a few dominant relationships. Age-tenure alignment is expected and natural. The moderate job satisfaction-performance link suggests employee engagement meaningfully correlates with output, though other unmeasured factors likely drive performance. The weak correlations across most pairs indicate
Correlation Matrix
Pairwise correlation matrix showing all variable relationships
Purpose
This correlation matrix maps all pairwise relationships among 7 variables across 300 observations, identifying which variables move together systematically. It serves as a foundational diagnostic tool to detect potential dependencies and multicollinearity patterns that inform downstream modeling and variable selection decisions.
Key Findings
- Significant Pairs: 8 of 21 pairs (38.1%) show statistically significant correlations (p < 0.05), indicating moderate evidence of true relationships beyond random noise
- Strongest Correlation: Age vs YearsAtCompany (r = 0.88) demonstrates a very strong positive relationship, suggesting tenure increases predictably with employee age
- Correlation Range: Values span from -0.03 to 1.0 (mean = 0.29), with most non-diagonal pairs clustering near weak-to-moderate strength
- Pattern: Positive correlations dominate (86.7% of significant pairs), with DistanceFromHome showing near-zero relationships across all variables
Interpretation
The matrix reveals a workforce where age and tenure are tightly coupled, while commute distance operates independently of other measured factors. Job satisfaction and performance show moderate positive association (r = 0.59), suggesting employee engagement correlates with output quality. However, 62% of variable pairs lack statistical significance, indicating limited multicollinearity concerns and relatively independent
Top Correlations
Top variable pairs ranked by correlation strength
Purpose
This section identifies and ranks the strongest variable relationships in your dataset to reveal which factors move together most consistently. Understanding these correlations is essential for feature selection in predictive modeling and for identifying potential multicollinearity that could affect model performance or interpretation.
Key Findings
- Strongest Pair (Age vs YearsAtCompany): r = 0.883 - An exceptionally strong positive relationship, indicating tenure increases predictably with employee age
- Strong Pairs Count: 3 pairs qualify as strong (|r| ≥ 0.5), all statistically significant at p < 0.05
- Moderate Relationships: 1 pair (MonthlySalary vs YearsAtCompany, r = 0.48) shows moderate strength
- Dominant Pattern: 86.7% of top correlations are positive, suggesting aligned directional movement across most variables
Interpretation
The Age-YearsAtCompany relationship (r = 0.883) is exceptionally strong and highly significant, reflecting a natural organizational pattern where older employees tend to have longer tenure. The three strong pairs (Age, MonthlySalary, and JobSatisfaction relationships) suggest these variables share substantial common variance. However, the median correlation across all 15 top pairs is only 0.12, indicating most relationships are weak
Strongest Pair Scatter
Scatter plot of strongest pair: Age vs YearsAtCompany
Purpose
This scatter plot visualizes the strongest relationship identified in the correlation analysis: Age vs YearsAtCompany (r = 0.883). By displaying all 300 individual observations, it allows visual confirmation that the strong numerical correlation reflects a genuine linear pattern rather than statistical artifact or clustering effects. This section bridges summary statistics and raw data to validate the correlation's practical meaning.
Key Findings
- Correlation Coefficient (r = 0.883): Indicates a very strong positive linear relationship—among the 21 variable pairs analyzed, this is the strongest association found
- Sample Size (n = 300): Full dataset with no missing values, providing robust statistical power and confidence in the relationship's stability
- Data Range: Age spans 22–65 years (mean 38.0); YearsAtCompany spans 0–18 years (mean 6.3), showing realistic organizational tenure patterns
- Linear Pattern: Points cluster tightly around the trend line with minimal scatter, confirming the relationship is genuinely linear and not curved or segmented
Interpretation
The scatter plot demonstrates that older employees consistently have longer tenure at the company. This strong association (r = 0.883) suggests age and organizational longevity are nearly interchangeable predictors in this dataset. The tight clustering around the trend line indicates minimal unexplained variance, meaning age
Variable Distributions
Standardized distributions of all analyzed variables
Purpose
This section visualizes the standardized distributions of all 7 variables to enable direct comparison across different measurement scales. By converting raw values to z-scores, variables with vastly different units (e.g., age in years vs. salary in dollars) can be assessed side-by-side for spread, symmetry, and outlier presence. Understanding distribution shape is critical because extreme outliers or skewness can inflate or deflate correlation coefficients, affecting the reliability of the 8 significant relationships identified in the overall analysis.
Key Findings
- Z-score Range: -3.31 to 4.43 - Indicates presence of moderate outliers across the dataset, with some observations extending 3+ standard deviations from the mean
- Overall Skewness: 0.26 - Slight positive skew suggests most variables cluster toward lower values with right-tail extensions
- Raw Value Spread: min=0, max=8,103 - Extreme range reflects heterogeneous variable scales (e.g., salary vs. distance)
- Median Offset: -0.09 z-score median vs. 0 mean - Subtle left-skew in standardized space indicates slight concentration below average
Interpretation
The standardized distributions reveal that while most variables are reasonably symmetric, the presence of outliers (z-scores beyond
Correlation Table
Complete pairwise correlation results with statistical details
| var1 | var2 | r_value | p_value | n_obs | significant | strength | _row |
|---|---|---|---|---|---|---|---|
| Age | YearsAtCompany | 0.8831 | 0 | 300 | True | Strong | independent_11 |
| JobSatisfaction | PerformanceRating | 0.5901 | 0 | 285 | True | Strong | JobSatisfaction |
| Age | MonthlySalary | 0.5803 | 0 | 285 | True | Strong | Age |
| MonthlySalary | YearsAtCompany | 0.4788 | 0 | 285 | True | Moderate | MonthlySalary |
| PerformanceRating | TrainingHours | 0.2457 | 0 | 285 | True | Weak | PerformanceRating |
| MonthlySalary | JobSatisfaction | 0.1593 | 0.0086 | 271 | True | Weak | independent_21 |
| JobSatisfaction | TrainingHours | 0.14 | 0.0212 | 271 | True | Weak | independent_41 |
| Age | JobSatisfaction | 0.1192 | 0.0444 | 285 | True | Weak | independent_12 |
| YearsAtCompany | JobSatisfaction | 0.1149 | 0.0528 | 285 | False | Weak | YearsAtCompany |
| Age | PerformanceRating | 0.0788 | 0.1733 | 300 | False | Weak | independent_13 |
| YearsAtCompany | PerformanceRating | 0.0733 | 0.2057 | 300 | False | Weak | independent_31 |
| MonthlySalary | DistanceFromHome | 0.0424 | 0.4763 | 285 | False | Weak | independent_24 |
| Age | TrainingHours | -0.0285 | 0.6319 | 285 | False | Weak | independent_14 |
| YearsAtCompany | TrainingHours | -0.0266 | 0.6544 | 285 | False | Weak | independent_32 |
| MonthlySalary | PerformanceRating | 0.023 | 0.6989 | 285 | False | Weak | independent_22 |
| YearsAtCompany | DistanceFromHome | 0.0218 | 0.7073 | 300 | False | Weak | independent_33 |
| MonthlySalary | TrainingHours | 0.02 | 0.7431 | 271 | False | Weak | independent_23 |
| TrainingHours | DistanceFromHome | -0.0144 | 0.8093 | 285 | False | Weak | TrainingHours |
| Age | DistanceFromHome | -0.0128 | 0.8247 | 300 | False | Weak | independent_15 |
| JobSatisfaction | DistanceFromHome | 0.007 | 0.9063 | 285 | False | Weak | independent_42 |
| PerformanceRating | DistanceFromHome | -0.0002 | 0.9975 | 300 | False | Weak | independent_51 |
Purpose
This section presents all 21 pairwise correlations among 7 variables, identifying which relationships are statistically significant at the 0.05 level. It serves as the comprehensive foundation for understanding variable interdependencies across the dataset, enabling prioritization of relationships worthy of deeper investigation.
Key Findings
- Significant Pairs: 8 of 21 relationships (38.1%) meet statistical significance, indicating moderate evidence of real associations beyond random variation
- Strongest Correlation: Age vs YearsAtCompany (r=0.88, p≈0) demonstrates the dominant relationship in the dataset
- Strength Distribution: 81% of pairs are classified as weak (r<0.3), with only 3 strong and 1 moderate relationship, reflecting sparse meaningful associations
- Sample Consistency: Observations range 271–300 across pairs, with most analyses using 285 observations, suggesting minimal data loss
Interpretation
The correlation matrix reveals a dataset where most variables operate independently. The three strong relationships—Age with tenure and salary, plus job satisfaction with performance—represent the primary drivers of covariation. The predominance of weak, non-significant pairs (13 of 21) indicates that employee outcomes are not heavily determined by simple linear associations among these seven variables, suggesting either complex multivariate interactions or the influence of unmeasured factors