Overview

Analysis Overview

Analysis overview and configuration

Analysis TypeExploratory Analysis

CompanyIBM HR Analytics

ObjectivePerform comprehensive exploratory data analysis on employee dataset

Analysis Date2026-02-19

Processing Idtest_1771572713

Total Observations500

Parameter	Value	_row
outlier_method	iqr	outlier_method
outlier_threshold	1.5	outlier_threshold
correlation_method	pearson	correlation_method
correlation_threshold	0.7	correlation_threshold
alpha	0.05	alpha
max_categories	20	max_categories
enabled_analyses	all	enabled_analyses

Interpretation

Purpose

This analysis provides a comprehensive exploratory assessment of an IBM HR employee dataset containing 500 observations across 27 variables (20 numeric, 7 categorical). The objective is to understand data quality, distributions, relationships, and predictive patterns relevant to employee outcomes, establishing a foundation for subsequent modeling or business intelligence activities.

Key Findings

Data Completeness: 500 observations with zero missing values and zero duplicates—dataset is clean and ready for analysis
Outlier Prevalence: 348 outliers detected (69.6% of data points), concentrated in numeric variables with high skewness (8 skewed variables identified)
Target Imbalance: Binary target shows moderate imbalance (84.4% "No" vs. 15.6% "Yes"; ratio 5.41:1)—requires monitoring for model bias
Predictive Signals: 10 numeric variables show significant associations with target; 5 categorical variables significantly associated (chi-square tests, p<0.05)
Correlation Structure: 7 high correlations (≥0.7) identified; top predictor is numeric_14 (r=0.69)

Interpretation

The dataset exhibits excellent data quality with no missing values or duplicates, but contains substantial outliers reflecting real variation in employee metrics. The

Data preprocessing and column mapping

Initial Rows500

Final Rows500

Rows Removed0

Retention Rate100

Interpretation

Purpose

This section documents the data preprocessing pipeline, showing that all 500 observations were retained without any rows removed during cleaning. This perfect retention rate indicates either exceptionally clean source data or minimal preprocessing interventions, which is critical context for understanding the reliability of downstream statistical tests and predictive modeling efforts.

Key Findings

Retention Rate: 100% (500/500 rows preserved) - No observations were excluded during quality checks or cleaning procedures
Rows Removed: 0 - Aligns with earlier findings showing zero missing values, zero duplicates, and zero type mismatches across all 27 variables
Data Integrity: Complete dataset preservation suggests the raw data met quality thresholds without requiring imputation, deduplication, or outlier removal

Interpretation

The perfect retention rate reflects the high baseline quality of the input dataset. Combined with the earlier analysis showing zero missing values, zero exact duplicates, and no data type mismatches, this indicates minimal data quality friction. However, this also means no rows were filtered despite detecting 348 outliers (69.6% of observations), suggesting outliers were retained for analysis rather than treated as errors—a deliberate choice that preserves variance but may inflate standard errors in statistical tests.

Context

The absence of a documented train/test split suggests this analysis is exploratory rather than predictive modeling. The complete dataset retention is

Executive Summary

Executive summary and key takeaways

total_observations

500

data_quality_pct

100

missing_pct

outliers

348

high_correlations

skewed_variables

Finding	Value
Dataset Size	500 observations
Data Quality	100% complete
Missing Values	0%
Outliers Detected	348 observations
High Correlations	7 pairs
Skewed Variables	8 variables

Bottom Line: Exploratory analysis completed for 500 observations with 20 numeric and 7 categorical variables.

Data Quality:
• Missing values: 0.0% overall
• Outliers detected: 348 using iqr method
• Data retention: 100.0% (500/500 observations)

Key Patterns:
• High correlations: 7 pairs (|r| > 0.70)
• Skewed distributions: 8 variables
• Categorical variables: 7

Recommendations:
• Review outliers for data quality issues or genuine extreme values
• Consider transformations for skewed variables before modeling
• Investigate high correlations to address multicollinearity
• Handle missing values via imputation or removal based on patterns

Interpretation

EXECUTIVE SUMMARY: DATA QUALITY & EXPLORATORY ANALYSIS

Purpose

This section synthesizes the complete exploratory data analysis across 500 observations with 27 variables (20 numeric, 7 categorical). It assesses data readiness for modeling and identifies structural patterns requiring attention before deployment.

Key Findings

Data Completeness: 100% (0% missing values) - Dataset requires no imputation; all 500 observations retained
Outlier Prevalence: 348 observations flagged (69.6%) using IQR method - Unusually high; suggests either legitimate extreme values or measurement issues requiring investigation
Distribution Skewness: 8 variables exhibit significant skew (range: 0.75–1.94) - Indicates non-normal distributions; transformations recommended for parametric modeling
Multicollinearity Risk: 7 high correlations (|r| > 0.70) detected - Potential redundancy in feature space; feature selection needed
Target Imbalance: 84.4% vs. 15.6% class split (5.41:1 ratio) - Moderate severity; requires stratified sampling and class-weighted algorithms
Predictive Signal: 10 numeric features show significant associations with target; numeric_14, numeric_7, an

Data Table

Descriptive Statistics

Summary statistics and variance analysis for all numeric variables

variable	count	mean	median	sd	variance	min	max	q1	q3	iqr	skewness	kurtosis	cv	is_constant	is_low_variance	_row
Age	500	36.9	36	9.36	87.6	18	60	30	43	13	0.4378	2.61	0.2537	False	False	25%
MonthlyIncome	500	6599	4952	4815	2.318e+07	1102	19999	2900	8742	5842	1.323	3.798	0.7296	False	False	25%1
DistanceFromHome	500	9.12	6	8.255	68.15	1	29	2	14	12	0.9751	2.717	0.9052	False	False	25%2
Education	500	2.888	3	1.036	1.074	1	5	2	4	2	-0.2187	2.389	0.3588	False	False	25%3
EnvironmentSatisfaction	500	2.678	3	1.072	1.149	1	4	2	4	2	-0.2715	1.827	0.4002	False	False	25%4
JobInvolvement	500	2.73	3	0.6797	0.462	1	4	2	3	1	-0.5647	3.462	0.249	False	False	25%5
JobLevel	500	2.096	2	1.128	1.273	1	5	1	3	2	1.009	3.363	0.5384	False	False	25%6
JobSatisfaction	500	2.804	3	1.079	1.164	1	4	2	4	2	-0.4197	1.896	0.3848	False	False	25%7
NumCompaniesWorked	500	2.688	2	2.511	6.303	0	9	1	4	3	1.032	3.064	0.934	False	False	25%8
PercentSalaryHike	500	15.22	14	3.729	13.9	11	25	12	18	6	0.7952	2.572	0.245	False	False	25%9
PerformanceRating	500	3.16	3	0.367	0.1347	3	4	3	3	0	1.855	4.441	0.1161	False	False	25%10
RelationshipSatisfaction	500	2.822	3	1.07	1.145	1	4	2	4	2	-0.4672	1.966	0.3791	False	False	25%11
StockOptionLevel	500	0.752	1	0.8318	0.6919	0	3	0	1	1	1.013	3.502	1.106	False	False	25%12
TotalWorkingYears	500	11.46	10	7.777	60.49	0	40	6	16	10	1.106	4.027	0.6784	False	False	25%13
TrainingTimesLastYear	500	2.8	3	1.287	1.655	0	6	2	3	1	0.5177	3.338	0.4595	False	False	25%14
WorkLifeBalance	500	2.754	3	0.706	0.4985	1	4	2	3	1	-0.6041	3.479	0.2564	False	False	25%15
YearsAtCompany	500	7.038	5	6.458	41.71	0	40	3	9	6	1.925	7.589	0.9176	False	False	25%16
YearsInCurrentRole	500	4.238	3	3.72	13.84	0	18	2	7	5	0.9943	3.69	0.8777	False	False	25%17
YearsSinceLastPromotion	500	2.194	1	3.275	10.73	0	15	0	3	3	1.935	6.277	1.493	False	False	25%18
YearsWithCurrManager	500	4.184	3	3.576	12.79	0	17	2	7	5	0.7522	3.005	0.8547	False	False	25%19

Interpretation

Purpose

This section establishes the distributional foundation of the dataset by computing summary statistics for all 20 numeric variables. Understanding central tendency, spread, and shape characteristics is essential for identifying data quality issues, detecting outliers, and determining appropriate analytical methods for the predictive modeling objective.

Key Findings

Skewed Variables: 8 of 20 numeric variables exhibit significant skewness (|skew| > 1), with YearsSinceLastPromotion (1.94), YearsAtCompany (1.92), and MonthlyIncome (1.32) showing the strongest right skew
Variance Quality: All 20 numeric variables retain sufficient variance; zero constant or low-variance columns detected, indicating all features contribute meaningful information
Coefficient of Variation: Ranges from 0.24 (JobInvolvement, PercentSalaryHike) to 1.49 (YearsSinceLastPromotion), reflecting heterogeneous variability across features
Range Disparity: MonthlyIncome spans 1102–19999 (CV=0.73) while satisfaction metrics cluster tightly around 2–3 (CV=0.25–0.40)

Interpretation

The dataset exhibits non-normal distributions across multiple dimensions,

Visualization

Correlation Analysis

Correlation matrix showing relationships between numeric variables

Interpretation

Purpose

This section identifies multicollinearity in the dataset by detecting strong linear relationships (|r| > 0.7) between numeric variables. Understanding these dependencies is critical for predictive modeling, as highly correlated features can inflate coefficient estimates, reduce model interpretability, and create instability in parameter estimation.

Key Findings

High Correlations Detected: 7 pairs of variables exceed the 0.7 threshold, indicating substantial linear dependence
Correlation Method: Pearson correlation captures linear relationships; notable pairs include YearsAtCompany–YearsWithCurrManager (r=0.77) and YearsInCurrentRole–YearsWithCurrManager (r=0.74)
Statistical Significance: All 7 high correlations are marked as significant, confirming these relationships are unlikely due to chance
Overall Correlation Structure: Mean correlation of 0.12 across all 400 variable pairs suggests most features are relatively independent, with multicollinearity concentrated in specific clusters

Interpretation

The dataset exhibits localized multicollinearity rather than systemic redundancy. The strong correlations cluster around tenure-related variables (Years at Company, Years in Current Role, Years with Current Manager), which logically measure related but distinct temporal dimensions of employment history. This pattern suggests these variables capture overlapping information

Visualization

Distribution Analysis

Distribution shapes and outlier detection for numeric variables

Interpretation

Purpose

This section identifies and visualizes the distribution of numeric variables across the dataset, with specific focus on detecting anomalous values. The 348 outliers flagged using the IQR method (1.5 × IQR threshold) represent extreme observations that deviate significantly from typical patterns. Understanding these distributions is critical for assessing data quality and determining whether outliers represent genuine business phenomena or measurement errors.

Key Findings

Outliers Detected: 348 observations (3.48% of 10,000 distribution records) flagged as anomalies across 20 numeric variables
Detection Method: IQR-based approach identifies values beyond 1.5 × interquartile range, a standard statistical threshold
Distribution Characteristics: Values range from 0 to 19,999 with high right skew (0.56), indicating right-tailed distributions with extreme high values
Variable Coverage: Outliers distributed across all 20 numeric variables, with consistent 500-observation samples per variable

Interpretation

The moderate outlier rate (3.48%) suggests the dataset contains legitimate extreme values rather than systematic data quality failures. The high skewness and wide value range (0–19,999) indicate several variables have naturally occurring outliers reflecting real business variation. These outliers warrant investigation to determine whether they represent valid observations (

Visualization

Missing Value Analysis

Missing data patterns and completeness assessment

Interpretation

Purpose

This section assesses data completeness by identifying missing values across all 27 variables in the dataset. Complete data is essential for reliable statistical analysis and modeling, as missing values can introduce bias, reduce statistical power, and complicate feature engineering. Understanding missingness patterns helps determine whether imputation, removal, or alternative analytical approaches are needed.

Key Findings

Overall Missing Percentage: 0% – The dataset contains no missing values across any variable
Rows with Missing Data: 0 – All 500 observations are complete with values for every variable
Variable Coverage: All 27 variables (20 numeric, 7 categorical) have 100% data completeness
No Imputation Required: Zero variables exceed the 10% missing threshold that would typically trigger imputation or removal decisions

Interpretation

The dataset exhibits perfect data completeness, eliminating a major source of analytical uncertainty. This clean state means all 500 observations can be used without data loss from listwise deletion, and no imputation assumptions are needed. The absence of missing data strengthens the validity of statistical tests (140 conducted, 23 significant) and feature importance rankings, as these analyses operate on the full sample without bias from incomplete cases.

Context

This ideal completeness status simplifies downstream modeling but does not address other data quality concerns identified elsewhere: 348 outliers (69.

Visualization

Bivariate Analysis

Numeric variable distributions across categorical groups

Interpretation

Purpose

This section examines how Age varies across the three Department categories using boxplot analysis. By comparing group medians, spreads, and ranges, we identify whether department membership has a meaningful relationship with employee age—a key indicator of whether this categorical variable should be prioritized in predictive modeling or segmentation analysis.

Key Findings

Human Resources Mean Age: 39.79 years (n=14) - Oldest department on average, though smallest sample size
Research & Development Mean Age: 37.24 years (n=333) - Largest group with moderate age, median of 36
Sales Mean Age: 35.88 years (n=153) - Youngest department, median of 34
Spread Consistency: Standard deviations range 9.16–12.81, indicating similar variability across groups
Age Range: All departments span 18–60 years, with overlapping distributions suggesting modest departmental differences

Interpretation

The three departments show a slight age gradient (HR > R&D > Sales), with Human Resources employees averaging ~4 years older than Sales. However, the substantial overlap in distributions and comparable standard deviations indicate that department explains only modest variation in age. This aligns with the Kruskal-Wallis test results showing no significant differences for numeric

Visualization

Statistical Tests

ANOVA and Kruskal-Wallis tests for group differences

Interpretation

Purpose

This section evaluates whether numeric variables differ significantly across categorical groups using statistical hypothesis testing. Of 140 tests performed, 23 revealed statistically significant differences (p < 0.05), indicating that certain numeric variables behave differently depending on group membership. This identifies which variables have meaningful predictive or explanatory power relative to categorical factors.

Key Findings

Total Tests Conducted: 140 across 20 numeric variables and 7 categorical variables
Significant Results: 23 tests (16.4%) showed p-values below the 0.05 threshold, indicating genuine group differences
Test Method: Kruskal-Wallis tests used exclusively, appropriate given that 0 of 20 numeric variables follow normal distributions
P-Value Distribution: Mean p-value of 0.39 with median of 0.32 suggests most variable-group pairs show no significant association
Strongest Signals: Age and YearsWithCurrManager show multiple significant associations with categorical variables like JobRole

Interpretation

The 16.4% significance rate indicates selective rather than pervasive group differences. Most numeric-categorical pairs (83.6%) show no meaningful variation across groups, suggesting limited discriminatory power for those combinations. However, the 23 significant findings identify specific numeric variables that meaningfully stratify by categorical factors—

Visualization

Normality Tests

Shapiro-Wilk normality tests and Q-Q plots

Interpretation

Purpose

This section evaluates whether the 20 numeric variables follow normal distributions using the Shapiro-Wilk test. Normality is a critical assumption for parametric statistical methods (t-tests, ANOVA, linear regression). Understanding the distributional properties of your data determines which analytical techniques are appropriate and whether data transformations are needed.

Key Findings

Variables Tested: 20 numeric variables analyzed
Normal Distributions Found: 0 (0% of variables pass normality test at p > 0.05)
Test Used: Shapiro-Wilk with all p-values = 0, indicating strong evidence against normality across all variables
Pattern Observed: All variables show significant deviation from normality, with Q-Q plots revealing systematic departures from the theoretical normal line

Interpretation

The complete absence of normally distributed variables indicates your dataset exhibits non-normal characteristics across all numeric features. This finding is consistent with the earlier observation of 8 skewed variables and 348 outliers detected via IQR method. The data's departure from normality—evidenced by positive skewness in variables like MonthlyIncome (skew=1.32) and YearsSinceLastPromotion (skew=1.94)—means parametric assumptions are violated, potentially affecting the validity of standard statistical tests already

Visualization

Pairwise Scatterplots

Scatterplot matrix showing pairwise relationships

Interpretation

Purpose

This section visualizes pairwise relationships between numeric variables through scatterplots, revealing both linear and non-linear associations that correlation coefficients alone cannot capture. By examining 6 variable pairs across 500 observations, the analysis identifies which features move together and detects patterns that may inform predictive modeling and feature engineering decisions.

Key Findings

Variable Pairs Analyzed: 6 combinations examined across 3,000 plotted points
Age vs. MonthlyIncome: Shows moderate positive relationship (r=0.49), with income ranging 1,102–19,999 across age span 18–60
High Variability in Income: Standard deviation of 4,814.58 indicates substantial income dispersion independent of age
Non-linear Patterns: Scattered distributions suggest relationships may not be purely linear; curved or clustered patterns could indicate threshold effects or categorical influences

Interpretation

The pairwise analysis reveals that while some numeric variables exhibit positive correlations (particularly Age with MonthlyIncome), the scatter and high variance indicate these relationships are moderate at best. The wide range of values and skewed distributions (skewness 1.57 for x-values, 1.05 for y-values) suggest outliers and non-uniform patterns that simple linear models may not fully capture. This aligns with earlier findings

Visualization

Feature Importance

Correlation-based feature importance ranking

Interpretation

Purpose

This section ranks 19 features by their linear correlation strength with the target variable (Age), identifying which predictors have the strongest associations with outcomes. The absolute correlation values quantify predictive power—longer bars indicate stronger relationships. This ranking helps prioritize variables for modeling and reveals which factors most consistently relate to the target in linear frameworks.

Key Findings

Top Predictor (TotalWorkingYears): Correlation of 0.69 with Age—substantially stronger than other features, indicating career tenure is the most predictive variable
Secondary Predictors: JobLevel (0.51) and MonthlyIncome (0.49) show moderate correlations, suggesting career progression and compensation relate meaningfully to age
Weak Predictors: Bottom-ranked features (EnvironmentSatisfaction, JobInvolvement, DistanceFromHome) have correlations near zero with p-values >0.59, indicating negligible linear relationships
Statistical Significance: Top 10 features show p-values near 0, confirming strong relationships; bottom 9 features are not statistically significant

Interpretation

The feature importance ranking reveals a clear hierarchy: career-related variables (working years, job level, income) dominate predictive power for Age, while satisfaction and environmental factors contribute minimally. The steep drop-off after rank 5 suggests

Visualization

Target Variable Analysis

Target variable distribution, class imbalance, statistical associations, and feature importance

Interpretation

Purpose

This section evaluates the target variable (Attrition) to understand class distribution, identify which features predict attrition, and assess statistical relationships between predictors and the outcome. This foundation is critical for building reliable classification models and understanding which employee characteristics drive attrition risk.

Key Findings

Class Imbalance: 84.4% "No" attrition vs. 15.6% "Yes" (5.41:1 ratio, MODERATE severity) — the minority class is substantially underrepresented, requiring careful model evaluation
Top Predictive Feature: Age ranks first in importance (0.21 point-biserial correlation), followed by JobLevel and JobInvolvement
Significant Categorical Associations: 5 of 7 categorical variables show statistically significant relationships with Attrition (OverTime, JobRole, MaritalStatus, EducationField, BusinessTravel; p < 0.05)
Significant Numeric Associations: 10 of 20 numeric variables significantly differ across attrition classes, with JobLevel, YearsWithCurrManager, and MonthlyIncome showing strongest differences

Interpretation

The moderate class imbalance indicates that standard accuracy metrics will be misleading—a model predicting "No" for all cases would achieve 84% accuracy

Data Table

Data Validation

Comprehensive data quality assessment: variance, cardinality, duplicates, and type profiling

variable	expected_range	actual_min	actual_max	validation_status
Age	No predefined range	18	60	PASS
MonthlyIncome	No predefined range	1102	19999	PASS
DistanceFromHome	No predefined range	1	29	PASS
Education	No predefined range	1	5	PASS
EnvironmentSatisfaction	No predefined range	1	4	PASS
JobInvolvement	No predefined range	1	4	PASS
JobLevel	No predefined range	1	5	PASS
JobSatisfaction	No predefined range	1	4	PASS
NumCompaniesWorked	No predefined range	0	9	PASS
PercentSalaryHike	No predefined range	11	25	PASS
PerformanceRating	No predefined range	3	4	PASS
RelationshipSatisfaction	No predefined range	1	4	PASS
StockOptionLevel	No predefined range	0	3	PASS
TotalWorkingYears	No predefined range	0	40	PASS
TrainingTimesLastYear	No predefined range	0	6	PASS
WorkLifeBalance	No predefined range	1	4	PASS
YearsAtCompany	No predefined range	0	40	PASS
YearsInCurrentRole	No predefined range	0	18	PASS
YearsSinceLastPromotion	No predefined range	0	15	PASS
YearsWithCurrManager	No predefined range	0	17	PASS

metric	value	recommendation
Total rows	500
Exact duplicates	0
% duplicates	0.0%
Unique rows	500
Rows after deduplication	500

variable	expected_type	actual_type	type_mismatch	sample_value	notes
Age	integer	integer	False	41
MonthlyIncome	integer	integer	False	5993
DistanceFromHome	integer	integer	False	1
Education	integer	integer	False	2
EnvironmentSatisfaction	integer	integer	False	2
JobInvolvement	integer	integer	False	3
JobLevel	integer	integer	False	2
JobSatisfaction	integer	integer	False	4
NumCompaniesWorked	integer	integer	False	8
PercentSalaryHike	integer	integer	False	11
PerformanceRating	integer	integer	False	3
RelationshipSatisfaction	integer	integer	False	1
StockOptionLevel	integer	integer	False	0
TotalWorkingYears	integer	integer	False	8
TrainingTimesLastYear	integer	integer	False	0
WorkLifeBalance	integer	integer	False	1
YearsAtCompany	integer	integer	False	6
YearsInCurrentRole	integer	integer	False	4
YearsSinceLastPromotion	integer	integer	False	0
YearsWithCurrManager	integer	integer	False	5
Department	character	character	False	Sales	Categorical variable
BusinessTravel	character	character	False	Travel_Rarely	Categorical variable
EducationField	character	character	False	Life Sciences	Categorical variable
Gender	character	character	False	Female	Categorical variable
JobRole	character	character	False	Sales Executive	Categorical variable
MaritalStatus	character	character	False	Single	Categorical variable
OverTime	character	character	False	Yes	Categorical variable

variable	unique_count	total_count	cardinality_ratio	top_10_pct	is_high_cardinality	is_very_high_cardinality	ohe_features_created	recommendation
Department	3	500	0.006	100	False	False	3	One-hot encoding safe
BusinessTravel	3	500	0.006	100	False	False	3	One-hot encoding safe
EducationField	6	500	0.012	100	False	False	6	One-hot encoding safe
Gender	2	500	0.004	100	False	False	2	One-hot encoding safe
JobRole	9	500	0.018	100	False	False	9	One-hot encoding safe
MaritalStatus	3	500	0.006	100	False	False	3	One-hot encoding safe
OverTime	2	500	0.004	100	False	False	2	One-hot encoding safe

Interpretation

Purpose

This section validates the structural integrity and quality of the dataset before analysis. It assesses whether columns have sufficient variance for modeling, whether categorical variables are suitable for encoding, whether duplicates exist, and whether data types are correctly assigned. These checks ensure the dataset is clean and ready for statistical testing and predictive modeling.

Key Findings

Constant Columns: 0 identified – all variables contain meaningful variation
Low-Variance Columns: 0 identified – no features with coefficient of variation below 5%
High-Cardinality Categoricals: 0 identified – all categorical variables safe for one-hot encoding
Exact Duplicates: 0 rows (0%) – dataset contains 500 unique observations
Type Mismatches: 0 columns – all variables correctly typed (numeric vs. character)
Data Validation: All 20 numeric variables pass range checks with no out-of-range values

Interpretation

The dataset demonstrates excellent structural quality with no data integrity issues that would compromise analysis. The absence of constant or low-variance columns means all features contribute meaningful information for statistical testing and modeling. With zero duplicates and correct data types, the dataset is free from preprocessing artifacts that could bias results. The moderate cardinality of categorical variables (2–9 unique values) supports straightforward encoding without dimensionality