Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| outlier_method | iqr | outlier_method |
| outlier_threshold | 1.5 | outlier_threshold |
| correlation_method | pearson | correlation_method |
| correlation_threshold | 0.7 | correlation_threshold |
| alpha | 0.05 | alpha |
| max_categories | 20 | max_categories |
| enabled_analyses | all | enabled_analyses |
This analysis provides a comprehensive exploratory assessment of an IBM HR employee dataset containing 500 observations across 27 variables (20 numeric, 7 categorical). The objective is to understand data quality, distributions, relationships, and predictive patterns relevant to employee outcomes, establishing a foundation for subsequent modeling or business intelligence activities.
The dataset exhibits excellent data quality with no missing values or duplicates, but contains substantial outliers reflecting real variation in employee metrics. The
Data preprocessing and column mapping
| Metric | Value |
|---|---|
| Initial Rows | 500 |
| Final Rows | 500 |
| Rows Removed | 0 |
| Retention Rate | 100% |
This section documents the data preprocessing pipeline, showing that all 500 observations were retained without any rows removed during cleaning. This perfect retention rate indicates either exceptionally clean source data or minimal preprocessing interventions, which is critical context for understanding the reliability of downstream statistical tests and predictive modeling efforts.
The perfect retention rate reflects the high baseline quality of the input dataset. Combined with the earlier analysis showing zero missing values, zero exact duplicates, and no data type mismatches, this indicates minimal data quality friction. However, this also means no rows were filtered despite detecting 348 outliers (69.6% of observations), suggesting outliers were retained for analysis rather than treated as errors—a deliberate choice that preserves variance but may inflate standard errors in statistical tests.
The absence of a documented train/test split suggests this analysis is exploratory rather than predictive modeling. The complete dataset retention is
| Finding | Value |
|---|---|
| Dataset Size | 500 observations |
| Data Quality | 100% complete |
| Missing Values | 0% |
| Outliers Detected | 348 observations |
| High Correlations | 7 pairs |
| Skewed Variables | 8 variables |
This section synthesizes the complete exploratory data analysis across 500 observations with 27 variables (20 numeric, 7 categorical). It assesses data readiness for modeling and identifies structural patterns requiring attention before deployment.
Summary statistics and variance analysis for all numeric variables
| variable | count | mean | median | sd | variance | min | max | q1 | q3 | iqr | skewness | kurtosis | cv | is_constant | is_low_variance | _row |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 500 | 36.9 | 36 | 9.36 | 87.6 | 18 | 60 | 30 | 43 | 13 | 0.4378 | 2.61 | 0.2537 | False | False | 25% |
| MonthlyIncome | 500 | 6599 | 4952 | 4815 | 2.318e+07 | 1102 | 19999 | 2900 | 8742 | 5842 | 1.323 | 3.798 | 0.7296 | False | False | 25%1 |
| DistanceFromHome | 500 | 9.12 | 6 | 8.255 | 68.15 | 1 | 29 | 2 | 14 | 12 | 0.9751 | 2.717 | 0.9052 | False | False | 25%2 |
| Education | 500 | 2.888 | 3 | 1.036 | 1.074 | 1 | 5 | 2 | 4 | 2 | -0.2187 | 2.389 | 0.3588 | False | False | 25%3 |
| EnvironmentSatisfaction | 500 | 2.678 | 3 | 1.072 | 1.149 | 1 | 4 | 2 | 4 | 2 | -0.2715 | 1.827 | 0.4002 | False | False | 25%4 |
| JobInvolvement | 500 | 2.73 | 3 | 0.6797 | 0.462 | 1 | 4 | 2 | 3 | 1 | -0.5647 | 3.462 | 0.249 | False | False | 25%5 |
| JobLevel | 500 | 2.096 | 2 | 1.128 | 1.273 | 1 | 5 | 1 | 3 | 2 | 1.009 | 3.363 | 0.5384 | False | False | 25%6 |
| JobSatisfaction | 500 | 2.804 | 3 | 1.079 | 1.164 | 1 | 4 | 2 | 4 | 2 | -0.4197 | 1.896 | 0.3848 | False | False | 25%7 |
| NumCompaniesWorked | 500 | 2.688 | 2 | 2.511 | 6.303 | 0 | 9 | 1 | 4 | 3 | 1.032 | 3.064 | 0.934 | False | False | 25%8 |
| PercentSalaryHike | 500 | 15.22 | 14 | 3.729 | 13.9 | 11 | 25 | 12 | 18 | 6 | 0.7952 | 2.572 | 0.245 | False | False | 25%9 |
| PerformanceRating | 500 | 3.16 | 3 | 0.367 | 0.1347 | 3 | 4 | 3 | 3 | 0 | 1.855 | 4.441 | 0.1161 | False | False | 25%10 |
| RelationshipSatisfaction | 500 | 2.822 | 3 | 1.07 | 1.145 | 1 | 4 | 2 | 4 | 2 | -0.4672 | 1.966 | 0.3791 | False | False | 25%11 |
| StockOptionLevel | 500 | 0.752 | 1 | 0.8318 | 0.6919 | 0 | 3 | 0 | 1 | 1 | 1.013 | 3.502 | 1.106 | False | False | 25%12 |
| TotalWorkingYears | 500 | 11.46 | 10 | 7.777 | 60.49 | 0 | 40 | 6 | 16 | 10 | 1.106 | 4.027 | 0.6784 | False | False | 25%13 |
| TrainingTimesLastYear | 500 | 2.8 | 3 | 1.287 | 1.655 | 0 | 6 | 2 | 3 | 1 | 0.5177 | 3.338 | 0.4595 | False | False | 25%14 |
| WorkLifeBalance | 500 | 2.754 | 3 | 0.706 | 0.4985 | 1 | 4 | 2 | 3 | 1 | -0.6041 | 3.479 | 0.2564 | False | False | 25%15 |
| YearsAtCompany | 500 | 7.038 | 5 | 6.458 | 41.71 | 0 | 40 | 3 | 9 | 6 | 1.925 | 7.589 | 0.9176 | False | False | 25%16 |
| YearsInCurrentRole | 500 | 4.238 | 3 | 3.72 | 13.84 | 0 | 18 | 2 | 7 | 5 | 0.9943 | 3.69 | 0.8777 | False | False | 25%17 |
| YearsSinceLastPromotion | 500 | 2.194 | 1 | 3.275 | 10.73 | 0 | 15 | 0 | 3 | 3 | 1.935 | 6.277 | 1.493 | False | False | 25%18 |
| YearsWithCurrManager | 500 | 4.184 | 3 | 3.576 | 12.79 | 0 | 17 | 2 | 7 | 5 | 0.7522 | 3.005 | 0.8547 | False | False | 25%19 |
This section establishes the distributional foundation of the dataset by computing summary statistics for all 20 numeric variables. Understanding central tendency, spread, and shape characteristics is essential for identifying data quality issues, detecting outliers, and determining appropriate analytical methods for the predictive modeling objective.
The dataset exhibits non-normal distributions across multiple dimensions,
Correlation matrix showing relationships between numeric variables
This section identifies multicollinearity in the dataset by detecting strong linear relationships (|r| > 0.7) between numeric variables. Understanding these dependencies is critical for predictive modeling, as highly correlated features can inflate coefficient estimates, reduce model interpretability, and create instability in parameter estimation.
The dataset exhibits localized multicollinearity rather than systemic redundancy. The strong correlations cluster around tenure-related variables (Years at Company, Years in Current Role, Years with Current Manager), which logically measure related but distinct temporal dimensions of employment history. This pattern suggests these variables capture overlapping information
Distribution shapes and outlier detection for numeric variables
This section identifies and visualizes the distribution of numeric variables across the dataset, with specific focus on detecting anomalous values. The 348 outliers flagged using the IQR method (1.5 × IQR threshold) represent extreme observations that deviate significantly from typical patterns. Understanding these distributions is critical for assessing data quality and determining whether outliers represent genuine business phenomena or measurement errors.
The moderate outlier rate (3.48%) suggests the dataset contains legitimate extreme values rather than systematic data quality failures. The high skewness and wide value range (0–19,999) indicate several variables have naturally occurring outliers reflecting real business variation. These outliers warrant investigation to determine whether they represent valid observations (
Missing data patterns and completeness assessment
This section assesses data completeness by identifying missing values across all 27 variables in the dataset. Complete data is essential for reliable statistical analysis and modeling, as missing values can introduce bias, reduce statistical power, and complicate feature engineering. Understanding missingness patterns helps determine whether imputation, removal, or alternative analytical approaches are needed.
The dataset exhibits perfect data completeness, eliminating a major source of analytical uncertainty. This clean state means all 500 observations can be used without data loss from listwise deletion, and no imputation assumptions are needed. The absence of missing data strengthens the validity of statistical tests (140 conducted, 23 significant) and feature importance rankings, as these analyses operate on the full sample without bias from incomplete cases.
This ideal completeness status simplifies downstream modeling but does not address other data quality concerns identified elsewhere: 348 outliers (69.
Numeric variable distributions across categorical groups
This section examines how Age varies across the three Department categories using boxplot analysis. By comparing group medians, spreads, and ranges, we identify whether department membership has a meaningful relationship with employee age—a key indicator of whether this categorical variable should be prioritized in predictive modeling or segmentation analysis.
The three departments show a slight age gradient (HR > R&D > Sales), with Human Resources employees averaging ~4 years older than Sales. However, the substantial overlap in distributions and comparable standard deviations indicate that department explains only modest variation in age. This aligns with the Kruskal-Wallis test results showing no significant differences for numeric
ANOVA and Kruskal-Wallis tests for group differences
This section evaluates whether numeric variables differ significantly across categorical groups using statistical hypothesis testing. Of 140 tests performed, 23 revealed statistically significant differences (p < 0.05), indicating that certain numeric variables behave differently depending on group membership. This identifies which variables have meaningful predictive or explanatory power relative to categorical factors.
The 16.4% significance rate indicates selective rather than pervasive group differences. Most numeric-categorical pairs (83.6%) show no meaningful variation across groups, suggesting limited discriminatory power for those combinations. However, the 23 significant findings identify specific numeric variables that meaningfully stratify by categorical factors—
Shapiro-Wilk normality tests and Q-Q plots
This section evaluates whether the 20 numeric variables follow normal distributions using the Shapiro-Wilk test. Normality is a critical assumption for parametric statistical methods (t-tests, ANOVA, linear regression). Understanding the distributional properties of your data determines which analytical techniques are appropriate and whether data transformations are needed.
The complete absence of normally distributed variables indicates your dataset exhibits non-normal characteristics across all numeric features. This finding is consistent with the earlier observation of 8 skewed variables and 348 outliers detected via IQR method. The data's departure from normality—evidenced by positive skewness in variables like MonthlyIncome (skew=1.32) and YearsSinceLastPromotion (skew=1.94)—means parametric assumptions are violated, potentially affecting the validity of standard statistical tests already
Scatterplot matrix showing pairwise relationships
This section visualizes pairwise relationships between numeric variables through scatterplots, revealing both linear and non-linear associations that correlation coefficients alone cannot capture. By examining 6 variable pairs across 500 observations, the analysis identifies which features move together and detects patterns that may inform predictive modeling and feature engineering decisions.
The pairwise analysis reveals that while some numeric variables exhibit positive correlations (particularly Age with MonthlyIncome), the scatter and high variance indicate these relationships are moderate at best. The wide range of values and skewed distributions (skewness 1.57 for x-values, 1.05 for y-values) suggest outliers and non-uniform patterns that simple linear models may not fully capture. This aligns with earlier findings
Correlation-based feature importance ranking
This section ranks 19 features by their linear correlation strength with the target variable (Age), identifying which predictors have the strongest associations with outcomes. The absolute correlation values quantify predictive power—longer bars indicate stronger relationships. This ranking helps prioritize variables for modeling and reveals which factors most consistently relate to the target in linear frameworks.
The feature importance ranking reveals a clear hierarchy: career-related variables (working years, job level, income) dominate predictive power for Age, while satisfaction and environmental factors contribute minimally. The steep drop-off after rank 5 suggests
Target variable distribution, class imbalance, statistical associations, and feature importance
This section evaluates the target variable (Attrition) to understand class distribution, identify which features predict attrition, and assess statistical relationships between predictors and the outcome. This foundation is critical for building reliable classification models and understanding which employee characteristics drive attrition risk.
The moderate class imbalance indicates that standard accuracy metrics will be misleading—a model predicting "No" for all cases would achieve 84% accuracy
Comprehensive data quality assessment: variance, cardinality, duplicates, and type profiling
| variable | expected_range | actual_min | actual_max | out_of_range_count | validation_status | notes |
|---|---|---|---|---|---|---|
| Age | No predefined range | 18 | 60 | 0 | PASS | |
| MonthlyIncome | No predefined range | 1102 | 19999 | 0 | PASS | |
| DistanceFromHome | No predefined range | 1 | 29 | 0 | PASS | |
| Education | No predefined range | 1 | 5 | 0 | PASS | |
| EnvironmentSatisfaction | No predefined range | 1 | 4 | 0 | PASS | |
| JobInvolvement | No predefined range | 1 | 4 | 0 | PASS | |
| JobLevel | No predefined range | 1 | 5 | 0 | PASS | |
| JobSatisfaction | No predefined range | 1 | 4 | 0 | PASS | |
| NumCompaniesWorked | No predefined range | 0 | 9 | 0 | PASS | |
| PercentSalaryHike | No predefined range | 11 | 25 | 0 | PASS | |
| PerformanceRating | No predefined range | 3 | 4 | 0 | PASS | |
| RelationshipSatisfaction | No predefined range | 1 | 4 | 0 | PASS | |
| StockOptionLevel | No predefined range | 0 | 3 | 0 | PASS | |
| TotalWorkingYears | No predefined range | 0 | 40 | 0 | PASS | |
| TrainingTimesLastYear | No predefined range | 0 | 6 | 0 | PASS | |
| WorkLifeBalance | No predefined range | 1 | 4 | 0 | PASS | |
| YearsAtCompany | No predefined range | 0 | 40 | 0 | PASS | |
| YearsInCurrentRole | No predefined range | 0 | 18 | 0 | PASS | |
| YearsSinceLastPromotion | No predefined range | 0 | 15 | 0 | PASS | |
| YearsWithCurrManager | No predefined range | 0 | 17 | 0 | PASS |
| metric | value | recommendation |
|---|---|---|
| Total rows | 500 | |
| Exact duplicates | 0 | |
| % duplicates | 0.0% | |
| Unique rows | 500 | |
| Rows after deduplication | 500 |
| variable | expected_type | actual_type | type_mismatch | sample_value | notes |
|---|---|---|---|---|---|
| Age | integer | integer | False | 41 | |
| MonthlyIncome | integer | integer | False | 5993 | |
| DistanceFromHome | integer | integer | False | 1 | |
| Education | integer | integer | False | 2 | |
| EnvironmentSatisfaction | integer | integer | False | 2 | |
| JobInvolvement | integer | integer | False | 3 | |
| JobLevel | integer | integer | False | 2 | |
| JobSatisfaction | integer | integer | False | 4 | |
| NumCompaniesWorked | integer | integer | False | 8 | |
| PercentSalaryHike | integer | integer | False | 11 | |
| PerformanceRating | integer | integer | False | 3 | |
| RelationshipSatisfaction | integer | integer | False | 1 | |
| StockOptionLevel | integer | integer | False | 0 | |
| TotalWorkingYears | integer | integer | False | 8 | |
| TrainingTimesLastYear | integer | integer | False | 0 | |
| WorkLifeBalance | integer | integer | False | 1 | |
| YearsAtCompany | integer | integer | False | 6 | |
| YearsInCurrentRole | integer | integer | False | 4 | |
| YearsSinceLastPromotion | integer | integer | False | 0 | |
| YearsWithCurrManager | integer | integer | False | 5 | |
| Department | character | character | False | Sales | Categorical variable |
| BusinessTravel | character | character | False | Travel_Rarely | Categorical variable |
| EducationField | character | character | False | Life Sciences | Categorical variable |
| Gender | character | character | False | Female | Categorical variable |
| JobRole | character | character | False | Sales Executive | Categorical variable |
| MaritalStatus | character | character | False | Single | Categorical variable |
| OverTime | character | character | False | Yes | Categorical variable |
| variable | unique_count | total_count | cardinality_ratio | top_10_pct | is_high_cardinality | is_very_high_cardinality | ohe_features_created | recommendation |
|---|---|---|---|---|---|---|---|---|
| Department | 3 | 500 | 0.006 | 100 | False | False | 3 | One-hot encoding safe |
| BusinessTravel | 3 | 500 | 0.006 | 100 | False | False | 3 | One-hot encoding safe |
| EducationField | 6 | 500 | 0.012 | 100 | False | False | 6 | One-hot encoding safe |
| Gender | 2 | 500 | 0.004 | 100 | False | False | 2 | One-hot encoding safe |
| JobRole | 9 | 500 | 0.018 | 100 | False | False | 9 | One-hot encoding safe |
| MaritalStatus | 3 | 500 | 0.006 | 100 | False | False | 3 | One-hot encoding safe |
| OverTime | 2 | 500 | 0.004 | 100 | False | False | 2 | One-hot encoding safe |
This section validates the structural integrity and quality of the dataset before analysis. It assesses whether columns have sufficient variance for modeling, whether categorical variables are suitable for encoding, whether duplicates exist, and whether data types are correctly assigned. These checks ensure the dataset is clean and ready for statistical testing and predictive modeling.
The dataset demonstrates excellent structural quality with no data integrity issues that would compromise analysis. The absence of constant or low-variance columns means all features contribute meaningful information for statistical testing and modeling. With zero duplicates and correct data types, the dataset is free from preprocessing artifacts that could bias results. The moderate cardinality of categorical variables (2–9 unique values) supports straightforward encoding without dimensionality