Analysis Overview
Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| outlier_method | iqr | outlier_method |
| outlier_threshold | 1.5 | outlier_threshold |
| correlation_method | pearson | correlation_method |
| correlation_threshold | 0.7 | correlation_threshold |
| alpha | 0.05 | alpha |
| max_categories | 20 | max_categories |
| enabled_analyses | all | enabled_analyses |
Purpose
This analysis provides a comprehensive exploratory assessment of an IBM HR employee dataset containing 500 observations across 27 variables (20 numeric, 7 categorical). The objective is to understand data quality, distributions, relationships, and predictive patterns relevant to employee outcomes, establishing a foundation for subsequent modeling or business intelligence activities.
Key Findings
- Data Completeness: 500 observations with zero missing values and zero duplicates—dataset is clean and ready for analysis
- Outlier Prevalence: 348 outliers detected (69.6% of data points), concentrated in numeric variables with high skewness (8 skewed variables identified)
- Target Imbalance: Binary target shows moderate imbalance (84.4% "No" vs. 15.6% "Yes"; ratio 5.41:1)—requires monitoring for model bias
- Predictive Signals: 10 numeric variables show significant associations with target; 5 categorical variables significantly associated (chi-square tests, p<0.05)
- Correlation Structure: 7 high correlations (≥0.7) identified; top predictor is numeric_14 (r=0.69)
Interpretation
The dataset exhibits excellent data quality with no missing values or duplicates, but contains substantial outliers reflecting real variation in employee metrics. The
Data preprocessing and column mapping
Purpose
This section documents the data preprocessing pipeline, showing that all 500 observations were retained without any rows removed during cleaning. This perfect retention rate indicates either exceptionally clean source data or minimal preprocessing interventions, which is critical context for understanding the reliability of downstream statistical tests and predictive modeling efforts.
Key Findings
- Retention Rate: 100% (500/500 rows preserved) - No observations were excluded during quality checks or cleaning procedures
- Rows Removed: 0 - Aligns with earlier findings showing zero missing values, zero duplicates, and zero type mismatches across all 27 variables
- Data Integrity: Complete dataset preservation suggests the raw data met quality thresholds without requiring imputation, deduplication, or outlier removal
Interpretation
The perfect retention rate reflects the high baseline quality of the input dataset. Combined with the earlier analysis showing zero missing values, zero exact duplicates, and no data type mismatches, this indicates minimal data quality friction. However, this also means no rows were filtered despite detecting 348 outliers (69.6% of observations), suggesting outliers were retained for analysis rather than treated as errors—a deliberate choice that preserves variance but may inflate standard errors in statistical tests.
Context
The absence of a documented train/test split suggests this analysis is exploratory rather than predictive modeling. The complete dataset retention is
Executive Summary
Executive summary and key takeaways
| Finding | Value |
|---|---|
| Dataset Size | 500 observations |
| Data Quality | 100% complete |
| Missing Values | 0% |
| Outliers Detected | 348 observations |
| High Correlations | 7 pairs |
| Skewed Variables | 8 variables |
Data Quality:
• Missing values: 0.0% overall
• Outliers detected: 348 using iqr method
• Data retention: 100.0% (500/500 observations)
Key Patterns:
• High correlations: 7 pairs (|r| > 0.70)
• Skewed distributions: 8 variables
• Categorical variables: 7
Recommendations:
• Review outliers for data quality issues or genuine extreme values
• Consider transformations for skewed variables before modeling
• Investigate high correlations to address multicollinearity
• Handle missing values via imputation or removal based on patterns
EXECUTIVE SUMMARY: DATA QUALITY & EXPLORATORY ANALYSIS
Purpose
This section synthesizes the complete exploratory data analysis across 500 observations with 27 variables (20 numeric, 7 categorical). It assesses data readiness for modeling and identifies structural patterns requiring attention before deployment.
Key Findings
- Data Completeness: 100% (0% missing values) - Dataset requires no imputation; all 500 observations retained
- Outlier Prevalence: 348 observations flagged (69.6%) using IQR method - Unusually high; suggests either legitimate extreme values or measurement issues requiring investigation
- Distribution Skewness: 8 variables exhibit significant skew (range: 0.75–1.94) - Indicates non-normal distributions; transformations recommended for parametric modeling
- Multicollinearity Risk: 7 high correlations (|r| > 0.70) detected - Potential redundancy in feature space; feature selection needed
- Target Imbalance: 84.4% vs. 15.6% class split (5.41:1 ratio) - Moderate severity; requires stratified sampling and class-weighted algorithms
- Predictive Signal: 10 numeric features show significant associations with target; numeric_14, numeric_7, an
Descriptive Statistics
Summary statistics and variance analysis for all numeric variables
| variable | count | mean | median | sd | variance | min | max | q1 | q3 | iqr | skewness | kurtosis | cv | is_constant | is_low_variance | _row |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 500 | 36.9 | 36 | 9.36 | 87.6 | 18 | 60 | 30 | 43 | 13 | 0.4378 | 2.61 | 0.2537 | False | False | 25% |
| MonthlyIncome | 500 | 6599 | 4952 | 4815 | 2.318e+07 | 1102 | 19999 | 2900 | 8742 | 5842 | 1.323 | 3.798 | 0.7296 | False | False | 25%1 |
| DistanceFromHome | 500 | 9.12 | 6 | 8.255 | 68.15 | 1 | 29 | 2 | 14 | 12 | 0.9751 | 2.717 | 0.9052 | False | False | 25%2 |
| Education | 500 | 2.888 | 3 | 1.036 | 1.074 | 1 | 5 | 2 | 4 | 2 | -0.2187 | 2.389 | 0.3588 | False | False | 25%3 |
| EnvironmentSatisfaction | 500 | 2.678 | 3 | 1.072 | 1.149 | 1 | 4 | 2 | 4 | 2 | -0.2715 | 1.827 | 0.4002 | False | False | 25%4 |
| JobInvolvement | 500 | 2.73 | 3 | 0.6797 | 0.462 | 1 | 4 | 2 | 3 | 1 | -0.5647 | 3.462 | 0.249 | False | False | 25%5 |
| JobLevel | 500 | 2.096 | 2 | 1.128 | 1.273 | 1 | 5 | 1 | 3 | 2 | 1.009 | 3.363 | 0.5384 | False | False | 25%6 |
| JobSatisfaction | 500 | 2.804 | 3 | 1.079 | 1.164 | 1 | 4 | 2 | 4 | 2 | -0.4197 | 1.896 | 0.3848 | False | False | 25%7 |
| NumCompaniesWorked | 500 | 2.688 | 2 | 2.511 | 6.303 | 0 | 9 | 1 | 4 | 3 | 1.032 | 3.064 | 0.934 | False | False | 25%8 |
| PercentSalaryHike | 500 | 15.22 | 14 | 3.729 | 13.9 | 11 | 25 | 12 | 18 | 6 | 0.7952 | 2.572 | 0.245 | False | False | 25%9 |
| PerformanceRating | 500 | 3.16 | 3 | 0.367 | 0.1347 | 3 | 4 | 3 | 3 | 0 | 1.855 | 4.441 | 0.1161 | False | False | 25%10 |
| RelationshipSatisfaction | 500 | 2.822 | 3 | 1.07 | 1.145 | 1 | 4 | 2 | 4 | 2 | -0.4672 | 1.966 | 0.3791 | False | False | 25%11 |
| StockOptionLevel | 500 | 0.752 | 1 | 0.8318 | 0.6919 | 0 | 3 | 0 | 1 | 1 | 1.013 | 3.502 | 1.106 | False | False | 25%12 |
| TotalWorkingYears | 500 | 11.46 | 10 | 7.777 | 60.49 | 0 | 40 | 6 | 16 | 10 | 1.106 | 4.027 | 0.6784 | False | False | 25%13 |
| TrainingTimesLastYear | 500 | 2.8 | 3 | 1.287 | 1.655 | 0 | 6 | 2 | 3 | 1 | 0.5177 | 3.338 | 0.4595 | False | False | 25%14 |
| WorkLifeBalance | 500 | 2.754 | 3 | 0.706 | 0.4985 | 1 | 4 | 2 | 3 | 1 | -0.6041 | 3.479 | 0.2564 | False | False | 25%15 |
| YearsAtCompany | 500 | 7.038 | 5 | 6.458 | 41.71 | 0 | 40 | 3 | 9 | 6 | 1.925 | 7.589 | 0.9176 | False | False | 25%16 |
| YearsInCurrentRole | 500 | 4.238 | 3 | 3.72 | 13.84 | 0 | 18 | 2 | 7 | 5 | 0.9943 | 3.69 | 0.8777 | False | False | 25%17 |
| YearsSinceLastPromotion | 500 | 2.194 | 1 | 3.275 | 10.73 | 0 | 15 | 0 | 3 | 3 | 1.935 | 6.277 | 1.493 | False | False | 25%18 |
| YearsWithCurrManager | 500 | 4.184 | 3 | 3.576 | 12.79 | 0 | 17 | 2 | 7 | 5 | 0.7522 | 3.005 | 0.8547 | False | False | 25%19 |
Purpose
This section establishes the distributional foundation of the dataset by computing summary statistics for all 20 numeric variables. Understanding central tendency, spread, and shape characteristics is essential for identifying data quality issues, detecting outliers, and determining appropriate analytical methods for the predictive modeling objective.
Key Findings
- Skewed Variables: 8 of 20 numeric variables exhibit significant skewness (|skew| > 1), with YearsSinceLastPromotion (1.94), YearsAtCompany (1.92), and MonthlyIncome (1.32) showing the strongest right skew
- Variance Quality: All 20 numeric variables retain sufficient variance; zero constant or low-variance columns detected, indicating all features contribute meaningful information
- Coefficient of Variation: Ranges from 0.24 (JobInvolvement, PercentSalaryHike) to 1.49 (YearsSinceLastPromotion), reflecting heterogeneous variability across features
- Range Disparity: MonthlyIncome spans 1102–19999 (CV=0.73) while satisfaction metrics cluster tightly around 2–3 (CV=0.25–0.40)
Interpretation
The dataset exhibits non-normal distributions across multiple dimensions,
Correlation Analysis
Correlation matrix showing relationships between numeric variables
Purpose
This section identifies multicollinearity in the dataset by detecting strong linear relationships (|r| > 0.7) between numeric variables. Understanding these dependencies is critical for predictive modeling, as highly correlated features can inflate coefficient estimates, reduce model interpretability, and create instability in parameter estimation.
Key Findings
- High Correlations Detected: 7 pairs of variables exceed the 0.7 threshold, indicating substantial linear dependence
- Correlation Method: Pearson correlation captures linear relationships; notable pairs include YearsAtCompany–YearsWithCurrManager (r=0.77) and YearsInCurrentRole–YearsWithCurrManager (r=0.74)
- Statistical Significance: All 7 high correlations are marked as significant, confirming these relationships are unlikely due to chance
- Overall Correlation Structure: Mean correlation of 0.12 across all 400 variable pairs suggests most features are relatively independent, with multicollinearity concentrated in specific clusters
Interpretation
The dataset exhibits localized multicollinearity rather than systemic redundancy. The strong correlations cluster around tenure-related variables (Years at Company, Years in Current Role, Years with Current Manager), which logically measure related but distinct temporal dimensions of employment history. This pattern suggests these variables capture overlapping information
Distribution Analysis
Distribution shapes and outlier detection for numeric variables
Purpose
This section identifies and visualizes the distribution of numeric variables across the dataset, with specific focus on detecting anomalous values. The 348 outliers flagged using the IQR method (1.5 × IQR threshold) represent extreme observations that deviate significantly from typical patterns. Understanding these distributions is critical for assessing data quality and determining whether outliers represent genuine business phenomena or measurement errors.
Key Findings
- Outliers Detected: 348 observations (3.48% of 10,000 distribution records) flagged as anomalies across 20 numeric variables
- Detection Method: IQR-based approach identifies values beyond 1.5 × interquartile range, a standard statistical threshold
- Distribution Characteristics: Values range from 0 to 19,999 with high right skew (0.56), indicating right-tailed distributions with extreme high values
- Variable Coverage: Outliers distributed across all 20 numeric variables, with consistent 500-observation samples per variable
Interpretation
The moderate outlier rate (3.48%) suggests the dataset contains legitimate extreme values rather than systematic data quality failures. The high skewness and wide value range (0–19,999) indicate several variables have naturally occurring outliers reflecting real business variation. These outliers warrant investigation to determine whether they represent valid observations (
Missing Value Analysis
Missing data patterns and completeness assessment
Purpose
This section assesses data completeness by identifying missing values across all 27 variables in the dataset. Complete data is essential for reliable statistical analysis and modeling, as missing values can introduce bias, reduce statistical power, and complicate feature engineering. Understanding missingness patterns helps determine whether imputation, removal, or alternative analytical approaches are needed.
Key Findings
- Overall Missing Percentage: 0% – The dataset contains no missing values across any variable
- Rows with Missing Data: 0 – All 500 observations are complete with values for every variable
- Variable Coverage: All 27 variables (20 numeric, 7 categorical) have 100% data completeness
- No Imputation Required: Zero variables exceed the 10% missing threshold that would typically trigger imputation or removal decisions
Interpretation
The dataset exhibits perfect data completeness, eliminating a major source of analytical uncertainty. This clean state means all 500 observations can be used without data loss from listwise deletion, and no imputation assumptions are needed. The absence of missing data strengthens the validity of statistical tests (140 conducted, 23 significant) and feature importance rankings, as these analyses operate on the full sample without bias from incomplete cases.
Context
This ideal completeness status simplifies downstream modeling but does not address other data quality concerns identified elsewhere: 348 outliers (69.
Bivariate Analysis
Numeric variable distributions across categorical groups
Purpose
This section examines how Age varies across the three Department categories using boxplot analysis. By comparing group medians, spreads, and ranges, we identify whether department membership has a meaningful relationship with employee age—a key indicator of whether this categorical variable should be prioritized in predictive modeling or segmentation analysis.
Key Findings
- Human Resources Mean Age: 39.79 years (n=14) - Oldest department on average, though smallest sample size
- Research & Development Mean Age: 37.24 years (n=333) - Largest group with moderate age, median of 36
- Sales Mean Age: 35.88 years (n=153) - Youngest department, median of 34
- Spread Consistency: Standard deviations range 9.16–12.81, indicating similar variability across groups
- Age Range: All departments span 18–60 years, with overlapping distributions suggesting modest departmental differences
Interpretation
The three departments show a slight age gradient (HR > R&D > Sales), with Human Resources employees averaging ~4 years older than Sales. However, the substantial overlap in distributions and comparable standard deviations indicate that department explains only modest variation in age. This aligns with the Kruskal-Wallis test results showing no significant differences for numeric
Statistical Tests
ANOVA and Kruskal-Wallis tests for group differences
Purpose
This section evaluates whether numeric variables differ significantly across categorical groups using statistical hypothesis testing. Of 140 tests performed, 23 revealed statistically significant differences (p < 0.05), indicating that certain numeric variables behave differently depending on group membership. This identifies which variables have meaningful predictive or explanatory power relative to categorical factors.
Key Findings
- Total Tests Conducted: 140 across 20 numeric variables and 7 categorical variables
- Significant Results: 23 tests (16.4%) showed p-values below the 0.05 threshold, indicating genuine group differences
- Test Method: Kruskal-Wallis tests used exclusively, appropriate given that 0 of 20 numeric variables follow normal distributions
- P-Value Distribution: Mean p-value of 0.39 with median of 0.32 suggests most variable-group pairs show no significant association
- Strongest Signals: Age and YearsWithCurrManager show multiple significant associations with categorical variables like JobRole
Interpretation
The 16.4% significance rate indicates selective rather than pervasive group differences. Most numeric-categorical pairs (83.6%) show no meaningful variation across groups, suggesting limited discriminatory power for those combinations. However, the 23 significant findings identify specific numeric variables that meaningfully stratify by categorical factors—
Normality Tests
Shapiro-Wilk normality tests and Q-Q plots
Purpose
This section evaluates whether the 20 numeric variables follow normal distributions using the Shapiro-Wilk test. Normality is a critical assumption for parametric statistical methods (t-tests, ANOVA, linear regression). Understanding the distributional properties of your data determines which analytical techniques are appropriate and whether data transformations are needed.
Key Findings
- Variables Tested: 20 numeric variables analyzed
- Normal Distributions Found: 0 (0% of variables pass normality test at p > 0.05)
- Test Used: Shapiro-Wilk with all p-values = 0, indicating strong evidence against normality across all variables
- Pattern Observed: All variables show significant deviation from normality, with Q-Q plots revealing systematic departures from the theoretical normal line
Interpretation
The complete absence of normally distributed variables indicates your dataset exhibits non-normal characteristics across all numeric features. This finding is consistent with the earlier observation of 8 skewed variables and 348 outliers detected via IQR method. The data's departure from normality—evidenced by positive skewness in variables like MonthlyIncome (skew=1.32) and YearsSinceLastPromotion (skew=1.94)—means parametric assumptions are violated, potentially affecting the validity of standard statistical tests already
Pairwise Scatterplots
Scatterplot matrix showing pairwise relationships
Purpose
This section visualizes pairwise relationships between numeric variables through scatterplots, revealing both linear and non-linear associations that correlation coefficients alone cannot capture. By examining 6 variable pairs across 500 observations, the analysis identifies which features move together and detects patterns that may inform predictive modeling and feature engineering decisions.
Key Findings
- Variable Pairs Analyzed: 6 combinations examined across 3,000 plotted points
- Age vs. MonthlyIncome: Shows moderate positive relationship (r=0.49), with income ranging 1,102–19,999 across age span 18–60
- High Variability in Income: Standard deviation of 4,814.58 indicates substantial income dispersion independent of age
- Non-linear Patterns: Scattered distributions suggest relationships may not be purely linear; curved or clustered patterns could indicate threshold effects or categorical influences
Interpretation
The pairwise analysis reveals that while some numeric variables exhibit positive correlations (particularly Age with MonthlyIncome), the scatter and high variance indicate these relationships are moderate at best. The wide range of values and skewed distributions (skewness 1.57 for x-values, 1.05 for y-values) suggest outliers and non-uniform patterns that simple linear models may not fully capture. This aligns with earlier findings
Feature Importance
Correlation-based feature importance ranking
Purpose
This section ranks 19 features by their linear correlation strength with the target variable (Age), identifying which predictors have the strongest associations with outcomes. The absolute correlation values quantify predictive power—longer bars indicate stronger relationships. This ranking helps prioritize variables for modeling and reveals which factors most consistently relate to the target in linear frameworks.
Key Findings
- Top Predictor (TotalWorkingYears): Correlation of 0.69 with Age—substantially stronger than other features, indicating career tenure is the most predictive variable
- Secondary Predictors: JobLevel (0.51) and MonthlyIncome (0.49) show moderate correlations, suggesting career progression and compensation relate meaningfully to age
- Weak Predictors: Bottom-ranked features (EnvironmentSatisfaction, JobInvolvement, DistanceFromHome) have correlations near zero with p-values >0.59, indicating negligible linear relationships
- Statistical Significance: Top 10 features show p-values near 0, confirming strong relationships; bottom 9 features are not statistically significant
Interpretation
The feature importance ranking reveals a clear hierarchy: career-related variables (working years, job level, income) dominate predictive power for Age, while satisfaction and environmental factors contribute minimally. The steep drop-off after rank 5 suggests
Target Variable Analysis
Target variable distribution, class imbalance, statistical associations, and feature importance
Purpose
This section evaluates the target variable (Attrition) to understand class distribution, identify which features predict attrition, and assess statistical relationships between predictors and the outcome. This foundation is critical for building reliable classification models and understanding which employee characteristics drive attrition risk.
Key Findings
- Class Imbalance: 84.4% "No" attrition vs. 15.6% "Yes" (5.41:1 ratio, MODERATE severity) — the minority class is substantially underrepresented, requiring careful model evaluation
- Top Predictive Feature: Age ranks first in importance (0.21 point-biserial correlation), followed by JobLevel and JobInvolvement
- Significant Categorical Associations: 5 of 7 categorical variables show statistically significant relationships with Attrition (OverTime, JobRole, MaritalStatus, EducationField, BusinessTravel; p < 0.05)
- Significant Numeric Associations: 10 of 20 numeric variables significantly differ across attrition classes, with JobLevel, YearsWithCurrManager, and MonthlyIncome showing strongest differences
Interpretation
The moderate class imbalance indicates that standard accuracy metrics will be misleading—a model predicting "No" for all cases would achieve 84% accuracy
Data Validation
Comprehensive data quality assessment: variance, cardinality, duplicates, and type profiling
| variable | expected_range | actual_min | actual_max | out_of_range_count | validation_status | notes |
|---|---|---|---|---|---|---|
| Age | No predefined range | 18 | 60 | 0 | PASS | |
| MonthlyIncome | No predefined range | 1102 | 19999 | 0 | PASS | |
| DistanceFromHome | No predefined range | 1 | 29 | 0 | PASS | |
| Education | No predefined range | 1 | 5 | 0 | PASS | |
| EnvironmentSatisfaction | No predefined range | 1 | 4 | 0 | PASS | |
| JobInvolvement | No predefined range | 1 | 4 | 0 | PASS | |
| JobLevel | No predefined range | 1 | 5 | 0 | PASS | |
| JobSatisfaction | No predefined range | 1 | 4 | 0 | PASS | |
| NumCompaniesWorked | No predefined range | 0 | 9 | 0 | PASS | |
| PercentSalaryHike | No predefined range | 11 | 25 | 0 | PASS | |
| PerformanceRating | No predefined range | 3 | 4 | 0 | PASS | |
| RelationshipSatisfaction | No predefined range | 1 | 4 | 0 | PASS | |
| StockOptionLevel | No predefined range | 0 | 3 | 0 | PASS | |
| TotalWorkingYears | No predefined range | 0 | 40 | 0 | PASS | |
| TrainingTimesLastYear | No predefined range | 0 | 6 | 0 | PASS | |
| WorkLifeBalance | No predefined range | 1 | 4 | 0 | PASS | |
| YearsAtCompany | No predefined range | 0 | 40 | 0 | PASS | |
| YearsInCurrentRole | No predefined range | 0 | 18 | 0 | PASS | |
| YearsSinceLastPromotion | No predefined range | 0 | 15 | 0 | PASS | |
| YearsWithCurrManager | No predefined range | 0 | 17 | 0 | PASS |
| metric | value | recommendation |
|---|---|---|
| Total rows | 500 | |
| Exact duplicates | 0 | |
| % duplicates | 0.0% | |
| Unique rows | 500 | |
| Rows after deduplication | 500 |
| variable | expected_type | actual_type | type_mismatch | sample_value | notes |
|---|---|---|---|---|---|
| Age | integer | integer | False | 41 | |
| MonthlyIncome | integer | integer | False | 5993 | |
| DistanceFromHome | integer | integer | False | 1 | |
| Education | integer | integer | False | 2 | |
| EnvironmentSatisfaction | integer | integer | False | 2 | |
| JobInvolvement | integer | integer | False | 3 | |
| JobLevel | integer | integer | False | 2 | |
| JobSatisfaction | integer | integer | False | 4 | |
| NumCompaniesWorked | integer | integer | False | 8 | |
| PercentSalaryHike | integer | integer | False | 11 | |
| PerformanceRating | integer | integer | False | 3 | |
| RelationshipSatisfaction | integer | integer | False | 1 | |
| StockOptionLevel | integer | integer | False | 0 | |
| TotalWorkingYears | integer | integer | False | 8 | |
| TrainingTimesLastYear | integer | integer | False | 0 | |
| WorkLifeBalance | integer | integer | False | 1 | |
| YearsAtCompany | integer | integer | False | 6 | |
| YearsInCurrentRole | integer | integer | False | 4 | |
| YearsSinceLastPromotion | integer | integer | False | 0 | |
| YearsWithCurrManager | integer | integer | False | 5 | |
| Department | character | character | False | Sales | Categorical variable |
| BusinessTravel | character | character | False | Travel_Rarely | Categorical variable |
| EducationField | character | character | False | Life Sciences | Categorical variable |
| Gender | character | character | False | Female | Categorical variable |
| JobRole | character | character | False | Sales Executive | Categorical variable |
| MaritalStatus | character | character | False | Single | Categorical variable |
| OverTime | character | character | False | Yes | Categorical variable |
| variable | unique_count | total_count | cardinality_ratio | top_10_pct | is_high_cardinality | is_very_high_cardinality | ohe_features_created | recommendation |
|---|---|---|---|---|---|---|---|---|
| Department | 3 | 500 | 0.006 | 100 | False | False | 3 | One-hot encoding safe |
| BusinessTravel | 3 | 500 | 0.006 | 100 | False | False | 3 | One-hot encoding safe |
| EducationField | 6 | 500 | 0.012 | 100 | False | False | 6 | One-hot encoding safe |
| Gender | 2 | 500 | 0.004 | 100 | False | False | 2 | One-hot encoding safe |
| JobRole | 9 | 500 | 0.018 | 100 | False | False | 9 | One-hot encoding safe |
| MaritalStatus | 3 | 500 | 0.006 | 100 | False | False | 3 | One-hot encoding safe |
| OverTime | 2 | 500 | 0.004 | 100 | False | False | 2 | One-hot encoding safe |
Purpose
This section validates the structural integrity and quality of the dataset before analysis. It assesses whether columns have sufficient variance for modeling, whether categorical variables are suitable for encoding, whether duplicates exist, and whether data types are correctly assigned. These checks ensure the dataset is clean and ready for statistical testing and predictive modeling.
Key Findings
- Constant Columns: 0 identified – all variables contain meaningful variation
- Low-Variance Columns: 0 identified – no features with coefficient of variation below 5%
- High-Cardinality Categoricals: 0 identified – all categorical variables safe for one-hot encoding
- Exact Duplicates: 0 rows (0%) – dataset contains 500 unique observations
- Type Mismatches: 0 columns – all variables correctly typed (numeric vs. character)
- Data Validation: All 20 numeric variables pass range checks with no out-of-range values
Interpretation
The dataset demonstrates excellent structural quality with no data integrity issues that would compromise analysis. The absence of constant or low-variance columns means all features contribute meaningful information for statistical testing and modeling. With zero duplicates and correct data types, the dataset is free from preprocessing artifacts that could bias results. The moderate cardinality of categorical variables (2–9 unique values) supports straightforward encoding without dimensionality