K-means clustering with automatic elbow and silhouette analysis for 2-10 clusters, 25 random starts per K, PCA visualization, and comprehensive cluster profiling.
Tests K=1 to 10 with within-sum-of-squares elbow analysis and silhouette scores (K=2-10). Identifies optimal K using both methods. 25 random starts (nstart) per K value for stability.
Calculates between-SS ratio for each feature to identify which variables drive cluster separation. PCA visualization shows first 2 principal components with variance explained.
Per-cluster statistics: mean, SD, min, max for all features. Cluster sizes, percentages, within-SS. Centroids in original scale (unscaled if standardization used).
Provide features array (column names) and n_clusters (default 3, range 2-20). Set scale_data=true (default) for standardization.
Algorithm converts non-numeric to numeric, removes rows with missing values, runs K=1-10 for elbow analysis, calculates silhouette scores for K=2-10, uses 25 random starts per K, and generates PCA visualization.
From preprocessing to stable clusters
Convert non-numeric features to numeric, remove rows with missing values, optionally standardize (scale) features saving mean/SD for interpretation.
Test K=1-10 with 25 random starts each, calculate within-SS for elbow method, compute silhouette scores for K=2-10, identify optimal K from both methods.
Calculate feature importance via between-SS ratio, generate cluster statistics (mean/SD/min/max), create PCA projection, compute R² and convergence metrics.
K-means with comprehensive validation: tests 10 K values, calculates elbow and silhouette optimal K, uses 25 random starts for stability.
Provides feature importance ranking via between-SS ratio, detailed cluster profiles with statistics, PCA visualization with variance explained, and R² metric showing proportion of variance explained by clusters.
Note: Uses Euclidean distance after optional standardization. Maximum iterations=100, convergence tracked. Centroids returned in original scale even if data scaled. Complete case analysis (removes rows with any NA).
Discover clear, stable clusters with profiles
Read the article: K‑Means Clustering