CLUSTERING

K‑Means Clustering

K-means clustering with automatic elbow and silhouette analysis for 2-10 clusters, 25 random starts per K, PCA visualization, and comprehensive cluster profiling.

What Makes This Reliable

Optimal K Selection

Tests K=1 to 10 with within-sum-of-squares elbow analysis and silhouette scores (K=2-10). Identifies optimal K using both methods. 25 random starts (nstart) per K value for stability.

Feature Importance

Calculates between-SS ratio for each feature to identify which variables drive cluster separation. PCA visualization shows first 2 principal components with variance explained.

Comprehensive Profiling

Per-cluster statistics: mean, SD, min, max for all features. Cluster sizes, percentages, within-SS. Centroids in original scale (unscaled if standardization used).

What You Need to Provide

Numeric features required

Provide features array (column names) and n_clusters (default 3, range 2-20). Set scale_data=true (default) for standardization.

Algorithm converts non-numeric to numeric, removes rows with missing values, runs K=1-10 for elbow analysis, calculates silhouette scores for K=2-10, uses 25 random starts per K, and generates PCA visualization.

Tabular Schema / numeric features

Quick Specs

K RangeTests 1-10, default n=3
Initialization25 random starts per K
MetricsWSS, silhouette, R²
VisualizationPCA 2D projection

How We Segment

From preprocessing to stable clusters

1

Data Preparation

Convert non-numeric features to numeric, remove rows with missing values, optionally standardize (scale) features saving mean/SD for interpretation.

2

K Optimization

Test K=1-10 with 25 random starts each, calculate within-SS for elbow method, compute silhouette scores for K=2-10, identify optimal K from both methods.

3

Analysis & Profiling

Calculate feature importance via between-SS ratio, generate cluster statistics (mean/SD/min/max), create PCA projection, compute R² and convergence metrics.

Why This Analysis Matters

K-means with comprehensive validation: tests 10 K values, calculates elbow and silhouette optimal K, uses 25 random starts for stability.

Provides feature importance ranking via between-SS ratio, detailed cluster profiles with statistics, PCA visualization with variance explained, and R² metric showing proportion of variance explained by clusters.

Note: Uses Euclidean distance after optional standardization. Maximum iterations=100, convergence tracked. Centroids returned in original scale even if data scaled. Complete case analysis (removes rows with any NA).

Ready to Segment?

Discover clear, stable clusters with profiles

Read the article: K‑Means Clustering