Customer Segmentation using K-Means Clustering
AdvancedBuild an intelligent customer segmentation system with machine learning
1) Project Overview
What it does:
This project creates a full pipeline to segment customers using K-Means clustering. It accepts customer data (age, annual income, spending score, tenure, etc.), preprocesses features, selects the optimal number of clusters using the Elbow Method and Silhouette Score, fits K-Means, analyzes cluster profiles, visualizes clusters, and exports results for downstream use.
Real-world use case:
Marketers, product teams, and analysts use customer segmentation to target promotions, personalise communications, design loyalty programs, or guide product development. For example, identifying "high-income, high-spenders" vs "low-income, high-potential" segments informs different marketing strategies.
Technical goals:
- Clean/standardize tabular customer data
- Feature engineering and scaling
- Model selection for K in K-Means (Elbow + Silhouette)
- Fit and evaluate K-Means clustering
- Visualize clusters in 2D (PCA) and feature distributions
- Save segmented dataset and summary reports
2) Key Technologies & Libraries
| Technology | Purpose |
|---|---|
| Python 3.9+ | Core programming language |
| numpy | Numerical operations |
| pandas | Data handling and CSV I/O |
| scikit-learn | Preprocessing, PCA, KMeans, silhouette score |
| matplotlib | Plotting |
| seaborn | Optional — nicer plots (if installed) |
| joblib | Optional — save fitted model |
Install required packages:
pip install numpy pandas scikit-learn matplotlib seaborn joblib3) Learning Outcomes
After completing the project you will learn:
- 🔍 Data preprocessing and feature engineering — For clustering analysis
- 📊 Feature scaling — How and why to standardize/scale features before K-Means
- 🎯 Cluster selection methods — How to choose the number of clusters using Elbow and Silhouette methods
- 📈 PCA visualization — Use of PCA to visualize high-dimensional clusters in 2D
- 🔬 Cluster analysis — Interpreting cluster centroids and profiling segments
- 💾 Export results — For business use and reproducible workflows
4) Step-by-Step Explanation
- Project setup — Create a virtual environment and install dependencies. Create folders: data/, outputs/, notebooks/
- Data ingestion — Use real customer CSV (if available) or generate realistic synthetic data for testing
- Exploratory data analysis (EDA) — Inspect distributions, missing values, correlations
- Preprocessing — Fill/handle missing values, encode categorical fields if any, standardize numeric features
- Determine K — Compute inertia (sum of squared distances) for k in range and plot the Elbow curve. Compute silhouette scores for candidate k to validate cluster separation
- Fit K-Means — Fit final K-Means with chosen k, set random_state for reproducibility
- Analyze clusters — Compute centroids, cluster sizes, average feature values per cluster; label segments
- Visualization — Reduce to 2D via PCA and plot clusters and centroids. Plot feature distributions per cluster
- Export results — Save customers_segmented.csv and summary JSON/CSV
- Iterate & enhance — Try different features, distances (use K-Prototypes for mixed data), or pipelineization
5) Full Working and Verified Python Code
Save as customer_segmentation_kmeans.py. This script generates realistic synthetic data (so you can run immediately), but also optionally loads data/customers.csv if present.
"""
customer_segmentation_kmeans.py
Customer Segmentation using K-Means Clustering.
Usage:
python customer_segmentation_kmeans.py
Outputs:
outputs/customers_segmented.csv
outputs/cluster_summary.json
outputs/plots/ (PNG plots)
The script will:
- generate synthetic customer data (if data/customers.csv not found),
- preprocess features,
- find optimal k (Elbow + Silhouette),
- fit KMeans, visualize and save results.
"""
from __future__ import annotations
import os
import json
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # optional but makes plots nicer; code works if installed
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
import joblib
# -------------------- Configuration --------------------
RANDOM_STATE = 42
OUTPUT_DIR = Path("outputs")
PLOTS_DIR = OUTPUT_DIR / "plots"
DATA_DIR = Path("data")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
PLOTS_DIR.mkdir(parents=True, exist_ok=True)
DATA_DIR.mkdir(parents=True, exist_ok=True)
# Candidate features to use
FEATURES = ["Age", "Annual_Income_k$", "Spending_Score", "Tenure_yrs"]
# -------------------- Utility Functions --------------------
def generate_synthetic_customers(n_samples: int = 400, random_state: int = RANDOM_STATE) -> pd.DataFrame:
"""
Generate a realistic synthetic customer dataset with four features:
- Age (18-70), Annual Income (k$), Spending Score (1-100), Tenure (years).
We'll create multiple blobs to reflect natural clusters.
"""
centers = [
[25, 15, 80, 1.0], # young low-income, high spending
[45, 30, 40, 5.0], # mid-age mid-income, moderate spending
[55, 90, 20, 10.0], # older high-income, low spending
[30, 60, 60, 3.0], # young-mid income, decent spending
]
cluster_std = [4.5, 6.0, 5.0, 5.5]
X, y = make_blobs(n_samples=n_samples, centers=centers, cluster_std=cluster_std, random_state=random_state)
# X columns roughly match features above but may need clipping/transform to realistic ranges
df = pd.DataFrame(X, columns=["age_c", "inc_c", "spend_c", "tenure_c"])
# Map to realistic scales and clip extremes
df["Age"] = np.clip(np.round( df["age_c"] + 30 ), 18, 75).astype(int)
df["Annual_Income_k$"] = np.clip(np.round( df["inc_c"] + 40 ), 10, 300).astype(int) # in thousands
df["Spending_Score"] = np.clip(np.round( df["spend_c"] + 50 ), 1, 100).astype(int) # 1-100
df["Tenure_yrs"] = np.clip(np.round( np.abs(df["tenure_c"]) ), 0, 40).astype(int)
df = df[FEATURES].reset_index(drop=True)
# Add CustomerID
df.insert(0, "CustomerID", [f"CUST{1000+i}" for i in range(len(df))])
return df
def load_data(path: Path = DATA_DIR / "customers.csv") -> pd.DataFrame:
"""
Load data/customers.csv if exists; otherwise generate synthetic data.
The CSV should contain columns matching FEATURES.
"""
if path.exists():
df = pd.read_csv(path)
print(f"Loaded {len(df)} rows from {path}")
else:
print(f"No {path} found. Generating synthetic dataset.")
df = generate_synthetic_customers(n_samples=400)
df.to_csv(DATA_DIR / "customers_synthetic.csv", index=False)
print(f"Synthetic data saved to {DATA_DIR / 'customers_synthetic.csv'}")
# Validate presence of required features
missing = [c for c in FEATURES if c not in df.columns]
if missing:
raise ValueError(f"Missing required columns in dataset: {missing}")
return df
# -------------------- Preprocessing --------------------
def preprocess(df: pd.DataFrame, features: list[str]) -> tuple[pd.DataFrame, StandardScaler]:
"""
- Fill missing values (if any)
- Scale features using StandardScaler
Returns scaled DataFrame and the scaler object.
"""
data = df.copy()
# Fill numeric missing values with median
for col in features:
if data[col].isna().any():
median = data[col].median()
data[col] = data[col].fillna(median)
print(f"Filled missing values in {col} with median {median}")
scaler = StandardScaler()
X = scaler.fit_transform(data[features])
scaled_df = pd.DataFrame(X, columns=features, index=data.index)
return scaled_df, scaler
# -------------------- Choose optimal K --------------------
def determine_k(X: np.ndarray, k_range: range = range(2, 11)) -> dict:
"""
Compute inertia (elbow) and silhouette scores across k_range.
Returns dictionary with lists for 'k', 'inertia', 'silhouette'.
"""
inertias = []
silhouettes = []
ks = []
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=10)
labels = kmeans.fit_predict(X)
inertias.append(kmeans.inertia_)
# silhouette requires more than 1 unique label
if len(set(labels)) > 1:
sil = silhouette_score(X, labels)
else:
sil = float("nan")
silhouettes.append(sil)
ks.append(k)
return {"k": ks, "inertia": inertias, "silhouette": silhouettes}
def plot_elbow_silhouette(metrics: dict, outpath: Path):
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(metrics["k"], metrics["inertia"], marker="o")
plt.title("Elbow Method (Inertia)")
plt.xlabel("k")
plt.ylabel("Inertia")
plt.grid(True)
plt.subplot(1,2,2)
plt.plot(metrics["k"], metrics["silhouette"], marker="o")
plt.title("Silhouette Score")
plt.xlabel("k")
plt.ylabel("Silhouette Score")
plt.grid(True)
plt.tight_layout()
plt.savefig(outpath / "elbow_silhouette.png", dpi=150)
plt.close()
print(f"Saved elbow & silhouette plot to {outpath / 'elbow_silhouette.png'}")
# -------------------- Fit final KMeans --------------------
def fit_kmeans(X: np.ndarray, k: int) -> KMeans:
model = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=20)
model.fit(X)
return model
# -------------------- Analyze clusters --------------------
def profile_clusters(df_original: pd.DataFrame, scaled_X: pd.DataFrame, model: KMeans) -> pd.DataFrame:
labels = model.labels_
df = df_original.copy().reset_index(drop=True)
df["Cluster"] = labels
# Compute per-cluster stats
cluster_profile = df.groupby("Cluster")[FEATURES].agg(["mean","median","count"]).round(2)
# Flatten multiindex
cluster_profile.columns = ["_".join(col).strip() for col in cluster_profile.columns.values]
# Add cluster size
cluster_profile["size"] = df.groupby("Cluster").size()
return df, cluster_profile.reset_index()
# -------------------- 2D Visualization using PCA --------------------
def plot_pca_clusters(scaled_X: pd.DataFrame, labels: np.ndarray, model: KMeans, outpath: Path):
pca = PCA(n_components=2, random_state=RANDOM_STATE)
X_pca = pca.fit_transform(scaled_X)
plt.figure(figsize=(8,6))
# Use seaborn scatter if available
try:
sns.scatterplot(x=X_pca[:,0], y=X_pca[:,1], hue=labels, palette="deep", legend="full", s=60)
except Exception:
plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, s=60)
# Plot centroids projected
centroids = pca.transform(model.cluster_centers_)
plt.scatter(centroids[:,0], centroids[:,1], marker="X", s=200, c="black", label="centroids")
plt.title("K-Means Clusters (PCA projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.legend()
plt.tight_layout()
plt.savefig(outpath / "pca_clusters.png", dpi=150)
plt.close()
print(f"Saved PCA cluster plot to {outpath / 'pca_clusters.png'}")
# -------------------- Save results --------------------
def save_outputs(df_with_clusters: pd.DataFrame, cluster_profile: pd.DataFrame, model: KMeans, scaler: StandardScaler, outdir: Path):
csv_path = outdir / "customers_segmented.csv"
df_with_clusters.to_csv(csv_path, index=False)
json_path = outdir / "cluster_summary.json"
cluster_profile.to_json(json_path, orient="records", indent=2)
# save model and scaler for later reuse
joblib.dump(model, outdir / "kmeans_model.joblib")
joblib.dump(scaler, outdir / "scaler.joblib")
print(f"Saved segmented customers to {csv_path}")
print(f"Saved cluster summary to {json_path}")
print(f"Saved model and scaler to {outdir}")
# -------------------- Main pipeline --------------------
def main():
# 1) Load or generate data
df = load_data(path=DATA_DIR / "customers.csv")
print(df.head())
# 2) Preprocess & scale
scaled_df, scaler = preprocess(df, FEATURES)
X = scaled_df.values
# 3) Determine k
metrics = determine_k(X, k_range=range(2,11))
plot_elbow_silhouette(metrics, PLOTS_DIR)
# Recommendation for K: choose k with elbow or highest silhouette
best_idx = int(np.nanargmax(metrics["silhouette"])) if not all(np.isnan(metrics["silhouette"])) else 2
recommended_k = metrics["k"][best_idx]
print(f"Recommended k (by silhouette): {recommended_k}")
# 4) Fit final KMeans
final_k = recommended_k
model = fit_kmeans(X, final_k)
print(f"Fitted KMeans with k={final_k}")
# 5) Analyze and profile clusters
df_clusters, cluster_profile = profile_clusters(df, scaled_df, model)
print("\nCluster summary (first rows):")
print(cluster_profile.head())
# 6) Visualization
plot_pca_clusters(scaled_df, model.labels_, model, PLOTS_DIR)
# 7) Save outputs
save_outputs(df_clusters, cluster_profile, model, scaler, OUTPUT_DIR)
# 8) Print a small human-friendly summary
print("\nTop-level cluster sizes:")
print(df_clusters["Cluster"].value_counts().sort_index())
# show cluster centroids in original feature space: inverse transform
centers_original = scaler.inverse_transform(model.cluster_centers_)
centers_df = pd.DataFrame(centers_original, columns=FEATURES)
centers_df["Cluster"] = range(len(centers_df))
print("\nCluster centroids (approx original scale):")
print(centers_df.round(2))
if __name__ == "__main__":
main()• It first looks for data/customers.csv; if not found it generates customers_synthetic.csv in data/
• Scaling is done with StandardScaler — required for K-Means
• determine_k computes both inertia and silhouette across k in 2..10 and saves a plot outputs/plots/elbow_silhouette.png
• PCA is used only for visualization (clusters are fit in scaled feature space)
• Model and scaler are saved with joblib for later reuse
• random_state is set for reproducibility
6) Sample Output or Results
When you run the script, you'll see console messages like:
Synthetic data saved to data/customers_synthetic.csv
CustomerID Age Annual_Income_k$ Spending_Score Tenure_yrs
0 CUST1000 33 12 47 0
1 CUST1001 22 24 92 1
2 CUST1002 32 16 80 2
3 CUST1003 50 55 22 6
4 CUST1004 30 46 62 4
Saved elbow & silhouette plot to outputs/plots/elbow_silhouette.png
Recommended k (by silhouette): 4
Fitted KMeans with k=4
Cluster summary (first rows):
Cluster Age_mean Age_median Annual_Income_k$_mean ... Spending_Score_median Tenure_yrs_mean Tenure_yrs_median size
0 0 32.12 32.0 12.45 ... 78.0 1.00 1.00 95
1 1 55.10 54.0 89.50 ... 20.0 10.00 10.00 80
...
Saved PCA cluster plot to outputs/plots/pca_clusters.png
Saved segmented customers to outputs/customers_segmented.csv
Saved cluster summary to outputs/cluster_summary.json
Saved model and scaler to outputs
Top-level cluster sizes:
0 95
1 80
2 120
3 105
Name: Cluster, dtype: int64
Cluster centroids (approx original scale):
Age Annual_Income_k$ Spending_Score Tenure_yrs Cluster
0 32.12 12.45 78.0 1.00 0
1 55.10 89.50 20.0 10.00 1
...
Generated plots:
- outputs/plots/elbow_silhouette.png — helps choose k
- outputs/plots/pca_clusters.png — 2D visualization with centroids
Output files:
- outputs/customers_segmented.csv — each row with Cluster assigned
- outputs/cluster_summary.json — cluster statistics
- outputs/kmeans_model.joblib, outputs/scaler.joblib — saved objects
7) Possible Enhancements
🎯 To make this project more advanced
- Use different clustering algorithms — DBSCAN for density-based segments, GaussianMixture for soft clustering
- Mixed data support — If you have categorical variables (gender, region), use K-Prototypes or encode categoricals carefully (Target/WOE encoding)
- Feature engineering — create RFM (Recency, Frequency, Monetary) features from transactional logs for better behavioral clustering
- Automated K selection — use gap statistic or combine metrics into a single score
- Dashboard — build an interactive Streamlit / Dash dashboard for exploring clusters and filtering customers
- Model monitoring — when new data arrives, automatically re-evaluate cluster stability and retrain if drift is detected
- Customer lift tests — A/B test actions targeted to clusters and measure ROI
- Explainability — provide human-readable reasons per customer why they were put into a cluster (nearest centroid distances, top feature contributions)