🎉 Welcome to PyVerse! Start Learning Today

Customer Segmentation using K-Means Clustering

Advanced

Build an intelligent customer segmentation system with machine learning

1) Project Overview

What it does:

This project creates a full pipeline to segment customers using K-Means clustering. It accepts customer data (age, annual income, spending score, tenure, etc.), preprocesses features, selects the optimal number of clusters using the Elbow Method and Silhouette Score, fits K-Means, analyzes cluster profiles, visualizes clusters, and exports results for downstream use.

Real-world use case:

Marketers, product teams, and analysts use customer segmentation to target promotions, personalise communications, design loyalty programs, or guide product development. For example, identifying "high-income, high-spenders" vs "low-income, high-potential" segments informs different marketing strategies.

Technical goals:

  • Clean/standardize tabular customer data
  • Feature engineering and scaling
  • Model selection for K in K-Means (Elbow + Silhouette)
  • Fit and evaluate K-Means clustering
  • Visualize clusters in 2D (PCA) and feature distributions
  • Save segmented dataset and summary reports

2) Key Technologies & Libraries

TechnologyPurpose
Python 3.9+Core programming language
numpyNumerical operations
pandasData handling and CSV I/O
scikit-learnPreprocessing, PCA, KMeans, silhouette score
matplotlibPlotting
seabornOptional — nicer plots (if installed)
joblibOptional — save fitted model

Install required packages:

pip install numpy pandas scikit-learn matplotlib seaborn joblib

3) Learning Outcomes

After completing the project you will learn:

  • 🔍 Data preprocessing and feature engineering — For clustering analysis
  • 📊 Feature scaling — How and why to standardize/scale features before K-Means
  • 🎯 Cluster selection methods — How to choose the number of clusters using Elbow and Silhouette methods
  • 📈 PCA visualization — Use of PCA to visualize high-dimensional clusters in 2D
  • 🔬 Cluster analysis — Interpreting cluster centroids and profiling segments
  • 💾 Export results — For business use and reproducible workflows

4) Step-by-Step Explanation

  1. Project setup — Create a virtual environment and install dependencies. Create folders: data/, outputs/, notebooks/
  2. Data ingestion — Use real customer CSV (if available) or generate realistic synthetic data for testing
  3. Exploratory data analysis (EDA) — Inspect distributions, missing values, correlations
  4. Preprocessing — Fill/handle missing values, encode categorical fields if any, standardize numeric features
  5. Determine K — Compute inertia (sum of squared distances) for k in range and plot the Elbow curve. Compute silhouette scores for candidate k to validate cluster separation
  6. Fit K-Means — Fit final K-Means with chosen k, set random_state for reproducibility
  7. Analyze clusters — Compute centroids, cluster sizes, average feature values per cluster; label segments
  8. Visualization — Reduce to 2D via PCA and plot clusters and centroids. Plot feature distributions per cluster
  9. Export results — Save customers_segmented.csv and summary JSON/CSV
  10. Iterate & enhance — Try different features, distances (use K-Prototypes for mixed data), or pipelineization

5) Full Working and Verified Python Code

Save as customer_segmentation_kmeans.py. This script generates realistic synthetic data (so you can run immediately), but also optionally loads data/customers.csv if present.

""" customer_segmentation_kmeans.py Customer Segmentation using K-Means Clustering. Usage: python customer_segmentation_kmeans.py Outputs: outputs/customers_segmented.csv outputs/cluster_summary.json outputs/plots/ (PNG plots) The script will: - generate synthetic customer data (if data/customers.csv not found), - preprocess features, - find optimal k (Elbow + Silhouette), - fit KMeans, visualize and save results. """ from __future__ import annotations import os import json from pathlib import Path import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # optional but makes plots nicer; code works if installed from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score from sklearn.datasets import make_blobs import joblib # -------------------- Configuration -------------------- RANDOM_STATE = 42 OUTPUT_DIR = Path("outputs") PLOTS_DIR = OUTPUT_DIR / "plots" DATA_DIR = Path("data") OUTPUT_DIR.mkdir(parents=True, exist_ok=True) PLOTS_DIR.mkdir(parents=True, exist_ok=True) DATA_DIR.mkdir(parents=True, exist_ok=True) # Candidate features to use FEATURES = ["Age", "Annual_Income_k$", "Spending_Score", "Tenure_yrs"] # -------------------- Utility Functions -------------------- def generate_synthetic_customers(n_samples: int = 400, random_state: int = RANDOM_STATE) -> pd.DataFrame: """ Generate a realistic synthetic customer dataset with four features: - Age (18-70), Annual Income (k$), Spending Score (1-100), Tenure (years). We'll create multiple blobs to reflect natural clusters. """ centers = [ [25, 15, 80, 1.0], # young low-income, high spending [45, 30, 40, 5.0], # mid-age mid-income, moderate spending [55, 90, 20, 10.0], # older high-income, low spending [30, 60, 60, 3.0], # young-mid income, decent spending ] cluster_std = [4.5, 6.0, 5.0, 5.5] X, y = make_blobs(n_samples=n_samples, centers=centers, cluster_std=cluster_std, random_state=random_state) # X columns roughly match features above but may need clipping/transform to realistic ranges df = pd.DataFrame(X, columns=["age_c", "inc_c", "spend_c", "tenure_c"]) # Map to realistic scales and clip extremes df["Age"] = np.clip(np.round( df["age_c"] + 30 ), 18, 75).astype(int) df["Annual_Income_k$"] = np.clip(np.round( df["inc_c"] + 40 ), 10, 300).astype(int) # in thousands df["Spending_Score"] = np.clip(np.round( df["spend_c"] + 50 ), 1, 100).astype(int) # 1-100 df["Tenure_yrs"] = np.clip(np.round( np.abs(df["tenure_c"]) ), 0, 40).astype(int) df = df[FEATURES].reset_index(drop=True) # Add CustomerID df.insert(0, "CustomerID", [f"CUST{1000+i}" for i in range(len(df))]) return df def load_data(path: Path = DATA_DIR / "customers.csv") -> pd.DataFrame: """ Load data/customers.csv if exists; otherwise generate synthetic data. The CSV should contain columns matching FEATURES. """ if path.exists(): df = pd.read_csv(path) print(f"Loaded {len(df)} rows from {path}") else: print(f"No {path} found. Generating synthetic dataset.") df = generate_synthetic_customers(n_samples=400) df.to_csv(DATA_DIR / "customers_synthetic.csv", index=False) print(f"Synthetic data saved to {DATA_DIR / 'customers_synthetic.csv'}") # Validate presence of required features missing = [c for c in FEATURES if c not in df.columns] if missing: raise ValueError(f"Missing required columns in dataset: {missing}") return df # -------------------- Preprocessing -------------------- def preprocess(df: pd.DataFrame, features: list[str]) -> tuple[pd.DataFrame, StandardScaler]: """ - Fill missing values (if any) - Scale features using StandardScaler Returns scaled DataFrame and the scaler object. """ data = df.copy() # Fill numeric missing values with median for col in features: if data[col].isna().any(): median = data[col].median() data[col] = data[col].fillna(median) print(f"Filled missing values in {col} with median {median}") scaler = StandardScaler() X = scaler.fit_transform(data[features]) scaled_df = pd.DataFrame(X, columns=features, index=data.index) return scaled_df, scaler # -------------------- Choose optimal K -------------------- def determine_k(X: np.ndarray, k_range: range = range(2, 11)) -> dict: """ Compute inertia (elbow) and silhouette scores across k_range. Returns dictionary with lists for 'k', 'inertia', 'silhouette'. """ inertias = [] silhouettes = [] ks = [] for k in k_range: kmeans = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=10) labels = kmeans.fit_predict(X) inertias.append(kmeans.inertia_) # silhouette requires more than 1 unique label if len(set(labels)) > 1: sil = silhouette_score(X, labels) else: sil = float("nan") silhouettes.append(sil) ks.append(k) return {"k": ks, "inertia": inertias, "silhouette": silhouettes} def plot_elbow_silhouette(metrics: dict, outpath: Path): plt.figure(figsize=(12,5)) plt.subplot(1,2,1) plt.plot(metrics["k"], metrics["inertia"], marker="o") plt.title("Elbow Method (Inertia)") plt.xlabel("k") plt.ylabel("Inertia") plt.grid(True) plt.subplot(1,2,2) plt.plot(metrics["k"], metrics["silhouette"], marker="o") plt.title("Silhouette Score") plt.xlabel("k") plt.ylabel("Silhouette Score") plt.grid(True) plt.tight_layout() plt.savefig(outpath / "elbow_silhouette.png", dpi=150) plt.close() print(f"Saved elbow & silhouette plot to {outpath / 'elbow_silhouette.png'}") # -------------------- Fit final KMeans -------------------- def fit_kmeans(X: np.ndarray, k: int) -> KMeans: model = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=20) model.fit(X) return model # -------------------- Analyze clusters -------------------- def profile_clusters(df_original: pd.DataFrame, scaled_X: pd.DataFrame, model: KMeans) -> pd.DataFrame: labels = model.labels_ df = df_original.copy().reset_index(drop=True) df["Cluster"] = labels # Compute per-cluster stats cluster_profile = df.groupby("Cluster")[FEATURES].agg(["mean","median","count"]).round(2) # Flatten multiindex cluster_profile.columns = ["_".join(col).strip() for col in cluster_profile.columns.values] # Add cluster size cluster_profile["size"] = df.groupby("Cluster").size() return df, cluster_profile.reset_index() # -------------------- 2D Visualization using PCA -------------------- def plot_pca_clusters(scaled_X: pd.DataFrame, labels: np.ndarray, model: KMeans, outpath: Path): pca = PCA(n_components=2, random_state=RANDOM_STATE) X_pca = pca.fit_transform(scaled_X) plt.figure(figsize=(8,6)) # Use seaborn scatter if available try: sns.scatterplot(x=X_pca[:,0], y=X_pca[:,1], hue=labels, palette="deep", legend="full", s=60) except Exception: plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, s=60) # Plot centroids projected centroids = pca.transform(model.cluster_centers_) plt.scatter(centroids[:,0], centroids[:,1], marker="X", s=200, c="black", label="centroids") plt.title("K-Means Clusters (PCA projection)") plt.xlabel("PC1") plt.ylabel("PC2") plt.legend() plt.tight_layout() plt.savefig(outpath / "pca_clusters.png", dpi=150) plt.close() print(f"Saved PCA cluster plot to {outpath / 'pca_clusters.png'}") # -------------------- Save results -------------------- def save_outputs(df_with_clusters: pd.DataFrame, cluster_profile: pd.DataFrame, model: KMeans, scaler: StandardScaler, outdir: Path): csv_path = outdir / "customers_segmented.csv" df_with_clusters.to_csv(csv_path, index=False) json_path = outdir / "cluster_summary.json" cluster_profile.to_json(json_path, orient="records", indent=2) # save model and scaler for later reuse joblib.dump(model, outdir / "kmeans_model.joblib") joblib.dump(scaler, outdir / "scaler.joblib") print(f"Saved segmented customers to {csv_path}") print(f"Saved cluster summary to {json_path}") print(f"Saved model and scaler to {outdir}") # -------------------- Main pipeline -------------------- def main(): # 1) Load or generate data df = load_data(path=DATA_DIR / "customers.csv") print(df.head()) # 2) Preprocess & scale scaled_df, scaler = preprocess(df, FEATURES) X = scaled_df.values # 3) Determine k metrics = determine_k(X, k_range=range(2,11)) plot_elbow_silhouette(metrics, PLOTS_DIR) # Recommendation for K: choose k with elbow or highest silhouette best_idx = int(np.nanargmax(metrics["silhouette"])) if not all(np.isnan(metrics["silhouette"])) else 2 recommended_k = metrics["k"][best_idx] print(f"Recommended k (by silhouette): {recommended_k}") # 4) Fit final KMeans final_k = recommended_k model = fit_kmeans(X, final_k) print(f"Fitted KMeans with k={final_k}") # 5) Analyze and profile clusters df_clusters, cluster_profile = profile_clusters(df, scaled_df, model) print("\nCluster summary (first rows):") print(cluster_profile.head()) # 6) Visualization plot_pca_clusters(scaled_df, model.labels_, model, PLOTS_DIR) # 7) Save outputs save_outputs(df_clusters, cluster_profile, model, scaler, OUTPUT_DIR) # 8) Print a small human-friendly summary print("\nTop-level cluster sizes:") print(df_clusters["Cluster"].value_counts().sort_index()) # show cluster centroids in original feature space: inverse transform centers_original = scaler.inverse_transform(model.cluster_centers_) centers_df = pd.DataFrame(centers_original, columns=FEATURES) centers_df["Cluster"] = range(len(centers_df)) print("\nCluster centroids (approx original scale):") print(centers_df.round(2)) if __name__ == "__main__": main()
✅ Code Verified:
• It first looks for data/customers.csv; if not found it generates customers_synthetic.csv in data/
• Scaling is done with StandardScaler — required for K-Means
• determine_k computes both inertia and silhouette across k in 2..10 and saves a plot outputs/plots/elbow_silhouette.png
• PCA is used only for visualization (clusters are fit in scaled feature space)
• Model and scaler are saved with joblib for later reuse
• random_state is set for reproducibility

6) Sample Output or Results

When you run the script, you'll see console messages like:

No data/customers.csv found. Generating synthetic dataset.
Synthetic data saved to data/customers_synthetic.csv
CustomerID Age Annual_Income_k$ Spending_Score Tenure_yrs
0 CUST1000 33 12 47 0
1 CUST1001 22 24 92 1
2 CUST1002 32 16 80 2
3 CUST1003 50 55 22 6
4 CUST1004 30 46 62 4
Saved elbow & silhouette plot to outputs/plots/elbow_silhouette.png
Recommended k (by silhouette): 4
Fitted KMeans with k=4

Cluster summary (first rows):
Cluster Age_mean Age_median Annual_Income_k$_mean ... Spending_Score_median Tenure_yrs_mean Tenure_yrs_median size
0 0 32.12 32.0 12.45 ... 78.0 1.00 1.00 95
1 1 55.10 54.0 89.50 ... 20.0 10.00 10.00 80
...
Saved PCA cluster plot to outputs/plots/pca_clusters.png
Saved segmented customers to outputs/customers_segmented.csv
Saved cluster summary to outputs/cluster_summary.json
Saved model and scaler to outputs
Top-level cluster sizes:
0 95
1 80
2 120
3 105
Name: Cluster, dtype: int64

Cluster centroids (approx original scale):
Age Annual_Income_k$ Spending_Score Tenure_yrs Cluster
0 32.12 12.45 78.0 1.00 0
1 55.10 89.50 20.0 10.00 1
...

Generated plots:

  • outputs/plots/elbow_silhouette.png — helps choose k
  • outputs/plots/pca_clusters.png — 2D visualization with centroids

Output files:

  • outputs/customers_segmented.csv — each row with Cluster assigned
  • outputs/cluster_summary.json — cluster statistics
  • outputs/kmeans_model.joblib, outputs/scaler.joblib — saved objects

7) Possible Enhancements

🎯 To make this project more advanced

  1. Use different clustering algorithms — DBSCAN for density-based segments, GaussianMixture for soft clustering
  2. Mixed data support — If you have categorical variables (gender, region), use K-Prototypes or encode categoricals carefully (Target/WOE encoding)
  3. Feature engineering — create RFM (Recency, Frequency, Monetary) features from transactional logs for better behavioral clustering
  4. Automated K selection — use gap statistic or combine metrics into a single score
  5. Dashboard — build an interactive Streamlit / Dash dashboard for exploring clusters and filtering customers
  6. Model monitoring — when new data arrives, automatically re-evaluate cluster stability and retrain if drift is detected
  7. Customer lift tests — A/B test actions targeted to clusters and measure ROI
  8. Explainability — provide human-readable reasons per customer why they were put into a cluster (nearest centroid distances, top feature contributions)