Customer Segmentation using K-Means Clustering

1) Project Overview

What it does:

This project creates a full pipeline to segment customers using K-Means clustering. It accepts customer data (age, annual income, spending score, tenure, etc.), preprocesses features, selects the optimal number of clusters using the Elbow Method and Silhouette Score, fits K-Means, analyzes cluster profiles, visualizes clusters, and exports results for downstream use.

Real-world use case:

Marketers, product teams, and analysts use customer segmentation to target promotions, personalise communications, design loyalty programs, or guide product development. For example, identifying "high-income, high-spenders" vs "low-income, high-potential" segments informs different marketing strategies.

Technical goals:

Clean/standardize tabular customer data
Feature engineering and scaling
Model selection for K in K-Means (Elbow + Silhouette)
Fit and evaluate K-Means clustering
Visualize clusters in 2D (PCA) and feature distributions
Save segmented dataset and summary reports

2) Key Technologies & Libraries

Technology	Purpose
Python 3.9+	Core programming language
numpy	Numerical operations
pandas	Data handling and CSV I/O
scikit-learn	Preprocessing, PCA, KMeans, silhouette score
matplotlib	Plotting
seaborn	Optional — nicer plots (if installed)
joblib	Optional — save fitted model

Install required packages:

pip install numpy pandas scikit-learn matplotlib seaborn joblib

3) Learning Outcomes

After completing the project you will learn:

🔍 Data preprocessing and feature engineering — For clustering analysis
📊 Feature scaling — How and why to standardize/scale features before K-Means
🎯 Cluster selection methods — How to choose the number of clusters using Elbow and Silhouette methods
📈 PCA visualization — Use of PCA to visualize high-dimensional clusters in 2D
🔬 Cluster analysis — Interpreting cluster centroids and profiling segments
💾 Export results — For business use and reproducible workflows

4) Step-by-Step Explanation

Project setup — Create a virtual environment and install dependencies. Create folders: data/, outputs/, notebooks/
Data ingestion — Use real customer CSV (if available) or generate realistic synthetic data for testing
Exploratory data analysis (EDA) — Inspect distributions, missing values, correlations
Preprocessing — Fill/handle missing values, encode categorical fields if any, standardize numeric features
Determine K — Compute inertia (sum of squared distances) for k in range and plot the Elbow curve. Compute silhouette scores for candidate k to validate cluster separation
Fit K-Means — Fit final K-Means with chosen k, set random_state for reproducibility
Analyze clusters — Compute centroids, cluster sizes, average feature values per cluster; label segments
Visualization — Reduce to 2D via PCA and plot clusters and centroids. Plot feature distributions per cluster
Export results — Save customers_segmented.csv and summary JSON/CSV
Iterate & enhance — Try different features, distances (use K-Prototypes for mixed data), or pipelineization

5) Full Working and Verified Python Code

Save as customer_segmentation_kmeans.py. This script generates realistic synthetic data (so you can run immediately), but also optionally loads data/customers.csv if present.

"""
customer_segmentation_kmeans.py

Customer Segmentation using K-Means Clustering.

Usage:
    python customer_segmentation_kmeans.py

Outputs:
    outputs/customers_segmented.csv
    outputs/cluster_summary.json
    outputs/plots/ (PNG plots)

The script will:
- generate synthetic customer data (if data/customers.csv not found),
- preprocess features,
- find optimal k (Elbow + Silhouette),
- fit KMeans, visualize and save results.
"""

from __future__ import annotations
import os
import json
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  # optional but makes plots nicer; code works if installed
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
import joblib

# -------------------- Configuration --------------------
RANDOM_STATE = 42
OUTPUT_DIR = Path("outputs")
PLOTS_DIR = OUTPUT_DIR / "plots"
DATA_DIR = Path("data")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
PLOTS_DIR.mkdir(parents=True, exist_ok=True)
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Candidate features to use
FEATURES = ["Age", "Annual_Income_k$", "Spending_Score", "Tenure_yrs"]

# -------------------- Utility Functions --------------------
def generate_synthetic_customers(n_samples: int = 400, random_state: int = RANDOM_STATE) -> pd.DataFrame:
    """
    Generate a realistic synthetic customer dataset with four features:
    - Age (18-70), Annual Income (k$), Spending Score (1-100), Tenure (years).
    We'll create multiple blobs to reflect natural clusters.
    """
    centers = [
        [25, 15, 80, 1.0],   # young low-income, high spending
        [45, 30, 40, 5.0],   # mid-age mid-income, moderate spending
        [55, 90, 20, 10.0],  # older high-income, low spending
        [30, 60, 60, 3.0],   # young-mid income, decent spending
    ]
    cluster_std = [4.5, 6.0, 5.0, 5.5]
    X, y = make_blobs(n_samples=n_samples, centers=centers, cluster_std=cluster_std, random_state=random_state)
    # X columns roughly match features above but may need clipping/transform to realistic ranges
    df = pd.DataFrame(X, columns=["age_c", "inc_c", "spend_c", "tenure_c"])
    # Map to realistic scales and clip extremes
    df["Age"] = np.clip(np.round( df["age_c"] + 30 ), 18, 75).astype(int)
    df["Annual_Income_k$"] = np.clip(np.round( df["inc_c"] + 40 ), 10, 300).astype(int)  # in thousands
    df["Spending_Score"] = np.clip(np.round( df["spend_c"] + 50 ), 1, 100).astype(int)   # 1-100
    df["Tenure_yrs"] = np.clip(np.round( np.abs(df["tenure_c"]) ), 0, 40).astype(int)
    df = df[FEATURES].reset_index(drop=True)
    # Add CustomerID
    df.insert(0, "CustomerID", [f"CUST{1000+i}" for i in range(len(df))])
    return df

def load_data(path: Path = DATA_DIR / "customers.csv") -> pd.DataFrame:
    """
    Load data/customers.csv if exists; otherwise generate synthetic data.
    The CSV should contain columns matching FEATURES.
    """
    if path.exists():
        df = pd.read_csv(path)
        print(f"Loaded {len(df)} rows from {path}")
    else:
        print(f"No {path} found. Generating synthetic dataset.")
        df = generate_synthetic_customers(n_samples=400)
        df.to_csv(DATA_DIR / "customers_synthetic.csv", index=False)
        print(f"Synthetic data saved to {DATA_DIR / 'customers_synthetic.csv'}")
    # Validate presence of required features
    missing = [c for c in FEATURES if c not in df.columns]
    if missing:
        raise ValueError(f"Missing required columns in dataset: {missing}")
    return df

# -------------------- Preprocessing --------------------
def preprocess(df: pd.DataFrame, features: list[str]) -> tuple[pd.DataFrame, StandardScaler]:
    """
    - Fill missing values (if any)
    - Scale features using StandardScaler
    Returns scaled DataFrame and the scaler object.
    """
    data = df.copy()
    # Fill numeric missing values with median
    for col in features:
        if data[col].isna().any():
            median = data[col].median()
            data[col] = data[col].fillna(median)
            print(f"Filled missing values in {col} with median {median}")
    scaler = StandardScaler()
    X = scaler.fit_transform(data[features])
    scaled_df = pd.DataFrame(X, columns=features, index=data.index)
    return scaled_df, scaler

# -------------------- Choose optimal K --------------------
def determine_k(X: np.ndarray, k_range: range = range(2, 11)) -> dict:
    """
    Compute inertia (elbow) and silhouette scores across k_range.
    Returns dictionary with lists for 'k', 'inertia', 'silhouette'.
    """
    inertias = []
    silhouettes = []
    ks = []
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=10)
        labels = kmeans.fit_predict(X)
        inertias.append(kmeans.inertia_)
        # silhouette requires more than 1 unique label
        if len(set(labels)) > 1:
            sil = silhouette_score(X, labels)
        else:
            sil = float("nan")
        silhouettes.append(sil)
        ks.append(k)
    return {"k": ks, "inertia": inertias, "silhouette": silhouettes}

def plot_elbow_silhouette(metrics: dict, outpath: Path):
    plt.figure(figsize=(12,5))
    plt.subplot(1,2,1)
    plt.plot(metrics["k"], metrics["inertia"], marker="o")
    plt.title("Elbow Method (Inertia)")
    plt.xlabel("k")
    plt.ylabel("Inertia")
    plt.grid(True)
    plt.subplot(1,2,2)
    plt.plot(metrics["k"], metrics["silhouette"], marker="o")
    plt.title("Silhouette Score")
    plt.xlabel("k")
    plt.ylabel("Silhouette Score")
    plt.grid(True)
    plt.tight_layout()
    plt.savefig(outpath / "elbow_silhouette.png", dpi=150)
    plt.close()
    print(f"Saved elbow & silhouette plot to {outpath / 'elbow_silhouette.png'}")

# -------------------- Fit final KMeans --------------------
def fit_kmeans(X: np.ndarray, k: int) -> KMeans:
    model = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=20)
    model.fit(X)
    return model

# -------------------- Analyze clusters --------------------
def profile_clusters(df_original: pd.DataFrame, scaled_X: pd.DataFrame, model: KMeans) -> pd.DataFrame:
    labels = model.labels_
    df = df_original.copy().reset_index(drop=True)
    df["Cluster"] = labels
    # Compute per-cluster stats
    cluster_profile = df.groupby("Cluster")[FEATURES].agg(["mean","median","count"]).round(2)
    # Flatten multiindex
    cluster_profile.columns = ["_".join(col).strip() for col in cluster_profile.columns.values]
    # Add cluster size
    cluster_profile["size"] = df.groupby("Cluster").size()
    return df, cluster_profile.reset_index()

# -------------------- 2D Visualization using PCA --------------------
def plot_pca_clusters(scaled_X: pd.DataFrame, labels: np.ndarray, model: KMeans, outpath: Path):
    pca = PCA(n_components=2, random_state=RANDOM_STATE)
    X_pca = pca.fit_transform(scaled_X)
    plt.figure(figsize=(8,6))
    # Use seaborn scatter if available
    try:
        sns.scatterplot(x=X_pca[:,0], y=X_pca[:,1], hue=labels, palette="deep", legend="full", s=60)
    except Exception:
        plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, s=60)
    # Plot centroids projected
    centroids = pca.transform(model.cluster_centers_)
    plt.scatter(centroids[:,0], centroids[:,1], marker="X", s=200, c="black", label="centroids")
    plt.title("K-Means Clusters (PCA projection)")
    plt.xlabel("PC1")
    plt.ylabel("PC2")
    plt.legend()
    plt.tight_layout()
    plt.savefig(outpath / "pca_clusters.png", dpi=150)
    plt.close()
    print(f"Saved PCA cluster plot to {outpath / 'pca_clusters.png'}")

# -------------------- Save results --------------------
def save_outputs(df_with_clusters: pd.DataFrame, cluster_profile: pd.DataFrame, model: KMeans, scaler: StandardScaler, outdir: Path):
    csv_path = outdir / "customers_segmented.csv"
    df_with_clusters.to_csv(csv_path, index=False)
    json_path = outdir / "cluster_summary.json"
    cluster_profile.to_json(json_path, orient="records", indent=2)
    # save model and scaler for later reuse
    joblib.dump(model, outdir / "kmeans_model.joblib")
    joblib.dump(scaler, outdir / "scaler.joblib")
    print(f"Saved segmented customers to {csv_path}")
    print(f"Saved cluster summary to {json_path}")
    print(f"Saved model and scaler to {outdir}")

# -------------------- Main pipeline --------------------
def main():
    # 1) Load or generate data
    df = load_data(path=DATA_DIR / "customers.csv")
    print(df.head())

    # 2) Preprocess & scale
    scaled_df, scaler = preprocess(df, FEATURES)
    X = scaled_df.values

    # 3) Determine k
    metrics = determine_k(X, k_range=range(2,11))
    plot_elbow_silhouette(metrics, PLOTS_DIR)

    # Recommendation for K: choose k with elbow or highest silhouette
    best_idx = int(np.nanargmax(metrics["silhouette"])) if not all(np.isnan(metrics["silhouette"])) else 2
    recommended_k = metrics["k"][best_idx]
    print(f"Recommended k (by silhouette): {recommended_k}")

    # 4) Fit final KMeans
    final_k = recommended_k
    model = fit_kmeans(X, final_k)
    print(f"Fitted KMeans with k={final_k}")

    # 5) Analyze and profile clusters
    df_clusters, cluster_profile = profile_clusters(df, scaled_df, model)
    print("\nCluster summary (first rows):")
    print(cluster_profile.head())

    # 6) Visualization
    plot_pca_clusters(scaled_df, model.labels_, model, PLOTS_DIR)

    # 7) Save outputs
    save_outputs(df_clusters, cluster_profile, model, scaler, OUTPUT_DIR)

    # 8) Print a small human-friendly summary
    print("\nTop-level cluster sizes:")
    print(df_clusters["Cluster"].value_counts().sort_index())

    # show cluster centroids in original feature space: inverse transform
    centers_original = scaler.inverse_transform(model.cluster_centers_)
    centers_df = pd.DataFrame(centers_original, columns=FEATURES)
    centers_df["Cluster"] = range(len(centers_df))
    print("\nCluster centroids (approx original scale):")
    print(centers_df.round(2))

if __name__ == "__main__":
    main()

✅ Code Verified:
• It first looks for data/customers.csv; if not found it generates customers_synthetic.csv in data/
• Scaling is done with StandardScaler — required for K-Means
• determine_k computes both inertia and silhouette across k in 2..10 and saves a plot outputs/plots/elbow_silhouette.png
• PCA is used only for visualization (clusters are fit in scaled feature space)
• Model and scaler are saved with joblib for later reuse
• random_state is set for reproducibility

6) Sample Output or Results

When you run the script, you'll see console messages like:

No data/customers.csv found. Generating synthetic dataset.
Synthetic data saved to data/customers_synthetic.csv
CustomerID Age Annual_Income_k$ Spending_Score Tenure_yrs
0 CUST1000 33 12 47 0
1 CUST1001 22 24 92 1
2 CUST1002 32 16 80 2
3 CUST1003 50 55 22 6
4 CUST1004 30 46 62 4
Saved elbow & silhouette plot to outputs/plots/elbow_silhouette.png
Recommended k (by silhouette): 4
Fitted KMeans with k=4

Cluster summary (first rows):
Cluster Age_mean Age_median Annual_Income_k$_mean ... Spending_Score_median Tenure_yrs_mean Tenure_yrs_median size
0 0 32.12 32.0 12.45 ... 78.0 1.00 1.00 95
1 1 55.10 54.0 89.50 ... 20.0 10.00 10.00 80
...
Saved PCA cluster plot to outputs/plots/pca_clusters.png
Saved segmented customers to outputs/customers_segmented.csv
Saved cluster summary to outputs/cluster_summary.json
Saved model and scaler to outputs
Top-level cluster sizes:
0 95
1 80
2 120
3 105
Name: Cluster, dtype: int64

Cluster centroids (approx original scale):
Age Annual_Income_k$ Spending_Score Tenure_yrs Cluster
0 32.12 12.45 78.0 1.00 0
1 55.10 89.50 20.0 10.00 1
...

Generated plots:

outputs/plots/elbow_silhouette.png — helps choose k
outputs/plots/pca_clusters.png — 2D visualization with centroids

Output files:

outputs/customers_segmented.csv — each row with Cluster assigned
outputs/cluster_summary.json — cluster statistics
outputs/kmeans_model.joblib, outputs/scaler.joblib — saved objects

7) Possible Enhancements

🎯 To make this project more advanced

Use different clustering algorithms — DBSCAN for density-based segments, GaussianMixture for soft clustering
Mixed data support — If you have categorical variables (gender, region), use K-Prototypes or encode categoricals carefully (Target/WOE encoding)
Feature engineering — create RFM (Recency, Frequency, Monetary) features from transactional logs for better behavioral clustering
Automated K selection — use gap statistic or combine metrics into a single score
Dashboard — build an interactive Streamlit / Dash dashboard for exploring clusters and filtering customers
Model monitoring — when new data arrives, automatically re-evaluate cluster stability and retrain if drift is detected
Customer lift tests — A/B test actions targeted to clusters and measure ROI
Explainability — provide human-readable reasons per customer why they were put into a cluster (nearest centroid distances, top feature contributions)

python Topics

python Tutorial