Python Code to Demonstrate Customer Segmentation with K-Means

Blockchain Analytics for Business and Commerce K-means Clustering with Python Python Code to Demonstrate Customer Segmentation with K-Means

Here’s a step-by-step approach using Python:

Generating Sample Data: We’ll create a synthetic dataset.
Applying K-means: We’ll apply the K-means algorithm to segment customers.
Visualization: We’ll visualize the clusters.

First, you’ll need to install the required package, if you haven’t already:

pip install matplotlib scikit-learn

Now, let’s write the Python code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Step 1: Generate synthetic data
# We create a dataset with 200 samples, 2 features (annual income and spending score) and roughly 4 clusters
X, _ = make_blobs(n_samples=200, centers=4, cluster_std=0.60, random_state=0)

# Step 2: Apply K-means clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# Step 3: Visualization of clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)  # Red dots are cluster centers
plt.title('Customer Segmentation')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()

Detailed Explanation of the Python Code

Import Libraries

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

Generate Synthetic Data

X, _ = make_blobs(n_samples=200, centers=4, cluster_std=0.60, random_state=0)

make_blobs: generates 200 samples (data points), spread across 4 centers (clusters), with a standard deviation of 0.60 for each cluster.

X contains the coordinates for each point (here representing features like “Annual Income” and “Spending Score”).

_ (underscore) is a placeholder for the cluster labels assigned by make_blobs which we don’t use here since we want to perform our own clustering.

Apply K-means Clustering

kmeans = KMeans(n_clusters=4)

kmeans.fit(X)

y_kmeans = kmeans.predict(X)

KMeans(n_clusters=4) creates a K-means clustering model specifying 4 clusters, based on our initial assumption or analysis (e.g., the Elbow Method might have suggested this number).

kmeans.fit(X) fits the K-means model on the dataset X. This method computes the centroids of the clusters, trying to minimize the within-cluster variance.

kmeans.predict(X) assigns each sample in X to one of the 4 clusters, returning an array of cluster indices which we store in y_kmeans.

Visualization of Clusters

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis') centers = kmeans.cluster_centers_

plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)

plt.title('Customer Segmentation')

plt.xlabel('Annual Income')

plt.ylabel('Spending Score')

plt.show()

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

plots all the points in X. The parameters:

X[:, 0] and X[:, 1] are the coordinates for each point (annual income and spending score, respectively).

c=y_kmeans colors each point based on its cluster assignment, providing visual differentiation of clusters.

s=50 sets the size of the points.

cmap='viridis' uses the ‘viridis’ color map for coloring the different clusters.

centers = kmeans.cluster_centers_ extracts the centroids of the clusters calculated by the K-means algorithm.

plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75) plots the centroids as red dots, where:

s=200 makes the centroid dots larger than the data points.

alpha=0.75 makes the red color slightly translucent.

The remaining plt functions set the title and labels for the axes, and plt.show() displays the plot.

numpy is imported to handle data in array format efficiently.

matplotlib.pyplot is used for plotting graphs to visualize data and clustering results.

KMeans from sklearn.cluster is the K-means clustering implementation in the scikit-learn library.

make_blobs from sklearn.datasets is used to generate synthetic datasets with a Gaussian distribution, ideal for demonstrating clustering.

Previous Topic

Back to Lesson

Next Lesson