Here’s a step-by-step approach using Python:
- Generating Sample Data: We’ll create a synthetic dataset.
- Applying K-means: We’ll apply the K-means algorithm to segment customers.
- Visualization: We’ll visualize the clusters.
First, you’ll need to install the required package, if you haven’t already:
pip install matplotlib scikit-learn
Now, let’s write the Python code:
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs # Step 1: Generate synthetic data # We create a dataset with 200 samples, 2 features (annual income and spending score) and roughly 4 clusters X, _ = make_blobs(n_samples=200, centers=4, cluster_std=0.60, random_state=0) # Step 2: Apply K-means clustering kmeans = KMeans(n_clusters=4) kmeans.fit(X) y_kmeans = kmeans.predict(X) # Step 3: Visualization of clusters plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis') centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75) # Red dots are cluster centers plt.title('Customer Segmentation') plt.xlabel('Annual Income') plt.ylabel('Spending Score') plt.show()
Detailed Explanation of the Python Code
Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
Generate Synthetic Data
X, _ = make_blobs(n_samples=200, centers=4, cluster_std=0.60, random_state=0)
make_blobs
: generates 200 samples (data points), spread across 4 centers (clusters), with a standard deviation of 0.60 for each cluster.
X
contains the coordinates for each point (here representing features like “Annual Income” and “Spending Score”).
_
(underscore) is a placeholder for the cluster labels assigned by make_blobs
which we don’t use here since we want to perform our own clustering.
Apply K-means Clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
KMeans(n_clusters=4)
creates a K-means clustering model specifying 4 clusters, based on our initial assumption or analysis (e.g., the Elbow Method might have suggested this number).
kmeans.fit(X)
fits the K-means model on the dataset X
. This method computes the centroids of the clusters, trying to minimize the within-cluster variance.
kmeans.predict(X)
assigns each sample in X
to one of the 4 clusters, returning an array of cluster indices which we store in y_kmeans
.
Visualization of Clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis') centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plt.title('Customer Segmentation')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plots all the points in X
. The parameters:
X[:, 0]
and X[:, 1]
are the coordinates for each point (annual income and spending score, respectively).
c=y_kmeans
colors each point based on its cluster assignment, providing visual differentiation of clusters.
s=50
sets the size of the points.
cmap='viridis'
uses the ‘viridis’ color map for coloring the different clusters.
centers = kmeans.cluster_centers_
extracts the centroids of the clusters calculated by the K-means algorithm.
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plots the centroids as red dots, where:
s=200
makes the centroid dots larger than the data points.
alpha=0.75
makes the red color slightly translucent.
The remaining plt
functions set the title and labels for the axes, and plt.show()
displays the plot.
numpy
is imported to handle data in array format efficiently.
matplotlib.pyplot
is used for plotting graphs to visualize data and clustering results.
KMeans
from sklearn.cluster
is the K-means clustering implementation in the scikit-learn library.
make_blobs
from sklearn.datasets
is used to generate synthetic datasets with a Gaussian distribution, ideal for demonstrating clustering.