Clustering with K-Means

Grouping similar data points together.

Clustering with K-Means

Clustering is unsupervised learning at its finest. You give the algorithm data with no labels, and it discovers natural groups. K-Means is the most popular clustering algorithm — simple, fast, and effective.

How K-Means Works

The algorithm randomly places K centroids, assigns points to the nearest centroid, then moves centroids to the center of assigned points. It repeats until convergence. The result is K clusters of similar data points.


from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans.fit(X)

print(f"Centroids: {kmeans.cluster_centers_}")
print(f"Labels: {kmeans.labels_}")

The n_init parameter runs the algorithm multiple times with different starting points and picks the best result. This reduces sensitivity to initialization.

Try it Yourself to

Choosing K with the Elbow Method

How do you know how many clusters to use? The elbow method plots the inertia (within-cluster sum of squares) for different K values. The "elbow" in the curve suggests the optimal K.


from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
X = np.vstack([
    np.random.randn(50, 2) + [0, 0],
    np.random.randn(50, 2) + [5, 5],
    np.random.randn(50, 2) + [10, 0]
])

inertias = []
K_range = range(1, 10)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, marker="o")
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.show()

Look for the point where adding more clusters doesn't significantly reduce inertia. In this example, the elbow is at K=3 — which matches our data generation.

Try it Yourself to

Evaluating Clusters

Without labels, evaluating clusters is tricky. The silhouette score measures how similar points are to their own cluster vs. other clusters. Values range from -1 to 1, with higher being better.


from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

np.random.seed(42)
X = np.vstack([
    np.random.randn(30, 2) + [0, 0],
    np.random.randn(30, 2) + [5, 5]
])

for k in [2, 3, 4]:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X)
    score = silhouette_score(X, labels)
    print(f"K={k}: Silhouette Score = {score:.4f}")

A silhouette score above 0.5 generally indicates reasonable clusters. Below 0.3 suggests the clusters overlap significantly.

Try it Yourself to

Key Takeaways

K-Means groups data into K clusters based on similarity
The elbow method helps choose the right number of clusters
Silhouette score evaluates cluster quality without labels
Always scale your features before clustering

← Previous Decision Trees & Random Forest

Next → Exploratory Data Analysis (EDA)