Mind of Machines Series: Clustering for Insights: K-Means and Hierarchical Clustering

Clustering is a powerful unsupervised learning technique used to identify natural groupings within data. Unlike supervised learning algorithms, clustering doesn’t require labeled data, making it useful for exploring datasets where the structure is not known. In this article, we’ll dive into two of the most popular clustering algorithms: K-Means and Hierarchical Clustering. Both methods are widely used for uncovering patterns in data and gaining insights into its structure.

What is Clustering?

Clustering algorithms aim to group data points into clusters such that points in the same cluster are more similar to each other than to points in other clusters. It’s a useful technique in a variety of fields, from customer segmentation to image compression.

We will explore two prominent clustering algorithms:

K-Means Clustering: A centroid-based approach that partitions the data into k clusters.
Hierarchical Clustering: A tree-like approach that builds a hierarchy of clusters either by merging or splitting them.

K-Means Clustering

K-Means clustering is one of the simplest and most commonly used clustering algorithms. It partitions the data into k clusters by minimizing the distance between data points and their respective cluster centroids. The number of clusters, k, is defined beforehand.

How K-Means Works

Choose the number of clusters k.
Initialize the centroids randomly.
Assign each data point to the nearest centroid, forming clusters.
Recalculate the centroids of the clusters.
Repeat steps 3 and 4 until the centroids no longer change or a maximum number of iterations is reached.

Let’s see how to implement K-Means clustering in Python using scikit-learn.

Example: K-Means Clustering in Python

Import necessary libraries

import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs

Generate synthetic data

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

Create a KMeans model with 4 clusters

kmeans = KMeans(n_clusters=4, random_state=42) kmeans.fit(X)

Predict cluster labels for the data points

y_kmeans = kmeans.predict(X)

Plot the clustered data

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap=‘viridis’)

Plot the centroids

centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c=‘red’, s=200, alpha=0.75) plt.title(“K-Means Clustering Example”) plt.xlabel(“Feature 1”) plt.ylabel(“Feature 2”) plt.show()

In this example, we generate synthetic data and cluster it into four groups using K-Means. The centroids of each cluster are marked in red. This method works well when you know the number of clusters in advance and is efficient on large datasets.

Hierarchical Clustering

Unlike K-Means, Hierarchical Clustering doesn’t require the number of clusters to be specified in advance. It builds a hierarchy of clusters either by agglomerating smaller clusters (agglomerative clustering) or by splitting larger clusters (divisive clustering).

Agglomerative Clustering

In agglomerative clustering, each data point starts as its own cluster. The algorithm then iteratively merges the closest clusters until all data points belong to a single cluster or the desired number of clusters is reached. This method is commonly visualized using a dendrogram, which shows the hierarchy of merges.

How Hierarchical Clustering Works

Assign each data point to its own cluster.
Merge the two closest clusters based on a distance metric (e.g., Euclidean distance).
Repeat step 2 until a single cluster remains or the desired number of clusters is achieved.

Here’s how we can implement hierarchical clustering in Python using the scipy and scikit-learn libraries.

Example: Hierarchical Clustering in Python

Import necessary libraries

import numpy as np import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage from sklearn.datasets import make_blobs

Generate synthetic data

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

Perform hierarchical clustering

Z = linkage(X, ‘ward’)

Plot the dendrogram

plt.figure(figsize=(10, 7)) dendrogram(Z) plt.title(“Hierarchical Clustering Dendrogram”) plt.xlabel(“Data Points”) plt.ylabel(“Distance”) plt.show()

In this example, we generate synthetic data and apply agglomerative hierarchical clustering. The dendrogram shows the hierarchy of clusters and helps visualize how the data points are merged at each step.

K-Means vs. Hierarchical Clustering

Both K-Means and Hierarchical Clustering are effective clustering techniques, but they have different strengths and weaknesses:

Number of Clusters: K-Means requires specifying the number of clusters beforehand, while Hierarchical Clustering does not.
Scalability: K-Means is faster and scales better with large datasets, whereas Hierarchical Clustering can become computationally expensive with large data.
Cluster Shapes: K-Means assumes that clusters are spherical, which might not always be true. Hierarchical Clustering can capture more complex relationships between data points.

Conclusion

Clustering is a valuable tool for exploring data and uncovering hidden patterns. K-Means is great for large datasets and cases where you have a sense of how many clusters exist. Hierarchical Clustering, on the other hand, is useful for smaller datasets and when you want to visualize the clustering process through dendrograms. Both methods provide important insights into the structure of your data.