HIERARCHICAL CLUSTERING ALGORITHM

Introduction

Hierarchical Clustering is also known as Hierarchical Clustering Analysis (HCA). It is a method in Data Mining and Statistics used for cluster analysis. We build a hierarchy of clusters to perform cluster analysis. There are mainly two categories of hierarchical clustering approaches:

1. Agglomerative
2. Divisive

Generally, a dendrogram is used to represent the results of hierarchical clustering. A dendrogram looks like the one as shown below:

The required data for hierarchical clustering can be distance matrix as well as raw data. If we provide raw data we can modify the algorithm to calculate the distance matrix.

A distance matrix is a matrix which contains the distance of each point to the clusters in the data. It can be represented as:

Agglomerative Hierarchical Clustering

It is a bottom-up clustering method where we divide the clusters into sub-clusters,  further into sub-clusters, and so on.

Agglomerative hierarchical clustering starts with every single object (gene or sample) in a single cluster. Then, in each successive iteration, it agglomerates (merges) the closest pair of clusters by satisfying some similarity criteria, until all of the data is in one cluster. Clusters generated in early stages are nested in those generated in later stages.

The process of agglomerative clustering is as follows:

• Firstly, assign each object to a cluster
• Then find the distance matrix
• Find the 2 clusters which have the shortest distance and merge them
• Continue this process until only one big cluster is formed.

The diagram below shows the process of agglomerative hierarchical clustering in an easy manner:

Divisive Hierarchical Clustering

This type of clustering follows a top-down approach. We start at the top with all the data objects in one cluster and follow the algorithm until each object is in its own singleton cluster.

Bottom-up methods make clustering decisions based on local patterns without initially taking into account the global distribution. We cannot undo these early decisions. Top-down clustering benefits from complete information about the global distribution when making top-level partitioning decisions.

The diagram below shows the process of divisive hierarchical clustering in an easy manner:

Distance Metric

The distance metric used in most of the real-life problem is Euclidean Distance. One may also choose the distance metric according to the relevance of the problem. But generally, Euclidean Distance is the most efficient metric if no pre-specified constraints are present.

Linkage Criteria refers to the point from where we calculate the distance between the two clusters and merge them.

There are several ways to perform the linkage:

1. Single Linkage- calculated between the two most similar parts of the cluster
2. Complete Linkage-calculated between the least two similar parts of a cluster
3. Mean/Average Linkage-calculated from the center of clusters.

Space and Time Requirements

O(N2 ) space since it uses the distance matrix ( N is the number of points)

O(N3 ) time in many cases

– There are N steps and at each step, the size, N2, distance matrix must be updated and searched

– Complexity can be reduced to O(N2 log(N) ) time by using a heap in some cases

Implementing Agglomerative Hierarchical Clustering in Scikit Learn

We will implement the agglomerative hierarchical clustering using Scikit learn in Python. First, we will import all the libraries that are essential to run the code we will write.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
We will use a dummy dataset which we will create using the numpy array.

X=np.array([[5,3],
…: [10,15],
…: [15,12],
…: [24,10],
…: [30,30],
…: [85,70],
…: [71,80],
…: [60,78],
…: [70,55],
…: [80,91],])
Now, we’ll use the Agglomerative Cluster function that we have imported from sklearn

Here, we have used the Euclidean Distance metric and Ward linkage criteria.

Next, we’ll create a dendrogram to represent the clustering.

labelList=range(1,11)
plt.figure(figsize=(10,7))
orientation=’top’,
labels=labelList,
distance_sort=’descending’,
show_leaf_counts=True)
plt.show()

The dendrogram thus formed is as shown below in the figure:

Performing Supervised and Unsupervised Clustering

Unsupervised Classification

In unsupervised learning or clustering, the aim is to discover groups of similar instances within the data. In this approach, we have no information about the class label of data or how many classes there are. One of the most common unsupervised methods is hierarchical clustering.

Supervised Classification

Supervised classification algorithms are used to assess and compare the performance of these algorithms to determine if supervised classification outperformed unsupervised clustering and if so which algorithms are most effective. Different supervised algorithms can be Naïve Baye’s, K-NN, Logistic Regression, etc.

Pros and Cons of Hierarchical Clustering

The clustering technique can be very handy when it comes to unlabeled data. Since most of the data in the real-world are unlabeled and annotating the data has higher costs, clustering techniques can be used to label unlabeled data.

Pros:

• Dendrograms are easy for visualization
• Provides hierarchical relationships between the clusters
• Shown to be able to capture concentric clusters

Cons:

• Not easy to define levels for clusters
• There are clustering techniques which outperform hierarchical clustering
• Once a decision is made to combine two clusters, it cannot be undone
• Different schemes have problems with one or more of the following:

– Sensitivity to noise and outliers

– Difficulty handling different sized clusters and irregular shapes

– Breaking large clusters