K-means Clustering

Overview

  1. Introduction
  2. Hierarchical and Partition Clustering
  3. What k-means Clustering is?
  4. Algorithm of k-means Clustering
  5. Convergence Criteria
  6. Smart Initialization
  7. How to choose no. of Clusters?
  8. Implementation with sci-kit learn
  9. Advantages and Disadvantages
  10. Summary

Introduction

Before diving deep into the k-means clustering algorithm, would like to give you a brief idea of Clustering. Clustering can be supervised and unsupervised, but here will discuss unsupervised. In unsupervised learning what we are doing is forming clusters without labels. We define clusters by its shape and spread area of it.
Generally, we have two types of a clustering algorithm.
1. Central Clustering
k-means is an example of this type of clustering.
2. Hierarchical Clustering
Divise and agglomerative.
Here, will discuss k-means only.

K-means is an unsupervised learning algorithm to use to solve the clustering problems. A clustering problem is a way of grouping a set of objects in such a way that all the objects with similar properties form a group or cluster. Here objects within a group are more similar than the objects in different groups. There are basically two types of clustering- Hierarchical and Partition Clustering.

Hierarchical and Partition Clustering

In case of hierarchical clustering, two most similar clusters keep combining until all the similar objects are in the same clusters. It generally produces a tree that shows the hierarchy of the cluster. While a partition clustering is a simple clustering process in which we divide a set of the object into the number of non-overlapping clusters.

Hierarchical Clustering is basically of two types:

  • Agglomerative or bottom-up algorithm where we take each unit as a separate cluster and keep combining them to form a larger cluster.
  • Divisive or top-down algorithm where we start with a complete set and then keep dividing them into different smaller clusters successively.

In the case of the partition clustering algorithm, we divide all the clusters at once.  K-means algorithm is a form of partition clustering case.

What K-means clustering is?

K-means clustering is a centroid model in which the number of clusters you required at the end must be mentioned at the very start. InK-means clustering, we will assign observation Xi to cluster K if score under cluster K is higher than the other clusters.

 

Different steps involved in the k-means algorithm are-

  • First thing is to select k data points randomly from the available data as our initial centroids. A centroid is a centre of your cluster and since at start, we have no idea about clusters so we are actually selecting them randomly.
  • Now all the data points that are closer to our centroid give us a cluster. Suppose we are using Euclidean distance as our metric, therefore, a straight line must be drawn between two centroids. After that, a perpendicular bisector divides the data set into clusters.
  • So now we have the new set of clusters, therefore, we also need centroid for them. In this case, the centroid is going to be the mean of all units in our cluster.

We keep repeating the last two steps until the time when our centroids stop moving.

Algorithm of K-means clustering

Step 1: Initialize cluster centres randomly (μ1, μ2,….μk).
Step 2: Now each observation will be assigned to the closest cluster. How we will check that it is the closest cluster?
. Suppose there are data points x1, x2,…. xn. We have to assign each data point xi to cluster c and for ith observation, we will write the cluster as ci.

Here min returns the minimum distance from the cluster centre among all the cluster centre and arg will return the index of that cluster centre.
Step 3: Now, re-assignment of cluster centres will take place by means of assigned observation.

Here summation goes from i:ci=j means summation for all the points belong to that cluster j and nj represents the no. of observation belongs to cluster j.
Step 4: Repeat step 1 and 2 until the converged solution will be found.

Convergence Criteria

We will choose local optima through co-ordinate descent algorithm.
As we are choosing the local optima as our convergence criteria then we will get different clusters some of which are not good enough which might depend on our initialization of centres. So, we will look for the smart initialization of cluster centres next.

##Formula Remaining for Convergence

k++ means(Smart Initialization of clusters)

Step 1: First cluster centre will be chosen randomly from the data points.
Step 2: We will repeat steps 3 and 4 until k clusters will be formed.
Step 3: We will compute the distance of all the data points from their nearest clusters.
Step 4: Now, for choosing new cluster centres we will choose the centre among the data points only but will choose the data point which is farthest from its cluster centre and the probability of distance will be proportional to (dist)2.
Let’s check all these steps with some visualization.

How to choose K (no. of clusters)?

Our k-means clustering objective is to make tighter clusters or you can say more homogenous. So, if we have k=n then our model will be more homogenous but it will not describe our data in a good way. Hence, will choose k having lower heterogeneity but not 0.

Implementation in Python

K-means clustering using sklearn

In [8]:
from sklearn.cluster import KMeans
import numpy as np

X = np.array([[2,1], [4,3], [1, 2],
              [3, 2], [4, 5], [4, 1]])
## with clusters 2
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
## lables to the input variables of clusters
kmeans.labels_
Out[8]:
array([1, 0, 1, 1, 0, 1])
In [9]:
## Prediction for test data
kmeans.predict([[0, 0], [4, 4]])
Out[9]:
array([1, 0])
In [10]:
## Checking the co-ordinates of cluster centers
kmeans.cluster_centers_
Out[10]:
array([[4. , 4. ],
       [2.5, 1.5]])

 

Advantages and Disadvantages of k-means

K-means is fast, robust and easier to understand which is relatively more efficient. If our data set is such that all the units are distinct and well separated from each other. If it holds some advantage it also holds disadvantages too. Like, here we already require the value of k even before the start of the process. I suppose our data set is highly overlapping then it may be difficult for K-means to form clusters. One of the major drawbacks of the process is its dependence of mean. If mean is not defined i.e. in categorical case we can’t apply the k-means algorithm.

Summary

In this blog, we have seen a basic introduction to unsupervised clustering algorithm and also the details of the k-means algorithm. Also with the implementation in scikit-learn.

 

 

 

Leave a Comment

Your email address will not be published. Required fields are marked *