K Means Clustering for Machine Learning
Introduction to clustering
Before diving deep into the k-means clustering algorithm, would like to give you the brief idea of Clustering. Clustering can be supervised and unsupervised, but here will discuss unsupervised. In unsupervised learning what we are doing is forming clusters without labels. We define clusters by its shape and spread area of it.
Generally, we have two types of a clustering algorithm.
1. Central Clustering
k-means is example of this type of clustering.
2. Hierarchical Clustering
Divise and agglomerative.
Here, will discuss k-means only.
Introduction to K-means
InK-means clustering, we will assign observation Xi to cluster K if score under cluster K is higher than the other clusters.
An algorithm of K-means clustering
Step 1: Initialize cluster centers randomly (μ1, μ2,….μk).
Step 2: Now each observation will be assigned to the closest cluster. How we will check that it is the closest cluster?
. Suppose there are data points x1, x2,…. xn. We have to assign each data point xi to cluster c and for ith observation, we will write the cluster as ci.
Here min returns the minimum distance from the cluster center among all the cluster center and arg will return the index of that cluster center.
Step 3: Now, re-assignment of cluster centers will take place by means of assigned observation.
Here summation goes from i:ci=j means summation for all the points belong to that cluster j and nj represents the no. of observation belongs to cluster j.
Step 4: Repeat step 1 and 2 until the converged solution will be found.
We will choose local optima through co-ordinate descent algorithm.
As we are choosing the local optima as our convergence criteria then we will get different clusters some of which are not good enough which might depend on our initialization of centers. So, we will look for the smart initialization of cluster centers next.
k++ means(Smart Initialization of clusters)
Step 1: First cluster center will be chosen randomly from the data points.
Step 2: We will repeat steps 3 and 4 until k clusters will be formed.
Step 3: We will compute the distance of all the data points from their nearest clusters.
Step 4: Now, for choosing new cluster centers we will choose the center among the data points only but will choose the data point which is farthest from its cluster center and the probability of distance will be proportional to (dist)2.
Let’s check all these steps with some visualization.
How to choose K (no. of clusters)?
Our k-means clustering objective is to make tighter clusters or you can say more homogenous. So, if we have k=n then our model will be more homogenous but it will not describe our data in a good way. Hence, will choose k having lower heterogeneity but not 0.
Implementation in Python
K-means clustering using sklearn
from sklearn.cluster import KMeans import numpy as np X = np.array([[2,1], [4,3], [1, 2], [3, 2], [4, 5], [4, 1]]) ## with clusters 2 kmeans = KMeans(n_clusters=2, random_state=0).fit(X) ## lables to the input variables of clusters kmeans.labels_
array([1, 0, 1, 1, 0, 1])
## Prediction for test data kmeans.predict([[0, 0], [4, 4]])
## Checking the co-ordinates of cluster centers kmeans.cluster_centers_
array([[4. , 4. ], [2.5, 1.5]])