K – Nearest Neighbor (k-NN)
The k-Nearest Neighbours algorithm (k-NN) is a method used for classification and regression in pattern recognition or predictive problems.
This algorithm is mainly used for classification purposes. This is because most of the work in the industries is classification based- we need to make a decision whether a given customer is a good credit risk, or a customer is suitable target for a particular set of advertisements, etc.
k-NN is a type of learning, where the function is only approximated locally and all computation is delayed till classification. The k-NN algorithm is the simplest of all algorithms.
Basic Idea Behind k-NN
The basic idea behind k-NN algorithm is that it classifies the given data point according to the type of the data points which are the neighbours of the given data point.
Consider the graph shown below, it has some points depicted as red rhombus and some as green circles. We have to predict the class of the blue star according to the graph.The blue star can either be a red rhombus or a green circle, nothing else. Let k=3, where k is the value of the number of neighbour points which we consider for the prediction of the class of the blue star.The circle encloses the three red rhombus, which signifies that the blue star belongs to the category of red rhombus.
How to Choose the Value of k?
Choosing the value of k in this algorithm plays a very crucial role. It eventually decides the accuracy of the model to predict the class of the test data. The selection of the value of k depends upon the data set.
But generally larger values of k reduce the effect of noise in the data, but at the same time it makes the boundaries smooth, making it difficult to see the class boundaries distinctly.
If we choose a big value for k everything is classified under the most probable class, and if we choose a small value of k it is highly unstable and leads to unstable decision boundaries. In case of small k value if there are small changes to the data set, the classification changes.
When the value of k=1, it is the most accurate prediction. The reason behind it is that when k=1 it shows a single data point i.e. the given point itself. Because the nearest point to a given point is itself.
Generally if you are using k-NN algorithm and you have an even number of classes it is a good idea to choose a K value with an odd number to avoid a tie. And the inverse, use an even number for K when you have an odd number of classes.
Making Predictions with k-NN
we use the training set directly for making predictions in k-NN.
Class type is predicted for the new data point/instance (x) by penetrating through the whole training set for the K most comparable instances (the neighbours). Each instance votes for their type and the type with maximum votes is taken as the predicted value of data point x. If the k-NN algorithm is being used for classification the output variable is the most common class value otherwise in case of regression this might be the mean output variable.
We measure the distance to determine which of the K instances in the training dataset are most similar to a new input.. For real-valued input variables, the most popular distance measure is Euclidean distance.
Euclidean distance is the square root of the sum of the squared differences between a new point (x) and an existing point (xi) across all input attributes j.
Euclidean Distance(x, xi) = sqrt( sum( (xj – xij)^2 ) )
Curse of Dimensionality
KNN works well with a small number of input variables (p), but struggles when the number of inputs is very large.
Each input variable is a dimension of a p-dimensional input space. For example, if you had two input variables x1 and x2, the input space would be 2-dimensional.
As the number of dimensions increases the volume of the input space increases at an exponential rate.
In high dimensions, points that may be similar may have very large distances. All points will be far away from each other and our intuition for distances in simple 2 and 3-dimensional spaces breaks down.
Implementing k-NN using Scikit Learn
First, we will import all the libraries we need as:
|import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
We will use the famous Iris dataset for our classification and load it into the Pandas dataframe.
|url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
# Assign colum names to the dataset
names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘Class’]
# Read dataset to pandas dataframe
dataset = pd.read_csv(url, names=names)
Now, we need to split the dataset into its attributes and labels.
|X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
In order to avoid over-fitting we split our dataset into test set and training set.
|from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
Here, our training set is 80% of the original dataset and our test set is the remaining 20%
Now, we need to scale the features of our dataset. It is important to scale the raw data for the proper functioning of algorithms.
|from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
We will now train out k-NN model and predict the values
|from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
# Predicting the class
y_pred = classifier.predict(X_test)
Advantages and Disadvantages of k-NN
- Simple to implement
- Can do well in practice with enough representative data
- Large search problem to find nearest neighbours
- Storage of data
- Must know we have a meaningful distance function