 # Naive Bayes Classification

Naive Bayes is one of the supervised Machine Learning algorithms used for classification problems.  This algorithm is based on one of the popular theorems from the probability called Bayes theorem. Naive Bayes is used for the binary as well as multiclass classification problem especially in the field of NLP like document classification, disease diagnosis and so on.

So before starting for this blog, I would highly recommend you to take a look at the Bayes theorem as it is the backbone of this algorithm. In brief, I will introduce you here about Bayes theorem.

1. Bayes theorem
2. Intuition
3. Assumptions of NB
4. Working of NB classifier
5. Maximum likelihood estimates
7. Laplace and Lidstone smoothing
8. Log probabilities
10. Different versions of NB
11. Case Study on Iris flower classification
12. Python implementation using scikit-learn
13. Conclusion

1. Bayes theorem:

With Bayes theorem, we can find the probability of c (class) given that x has already occurred. Here, x is the evidence for hypothesis A. The equation for it is given below: 2. Intuition:

To get a high-level idea of what NB exactly does let’s consider the following example.

Let’s say we want to predict gender (male/female) based on the 2 features i.e. hair length and age. The distribution of 2-D is as follows: Now the new query point (Xq) arises with 2 new values of a given feature and we want to predict whether it is male or female? How do we do that using Bayes theorem?

Here we calculate the posterior probability for each class and classify a new point (Xq) to a class in which the posterior probability is maximum.

The decision rule in simple term words can be written as, 3. Assumptions of NB:

The adjective naive for Naive Bayes comes from the assumption that the features in the dataset are iid i.e. independent and identically distributed. Because of this assumption, the classifier is called Naive.

So this one of the biggest assumptions of NB that says that the presence of one particular feature does not affect the other which is quite not possible in real-world data.

4. Working of NB classifier:

Let’s say we have m (features) dimensional data and for each point, we have one of the classes from possible ‘k’ classes.

We can represent it as, Given a new Xq, we need to predict Yq. So we calculate the posterior probability as, The above equations can be read as probability it belongs to class 1 given that Xq i.e. feature

vector, the probability it belongs to class 2 given that Xq and so on.

We calculate a probability for all ‘k’ classes and take one with the maximum value to classify a new point that belongs to that class.

Applying Bayes theorem to above equations, For comparing the above-calculated probability we can drop the denominator because it is the same for all the probabilities. Thus we can say that, In general,

P (Ci | x) α P (x ∩ Ci) , Where 1 ≤ i ≤ k

P (Ci | x) = P (x ∩ Ci)

∴ P (Ci | x) = P (x | Ci) * P (Ci)

If we apply the chain rule to calculate all these probabilities we require a huge amount of data as well as time which is also exponential. Hence the assumption is made for NB classifier that the features are conditionally independent and we reduce calculation to, We calculate individual probability with each feature and combine them to get the posterior probability. 5. Maximum likelihood estimates:

The main objective function in naive Bayes classifier is to maximize the posterior probability given the training data for each class. Where 1 ≤ i ≤ k

Let’s say we have the following data for +ve and -ve points. Suppose we have a new instance as X = BFP and we want to classify it as +ve or -ve.

We calculate the posterior probability for 2 classes as,  The likelihood probability for X given that it belongs to +ve class is calculated as, By looking at the above result we can intuitively say that if the likelihood probability is zero then given a new feature vector the probability that it belongs to a particular class becomes zero.

That means if in the testing point if a new word is encountered which is not present in training data then the entire probability is zero which does not make any sense.

One of the possible solutions is to ignore all the probability with zero values. Thus is equivalent to it 1 but it is actually 0. So ignoring such probabilities is not a good idea.

Therefore, we introduce the concept of additive smoothing which means adding something extra to the calculated probabilities.

Suppose a particular X1 (feature) given that it belongs to +ve class is calculated as If X1 does not appear in train data then,  Where α = parameter of additive smoothing (hyperparameter) α > 0

m = dimensionality of feature vector X = {x1, x2,….,xm}

The additive smoothing is applied for all irrespective of its likelihood probability is zero or not.

7. Laplace and Lidstone smoothing:

In additive smoothing, if α = 1 then it is called Laplace smoothing i.e. add one smoothing.

If α <1 then it is a Lidstone smoothing.

Consider the following likelihood probability, Let n1 = 100, w1 = 2 and m = 2 Case 1 : If α = 0 then no smoothing.

Case 2: If α = 1 then Laplace smoothing.

Case 3 : If α = 10 then Case 4: If α = 100 then Case 5 : If α = 1000 then If the numerator and denominator are small numbers then the impact of α as α increase is high in likelihood probabilities. So if the α increases then the likelihood probabilities will be approximately ½.

That means we are moving the likelihood probabilities towards uniform distribution in our case.

8. Log probabilities: The above-calculated probabilities are a very small number that cannot be stored in memory. So there is a problem of numerical stability issues, numerical underflow and so on.

To avoid this problem we use log probabilities.  Case 1 : If α = 0 (High variance)

Consider, In this case, even for a very rare word, we are associated with some probability but on the other hand for another word that is never seen in the training phase the likelihood probability suddenly becomes zero.

This means our model is overfitting. Because for a very small change in training data, there is a very large difference in the model.

So this is a case of high variance.

Case 2: If α is a very large value (High bias)

As we have seen earlier if α then, Likelihood probabilities ≈ ½

Consider,    If we have balanced data then priors for each class also become the same. Then in such a case, the posterior probability for each class is approximately the same.

So this is a case of high bias because the posterior probabilities are the same.

10. Different versions of NB:

• Bernoulli NB: This classifier is suitable for discrete data. Bernoulli NB is designed for binary/boolean features. The binary bag of words (BOW) is a simple example of Bernoulli NB. It is used normally when we have Bernoulli distribution.
• Multinomial NB: Like Bernoulli NB, this classifier is suitable with discrete features (e.g. word count for text classification). Multinomial NB is used for multinomial distribution that normally requires integer feature counts. However, in practice, fractional counts such as tf-idf also may work.
• Gaussian NB: When we are dealing with the continuous data Gaussian NB is useful. It is used when data has Gaussian distribution.

11. Case Study on Iris flower classification:

The iris data set is widely used as a beginner’s dataset for machine learning purposes. The dataset is included in the machine learning package Scikit-learn so that users can access it without having to find a source for it. The objective of this dataset is to classify a new flower as belonging to one of the 3 classes (virginica, versicolor, setosa) given the 4 features i.e sepal length, sepal width, petal length, and petal width.

So let’s get started with it. 1. Python implementation using scikit-learn:
2. Conclusion:

I hope you got the intuition of naive Bayes classifiers and how it actually works. I encourage you to implement a case study to get a better understanding of the NB.

Thanks for reading! If you liked the blog then let me know your thoughts in the comment section.