Logistic Regression

When we faced any classification problem in Machine Learning there are many algorithms to work with but often ML developer-first goes with the very oldest algorithm in this field i.e. Logistic Regression. I hope you’ve now mastered the linear regression. In this post, I’ll give you knowledge about logistic regression. It is a very easy algorithm to use once you know the proper background of how it actually works.

I have tried the simple way to explain these concepts.

Table of contents

  1. Introduction
  2. Intuition
  3. Assumptions and task in logistic regression
  4. The math behind logistic regression
  5. Overfitting and underfitting
  6. How to avoid overfitting using regularization
  7. Bias-variance tradeoff
  8. Multicollinearity in logistic regression
  9. Case Study on Employee Churn Dataset
  10. Conclusion

1. Introduction

Logistic Regression is one of the popular classification algorithms. Although its name contains regression it does the only classification. It is used only for a binary classification problem like yes/no, 0/1, true/false, etc. The techniques that are used in logistic regression are similar to that of linear regression with few changes. The idea in logistic regression is to find the relationship between the input features (X)  and output probabilities of that outcome(y). 

2. Intuition

Before going into the detailed mathematical background of logistic regression let’s try to understand what does a classification problem means and on a high level how logistic regression helps to solve the classification problem.

Let’s say we want to predict whether a person is male or female given their hair length and height. If we plot data in the 2-D plane it will look like the following.

Now if I say you to draw the line which will separate both the genders from each other then one can intuitively draw the line as follows:

There are some points which are wrongly classified but still, the line does a pretty good job.

The general equation of a line can be written as,

In Machine Learning we replace the constants a, b, and c by the weight vectors w0, w1, and w2. So we rewrite the above equation as follows:

W1 and w2 are the weight vectors corresponding to the hair length (X1) and height (X2) of the person. These values also tell us the importance of features while classifying a particular point. In our case ideally, the hair length is more important than the age of the person for predicting gender. So the magnitude of w1 should be greater than w2. I hope now you have a little idea about the classification using logistic regression.

3. Assumptions and task in logistic regression

The logistic regression assumes that there is no multicollinearity among the independent variables. It also assumes that the data is linearly separable (separated by line or plane) which is very rare in real-world problems. 

The task in logistic regression is to find the best decision boundary that separates the 2 classes from others. For the 2-D data, the decision boundary would be the line (line color line in our above example). For 3-D data it is plane and in higher dimensions, it is called hyper-plane. 

4. The math behind logistic regression

The example of a prediction of gender that we’ve seen above has only 2 features i.e. hair length and height of the person. But real-world problems can have m number of features. Let’s generalize our model for the higher dimensional data. 

Let’s say we have X1, X2,………, Xm, and corresponding weight vectors W1, W2,…….., Wm. So the equation of hyperplane  becomes as follows:

W0 + W1X1+W2X2+W3X3+…..+WmXm = 0

For the mathematical calculations, we do slight changes to the above equation.

W0*1 + W1X1+W2X2+W3X3+…..+WmXm = 0

W0 + W1X1+W2X2+W3X3+…..+WmXm = 0

Where X0 = 0

X and W both are vectors. So we can rewrite the above as below.

The task of the model is to find the optimal values of the weights W1, W2,…….., Wm. It does so by using the Gradient Descent algorithm which is beyond the scope of this article. 

Let’s consider we have the following figure with the decision boundary(line) separating 2 classes.

The distance from the line to the query point is given as follows:

 

Where   

If we want to express the binary classification problem in terms of the set theory then we have the following equation.

Depending on the values of the equation of hyperplane we have the following 2 cases.

2.

The class of the point in the test set is determined based on the following equation where we multiplied the value of the actual class with the predicted class.

Again depending on the values of the yi we have 2 possible cases.

Case 1: If yi = +1

It means that the WT * Xi is positive and yi = +1.

That means the point is classified correctly because it actually belongs to the positive class.

It means that the WT * Xi is negative and yi = -1.

That means the point is classified incorrectly because it actually belongs to the positive class.

Case 2: If yi = -1

This implies that the WT * Xi is negative and yi = -1.

Then the point is classified correctly because it actually belongs to the negative class.

This implies that the WT * Xi is positive and yi = +1.

Then the point is classified incorrectly because it actually belongs to the negative class.

 

The above 2 cases might look confusing for you but now I will summarize it in very short which can be easily understandable.

We can say that if Zi > 0 then the point is correctly classified and if Zi < 0 then the point is incorrectly classified.

The good model should reduce the misclassification and maximize the correct classification. So our task in logistic regression is to find the hyperplane in higher dimensions such that Zi is greater than zero for as many points as possible.

This can be mathematically represented as,

The Concept of Squashing

The squashing function in mathematics is used to compress or squash the values into finite intervals. If Zi goes from the -∞ to +∞, the squash function of Zi i.e f(Zi) goes from A to B as shown below.

Sigmoid function as squashing in logistic regression

The values of term yi * WTXi can go from anywhere from -∞ to +∞ but we want the model to predict the values in terms of probability which is not less than 0 and not greater than 1. 

The sigmoid helps us to achieve these. Also, the sigmoid function is differentiable so it allows us to apply many popular optimization techniques like gradient descent.

Again we can write our 2 above cases in terms of probability as:

If WT*Xi > 0

Then σ(WT * Xi) lies between 0.5 and 1

Else if WT*Xi < 0

Then σ(WT * Xi) lies between 0 and 0.5

Else

σ(WT * Xi) = 0

 

So because of the sigmoid function, our equation changes to,

Just we have applied the sigmoid function σ inside the equation. 

Now if you remembered your junior college math there you must have studied one theorem about function i.e. If g(x) is a monotonic function then argmax f(x) = argmax g(f(x)) where f(x) can be any function. A monotonic function is a function that is either entirely nonincreasing or nondecreasing. A function is monotonic if its first derivative does not change sign. Also, we know that log function is a monotonic function. So by using these theorem, we change our equation as follows:

 ………..(log 1/x = -log x)

…………(max{f(x)} = min{-f(X)})

Non-linear decision boundaries in logistic regression

We have 2-D data with features X1 and X2 and if we plot data in 2-D space it looks like the following:

Note if I ask the same question to draw the line which separates the 2 classes from each other then it is quite impossible in this case. Here we require some complex shapes in order to separate the 2 classes. The equation of a simple straight line is

If we add some polynomial features of degree 2 then the equation becomes,

Let’s assume that the weight vector W = [-1, 0, 0, 1, 1, 0]. Then equation changes to

Clearly, this is an equation of a circle with unit radius and if we draw it in the above figure then the data becomes easily separable.

So the decision boundary in this becomes the circle and not the straight line.

Therefore from the above discussion, we can conclude that data that is not separable in lower dimensions becomes separable in higher dimensions i.e if we add the features to it. 

5. Overfitting and underfitting

Overfitting occurs when the model learns every point of the training data. This learning will perform well on the test data but has a negative impact on the completely unseen data. As we know that higher-dimensional data fit well with the logistic regression but on the other hand if there are more features then model overfits because we are adding polynomial features due to which values of W increases. Such a hypothesis has a high variance. 

Underfitting is exactly the opposite of overfitting where the model does not perform well on training data as well as on unseen data. The underfit model has poor performance on the test data. 

We need to find something which lies between overfitting and underfitting. It is often called a good fit or optimal fit. The below figure shows all 3 cases.

6. How to avoid overfitting using regularization

We want the model to not byheart the training data instead we want to model to generalize well on completely unseen data. So the regularization term penalizes the coefficients of weights. 

Intuitively regularization adds a penalty against the complex model. Also, we don’t want the model to learn each and every small peculiarity in training data and regularization helps to achieve this. 

The optimization function with unregularized parameter is

If we add a regularization term to this then it becomes,

Where lambda (λ) = parameter of regularization

So this is kind of a race where the 1st term is trying to minimize the error as W tends to infinity and the 2nd term does not allow it for infinity. 

7. Bias-variance tradeoff

The effect of the regularization term depends on the values of λ and we provide this value to the model. So this is the sort of amount of penalty that we are applying by providing the value of λ. 

  1. When λ = 0

This is the same as totally getting rid of the regularization term. So this will work the same as that without the regularization technique. 

      2. When λ = ∞ or very high

The impact of the regularization term becomes very high due to which model underfits. So here we are penalizing the weights very high which turn out to be zero.

So the idea is to find the right balance between the λ which performs well for the model.

If you are implementing the logistic regression with regularization technique then there is another parameter C which is the inverse of λ i.e C = 1 / λ. So the effects of C accordingly reverse. 

8. Multicollinearity in logistic regression

Multicollinearity occurs in the data when the input features are highly interrelated with each other. The heatmap is used to determine the correlation between each feature. This value lies between 0 and 1. 0 indicates that the particular two features are not correlated to each other and exactly the opposite goes for 1.

It is not uncommon when there are a number of covariates in the model. Moderate multicollinearity may not be problematic. If two independent variables X1 and X2 are interconnected we need to take only one variable. Because both the variables are representing the same information. The redundancy of information could give the wrong coefficients in the model.

Enough Maths!!

I hope you now have a very good understanding of the math behind logistic regression. So now let’s come out of the theoretical part and do some real-world hand on logistic regression. 

Employee churn is the costlier problem for the company where the cost of replacing the employee is high. Understanding why and when employees are most likely to leave can be useful to take actions to improve employee retention as well as possibly planning new hiring in advance which helps the company a lot in terms of its profit. 

So the HR department of the company hired us as a data science consultant. They want to retain or hire employees with a more proactive approach. 

Given that we have data on employees with Attrition(Yes or No). In this study, our target variable y is the probability of an employee leaving the company.

9. Case Study on Employee Churn Dataset

10. Conclusion

In this post, we have learned the concepts of logistic regression as well as maths behind it. Also, we have applied logistic regression on real-life datasets. So I hope you are now able to relate the concepts that we have learned with the case study.

Thanks for reading! If you liked the blog then share it with your friends.

Share

Share on twitter
Share on facebook
Share on linkedin
Share on whatsapp
Share on reddit
Share on telegram
Share on pinterest
Share on email

If you loved this post, join our Telegram Community and follow us on Instagram

 

Readers who read this also read

Close Menu