Logistic Regression for Machine Learning
- Applications of LogisticRegression.
- When to use Logistic Regression?
- Decision Boundary and Cost Function
- Weight Updation – Choosing Best Parameters
- Multiclass Classification and Regularization
- Implement our model with scikit-learn
Whenever we are dealing with a classification problem, let’s be more precise say binary classification problem one method we can use is the linear regression. We can just map all values less than 0.5 as 0 and all the values greater than that as 1. However, there is one problem- the classification is not a linear function.
Think about it, here your target variable is going to take some discrete number of values. When the target variable takes only two values ‘0’ or ‘1’ then a classification problem is termed as a binary classification problem. An effective way to deal with a classification problem is using Logistic Regression.
Our usual linear regression hypothesis function takes values on the real line while we want a function f(x) such that
This can be done by inserting z= in the logistic function also known as sigmoid function as
This function maps any real number to [0,1] interval making it useful to transform any arbitrary valued function into a function best suited for classification. An alternative form of representing f(x) is that it represents the probability, P(y=1| x; m). Logistic Regression is a type of probabilistic statistical classification model used to represent the binary response from a binary predictor.
Applications of Logistic Regression
When to use Logistic Over Linear Regression?
You might be thinking why can’t we use linear regression for classification problem. It will be more clear to you from the example described below. Some reasons for not using Linear Regression for classification are:
1. Categorical Data- It is not able to work well on categorical data.
2. Continuous value for output- It prefers giving a continuous or new value for any new output rather than probabilities.
Now, through the example, we are following up on the example of Andrew Ng from his course in Coursera.
As Andrew Ng explains it, with linear regression you fit a polynomial through the data – say, like on the example below we’re fitting a straight line through the dataset of tumours for predicting the type of tumour on the basis of its size.
Here, malignant tumours output 1 and non-malignant output 0, and the green line in the diagram is our classifying line or in the language of Andrew Ng, it is a hypothesis line h(x). To make predictions we may say that for any given tumour size x if h(x) gets bigger than 0.5(threshold to be considered) we predict malignant tumour, otherwise, we predict benignly.
Looks like this way we could correctly predict every single training set sample, but now let’s change the data points to some extent.
Intuitively it’s clear that all tumours larger than a certain threshold are malignant. So let’s add another sample with huge tumour size, and run linear regression again:
Now our h(x)>0.5 then malignant doesn’t work anymore. To keep making correct predictions we need to change our hypothesis.
So, we cannot change our hypothesis each time if data point changes a bit.
In this particular example, you would probably use logistic regression which might learn a hypothesis like this just for the sake of understanding.
Decision Boundary and Cost function
A decision boundary is any curve that separates the area where y=0 to the area where y=1. In order to get 0 or 1 classification, we transform the output of the hypothesis function as follows,
As we have a different hypothesis function in case of Logistic regression we can’t use the same cost function either. If we use the same cost function as in the case of linear regression then the logistic function will cause the output to be wavy. In short, cost function will not be a convex function. Cost function in case of Logistic regression can be shown as
Terms are as follows -:
J =: Cost function
m =: no. of training examples
y =: Actual output
f(x) =: Predicted Output using logistic regression equation
And the weight updating equation is as follows and will use gradient descent for updating weights in order to minimize the cost function.
w = w – alpha*derivative(J)
These terms are as follows -:
w =: weight associated with each attribute.
alpha =: learning rate parameter
derivative(J) =: derivative of J(cost function) w.r.t w.
In this case, our objective is to find the optimum value of the parameter m for which our cost is minimum.
z = weighted sum of input (w0 + w1*x1 + w2*x2 + ..)
Let’s see how this Sigmoid function is used to calculate the probability.
Suppose, we have an initial weight matrix is as follows. We can update this weight matrix(using cost function, will tell you later) as weight matrix is only which maintains the accuracy of our model.
Now, will tell you how this probability is calculated using sigmoid
For data point1 (x(1)): we have x1= 2, x2 =1, probability will be calculated as
Step 1 : z(x(1)) = 0 + 1*2 -2*1 = 0
Step 2 :probability(x(1)) = 1/(1+e^(-0)) = 0.5
For data point2 (x(2)): we have x1= 0, x2 =2, probability will be calculated as:
Step 1 : z(x(2)) = 0 + 1*0 -2*2 = -4
Step 2 :probability(x(2)) = 1/(1+e^(-(-4))) = 0.02]
Next, what we do with these probailities is by using threshold try to classify for the binary classification.
Suppose we have set threshold=0.5.
if (p>0.5) x(i) belongs to class 0
else to class 1
Hence in this way, we predict the class for a particular input. But notice here, you might be wondering where we get this weight matrix.
For this, at first, we need to do random initialization after that, by checking the accuracy of our model and using gradient descent and cost function, we update the weight matrix in order to minimize the cost function for a better model for prediction.
Weight Update using Gradient Descent(Mathematical part- Optional)
So, will use gradient descent to update weights for each attribute(independent variables).
For the weight update we will use the simple equation as follows-:
ძ l(J)/ ძ (wj) = ∑ xj(i) (f(x(i)), y(i))
∑ xj(i) ([y(i) = +1] – P(y=+1|x(i), w))
W = w- α* ძ l(J)/ ძ (wj)
Here, we have represented 2 equations for derivative(J), but both are same. First is direct one, in second one we expanded by considering y=+1 case else 0 .
Here Yi = +1 is +1 if Yi == +1 else 0
So, Let’s understand what this equation is implying-: It is calculating the change of weight for each feature at t step considering all the data points one by one. Let’s say we have some value of weight for feature 1 at step is w1(t). Then what we are doing using this equation is to find the change in w1 due to all the data points.
Let’s discuss the use of the terms in this equation:
xj(i) = for the input(data point) i what is the value for feature j.
P(y=+1|xi,wt) = Probabilty for data point xi, calculated above.
ძ l(J)/ ძ (wj) = change in cost for feature j(w(j)) at step t.
Suppose we are doing for the w1 using above update equation and above example then the following table will be calculated and then we will use gradient descent in order to find the updated value of w1.
delta(w1(t)) = 1 + 0 – 0.15 + 0.48 = 1.33
then using gradient descent -: w1(t+1) = w1(t) + step_size*delta(w1(t)). Suppose we have step_size = 0.1.
w1(t+1) = 1 + 0.1*1.33 = 1.133(little bit change as from 1->1.133).
Interpretation of this equation:
In the similar fashion,n we will update weights of other dependent variables in order to modify our model for better prediction.
So, In classification through logistic regression what we are doing is-:
Step 1: Calculate the probabilities of each data point or input independently.
Step 2: After this, using thresholding classify into classes.
Step 3: At this stage, we are training our classifier so, we need to choose the best weights of dependent variables so that it can perform well on test data.
Step 4: For the weight option we have used the equation above.
Now, the question comes into the mind when to stop updating weights?
So, We will stop when the derivative with respect to weights become less than a threshold value decided. This is how the core concept of logistic regression.
But if you are using any library then, you are free from all these kind of things of choosing step size or threshold value. Therefore, for the training, we are using a gradient descent algorithm(for the best weights)
Multi-class Classification and Regularization
In multi-class classification y instead of taking values 0 or 1 is free to choose any k+1 distinct values. An effective way to deal with such a problem is to divide our problem into the k+1 number of binary classification cases. In each case, we are calculating the probability that y is a member of one of our class. We are actually choosing one class time and lumping all the other classes into a second class. We are now applying binary logistic regression for each case. For prediction, we choose the one that maximizes f(x).
In case of fitting a model over the data two cases can arise and they are- under-fitting and over-fitting. In the case of under-fitting, our model is a poor fit over the data. We sometimes tackle such problems by adding extra features to our model. But here a basic problem can arise that our model instead of learning start to remembering the data and this can lead us to poor prediction. To deal with such problem we use a technique called regularization. In this method, we try to reduce the magnitude of the parameter by penalizing it by some parameter.
On regularization, our usual cost function may look like.
Logistic Regression using sklearn
Here, we are using sklearn inbuilt breast cancer data-set where we have to classify the tumour as benign or malignant.
We simply fit Logistic Regression classifier here and then we are finding the accuracy of our model.
In this blog, we have seen the Logistic Regression for classification problem and the function used in that. We also dig deep into the mathematics of weight updating and gradient descent algorithm.