# Logistic Regression for Machine Learning

This algorithm has named as the regression but doesn’t get confused by its name, it is generally used for classification problem. For beginners, it might be a bit confusing why so many different algorithms for classification, clustering. See, We would like to make you one thing clear that every algorithm doesn’t work for every data or condition.

You might be thinking why can’t we use linear regression for classification problem. It will be more clear to you from the example described below. Some reasons for not using Linear Regression for classification are:

1. Categorical Data- It is not able to work well on categorical data.
2. Continuous value for output- It prefers giving continuous or new value for any new output rather than probabilities.

Now, through the example, we are following up on the example of Andrew Ng from his course in Coursera.
As Andrew Ng explains it, with linear regression you fit a polynomial through the data – say, like on the example below we’re fitting a straight line through the dataset of tumors for predicting the type of tumor on the basis of its size.

Here, malignant tumors output 1 and non-malignant output 0, and the green line in the diagram is our classifying line or in the language of Andrew Ng, it is a hypothesis line h(x). To make predictions we may say that for any given tumor size x if h(x) gets bigger than 0.5(threshold to be considered) we predict malignant tumor, otherwise, we predict benign.

Looks like this way we could correctly predict every single training set sample, but now let’s change the data points to some extent.

Intuitively it’s clear that all tumors larger than a certain threshold are malignant. So let’s add another sample with a huge tumor size, and run linear regression again:

Now our h(x)>0.5 then malignant doesn’t work anymore. To keep making correct predictions we need to change our hypothesis.

So, we cannot change our hypothesis each time if data point changes a bit.

In this particular example, you would probably use logistic regression which might learn a hypothesis like this just for the sake of understanding.

## Logistic Regression Classifier

Logistic Classifier is using logistic(Sigmoid) function. It is the specialty of logistic regression that it is using the Sigmoid function which is basically a link function, squeezing the values in the range (0,1), it will be more clear to you below.

Logistic function = 1/1 + e^(-score)

Score : weighted sum of input (w0 + w1*x1 + w2*x2 + ..)

Let’s see how this Sigmoid function is used to calculate the probability.

Suppose, we have an initial weight matrix is as follows. We can update this weight matrix(using cost function, will tell you later) as weight matrix is only which maintains the accuracy of our model.

Now, will tell you how this probability is calculated using sigmoid
For data point1 (x(1)): we have x1= 2, x2 =1, probability will be calculated as

Step 1 : score(x(1)) = 0 + 1*2 -2*1 = 0

Step 2 :probability(x(1)) = 1/(1+e^(-0)) = 0.5

For data point2 (x(2)): we have x1= 0, x2 =2, probability will be calculated as:

Step 1 : score(x(2)) = 0 + 1*0 -2*2 = -4

Step 2 :probability(x(2)) = 1/(1+e^(-(-4))) = 0.02]

Next, what we do with these probailities is by using threshold try to classify for the binary classification.

Suppose we have set threshold=0.5.

if (p>0.5) x(i) belongs to class 0
else to class 1

Hence in this way, we predict the class for a particular input. But notice here, you might be wondering where we get this weight matrix.

For this, at first, we need to do random initialization after that, by checking the accuracy of our model and using cost function we update the weight matrix.
Second thing, I told you the direct function used in logistic regression. There is a derivation behind it, which we will discuss in the end. Before that, we will go for the weight update equation.

## Weight Update Function(Mathematical Part- Optional)

For the weight update we will use the simple equation as follows-:

ძ l(J)/ ძ (wj) = ∑ xj(i) (f(x(i)), y(i))

Or

∑ xj(i) ([y(i) = +1] – P(y=+1|x(i), w))

W = w- α* ძ l(J)/ ძ (wj)

Here, we have represented 2 equations for derivative(J), but both are same. First is direct one, in second one we expanded by considering y=+1 case else 0 .

Here Yi = +1 is +1 if Yi == +1 else 0
So, Let’s understand what this equation is implying-: It is calculating the change of weight for each feature at t step considering all the data points one by one. Let’s say we have some value of weight for feature 1 at step is w1(t). Then what we are doing using this equation is to find the change in w1 due to all the data points.

Let’s discuss the use of the terms in this equation:

hj(xi) = for the input(data point) i what is the value for feature j.
P(y=+1|xi,wt) = Probabilty for data point xi, calculated above.
ძ l(w(t))/ ძ (wj) = change in weight for feature j(w(j)) at step t.
Suppose we are doing for the w1 using above update equation and above example then the following table will be calculated and then we will use gradient descent in order to find the updated value of w1.

delta(w1(t)) = 1 + 0 – 0.15 + 0.48 = 1.33
then using gradient descent -: w1(t+1) = w1(t) + step_size*delta(w1(t)). Suppose we have step_size = 0.1.
w1(t+1) = 1 + 0.1*1.33 = 1.133(little bit change as from 1->1.133).

## Interpretation of this equation:

In the similar fashion,n we will update weights of other dependent variables in order to modify our model for better prediction.

Algorithmic Summary

So, In classification through logistic regression what we are doing is-:
Step 1: Calculate the probabilities of each data point or input independently.
Step 2: After this, using thresholding classify into classes.
Step 3: At this stage, we are training our classifier so, we need to choose the best weights of dependent variables so that it can perform well on test data.
Step 4: For the weight option we have used the equation above.
Now, the question comes into the mind when to stop updating weights?
So, We will stop when the derivative with respect to weights become less than a threshold value decided. This is how the core concept of logistic regression.
But if you are using any library then, you are free from all these kind of things of choosing step size or threshold value. Therefore, for the training, we are using a gradient descent algorithm(for the best weights)

Implementation in Python

### Logistic Regression Using Scikit learn

In logistic we can use regularization as we don in linear regression.
In Scikit-learn itself, there is a number of parameter for the linear model of logistic regression which includes L1 penalty and L2 penalty too.

### Training a classifier

In [2]:
```from sklearn.linear_model import LogisticRegression
#X-> training inputs
#Y-> training outputs
# Here we are training a binary classifier

X = [[1, 0, 2], [0, 1, 3]]
y = [0, 1]

##Logistic with setting arbitrary C value.
##By default we all have kernel='RBF'
clf = LogisticRegression(C=0.4, dual = True)
clf.fit(X, y)
```
Out[2]:
```LogisticRegression(C=0.4, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)```

### Prediction

In [4]:
```clf.predict([[3,3,2]])
```
Out[4]:
`array([1])`