In this tutorial, I will brief you about the linear regression and the bias-variance problem that we have seen in the last blog (link here). Also, you will be acquainted with how regularization can be used to solve this problem. I will also walk through the case study which will help to get a better idea of what we are going to discuss.
Table of contents
- Brief about Linear Regression
- Why do we require regularization?
- Lasso Regression – L1 regularization
- Bias-variance tradeoff
- Case Study on Boston House Prediction Dataset
- Python implementation using scikit-learn
The LASSO stands for Least Absolute Shrinkage and Selection Operator.
Lasso regression is a type of linear model that uses the shrinkage. Shrinkage in the sense it reduces the coefficients of the model thereby simplifying the model. The lasso regression performs the L1 regularization. We will be going to see in detail how does it perform regularization before that we will revise the important concepts from the linear regression model.
2. Brief about Linear regression
Linear Regression is studied to understand the relationship between the output variable(y) and the input variables(X). The whole aim of the Linear Regression is to find the best fitting line (In 2-D) through all the points that can minimize the error. In higher dimensional data we called it a regression plane. Here the best fit line means the line where predicted values should be very closer to the actual values. In other terms, we can also say that it will try to minimize the difference between the actual values and the predicted values. This is also called the error. The graphical representation for this is shown below.
The equation of a line in terms of Machine Learning can be expressed as,
Where y^ = Predicted output
Xi = input
W0, W1 = Parameters of the model
If we extend the above equation to the higher dimensions m then it changes to
Where fi = input features and 1<= i < m
f0 = 1
Wi = Parameters of the model
m = dimensions of the data
There is something in linear regression that we called Squared Loss. It can be defined as the following
Where yi^ = Predicted value
yi = Actual value
The model will try to reduce the squared loss for all the points in the dataset.
If we sum over all the errors produced by each point and take the weight matrix which will minimize the error then the equation can be written as
This is a brief overview of the linear regression model. If you want more details kindly go through my previous blog on linear regression (link here).
3. Why do we require regularization?
Regularization is a technique to penalize the high-value regression coefficient. In other words, we can say that it reduces the values of the coefficient thereby simplifying the model.
Let’s take a simple example where we have 2 points and we want to fit the regression line through these points.
The simplest line would be the first-degree polynomial through all the points. However, there can be an infinite number of lines that pass through all 2 points of the second order, third order, and so on. This can be shown in the below figure.
For the small amount of data fitting, the line won’t be that difficult. But what if a new point arises which does not fall on the line? On the other hand, the very simpler model may underfit and performs very poorly on the unseen data. The solution to this problem is to find the right balance between overfitting and underfitting which is the whole idea of regularization.
So basically in regularization, we keep the same number of features but reduces the magnitude of the coefficient (W0, W1,.……, Wm).
4. Lasso Regression – L1 regularization
The lasso regression by default adds the L1 regularization penalty i.e. it adds the absolute value of the magnitude of the coefficient to the loss function. So the loss function changes to the following equation.
Lasso regression leads to the sparse model that is a model with a fewer number of the coefficient. Some of the coefficients may become zero and hence eliminated. So lasso regression not only help to avoid overfitting but also to do the feature selection.
The lambda (λ) in the above equation is the amount of penalty that we add. The details for this are discussed in the next section of the blog.
5. Bias-variance tradeoff
In lasso regression, the parameter λ controls the amount of penalty to be added in the loss function so there may be the right balance between overfitting and underfitting. Depending on the values of λ we provide it has the following 3 cases.
Case 1: When λ = 0
This case is the same as totally getting rid of the penalty term. So, in this case, we can say that the model performs the simple linear regression model.
Case 2: When λ = ∞
As the value of λ increases, we are forcing the model to penalize more by reducing the magnitude of the coefficients. So as λ increases more and more coefficients become tends to zero and hence finally eliminated. Theoretically when λ = ∞ all the coefficients are removed.
Case 3: When 0 < λ <∞
This is the right case in which we are tuning the value of λ between 0 and ∞.
Now I hope you have got a clear idea of how the different values of λ can affect the magnitude of the coefficient. In short, we can say that as λ increases bias increases and λ decreases variance increases.
6. Case Study on Boston House Prediction Dataset
To get hands-on lasso regression and for better understanding, we will take an original dataset and apply the concepts that we have learned.
We will take the housing dataset which contains information about the different houses sold in Boston. The dataset contains 506 data points and 14 attributes including the target variable. So the aim is to predict the house price given the values of various attributes.
You can download the dataset from here (https://www.kaggle.com/vikrishnan/boston-house-prices).
The following picture describes all the information about the various attributes of the dataset.
Now here I am not going into the data exploration and preprocessing step but I will focus on the main part that we have learned about regularization. You can find all the code in my GitHub repository (link here).
7. Python implementation using scikit-learn
The code I used to perform Lasso Regression is as below. We will try to use different values of lambda and how it affects the model performance.
The default value of the lambda is 1 and if we observe the above pot we can say that attribute CHAS which is more important in linear regression becomes the least important in lasso regression i.e it completely removes the coefficient of CHAS attribute.
Let’s try for other values of lambda.
We can see that as we increased the value of lambda, coefficients were approaching zero. This is known as feature selection.
Now let’s look at the accuracy of the different values of lambda.
If we observe carefully we can say that as the value of lambda increases the number of features decreases. Also, you can observe the different values of accuracy for different lambda values.
In this post, I gave an overview of regularization using lasso regression. I hope you got the intuition of using regularization and how it actually works. I encourage you to implement a case study to get a better understanding of the regularization technique. In the next blog, we will see another regularization technique.
Thanks for reading! If you liked the blog then let share it with your friends.