In this tutorial, I will brief you about the Ridge regression and the bias-variance problem that we have seen in the last blog of Logistic Regression. Also, you will be acquainted with how regularization can be used to solve this problem using ridge regression. I will also walk through the case study which will help to get a better idea of what we are going to discuss.
You will also discover the difference between the ridge and lasso regression. If you’re not gone through the lasso regression the I would highly recommend you to take look on it before starting with ridge regression. Lasso regression for more information about lasso regression.
Table of contents
- Brief about Linear Regression
- Why do we require regularization?
- Ridge Regression – L2 regularization
- Bias-variance tradeoff
- Difference between ridge and lasso regression
- Case Study on Boston House Prediction Dataset
Ridge regression is a type of linear model that uses the shrinkage. Shrinkage in the sense it reduces the coefficients of the model thereby simplifying the model. The lasso regression performs the L2 regularization. We will be going to see in detail how does it perform regularization before that we will revise the important concepts from the linear regression model.
2. Brief about Linear regression
Linear Regression is studied to understand the relationship between the output variable(y) and the input variables(X). The whole aim of the Linear Regression is to find the best fitting line (In 2-D) through all the points that can minimize the error. In higher dimensional data we called it a regression plane. Here the best fit line means the line where predicted values should be very closer to the actual values. In other terms, we can also say that it will try to minimize the difference between the actual values and the predicted values. This is also called the error. The graphical representation for this is shown below.
The equation of a line in terms of Machine Learning can be expressed as,
Where y^ = Predicted output
Xi = input
W0, W1 = Parameters of the model
If we extend the above equation to the higher dimensions m then it changes to
Where fi = input features and 1<= i < m
f0 = 1
Wi = Parameters of the model
m = dimensions of the data
There is something in linear regression that we called Squared Loss. It can be defined as the following
Where yi^ = Predicted value
yi = Actual value
The model will try to reduce the squared loss for all the points in the dataset.
If we sum over all the errors produced by each point and take the weight matrix which will minimize the error then the equation can be written as
This is a brief overview of the linear regression model. If you want more details kindly go through my previous blog of linear regression.
3. Why do we require regularization?
Regularization is a technique to penalize the high-value regression coefficient. In other words, we can say that it reduces the values of the coefficient thereby simplifying the model.
Let’s take a simple example where we have 2 points and we want to fit the regression line through these points.
The simplest line would be the first-degree polynomial through all the points. However, there can be an infinite number of lines that pass through all 2 points of the second order, third order, and so on. This can be shown in the below figure.
For the small amount of data fitting, the line won’t be that difficult. But what if a new point arises which does not fall on the line? On the other hand, the very simpler model may underfit and performs very poorly on the unseen data. The solution to this problem is to find the right balance between the overfitting and underfitting which is the whole idea of regularization.
So basically in regularization, we keep the same number of features but reduces the magnitude of the coefficient (W0, W1,.……, Wm).
4. Ridge Regression – L2 regularization
The ridge regression by default adds the L2 regularization penalty i.e. it adds the square of the magnitude of the coefficient to the loss function. So the loss function changes to the following equation.
Unlike lasso regression, ridge regression does not lead to the sparse model that is a model with a fewer number of the coefficient. Because some of the coefficients may tend to become zero but not exactly equal to zero and hence cannot be eliminated. So it will retain all the features of the data.
The lambda (λ) in the above equation is the amount of penalty that we add. The details for this are discussed in the next section of the blog.
5. Bias-variance tradeoff
In ridge regression, the parameter λ controls the amount of penalty to be added in the loss function so there may be the right balance between overfitting and underfitting. Depending on the values of λ we provide it has the following 3 cases.
Case 1: When λ = 0
This case is the same as totally getting rid of the penalty term. So, in this case, we can say that the model performs the simple linear regression model.
Case 2: When λ = ∞
As the value of λ increases, we are forcing the model to penalize more by reducing the magnitude of the coefficients. So as λ increases more and more coefficients become tends to zero and but not equal to zero exactly.
Case 3: When 0 < λ <∞
This is the right case in which we are tuning the value of λ between 0 and ∞.
Now I hope you have got a clear idea of how the different values of λ can affect the magnitude of the coefficient. In short, we can say that as λ increases bias increases and λ decreases variance increases.
6. Difference between ridge and lasso regression
I hope now you have understood the ridge regression as well as lasso regression. Let’s look at the difference between them.
Let’s say you have a dataset with 50,000 features and you have to apply the regression model on it. Now if I ask you which one you will apply? The answer to this is given with the help of the following explanations.
If we apply ridge regression on the data then the model will retain all 50,000 features of the data but will shrink the coefficient of it. The problem is that the model will still remain complex with 50,000 features and may lead to poor performance.
On the other hand, if we apply the lasso regression then for the highly correlated pair of features one feature will be removed completely by setting its coefficient to zero. By doing this we might lose some useful information from the data.
The solution to this problem is to find the hybrid model by combining the ridge as well as the lasso regression model. This model is known as Elastic Net regression.
To get hands-on ridge regression and for better understanding, we will take an original dataset and apply the concepts that we have learned.
We will take the housing dataset which contains information about the different houses sold in Boston. The dataset contains 506 data points and 14 attributes including the target variable. So the aim is to predict the house price given the values of various attributes.
You can download the dataset from here (https://www.kaggle.com/vikrishnan/boston-house-prices).
The following picture describes all the information about the various attributes of the dataset.
Now here I am not going into the data exploration and preprocessing step but I will focus on the main part that we have learned about regularization. You can find all the code in my GitHub repository (link here).
In this post, I gave an overview of regularization using ridge regression and the difference between the lasso and ridge regression. I hope you got the intuition of using regularization and how it actually works. I encourage you to implement a case study to get a better understanding of the regularization technique.
Thanks for reading! If you liked the blog then share it with your friends.