In this tutorial, I will brief you about the linear regression and the bias-variance problem that we have seen in the last blog. I will also walk through the case study which will help to get a better idea of what we are going to discuss.
You will also discover why the elastic net is useful than the ridge and lasso regression? If you’ve not gone through the lasso regression and ridge regression then I would highly recommend you take look at it before starting with elastic net regression.
Table of contents
- Brief about Linear Regression
- Why do we require regularization?
- Ridge Regression – L1 + L2 regularization
- Bias-variance tradeoff
- Why elastic net regression why not ridge or lasso?
- Case Study on Boston House Prediction Dataset
- Python implementation using scikit-learn
ElasticNet regression is a type of linear model that uses a combination of ridge and lasso regression as the shrinkage. Shrinkage in the sense it reduces the coefficients of the model thereby simplifying the model. The elastic net regression performs L1 + L2 regularization. We will be going to see in detail how does it perform regularization before that we will revise the important concepts from the linear regression model.
2. Brief about Linear Regression
Linear Regression is studied to understand the relationship between the output variable(y) and the input variables(X). The whole aim of the Linear Regression is to find the best fitting line (In 2-D) through all the points that can minimize the error. In higher dimensional data we called it a regression plane. Here the best fit line means the line where predicted values should be very closer to the actual values. In other terms, we can also say that it will try to minimize the difference between the actual values and the predicted values. This is also called the error. The graphical representation for this is shown below.
The equation of a line in terms of Machine Learning can be expressed as,
Where y^ = Predicted output
Xi = input
W0, W1 = Parameters of the model
If we extend the above equation to the higher dimensions m then it changes to
Where fi = input features and 1<= i < m
f0 = 1
Wi = Parameters of the model
m = dimensions of the data
There is something in linear regression that we called Squared Loss. It can be defined as the following
Where yi^ = Predicted value
yi = Actual value
The model will try to reduce the squared loss for all the points in the dataset.
If we sum over all the errors produced by each point and take the weight matrix which will minimize the error then the equation can be written as
This is a brief overview of the linear regression model. If you want more details kindly go through my previous blog on linear regression.
3. Why do we require regularization?
Regularization is a technique to penalize the high-value regression coefficient. In other words, we can say that it reduces the values of the coefficient thereby simplifying the model.
Let’s take a simple example where we have 2 points and we want to fit the regression line through these points.
The simplest line would be the first-degree polynomial through all the points. However, there can be an infinite number of lines that pass through all 2 points of the second order, third order, and so on. This can be shown in the below figure.
For the small amount of data fitting, the line won’t be that difficult. But what if a new point arises which does not fall on the line? On the other hand, the very simpler model may underfit and performs very poorly on the unseen data. The solution to this problem is to find the right balance between overfitting and underfitting which is the whole idea of regularization.
So basically in regularization, we keep the same number of features but reduces the magnitude of the coefficient (W0, W1,.……, Wm).
4. ElasticNet Regression – L1 + L2 regularization
The elastic net regression by default adds the L1 as well as L2 regularization penalty i.e it adds the absolute value of the magnitude of the coefficient and the square of the magnitude of the coefficient to the loss function respectively. So the loss function changes to the following equation.
The lambda (λ1 and λ2) in the above equation is the amount of penalty that we add. The details for this are discussed in the next section of the blog.
5. Bias-variance tradeoff
In ridge regression, the parameters λ1 and λ2 control the amount of penalty to be added in the loss function so there may be the right balance between overfitting and underfitting. Depending on the values of λ’s we provide it has the following 3 cases.
Case 1: When λ1 = 0 and λ2 = 0
This case is the same as totally getting rid of the penalty term. So, in this case, we can say that the model performs the simple linear regression model.
Case 2: When λ1 = ∞ and λ2 = ∞
As the value of λ increases, we are forcing the model to penalize more by reducing the magnitude of the coefficients. So as λ increases more and more coefficients become tends to zero and some can be equal to zero exactly.
Case 3: When λ1 = 0 and λ2 = ∞
This is the case we completely remove ridge regression from the equation and the model performs as same as lasso regression.
Case 4: When λ1 = ∞ and λ2 = 0
In this case, we completely remove lasso regression from the equation and the model performs as same as ridge regression.
Case 5: When 0 < λ1, λ2 <∞
This is the right case in which we are tuning the value of λ between 0 and ∞.
Now I hope you have got a clear idea of how the different values of λ can affect the magnitude of the coefficient. In short, we can say that as λ increases bias increases and λ decreases variance increases.
6. Why elastic net regression why not ridge or lasso?
Let’s say you have a dataset with 50,000 features and you have to apply the regression model to it. Now if I ask you which one you will apply for? The answer to this is given with the help of the following explanations.
If we apply ridge regression on the data then the model will retain all 50,000 features of the data but will shrink the coefficient of it. The problem is that the model will still remain complex with 50,000 features and may lead to poor performance.
On the other hand, if we apply the lasso regression then for the highly correlated pair of features one feature will be removed completely by setting its coefficient to zero. By doing this we might lose some useful information from the data.
For such a situation elastic net regression comes to rescue were we from the hybrid model by combining the ridge as well as the lasso regression model.
7. Case Study on Boston House Prediction Dataset
To get hands-on ridge regression and for better understanding, we will take an original dataset and apply the concepts that we have learned.
We will take the housing dataset which contains information about the different houses sold in Boston. The dataset contains 506 data points and 14 attributes including the target variable. So the aim is to predict the house price given the values of various attributes.
You can download the dataset from here (https://www.kaggle.com/vikrishnan/boston-house-prices).
The following picture describes all the information about the various attributes of the dataset.
Now here I am not going into the data exploration and preprocessing step but I will focus on the main part that we have learned about regularization. You can find all the code on Github.
8. Python implementation using scikit-learn:
The code I used to perform Lasso Regression is as below. We will try to use different values of lambda and how it affects the model performance.
In this post, I gave an overview of regularization using ridge regression and the difference between the lasso and ridge regression. I hope you got the intuition of using regularization and how it actually works. I encourage you to implement a case study to get a better understanding of the regularization technique. In the next blog, we will see an elastic net regularization technique.
Thanks for reading! If you liked the blog then let me know your thoughts in the comment section.