Lasso Regression for Machine Learning
- Applications of Lasso Regression.
- How does that algorithm work?
- How do we choose the best parameters?
- Pseudo Code of the algorithm.
- Implement our model with scikit-learn
Lasso Regression is an effective technique to deal with a problem involving a large number of features. When we have a large number of features in our model it might happen that the model starts to over-fit.
The more is the number of features, the more complex it involves in the calculation. Penalizing the magnitude of the feature’s coefficient along with minimizing the error is the basic intuition behind this kind of regression. The term LASSO stands for Least Absolute Shrinkage and Selection Operator. This regression performs the L1 operation as it adds up a multiple of the sum of the absolute value of coefficients to the original regression objective.
Mathematically, in this case, our cost function may look like
here, SSR is the sum of squares of residuals.
The intuition behind Lasso Regression
Here, we are fitting a lasso regression model on the Boston data set.
We are defining a function lasso_effect which takes a list of alpha as an argument and returning us a data frame of coefficients with respect to each alpha in the list. As we keep on increasing the value of alpha the number of features with the coefficient equal to zero increases.
Lasso v/s Ridge Regression
Let us think of a data set having more than 10000 features. If we apply Ridge regression technique it will retain all the features and only shrink the value of the coefficient. Since we are still left with all the features the complexity in calculation does not decrease. On the other hand, if we use Lasso regression apart from shrinking the coefficient it also minimizes the error as it performs feature selection as well. In the presence of correlated features in our model Ridge regression performs well enough as it retains all the features. Lasso on the other side select any one feature among the highly correlated features and set the coefficient of rest to zero. This can be quite problematic as this results in the loss of information. Well there is a technique called Elastic Net which combines the properties of Lasso and Ridge and hence gives the better result
Whenever our data contains a large number of features it is difficult to compute with some basic algorithm. It may happen that if we try to fit say linear regression algorithm the data starts to overfit. In such troublesome cases, we prefer lasso regression. We use Lasso regression to find the subset of observations that will minimize the prediction error. It will do so based on shrinkage and selection algorithm. It can give us a good fit in comparison to the linear regression model. But lasso regression starts to give bad results in the case where there is a high correlation between the independent variables. In such cases, lasso starts to select any one of the highly correlated features and then put the coefficient of rest of the features to zero. This is why in the case of the highly correlated data set we do not prefer Lasso.