# Stepwise Regression for Machine Learning

**Overview**

**Introduction****Applications of Stepwise Regression.****How does that algorithm work?****How do we choose the best parameters?****Pseudo Code of the algorithm.****Implement our model with scikit-learn****Summary**

Stepwise Regression is a branch of statistics that are used to determine the useful set of predictors(independent variables) while determining the model. It is a process of feature selection in the regression model.

Stepwise Regression is important in machine learning because removing the less relevant features from the model will decrease the variance of the prediction.

Let us consider, we have a regression model of the following form:

**y = b _{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3}+ ……..+ b_{10}X_{10}**

where, y = dependent variables

X_{1}, X_{2}, X_{3} ….X_{10} = independent variables

b_{0}, b_{1}, b_{2}, b_{3}, …. b_{10} = coefficient of independent variables

In the above regression model, the independent variables X_{1}, X_{2,} X_{3} ….. X_{10} individually may or may not play a significant role while determining the model. Stepwise regression is used to predict which of these independent variables play a major role in determining the model and which of them can be eliminated from the equation.

There are two ways of implementing Stepwise Regression while determining the model. They are classified as forwarding Stepwise Regression and Backward Stepwise Regression. These two regressions are based on the t-statistic values of the regression model.

Let us learn the stepwise regression process through an example. Let’s consider a dataset in the table below. We implement a stepwise regression model on the following dataset.

(Note: The data is not exact and random values are estimated)

Training Sets |
Selling Price of land in millions(y) |
Area in sq. ft. (X |
Size of the road to access the land(X |
Drinking Water Facility(X |
Ground Water(X |
Medical Facilities nearby(X |
Educational institutions nearby(X |
Land Fertility(X |

1 |
6 |
200 |
12 |
90 |
80 |
60 |
60 |
90 |

2 |
3 |
195 |
8 |
85 |
75 |
90 |
40 |
50 |

3 |
5 |
221 |
12 |
23 |
90 |
24 |
43 |
54 |

4 |
3 |
100 |
6 |
89 |
67 |
60 |
68 |
67 |

5 |
5 |
312 |
6 |
67 |
78 |
98 |
78 |
67 |

Mean |
4.4 |
205.6 |
8.8 |
70.8 |
78 |
66.4 |
57.8 |
65.6 |

The regression model for the above table is

** y = b _{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5} + b_{6}X_{6} + b_{7} X_{7}**

Not all independent variables play an important role in determining the regression model. Some variables can be eliminated from the above equation. To determine which variables to eliminate and which variables to keep in the regression model, t-values of the regression of independent variables on the dependent variable(s) is calculated.

T-value can be calculated by the formula:

**where, **

**M _{x} = Mean-value of X**

** M _{y} = Mean-value of Y**

** n = number of training sets**

**where, **

**x = individual scores,**

** M = mean**

** n = number of scores in group**

Initially, let us begin with the initial equation i.e.,

** y = b _{0 }+ b_{1}X……. (1)**

to determine X, we use the t-value table which we calculate from above mentioned formula.

Using the above table, we calculate the t-value of dependent variables on independent variables individually.

We will learn how to calculate the value of t1. Other t-variables are calculated in the same way.

For t1,

** S**_{x} = [(200-205.6)^{2} + (195-205.6)^{2} + (221-205.6)^{2} + (100-205.6)^{2} + (312 – 205.6)^{2 }] / (5-1)

** = (31.36 + 112.36 + 237.16 + 11151.36 + 11320.96) / 4**

** = 5713.3**

** S _{y} = [(6-4.4)^{2} + (3 -4.4)^{2} + (5-4.4)^{2} + (3-4.4)^{2} + (5-4.4)^{2}]/(5-1)**

** = (2.56 + 1.96 + 0.36 + 1.96 + 0.36) / 4**

** = 1.8 **

Now, calculating the t_{1},

** = 201.2 / 2555.06**

** = 0.0787**

This is the value of regressing X_{1} on y_{1}. We similarly regress all independent variables on y and calculate respective t-values which are mentioned in the table below:

t |
t |
t |
t |
t |
t |
t |

0.0787 |
1.05 |
0.186 |
2.23 |
0.072 |
0.454 |
0.56 |

From the above table, we can see that the t-value of the 4^{th} dependent variable is highest. So, now we insert the 4^{th} variable i.e., X_{4} in the equation. So, now the equation (1) becomes:

**y = b _{0} + b_{1}X_{4} + b_{2}X ………..(2)**

In the above calculation, we regressed independent variables on y to calculate the respective t-values. Now to calculate the t-values after X4 is determined, we regress all other independent variables on X4 in the similar process.

From the formula of the t-value, after regressing with X_{4}, we get following t-values:

t |
t |
t |
t |
t |
t |

-0.05 |
2.11 |
0.02 |
0.03 |
0.2 |
0.11 |

In the above table, we can see that the t-value of second variable i.e., x_{2} is the highest. So the variable X in the equation (2) is replaced by X_{2} and the equation becomes:

y = b_{0} + b_{1}X_{4} + b_{2}X_{2} + b_{3}X ………….(3)

Now, to find the next independent variable X in the equation (3), we again use the t-value calculation by regressing each independent variables on X_{4} and X_{2}(combined). The t-values are obtained as follows:

t |
t |
t |
t |
t |

-0.06 |
-0.07 |
-0.06 |
-0.144 |
-0.19 |

In the above table, we can see that the absolute t-values of none of the above have significant value (i.e., >= 1). So, now we terminate our equation with the independent variables X_{4} and X_{2}.

Thus, the final equation becomes:

**y = b _{0} + b_{1}X_{4} + b_{2}X_{2} + ɛ ………….(4)**

The equation (4) is our final regression model.

**When to use Stepwise Regression?**

We know we need to use Stepwise Regression when we have a large number of datasets and we are interested in including only significant variables in the regression model.

**Types of Stepwise Regression**

There are two types of regression depending upon the starting values and adding or deleting the independent variables.

**Forward Stepwise Regression**

In this method of Stepwise Regression, we start with the empty predictor and sequentially finding those predictor variables which provides maximum improvement in the performance of the model and adding those variables to the equation. The above-illustrated example is of the forward stepwise regression.

**Backward Stepwise Regression**

In this method of Stepwise Regression, we start with the equation having all predictors and sequentially finding those predictors which contribute least to the performance of the model and finally removing those variables from the equation. This regression process is just the opposite of the above-given example. In this process, we start with all the independent variables and simultaneously remove those variables and end up with the equation containing only significant variables.

**Advantages of Stepwise Regression:**

1. It provides the ability to manage the predictor variables in the regression model by eliminating or inserting them into the model according to their significance.

2. It is faster than the other automatic model-selection methods.

**Disadvantages of Stepwise Regression**

1. Less number of datasets may result in the generation of the not-perfect regression model.

2. If two predictor variables are correlated, only one gets inserted into the model and the other is eliminated.

3. Regression coefficients are biased.