Overview of Stepwise Regression
 Introduction
 Applications of Stepwise Regression.
 Mathematics for Stepwise Regression.
 Algorithm for Stepwise Regression.
 Implement a Stepwise Regression model with scikit learn.
 Summary

Introduction

Applications of Stepwise Regression.

Mathematics for Stepwise Regression.

Algorithm for Stepwise Regression.

Implement a Stepwise Regression model with scikit learn.
Stepwise Regression is a branch of statistics that are used to determine the useful set of predictors(independent variables) while determining the model. It is a process of feature selection in the regression model.
Stepwise Regression is important in machine learning because removing the less relevant features from the model will decrease the variance of the prediction.
Let us consider, we have a regression model of the following form:
y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3}+ ……..+ b_{10}X_{10}
where, y = dependent variables
X_{1}, X_{2}, X_{3} ….X_{10} = independent variables
b_{0}, b_{1}, b_{2}, b_{3}, …. b_{10} = coefficient of independent variables
In the above regression model, the independent variables X_{1}, X_{2,} X_{3} ….. X_{10} individually may or may not play a significant role while determining the model. Stepwise regression is used to predict which of these independent variables play a major role in determining the model and which of them can be eliminated from the equation.
There are two ways of implementing Stepwise Regression while determining the model. They are classified as forwarding Stepwise Regression and Backward Stepwise Regression. These two regressions are based on the tstatistic values of the regression model.
Let us learn the stepwise regression process through an example. Let’s consider a dataset in the table below. We implement a stepwise regression model on the following dataset.
(Note: The data is not exact and random values are estimated)
Training Sets 
Selling Price of land in millions(y) 
The area in sq. ft. (X_{1}) 
Size of the road to access the land(X_{2}) 
Drinking Water Facility(X_{3}) 
Ground Water(X_{4}) 
Medical Facilities nearby(X_{5}) 
Educational institutions nearby(X_{6}) 
Land Fertility(X_{7}) 
1 
6 
200 
12 
90 
80 
60 
60 
90 
2 
3 
195 
8 
85 
75 
90 
40 
50 
3 
5 
221 
12 
23 
90 
24 
43 
54 
4 
3 
100 
6 
89 
67 
60 
68 
67 
5 
5 
312 
6 
67 
78 
98 
78 
67 
Mean 
4.4 
205.6 
8.8 
70.8 
78 
66.4 
57.8 
65.6 
The regression model for the above table is
y = b_{0} + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} + b_{4}X_{4} + b_{5}X_{5} + b_{6}X_{6} + b_{7} X_{7}
Not all independent variables play an important role in determining the regression model. Some variables can be eliminated from the above equation. To determine which variables to eliminate and which variables to keep in the regression model, tvalues of the regression of independent variables on the dependent variable(s) is calculated.
Tvalue can be calculated by the formula:
t =
where,
M_{x} = Meanvalue of X
M_{y} = Meanvalue of Y
n = number of training sets
where,
x = individual scores,
M = mean
n = number of scores in group
Initially, let us begin with the initial equation i.e.,
y = b_{0 }+ b_{1}X……. (1)
to determine X, we use the tvalue table which we calculate from above mentioned formula.
Using the above table, we calculate the tvalue of dependent variables on independent variables individually.
We will learn how to calculate the value of t1. Other tvariables are calculated in the same way.
For t1,
S_{x} = [(200205.6)^{2} + (195205.6)^{2} + (221205.6)^{2} + (100205.6)^{2} + (312 – 205.6)^{2 }] / (51)
= (31.36 + 112.36 + 237.16 + 11151.36 + 11320.96) / 4
= 5713.3
S_{y} = [(64.4)^{2} + (3 4.4)^{2} + (54.4)^{2} + (34.4)^{2} + (54.4)^{2}]/(51)
= (2.56 + 1.96 + 0.36 + 1.96 + 0.36) / 4
= 1.8
Now, calculating the t_{1},
t_{1} = (205.6 – 4.4) /
= 201.2 / 2555.06
= 0.0787
This is the value of regressing X_{1} on y_{1}. We similarly regress all independent variables on y and calculate respective tvalues which are mentioned in the table below:
t_{1} 
t_{2} 
t_{3} 
t_{4} 
t_{5} 
t_{6} 
t_{7} 
0.0787 
1.05 
0.186 
2.23 
0.072 
0.454 
0.56 
From the above table, we can see that the tvalue of the 4^{th} dependent variable is highest. So, now we insert the 4^{th} variable i.e., X_{4} in the equation. So, now the equation (1) becomes:
y = b_{0} + b_{1}X_{4} + b_{2}X ………..(2)
In the above calculation, we regressed independent variables on y to calculate the respective tvalues. Now to calculate the tvalues after X4 is determined, we regress all other independent variables on X4 in a similar process.
From the formula of the tvalue, after regressing with X_{4}, we get following tvalues:
t_{1} 
t_{2} 
t_{3} 
t_{5} 
t_{6} 
t_{7} 
0.05 
2.11 
0.02 
0.03 
0.2 
0.11 
In the above table, we can see that the tvalue of second variable i.e., x_{2} is the highest. So the variable X in the equation (2) is replaced by X_{2} and the equation becomes:
y = b_{0} + b_{1}X_{4} + b_{2}X_{2} + b_{3}X ………….(3)
Now, to find the next independent variable X in the equation (3), we again use the tvalue calculation by regressing each independent variables on X_{4} and X_{2}(combined). The tvalues are obtained as follows:
t_{1} 
t_{3} 
t_{5} 
t_{6} 
t_{7} 
0.06 
0.07 
0.06 
0.144 
0.19 
In the above table, we can see that the absolute tvalues of none of the above have significant value (i.e., >= 1). So, now we terminate our equation with the independent variables X_{4} and X_{2}.
Thus, the final equation becomes:
y = b_{0} + b_{1}X_{4} + b_{2}X_{2} + ɛ ………….(4)
The equation (4) is our final regression model.
6. Summary
When to use Stepwise Regression?
We know we need to use Stepwise Regression when we have a large number of datasets and we are interested in including only significant variables in the regression model.
Types of Stepwise Regression
There are two types of regression depending upon the starting values and adding or deleting the independent variables.
Forward Stepwise Regression
In this method of Stepwise Regression, we start with the empty predictor and sequentially finding those predictor variables which provides maximum improvement in the performance of the model and adding those variables to the equation. The aboveillustrated example is of the forward stepwise regression.
Backward Stepwise Regression
In this method of Stepwise Regression, we start with the equation having all predictors and sequentially finding those predictors which contribute least to the performance of the model and finally removing those variables from the equation. This regression process is just the opposite of the abovegiven example. In this process, we start with all the independent variables and simultaneously remove those variables and end up with the equation containing only significant variables.
Advantages of Stepwise Regression:
1. It provides the ability to manage the predictor variables in the regression model by eliminating or inserting them into the model according to their significance.
2. It is faster than the other automatic modelselection methods.
Disadvantages of Stepwise Regression
1. Less number of datasets may result in the generation of the notperfect regression model.
2. If two predictor variables are correlated, only one gets inserted into the model and the other is eliminated.
3. Regression coefficients are biased.