# Overview of Stepwise Regression

1. Introduction
2. Applications of Stepwise Regression.
3. Mathematics for Stepwise Regression.
4. Algorithm for Stepwise Regression.
5. Implement a Stepwise Regression model with sci-kit learn.
6. Summary

5. ## Implement a Stepwise Regression model with sci-kit learn.

Stepwise Regression is a branch of statistics that are used to determine the useful set of predictors(independent variables) while determining the model. It is a process of feature selection in the regression model.

Stepwise Regression is important in machine learning because removing the less relevant features from the model will decrease the variance of the prediction.

Let us consider, we have a regression model of the following form:

y = b0 + b1X1 + b2X2 + b3X3+ ……..+ b10X10

where, y = dependent variables

X1, X2, X3 ….X10 = independent variables

b0, b1, b2, b3, …. b10 = coefficient of independent variables

In the above regression model, the independent variables X1, X2, X3 ….. X10 individually may or may not play a significant role while determining the model. Stepwise regression is used to predict which of these independent variables play a major role in determining the model and which of them can be eliminated from the equation.

There are two ways of implementing Stepwise Regression while determining the model. They are classified as forwarding Stepwise Regression and Backward Stepwise Regression. These two regressions are based on the t-statistic values of the regression model.

Let us learn the stepwise regression process through an example. Let’s consider a dataset in the table below. We implement a stepwise regression model on the following dataset.

(Note: The data is not exact and random values are estimated)

 Training Sets Selling Price of land in millions(y) The area in sq. ft. (X1) Size of the road to access the land(X2) Drinking Water Facility(X3) Ground Water(X4) Medical Facilities nearby(X5) Educational institutions nearby(X6) Land Fertility(X7) 1 6 200 12 90 80 60 60 90 2 3 195 8 85 75 90 40 50 3 5 221 12 23 90 24 43 54 4 3 100 6 89 67 60 68 67 5 5 312 6 67 78 98 78 67 Mean 4.4 205.6 8.8 70.8 78 66.4 57.8 65.6

The regression model for the above table is

y = b0 + b1X1 + b2X2 + b3X3 + b4X4 + b5X5 + b6X6 + b7 X7

Not all independent variables play an important role in determining the regression model. Some variables can be eliminated from the above equation. To determine which variables to eliminate and which variables to keep in the regression model, t-values of the regression of independent variables on the dependent variable(s) is calculated.

T-value can be calculated by the formula:

t =

where,

Mx = Mean-value of X

My = Mean-value of Y

n = number of training sets

and S =

where,

x = individual scores,

M = mean

n = number of scores in group

Initially, let us begin with the initial equation i.e.,

y = b0 + b1X……. (1)

to determine X, we use the t-value table which we calculate from above mentioned formula.

Using the above table, we calculate the t-value of dependent variables on independent variables individually.

We will learn how to calculate the value of t1. Other t-variables are calculated in the same way.

For t1,

Sx = [(200-205.6)2 + (195-205.6)2 + (221-205.6)2 + (100-205.6)2 + (312 – 205.6)] / (5-1)

= (31.36 + 112.36 + 237.16 + 11151.36 + 11320.96) / 4

= 5713.3

Sy = [(6-4.4)2 + (3 -4.4)2 + (5-4.4)2 + (3-4.4)2 + (5-4.4)2]/(5-1)

= (2.56 + 1.96 + 0.36 + 1.96 + 0.36) / 4

= 1.8

Now, calculating the t1,

t1 = (205.6 – 4.4) /

= 201.2 /  2555.06

= 0.0787

This is the value of regressing X1 on y1. We similarly regress all independent variables on y and calculate respective t-values which are mentioned in the table below:

 t1 t2 t3 t4 t5 t6 t7 0.0787 1.05 0.186 2.23 0.072 0.454 0.56

From the above table, we can see that the t-value of the 4th dependent variable is highest. So, now we insert the 4th variable i.e., X4 in the equation. So, now the equation (1) becomes:

y = b0 + b1X4 + b2X ………..(2)

In the above calculation, we regressed independent variables on y to calculate the respective t-values. Now to calculate the t-values after X4 is determined, we regress all other independent variables on X4 in a similar process.

From the formula of the t-value, after regressing with X4, we get following t-values:

 t1 t2 t3 t5 t6 t7 -0.05 2.11 0.02 0.03 0.2 0.11

In the above table, we can see that the t-value of second variable i.e., x2 is the highest. So the variable X in the equation (2) is replaced by X2 and the equation becomes:

y = b0 + b1X4 + b2X2 + b3X ………….(3)

Now, to find the next independent variable X in the equation (3), we again use the t-value calculation by regressing each independent variables on X4 and X2(combined). The t-values are obtained as follows:

 t1 t3 t5 t6 t7 -0.06 -0.07 -0.06 -0.144 -0.19

In the above table, we can see that the absolute t-values of none of the above have significant value (i.e., >= 1). So, now we terminate our equation with the independent variables X4 and X2.

Thus, the final equation becomes:

y = b0 + b1X4 + b2X2 + ɛ ………….(4)

The equation (4) is our final regression model.

## 6. Summary

When to use Stepwise Regression?

We know we need to use Stepwise Regression when we have a large number of datasets and we are interested in including only significant variables in the regression model.

Types of Stepwise Regression

There are two types of regression depending upon the starting values and adding or deleting the independent variables.

Forward Stepwise Regression

In this method of Stepwise Regression, we start with the empty predictor and sequentially finding those predictor variables which provides maximum improvement in the performance of the model and adding those variables to the equation. The above-illustrated example is of the forward stepwise regression.

Backward Stepwise Regression

In this method of Stepwise Regression, we start with the equation having all predictors and sequentially finding those predictors which contribute least to the performance of the model and finally removing those variables from the equation. This regression process is just the opposite of the above-given example. In this process, we start with all the independent variables and simultaneously remove those variables and end up with the equation containing only significant variables.