# Decision Tree for Machine Learning

**Overview**

**Introduction****Applications of LogisticRegression.****What is a decision tree?****Explanation with example****Types of decision tree****Splits for continuous input****Best feature to split on****How to avoid overfitting?****Implement our model with scikit-learn****Summary**

## Introduction to Decision Trees

Decision Tree doesn’t seem interesting? Like, how this decision tree can help in solving Machine Learning Problems. It is going to be very exciting, so let’s get started. Like in real life sometimes we are in a trouble what to do. Sometimes confusion comes into the mind which path will be helpful to us. So, what we do basically design a tree in our mind which path will lead to good results or bad results. In the similar fashion, Decision tree works in Machine Learning.

Decision Tree is a very powerful predictive model. Further, in a decision tree, we use the Ensemble method(Adaboost Algorithm), More specifically a combination of decision trees which are amazing in itself. Even it helps in winning many Kaggle Competitions too.

It’s time to get dive into the algorithmic details.

## What is the decision tree?

Decision Tree is a supervised Learning Algorithm used for both classification and regression problems. In the decision tree, we split the data into two(binary tree) or more subsets. It is a recursive process. A Decision Tree is a kind of graphical representation of all possible solutions to a decision. The basic intuition behind the decision tree is to divide a large data set into smaller data sets based on the certain rule until we get a dataset small enough to contain a single label. Here each feature is denoted by nodes while the branches represent the possible decisions. The outcome of the decision is given by a leaf node with no branches.

In this, will follow a single example “Churn prediction(Customer will stay in the bank or not)”.

So, in a decision tree, some terms used generally. In this tutorial, like to keep it simple but you should be aware of these terms once-:

1. Root Node -: It is the topmost node in the tree.

2. Parent Node -: It is the node which will be sub-divided into nodes.

3. Leaf Node -: It is the node where splitting of the tree gets stopped.

## Explanation of the Example

In Churn Prediction, we have data from customers of Bank. There are certain features like-: CreditScore, Balance, Age, Gender. So, on the basis of these features will make a decision tree that will predict the Customer will exit or not(Target Variable).Dataset(created at own just an example) is as shown below:

## Types Of Decision Tree

1. Regression Decision Trees

2. Classification Decision Trees

In the case of the decision tree, we often deal with a binary classification problem. But we can even make a decision tree in multi-class classification case where we have k different labels to our feature. Have a look.

The above tree is an example of a binary classification problem where corresponding to each node we have two decision branches. Such type of cases are easy to deal with but what if the features are continuous in nature. If our features are continuous we can simply convert them into categorical variables, but the problem with the continuous variable is that they have infinite boundaries. How we are going to decide which boundary to chose?

## Splits for continuous inputs

In our example, Balance is a continuous Input then it doesn’t mean will split for many values as it will cause overfitting.

Hence, Instead of this continuous splits, one or more threshold splits needs to be decided for the continuous variable Balance such as:

Now, ending up with these continuous Inputs let’s go back to our main decision tree algorithm. Then, the question comes into the mind which feature will be best to split on? How do we decide that?

## Best Feature to Split On For Decision Trees

There are different splitting criteria for decision tree. We will discuss one by one.

**Gini Index**-: It is the metric which takes the node which is less impure in nature i.e, high purity. There is a formula involved behind it. Taking our previous example only of “Churn Prediction”. In that two attributes on which we are splitting i.e, Gender and Balance. We can check the Gini index value of both attributes and whichever is having high will take that one to split on. Let’s see the calculationGini index for Balance -:

Gini index for child node (< 45k) => (4/22)*(4/22) + (18/22)*(18/22) = 0.99

Gini index for child node (> 45k) => (4/18)*(4/18) + (14/18)*(14/18) = 0.65

Gini index value for Balance attribute is => (22/40)*0.99 + (18/40)*0.65 = 0.836

Gini index for child node Female => (16/20)*(16/20) + (4/20)*(4/20) = 0.68

Gini index for child node Male => (6/20)*(6/20) + (14/20)*(14/20) = 0.58

Gini index for node Gender => (20/40)*0.68 + (20/40)*0.58 = 0.63

We will take balance to split on as higher value than Gender.

**Information gain**-: It is saying that the attribute having high information will be considered for splitting. In this one, more term is involved i.e, entropy. And entropy and Information gain are inversely related. So, basically after getting the entropy value one can get the information gain value. Information gain basically implies that lower information is required if that attribute is more homogenous. Let’s see the calculation. Entropy is calculated as -:

Entropy for child node (< 45k) => -(4/22)*log(4/22) – (18/22)*log(18/22) = 0.2056

Entropy Gain for child node (> 45k) => -(4/18)*log(4/18) – (14/18)*log(14/18) = 0.2291

Entropy value for Balance attribute is => (22/40)*0.2056 + (18/40)*0.2291 = 0.216

Information Gain(Balance) => 1-0.216 = 0.784

Entropy for child node Female => -(16/20)*log(16/20) – (4/20)*log(4/20) = 0.216

Entropy for child node Male => -(6/20)*log(6/20) – (14/20)*log(14/20) = 0.264

Entropy for node Gender => (20/40)*0.216 + (20/40)*0.264 = 0.24

Information Gain(Gender) => 1-0.24 = 0.76

Hence, information gain for balance is high then will select balance attribure to split on.

**Classification Error**-: By calculating the error of the node, the node has less error will be considered for the best split. It is either the regression tree or classification tree. Both trees have a different cost function. Here, will go into depth about how classification tree works.

Calculation of classification error of the particular node:

Calculation of no. of mistakes:

For the split, the no. of data points that don’t belong to the majority class will be considered as mistakes. It will be more clear to you through an example.

As shown in the above two figures, we are splitting on the basis of balance and gender. So, classification error when we are splitting on feature “Balance”.

Classification Error(balance) = 4+4/40 = 0.20

So, classification error when we are splitting on feature “Gender”.

Classification Error(gender) = 4+6/40 = 0.25

Hence, Splitting on Balance is much better.

In this way, we are continuously splitting then our tree will become denser, If we say in terms of Machine Learning it will cause overfitting. Therefore, our next phase is how to avoid overfitting.

## How to avoid Overfitting?

There are two techniques in general:

- Early Stopping
- Pruning
- Early Stopping

- Limit the depth of tree: Choose the max depth, after that splitting of the tree will not occur.
- Stop if classification error is not reducing.

Next, about pruning is simply cutting the path if not useful for us.

## Implementing with Scikit-learn

```
from sklearn import tree
X = [[1,2,3], [6,5,4]]
Y = [0, 1]
## creating object of Decision Tree
clf = tree.DecisionTreeClassifier()
## fitting model
clf = clf.fit(X, Y)
```

```
## predictiom on certain data
clf.predict([[1,1,4]])
```