Table of Contents
- Why is the name Random Forests?
- Important Terms in Random Forests
- Assumptions in Random Forests
- Strength and Weakness of Random Forests
- Difference between Decision Trees and Random Forests
- How Random Forests work
- Attributes Selection of Random Forests
- Model Representation
- Loss function in Random Forests
- Overfitting and Underfitting
- Case Study
Random Forest is one of the most widely used machine learning algorithms that can be used for both classification and regression. They are used in plenty of real-world applications and work like magic in hackathons.
It is an ensemble method which aggregates the predictions from a bunch of decision trees using bagging to make a more robust prediction. In this blog post, we will understand what ensemble methods are, what bagging means, and slowly build our way towards an in-depth understanding of the working of random forest.
Say we are approaching a classification problem. Let us assume that we have a dataset D. We randomly sample with replacement datasets from D, so now we have datasets D1, D2, …… Dn. Now, for each dataset we train a single decision tree to make predictions, now we have trained decision trees as well, T1, T2, ….. Tn. These trees are fully grown and have low bias but high variance.
Finally, during test time we will make the final prediction as to the majority vote of the predictions given to us by our decision trees. In case of regression instead of doing a majority vote, we take the average of the outputs of our decision trees. This reduces the variance without affecting the bias resulting in a much better model prediction.
This is in a nutshell how random forests work and in fact, this is how bagging works. Random Forest is one of the most used bagging algorithms, theoretically, we can use bagging with any machine learning algorithm of our choice!
There are some more heuristics that make random forests perform even better & make them even more “random”. We will see what they are going forward, but first, we must ask one very basic yet important question.
3. Why is the name Random Forests?
We are training a bunch of decision trees to make our predictions, this is where the name forest comes from. Also, we saw that to train the trees we are randomly sampling points from the dataset D into our datasets. We will later see that random forest doesn’t just sample data points but also features for building decision trees.
Hence, due to this random nature and the aggregation of the decision trees, we get the name random forest.
4. Important Terms in Random Forests
Before we dive into the details of how random forest and importantly how each component of a random forest i.e. the decision trees work we need to learn a few terms that are important for understanding the concept:
- Split – Breaking the data on the basis of a feature is called a split. Here is an example. The feature on which the splitting has been done here is gender (left) and class (right).
- split further. In the above example, the decision point where we split the data into Male and Female is the decision node.
- Root Node – This is where all the data is. We begin splitting the data based on the features from this point. The data can be split into 2 or more sets. We can also think of the root node as the first decision node.
- Leaf Node – This is the terminal node where we can’t split the data further.
- Branch – A subset of the entire data set that we get after splitting is called a branch.
- Parent Nodes – The node which is getting split into sub-nodes is called the parent node and the sub-nodes are called child-nodes.
- Entropy – It is a measure of randomness in the dataset
- Information Gain – It is a measure of how much entropy is reduced after making a split. We can think of this as the information gained after making a split on a feature.
These definitions are more than enough to be able to understand the concepts that will be discussed further, if any of these terms seem vague right now then don’t worry about it, you will understand everything once we move on to the inner workings of random forest and decision trees.
5. Assumptions in Random Forest
The random forest algorithm in itself is made up of many decision trees. Decision trees are non-probabilistic models, which means that we do not need to make any assumptions based on the data. In fact, for tree-based models like decision trees or random forest, we don’t even need to scale the data. All of these properties apply to the random forest as well the only difference being that random forest uses bagging to aggregate the decision trees.
The only assumption that can be said which random forest makes is that the sampling is representative. This is actually a common assumption for whenever we do random sampling.
Say in our dataset we have a feature that contains 2 categories. Before sampling the data into datasets they may have some distribution of say 70% category 1 and 30% category 2. The representative sampling assumption is that when we sample the dataset with a replacement then the distribution of this categorical feature must be the same in our new datasets (70:30). If not then our decision trees may misclassify some points based on which category is represented more.
6. Strengths and Weaknesses of Random Forest
As we discussed in the above section, one of the major advantages of the random forest is that it does not make any assumptions about the data.
It can easily handle large dimensionality in the data and gives higher accuracy than most classifiers.
It is very interpretable as it carries out feature importance internally.
Random forest allows us to compute out of bag error which allows us to see how the model would perform in unseen data (test data) without having to split our data in training and validation sets. Although in practice we still do the splitting, this is an advantage that every bagging algorithm gives us. We will understand what OOB error is in more detail later in this post.
Training many decision trees which are fully grown has high computational costs.
The prediction time is slow as we are getting predictions from a bunch of trees and then aggregating them.
Even though random forests are interpretable, a single decision tree offers a better interpretation.
7. Difference between Decision Trees and Random Forests
Apart from the major difference that Random forest is an ensemble method and actually just a collection of decision trees, another key difference is that in decision trees the algorithm looks at all the features in the data. However, in the random forest, each decision tree randomly selects features as well to split on instead of using all the features.
8. How Decision Tree Works
To understand how random forest works, we need to first understand how a single decision tree works. Let’s take an example to see how a decision tree would look like for a simple dataset.
The task here is to predict whether one should play tennis or not on a given day. To simplify the task I have not shown the Temperature variable in the decision tree.
So we begin with deciding how to select the root node. We take each of our features and split the dataset either based on some threshold (like in humidity) or based on categories (like in outlook or windy).
If we take the feature outlook then the dataset will be split into 3 sets, the first set will have all the data points which have a sunny outlook and the second set will have rain and the third overcast.
For each of these datasets, we calculate the entropy using which we calculate the information gain for the feature Outlook. Don’t worry about how this exact calculation is done for now. Let’s denote it as
We similarly split the dataset using all of our features. If the feature is categorical then we split it into its categories and if the feature is numerical then each value becomes a threshold and we split on that threshold. For each feature, we calculate the Information gain.
Finally, we will have,
We check which feature gives us the highest information gain and in our case that is Outlook and hence that becomes our root node.
After deciding that outlook was our root node we will have three branches. For each branch, we keep repeating steps 1 through 4 to decide on our decision nodes.
There are mainly 3 main stopping criteria for a decision tree:
- If we reach a pure node. In our example, the overcast branch only contains ‘yes’ as the target variable so if the Outlook is overcast we can simply predict ‘yes’.
- If we reach the max depth. Max depth is a parameter that we can control. In our example, the max depth was 2 so after the tree has a depth of 2 we stop growing our tree.
- There are not enough samples in the node. The minimum samples required to make a split is also a parameter we can control. Say the minimum sample required is 5 after splitting on Humidity we had 2 branches. For Humidity > 75 we have only 3 samples so again we stop growing our tree.
If the leaf node is not pure like in the case of overcast we can just make the prediction based on the majority vote.
So now after we have constructed our decision tree by looking at the training data during test time we just have to look at the features and based on our decision nodes assign a box to that data point and make the prediction.
We have understood how a single decision tree works, let us now see how a collection of these trees work.
Working of Random Forest
In a nutshell, Random Forest can be explained using the following equation:
Let’s look at each part of this equation now individually.
- First comes the base learners. As we know the random forest is an ensemble method i.e. it aggregates the results of multiple models and here these models are decision trees.
- Row Sampling is a concept that comes from bagging. We sample data points from our data with replacement and form new data sets on which we then train our base learners. Understanding what sampling with replacement means here is important.
If the size of our original data set is 1000 then the size of each of the data sets is also going to be , but the data points will not be 100% the same. Some points might be repeated as well. Say our original dataset contains [a, b, c] points then one of the new data sets might be [b, c, b].
- Column Sampling or feature bagging means that every dataset does not contain the same features. This further increases the variance of individual decision trees and on aggregating gives better results.
Here is an illustration to understand this:
Our complete dataset is which is first being sampled into 3 datasets . The features are also being randomly selected each time (feature bagging) and then for each of these datasets we fit our decision tree models which are then aggregated to form our final model.
Here is where the concept of out of bag error comes in. If we take the first sampled data set we can treat all the samples that are not in this data set as out of bag samples and use our trained model to get an error metric on those samples. This works like a cross-validation strategy without us having to define a separate validation data.
9. Attribute Selection in Random Forest
How does the random forest decide which attributes are important? In the case of a decision tree we can just rank the features on the basis of information gain but how do we select the important attributes in case of a random forest where there are many such decision trees?
The answer is quite simple actually, we can just calculate the mean decrease in entropy or the mean information gain for each feature. After we calculate this average we can similarly rank the features and select the attributes that are important.
Loss Functions in Random Forest
Let’s try to understand how the entropy and the information gain is calculated.
For a random variable which can take values, the entropy is given as:
Usually, the base is taken as 2 or . Now, in the play tennis data set (before any splitting) we can calculate the entropy as follows:
Entropy can take values between 0 and 1. The following plot is very important to understand its importance.
When the probability is 0.5 (no category dominates the other) then the entropy is maximum and when the probability is in the extremes (one category dominates) the entropy is 0. If categories are represented equally then entropy is high with the maximum being 1.
Now, after calculating the entropy on the entire data we split the data set on some feature, say we get 2 branches after splitting. We then calculate the entropies for both the branches. Let’s denote the entropies as .
Information gain is given by the formula:
denotes the size of the data set we get after splitting and is the size of the entire data set.
This is nothing but the difference between the entropy of the entire data set and the weighted sum of the entropy of the sub-datasets or the branches. Information gain just denotes the reduction in entropy.
Gini Impurity is just another measure of randomness just like entropy. It can be stated as:
Calculating Gini impurity is computationally more efficient as it does not involve calculating log, hence it is more used as a function for entropy in calculating information gain.
The main difference between entropy and Gini impurity can be understood with its graph:
The maximum value of Gini impurity is 0.5, so when categories are represented equally then Gini impurity is high with the maximum being 0.5.
10. Model Representation
Let’s put everything together now to build a random forest in our previous play tennis data set.
First, we choose the number of decision trees we want to train. Let’s take n = 3. Then we need to select the number of features to sample, a good method is to take the number of features as where d is the dimensionality of the data set. So here we can take 2 features at a time.
We randomly sample points with replacement from the data set into 3 data sets. Each data set contains 2 features that are randomly sampled without replacement. So the first sampled data set might look like this
Now we train a decision tree on this data set. The tree is grown by reducing the entropy (or gini impurity) on each split and it is fully grown i.e. until all our leaf nodes are pure.
Step 3 is repeated for each of the data sets to get our base models.
Here we also calculate the out of bag error of the samples that are not in the current data set.
Now we will have 3 models, each model has seen different subsets of points and has been trained using different combinations of features. Each model is different from the rest. All of these models will have low bias and high variance. They are aggregated using bagging to form our final random forest model.
12. Overfitting and Underfitting
The random forest algorithm has low variance compared to decision trees and hence is less prone to overfitting.
Usually, as the number of base learners increases the variance decreases. The model is less prone to overfitting.
If the data set size is very small then the random forest is prone to overfitting, the best bet, in that case, would be to get more data.
More are the number of features sampled, higher is the chance of overfitting, we should maintain a reasonable feature sampling ratio. As I mentioned before is a good practice, but there are other methods as well.
There are 2 common methods of regularizing random forest to reduce variance:
- In the random forest, we can control the depth of the decision trees as well, even though theoretically fully grown trees can provide better results but in real world scenario as the depth increases the chance of overfitting increases and if the depth is not sufficient we might underfit.
- Another parameter that controls overfitting/underfitting is the minimum samples required for splitting a node. If it is too small then we will again tend to overfit.
There are even more heuristics to increase the efficiency of the random forest :
- Extremely Randomized Trees (ExtraTrees): Decision trees on splitting on numeric (real-valued) features select each value as a possible threshold to split on and calculate the information gain to determine the appropriate threshold. In this variant, all the values are not tried as thresholds, and instead, a sample is taken. This reduces the variance even further but comes at a cost of a slightly increased bias.
- Regularized Random Forest (RRF): In this variant, the trees penalize the information gain if it is similar to the information gain from some other feature. Basically, if feature gives a particular information gain and then feature gives similar information gain then the information gain from feature is penalized.
We have now understood decision trees, bagging, and random forest. Now let’s see how they can be used using python.