Random Forests for Machine Learning
In this blog post about Boosting we will be covering the topics mentioned below:
- What is an ensemble?
- Introduction to Bagging
- What is Random Forest?
- Difference between a random forest and decision tree
- Bagging and feature selection
- Random Forest kernel
- Implementation with scikit learn
- Difference between bagging and boosting
- Advantages of Random Forest
So, previously we discussed decision trees used for the classification problem. Here comes “random forest” which is a very amazing algorithm as you can use it in most of the cases. If you are not able to decide which algorithm will be useful in solving your particular problem, then random forest is best to opt.
What is an ensemble?
Random Forest basically works on ‘ensemble’ technique. Now the question comes in mind “What is this ensemble now??” So, the ensemble is basically you are designing multiple classifiers(suach as decision trees) and by combining them all we are making a concluding classifier. You must wonder that “Ensemble is a technique which is used to win many competitions too, it is that much powerful”
So, Ensembles method can be “Bagging” and “Boosting”. Random forest is using “Bagging approach”.Therefore, in this will discuss bagging and in the next algorithm will discuss boosting, which is also one of the most wonderful algorithms especially “AdaBoost”, type of boosting method.
There comes a lot of confusion in your mind as so many new words- “Ensembles, Bagging, Boosting, Adaboost”.Don’t get beffudled, it will be clear to you now-:
1. Bagging(the random forest is using)
2. Boosting(AdaBoost is boosting technique)
Hope, it will be clear to you now.
Introduction to Bagging
Bagging is a process of selecting subsamples from the original samples and building different classifiers. Hence, combining all the classifiers to generate a single classifier. Each classifier will be weighted according to the importance of the split of that feature in the classifier.
Here is the diagram shown below.Here W1, W2, W3,… are the weights given to each tree or classifier.
What is Random Forest?
Random forest is a combination of several decision trees where each tree depends on a random vector sampled out of original data set using with replacement mechanism. When we say with replacement, it actually means that at each draw any unit can be selected out of the entire data set irrespective of previous draws. A single decision tree often has a high bias so we use random forest algorithm to average out the extreme value.
Random forest gives additional randomness to the model. Instead of searching for an important feature it searches for the best feature among the random subset of features. This gives the model diversity and hence it is an improvement over a decision tree model.
Difference between a random forest and a decision tree
In general, a random forest is nothing but a collection of many decision trees but there are some differences between them. If you give your training data set to a decision tree it will formulate some set of rules to do the split and it is going to use it in making predictions. On the other hand, a random forest randomly selects features and observations to make several decision trees and then average out the result.
If you have a very deep decision tree then it might be infected from over-fitting. While a random forest prevents over-fitting most of the time as it makes several decision trees of smaller depths. But it sometimes slower the calculation depending on how many decision trees your random forest algorithm is building.
Bagging and Feature selection
The term Bagging stands for “Bootstrap Aggregation Algorithm”. Suppose we have m data points in our data set then we are going to choose k points out of it with replacement. That is, any point can be selected more than one time in our sample of k- data points. In case of random forest suppose we have k- data points then we select k- points out of it with replacement.
Along with bagging, there is also a second thing happening in random forest case and that is feature selection. In case of feature selection what we are actually doing is we are not selecting entire feature but our decision trees are based on some features that is randomly chosen out of all the possible features. So we are following randomness in both ways- in selecting which data point to choose and in the selection of features from all possible features. Say, if you decided to choose k out of n features then you are going to select k features randomly for every decision tree in your random forest algorithm.
To add on more accuracy to your random forest you can use a feature called extra trees that add randomness within each decision tree to decide randomly for each feature which value to spit upon.
Steps involved in random Forest as per above explanation -:
- Select some features from total no. of features. Let’s say k features have been selected from m.
- Using these features generate n no. of trees.
- After this, all the generated trees are combined together to get a final model.
Random Forest Kernel
Random forest is a type of algorithm that combines the decision of several decision trees to make the decision as a model. When we want to use a random forest as a kernel we see each level of each decision tree as a partition. If you somehow witness that some of the points fall together in most of the partition then it is sure that they share some common properties. It is often denoted as KeRF. Well, kernel method is a kind of pattern analysis technique. The basic function of such pattern analysis technique is that they study the most general type of relationship in a data set.
Random Forest in sklearn
Above you can see when try to fit same data over a decision tree and a random forest our testing accuracy is improved in case of a random forest.
Difference between Bagging and Boosting
Bagging is basically merging the classifiers but each classifier is independent of one another while in Boosting there is sequential modelling. In bagging, we assign weights to each classifier and merge them while in Boosting each classifier is built by keeping the error involved in the previous classifier, in order to make better predictions each time.
Advantages of Random forest
1. If no. of trees are more then it will not easily overfit the model.<br>
2. It can handle the missing value problem easily.
In this blog, we discussed a lot of new terms like an ensemble, bagging, boosting. Mainly we discussed Random Forest with its algorithm. Now you now how to implement random forest with scikit learn. With the help of algorithm you can develop random forest from scratch.