Intro to Machine Learning 5 | Random Forest

Series: Intro to Machine Learning

Intro to Machine Learning 5 | Random Forest

  1. Improving Decision Tree Generality

Now we have known that the decision tree can have high accuracy, but it may overfit like crazy. So our goal for the new model is to keep this high accuracy but increase the generality. However, trees are very stable and they will grow to the exact same model if we start from the same root set. So two techniques we use for creating independent trees are,

  • Bootstrapping/Bagging: Weaken trees by training on the randomly selected subset.
  • Amnesia: Forgetting some features as we create decision nodes. This is because if there is one strongly predictive var out of 𝑝, then all trees would be similar.

Another technique is to use extremely random forests, which will not do bagging and Amnesia. What it is going to do is to randomly split with the most predictive feature in each decision node.

2. Amnesia in Sklearn

In sci-kit learn, we use the hyperparameter max_features for amnesia. If int, then it considers max_features features at each split. If float, then max_features is a fraction and round(max_features * n_features) features are considered at each split.

  • If max_features is too high, we may bind to similar trees trained on the strongly predictive variable.
  • If max_features is too low, we can have bad accuracy because we are forgetting too many features in each node.

Therefore, we need to perform hyperparameter tuning for max_features.

3. Properties of Random Forest

  • Accuracy initially improves greatly as we add trees.

This is because each tree sees only 2/3 of data by bagging so adding bootstrapped trees increases the use of training data.

  • Accuracy asymptotically approaches a minimum instead of continual improvement.

With enough trees, the ensemble sees 100% of the training data, so it’s approaching the accuracy of the single decision tree in the ideal world.

  • RF does not overfit as more trees are added.

New trees balance each other out, one might be too high, another too low, so new trees get averaged in so each additional tree has a less individual effect.

  • RF is relatively robust to 𝑦 outliers.

The 𝑦 outliers get shunted to their own leaf since doing so reduces loss function, particularly if squared-error is used.

  • RF is relatively robust to 𝑋 noise.

Noise variables in X aren’t predictive so not chosen as split variables.

  • RF is relatively robust to 𝑋 multicollinearity.

The forest will randomly choose to split between two collinear columns so multicollinearity is not a problem.

  • RF is impacted by falsely-predictive features like sales ID.

Even though RF is robust to X noises, it can be highly impacted by the falsely-predictive features. So dropping useless features also often gives a small bump.

  • Bagging helps more, the more unstable the model.

Averaging is a smoothing operator, which squeezes predictions to the centroid. If the model is low variance already, there is no point in bagging.

  • RFs are scale and range insensitive in features and target 𝑦.

This is because RFs are quite simple in each node. It compares feature values in decision nodes but does not do maths on them. Besides, the target y is computed by its means or modes for predictions, which have no use of features.

4. Bootstrapping Vs. Subsampling

  • Bootstrapping: Popular and safer to use.
  • Subsampling: A kind of new theory that can get the same performance on n/2 subsamples. It also improves the generality if we have a smaller fraction of n.

5. Goal of RF Tuning Strategy

The general goal is to do something that can minimize the validation error and then stop.

  • Start with 20 trees and work upwards until validation error stops getting better.
  • Sklearn uses max_features= sqrt(p) by default, so try dropping this to log(p), or p/3 for regression and sqrt(p) for classification.

6. Out-Of-Bag (OOB) Samples

Bootstrap aggregating is also called bagging, and that means training an ensemble of models based on bootstrapping. Random forest applies bootstrapping to reduce overfitting, and the trees in a random forest are called bagged trees. For a specific bagged tree, the set that is not used for training this tree is called an out-of-sample sample (about 37%).

  • Out-Of-Bag Predictions: get OOB predictions by averaging (for regression) or voting (for classification) estimates from the bagged trees with that record in their OOB samples.
  • Out-Of-Bag Score: get R² score for regression or accuracy score for classification on the label y and its OOB predictions. Ignore or skip the records that never appear in any OOB samples.
  • Out-Of-Bag Error: equals to 1 - OOB_score according to sci-kit learn.

7. OOB Frequently Asked Questions

  • Why does OOB error slightly overestimate testing/validation error? Or when does OOB score sightly underestimate testing/validation score?

Because OOB samples are not predicted with all trees in the forest whereas the test/validation set uses the whole forest, which presumably has lower noise/variation.

  • Why OOB Error should not be used with time-sensitive datasets?

The validation set for time-sensitive data can’t be split randomly. Probably think about one-step validation.

  • When OOB error is lower than validation/testing error? Or when the OOB score is higher than the validation/testing score?

Consider the following reasons,

  • The validation set is drawn from a different distribution than the training set
  • It is a time-sensitive data set
  • We didn’t extract the validation set properly
  • The model is overfitted to the data in the training set, focusing on relationships that are not relevant to the test set (e.g. sales ID case)

8. Common Feature Importance Techniques

  • Spearman’s rank correlation
  • Principle component analysis (PCA)
  • Minimal-redundancy maximal-relevance (mRMR)

Note that these techniques should only be used on strong and stable models.

9. Feature Importance For Random Forest

  • sklearn default: Gini/MSE Drop. However, this gives us biased feature importance. This common mechanism for computing feature importance is biased, and it tends to inflate the importance of continuous or high-cardinality categorical variables.
  • Drop-Column Importance: brute force and easy, however, it is very expensive and slow because it means retraining the model p times for p features.
  • Permutation Importance: easy and no need to retrain the model

10. Codependent Features for Feature Importance

  • Drop-column: tends to show low or 0 importance scores for codependent features
  • Permutation: has the effect of pulling down the perceived importance of the original and result in a shared importance