Boosting algorithms are very popular and xgboost has been at the top of the mountain for a very long time. Let’s build some understanding about what are boosting algorithms all about.

## Boosting Algorithms

Before we get to one of the go-to Kaggle competition algorithms xgboost/light gbm, we need to cover some basics.

# Ensemble Learners

An ensemble is a collection of weak predictors which combined give a final prediction. Many different predictors trying to predict the same target variable will perform better than any single predictor alone. Ensemble techniques can be broken into Stacking, Bagging and Boosting.

### Bagging Techniques

Murder was the case, Imagine a Court room drama where evidence is reviewed by not one but a panel of Judges to make the final ruling.

Instead of a single judge making a decision all the judges made independent decisions in order to achieve the final ruling. Bagging Techniques, involve several independent predictors (Judges) and make a combined decision (final ruling of the panel). The final prediction is put together by either a majority vote or a weighted average of all the independent predictors. To make sure that the predictors are different from each other, a random sub-sample (bootstrap) of data for each individual model is carried out. This done while maintaining the same probability for each data point to enter the model.
An example of this technique would be a Random forest. Advantages:

• Handles Over-fitting
• Reduce Variance

### Boosting Techniques

Relay Race is what comes to mind when I think of Boosting techniques. Models are made sequentially like each runner (model) passing on what has been accomplished to the next runner (model) to further improve the results in this race.

Subsequent predictors learn from the mistakes of the previous predictors. To take an example: We have 100 images of fruits and we want to classify which of these are images of apples and which are not. Assume that our first model, predicts with a 70% accuracy. The next model will consider all the errors, the misclassified images of the first model. The objective is to minimize these errors, so our second model will try to improve accuracy by getting more closer to the actual values. At this juncture two points need to be addressed:

• Unlike bagging ensemble techniques, data points are not equally likely to enter the next model.
• If we keep iterating long enough, we are prone to over-fit our data. A termination criteria must be set to prevent over-fitting.

Now let us say that we stopped at the 4th model with a 83% accuracy. Thus, boosting will focus on examples which are mis-classiﬁed or have higher errors and iteratively improve results by reducing bias and variance.Table 1: Model Comparison

There are several algorithms such as ADAboost (Adaptive Boosting) but for now we will concentrate on GBM and build up-to xgboost.

The objective of regressor/classification models is to reduce the error that is minimize the loss function. Lets look at how gradient descent works.

The curve represents the loss function and with each iteration we are taking a step towards the minimum. The equation of this process is as follows:

New Prediction value =
Old prediction - (Step Size) * (Slope of the Loss function)


The intuition behind GBM is to simply minimize the loss function with each model/iteration. We first model the data and analyze the errors. The next model tries to find patterns in these errors and try to predict the hard to fit data points correctly. This is iteratively carried out till the termination criteria is reached. At the end the weights are provided to each predictor which are combined to get the final solution.

We have explained how gradients work in more detail in the deep learning series here.

XGBoost is an implementation of the Gradient Boosted Decision Trees algorithm. It is like an upgrade to GBM, with a much more optimized and faster implementation.

How can you speed up an algorithm that works by running models sequentially?

This is wonderfully put together by Zhanpeng Fang, and you can read more about it here.

The best part of xgboost is:

Regularization: This is a penalty function that ensures that the model is not too complex or over-fitting. The objective function that we minimize is the loss function and the penalty function.Source.

If we are making to many splits (increasing complexity) of our decision trees and over-fitting the data, the regularization function will be high.

• Missing Values: xgboost automatically treats missing values. As at every tree split there can be only two directions to further venture into, the algorithm determines the optimal direction to learn from. Thus if there are missing values at a direction, the algorithm recognizes that there will be improvement if it were to only visit the non-missing entries. More information is available in the paper by Tianqi Chen and Carlos Guestrin on page 4-5.
• Tree Pruning: GBM follows a greedy approach and stops splitting at negative gains, however it is possible that there were splits with a positive information gain post the split with negative gain. xgboost does not ignore such cases and while pruning it will only remove tree splits if there are no positive gains after encountering the first negative information gain.
• Cross Validation: xgboost has been blessed with an inbuilt cross-validation technique making it easier for the user.

In the next post we will discuss what the respective hyper parameters mean in xgboost.

Further improvements, have been made with LightGBM and Catboost.