Comparative study of the 3 most commonly used boosting methods

In the era of machine learning, the search for powerful machine learning algorithms that can achieve great accuracy with less training time, is the goal of many of the researchers and practitioners. However, there's always a balance state of performance-vs-time that makes the difference between algorithms; some algorithms have better performance than others, but requires more time to train and vice versa. Many studies have been performed to compare machine learning algorithms and to highlight their performance on different datasets e.g. Osisanwo F.Y et al., Jain N. et al., and Jacob G.

In this article, I'll present a comparative study the top 3 most commonly used gradient boosting algorithms which are: XGBoost, CatBoost, and LightGBM. For a brief and concise introduction on these algorithms please refer to this post.
The goal of this study is to compare the classification performance of these algorithms on two different data sets (Big and small),so that we develop a better understanding of the capability of each algorithm on each case. Since, CatBoost was initially released in 2017, and it is then the newest one amongst the 3 algorithms, the benchmarks presented in the official CatBoost documentation website shows the advantage of using CatBoost over its competitors, view the figure below.
It is clearly shown that CatBoost performed better on the majority of datasets, it only had a significantly higher training time on the Higgs dataset than LightGBM when using a CPU. Although, CatBoost results are remarkable, we should always keep in mind that they can be cherry-picked to emphasize the advantage of CatBoost over the others, which is always the case for benchmarks. Moreover, knowing that XGBoost is arguably the top algorithm on the kaggle competition winning list for structured data, would suggest that it is always better to check the performance on the current task and then decide which one to use. 
This being said doesn't render the study in hand pointless, because what we're trying to accomplish here is to develop an intuitive sense to what we should try first for a certain type of data set, which could save us time and computational power.

Comparative study plan

In this paragraph, we'll go through the steps that are executed in this study. But first we'll present the chosen data sets. The data sets that are used in this study are available for everyone on the Kaggle platform. Two data sets were chosen to perform binary classification task:
  1. Heart Disease UCI (300 Samples)
  2. Rain in Australia (145k Samples)
The whole study and the results are available on this Kaggle kernel. It is important to note that it takes around 2 hours of runtime.

Data

Heart Disease UCI (HDUCI) 

This dataset is available on Kaggle. It contains 303 instances and 76 attributes(features), although only 14 attributes were used by all published experiments according to the data publisher. This dataset will be used as an example of a small dataset to compare the performance of the chosen gradient boosting methods.

Here are the more details about the 14 used attributes:
  1. age
  2. sex
  3. cp: chest pain type (4 values)
  4. trestbps: resting blood pressure
  5. chol: serum cholestoral in mg/dl
  6. fbs: fasting blood sugar > 120 mg/dl
  7. restecg: resting electrocardiographic results (values 0,1,2)
  8. thalach: maximum heart rate achieved
  9. exang: exercise induced angina
  10. oldpeak: ST depression induced by exercise relative to rest
  11. slope: the slope of the peak exercise ST segment
  12. ca: number of major vessels (0-3) colored by flourosopy
  13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
  14. target: integer valued 0 (no presence), and 1 (presence).

Rain in Australia (RIA)

This dataset is also available on Kaggle. This dataset contains about 10 years of daily weather observations from many locations across Australia. It has 23 attributes and 145k instances, and we're going to use this relatively big dataset to compare the performance of the 3 chosen methods.
  1. Date
  2. Location
  3. MinTemp
  4. MaxTemp
  5. Rainfall
  6. Evaporation
  7. Sunshine
  8. WindGustDir
  9. WindGustSpeed
  10. WindDir9am
  11. WindDir3pm
  12. WindSpeed9am
  13. WindSpeed3pm
  14. Humidity9am
  15. Humidity3pm
  16. Pressure9am
  17. Pressure3pm
  18. Cloud9am
  19. Cloud3pm
  20. Temp9am
  21. Temp3pm
  22. RainToday
  23. RainTomorrow: Target

Cleaning & Preprocessing

Data Assessment and Cleaning

This is the first step of the plan and it is always a mandatory step that should facilitate the tasks that follow greatly. Both datasets were chosen to be easily assessed for quality and tidiness issues, so that we don't waste unnecessary time on this step which is not the core of the study. Once the assessment is done the cleaning process is conducted, and 3 main tasks that were performed:
  • Dropping columns that have majority NULL values.
  • Dropping instances with no target value.
  • Filling missing values with the average value

Preprocessing clean data

The preprocessing step consist in preparing the data to be used as an input for the machine learning models. Since the data used was cleaned and chosen to have minimal issues regarding its use for training, there was only one preprocessing task that was performed and that is converting the categorical and date type data into numerical data. The task of converting categorical into numerical data is achieved through what's called one hot encoding, where each category becomes a feature with 0 and 1 values.
Finally, the datasets were split into train and test sets, which concludes the preprocessing stage.
The obtained train and test sets are of the following sizes:
  • The training and test sizes on the HDUCI dataset are: 242 and 61 
  • The training and test sizes on the RIA dataset are: 126708 and 14079

Training and evaluating

This section represents the core of the study in hand. For each data set we train and evaluate the 3 chosen boosting methods with and without hyper-parameters tuning. The training and evaluation process is accomplished through the use of three helper function; Train_models, Models_results, and Plot_summary. Available in this kernel.
  • The Train_models function as its name suggest, it is used to train three models; an XGBoost model, a CatBoost model, and a LightGBM model, with the optional 'gridsearch' parameter which is used for tuning the models parameters.
  • The Models_results helper function is used to display the key results for each trained model, which are, the Training_duration, the Classification_report, and in case a grid search is conducted, it also displays the Gridsearch_duration and the Best_parameters_set.
  • The Plot_summary function is then used to convert the numerical results into visualisations which are facilitates the comparaison of the obtained results. Training and comparing base models on the small data set.

Results of the base models on the small data set

The evaluation of the models on the test show that the performance results are very similar with just a minor difference between the three models as shown in the figure below.
The train duration bar chart, view figure below, shows a significant difference of 1.2 second between the XGBoost and the LightGBM model, considering the size of the test set (61 samples). And it is important to notice that the ranking of models based on performance is inversed when it comes to train duration. This supports the performance-vs-time balance statement that we've referred to in the first paragraph.
Finally, the confusion matrices shows how close the classification results are on a small data set. 

Results of the tuned models on the small data set

The search grid strategy that I've used to tuned each model and not be biased towards anyone of them, was by tuning the top 5 most important parameters for each classifier with the same range of values in case that parameters is a common one. It is important to note that the grid search is a very time consuming method so, the range was chosen after several runs to better optimize it. Moreover the number of parameters was limited to 5 for the same reason.

As shown in the figure below, he grid search duration is significantly high for the CatBoost model, whereas the both the XGBoost and LightGBM classifiers showed reasonable grid search duration. This significantly higher grid search duration for the CatBoost model maybe due to the fact that it has some default parameters that LightGBM and XGBoost do not have such as: 
  • boosting_type='Plain' which performs better on bigger data sets. Should be set to 'Ordered' for small data sets
  • leaf_estimation_iterations=1, this parameters requires many feature to perform as intended, however in the case of a small data set as HDUCI, it turns into a bottleneck.
Also, CatBoost performs preprocessing of the input data, which is a bottleneck if we have little amount of iterations, like 50 iterations.

I the other hand we can clearly see that the training of the tuned model shows similar results regarding the training time as in the previous paragraph.

In terms of performance, both the classification report bar chart and the confusion matrices shows that the results are quite similar with a slightly better results of the LightGBM classifier.

Results of the base models on the big data set

The training and evaluation process on the RIA dataset is performed similarly as on the HDUCI dataset, and the only remarkable thing to notice about the results of the 3 models, as shown in the figures below, is that the LightBGM achieved the same level of accuracy in a very significant less training duration. This is a crucial point to note,  the 44 seconds difference in train duration would certainly be heavily multiplied several time when conducting a grid search in the next paragraph.

Results of the tuned models on the big data set

The tuning process for the models is similar to the previous one, except for the class weights parameters which was utilized to remedy the imbalanced representation of classes in the RIA dataset where the class labeled 1 is under-represented by a factor of 1/3.

The most remarkable result in the last experiment is that the CatBoost model grid search duration was slightly better than the LightGBM drid search duration, which was not expected based on the difference in training duration for the base models. However this can be explained by the same default parameters that caused referred to in the tuned models on the HDUCI dataset. Since the RIA is a much bigger dataset, the default parameters boosting_type, leaf_estimation_iterations, and the internal preprocess, helped accelerating the process significantly specially when using a large number of iteration like in the conducted grid search.
The performance results are very similar as shown in the below figures.

Conclusion

In this study, we compared the top 3 most commonly used boosting models on two datasets. The results on both datasets in terms of performance on the test set were quite similar, however the notable difference was the training and grid search duration. Here are some key takeaways from this study:
  1. The CatBoost model have a great performance on both small and large data sets, however the training and grid searching duration are significantly better on large data sets with a high number of iterations.
  2. The CatBoost model has a fairly balanced performance-vs-speed architecture.
  3. The XGBoost model performs very well on both small and large data sets, but it exhibit a very high grid search duration on the large data set. It is optimized for a high performance and less optimized for time consumption.
  4. The XGBoost model is more optimized for performance than time consumption.
  5. The LightGBM model is the most efficient regarding time consumption on both small large and small data sets, and it's performance is remarkably high for both cases.
  6. The LightGBM is clearly best optimized for speed without much sacrificing it's performance.
The study conducted is a fairly simple one, and that is due to the computational power available. The computations were performed using the CPU on Kaggle. A much deeper comparative study can be achieved by using more powerful CPUs or GPUs, which makes it possible to tune more parameters and use bigger datasets.

Comparative Study
September 02, 2021
0

Search

Popular Posts

Boosting Your Machine Learning Models with Bagging Techniques

Introduction: In the world of machine learning, improving the accuracy and ro…

What is Stable Diffusion and How Does it Work?

Stable Diffusion stands as a cutting-edge deep learning model introduced in 2…

Exploring the Tech Job Horizon: Unveiling Insights from 25,000 Opportunities

In the rapidly advancing landscapes of Information Technology, Artificial Int…

Recent Comments

Contact Me