Case Study II: Tuning of Gradient Boosting (xgboost)

Bartz-Beielstein, Thomas; Chandrasekaran, Sowmya; Rehbach, Frederik

doi:10.1007/978-981-19-5170-1_9

Thomas Bartz-Beielstein⁵,
Sowmya Chandrasekaran⁵ &
Frederik Rehbach⁵

8320 Accesses
2 Citations

Abstract

This case study gives a hands-on description of Hyperparameter Tuning (HPT) methods discussed in this book. The Extreme Gradient Boosting (XGBoost) method and its implementation xgboost was chosen, because it is one of the most powerful methods in many Machine Learning (ML) tasks, especially when standard tabular data should be analyzed. This case study follows the same HPT pipeline as the first and third studies: after the data set is provided and pre-processed, the experimental design is set up. Next, the HPT experiments are performed. The R package SPOT is used as a “datascope” to analyze the results from the HPT runs from several perspectives: in addition to Classification and Regression Trees (CART), the analysis combines results from the surface, sensitivity, and parallel plots with a classical regression analysis. Severity is used to discuss the practical relevance of the results from an error-statistical point-of-view. The well-proven R package mlr is used as a uniform interface from the methods of the packages SPOT and SPOTMisc to the ML methods. The corresponding source code is explained in a comprehensible manner.

You have full access to this open access chapter, Download chapter PDF

Using Automatic Programming to Improve Gradient Boosting for Classification

A comparative analysis of gradient boosting algorithms

Article 24 August 2020

Optimization by Gradient Boosting

1 Introduction

This chapter considers the XGBoost algorithm which was detailed in Sect. 3.6. How to find suitable parameter values and bounds, and how to perform experiments w.r.t. the following nine XGBoost hyperparameters will be discussed: \(\texttt {nrounds}\), \(\texttt {eta}\), \(\texttt {lambda}\), \(\texttt {alpha}\), \(\texttt {subsample}\), \(\texttt {colsample}\), \(\texttt {gamma}\), \(\texttt {maxdepthx}\), and \(\texttt {minchild}\).

2 Data Description

The first step is identical to the step in the \(\texttt {ranger}\) example in Chap. 8, because the Census-Income (KDD) Data Set (CID) will be used.^{Footnote 1} So, the function \(\texttt {getDataCensus}\) is called with the parameters from Table 8.3 to get the CID data from Table 8.2. The complete data set, \((\mathcal {X}, \mathcal {Y})\) contains \(n= 299, 285 \) observations with 41 features on demography and employment.

3 \(\texttt {getMlConfig}\): Experimental Setup and Configuration of the Gradient Boosting Model

Again, a subset with \(n=1e4\) samples that defines the subset \((X,Y) \in (\mathcal {X}, \mathcal {Y})\) is provided. The project setup is also similar to the setup described in Sect. 8.1. Therefore, only the differences will be shown. The full script is available in Sect. 9.10.

The function \(\texttt {getMlConfig}\) is called with the same arguments as in Chap. 8, with one exception: \(\texttt {model}\) is set to \(\texttt {"xgboost"}\). The function \(\texttt {getMlConfig}\) defines the ML task, the model configuration, and the data split (generation of the training and test data sets, i.e., \((X,Y)^{(\text {train})}\)and \((X,Y)^{(\text {test})}\).) To achieve this goal, the functions \(\texttt {getMlTask}\), \(\texttt {getModelConf}\), and \(\texttt {getMlrResample}\) are executed. As a result, the list \(\texttt {cfg}\) with 13 elements is available, see Table 9.1.

Table 9.1 Result from the function \(\texttt {getMlConfig}\): the \(\texttt {cfg}\) list

Full size table

3.1 \(\texttt {getMlrTask}\): Problem Design and Definition of the Machine Learning Task

The problem design describes the target and task type, the number of observations, as well as the number of factorial, numerical, and cardinal variables. It was described in Sect. 8.3.2.1.

3.2 \(\texttt {getModelConf}\) Algorithm Design—Hyperparameters of the Models

The function \(\texttt {getModelConf}\), which is called from \(\texttt {getMlConf}\), computes an adequate XGBoost hyperparameter setting. Examples from literature shown in Table 3.6 in Sect. 3.6 will be used as a guideline. These values were modified as follows:

\(\texttt {nrounds}\)::: An upper value (\(2^5\)), which is similar to the Random Forest (RF) configuration, was chosen. This value is smaller than the value used by Probst et al. (2019a), who used 5 000.
\(\texttt {colsample\_bytree}\)::: The lower value was chosen as 1/\(\texttt {getTaskNFeats}\) \(\texttt {(task)}\). This is a minor deviation from the settings used in Probst et al. (2019a). The reason for this modification is simple: a lower value of zero makes no sense, because at least one feature should be chosen via \(\texttt {colsample}\).
\(\texttt {gamma}\)::: A lower value of \(-10\) was chosen. This value is smaller than the value chosen by Thomas et al. (2018). Accordingly, a larger upper value (10) than by Thomas et al. (2018) was selected.

Hyperparameter transformations are shown in the column \(\texttt {trans}\) in Table 9.2. These transformations are similar to the transformations used by Probst et al. (2019a) and Thomas et al. (2018) with one minor change: \(\texttt {trans\_2pow\_round}\) was applied to the hyperparameter \(\texttt {nrounds}\).

The ML configuration list \(\texttt {cfg}\) contains information about the hyperparameters of the XGBoost model, see Table 9.2.

Table 9.2 XGBoost hyperparameter. \(N_{\text {Feats}}\) denotes the output from \(\texttt {getTaskNFeats(task)}\)

Full size table

Background: XGBoost Hyperparameters

The complete list of XGBoost hyperparameters can also be shown using the function \(\texttt {getModelConf}\). Note: the hyperparameter \(\texttt {colsample\_bytree}\) is a relative hyperparameter, i.e., it depends on the number of features (\(\texttt {nFeatures}\)), see the discussion in Sect. 3.6. Hence, the value \(\texttt {nFeatures}\) must be determined before the hyperparameter bounds can be computed.

The list of hyperparameters is stored as the list element \(\texttt {tunepars}\), see Table 9.2.

Furthermore, all factor features will be replaced with their dummy variables. Dummy variables are recommended for XGBoost: internally, a \(\texttt {model.matrix}\) is used and non-factor features will be left untouched and passed to the result. The seed can be set to improve reproducibility. Finally, these settings are compiled to the list \(\texttt {cfg}\).

3.3 \(\texttt {getMlrResample}\): Training and Test Data

The partition of the full data set is done as described in Sect. 8.3.2.3. \(\texttt {rsample}\) specifies a training data set, which contains 2/3 of the data and a testing data set with the remaining 1/3 of the data.

4 Objective Function (Model Performance)

Because the XGBoost method is more complex than RF, an increased computational budget is recommended, e.g., by choosing a budget for tuning of \(6 \times 3{,}600\) s or six hours. The increased budget is used in the global study (Chap. 12). For the experiments performed in the current chapter, the budget was not increased.

Before the hyperparameter tuner is called, the objective function is defined: this function receives a configuration for a tuning experiment and returns an objective function to be tuned via \(\texttt {spot}\). A detailed description of the objective function can be found in Sect. 8.4.4.

5 \(\texttt {spot}\): Experimental Setup for the Hyperparameter Tuner

The R package SPOT is used to perform the actual tuning (optimization). Because the generic Sequential Parameter Optimization Toolbox (SPOT) setup was introduced in Sect. 4.5, this section highlights the modifications of the generic setup that were made for the \(\texttt {xgboost}\) hyperparameter tuning experiments.

The third step of the hyperparameter tuning pipeline as shown in Fig. 8.5 starts the SPOT hyperparameter tuner.

The result is written to a file and can be accessed via

The full R code for running this case study is shown in the Appendix (Sect. 9.10).

6 Tunability

6.1 Progress

The function \(\texttt {prepareProgressPlot}\) generates a data frame that can be used to visualize the hyperparameter tuning progress. The data frame can be passed to \(\texttt {ggplot}\). Figure 9.1 visualizes the progress during the \(\texttt {ranger}\) hyperparameter tuning process during the \(\texttt {spot}\) tuning procedure.

After 60 min, 157 \(\texttt {xgboost}\) models were evaluated. Comparing the worst configuration that was observed during the HPT with the best, a 66.3743% reduction was obtained. After the initial phase, which includes 18 evaluations, the smallest Mean Mis-Classification Error (MMCE) reads 0.1793641. The dotted red line in Fig. 8.6 illustrates this result. The final best value reads 0.1724655, i.e., a reduction of the MMCE of 3.8462%. These values, in combination with results shown in the progress plot (Fig. 8.6), indicate that a quick HPT run is able to improve the quality of the \(\texttt {xgboost}\) method. It also indicates that increased run times do not result in a significant improvement of the MMCE.

! Attention

These results do not replace a sound statistical comparison, they are only indicators, not final conclusions.

6.2 \(\texttt {evalParamCensus}\): Comparing Default and Tuned Parameters on Test Data

As a baseline for comparison, XGBoost was run with default hyperparameter values. The corresponding R code for replicating the experiment is available in the code folder. The best (minimum MMCE) result from thirty repeats is reported. The corresponding values are shown in Table 9.3. The function \(\texttt {evalParamCensus}\) was used to perform this comparison. By specifying the ML model, e.g., \(\texttt {"xgboost"}\) and the \(\texttt {runNr}\), the function \(\texttt {evalParamCensus}\) was called.

Table 9.3 Comparison of the default and tuned hyperparameters of the XGBoost method. \(\texttt {colsample}\) denotes \(\texttt {colsample\_bytree}\). Table shows transformed values. Note: the \(\texttt {alpha}\) and \(\texttt {gamma}\) values are identical. They are computed as \(2^{-10}\), which is the lower bound value, because the theoretical default value 0 is infeasible. See also Table 3.8

Full size table

The result files can be loaded and the violin plot of the obtained MMCE can be visualized (Fig. 9.2). It can be seen that the tuned solutions provide a better MMCE. Default and tuned results for the \(\texttt {ranger}\) model are available as \(\texttt {rangerDefault}\) \(\texttt {Evaluation.RData}\) and \(\texttt {xgboost00001Evaluation.RData}\), respectively.

The scores are stored as a \(\texttt {matrix}\). Attributes are used to label the measures. The following measures are calculated for each hyperparameter setting: \(\texttt {accuracy}\), \(\texttt {ce}\), \(\texttt {f1}\), \(\texttt {logLoss}\), \(\texttt {mae}\), \(\texttt {precision}\), \(\texttt {recall}\), and \(\texttt {rmse}\). The comparison is based on the MMCE that was defined in Eq. (2.2). Hyperparameters of the default and the tuned configurations are shown in Table 9.3. The full procedure, i.e., starting from scratch, to obtain the default \(\texttt {xgboost}\) hyperparameters is shown in Sect. 9.10.

Next, the hyperparameters of the tuned \(\texttt {xgboost}\) methods are shown.

7 Analyzing the Gradient Boosting Tuning Process

The analysis and the visualizations are based on the transformed values.

To analyze effects and interactions between hyperparameters of the \(\texttt {xgboost}\) Model, a simple regression tree as shown in Fig. 9.3 and Fig. 9.4 can be used.

Same tree with the transformed values:

The regression tree supports the observations that hyperparameter values for alpha, lambda, gamma, and nrounds have the largest effect on the MMCE.

	alpha	lambda	gamma	nrounds	subsample	eta	colsample_bytree
1	0.23112227	0.04431784	0.04039483	0.014028719	0.012203015	0.009397272	0.0028057437

alpha is the most relevant hyperparameter.

To perform a sensitivity analysis, parallel and sensitivity plots can be used (Figs. 9.5 and 9.6).

Results from the \(\texttt {spot}\) run can be passed to the function \(\texttt {plotSenstivity}\), which generates a sensitivity plot as shown in Fig. 8.10. Sensitivity plots were introduced in Sect. 8.6. Contour plots are shown in Fig. 9.7.

Finally, a simple linear regression model can be fitted to the data. Based on the data from SPOT’s \(\texttt {result}\) list, the summary is shown below.

Although this linear model requires a detailed investigation (a misspecification analysis is necessary) it also is in accordance with previous observations that hyperparameters alpha, lambda, gamma, nrounds have significant effects on the loss function.

8 Severity: Validating the Results

Now, we utilize hypothesis testing and severity to analyze the statistical significance of the achieved performance improvement. Considering the results from the pre-experimental runs, the difference is \(\bar{x} = \) 0.0199. Since this value is positive, for the moment, let us assume that the tuned solution is superior. The corresponding standard deviation is \(s_d = \) 0.0081. Based on Eq. 5.14, and with \(\alpha = \) 0.05, \(\beta = \) 0.2, and \(\Delta = \) 0.01, let us identify the required number of runs for the full experiment using the \(\texttt {getSampleSize()}\) function.

Table 9.4 Case study II: result analysis

Full size table

For a relevant difference of 0.01, approximately 8 runs per algorithm are required. Hence, we can proceed to evaluate the severity and analyze the performance improvement achieved through tuning the parameters of the xgboost.

The summary result statistics is presented in Table 9.4. The decision based on p-value is to reject the null hypothesis, i.e., the claim that the tuned parameter setup provides a significant performance improvement in terms of MMCE is supported. The effect size suggests that the difference is of a larger magnitude. For the chosen \(\Delta =\)0.01, the severity value is at 1 and thus it strongly supports the decision of rejecting the \(H_0\). The severity plot is shown in Fig. 9.8. Severity shows that performance differences smaller than 0.015 are well supported.

9 Summary and Discussion

The analysis indicates that hyperparameter alpha has the greatest effect on the algorithm’s performance. The recommended value of alpha is 7.2791, which is much larger than the default value.

This case study demonstrates how functions from the R packages mlr and SPOT can be combined to perform a well-structured hyperparameter tuning and analysis. By specifying the time budget via \(\texttt {maxTime}\), the user can systematically improve hyperparameter settings. Before applying ML algorithms such as XGBoost to complex classification or regression problems, HPT is recommended. Wrong hyperparameter settings can be avoided. Insight into the behavior of ML algorithms can be obtained.

10 Program Code

Program Code

Notes

1.
The data from CID is historical. It includes wording or categories regarding people which do not represent or reflect any views of the authors and editors.

Author information

Authors and Affiliations

Institute for Data Science, Engineering and Analytics, TH Köln, Cologne, Germany
Thomas Bartz-Beielstein, Sowmya Chandrasekaran & Frederik Rehbach

Authors

Thomas Bartz-Beielstein
View author publications
You can also search for this author in PubMed Google Scholar
Sowmya Chandrasekaran
View author publications
You can also search for this author in PubMed Google Scholar
Frederik Rehbach
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Bartz-Beielstein .

Editor information

Editors and Affiliations

Bartz & Bartz GmbH, Gummersbach, Germany
Eva Bartz
Institute for Data Science, Engineering, and Analytics, TH Köln, Gummersbach, Germany
Thomas Bartz-Beielstein
Bartz & Bartz GmbH, Gummersbach, Germany
Martin Zaefferer
Institute for Data Science, Engineering, and Analytics, TH Köln, Gummersbach, Germany
Olaf Mersmann

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 1024 KB)

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bartz-Beielstein, T., Chandrasekaran, S., Rehbach, F. (2023). Case Study II: Tuning of Gradient Boosting (xgboost). In: Bartz, E., Bartz-Beielstein, T., Zaefferer, M., Mersmann, O. (eds) Hyperparameter Tuning for Machine and Deep Learning with R. Springer, Singapore. https://doi.org/10.1007/978-981-19-5170-1_9

Download citation

DOI: https://doi.org/10.1007/978-981-19-5170-1_9
Published: 01 January 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5169-5
Online ISBN: 978-981-19-5170-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Case Study II: Tuning of Gradient Boosting (xgboost)

Abstract

Similar content being viewed by others

Using Automatic Programming to Improve Gradient Boosting for Classification