1 Introduction

This chapter considers the XGBoost algorithm which was detailed in Sect. 3.6. How to find suitable parameter values and bounds, and how to perform experiments w.r.t. the following nine XGBoost hyperparameters will be discussed: \(\texttt {nrounds}\), \(\texttt {eta}\), \(\texttt {lambda}\), \(\texttt {alpha}\), \(\texttt {subsample}\), \(\texttt {colsample}\), \(\texttt {gamma}\), \(\texttt {maxdepthx}\), and \(\texttt {minchild}\).

2 Data Description

The first step is identical to the step in the \(\texttt {ranger}\) example in Chap. 8, because the Census-Income (KDD) Data Set (CID) will be used.Footnote 1 So, the function \(\texttt {getDataCensus}\) is called with the parameters from Table 8.3 to get the CID data from Table 8.2. The complete data set, \((\mathcal {X}, \mathcal {Y})\) contains \(n= 299, 285 \) observations with 41 features on demography and employment.

3 \(\texttt {getMlConfig}\): Experimental Setup and Configuration of the Gradient Boosting Model

Again, a subset with \(n=1e4\) samples that defines the subset \((X,Y) \in (\mathcal {X}, \mathcal {Y})\) is provided. The project setup is also similar to the setup described in Sect. 8.1. Therefore, only the differences will be shown. The full script is available in Sect. 9.10.

The function \(\texttt {getMlConfig}\) is called with the same arguments as in Chap. 8, with one exception: \(\texttt {model}\) is set to \(\texttt {"xgboost"}\). The function \(\texttt {getMlConfig}\) defines the ML task, the model configuration, and the data split (generation of the training and test data sets, i.e., \((X,Y)^{(\text {train})}\)and \((X,Y)^{(\text {test})}\).) To achieve this goal, the functions \(\texttt {getMlTask}\), \(\texttt {getModelConf}\), and \(\texttt {getMlrResample}\) are executed. As a result, the list \(\texttt {cfg}\) with 13 elements is available, see Table 9.1.

figure a
Table 9.1 Result from the function \(\texttt {getMlConfig}\): the \(\texttt {cfg}\) list

3.1 \(\texttt {getMlrTask}\): Problem Design and Definition of the Machine Learning Task

The problem design describes the target and task type, the number of observations, as well as the number of factorial, numerical, and cardinal variables. It was described in Sect. 8.3.2.1.

3.2 \(\texttt {getModelConf}\) Algorithm Design—Hyperparameters of the Models

The function \(\texttt {getModelConf}\), which is called from \(\texttt {getMlConf}\), computes an adequate XGBoost hyperparameter setting. Examples from literature shown in Table 3.6 in Sect. 3.6 will be used as a guideline. These values were modified as follows:  

\(\texttt {nrounds}\)::

An upper value (\(2^5\)), which is similar to the Random Forest (RF) configuration, was chosen. This value is smaller than the value used by Probst et al. (2019a), who used 5 000.

\(\texttt {colsample\_bytree}\)::

The lower value was chosen as 1/\(\texttt {getTaskNFeats}\) \(\texttt {(task)}\). This is a minor deviation from the settings used in Probst et al. (2019a). The reason for this modification is simple: a lower value of zero makes no sense, because at least one feature should be chosen via \(\texttt {colsample}\).

\(\texttt {gamma}\)::

A lower value of \(-10\) was chosen. This value is smaller than the value chosen by Thomas et al. (2018). Accordingly, a larger upper value (10) than by Thomas et al. (2018) was selected.

 

Hyperparameter transformations are shown in the column \(\texttt {trans}\) in Table 9.2. These transformations are similar to the transformations used by Probst et al. (2019a) and Thomas et al. (2018) with one minor change: \(\texttt {trans\_2pow\_round}\) was applied to the hyperparameter \(\texttt {nrounds}\).

The ML configuration list \(\texttt {cfg}\) contains information about the hyperparameters of the XGBoost model, see Table 9.2.

Table 9.2 XGBoost hyperparameter. \(N_{\text {Feats}}\) denotes the output from \(\texttt {getTaskNFeats(task)}\)

Background: XGBoost Hyperparameters

The complete list of XGBoost hyperparameters can also be shown using the function \(\texttt {getModelConf}\). Note: the hyperparameter \(\texttt {colsample\_bytree}\) is a relative hyperparameter, i.e., it depends on the number of features (\(\texttt {nFeatures}\)), see the discussion in Sect. 3.6. Hence, the value \(\texttt {nFeatures}\) must be determined before the hyperparameter bounds can be computed.

figure b

The list of hyperparameters is stored as the list element \(\texttt {tunepars}\), see Table 9.2.

Furthermore, all factor features will be replaced with their dummy variables. Dummy variables are recommended for XGBoost: internally, a \(\texttt {model.matrix}\) is used and non-factor features will be left untouched and passed to the result. The seed can be set to improve reproducibility. Finally, these settings are compiled to the list \(\texttt {cfg}\).

3.3 \(\texttt {getMlrResample}\): Training and Test Data

The partition of the full data set is done as described in Sect. 8.3.2.3. \(\texttt {rsample}\) specifies a training data set, which contains 2/3 of the data and a testing data set with the remaining 1/3 of the data.

4 Objective Function (Model Performance)

Because the XGBoost method is more complex than RF, an increased computational budget is recommended, e.g., by choosing a budget for tuning of \(6 \times 3{,}600\) s or six hours. The increased budget is used in the global study (Chap. 12). For the experiments performed in the current chapter, the budget was not increased.

Before the hyperparameter tuner is called, the objective function is defined: this function receives a configuration for a tuning experiment and returns an objective function to be tuned via \(\texttt {spot}\). A detailed description of the objective function can be found in Sect. 8.4.4.

5 \(\texttt {spot}\): Experimental Setup for the Hyperparameter Tuner

The R package SPOT is used to perform the actual tuning (optimization). Because the generic Sequential Parameter Optimization Toolbox (SPOT) setup was introduced in Sect. 4.5, this section highlights the modifications of the generic setup that were made for the \(\texttt {xgboost}\) hyperparameter tuning experiments.

The third step of the hyperparameter tuning pipeline as shown in Fig. 8.5 starts the SPOT hyperparameter tuner.

figure c

The result is written to a file and can be accessed via

figure d

The full R code for running this case study is shown in the Appendix (Sect. 9.10).

6 Tunability

6.1 Progress

The function \(\texttt {prepareProgressPlot}\) generates a data frame that can be used to visualize the hyperparameter tuning progress. The data frame can be passed to \(\texttt {ggplot}\). Figure 9.1 visualizes the progress during the \(\texttt {ranger}\) hyperparameter tuning process during the \(\texttt {spot}\) tuning procedure.

Fig. 9.1
figure 1

XGB: Hyperparameter tuning progress. Validation loss plotted against the number of function evaluations, i.e., the number of evaluated XGBoost models. The red dashed line denotes the best value found by the initial designs to show the hyperparameter tuning progress. The blue dashed line represents the best value from the whole run

After 60 min, 157 \(\texttt {xgboost}\) models were evaluated. Comparing the worst configuration that was observed during the HPT with the best, a 66.3743% reduction was obtained. After the initial phase, which includes 18 evaluations, the smallest Mean Mis-Classification Error (MMCE) reads 0.1793641. The dotted red line in Fig. 8.6 illustrates this result. The final best value reads 0.1724655, i.e., a reduction of the MMCE of 3.8462%. These values, in combination with results shown in the progress plot (Fig. 8.6), indicate that a quick HPT run is able to improve the quality of the \(\texttt {xgboost}\) method. It also indicates that increased run times do not result in a significant improvement of the MMCE.

! Attention

These results do not replace a sound statistical comparison, they are only indicators, not final conclusions.

6.2 \(\texttt {evalParamCensus}\): Comparing Default and Tuned Parameters on Test Data

As a baseline for comparison, XGBoost was run with default hyperparameter values. The corresponding R code for replicating the experiment is available in the code folder. The best (minimum MMCE) result from thirty repeats is reported. The corresponding values are shown in Table 9.3. The function \(\texttt {evalParamCensus}\) was used to perform this comparison. By specifying the ML model, e.g., \(\texttt {"xgboost"}\) and the \(\texttt {runNr}\), the function \(\texttt {evalParamCensus}\) was called.

Table 9.3 Comparison of the default and tuned hyperparameters of the XGBoost method. \(\texttt {colsample}\) denotes \(\texttt {colsample\_bytree}\). Table shows transformed values. Note: the \(\texttt {alpha}\) and \(\texttt {gamma}\) values are identical. They are computed as \(2^{-10}\), which is the lower bound value, because the theoretical default value 0 is infeasible. See also Table 3.8

The result files can be loaded and the violin plot of the obtained MMCE can be visualized (Fig. 9.2). It can be seen that the tuned solutions provide a better MMCE. Default and tuned results for the \(\texttt {ranger}\) model are available as \(\texttt {rangerDefault}\) \(\texttt {Evaluation.RData}\) and \(\texttt {xgboost00001Evaluation.RData}\), respectively.

The scores are stored as a \(\texttt {matrix}\). Attributes are used to label the measures. The following measures are calculated for each hyperparameter setting: \(\texttt {accuracy}\), \(\texttt {ce}\), \(\texttt {f1}\), \(\texttt {logLoss}\), \(\texttt {mae}\), \(\texttt {precision}\), \(\texttt {recall}\), and \(\texttt {rmse}\). The comparison is based on the MMCE that was defined in Eq. (2.2). Hyperparameters of the default and the tuned configurations are shown in Table 9.3. The full procedure, i.e., starting from scratch, to obtain the default \(\texttt {xgboost}\) hyperparameters is shown in Sect. 9.10.

Next, the hyperparameters of the tuned \(\texttt {xgboost}\) methods are shown.

Fig. 9.2
figure 2

Comparison of XGBoost methods with default (D) and tuned (T) hyperparameters. Classification error (MMCE) plotted on the horizontal axis. Vertical lines in the violin figures mark quantiles (0.25, 0.5, 0.75) of the corresponding distribution. Numerical values are shown in Table 9.3

7 Analyzing the Gradient Boosting Tuning Process

The analysis and the visualizations are based on the transformed values.

To analyze effects and interactions between hyperparameters of the \(\texttt {xgboost}\) Model, a simple regression tree as shown in Fig. 9.3 and Fig. 9.4 can be used.

Fig. 9.3
figure 3

Regression tree. Case study II. XGBoost

Fig. 9.4
figure 4

Regression tree. Case study II. XGBoost. Hyperparameters are transformed values

Same tree with the transformed values:

The regression tree supports the observations that hyperparameter values for alpha, lambda, gamma, and nrounds have the largest effect on the MMCE.

 

alpha

lambda

gamma

nrounds

subsample

eta

colsample_bytree

1

0.23112227

0.04431784

0.04039483

0.014028719

0.012203015

0.009397272

0.0028057437

alpha is the most relevant hyperparameter.

To perform a sensitivity analysis, parallel and sensitivity plots can be used (Figs. 9.5 and 9.6).

Fig. 9.5
figure 5

Best configurations in green

Fig. 9.6
figure 6

Sensitivity plot (best). Too large alpha values result in poor results

Results from the \(\texttt {spot}\) run can be passed to the function \(\texttt {plotSenstivity}\), which generates a sensitivity plot as shown in Fig. 8.10. Sensitivity plots were introduced in Sect. 8.6. Contour plots are shown in Fig. 9.7.

Fig. 9.7
figure 7

Surface plot: \(x_3\) plotted against \(x_1\). This surface plot indicates that \(\texttt {alpha}\) has a large effect. Too large alpha values result in poor results

Finally, a simple linear regression model can be fitted to the data. Based on the data from SPOT’s \(\texttt {result}\) list, the summary is shown below.

figure e

Although this linear model requires a detailed investigation (a misspecification analysis is necessary) it also is in accordance with previous observations that hyperparameters alpha, lambda, gamma, nrounds have significant effects on the loss function.

8 Severity: Validating the Results

Now, we utilize hypothesis testing and severity to analyze the statistical significance of the achieved performance improvement. Considering the results from the pre-experimental runs, the difference is \(\bar{x} = \) 0.0199. Since this value is positive, for the moment, let us assume that the tuned solution is superior. The corresponding standard deviation is \(s_d = \) 0.0081. Based on Eq. 5.14, and with \(\alpha = \) 0.05, \(\beta = \) 0.2, and \(\Delta = \) 0.01, let us identify the required number of runs for the full experiment using the \(\texttt {getSampleSize()}\) function.

Table 9.4 Case study II: result analysis
Fig. 9.8
figure 8

Tuning XGB. Severity of rejecting H0 (red), power (blue), and error (gray). Left: The observed mean \(\bar{x} = \)0.0199 is larger than the cut-off point \(c_{1-\alpha } = \)0.0024 Right: The claim that the true difference is as large or larger than 0.01 is well supported by severity. However, any difference larger than 0.015 is not supported by severity

For a relevant difference of 0.01, approximately 8 runs per algorithm are required. Hence, we can proceed to evaluate the severity and analyze the performance improvement achieved through tuning the parameters of the xgboost.

The summary result statistics is presented in Table 9.4. The decision based on p-value is to reject the null hypothesis, i.e., the claim that the tuned parameter setup provides a significant performance improvement in terms of MMCE is supported. The effect size suggests that the difference is of a larger magnitude. For the chosen \(\Delta =\)0.01, the severity value is at 1 and thus it strongly supports the decision of rejecting the \(H_0\). The severity plot is shown in Fig. 9.8. Severity shows that performance differences smaller than 0.015 are well supported.

9 Summary and Discussion

The analysis indicates that hyperparameter alpha has the greatest effect on the algorithm’s performance. The recommended value of alpha is 7.2791, which is much larger than the default value.

This case study demonstrates how functions from the R packages mlr and SPOT can be combined to perform a well-structured hyperparameter tuning and analysis. By specifying the time budget via \(\texttt {maxTime}\), the user can systematically improve hyperparameter settings. Before applying ML algorithms such as XGBoost to complex classification or regression problems, HPT is recommended. Wrong hyperparameter settings can be avoided. Insight into the behavior of ML algorithms can be obtained.

10 Program Code

Program Code

figure f