1 Methods and Hyperparameters

In the following, we provide a survey and description of hyperparameters of ML and DL methods. We emphasize that this is not a complete list of their parameters, but covers parameters that are set quite frequently according to the literature.

Since the specific names and meaning of hyperparameters may depend on the actual implementation used, we have chosen a reference implementation for each model. The implementations chosen are all packages from the statistical programming language R. Thus, we provide a description that is consistent with what users experience, so that they can identify the relevant parameters when tuning ML and DL methods in practice. In particular, we cover the methods shown in Table 3.1.

Table 3.1 Overview: Methods and hyperparameters analyzed in this book

This table presents an overview of these methods, their R packages, and associated hyperparameters. After a short, general description of the specific hyperparameter, the following features will be described for every hyperparameter:  

Type::

Describes the type (e.g., integer) and complexity (e.g., scalar). These data types are described in Sect. 2.7.3. The variable type of the implementation in the R package SPOTMisc, which is used for the experiments in this book, is also listed.

Default::

Default value as specified in \(\texttt {getModelConf}\) from the R package SPOTMisc.

Sensitivity::

  Describes how much the model is affected by changes of the parameter. There is a close relationship between sensitivity and tunability as defined by Probst et al. (2019a), because tunability is the potential for improvement of the parameter in the vicinity of a reference value.

Heuristics::

Describes ways to find good hyperparameter settings.

Range::

Describes feasible values, i.e., lower and upper bounds, constraints, etc.

Transformation::

Transformation as specified in \(\texttt {getModelConf}\).

Bounds::

Lower and upper bounds as specified in \(\texttt {getModelConf}\).

Constraints::

Additional constraints, specific for certain settings or algorithms.

Interactions::

Describes interactions between the parameters.

  Each description concludes with a brief survey of examples from the literature that gives hints how the method was tuned.

!Attention: Default Hyperparameters

The default values in this chapter refer to the untransformed values, i.e., the transformations that are also listed in the descriptions were not applied.

2 k-Nearest Neighbor

2.1 Description

In the field of statistical discrimination KNN classification is an established and successful method. Hechenbichler and Schliep (2004) developed an extended KNN version, where the distances of the nearest neighbors can be taken into account. The KNN model determines for each \({x} \) the k neighbors with the least distance to \({x} \), e.g., based on the Minkowski distance (Eq. (2.1)). For regression, the mean of the neighbors is used (James et al. 2017). For classification, the prediction of the model is the most frequent class observed in the neighborhood. Two relevant hyperparameters (\(\texttt {k}\), \(\texttt {p}\)) result from this. Additionally, one categorical hyperparameter could be considered: the choice of evaluation algorithm (e.g., choosing between brute force or KD-Tree) (Friedman et al. 1977). However, this mainly influences computational efficiency, rather than actual performance.

We consider the implementation from the R package kknnFootnote 1 (Schliep et al. 2016).

2.2 Hyperparameters of k-Nearest Neighbor

KNN Hyperparameter k

The parameter \(\texttt {k}\) determines the number of neighbors that are considered by the model. In case of regression, it affects how smooth the predicted function of the model is. Similarly, it influences the smoothness of the decision boundary in case of classification.

Small values of \(\texttt {k}\) lead to fairly nonlinear predictors (or decision boundaries), while larger values tend toward more linear shapes (James et al. 2017). The error of the model at any training data sample is zero if \(\texttt {k}\) = 1 but this does not allow any conclusions about the generalization error (James et al. 2017). Larger values of \(\texttt {k}\) may help to deal with rather noisy data. Moreover, larger values of \(\texttt {k}\) increase the runtime of the model.  

Type::

integer, scalar.

Default::

\(\texttt {7}\)

Sensitivity::

Determining the size of the neighborhood via \(\texttt {k}\) is a fairly sensitive decision. James et al. (2017) describe this as a drastic effect. However, this is only true as long as the individual classes are hard to separate (in case of classification). If there is a large margin between classes, the shape of the decision boundary becomes less relevant (see Domingos 2012, Fig. 3). Thus, the sensitivity of the hyperparameter depends on the considered problem and data. Probst et al. (2019a) also identify \(\texttt {k}\) as a sensitive (or tunable) hyperparameter.

Heuristics::

As mentioned above, the choice of \(\texttt {k}\) may depend on properties of the data. Hence, no general rule can be provided. In individual cases, determining the distance between and within classes may help to find an approximate value: \(\texttt {k}\) = 1 is better than \(\texttt {k} >1\), if the distance within classes is larger than the distance between classes (Cover and Hart 1967). Another empirical suggestion from the literature is \(\texttt {k} =\sqrt{n}\), where n is the number of data samples (Lall and Sharma 1996; Probst et al. 2019a).

Range::

\(\texttt {k} \ge 1\), \(\texttt {k} \ll n\). Only integer values are valid.

Transformation::

\(\texttt {trans\_id}\)

Bounds::

\(\texttt {lower = 1; upper = 30}\)

Constraints::

none.

Interactions::

We are not aware of any interactions between the hyperparameters. However, both \(\texttt {k}\) and \(\texttt {p}\) change the perceived neighborhood of samples and thus the shape of the decision boundaries. Hence, an interaction between these hyperparameters is likely.

 

KNN Hyperparameter p

The hyperparameter \(\texttt {p}\) affects the distance measure that is used to determine the nearest neighbors in KNN. Frequently, this is the Minkowski distance, see Eq. (2.1). Moreover, it has to be considered that other distances could be chosen for non-numerical features of the data set (i.e., Hamming distance for categorical features). The implementation used in the R package kknn transforms categorical variables into numerical variables via dummy-coding, then using the Minkowski distance on the resulting data. Similar to \(\texttt {k}\), \(\texttt {p}\) changes the observed neighborhood. While \(\texttt {p}\) does not change the number of neighbors, it still affects the choice of neighbors.  

Type::

double, scalar.

Default::

\(\texttt {log10(2)}\)

Sensitivity::

It has to be expected that the model is less sensitive to changes in \(\texttt {p}\) than to changes in \(\texttt {k}\), since fairly extreme changes are required to change the neighborhood set of a specific data sample. This explains why many publications do not consider \(\texttt {p}\) during tuning, see Table 3.2. However, the detailed investigation of Alfeilat et al. (2019) showed that changes of the distance measure can have a significant effect on the model accuracy. Alfeilat et al. (2019) only tested special cases of the Minkowski distance (Eq. (2.1)): Manhattan distance (\(\texttt {p} =1\)), Euclidean distance (\(\texttt {p} =2\)) and Chebyshev distance (\(\texttt {p} =\infty \)). They give no indication whether other values may be of interest as well.

Heuristics::

The choice of distance measure (and hence \(\texttt {p}\)) depends on the data, a general recommendation or rule-of-thumb is hard to derive (Alfeilat et al. 2019).

Range::

Often, the interval \(1 \le \texttt {p} \le 2\) is considered. The lower boundary is \(\texttt {p} >0\). Note: The Minkowski distance is not a metric if \(\texttt {p} <1\) (Alfeilat et al. 2019). Theoretically, a value of \(\texttt {p} =\infty \) is possible (resulting into Chebyshev distance), but this is not possible in the kknn implementation.

Transformation::

\(\texttt {trans\_10pow}\)

Bounds::

\(\texttt {lower = -1; upper = 2}\)

Constraints::

none.

Interactions::

We are not aware of any known interactions between hyperparameters. However, both \(\texttt {k}\) and \(\texttt {p}\) change what is perceived as the neighborhood of samples, and hence the shape of decision boundaries. An interaction between those hyperparameters is likely.

 

Table 3.2 provides a brief survey of examples from the literature, where KNN was tuned.

Table 3.2 Survey of examples from the literature, for tuning of KNN

3 Regularized Regression (Elastic Net)

3.1 Description

EN is a regularized regression method (Zou and Hastie 2005). Regularized regression can be employed to fit regression models with a reduced number of model coefficients. Special cases of EN are Lasso and Ridge regression.

Regularization is useful for large k, i.e., when data sets are high dimensional (especially but not exclusively if \(k>n\)), or when variables in the data sets are heavily correlated with each other (Zou and Hastie 2005). Less complex models (i.e., with fewer coefficients, see also Definition 2.27) help to reduce overfitting.  Overfitting means that the model is extremely well adapted to the training data, but generalizes poorly as a result, i.e., predicts poorly for unseen data. The resulting models are also easier to understand for humans, due to their reduced complexity.

During training, non-regularized regression reduces the model error (e.g., via the least squares method), but not the model complexity. EN also considers a penalty term, which grows with the number of coefficients included in the model (i.e., the number of non-zero coefficients).

As a reference implementation, we use the R package glmnetFootnote 2 (Friedman et al. 2020; Simon et al. 2011).

3.2 Hyperparameters of Elastic Net

EN Hyperparameter alpha

The parameter \(\texttt {alpha}\) (\(\alpha \)) weighs the two elements of the penalty term of the loss function in the EN model (Friedman et al. 2010):

$$\begin{aligned} \min _{\beta _0,\beta } \frac{1}{2n}\sum _{i=1}^n ({y} _i-\beta _0 - {x} _i^{\text {T}} \beta )^2 + \lambda P(\alpha ,\beta ). \end{aligned}$$
(3.1)

The penalty term \(P(\alpha ,\beta )\) is (Friedman et al. 2010)

$$\begin{aligned} (1-\alpha ) \frac{1}{2}||\beta ||_2^2+\alpha ||\beta ||_1, \end{aligned}$$
(3.2)

with the vector of p model coefficients \(\beta \in \mathbb {R} ^p\) and the intercept coefficient \(\beta _0 \in \mathbb {R} \). The value \(\texttt {alpha}\) = 0 corresponds to the special case of Ridge regression, \(\texttt {alpha}\) = 1 corresponds to Lasso regression (Friedman et al. 2010).

The parameter \(\texttt {alpha}\) allows to find a compromise or trade-off between Lasso and Ridge regression. This can be advantageous, since both variants have different consequences. Ridge regression affects that coefficients of strongly correlated variables match to each other (extreme case: identical variables receive identical coefficients) (Friedman et al. 2010). In contrast, Lasso regression tends to lead to a single coefficient in such a case (the other coefficients being zero) (Friedman et al. 2010).  

Type::

double, scalar.

Default::

\(\texttt {1}\)

Sensitivity::

Empirical results from  Friedman et al. (2010) show that the EN model can be rather sensitive to changes in \(\texttt {alpha}\).

Heuristics::

We are not aware of any heuristics to set this parameter. As described by Friedman et al. (2010), \(\texttt {alpha}\) can be set to a value of close to 1, if a model with few coefficients without risk of degeneration is desired.

Range::

\(\texttt {alpha}\) \(\in [0,1]\).

Transformation::

\(\texttt {trans\_id}\)

Bounds::

\(\texttt {lower = 0; upper = 1}\)

Constraints::

none.

Interactions::

\(\texttt {lambda}\) interacts with \(\texttt {alpha}\), see Sect. 3.3.2.

 

EN Hyperparameter lambda

The hyperparameter \(\texttt {lambda}\) influences the impact of the penalty term \(P(\alpha ,\beta )\) in Eq. (3.1). Very large \(\texttt {lambda}\) values lead to many model coefficients (\(\beta \)) being set to zero. Correspondingly, only few model coefficients become zero if \(\texttt {lambda}\) is small (close to zero). Thus, \(\texttt {lambda}\) is often treated differently than other hyperparameters: in many cases, several values of \(\texttt {lambda}\) are of interest, rather than a single value (Simon et al. 2011). There is no singular, optimal solution for \(\texttt {lambda}\), as it controls the trade-off between model quality and complexity (number of coefficients that are not zero). Hence, a whole set of \(\texttt {lambda}\) values will often be suggested to users, who then choose a resulting model that provides a specific trade-off to their liking.  

Type::

double, scalar.

Default::

not implemented, because parameter is not tuned.

Sensitivity::

EN is necessarily sensitive to \(\texttt {lambda}\), since extreme values lead to completely different models, i.e., all coefficients are zero or none are zero. This is also shown in Fig. 1 by Friedman et al. (2010).

Heuristics::

Often, \(\texttt {lambda}\) gets determined by a type of grid search, where a sequence of decreasing \(\texttt {lambda}\) is tested (Friedman et al. 2010; Simon et al. 2011). The sequence starts with a sufficiently large value of \(\texttt {lambda}\), such that \(\beta =0\). The sequence ends, if the resulting model starts to approximate the unregularized model (Simon et al. 2011).

Range::

\(\texttt {lambda}\) \(\in (0,\infty )\) (Note: \(\texttt {lambda}\) = 0 is possible, but leads to a simple unregularized model). Using a logarithmic scale seems reasonable, as used in the study by Probst et al. (2019a), to cover a broad spectrum of very small and very large values.

Transformation::

not implemented, because parameter is not tuned.

Bounds::

not implemented, because parameter is not tuned.

Constraints::

none.

Interactions::

\(\texttt {lambda}\) interacts with \(\texttt {alpha}\). Both are central for determining the coefficients \(\beta \) (see also Friedman et al. 2010, Fig. 1).

 

EN Hyperparameter thresh

The parameter \(\texttt {thresh}\) is a threshold for model convergence (i.e., convergence of the internal coordinate descent). Model training ends, when the change after an update of the coefficients drops below this value (Friedman et al. 2020). Unlike parameters like \(\texttt {lambda}\), \(\texttt {thresh}\) is not a regularization parameter, hence there is a clear connection between \(\texttt {thresh}\) and the number of model coefficients.

As a stopping criterion, \(\texttt {thresh}\) influences the duration of model training (larger values of \(\texttt {thresh}\) result into faster training), and the quality of the model (larger values of \(\texttt {thresh}\) may decrease quality).

 

Type::

double, scalar.

Default::

\(\texttt {-7}\)

Sensitivity::

As long as \(\texttt {thresh}\) is in a reasonable range of values, the model will not be sensitive to changes. Extremely large values can lead to fairly poor models, extremely small values may result into significantly larger training times.

Heuristics::

none are known.

Range::

\(\texttt {thresh}\) \(\approx 0\), \(\texttt {thresh}\) \(>0\). It seems reasonable to set \(\texttt {thresh}\) on a log-scale with fairly coarse granularity, since \(\texttt {thresh}\) has a low sensitivity for the most part. Example: \(\texttt {thresh}\) \(= 10^{-20},10^{-18}, \ldots , 10^{-4}\).

Transformation::

\(\texttt {trans\_10pow}\)

Bounds::

\(\texttt {lower = -8; upper = -1}\)

Interactions::

none are known.

 

In conclusion, Table 3.3 provides a brief survey of examples from the literature, where EN was tuned.

Table 3.3 Survey of examples from the literature, for tuning of EN

4 Decision Trees

4.1 Description

Decision and regression trees are models that divide the data space into individual segments with successive decisions (called splits).

Basically, the procedure of a decision tree is as follows: Starting from a root node (which contains all observations) a first split is carried out. Each split affects a variable (or a feature). This variable is compared with a threshold value. All observations that are less than the threshold are assigned to a new node. All other observations are assigned to another new node. This procedure is then repeated for each node until a termination criterion is reached or until there is only one observation in each end node. End nodes are also called leaves (following the tree analogy).

A detailed description of tree-based models is given by James et al. (2014). An overview of decision tree implementations and algorithms is given by Zharmagambetov et al. (2019). Gomes Mantovani et al. (2018) describe the tuning of hyperparameters of several implementations. As a reference implementation, we refer to the R package rpart  (Therneau and Atkinson 2019; Therneau et al. 2019).

4.2 Hyperparameters of Decision Trees

DT Hyperparameter minsplit

If there are fewer than \(\texttt {minsplit}\) observations in a node of the tree, no further split is carried out at this node. Thus, \(\texttt {minsplit}\) limits the complexity (number of nodes) of the tree. With large \(\texttt {minsplit}\) values, fewer splits are made. A suitable choice of \(\texttt {minsplit}\) can thus avoid overfitting. In addition, the parameter influences the duration of the training of a decision tree (Hastie et al. 2017).

 

Type::

integer, scalar.

Default::

\(\texttt {20}\)

Sensitivity::

Trees can react very sensitively to parameters that influence their complexity. Together with \(\texttt {minbucket}\), \(\texttt {cp}\), and \(\texttt {maxdepth}\), \(\texttt {minsplit}\) is one of the most important hyperparameters (Gomes Mantovani et al. 2018).

Heuristics::

\(\texttt {minsplit}\) is set to three times \(\texttt {minbucket}\) in certain implementations, if this parameter is available (Therneau and Atkinson 2019).

Range::

\(\texttt {minsplit}\) \( \in [1, n]\), where \(\texttt {minsplit}\) \( \ll n\) is recommended, since otherwise trees with extremely few nodes will arise. Only integer values are valid.

Transformation::

\(\texttt {trans\_id}\)

Bounds::

\(\texttt {lower = 1; upper = 300}\)

Constraints::

\(\texttt {minsplit}\) > \(\texttt {minbucket}\). This is a soft constraint, i.e., valid models are created even if violated, but \(\texttt {minsplit}\) would no longer have any effect.

Interactions::

The parameters \(\texttt {minsplit}\), \(\texttt {minbucket}\), \(\texttt {cp}\), and \(\texttt {maxdepth}\) all influence the complexity of the tree. Interactions between these parameters are therefore likely. In addition, \(\texttt {minsplit}\) has no effect for certain values of \(\texttt {minbucket}\) (see Constraints). Similar relationships (depending on the data) are also conceivable for the other parameter combinations.

 

DT Hyperparameter minbucket

\(\texttt {minbucket}\) specifies the minimum number of data points in an end node (leaf) of the tree. The meaning in practice is similar to that of \(\texttt {minsplit}\). With larger values, \(\texttt {minbucket}\) also increasingly limits the number of splits and thus the complexity of the tree. Additional information: \(\texttt {minbucket}\) is set relative to \(\texttt {minsplit}\), i.e., we are using numerical values for \(\texttt {minbucket}\) that represent percentages relative to \(\texttt {minsplit}\). If \(\texttt {minbucket}\) \(= 1.0\), then \(\texttt {minbucket}\) = \(\texttt {minsplit}\). \(\texttt {minsplit}\) should be greater than or equal \(\texttt {minbucket}\).  

Type::

integer, scalar.

Default::

\(\texttt {1/3}\)

Sensitivity::

see \(\texttt {minsplit}\).

Heuristics::

\(\texttt {minbucket}\) is set to a third of the values of \(\texttt {minsplit}\) in the reference implementations, if this parameter is available Therneau et al. (2019).

Range::

\(\texttt {minbucket}\) \(\in [1, n]\), where \(\texttt {minbucket}\) \( \ll n\) is recommended, as otherwise trees with extremely few nodes will arise. Only integer values are valid.

Transformation::

\(\texttt {trans\_id}\)

Bounds::

\(\texttt {lower = 0.1; upper = 0.5}\)

Constraints::

\(\texttt {minsplit}\) > \(\texttt {minbucket}\) (this is a soft constraint, i.e., valid models are created even if violated, but \(\texttt {minsplit}\) would no longer have any effect).

Interactions::

see \(\texttt {minsplit}\). Due to the similarity of \(\texttt {minsplit}\) and \(\texttt {minbucket}\), it can make sense to only tune one of the two parameters.

 

DT Hyperparameter cp

The threshold complexity \(\texttt {cp}\) controls the complexity of the model in that split decisions are linked to a minimal improvement. This means that if a split does not improve the tree-based model by at least the factor \(\texttt {cp}\), this split will not be carried out. With larger values, \(\texttt {cp}\) increasingly limits the number of splits and thus the complexity of the tree.

Therneau and Atkinson (2019) describe the \(\texttt {cp}\) parameter as follows:

The complexity parameter \(\texttt {cp}\) is, like \(\texttt {minsplit}\), an advisory parameter, but is considerably more useful. It is specified according to the formula

$$\begin{aligned} R_{\texttt {cp}}(T) \equiv R(T) + \texttt {cp} \times |T| \times R(T_1), \end{aligned}$$
(3.3)

where \(T_1\) is the tree with no splits, |T| is the number of splits for a tree, and R is the risk. This scaled version is much more user-friendly than the original CART formula since it is unit less. A value of \(\texttt {cp} = 1\) will always result in a tree with no splits. For regression models, the scaled \(\texttt {cp}\) has a very direct interpretation: if any split does not increase the overall \(R^2\) of the model by at least \(\texttt {cp}\) (where \(R^2\) is the usual linear models definition) then that split is decreed to be, a priori, not worth pursuing. The program does not split said branch any further and saves considerable computational effort. The default value of 0.01 has been reasonably successful at “pre-pruning” trees so that the cross-validation step only needs to remove one or two layers, but it sometimes over prunes, particularly for large data sets.

 

Type::

double, scalar.

Default::

\(\texttt {-2}\)

Sensitivity::

see \(\texttt {minsplit}\).

Heuristics::

none known.

Range::

paramcp \(\in [0, 1[\).

Transformation::

\(\texttt {trans\_10pow}\)

Bounds::

\(\texttt {lower = -10; upper = 0}\)

Constraints::

none.

Interactions::

see \(\texttt {minsplit}\). Since \(\texttt {cp}\) expresses a relative factor for the improvement of the model, an interaction with the corresponding quality measure is also possible (split parameter).

 

DT Hyperparameter maxdepth

The parameter \(\texttt {maxdepth}\) limits the maximum depth of a leaf in the decision tree. The depth of a leaf is the number of nodes that lie on the path between the root and the leaf. The root node itself is not counted (Therneau and Atkinson 2019).

The meaning in practice is similar to that of \(\texttt {minsplit}\). Both \(\texttt {minsplit}\) and \(\texttt {maxdepth}\) can be used to limit the complexity of the tree. However, smaller values of \(\texttt {maxdepth}\) lead to a lower complexity of the tree. With \(\texttt {minsplit}\) it is the other way round (larger values lead to less complexity).  

Type::

integer, scalar.

Default::

\(\texttt {30}\)

Sensitivity::

see \(\texttt {minsplit}\).

Heuristics::

none known.

Range::

\(\texttt {maxdepth}\) \(\in [0, n]\). Only integer values are valid.

Transformation::

\(\texttt {trans\_id}\).

Bounds::

\(\texttt {lower = 1; upper = 30}\).

Constraints::

none.

Interactions::

see \(\texttt {minsplit}\).

 

Table 3.4 shows examples from the literature.

Table 3.4 DT: survey of examples from the literature. Tree-based tuning example configurations

5 Random Forest

5.1 Description

The model quality of decision trees can often be improved with ensemble methods. Here, many individual models (i.e., many individual trees) are merged into one overall model (the ensemble). Popular examples are RF and XGBoost methods. This section discusses RF methods, XGBoost methods will be discussed in Sect. 3.6. The RF method creates many decision trees at the same time, and their prediction is then usually made using the mean (in case of regression) or by majority vote (in case of classification).

The variant of RF described by Breiman (2001) uses two important steps to reduce generalization error: first, when creating individual trees, only a random subset of the features is considered for each split. Second, each tree is given a randomly drawn subset of the observations to train. Typically, the approach of bootstrap aggregating or bagging (James et al. 2017) is used. A comprehensive discussion of random forest models is provided by Louppe (2015), who also presents a detailed discussion of hyperparameters. Theoretical results on hyperparameters of RF models are summarized by Scornet (2017). Often, tuning of RF also takes into account parameters for the decision trees themselves as described in Sect. 3.4. Our reference implementation studied in this report is from the R package rangerFootnote 3 (Wright 2020; Wright and Ziegler 2017).

5.2 Hyperparameters of Random Forests

RF Hyperparameter num.trees

\(\texttt {num.trees}\) determines the number of trees that are combined in the overall ensemble model. In practice, this influences the quality of the method (more trees improve the quality) and the runtime of the model (more trees lead to longer runtimes for training and prediction).  

Type::

integer, scalar.

Default::

\(\texttt {log(500,2)}\).

Sensitivity::

According to Breiman (2001), the generalization error of the model converges with increasing number of trees toward a lower bound. This means that the model will become less sensitive to changes of \(\texttt {num.trees}\) with increasing values of \(\texttt {num.trees}\). This is also shown in the benchmarks of Louppe (2015). Only with relatively small values (\(\texttt {num.trees}\) \(<50\)) the model is rather sensitive to changes in that parameter. The empirical results of Probst et al. (2019a) also show that the tunability of \(\texttt {num.trees}\) is estimated to be rather low.

Heuristics::

There are theoretical results about the convergence of the model in relation to \(\texttt {num.trees}\)  (Breiman 2001; Scornet 2017). This however does not result in a clear heuristic approach to setting this parameter. One common recommendation is to choose \(\texttt {num.trees}\) sufficiently high (Probst et al. 2019c) (since more trees are usually better), while making sure that the runtime of the model does not become too large.

Range::

\(\texttt {num.trees}\) \(\in [1,\infty )\). Several hundred or thousands of trees are commonly used, see also Table 3.5.

Transformation::

\(\texttt {trans\_2pow\_round}\).

Bounds::

\(\texttt {lower = 0; upper = 11}\).

Constraints::

none.

Interactions::

none are known.

 

RF Hyperparameter mtry

The hyperparameter \(\texttt {mtry}\) determines how many randomly chosen features are considered for each split. Thus, it controls an important aspect, the randomization of individual trees. Values of \(\texttt {mtry}\) \(\ll n\) imply that differences between trees will be larger (more randomness). This increases the potential error of individual trees, but the overall ensemble benefits (Breiman 2001; Louppe 2015). As a useful side effect \(\texttt {mtry}\) \(\ll n\) may also reduce the runtime considerably (Louppe 2015). Nevertheless, findings about this parameter largely depend on heuristics and empirical results. According to Scornet (2017), no theoretical results about the randomization of split features are available.  

Type::

integer, scalar.

Default::

\(\texttt {floor(sqrt(nFeatures))}\).

Sensitivity::

According to Breiman (2001), RF is relatively insensitive to changes of \(\texttt {mtry}\): “But the procedure is not overly sensitive to the value of F. The average absolute difference between the error rate using F=1 and the higher value of F is less than 1.” (Breiman 2001) (here: F corresponds to \(\texttt {mtry}\)).

This seems to be at odds with the benchmarks by Louppe (2015), which determine that \(\texttt {mtry}\) may indeed have a considerable impact, especially for low values of \(\texttt {mtry}\). The investigation of tunability by Probst et al. (2019a) also identifies \(\texttt {mtry}\) as an important (i.e., tunable) parameter. This is not necessarily a contradiction to Breiman’s observation, since Probst et al. (2019a) determine RF as the least tunable model in their experimental investigation. So while \(\texttt {mtry}\) might have some impact (compared to other parameters), it may be less sensitive when compared in relation to hyperparameters of other models.

Heuristics::

Breiman (2001) propose the following heuristic:

$$\begin{aligned} {\texttt {mtry}} =\text {floor}(\log _2(n)+1). \end{aligned}$$

Should categorical features be present, Breiman suggests doubling or tripling that value. No theoretical motivation is given.

Another frequent suggestion is \(\texttt {mtry}\) \(=\sqrt{n}\) (or \(\texttt {mtry}\) \(=\text {floor} (\sqrt{n})\)). While these are used in various implementations of RF, there is no clear theoretical motivation given. For \(n<20\) both heuristics provide very similar values.

Some implementations distinguish between classification (\(\texttt {mtry}\) \(=\sqrt{n}\)) and regression (\(\texttt {mtry}\) \(=n/3\)). Empirical results with these heuristics are described by Probst et al. (2019c).

Range::

\(\texttt {mtry}\) \(\in [1,n]\).

Transformation::

\(\texttt {trans\_id}\).

Bounds::

\(\texttt {lower = 1; upper = nFeatures}\).

Constraints::

none.

Interactions::

none are known.

 

RF Hyperparameter sample.fraction

The parameter \(\texttt {sample.fraction}\) determines how many observations are randomly drawn to train one specific tree.

Probst et al. (2019c) write that \(\texttt {sample.fraction}\) has a similar effect as \(\texttt {mtry}\). That means, it influences the properties of the trees: With small \(\texttt {sample.fraction}\) (corresponding to small \(\texttt {mtry}\)) individual trees are weaker (in terms of predictive quality), yet the diversity of trees is increased. This improves the ensemble model quality. Smaller values of \(\texttt {sample.fraction}\) reduce the runtime (Probst et al. 2019c) (if all other parameters are equal).  

Type::

double, scalar.

Default::

\(\texttt {1}\).

Sensitivity::

\(\texttt {sample.fraction}\) can have a relevant impact on model quality. Scornet reports: “However, according to empirical results, there is no justification for default values in random forests for sub-sampling or tree depth, since optimizing either leads to better performance.”

Heuristics::

none known.

Range::

\(\texttt {sample.fraction}\) \(\in (0,1]\).

Transformation::

\(\texttt {trans\_id}\).

Bounds::

\(\texttt {lower = 0.1; upper = 1}\).

Constraints::

none.

Interactions::

Potentially, \(\texttt {sample.fraction}\) interacts with parameters that influence training individual trees DT (e.g., \(\texttt {maxdepth}\), \(\texttt {minsplit}\), \(\texttt {cp}\)). Scornet: “According to the theoretical analysis of median forests, we know that there is no need to optimize both the \(\texttt {subsample}\) size and the tree depth: optimizing only one of these two parameters leads to the same performance as optimizing both of them” (Scornet 2017). However, this theoretical observation is only valid for the respective median trees and not necessarily for the classical RF model we consider.

 

RF Hyperparameter replace

The parameter \(\texttt {replace}\) specifies, whether randomly drawn samples are replaced, i.e., whether individual samples can be drawn multiple times for training of a tree (\(\texttt {replace}\) = \(\texttt {TRUE}\)) or not (\(\texttt {replace}\) = \(\texttt {FALSE}\)). If \(\texttt {replace}\) = \(\texttt {TRUE}\), the probability that two trees receive the same data sample is reduced. This may further decorrelate trees and improve quality.  

Type::

logical, scalar.

Default::

\(\texttt {2}\) (\(\texttt {TRUE}\)).

Sensitivity::

The sensitivity of \(\texttt {replace}\) is often rather small. Yet, the survey of Probst et al. (2019c) notes a potentially detrimental bias for \(\texttt {replace}\) = \(\texttt {TRUE}\), if categorical variables with a variable number of levels are present.

Heuristics::

Due to the aforementioned bias, the choice could be made depending on the variance of the cardinality in the data features. However, a quantifiable recommendation is not available.

Range::

\(\texttt {replace}\) \(\in \{\texttt {TRUE}, \texttt {FALSE} \}\).

Transformation::

\(\texttt {trans\_id}\).

Bounds::

\(\texttt {lower = 1 (\texttt {FALSE}); upper = 2 (\texttt {TRUE})}\).

Constraints::

none.

Interactions::

One obvious interaction occurs with \(\texttt {sample.fraction}\). Both parameters control the random choice of training data for each tree. The setting (\(\texttt {replace}\) = \(\texttt {TRUE}\) \(\wedge \) \(\texttt {sample.fraction}\) \(=1\)) as well as the setting (\(\texttt {replace}\) = \(\texttt {FALSE}\) \(\wedge \) \(\texttt {sample.}\) \(\texttt {fraction}\) \(<1\)) implies that individual trees will not see the whole data set.

 

RF Hyperparameter respect.unordered.factors

This parameter decides how splits of categorical variables are handled. There are basically three options: \(\texttt {ignore}\), \(\texttt {order}\), or \(\texttt {partition}\), which will briefly be explained in the following. A detailed discussion is given by Wright and König (2019). A standard that is also used by Breiman (2001) is \(\texttt {respect.unordered.}\) \(\texttt {factors}\) = partition. In that case, all potential splits of a nominal, categorical variable are considered. This leads to a good model, but the large number of considered splits can lead to an unfavorable runtime.

A naive alternative is \(\texttt {respect.unordered.factors}\) = \(\texttt {ignore}\). Here, the categorical nature of a variable will be ignored. Instead, it is assumed that the variable is ordinal, and splits are chosen just as with numerical variables. This reduces runtime but can decrease model quality.

A better choice should be \(\texttt {respect.unordered.factors}\) = order. Here, each categorical variable first is sorted, depending on the frequency of each level in the first of two classes (in case of classification) or depending on the average dependent variable value (regression). After this sorting, the variable is considered to be numerical. This allows for a runtime similar to that with \(\texttt {respect.unordered.}\) \(\texttt {factors}\) = \(\texttt {ignore}\) but with potentially better model quality. This may not be feasible for classification with more than two classes, due to lack of a clear sorting criterion (Wright and König 2019; Wright 2020).

In specific cases, \(\texttt {respect.unordered.factors}\) = ignore may work well in practice. This could be the case, when the variable is actually nominal (unknown to the analyst).  

Type::

character, scalar.

Default::

\(\texttt {1}\) (\(\texttt {ignore}\)).

Sensitivity::

unknown.

Heuristics::

none.

Range::

\(\texttt {respect.unordered.factors}\) \(\in \{\)ignore, order, partition \(\}\). The parameter \(\texttt {respect.unordered.}\) \(\texttt {factors}\) can also be understood as a binary value. Then \(\texttt {TRUE}\) corresponds to order and \(\texttt {FALSE}\) to ignore (Wright 2020).

Transformation::

\(\texttt {trans\_id}\).

Bounds::

\(\texttt {lower = 1}\) (\(\texttt {ignore}\)); \(\texttt {upper = 2}\) (\(\texttt {order}\)).

Constraints::

none.

Interactions::

none are known.

 

In conclusion, Table 3.5 provides a brief survey of examples from the literature, where RF was tuned.

Table 3.5 RF: survey of examples from the literature for tuning of random forest

6 Gradient Boosting (xgboost)

6.1 Description

Boosting is an ensemble process. In contrast to random forests, see Sect. 3.5, the individual models (here: decision trees) are not created and evaluated at the same time, but rather sequentially. The basic idea is that each subsequent model tries to compensate for the weaknesses of the previous models.

For this purpose, a model is created repeatedly. The model is trained with weighted data. At the beginning these weights are identically distributed. Data that are poorly predicted or recognized by the model are given larger weights in the next step and thus have a greater influence on the next model. All models generated in this way are combined as a linear combination to form an overall model (Freund and Schapire 1997; Drucker and Cortes 1995).

An intuitive description of this approach is slow learning, as the attempt is not made to understand the entire database in a single step, but to improve the understanding step by step (James et al. 2014). Gradient Boosting (GB) is a variant of this approach, with one crucial difference: instead of changing the weighing of the data, models are created sequentially that follow the gradient of a loss function. In the case of regression, the models learn with residuals of the sum of all previous models. Each individual model tries to reduce the weaknesses (here: residuals) of the ensemble (Friedman 2001).

In the following, we consider the hyperparameters of one version of GB: XGBoost (Chen and Guestrin 2016). In principle, any models can be connected in ensembles via boosting. We apply XGBoost to decision trees. As a reference implementation, we refer to the R package xgboost (Chen et al. 2020). Brownlee (2018) describes some empirical hyperparameter values for tuning XGBoost.

6.2 Hyperparameters of Gradient Boosting

XGBoost Hyperparameter nrounds

The parameter \(\texttt {nrounds}\) specifies the number of boosting steps. Since a tree is created in each individual boosting step, \(\texttt {nrounds}\) also controls the number of trees that are integrated into the ensemble as a whole. Its practical meaning can be described as follows: larger values of \(\texttt {nrounds}\) mean a more complex and possibly more precise model, but also cause a longer running time. The practical meaning is therefore very similar to that of num.trees in random forests. In contrast to num.trees, overfitting is a risk with very large values, depending on other parameters such as \(\texttt {eta}\), \(\texttt {lambda}\), \(\texttt {alpha}\). For example, the empirical results of Friedman (2001) show that with a low \(\texttt {eta}\), even a high value of \(\texttt {nrounds}\) does not lead to overfitting.  

Type::

integer, scalar.

Default::

\(\texttt {0}\).

Sensitivity/robustness:

Similar to the random forests parameter num.trees, \(\texttt {nrounds}\) also has a higher sensitivity, especially with low values (Friedman 2001).

Heuristics::

Heuristics cannot be derived from the literature. Often values of several hundred to several thousand trees are set as the upper limit (Brownlee 2018).

Range::

\(\in [1, \infty [\). Only integer values are valid.

Transformation::

\(\texttt {trans\_2pow\_round}\).

Bounds::

\(\texttt {lower = 0; upper = 11}\).

Constraints::

none.

Interactions::

There is a connection between the hyperparameters beta, rounds, and \(\texttt {subsample}\).

 

XGBoost Hyperparameter eta

The parameter \(\texttt {eta}\) is a learning rate and is also called “shrinkage” parameter. It controls the lowering of the weights in each boosting step (Chen and Guestrin 2016; Friedman 2002). It has the following practical meaning: lowering the weights helps to reduce the influence of individual trees on the ensemble. This can also avoid overfitting (Chen and Guestrin 2016).  

Type::

double, scalar.

Default::

\(\texttt {log2(0.3)}\).

Sensitivity::

Empirical results show that XGBoost is more sensitive to \(\texttt {eta}\) when \(\texttt {eta}\) is large (Friedman 2001). Generally speaking, smaller values are better. In an empirical study, Probst et al. (2018) describe \(\texttt {eta}\) as a parameter with comparatively high tunability.

Heuristics::

A heuristic is difficult to formulate due to the dependence on other parameters and the data situation, but Hastie et al. (2017) recommend

\(\ldots \) the best strategy appears to be to set \(\texttt {eta}\) to be very small (\(\texttt {eta}\) \(<0.1\)) and then choose \(\texttt {nrounds}\) by early stopping.

This may lead to correspondingly longer runtimes due to large \(\texttt {nrounds}\). Brownlee (2018) mentions a heuristic, which describes a search range depending on \(\texttt {nrounds}\).

Range::

\(\texttt {eta}\) \(\in [0, 1]\). Using a logarithmic scale seems reasonable, e.g., \(2^{-10}, \ldots , 2^0\)), as used in the studies by Probst et al. (2018) or Sigrist (2020), because values close to zero often show good results.

Transformation::

\(\texttt {trans\_2pow}\).

Bounds::

\(\texttt {lower = -10; upper = 0 }\).

Constraints::

none.

Interactions::

There is a connection between \(\texttt {eta}\) and \(\texttt {nrounds}\): If one of the two parameters increases, the other should be decreased if the error remains the same (Friedman 2001; Probst et al. 2019a). This is also demonstrated by Hastie et al. (2017):

Smaller values of \(\texttt {eta}\) lead to larger values of \(\texttt {nrounds}\) for the same training risk, so that there is a trade-off between them.

In addition, Hastie et al. (2017) also point to correlations with the \(\texttt {subsample}\) parameter: In an empirical study, \(\texttt {subsample}\) = 1 and \(\texttt {eta}\) = 1 show significantly worse results than \(\texttt {subsample}\) = 0.5 and \(\texttt {eta}\) = 0.1. If \(\texttt {subsample}\) = 0.5 and \(\texttt {eta}\) = 1, the results are even worse than for \(\texttt {eta}\) = 1 and \(\texttt {subsample}\) = 1. In the best case (\(\texttt {subsample}\) = 0.5 and \(\texttt {eta}\) = 0.1), however, larger values of \(\texttt {nrounds}\) are required to achieve optimal results.

 

XGBoost Hyperparameter lambda

The parameter \(\texttt {lambda}\) is used for the regularization of the model. This parameter influences the complexity of the model (Chen and Guestrin 2016; Chen et al. 2020) (similar to the parameter of the same name in EN). Its practical significance can be described as follows: as a regularization parameter, \(\texttt {lambda}\) helps to prevent overfitting (Chen and Guestrin 2016). With larger values, smoother or simpler models are to be expected.  

Type::

double, scalar.

Default::

\(\texttt {0}\).

Sensitivity::

not known.

Heuristics::

none known.

Range::

\(\texttt {lambda}\) \(\in [0, \infty [\). A logarithmic scale seems to be useful, e.g., \( 2^{-10}, \dots , 2^{10}\), as used in the study by Probst et al. (2019a) to cover a wide range of very small and very large values.

Transformation::

\(\texttt {trans\_2pow}\).

Bounds::

\(\texttt {lower = -10; upper = 10}\).

Constraints::

none.

Interactions::

Because both \(\texttt {lambda}\) and \(\texttt {alpha}\) control the regularization of the model, an interaction is likely.

 

XGBoost Hyperparameter alpha

The authors of the R package xgboost, Chen and Guestrin (2016), did not mention this parameter. The documentation of the reference implementation does not provide any detailed information on \(\texttt {alpha}\) either. Due to the description as a parameter for the \(l_1\) regularization of the weights (Chen et al. 2020), a highly similar use as for the parameter of the same name in elastic net is to be assumed. Its practical meaning can be described as follows: similar to \(\texttt {lambda}\), \(\texttt {alpha}\) also functions as a regularization parameter.  

Type::

double, scalar.

Default::

\(\texttt {-10}\).

Sensitivity::

unknown.

Heuristics::

No heuristics are known.

Range::

\(\texttt {alpha}\) \(\in [0, \infty [\). A logarithmic scale seems to be useful, e.g., \( 2^{-10}, \ldots , 2^{10}\), as used in the study by Probst et al. (2019a) to cover a wide range of very small and very large values.

Transformation::

\(\texttt {trans\_2pow}\).

Bounds::

\(\texttt {lower = -10; upper = 10}\).

Constraints::

none.

Interactions::

Since both \(\texttt {lambda}\) and \(\texttt {alpha}\) control the regularization of the model, an interaction is likely.

 

XGBoost Hyperparameter subsample

In each boosting step, the new tree to be created is usually only trained on a subset of the entire data set, similar to random forest (Friedman 2002). The \(\texttt {subsample}\) parameter specifies the portion of the data approach that is randomly selected in each iteration. Its practical significance can be described as follows: an obvious effect of small \(\texttt {subsample}\) values is a shorter running time for the training of individual trees, which is proportional to the \(\texttt {subsample}\)  (Hastie et al. 2017).  

Type::

double, scalar.

Default::

\(\texttt {1}\).

Sensitivity::

The study by Friedman (2002) shows a high sensitivity for very small or large values of \(\texttt {subsample}\). In a relatively large range of values from \(\texttt {subsample}\) (around 0.3 to 0.6), however, hardly any differences in model quality are observed.

Determination heuristics::

Hastie et al. (2017) suggest \(\texttt {subsample}\) = 0.5 as a good starting value, but point out that this value can be reduced if \(\texttt {nrounds}\) increases. With many trees (nround is large) it is sufficient if each individual tree sees a smaller part of the data, since the unseen data is more likely to be taken into account in other trees.

Range::

\(\texttt {subsample}\) \(\in ]0,1]\). Based on the empirical results Friedman (2002); Hastie et al. (2017), a logarithmic scale is not recommended.

Transformation::

\(\texttt {trans\_id}\).

Bounds::

\(\texttt {lower = 0.1; upper = 1}\).

Constraints::

none.

Interactions::

There is a connection between the \(\texttt {eta}\), \(\texttt {nrounds}\), and \(\texttt {subsample}\).

 

XGBoost Hyperparameter colsample_bytree

The parameter \(\texttt {colsample\_bytree}\) has similarities to the mtry parameter in random forests. Here, too, a random number of features is chosen for the splits of a tree. In XGBoost, however, this choice is made only once for each tree that is created, instead for each split (xgboost developers 2020). Here \(\texttt {colsample\_bytree}\) is a relative factor. The number of selected features is therefore \(\texttt {colsample\_bytree}\) \(\times n\). Its practical meaning is similar to mtry: \(\texttt {colsample\_bytree}\) enables the trees of the ensemble to have a greater diversity. The runtime is also reduced, since a smaller number of splits have to be checked each time (if \(\texttt {colsample\_bytree}\) \(< 1\)).  

Type::

double, scalar.

Default::

\(\texttt {1}\).

Sensitivity::

The empirical study by Probst et al. (2019a) shows that the model is particularly sensitive to changes for \(\texttt {colsample\_bytree}\) values close to 1. However, this sensitivity decreases in the vicinity of more suitable values.

Heuristics::

none known.

Range::

\(\texttt {colsample\_bytree}\) \(\in ]0,1]\). Brownlee (2018) mentions search ranges such as \(\texttt {colsample\_bytree}\) = 0.4, 0.6, 0.8, 1, but mostly works with \(\texttt {colsample\_bytree}\) = \(0.1, 0.2, \ldots , 1\).

Transformation::

\(\texttt {trans\_id}\).

Bounds::

\(\texttt {lower = 1/nFeatures; upper = 1}\).

Constraints::

none.

Interactions::

none known.

 

XGBoost Hyperparameter gamma

This parameter of a single decision tree is very similar to the parameter \(\texttt {cp}\): Like \(\texttt {cp}\), \(\texttt {gamma}\) controls the number of splits of a tree by assuming a minimal improvement for each split. According to the documentation (Chen et al. 2020):

Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.

The main difference between \(\texttt {cp}\) and \(\texttt {gamma}\) is the definition of \(\texttt {cp}\) as a relative factor, while \(\texttt {gamma}\) is defined as an absolute value. This also means that the ranges differ.  

Default::

\(\texttt {-10}\).

Range::

\(\texttt {gamma}\) \( \in [0, \infty [\). A logarithmic scale seems to make sense, e.g., \(2^{-10}, \ldots , 2^{10}\), as, e.g., in the study by Thomas et al. (2018) to cover a wide range of very small and very large values.

Transformation::

\(\texttt {trans\_2pow}\).

Bounds::

\(\texttt {lower = -10; upper = 10}\).

 

XGBoost Hyperparameter maxdepth

This parameter of a single decision tree is already known as \(\texttt {maxdepth}\).  

Default::

\(\texttt {6}\).

Sensitivity/heuristics::

Hastie et al. (2017) state:

Although in many applications \(J = 2\) will be insufficient, it is unlikely that \(J > 10\) will be required. Experience so far indicates that \( 4 \le J \le 8\) works well in the context of boosting, with results being fairly insensitive to particular choices in this range.Footnote 4

Transformation::

\(\texttt {trans\_id}\).

Bounds::

\(\texttt {lower = 1; upper = 15}\).

 

XGBoost Hyperparameter min_child_weight

Like \(\texttt {gamma}\) and \(\texttt {maxdepth}\), \(\texttt {min\_child\_weight}\) restricts the number of splits of each tree. In the case of \(\texttt {min\_child\_weight}\), this restriction is determined using the Hessian matrix of the loss function (summed over all observations in each new terminal node) (Chen et al. 2020; Sigrist 2020). In experiments by Sigrist (2020), this parameter turns out to be comparatively difficult to tune: the results show that tuning with \(\texttt {min\_child\_weight}\) gives worse results than tuning with a similar parameter (limitation of the number of samples per sheet) (Sigrist 2020).  

Type::

double, scalar.

Default::

\(\texttt {0}\).

Sensitivity::

unknown.

Heuristics::

none known.

Range::

\(\texttt {min\_child\_weight}\) \(\in [0, \infty [\). A logarithmic scale seems to make sense, e.g., \(2^{-10}, \ldots , 2^{10}\), as used in the study by Probst et al. (2019a) to cover a wide range of very small and very large values.

Transformation::

\(\texttt {trans\_2pow}\).

Bounds::

\(\texttt {lower = 0; upper = 7}\).

Constraints::

none.

Interactions::

Interactions with parameters such as \(\texttt {gamma}\) and \(\texttt {maxdepth}\) are probable, since all three parameters influence the complexity of the individual trees in the ensemble.

 

Table 3.6 shows XGBoost example parameter settings from the literature.

Table 3.6 Survey: examples from literature about XGBoost tuning

7 Support Vector Machines

7.1 Description

The SVM is a kernel-based method.

Definition 3.1

(Kernel) A kernel is a real-valued, symmetrical function \(\text {k}({x},{x} ')\) (usually positive definite), which often expresses some form of similarity between two observations \({x}, {x} '\).

The usefulness of kernels can be explained by the Kernel-Trick. The Kernel-Trick describes the ability of kernels to transfer data into a higher dimensional feature space. This allows classification with linear decision boundaries (hyperplanes) even in cases where the data in the original feature space are not linearly separable (Schölkopf and Smola 2001).

As reference implementation, we use the R package e1071Footnote 5 (Meyer et al. 2020), which is based on libsvm (Chang and Lin 2011).

7.2 Hyperparameters of the SVM

SVM Hyperparameter kernel

The parameter \(\texttt {kernel}\) is central for the SVM model. It describes the choice of the function \(\text {k}({x},{x} ')\). In practice, \(\text {k}({x},{x} ')\) can often be understood to be a measure of similarity. That is, the kernel function describes how similar two observations are to each other, depending on their feature values.  

Type::

character, scalar.

Default::

\(\texttt {1}\) (\(\texttt {radial}\)).

Sensitivity::

The empirical investigation of Probst et al. (2019a) shows “In svm the biggest gain in performance can be achieved by tuning the kernel, \(\texttt {gamma}\) or \(\texttt {degree}\), while the cost parameter does not seem to be very tunable.” This does not necessarily mean that \(\texttt {cost}\) should not be tuned, as the tunability investigated by Probst et al. (2019a) always considers a reference value (e.g., the default).

Heuristics::

Informally, it is often recommended to use \(\texttt {kernel}\) = radial basis. This also matches well to results and observations from the literature (Probst et al. 2019a; Guenther and Schonlau 2016). With very large numbers of observations and/or features Hsu et al. (2016) suggest to use \(\texttt {kernel}\) = linear. These are infallible rules, other kernels may perform better depending on the data set. This stresses the necessity of using hyperparameter tuning to choose kernels.

Range::

\(\bullet \) linear: \(\text {k}({x},{x} ') ={x} ^\text {T} {x} '\)

\(\bullet \) polynomial: \(\text {k}({x},{x} ') =(\texttt {gamma} ~{x} ^\text {T} {x} ' + \texttt {coef0})^\texttt {degree} \)

\(\bullet \) radial basis: \(\text {k}({x},{x} ') =\exp (-\texttt {gamma} ~|| {x}-{x} '||^2)\)

\(\bullet \) sigmoid: \(\text {k}({x},{x} ') =\tanh (\texttt {gamma} ~{x} ^\text {T} {x} ' + \texttt {coef0})\).

Transformation::

\(\texttt {trans\_id}\).

Bounds::

\(\texttt {lower = 1 (radial); upper = 2 (sigmoid)}\).

Constraints::

none.

Interactions::

The kernel functions themselves have parameters (\(\texttt {degree}\), \(\texttt {gamma}\), and \(\texttt {coef0}\)), whose values only matter if the respective function is chosen.

 

SVM Hyperparameter degree

The parameter \(\texttt {degree}\) influences the kernel function if a polynomial kernel was selected:

  • polynomial: \(\text {k}({x},{x} ') =(\texttt {gamma} ~{x} ^\text {T} {x} ' + \texttt {coef0})^\texttt {degree} \).

Integer values of \(\texttt {degree}\) determine the degree of the polynomial. Non-integer values are possible, even though not leading to a polynomial in the classical sense. If \(\texttt {degree}\) has a value close to one, the polynomial kernel approximates the linear kernel. Else, the kernel becomes correspondingly nonlinear.  

Type::

double, scalar.

Default::

not implemented, because parameter is not tuned.

Sensitivity::

The empirical investigation of Probst et al. (2019a) shows “In svm the biggest gain in performance can be achieved by tuning the kernel, \(\texttt {gamma}\) or \(\texttt {degree}\), while the cost parameter does not seem to be very tunable.”

Heuristics::

none are known.

Range::

\(\texttt {degree}\) \(\in (0,\infty )\).

Transformation::

not implemented, because parameter is not tuned.

Bounds::

not implemented, because parameter is not tuned.

Constraints::

none.

Interactions::

The parameter only has an impact if \(\texttt {kernel}\) = \(\texttt {polynomial}\).

 

SVM Hyperparameter gamma

The parameter \(\texttt {gamma}\) influences three kernel functions:

  • polynomial: \(\text {k}({x},{x} ') =(\texttt {gamma} ~{x} ^\text {T} {x} ' + \texttt {coef0})^\texttt {degree} \).

  • radial basis: \(\text {k}({x},{x} ') =\exp (-\texttt {gamma} ~|| {x}-{x} '||^2)\).

  • sigmoid: \(\text {k}({x},{x} ') =\tanh (\texttt {gamma} ~{x} ^\text {T} {x} ' + \texttt {coef0})\).

In case of polynomial and sigmoid, \(\texttt {gamma}\) acts as a multiplier for the scalar product of two feature vectors. For radial basis, \(\texttt {gamma}\) acts as a multiplier for the distance of two feature vectors.

In practice, \(\texttt {gamma}\) scales how far the impact of a single data sample reaches in terms of influencing the model. With small \(\texttt {gamma}\) values, an individual observation may potentially influence the prediction in a larger vicinity, since with increasing distance between \({x}\) and \({x}\) ’, their similarity will decrease more slowly (esp. with \(\texttt {kernel}\) = radial basis).  

Type::

double, scalar.

Default::

\(\texttt {log(1/nFeatures,2)}\).

Sensitivity::

The empirical investigation of van Rijn and Hutter (2018) shows that \(\texttt {gamma}\) is rather sensitive.

Heuristics::

The reference implementation uses a simple heuristic, to determine \(\texttt {gamma}\): \(\texttt {gamma}\) \(=1/n\) (Meyer et al. 2020). Another implementation (the sigest function in kernlabFootnote 6) first scales all input data, so that each feature has zero mean and unit variance. Afterward, a good interval for \(\texttt {gamma}\) is determined, by using the 10 and 90% quantile of the distances between the scaled data samples. By default, 50% randomly chosen samples from the input data are used.

Range::

\(\texttt {gamma}\) \(\in [0,\infty )\). Using a logarithmic scale seems reasonable (e.g., \(2^{-10}, \ldots , 2^{10}\) as used by Probst et al. 2019a), to cover a broad spectrum of very small and very large values.

Transformation::

\(\texttt {trans\_2pow}\).

Bounds::

\(\texttt {lower = -10; upper = 10}\).

Constraints::

none.

Interactions::

This parameter has no effect when \(\texttt {kernel}\) = linear. In addition, empirical results show a clear interaction with \(\texttt {cost}\)  (van Rijn and Hutter 2018).

 

SVM Hyperparameter coef0

The parameter \(\texttt {coef0}\) influences two kernel functions:

  • polynomial: \(\text {k}({x},{x} ') =(\texttt {gamma} ~{x} ^\text {T} {x} ' + \texttt {coef0})^\texttt {degree} \).

  • sigmoid: \(\text {k}({x},{x} ') =\tanh (\texttt {gamma} ~{x} ^\text {T} {x} ' + \texttt {coef0})\).

In both cases, \(\texttt {coef0}\) is added to the scalar product of two feature vectors.  

Type::

double, scalar.

Default::

\(\texttt {0}\).

Sensitivity::

Empirical results of Zhou et al. (2011) show that \(\texttt {coef0}\) has a strong impact in case of the polynomial kernel (but only for \(\texttt {degree}\) = 2).

Heuristics::

Guenther and Schonlau (2016) suggest to leave this parameter at \(\texttt {coef0}\) = 0.

Range::

\(\texttt {coef0}\) \(\in \mathbb {R}\).

Transformation::

\(\texttt {trans\_id}\).

Bounds::

\(\texttt {lower = -1; upper = 1}\)

Constraints::

none.

Interactions::

This parameter is only active if \(\texttt {kernel}\) = polynomial or \(\texttt {kernel}\) = sigmoid.

 

SVM Hyperparameter cost

The parameter \(\texttt {cost}\) (often written as C) is a constant that weighs constraint violations of the model. C is a typical regularization parameter, which controls the complexity of the model (Cherkassky and Ma 2004), and may help to avoid overfitting or dealing with noisy data.Footnote 7  

Type::

double, scalar.

Default::

\(\texttt {0}\).

Sensitivity::

The empirical results of van Rijn and Hutter (2018) show that \(\texttt {cost}\) has a strong impact on the model, while the investigation of Probst et al. (2019a) determines only a minor tunability. This disagreement may be explained, since \(\texttt {cost}\) may have a huge impact in extreme cases, yet good parameter values are found close to the default values.

Heuristics::

Cherkassky and Ma (2004) suggest the following: \(\texttt {cost}\) \(= \max (|\bar{{y}}+3\sigma _{y} |,|\bar{{y}}-3\sigma _{y} |)\). Here, \(\bar{{y}}\) is the mean of the observed \({y}\) values in the training data, and \(\sigma _{y} \) is the standard deviation. They justify this heuristic, by pointing out a connection between \(\texttt {cost}\) and the predicted \({y}\): as a constraint, \(\texttt {cost}\) limits the output values of the SVM model (regression) and should hence be set in a similar order of magnitude as the observed \({y}\)  (Cherkassky and Ma 2004).

Range::

\(\texttt {cost}\) \(\in [0,\infty )\). Using a logarithmic scale seems reasonable (e.g., \(2^{-10}, \ldots , 2^{10}\) as used by Probst et al. (2019a)), to cover a broad spectrum of very small and very large values.

Transformation::

\(\texttt {trans\_2pow}\).

Bounds::

\(\texttt {lower = -10; upper = 10}\)

Constraints::

none.

Interactions::

Empirical results show a clear interaction with \(\texttt {gamma}\)  (van Rijn and Hutter 2018).

 

SVM Hyperparameter epsilon

The parameter \(\texttt {epsilon}\) defines a corridor or “ribbon” around predictions. Residuals within that ribbon are tolerated by the model, i.e., are not penalized (Schölkopf and Smola 2001). The parameter is only used for regression with SVM, not for classification. In the experiments in Sect. 12.1, \(\texttt {epsilon}\) is only considered when SVM is used for regression.

Similar to \(\texttt {cost}\), \(\texttt {epsilon}\) is a regularization parameter. With larger values, \(\texttt {epsilon}\) allows for larger errors/residuals. This reduces the number of support vectors (and incidentally, also the runtime). The model becomes more smooth (cf. Schölkopf and Smola 2001, Fig. 9.4). This can be useful, e.g., to deal with noisy data and avoid overfitting. However, the model quality may be decreased.  

Type::

double, scalar.

Default::

\(\texttt {-1}\).

Sensitivity::

As described above, \(\texttt {epsilon}\) has a significant impact on the model.

Heuristics::

For SVM regression, Cherkassky and Ma (2004) suggest based on simplified assumptions and empirical results: \(\texttt {epsilon}\) \(= 3 \sigma \sqrt{\frac{\ln (n)}{n}}\). Here, \(\sigma ^2\) is the noise variance, which has to be estimated from the data, see, e.g., Eqs. (22), (23), and (24) in Cherkassky and Ma (2004). The noise variance is the remaining variance of the observations \({y}\), which cannot be explained by an ideal model. This ideal model has to be approximated with the nearest neighbor model (Cherkassky and Ma 2004), resulting in additional computational effort.

Range::

\(\texttt {epsilon}\) \(\in (0,\infty )\).

Transformation::

\(\texttt {trans\_10pow}\).

Bounds::

\(\texttt {lower = -8; upper = 0}\).

Constraints::

none.

Interactions::

none are known.

 

In conclusion, Table 3.7 provides a brief survey of examples from the literature, where SVM was tuned.

Table 3.7 Survey of examples from the literature, for tuning of SVM

8 Deep Neural Networks

8.1 Description

While DL describes the methodology, Deep Neural Networks (DNNs) are the models used in DL. DL models require the specification of a set of architecture-level parameters, which are important hyperparameters. Hyperparameters in DL are optimized in the outer loop of the hyperparameter tuning process. They are to be distinguished from the parameters of the DL method that are optimized in the initial loop, e.g., during the training phase of a Neural Network (NN) via backpropagation. Hyperparameter values are determined before the model is executed—they remain constant during model building and execution whereas parameters are modified. Selecting the method for the parameter optimization is a typical Hyperparameter Tuning (HPT) task. Available optimization methods such as ADAptive Moment estimation algorithm (ADAM) are described in Sect. 3.8.2.

Typical questions regarding hyperparameters in DL models are as follows:

  1. 1.

    How many layers should be combined?

  2. 2.

    Which dropout rate prevents overfitting?

  3. 3.

    How many filters (units) should be used in each layer?

Several empirical studies and benchmarking suites are available, see Sect. 6.2. But to date, there is no comprehensive theory that adequately explains how to answer these questions. Recently, Roberts et al. (2021) presented a first attempt to develop a DL theory.

Besides the hyperparameters discussed in this section, there are additional parameters used to define weight initialization schemes or regularization penalties. Furthermore, it should be noted that hyperparameters in DL methods can be conditionally dependent (this is also true for ML), e.g., on the number of layers as shown in the following example:

Example: Conditionally Dependent Hyperparameters

Mendoza et al. (2019) consider besides NN hyperparameters (e.g., batch size, number of layers, learning rate, dropout output rate, and optimizer), hyperparameters conditioned on solver type (e.g., \(\beta _1\) and \(\beta _2\)) as well as hyperparameters conditioned on learning-rate policy, and per-layer hyperparameters (e.g., activation function, number of units). For practical reasons, Mendoza et al. (2019) constrained the number of layers to the range between one and six: firstly, they aimed to keep the training time of a single configuration low, and secondly each layer adds eight per-layer hyperparameters to the configuration space, such that allowing additional layers would further complicate the configuration process.

8.2 Hyperparameters of Deep Neural Networks

DL Hyperparameter layers

The parameter \(\texttt {layer}\) determines the number of layers of the NN. Only the number of hidden layers are affected, because input and output layers are basic elements of every NN. Larger values mean more complex models, which correspondingly also have more model coefficients, a higher runtime, but possibly also a higher model quality. There is also an increased risk of overfitting, if no regularization measures are implemented or methods such as early-stopping be used (Prechelt 2012).  

Type::

integer.

Default::

\(\texttt {1}\).

Sensitivity::

The influence of \(\texttt {layers}\) can be extreme. By varying this value, extremely simple (no hidden layer or only one hidden layer with very few neurons) or extremely complex models (thousands of layers and neurons) can be generated. Moreover, the study of Li et al. (2018) shows that network depth has a strong influence on weight optimization. The functional relationship between weights and model quality becomes increasingly nonlinear as network depth increases and contains more local optima. Thus, the difficulty of weight optimization problem increases. At the same time, this difficulty decreases when more neurons are used per layer (Li et al. 2018). Also, “skip connections” (connections in the network that skip layers) can help reduce the difficulty.

Heuristics::

We are not aware of any quantitative heuristics. Bengio (2012) recommend choosing the number of layers as large as possible, considering the impact on computational resources. Larger networks exhibit better model performance as long as appropriate regularization procedures are applied (Bengio 2012).

Range::

\(\texttt {layers} _i \in [1,\infty )\), with \(i = \{1,2,\ldots ,\infty \}\). Only integer values are valid.

Transformation::

\(\texttt {identity}\).

Bounds::

\(\texttt {lower = 1; upper = 4}\).

Constraints::

none.

Interactions::

An interaction of \(\texttt {units}\) and \(\texttt {dropout}\) with \(\texttt {layers}\) is expected. These parameters together determine the total number of nodes in the network. This is also shown by the example of Srivastava et al. (2014).

 

DL Hyperparameter units

The parameter \(\texttt {units}\) determines the size of the corresponding network layer (number of neurons in the layer). Only the hidden layers are affected, because the dimension of the input and output layers is pre-determined, i.e., the number of units of the input layer depends on the dimensionality of the data and the number of units of the output layer depends on the task (e.g., binary and multi-class classification or regression). Similar to the \(\texttt {layers}\), larger values mean more complex models, which correspondingly also have more model coefficients, a higher runtime, but possibly also a higher model quality. There is also an increased risk of overfitting, should no regularization measures be taken or methods such as early-stopping be used (Prechelt 2012).  

Type::

integer, vector.

Default::

\(\texttt {5}\).

Sensitivity::

The influence of \(\texttt {units}\) can be extreme. By varying this vector, extremely simple (no hidden layer or only one hidden layer with very few neurons) or extremely complex models (thousands of layers and neurons) can be generated. Moreover, the study of Li et al. (2018) shows that network depth has a strong influence on weight optimization. The functional relationship between weights and model quality becomes increasingly nonlinear as network depth increases and contains more local optima. Thus, the difficulty of weight optimization problem increases. At the same time, this difficulty decreases when more neurons are used per layer (Li et al. 2018). Also, “skip connections” (connections in the network that skip layers) can help reduce the difficulty.

Heuristics::

We are not aware of any quantitative heuristics. Larger networks exhibit better model performance as long as appropriate regularization procedures are applied (Bengio 2012). In addition, it is recommended from empirical results (Bengio 2012), to choose a first hidden layer that has more neurons than the input layer (i.e., the first element of \(\texttt {units}\) should be larger than n).

Range::

\(\texttt {units} _i \in [1,\infty )\), with \(i = \{1,2,\ldots ,\infty \}\). Only integer values are valid.

Transformation::

\(\texttt {trans2\_pow}\)

Bounds::

\(\texttt {lower = 0; upper = 5}\)

Constraints::

none.

Interactions::

An interaction of \(\texttt {layers}\) and \(\texttt {dropout}\) with \(\texttt {units}\) is expected. These parameters together determine the total number of nodes in the network. This is also shown by the example of Srivastava et al. (2014).

 

DL Hyperparameter activation

The parameter \(\texttt {activation}\) specifies the activation function of the network nodes (neurons). In tensorflow, this parameter is often specified for each layer. This function decides how the input values of each node are translated into an output value.

The choice of activation function can have a strong impact on model performance. Among other things, \(\texttt {activation}\) influences an essential property of the network: the ability to approximate nonlinear functions. Only nonlinear activation functions allow this (Goldberg 2016).  

Type::

character/function, vector. Standard activation functions can be selected via their name, else custom functions can be implemented in tensorflow or keras.

Default::

\(\texttt {relu}\) (parameter is not tuned).

Sensitivity::

unknown.

Heuristics::

A heuristic is not known. A popular choice is \(\texttt {activation}\) = relu (Bengio 2012). However, \(\texttt {activation}\) = tanh also shows success (LeCun et al. 2012). The choice of activation function is often empirically justified (Goldberg 2016), based on empirical data or empirical research for a specific problem. This underscores the need to tune this parameter.

Range::

\(\texttt {activation}\) \(\in \) {tanh, sigmoid, relu, linear, swish, ...}.

Transformation::

not implemented, because parameter is not tuned.

Bounds::

not implemented, because parameter is not tuned.

Constraints::

As a soft constraint, the choice of activation function may affect whether or not GPU-acceleration can be used in tensorflow. That is, some activation functions cannot be used if GPU support is required.

Interactions::

not known.

 

DL Hyperparameter dropout

Dropout is a commonly used regularization technique for DNNs: some percentage of the layer’s output features will be randomly set to zero (“dropped out”) during training, i.e., dropout refers to the random removal of nodes (units) in the network (Chollet and Allaire 2018; Srivastava et al. 2014). The parameter \(\texttt {dropout}\) (often also p (Srivastava et al. 2014)) is the probability that any node will be removed. Removing nodes randomly helps to avoid overfitting, \(\texttt {dropout}\) thus acts in the sense of regularization (Srivastava et al. 2014). In tensorflow, this parameter is often specified for each layer.  

Type::

double, vector.

Default::

\(\texttt {0}\).

Sensitivity::

A NN model’s quality can be very sensitive to \(\texttt {dropout}\). In an example, Srivastava et al. (2014) show that at a constant number of hidden nodes (network structure remains unchanged) the model error on test data for values between \(\texttt {dropout}\) = 0.4 and \(\texttt {dropout}\) = 0.6 is approximately constant. However, the model error increases for larger and smaller values of \(\texttt {dropout}\).

Heuristics::

none known.

Range::

\(\texttt {dropout}\) \(\in (0,1]\).

Transformation::

\(\texttt {identity}\).

Bounds::

\(\texttt {lower = 0; upper = 0.4}\).

Constraints::

none.

Interactions::

An interaction of \(\texttt {dropout}\) with \(\texttt {units}\) and \(\texttt {layers}\) is expected. These parameters together determine the total number of nodes in the network. This is also illustrated in the example of Srivastava et al. (2014).

 

DL Hyperparameter learning_rate

The learning rate (\(\texttt {learning\_rate}\)) is a parameter of the weight optimization algorithm employed in the NN. It can be understood as a multiplier for the gradient in each iteration of the NN training procedure. The result is used to determine new values for the network weights (Bengio 2012).

The learning rate is essential to the model. When the gradient of the weights is determined, the learning rate decides how large a step to take in the direction of the gradient. Very large values can lead to faster progress on the one hand, but on the other hand can lead to instability and thus prevent the convergence of the training.  

Type::

double, scalar/vector. Usually a scalar, but a schedule of different values can also be supplied to most tensorflow optimizers.

Default::

\(\texttt {1e-3}\).

Sensitivity::

Learning rates have a significant impact on the model. According to Bengio (2012), this parameter is often the most important parameter that should always be considered when tuning neural networks.

Heuristics::

LeCun et al. (2012) propose to estimate learning rates individually for each weight, proportional to the root of the number of inputs to a node. Bengio (2012), on the other hand, states “The optimal learning rate is usually close to (by a factor of 2) the largest learning rate that does not cause divergence of the training criterion.” Heuristics based on this observation require multiple restarts of network training procedure (for example, start with large learning rate, stepwise divide by three until model training starts to converge (Bengio 2012).)

Range::

\(\texttt {learning\_rate}\) \(\in (0,\infty )\).

Transformation::

\(\texttt {identity}\).

Bounds::

\(\texttt {lower = 1e-6; upper = 1e-2}\).

Constraints::

none.

Interactions::

An interaction of \(\texttt {batch\_size}\), \(\texttt {epochs}\), and \(\texttt {learning\_}\) \(\texttt {rate}\) is expected: Smaller learning rates or batch sizes may result in larger \(\texttt {epochs}\) being required for model convergence.

 

DL Hyperparameter epochs

The parameter \(\texttt {epochs}\) determines the number of iterations (here: epochs), which are executed during the training of the model. An epoch describes the update of the network weights based on the calculated local gradient. Usually, within an epoch, the entire training data set is considered for determining the gradient (Bengio 2012). Each epoch can be subdivided again (depending on \(\texttt {batch\_size}\)) into single steps.

In practice, \(\texttt {epochs}\) is often not a classical tuning parameter, since it mainly affects the runtime of the tuning procedure. Larger values are generally better for the model quality, but detrimental for the required runtime. However, larger runtimes may also increase the risk of overfitting, if no countermeasures are employed.  

Type::

integer, scalar.

Default::

\(\texttt {4}\).

Sensitivity::

For small values of \(\texttt {epochs}\), the NN is sensitive to changes in \(\texttt {epochs}\). It becomes increasingly insensitive to changes as \(\texttt {epochs}\) increases (i.e., as the model increasingly converges).

Heuristics::

None known.

Range::

\(\texttt {epochs}\) \(\in [1,\infty ]\). Only integer values are valid.

Transformation::

\(\texttt {trans\_2pow}\)

Bounds::

\(\texttt {lower = 3; upper = 7}\).

Constraints::

none.

Interactions::

See \(\texttt {batch\_size}\) and \(\texttt {learning\_rate}\).

 

DL Hyperparameter optimizer

Optimization algorithms, e.g., Root Mean Square Propagation (RMSProp) (implemented in Keras as \(\texttt {optimizer\_rmsprop}\)) or ADAM (\(\texttt {optimizer\_adam}\)). Choi et al. (2019) considered RMSProp with momentum (Tieleman and Hinton 2012), ADAM (Kingma and Ba 2015), and ADAM (Dozat 2016) and claimed that the following relations hold:

$$\begin{aligned} \begin{aligned} {\textsc {SGD}}&\subseteq {\textsc {Momentum}} \subseteq {\textsc {RMSProp}},\\ {\textsc {SGD}}&\subseteq {\textsc {Momentum}} \subseteq {\textsc {Adam}},\\ {\textsc {SGD}}&\subseteq {\textsc {Nesterov}} \subseteq {\textsc {NAdam}}. \end{aligned} \end{aligned}$$

ADAM can approximately simulate MOMENTUM: MOMENTUM can be approximated with ADAM, if a learning-rate schedule that accounts for ADAM’s bias correction is implemented. Choi et al. (2019) demonstrated that these inclusion relationships are meaningful in practice. In the context of HPT and Hyperparameter Optimization (HPO), inclusion relations can significantly reduce the complexity of the experimental design. These inclusion relations justify the selection of a basic set, e.g., RMSProp, ADAM, and Nesterov-accelerated Adaptive Moment Estimation (NADAM).  

Type::

factor.

Default::

\(\texttt {5}\).

Sensitivity::

unknown.

Heuristics::

We are not aware of heuristics.

Range::

\(\texttt {optimizer}\) \( \in \{\) \(\texttt {"SDG"}\), \(\texttt {RMSPROP"}\), \(\texttt {ADAGRAD"}\), \(\texttt {ADADELTA"}\), \(\texttt {ADAM"}\), \(\texttt {ADAMAX"}\), \(\texttt {NADAM"}\) \(\}\).

Transformation::

\(\texttt {identity}\).

Bounds::

\(\texttt {lower = 1; upper = 7}\).

Constraints::

none.

Interactions::

Necessarily, there is an interaction.

 

DL Hyperparameter loss

This parameter determines the loss function that is minimized when training the network (optimizing the weights). The loss function can have a significant influence on the quality of the model (Janocha and Czarnecki 2017). However, it is not a typical tuning parameter, in part because the tuning procedure itself requires a consistent loss function, to identify better configurations of the hyperparameters. The \(\texttt {loss}\) parameter is therefore usually chosen separately by the user before the tuning procedure.  

Type::

character, scalar.

Default::

problem dependent, parameter is not tuned.

Sensitivity::

not known.

Heuristics::

not known.

Range::

several standard loss functions (such as Mean Squared Error (MSE)) are available in tensorflow, custom loss functions can be provided by users.

Transformation::

not implemented, because parameter is not tuned.

Bounds::

not implemented, because parameter is not tuned.

Constraints::

Some loss functions are specific to certain tasks (i.e., classification: crossentropy, regression: MSE).

Interactions::

unknown.

 

DL Hyperparameter batch_size

When determining the gradient of the network weights, either the whole data set can be used for this or only a subset (here: batch). The size of this subset is specified by \(\texttt {batch\_size}\).

The parameter \(\texttt {batch\_size}\) mainly affects the runtime of the training (Bengio 2012). However, \(\texttt {batch\_size}\) also affects the quality of the model. Small batch sizes may introduce a strong random element to weigh updates, which can hinder or benefit the learning process. Shallue et al. (2019) and Zhang et al. (2019) have shown empirically that increasing the batch size can increase the gaps between training times for different optimizers.  

Type::

integer, scalar.

Default::

\(\texttt {32}\).

Sensitivity::

unknown.

Heuristics::

We are not aware of heuristics, 32 is suggested as a good default value (Bengio 2012). However, from the experience of the authors of this expertise, this is highly dependent on the data situation, computer architecture, and further configuration of the model. Specifying \(\texttt {batch\_size}\) as a function of n should also be considered.

Range::

\(\texttt {batch\_size}\) \(\in (1,n]\). Only integer values are valid. Common \(\texttt {batch\_size}\) values are between 10 and several hundred (Bengio 2012). But several thousands are also possible (Mendoza et al. 2016).

Transformation::

not implemented, because parameter is not tuned.

Bounds::

not implemented, because parameter is not tuned.

Constraints::

none.

Interactions::

Necessarily, there is an interaction between \(\texttt {batch\_size}\) and \(\texttt {epochs}\), since both together determine the number of steps of the training procedure. In addition, an interaction of \(\texttt {batch\_size}\), \(\texttt {epochs}\), and \(\texttt {learning\_rate}\) is also expected. The interaction between \(\texttt {batch\_size}\) and \(\texttt {learning\_rate}\) is also mentioned by Bengio (2012).

 

9 Summary and Discussion

On the basis of our literature survey, we recommend tuning the introduced hyperparameters of ML models. In the experiments described in this study, we also investigate five additional parameters:

  • \(\texttt {dropoutfact}\) is a multiplier for \(\texttt {dropout}\), which reduces or increases \(\texttt {dropout}\) in each consecutive layer of the network;

  • \(\texttt {unitsfact}\) performs the same job but for \(\texttt {units}\); and

  • \(\texttt {beta\_1}\), \(\texttt {beta\_2}\), and \(\texttt {epsilon}\) are parameters affecting the \(\texttt {optimizer}\).

Reasonable bounds for all investigated parameters are summarized in Table 3.8.

Table 3.8 Overview of hyperparameters in the experiments. For data type, we employ the signifiers used in R. For categorical parameters, we list categories instead of providing bounds. “Default” refers to the ML default values in mlr and to the DL default values in SPOTMisc