1 Introduction

The DNN hyperparameter study described in this chapter uses the same data and the same HPT process as the ML studies in Chaps. 8 and 9. Section 10.2 describes the data preprocessing. Section 10.3 explains the experimental setup and the configuration of the DL models. The objective function is defined in Sect. 10.4. The hyperparameter tuner, \(\texttt {spot}\), is described in Sect. 10.5. Based on this setup, experimental results are analyzed: After discussing tunability based on the HPT progress in Sect. 10.6, default, \(\lambda _0\) and tuned hyperparameters, \(\lambda ^{\star }\) are compared in Sect. 10.6.2. The DL tuning process is analyzed in Sect. 10.7. Results are validated using severity in Sect. 10.8. A summary in Sect. 10.9 concludes this chapter. The DL hyperparameter tuning pipeline, that was used for the experiments, is summarized in Table 10.1 and illustrated in Fig. 10.1. The first sections in this chapter highlight the most important steps of this pipeline. The program code for performing the experiments is shown in Sect. 10.10.

keras  is TensorFlow (TF)’s high-level Application Programming Interface (API) designed with a focus on enabling fast experimentation. TF is an open source software library for numerical computations with data flow graphs (Abadi et al. 2016). Mathematical operations are represented as nodes in the graph, and the graph edges represent the multidimensional arrays of data (tensors) (O’Malley et al. 2019). The full TF API can be accessed via the tensorflow package from within the R software environment for statistical computing and graphics (R).

The Appendix contains information on how to set up the required Python software environment for performing HPT with keras, SPOT, and SPOTMisc. Source code for performing the experiments will included in the R package SPOTMisc. Further information is published on https://www.spotseven.de  and with some delay on Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/package=SPOT). This delay is caused by an intensive code check, which is performed by the CRAN team. It guarantees high-quality open source software and is an important feature for providing reliable software that is not just a flash in the pan.

Table 10.1 Deep-learning hyperparameter pipeline
Fig. 10.1
figure 1

Overview. The HPT pipeline introduced in this chapter comprehends the following steps: After the data acquisition (\(\texttt {getDataCensus}\)), the data is split into training, validation, and test sets. These data sets are processed via the function \(\texttt {genericDataPrep}\). keras is configured via the function \(\texttt {getKerasConf}\). The hyperparameter tuner \(\texttt {spot}\) is called and finally, the results are evaluated (\(\texttt {evalParamCensus}\))

2 Data Description

Identically to the ML case studies, the DL case study presented in this chapter uses the Census-Income (KDD) Data Set (CID), which is made available, for example, via the University of California, Irvine (UCI) Machine Learning Repository.Footnote 1,Footnote 2

2.1 \(\texttt {getDataCensus}\): Getting the Data from OpenML

Before training the DNN, the data is preprocessed by reshaping it into the shape the DNN can process. The function \(\texttt {getDataCensus}\) is used to get the Open Machine Learning (OpenML) data (from cache or from server). The same options as in the previous ML studies will be used, i.e., the parameter settings from Table 8.3 will be used.

figure a

2.2 \(\texttt {getGenericTrainValTestData}\): Split Data in Train, Validation, and Test Data

The data frame \(\texttt {dfCensus}\), \((X,Y) \subset (\mathcal {X}, \mathcal {Y})\), with 10 000 observations of 23 variables, is available. Based on \(\texttt {prop}\), the data is split into training, validation, and test data sets, \((X,Y)^{(\text {train})}\), \((X,Y)^{(\text {val})}\), and \((X,Y)^{(\text {test})}\), respectively. If \(\texttt {prop = 2/3}\), the training data set has 4 444 observations, the validation data set has 2 222 observations, and the test data set the remaining 3 334 observations.

figure b

2.3 \(\texttt {genericDataPrep}\): Spec

The third step of the data preprocessing generates a \(\texttt {specList}\).

figure c

The function \(\texttt {genericDataPrep}\) works as described in Sects. 10.2.3.110.2.3.5.

2.3.1 The Iterator: Data Frame to Data Set

The helper function \(\texttt {df\_to\_dataset}\) Footnote 3 converts the data frame \(\texttt {dfCensus}\) into a data set. This procedure enables processing of very large Comma Separated Values (CSV) files (so large that they do not fit into memory). The elements of the training data sets are randomly shuffled. Finally, consecutive elements of this data set are combined into batches.

Applying the function \(\texttt {df\_to\_dataset}\) generates a list of tensors. Each tensor represents a single column. The most significant difference to R ’s data frames is that a TF data set is an iterator.

figure d

Background: Iterators

Each time an iterator is called it will yield a different batch of rows from the data set. The iterator function \(\texttt {iter\_next}\) can be called as follows, so that batches are shown.

figure e

The data set \(\texttt {train\_ds\_generic}\) returns a list of column names (from the data frame) that map to column values from rows in the data frame.

2.3.2 The feature_spec Object: Specifying the Target

TF has built-in methods to perform common input conversions.Footnote 4 The powerful \(\texttt {feature\_column}\) system will be accessed via the user-friendly, high-level interface called \(\texttt {feature\_spec}\). While working with structured data, e.g., CSV data, column transformations and representations can be initialized and specified. A practical benefit of implementing data preprocessing within model \(\mathcal {A}\) is that when \(\mathcal {A}\) is exported, the preprocessing is already included. In this case, new data can be passed directly to \(\mathcal {A}\).

! Attention: Keras Preprocessing Layers

keras and tensorflow are under constant development. The current implementation in SPOTMisc classifies structured data with feature columns. The corresponding TF module was designed for the use with TF version 1 estimators. It does fall under compatibility guarantees.Footnote 5 The newly developed keras module uses “preprocessing layers”  for building keras-native input processing pipelines. Future versions of SPOTMisc will be based on preprocessing layers. However, because the underlying ideas of both preprocessing layers are similar (TF provides a migration guideFootnote 6), the most important preprocessing steps will be presented next.

First the \(\texttt {spec}\) object \(\texttt {specGeneric}\) is defined. The response variable, here: \(\texttt {target}\), can be specified using a formula, see Chambers and Hastie (1992) and the R function \(\texttt {formula}\).

figure f

2.3.3 Adding Steps to the feature_spec Object

The CID data set contains a variety of data types. These mixed data types are converted to a fixed-length vector for the DL model to process. Based on their feature type, their data type or level, the columns will be treated differently. After creating the \(\texttt {feature\_spec}\) object the step functions from Table 10.2 can be used to add further \(\texttt {steps}\). Depending on the data type, the step functions specify the data transformations. Table 10.3 shows these types.

Table 10.2 Steps: data transformations depending on the data type
Table 10.3 Description of the CID feature and data types that are used in the data set \((X,Y) \subset (\mathcal {X}, \mathcal {Y})\)

The R package tfdatasets provides selectors to select certain variable types and ranges, e.g., \(\texttt {all\_numeric}\) to select all numeric variables, \(\texttt {all\_nominal}\) to select all characters, or \(\texttt {has\_type("float32")}\) to select variables based on their TF variable type. Based on the feature and data type shown in Table 10.3, the data transformations from Table 10.2 are applied. We will consider feature specs for continuous and catergorical data separately.

2.3.4 Feature Spec: Continuous Data

For continuous data, i.e., numerical variables, the function \(\texttt {step\_numeric\_}\) \(\texttt {column}\) will be used and all numeric variables will be normalized (scaled). The R package tfdataset provides the scaler function \(\texttt {scaler\_min\_max}\), which uses the minimum and maximum of the numeric variable and the function \(\texttt {scaler\_standard}\), which uses the mean and the standard deviation.

2.3.5 Feature Spec: Categorical Data

The DNN model \(\mathcal {A}\) cannot directly process categorical (nominal) data—they must be transformed so that they can be represented as numbers. The representation of categorical variables as a set of one-hot encoded columns is widely used in practice (Chollet and Allaire 2018). There are basically two options for specifying the kind of numeric representation used for categorical variables: indicator columns or embedding columns.

Background: Embedding

Suppose instead of having a factor with a few levels (e.g., three categorical features such as \(\texttt {red}\), \(\texttt {green}\), or \(\texttt {blue}\)), there are hundreds or even more levels. As the number of levels grows very large, it becomes unfeasible to train a DNN using one-hot encodings. In this situation, embedding  should be used: instead of representing the data as a very large one-hot vector, the data can be stored as a low-dimensional vector of real numbers. Note, the size of the embedding is a parameter that must be tuned (Abadi et al. 2015).

The implementation in SPOTMisc uses two steps: first, based on the number of \(\texttt {levels}\), i.e., the value of the parameter \(\texttt {minLevelSizeEmbedding}\) in the following code, the set of columns where embedding should be used, is determined. Then, either the function \(\texttt {step\_indicator\_column}\) or the function \(\texttt {step\_embedding\_column}\) is applied.

figure g

After adding a \(\texttt {step}\) we need to \(\texttt {fit}\) the \(\texttt {specGeneric}\) object:

figure h

Finally, the following data structures are available:

  1. 1.

    \(\texttt {train\_ds\_generic}\) (batched, based on 4444 samples)

  2. 2.

    \(\texttt {val\_ds\_generic}\), (batched, based on 2222 samples)

  3. 3.

    \(\texttt {specGeneric\_prep}\) and

  4. 4.

    \(\texttt {testGeneric}\) (the remaining 3334 samples).

These data are returned as the list \(\texttt {specList}\) from the function \(\texttt {genericData}\) \(\texttt {Prep}\).

figure i

Dense features prepared with TF’s feature columns mechanism can be listed. There are 22 dense features that will be passed to the DNN.

figure j

3 Experimental Setup and Configuration of the Deep Learning Models

3.1 \(\texttt {getKerasConf}\): keras and Tensorflow Configuration

Setting up the keras configuration from within SPOTMisc is a simple step: the function \(\texttt {getKerasConf}\) is called. The function \(\texttt {getKerasConf}\) passes additional parameters to the \(\texttt {keras}\) function, e.g.,  

\(\texttt {activation:}\):

Activation function in the last Neural Network (NN) layer. Default: \(\texttt {"sigmoid"}\).

\(\texttt {active:}\):

Vector of active variables, e.g., \(\texttt {c(1,10)}\) specifies that only the first and tenth variable will be considered by \(\texttt {spot}\). This mechanism allows the shrinking the full set of tunable parameters, say \(\lambda \), to a smaller set, \(\lambda ^{(-)}\), if the user wants to investigate the tunability (or the effect) of one or only a few hyperparameters.

\(\texttt {callbacks:}\):

List of callbacks to be called during training. Default: \(\texttt {list()}\).

\(\texttt {clearSession:}\):

Whether to call \(\texttt {k\_clear\_session}\) or not at the end of keras modeling. Default: \(\texttt {FALSE}\).

\(\texttt {encoding:}\):

Encoding used during data preparation. Default: \(\texttt {"oneHot"}\).

\(\texttt {loss:}\):

Loss function, \(\mathcal {L}\), for the \(\texttt {compile}\) from the package keras. For example Binary Cross Entropy (BCE) loss as defined in Eq. (2.3).

Default: \(\texttt {"loss\_binary\_crossentropy"}\).

\(\texttt {metrics:}\):

Metrics function for compile. Default: \(\texttt {"binary\_}\) \(\texttt {accuracy"}\).

\(\texttt {model:}\):

Model, \(\mathcal {A}\), as specified via \(\texttt {getModelConf}\). Default: \(\texttt {"dl"}\). Forthcoming versions of SPOTMisc will provide additional DNN model types, e.g., Convolutional Neural Networks (CNNs).

\(\texttt {nClasses:}\):

Number of classes in (multi-class) classification. Specifies the number of units in the last layer (before \(\texttt {softmax}\)). Default: \(\texttt {1}\) (binary classification).

\(\texttt {resDummy:}\):

If \(\texttt {TRUE}\), generate dummy (mock up) result for testing. If \(\texttt {FALSE}\), run keras and tensorflow evaluations. Default: \(\texttt {FALSE}\).

\(\texttt {returnValue:}\):

Return value. Can be one of \(\texttt {"trainingLoss"}\), \(\texttt {"negTrainingAccuracy"}\), \(\texttt {"validationLoss"}\), \(\texttt {"negValidation}\) \(\texttt {Accuracy"}\), \(\texttt {"testLoss"}\), or

\(\texttt {"negTestAccuracy"}\).

\(\texttt {returnObject:}\):

Return object. Can be one of \(\texttt {"evaluation"}\), \(\texttt {"model"}\), \(\texttt {"pred"}\). Default: \(\texttt {"evaluation"}\).

\(\texttt {shuffle:}\):

Logical (whether to shuffle the training data \((X,Y)^{(\text {train})}\) before each epoch) or string (for “batch”). Used in the function \(\texttt {df\_to\_dataset}\). "batch" is a special option for dealing with the limitations of the Hierarchical Data Format (HDF) version 5 data. It shuffles in batch-sized chunks. Default: \(\texttt {FALSE}\).

\(\texttt {testData:}\):

Test data, \((X,Y)^{(\text {test})}\), on which to evaluate the loss, \(\mathcal {L}\), and any model metrics, \(\psi ^{(\text {test})}\)at the end of the optimization using the function \(\texttt {evaluate}\).

\(\texttt {tfDevice:}\):

Tensorflow device. CPU/GPU allocation. Passed to \(\texttt {tensorflow}\) via \(\texttt {tf\$device(kerasConf}\) \(\texttt {\$tfDevice)}\). Default: \(\texttt {"/cpu:0"}\) (use CPU only).

\(\texttt {trainData:}\):

Training data, \((X,Y)^{(\text {train})}\), on which to evaluate the loss and any model metrics at the end of each epoch.

\(\texttt {validationData:}\):

Validation data, \((X,Y)^{(\text {val})}\), on which to evaluate the loss \(\psi ^{(\text {val})}\)and any model metrics at the end of each epoch.

\(\texttt {validation\_split:}\):

Float between 0 and 1. Fraction of the training data \((X,Y)^{(\text {train})}\)to be used internally by \(\mathcal {A}\) as validation data \((X,Y)^{(\text {valtrain})}\). \(\mathcal {A}\) will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on \((X,Y)^{(\text {valtrain})}\)at the end of each epoch. \((X,Y)^{(\text {valtrain})}\)is selected from the last samples in the \((X,Y)^{(\text {train})}\)data provided, before shuffling. Default: \(\texttt {0.2}\).

\(\texttt {verbose:}\):

Verbosity mode (0 = silent, 1 = progress bar, 2 = one line per epoch). Default: \(\texttt {0}\).

  The default settings are useful for the binary classification task analyzed in this chapter. Only the parameter \(\texttt {kerasConf\$clearSession}\) is set to \(\texttt {TRUE}\) and \(\texttt {kerasConf\$verbose}\) is set to \(\texttt {0}\).

figure k

3.2 \(\texttt {getModelConf}\): DL Hyperparameters

figure l

If the default values from the function \(\texttt {getKerasConf}\) are used, the vector of hyperparameter \(\lambda \) contains the following elements: the dropout rates (dropout rates of the layers will be tuned individually), the number of units (the number of single outputs from a single layer), the learning rate (controls how much to change the DNN model in response to the estimated error each time the model weights are updated), the number of training epochs (a training epoch is one forward and backward pass of a complete data set), the optimizer for the inner loop, \(\mathcal {O}_{\text {inner}}\), and its parameters (i.e., \(\beta _1\), \(\beta _2\) as well as \(\epsilon \)) and the number of layers. These hyperparameters and their ranges are listed in Table 10.4.

Table 10.4 The hyperparameters, \(\lambda \), for the DNN, which implements a fully connected network

To enable compatibility with the ranges of the learning rates of the other optimizers, the learning rate of the optimizer \(\texttt {adadelta}\) is internally mapped to \(\texttt {1-learning\_rate}\). That is, a learning rate of 0 will be mapped to 1 (which is \(\texttt {adadelta}\) ’s default learning rate). The learning rate of \(\texttt {adagrad}\) and \(\texttt {sgd}\) is internally mapped to \(\texttt {10 * learning\_rate}\). That is, a learning rate of 0.001 will be mapped to 0.01 (which is \(\texttt {adagrad}\) ’s and \(\texttt {sgd}\) ’s default). The learning rate learning_rate of \(\texttt {adamax}\) and \(\texttt {nadam}\) is internally mapped to \(\texttt {2 * learning\_rate}\). That is, a learning rate of 0.001 will be mapped to 0.002 (which is \(\texttt {adamax}\) ’s and \(\texttt {nadam}\) ’s default.)

The hyperparameter \(x_{11}\), which encodes the \(\texttt {optimizer}\) is implemented as a factor. Factor levels, which represent the available optimizers are listed in Table 10.5

A discussion of the DNN hyperparameters, \(\lambda \), recommendations for their settings and further information are presented in Sect. 3.8. The R function \(\texttt {getModelConf}\) provides information about hyperparameter names, ranges, and types.

Table 10.5 Optimizers that can be selected via hyperparameter \(x_{11}\). Default optimizer \(\mathcal {O}_{\text {inner}}\) is \(\texttt {adam}\). The function \(\texttt {selectKerasOptimizer}\) from the SPOTMisc implements the selection. The corresponding R functions have the prefix \(\texttt {optimizer\_}\), e.g., \(\texttt {adamax}\) can be called via \(\texttt {optimizer\_adamax}\)

3.3 The Neural Network

Background: Network Implementation in SPOTMisc

The SPOTMisc function \(\texttt {getModelConf}\) selects a pre-specified, but not pre-trained, DL network \(\mathcal {A}\). This network is called via \(\texttt {funKerasGeneric}\), which is the interface to \(\texttt {spot}\). \(\texttt {funKerasGeneric}\) uses a network, that is implemented as follows:

To build the DNN in keras, the function \(\texttt {layer\_dense\_features}\) that processes the feature columns specification is used (Fig. 10.2). It receives the data set \(\texttt {specGeneric\_prep}\) as input and returns an array off all dense features:

figure m

The iterator can be called to take a look at the (scaled) output:

figure n

The NN model can be compiled after the \(\texttt {loss}\) function \(\mathcal {L}\), which determines how good the DNN prediction is (based on the \((X,Y)^{(\text {val})}\)), the \(\texttt {optimizer}\), i.e., \(\mathcal {O}_{\text {inner}}\), i.e., the update mechanism of \(\mathcal {A}\), which adjusts the weights using backpropagation, and the \(\texttt {metrics}\). metrics The metrics monitor the progress during training and testing and are specified using the \(\texttt {compile}\) function from keras.

! Attention: Hyperparameter Values

To improve the readability of the code, evaluated (“forced” values) of the hyperparameters \(\lambda \) are shown in the code snippets below instead of the arguments that are passed from the tuner \(\texttt {spot}\) to the function \(\texttt {funKerasGeneric}\).

figure o
Fig. 10.2
figure 2

Simple DNN based on the code in this section

The DNN training can be started as follows (using keras\(\texttt {fit}\) function). Train the model on the CPU using the setting \(\texttt {tf\$device("/cpu:0")}\) on the validation data set:

figure p

The predictions from the DNN model are shown in the following code snippet. The tensor values are the output from the final DNN layer after the sigmoid function was applied. Values are from the interval [0, 1] and represent probabilities: values smaller than 0.5 are interpreted as predictions “\(\texttt {age}\) \(< 40\)”, otherwise “\(\texttt {age}\) \(\ge 40\)”.

figure q

Figure 10.3 shows the quantities that are being displayed during training:

  1. (i)

    the loss of the network over the training and validation data, \(\psi ^{(\text {train})}\) and \(\psi ^{(\text {val})}\), respectively, and

  2. (ii)

    the accuracy of the network over the training and validation data, \(f_{\text {acc}}^{(\text {train})}\) and \(f_{\text {acc}}^{(\text {val})}\), respectively.

This figure illustrates that an accuracy greater than 80% on the training data, \((X,Y)^{(\text {train})}\), can be reached quickly.

Fig. 10.3
figure 3

DNN training. History of the inner optimization loop

Figure 10.3 can indicate (even if this is only a short fit procedure) whether the modeling is affected by overfitting or not. If this situation occurs, it might be useful to implement dropout layers or use other methods to prevent overfitting.

The effects of HPT and the tunability of \(\mathcal {A}\) will be described in the following sections. Finally, using keras\(\texttt {evaluate}\) function, the DNN model performance can be checked on \(X^{(\text {test})}\).

figure r

The relationship between \(\psi ^{(\text {train})}\), \(\psi ^{(\text {val})}\), and \(\psi ^{(\text {test})}\) as well as between \(f_{\text {acc}}^{(\text {train})}\), \(f_{\text {acc}}^{(\text {val})}\), and \(f_{\text {acc}}^{(\text {test})}\) can be analyzed with Sequential Parameter Optimization Toolbox (SPOT), because it computes and reports these values.

4 \(\texttt {funKerasGeneric}\): The Objective Function

The hyperparameter tuner, e.g., \(\texttt {spot}\), performs model selection during the tuning run: training data \(X^{(\text {train})}\) is used for fitting (training) the models, e.g., the weights of the DNNs. Each trained model \(\mathcal {A} _{\lambda _i}\left( X^{(\text {train})}\right) \) will be evaluated on the validation data \(X^{(\text {val})}\), i.e., the loss is calculated as shown in Eq. (2.9). Based on \((\lambda _i, \psi ^{(\text {val})}_i )\), at each iteration of the outer optimization loop a surrogate model \(\mathcal {S}(t)\) is fitted, e.g., a Bayesian Optimization (BO) (Kriging) model using \(\texttt {spot}\) ’s \(\texttt {buildKriging}\) function.

For each hyperparameter configuration \(\lambda _i\), the objective function \(\texttt {funKerasGeneric}\) reports information about the related DNN models \(\mathcal {A} _{\lambda _i}\)

  1. 1.

    training loss, \(\psi ^{(\text {train})}\),

  2. 2.

    training accuracy, \(f_{\text {acc}}^{(\text {train})}\),

  3. 3.

    validation (testing) loss, \(\psi ^{(\text {val})}\), and

  4. 4.

    validation (testing) accuracy, \(f_{\text {acc}}^{(\text {val})}\).

5 \(\texttt {spot}\): Experimental Setup for the Hyperparameter Tuner

The SPOT package for R, which was introduced in Sect. 4.5, will be used for the DL hyperparameter tuning (Bartz-Beielstein et al. 2021). The budget is set to twelve hours, i.e., the run time of DL tuning is larger than the run time of the ML tuning. The budget for the \(\texttt {spot}\) runs was set to this value, because of the complexity of the hyperparameter search space \(\Lambda \) and the relatively long run time of the DNN.

SPOT provides several options for adjusting the HPT parameters, e.g., type of the Surrogate Model Based Optimization (SMBO) model, \(\mathcal {S}\), and optimizer, \(\mathcal {O}\), as well as the size of the initial design, \(n_{\text {init}}\). These parameters can be passed via the \(\texttt {spotControl}\) function to \(\texttt {spot}\). For example, instead of the default surrogate \(\mathcal {S}\), which is BO (implemented as \(\texttt {buildKriging}\)), a Random Forest (RF), (implemented as \(\texttt {buildRanger}\)) can be chosen.

The general DL HPT data workflow is as follows: first the training data, \((X,Y)^{(\text {train})}\) are fed to the DNN. The DNN will then learn to associate images and labels. Based on the keras parameter \(\texttt {validation\_split}\), the training data will be partitioned into a (smaller) training data set, \(X^{(\text {train})}\), and a validation data set, \((X,Y)^{(\text {valtrain})}\). The trained DNN produces predictions for validations based on \((X,Y)^{(\text {val})}\) data. The DL HPT data workflow is shown in Fig. 10.4.

Fig. 10.4
figure 4

Overview. The DL HPT data workflow

Similar to the process described in Sect. 8.1 for ML, the hyperparameter tuning for DL can be started as follows:

figure s
Table 10.6 SPOT parameters used for deep learning hyperparameter tuning. The \(\texttt {control}\) list contains internally further lists, see Table 10.7
Table 10.7 SPOT list parameters used for deep learning hyperparameter tuning

The \(\texttt {startCensusRun}\) function performs the following steps:

  1. 1.

    Providing the CID data set, \((\mathcal {(X}, \mathcal {Y})_{\text {CID}}\), see Sect. 8.2.1.

  2. 2.

    Generating the random sample \((X,Y) \subseteq (\mathcal {(X}, \mathcal {Y})_{\text {CID}}\) of size \(\texttt {nobs}\).

  3. 3.

    Defining an experimental design, including performance measures.

  4. 4.

    Configuration of the hyperparameter tuner, \(\mathcal {T}\).

  5. 5.

    Configuration of the DL model, \(\mathcal {A}\).

  6. 6.

    Performing the experiments.

Furthermore, it can be decided whether to use the default hyperparameter setting, \(\lambda _0\), as a starting point or not. Using the parameter specifications from Tables 10.6 and 10.7, we are ready to perform the HPT run: \(\texttt {spot}\) can be started.

6 Tunability

Regarding tunability as defined in Definition 2.26, we are facing a special situation in this chapter, because there is no generally accepted “default” hyperparameter configuration, \(\lambda _0\), for DNNs. This problem is not as obvious in ML, because the corresponding methods have a long history, i.e., there are publications for most of the shallow methods that can give hints how to select adequate \(\lambda \) values. This information is collected and summarized in Chap. 3. The “default” hyperparameter setting of the DNNs analyzed in this chapter is based on our own experiences, combined with recommendations in the literature. Chollet and Allaire (2018) may be considered as a reference in this field.Footnote 7

The \(\texttt {result}\) list from the \(\texttt {spot}\) run can be loaded. It contains the 14 values shown in Table 4.6, e.g., names of the tuned hyperparameters that were introduced in Table 10.4:

figure t

The HPT inner optimization loop is shown in Fig. 10.5. The DNN uses the tuned hyperparameters, \(\lambda ^{\star }\) from Table 10.8. The model training supports the result found by the tuner \(\texttt {spot}\) that the number of training epochs should be 32. The reader may compare the inner optimization loop with default and with tuned hyperparameters in Figs. 10.3 and 10.5.

The tuned DNN model has the following structure:

figure u
Fig. 10.5
figure 5

Training DL (inner optimization loop) using the tuned hyperparameter setting \(\lambda ^{\star }\)

Fig. 10.6
figure 6

Progress plot. In contrast to the progress plots used for the ML methods, this plot shows the BCE loss and not the MMCE against the number of iterations (function evaluations of the tuner)

Table 10.8 DNN configurations. “lr” denotes “learning_rate”. The overall mean of the loss, \(\overline{\texttt {y}}\) is 0.3691, its standard deviation is 0.1152, whereas the mean of the best HPT configuration, \(\lambda ^{\star }\), found by Optimal Computing Budget Allocation (OCBA), is 0.3346 with s.d. 0.0343

6.1 Progress

After loading the results from the experiments, the hyperparameter tuning progress can be visually analyzed. First of all, the \(\texttt {result}\) list information will be used to visualize the route to the solution: in Fig. 10.6, loss function values, \(\psi ^{(\text {val})}\), are plotted against the number of iterations, t. Each point represents one evaluation of an DNN model \(\mathcal {A} _{\lambda }(t)\) at time step (\(\texttt {spot}\) iteration) t.

The initial design, which includes the default hyperparameter setting, \(\lambda _0\), results in a loss value of \(\psi ^{(\text {val})}_{\text {init}} = 0.3371\). The best value, that was found during the tuning, is \(y_{\text {val}}^{(*)} = 0.3285\). These values have to be taken with caution, because they represent onyl one evaluation of \(\mathcal {A} _{\lambda }\). Based on OCBA, which takes the noise in the model evaluation via the function \(\texttt {funKerasGeneric}\) into consideration, the best function value is \(y_{\text {val}}^{(\text {OCBA}^*)}= 0.3346\).

After 12 h, 914 \(\texttt {dl}\) models were evaluated. Comparing the worst configuration that was observed during the HPT with the best, a 81.773% reduction in the BCE loss was obtained. After the initial phase, which includes 44 evaluations, the smallest BCE reads 0.3370858. The dotted red line in Fig. 8.6 illustrates this result. The final best value reads 0.3285304, i.e., a reduction of the BCE of 2.5381%. These values, in combination with results shown in the progress plot (Fig. 8.6) indicate that a relatively short HPT run is able to improve the quality of the DNN model. It also indicates, that increased run times do not result in a significant improvement of the BCE. The full comparison of the DL and ML algorithm performances with default, \(\lambda _0\), and tuned, \(\lambda ^{\star }\), hyperparameters is shown in Sect. 10.9.

! Attention

These results do not replace a sound statistical comparison, they are only indicators, not final conclusions.

The corresponding code is presented in the Appendix. The related hyperparameters values are shown in Table 10.8.

There is a large variance in the loss as can be seen in Figs. 10.6 and 10.7. The latter of these two plots visualizes the same data as the former, but uses log-log axes instead.

Fig. 10.7
figure 7

Log-log plot

6.2 \(\texttt {evalParamCensus}\): Comparing Default and Tuned Parameters on Test Data

The function \(\texttt {evalParamCensus}\) evaluates ML and DL hyperparameter configurations on the CID data set. It compiles a data frame, which includes performance scores from several hyperparameter configurations and can also process results from default settings. This data frame can be used for a comparison of default and tuned hyperparameters, \(\lambda _0\) and \(\lambda ^{\star }\), respectively. A violin plot of this comparison is shown in Fig. 10.8. It is based on 30 evaluations of \(\lambda _0\) and \(\lambda ^{\star }\) and shows—in contrast to the values in the DNN progress plots—the Mean Mis-Classification Error (MMCE). The MMCE was chosen to enable a comparison of the DL results with the ML results shown in this book. Identical evaluations were done in Chaps. 8, 9, and 12. A global comparison of the six ML and the DL methods from this book will be shown in Sect. 10.9.

Fig. 10.8
figure 8

Comparison of DL algorithms with default (D) and tuned (T) hyperparameters. Mean misclassification error (MMCE) for both configurations. Vertical lines mark quantiles (0.25, 0.5, 0.75) of the corresponding distribution. Numerical values are shown in Table 10.8

7 Analysing the Deep Learning Tuning Process

The values that are used for the analysis in this section are biased because they are not using an experimental design (space filling or factorial). Instead, they are using the data from the \(\texttt {spot}\) tuning process, i.e., they are biased by the search strategy (Expected Improvenment (EI)) on the surrogate \(\mathcal {S}\).

Identical to the analysis of the ML methods, a simple regression tree as shown in Fig. 10.9 can be used for analysing effects and interactions between hyperparameters \(\lambda \).

Fig. 10.9
figure 9

Regression tree. Deep learning model. Transformed hyperparameter values are shown

Table 10.9 Variable importance of the DL model hyperparameters
Fig. 10.10
figure 10

Best configurations in green

The regression tree supports the observations, that units and epochs have the largest effect on the validation loss. The importance of the parameters from the random forest analysis are shown in Table 10.9.

To perform a sensitivity analysis, parallel and sensitivity plots can be used.

The parallel plot (Fig. 10.10) indicates that the hyperparameter \(\texttt {units}\) should be set to a value of 32 (the transformed values range from 1 to 32), the \(\texttt {epochs}\), i.e. \(x_6\), should be set to a value of 32 (the transformed values range from 8 to 128), the \(\texttt {layers}\), i.e. \(x_9\), should be set to a value of 1 (the transformed values range from 1 to 4), and the \(\texttt {optimizer}\), i.e. \(x_{11}\), should be set to a value of 4 (the transformed values range from 1 to 7).

Looking at Fig. 10.11, the following observations can be made: Similar to the results from the parallel plot (Fig. 10.10), the sensitivity plot shows that the \(\texttt {epochs}\), i.e. \(x_6\), and the \(\texttt {optimizer}\), i.e. \(x_{11}\), have the largest effect: the former leads to poor results for larger values, whereas the latter produces poor results for relatively small values. This indicates that the number of training epochs should not be too large (probably to prevent overfitting, see Fig. 10.5) and that the optimizers \(\texttt {adadelta}\) or \(\texttt {adam}\) are recommended (Fig. 10.12).

Fig. 10.11
figure 11

Sensitivity plot (best)

Fig. 10.12
figure 12

Surface plot: epochs \(x_6\) plotted against optimizer \(x_{11}\). This plot indicates that longer training (larger \(\texttt {epochs}\) values) worsen the performance and that the optimizer \(\texttt {adadelta}\) performs well. Note: Plateaus are caused by discrete and factor variables

Finally, a simple linear regression model can be fitted to the data. Based on the data from SPOT’s \(\texttt {res}\) list, this can be done as follows:

figure v

Although this linear model requires a detailed investigation (a misspecification analysis is recommended, see, e.g., Spanos 1999), it can be used in combination with other Exploratory Data Analysis (EDA) tools and visualizations from this section to discover unexpected and/or interesting effects. It should not be used alone for a final decision. Despite of a relatively low adjusted \(R^2\) value, the regression output shows—in correspondence with previous observations—that increasing the number of \(\texttt {epochs}\) worsens the model performance.

8 Severity: Validating the Results

Considering the results of the experimental runs the difference is \(\bar{x} = \) 0.0054. Since this value is positive, for the moment, let us assume that the tuned solution is superior. The corresponding standard deviation is \(s_d = \) 0.0056. Based on Eq. 5.14, and with \(\alpha = \) 0.05, \(\beta = \) 0.2, and \(\Delta = \) 0.006.

Next, we will identify the required number of runs for the full experiment using the \(\texttt {getSampleSize}\) function. For a relevant difference of 0.006 approximately 11 completing runs per algorithm are required. Hence, we can directly proceed to evaluate the severity and analyse the performance improvement achieved through tuning the parameters of the DL model.

Result summaries are presented in Table 10.10. The decision based on p-value is to reject the null hypothesis, i.e, the claim that the tuned parameter setup provides a significant performance improvement in terms of MMCE is supported. The effect size suggests that the difference is of medium magnitude. For the chosen \(\Delta =0.006\), the severity value is at 0.29 and thus it does not support the decision of rejecting the \(H_0\). The severity plot is shown in Fig. 10.13. Severity shows that only performance differences smaller than 0.0045 are well supported.

Table 10.10 Case Study III: Result Analysis
Fig. 10.13
figure 13

Tuning DL. Severity of rejecting H0 (red), power (blue), and error (gray). Left: the observed mean \(\bar{x} = \) 0.0054 is larger than the cut-off point \(c_{1-\alpha } = \) 0.0017 Right: The claim that the true difference is as large or larger than 0.006 is not supported by severity. But, any difference smaller than 0.0045 is supported by severity

9 Summary and Discussion

A HPT approach based on SMBO was introduced and exemplified in this chapter. It uses functions from the packages keras, SPOT and SPOTMisc from the statistical programming environment R, hence providing a HPT environment that is fully accessible from R. Although HPT can be performed with R functions, an underlying Python environment has to be installed. This installation is explained in the Appendix.

The first three case studies in this book are concluded with a global comparison of the seven methods, i.e., six ML methods and one DL method. The main goal of these studies was to analyze whether a relatively short HPT run, which is performed on a notebook or desktop computer without High Performance Computing (HPC) hardware, can improve the performance. Or, stated differently:

Is it worth doing a short HPT run before doing a longer study?

To illustrate the performance gain (tunability), a final comparison of the seven methods will be presented. The number of repeats will be determined first:

An approximate formula for sample size determination will be used. The reader is referred to Sect. 5.6.5 and to Senn (2021) for details. A sample size of 30 experiments was chosen, i.e., altogether 210 runs were performed.

The list of results from the rfunctionspot HPT run stores relevant information about the configuration and the experimental results.

Fig. 10.14
figure 14

Comparison of ML algorithms with default (D) and tuned (T) hyperparameters. Classification error (MMCE). Note: because there is no “default” hyperparameter setting for the deep learning models used in this study, we have chosen a setting based on our experience and recommendations from the literature, see the discussion in Sect. 10.6

Violin plots (Fig. 10.14) can be used. These observations are based on data collected from default and tuned parameter settings. Although the absolute best value was found by Extreme Gradient Boosting (XGBoost), Support Vector Machine (SVM) should be considered as well, because the performance is similar while the variance is much lower. This study briefly explained how HPT can be used as a datascope for the optimization of DNN hyperparameters. The results from this brief study scratch on the surface of the HPT set of tools. Especially for DL, SPOT allows recommendations for improvement, it provides tools for comparisons using different losses and measures on different data sets, e.g., \(\psi ^{(\text {train})}\), \(\psi ^{(\text {val})}\), and \(\psi ^{(\text {test})}\).

While discussing the hyperparameter tuning results, HPT does not search for the final, best solution only. For sure, the hyperparameter practitioner is interested in the best solution. But even from this greedy point of view, considering the route to the solution is also of great importance, because analyzing this route enables learning and can be much more efficient in the long run compared to a greedy strategy.

Example: Route to the solution

Consider a classification task that has to be performed several times in a different context with similar data. Instead of blindly (automatically) running the Hyperparameter Optimization (HPO) procedure individually for each classification task (which might also require a significant amount of time and resources, even when it is performed automatically) a few HPT procedures are performed. Insights gained from HPT might help to avoid ill specified parameter ranges, too short run times, and further pitfalls.

In addition to an effective and efficient way to determine the optimal hyperparameters, SPOT provides means for understanding algorithms’ performance (we will use datascopes similar to microscopes in biology and telescopes in astronomy). Considering the research goals stated in Sect. 4.1, the HPT approach presented in this study provides many tools and solutions.

To conclude this chapter, in addition to the research goals (R-1) to (R-8) from Sect. 4.1, important goals that are specific for HPT in DNN are presented.

The selection of an adequate performance measure is relevant. Kedziora et al. (2020) claimed that “research strands into ML performance evaluation remain arguably disorganized, \([\ldots ]\). Typical ML benchmarks focus on minimizing both loss functions and processing times, which do not necessarily encapsulate the entirety of human requirement.” Furthermore, a sound test problem specification is necessary, i.e., train, validation, and test sets should be clearly specified. Importantly, the initialization (this is similar to the specification of starting points in optimization) procedures should be made transparent. Because DL methods require a large amount of computational resources, the usage of surrogate benchmarks should be considered (this is similar to the use of Computational Fluid Dynamics (CFD) simulations in optimization). Most of the ML and DL methods are noisy. Therefore, repeats should be considered. The power of the test, severity, and related tools which were introduced in Chap. 5 can give hints for choosing adequate values, i.e., how many runs are feasible or necessary. The determination of meaningful differences—with respect to the specification of the loss function or the accuracy—based on tools like severity are of great relevance for the practical application. Remember: scientific relevance is not identical to statistical significance. Furthermore, floor and ceiling effects should be avoided, i.e., the comparison should not be based on too hard (or too easy) problems. We strongly recommend a comparison to baseline (e.g., default settings or Random Search (RS)).

The model \(\mathcal {A}\) must be clearly specified, i.e., the initialization, pre-training (starting points in optimization) should be explained. The hyperparameter (ranges, types) should be clearly specified. If there are any additional (untunable) parameters, then they should be explained. How is reproducibility ensured (and by whom)? Last but not least: open source code and open data should be provided.

The final conclusion from the three case studies (Chaps. 810) can be formulated as follows:

HPT provides tools for comparing, analyzing, and selecting an adequate ML or DL method for unknown real-world problems. It requires only moderate computational resources (notebooks or desktop computers) and limited time. Practitioners can start HPT runs at the end of their work day and will find the results ready on their desk the next morning.

10 Program Code

Program Code

figure w
figure x
figure y
figure z
figure aa
figure ab