Abstract
A surrogate model based Hyperparameter Tuning (HPT) approach for Deep Learning (DL) is presented. This chapter demonstrates how the architecture-level parameters (hyperparameters) of Deep Neural Networks (DNNs) that were implemented in keras/tensorflow can be optimized. The implementation of the tuning procedure is 100% accessible from R, the software environment for statistical computing. How the software packages (keras, tensorflow, and SPOT) can be combined in a very efficient and effective manner will be exemplified in this chapter. The hyperparameters of a standard DNN are tuned. The performances of the six Machine Learning (ML) methods discussed in this book are compared to the results from the DNN. This study provides valuable insights in the tunability of several methods, which is of great importance for the practitioner.
You have full access to this open access chapter, Download chapter PDF
Similar content being viewed by others
1 Introduction
The DNN hyperparameter study described in this chapter uses the same data and the same HPT process as the ML studies in Chaps. 8 and 9. Section 10.2 describes the data preprocessing. Section 10.3 explains the experimental setup and the configuration of the DL models. The objective function is defined in Sect. 10.4. The hyperparameter tuner, \(\texttt {spot}\), is described in Sect. 10.5. Based on this setup, experimental results are analyzed: After discussing tunability based on the HPT progress in Sect. 10.6, default, \(\lambda _0\) and tuned hyperparameters, \(\lambda ^{\star }\) are compared in Sect. 10.6.2. The DL tuning process is analyzed in Sect. 10.7. Results are validated using severity in Sect. 10.8. A summary in Sect. 10.9 concludes this chapter. The DL hyperparameter tuning pipeline, that was used for the experiments, is summarized in Table 10.1 and illustrated in Fig. 10.1. The first sections in this chapter highlight the most important steps of this pipeline. The program code for performing the experiments is shown in Sect. 10.10.
keras is TensorFlow (TF)’s high-level Application Programming Interface (API) designed with a focus on enabling fast experimentation. TF is an open source software library for numerical computations with data flow graphs (Abadi et al. 2016). Mathematical operations are represented as nodes in the graph, and the graph edges represent the multidimensional arrays of data (tensors) (O’Malley et al. 2019). The full TF API can be accessed via the tensorflow package from within the R software environment for statistical computing and graphics (R).
The Appendix contains information on how to set up the required Python software environment for performing HPT with keras, SPOT, and SPOTMisc. Source code for performing the experiments will included in the R package SPOTMisc. Further information is published on https://www.spotseven.de and with some delay on Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/package=SPOT). This delay is caused by an intensive code check, which is performed by the CRAN team. It guarantees high-quality open source software and is an important feature for providing reliable software that is not just a flash in the pan.
2 Data Description
Identically to the ML case studies, the DL case study presented in this chapter uses the Census-Income (KDD) Data Set (CID), which is made available, for example, via the University of California, Irvine (UCI) Machine Learning Repository.Footnote 1,Footnote 2
2.1 \(\texttt {getDataCensus}\): Getting the Data from OpenML
Before training the DNN, the data is preprocessed by reshaping it into the shape the DNN can process. The function \(\texttt {getDataCensus}\) is used to get the Open Machine Learning (OpenML) data (from cache or from server). The same options as in the previous ML studies will be used, i.e., the parameter settings from Table 8.3 will be used.
2.2 \(\texttt {getGenericTrainValTestData}\): Split Data in Train, Validation, and Test Data
The data frame \(\texttt {dfCensus}\), \((X,Y) \subset (\mathcal {X}, \mathcal {Y})\), with 10 000 observations of 23 variables, is available. Based on \(\texttt {prop}\), the data is split into training, validation, and test data sets, \((X,Y)^{(\text {train})}\), \((X,Y)^{(\text {val})}\), and \((X,Y)^{(\text {test})}\), respectively. If \(\texttt {prop = 2/3}\), the training data set has 4 444 observations, the validation data set has 2 222 observations, and the test data set the remaining 3 334 observations.
2.3 \(\texttt {genericDataPrep}\): Spec
The third step of the data preprocessing generates a \(\texttt {specList}\).
The function \(\texttt {genericDataPrep}\) works as described in Sects. 10.2.3.1–10.2.3.5.
2.3.1 The Iterator: Data Frame to Data Set
The helper function \(\texttt {df\_to\_dataset}\) Footnote 3 converts the data frame \(\texttt {dfCensus}\) into a data set. This procedure enables processing of very large Comma Separated Values (CSV) files (so large that they do not fit into memory). The elements of the training data sets are randomly shuffled. Finally, consecutive elements of this data set are combined into batches.
Applying the function \(\texttt {df\_to\_dataset}\) generates a list of tensors. Each tensor represents a single column. The most significant difference to R ’s data frames is that a TF data set is an iterator.
Background: Iterators
Each time an iterator is called it will yield a different batch of rows from the data set. The iterator function \(\texttt {iter\_next}\) can be called as follows, so that batches are shown.
The data set \(\texttt {train\_ds\_generic}\) returns a list of column names (from the data frame) that map to column values from rows in the data frame.
2.3.2 The feature_spec Object: Specifying the Target
TF has built-in methods to perform common input conversions.Footnote 4 The powerful \(\texttt {feature\_column}\) system will be accessed via the user-friendly, high-level interface called \(\texttt {feature\_spec}\). While working with structured data, e.g., CSV data, column transformations and representations can be initialized and specified. A practical benefit of implementing data preprocessing within model \(\mathcal {A}\) is that when \(\mathcal {A}\) is exported, the preprocessing is already included. In this case, new data can be passed directly to \(\mathcal {A}\).
! Attention: Keras Preprocessing Layers
keras and tensorflow are under constant development. The current implementation in SPOTMisc classifies structured data with feature columns. The corresponding TF module was designed for the use with TF version 1 estimators. It does fall under compatibility guarantees.Footnote 5 The newly developed keras module uses “preprocessing layers” for building keras-native input processing pipelines. Future versions of SPOTMisc will be based on preprocessing layers. However, because the underlying ideas of both preprocessing layers are similar (TF provides a migration guideFootnote 6), the most important preprocessing steps will be presented next.
First the \(\texttt {spec}\) object \(\texttt {specGeneric}\) is defined. The response variable, here: \(\texttt {target}\), can be specified using a formula, see Chambers and Hastie (1992) and the R function \(\texttt {formula}\).
2.3.3 Adding Steps to the feature_spec Object
The CID data set contains a variety of data types. These mixed data types are converted to a fixed-length vector for the DL model to process. Based on their feature type, their data type or level, the columns will be treated differently. After creating the \(\texttt {feature\_spec}\) object the step functions from Table 10.2 can be used to add further \(\texttt {steps}\). Depending on the data type, the step functions specify the data transformations. Table 10.3 shows these types.
The R package tfdatasets provides selectors to select certain variable types and ranges, e.g., \(\texttt {all\_numeric}\) to select all numeric variables, \(\texttt {all\_nominal}\) to select all characters, or \(\texttt {has\_type("float32")}\) to select variables based on their TF variable type. Based on the feature and data type shown in Table 10.3, the data transformations from Table 10.2 are applied. We will consider feature specs for continuous and catergorical data separately.
2.3.4 Feature Spec: Continuous Data
For continuous data, i.e., numerical variables, the function \(\texttt {step\_numeric\_}\) \(\texttt {column}\) will be used and all numeric variables will be normalized (scaled). The R package tfdataset provides the scaler function \(\texttt {scaler\_min\_max}\), which uses the minimum and maximum of the numeric variable and the function \(\texttt {scaler\_standard}\), which uses the mean and the standard deviation.
2.3.5 Feature Spec: Categorical Data
The DNN model \(\mathcal {A}\) cannot directly process categorical (nominal) data—they must be transformed so that they can be represented as numbers. The representation of categorical variables as a set of one-hot encoded columns is widely used in practice (Chollet and Allaire 2018). There are basically two options for specifying the kind of numeric representation used for categorical variables: indicator columns or embedding columns.
Background: Embedding
Suppose instead of having a factor with a few levels (e.g., three categorical features such as \(\texttt {red}\), \(\texttt {green}\), or \(\texttt {blue}\)), there are hundreds or even more levels. As the number of levels grows very large, it becomes unfeasible to train a DNN using one-hot encodings. In this situation, embedding should be used: instead of representing the data as a very large one-hot vector, the data can be stored as a low-dimensional vector of real numbers. Note, the size of the embedding is a parameter that must be tuned (Abadi et al. 2015).
The implementation in SPOTMisc uses two steps: first, based on the number of \(\texttt {levels}\), i.e., the value of the parameter \(\texttt {minLevelSizeEmbedding}\) in the following code, the set of columns where embedding should be used, is determined. Then, either the function \(\texttt {step\_indicator\_column}\) or the function \(\texttt {step\_embedding\_column}\) is applied.
After adding a \(\texttt {step}\) we need to \(\texttt {fit}\) the \(\texttt {specGeneric}\) object:
Finally, the following data structures are available:
-
1.
\(\texttt {train\_ds\_generic}\) (batched, based on 4444 samples)
-
2.
\(\texttt {val\_ds\_generic}\), (batched, based on 2222 samples)
-
3.
\(\texttt {specGeneric\_prep}\) and
-
4.
\(\texttt {testGeneric}\) (the remaining 3334 samples).
These data are returned as the list \(\texttt {specList}\) from the function \(\texttt {genericData}\) \(\texttt {Prep}\).
Dense features prepared with TF’s feature columns mechanism can be listed. There are 22 dense features that will be passed to the DNN.
3 Experimental Setup and Configuration of the Deep Learning Models
3.1 \(\texttt {getKerasConf}\): keras and Tensorflow Configuration
Setting up the keras configuration from within SPOTMisc is a simple step: the function \(\texttt {getKerasConf}\) is called. The function \(\texttt {getKerasConf}\) passes additional parameters to the \(\texttt {keras}\) function, e.g.,
- \(\texttt {activation:}\):
-
Activation function in the last Neural Network (NN) layer. Default: \(\texttt {"sigmoid"}\).
- \(\texttt {active:}\):
-
Vector of active variables, e.g., \(\texttt {c(1,10)}\) specifies that only the first and tenth variable will be considered by \(\texttt {spot}\). This mechanism allows the shrinking the full set of tunable parameters, say \(\lambda \), to a smaller set, \(\lambda ^{(-)}\), if the user wants to investigate the tunability (or the effect) of one or only a few hyperparameters.
- \(\texttt {callbacks:}\):
-
List of callbacks to be called during training. Default: \(\texttt {list()}\).
- \(\texttt {clearSession:}\):
-
Whether to call \(\texttt {k\_clear\_session}\) or not at the end of keras modeling. Default: \(\texttt {FALSE}\).
- \(\texttt {encoding:}\):
-
Encoding used during data preparation. Default: \(\texttt {"oneHot"}\).
- \(\texttt {loss:}\):
-
Loss function, \(\mathcal {L}\), for the \(\texttt {compile}\) from the package keras. For example Binary Cross Entropy (BCE) loss as defined in Eq. (2.3).
Default: \(\texttt {"loss\_binary\_crossentropy"}\).
- \(\texttt {metrics:}\):
-
Metrics function for compile. Default: \(\texttt {"binary\_}\) \(\texttt {accuracy"}\).
- \(\texttt {model:}\):
-
Model, \(\mathcal {A}\), as specified via \(\texttt {getModelConf}\). Default: \(\texttt {"dl"}\). Forthcoming versions of SPOTMisc will provide additional DNN model types, e.g., Convolutional Neural Networks (CNNs).
- \(\texttt {nClasses:}\):
-
Number of classes in (multi-class) classification. Specifies the number of units in the last layer (before \(\texttt {softmax}\)). Default: \(\texttt {1}\) (binary classification).
- \(\texttt {resDummy:}\):
-
If \(\texttt {TRUE}\), generate dummy (mock up) result for testing. If \(\texttt {FALSE}\), run keras and tensorflow evaluations. Default: \(\texttt {FALSE}\).
- \(\texttt {returnValue:}\):
-
Return value. Can be one of \(\texttt {"trainingLoss"}\), \(\texttt {"negTrainingAccuracy"}\), \(\texttt {"validationLoss"}\), \(\texttt {"negValidation}\) \(\texttt {Accuracy"}\), \(\texttt {"testLoss"}\), or
\(\texttt {"negTestAccuracy"}\).
- \(\texttt {returnObject:}\):
-
Return object. Can be one of \(\texttt {"evaluation"}\), \(\texttt {"model"}\), \(\texttt {"pred"}\). Default: \(\texttt {"evaluation"}\).
- \(\texttt {shuffle:}\):
-
Logical (whether to shuffle the training data \((X,Y)^{(\text {train})}\) before each epoch) or string (for “batch”). Used in the function \(\texttt {df\_to\_dataset}\). "batch" is a special option for dealing with the limitations of the Hierarchical Data Format (HDF) version 5 data. It shuffles in batch-sized chunks. Default: \(\texttt {FALSE}\).
- \(\texttt {testData:}\):
-
Test data, \((X,Y)^{(\text {test})}\), on which to evaluate the loss, \(\mathcal {L}\), and any model metrics, \(\psi ^{(\text {test})}\)at the end of the optimization using the function \(\texttt {evaluate}\).
- \(\texttt {tfDevice:}\):
-
Tensorflow device. CPU/GPU allocation. Passed to \(\texttt {tensorflow}\) via \(\texttt {tf\$device(kerasConf}\) \(\texttt {\$tfDevice)}\). Default: \(\texttt {"/cpu:0"}\) (use CPU only).
- \(\texttt {trainData:}\):
-
Training data, \((X,Y)^{(\text {train})}\), on which to evaluate the loss and any model metrics at the end of each epoch.
- \(\texttt {validationData:}\):
-
Validation data, \((X,Y)^{(\text {val})}\), on which to evaluate the loss \(\psi ^{(\text {val})}\)and any model metrics at the end of each epoch.
- \(\texttt {validation\_split:}\):
-
Float between 0 and 1. Fraction of the training data \((X,Y)^{(\text {train})}\)to be used internally by \(\mathcal {A}\) as validation data \((X,Y)^{(\text {valtrain})}\). \(\mathcal {A}\) will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on \((X,Y)^{(\text {valtrain})}\)at the end of each epoch. \((X,Y)^{(\text {valtrain})}\)is selected from the last samples in the \((X,Y)^{(\text {train})}\)data provided, before shuffling. Default: \(\texttt {0.2}\).
- \(\texttt {verbose:}\):
-
Verbosity mode (0 = silent, 1 = progress bar, 2 = one line per epoch). Default: \(\texttt {0}\).
The default settings are useful for the binary classification task analyzed in this chapter. Only the parameter \(\texttt {kerasConf\$clearSession}\) is set to \(\texttt {TRUE}\) and \(\texttt {kerasConf\$verbose}\) is set to \(\texttt {0}\).
3.2 \(\texttt {getModelConf}\): DL Hyperparameters
If the default values from the function \(\texttt {getKerasConf}\) are used, the vector of hyperparameter \(\lambda \) contains the following elements: the dropout rates (dropout rates of the layers will be tuned individually), the number of units (the number of single outputs from a single layer), the learning rate (controls how much to change the DNN model in response to the estimated error each time the model weights are updated), the number of training epochs (a training epoch is one forward and backward pass of a complete data set), the optimizer for the inner loop, \(\mathcal {O}_{\text {inner}}\), and its parameters (i.e., \(\beta _1\), \(\beta _2\) as well as \(\epsilon \)) and the number of layers. These hyperparameters and their ranges are listed in Table 10.4.
To enable compatibility with the ranges of the learning rates of the other optimizers, the learning rate of the optimizer \(\texttt {adadelta}\) is internally mapped to \(\texttt {1-learning\_rate}\). That is, a learning rate of 0 will be mapped to 1 (which is \(\texttt {adadelta}\) ’s default learning rate). The learning rate of \(\texttt {adagrad}\) and \(\texttt {sgd}\) is internally mapped to \(\texttt {10 * learning\_rate}\). That is, a learning rate of 0.001 will be mapped to 0.01 (which is \(\texttt {adagrad}\) ’s and \(\texttt {sgd}\) ’s default). The learning rate learning_rate of \(\texttt {adamax}\) and \(\texttt {nadam}\) is internally mapped to \(\texttt {2 * learning\_rate}\). That is, a learning rate of 0.001 will be mapped to 0.002 (which is \(\texttt {adamax}\) ’s and \(\texttt {nadam}\) ’s default.)
The hyperparameter \(x_{11}\), which encodes the \(\texttt {optimizer}\) is implemented as a factor. Factor levels, which represent the available optimizers are listed in Table 10.5.
A discussion of the DNN hyperparameters, \(\lambda \), recommendations for their settings and further information are presented in Sect. 3.8. The R function \(\texttt {getModelConf}\) provides information about hyperparameter names, ranges, and types.
3.3 The Neural Network
Background: Network Implementation in SPOTMisc
The SPOTMisc function \(\texttt {getModelConf}\) selects a pre-specified, but not pre-trained, DL network \(\mathcal {A}\). This network is called via \(\texttt {funKerasGeneric}\), which is the interface to \(\texttt {spot}\). \(\texttt {funKerasGeneric}\) uses a network, that is implemented as follows:
To build the DNN in keras, the function \(\texttt {layer\_dense\_features}\) that processes the feature columns specification is used (Fig. 10.2). It receives the data set \(\texttt {specGeneric\_prep}\) as input and returns an array off all dense features:
The iterator can be called to take a look at the (scaled) output:
The NN model can be compiled after the \(\texttt {loss}\) function \(\mathcal {L}\), which determines how good the DNN prediction is (based on the \((X,Y)^{(\text {val})}\)), the \(\texttt {optimizer}\), i.e., \(\mathcal {O}_{\text {inner}}\), i.e., the update mechanism of \(\mathcal {A}\), which adjusts the weights using backpropagation, and the \(\texttt {metrics}\). metrics The metrics monitor the progress during training and testing and are specified using the \(\texttt {compile}\) function from keras.
! Attention: Hyperparameter Values
To improve the readability of the code, evaluated (“forced” values) of the hyperparameters \(\lambda \) are shown in the code snippets below instead of the arguments that are passed from the tuner \(\texttt {spot}\) to the function \(\texttt {funKerasGeneric}\).
The DNN training can be started as follows (using keras ’ \(\texttt {fit}\) function). Train the model on the CPU using the setting \(\texttt {tf\$device("/cpu:0")}\) on the validation data set:
The predictions from the DNN model are shown in the following code snippet. The tensor values are the output from the final DNN layer after the sigmoid function was applied. Values are from the interval [0, 1] and represent probabilities: values smaller than 0.5 are interpreted as predictions “\(\texttt {age}\) \(< 40\)”, otherwise “\(\texttt {age}\) \(\ge 40\)”.
Figure 10.3 shows the quantities that are being displayed during training:
-
(i)
the loss of the network over the training and validation data, \(\psi ^{(\text {train})}\) and \(\psi ^{(\text {val})}\), respectively, and
-
(ii)
the accuracy of the network over the training and validation data, \(f_{\text {acc}}^{(\text {train})}\) and \(f_{\text {acc}}^{(\text {val})}\), respectively.
This figure illustrates that an accuracy greater than 80% on the training data, \((X,Y)^{(\text {train})}\), can be reached quickly.
Figure 10.3 can indicate (even if this is only a short fit procedure) whether the modeling is affected by overfitting or not. If this situation occurs, it might be useful to implement dropout layers or use other methods to prevent overfitting.
The effects of HPT and the tunability of \(\mathcal {A}\) will be described in the following sections. Finally, using keras ’ \(\texttt {evaluate}\) function, the DNN model performance can be checked on \(X^{(\text {test})}\).
The relationship between \(\psi ^{(\text {train})}\), \(\psi ^{(\text {val})}\), and \(\psi ^{(\text {test})}\) as well as between \(f_{\text {acc}}^{(\text {train})}\), \(f_{\text {acc}}^{(\text {val})}\), and \(f_{\text {acc}}^{(\text {test})}\) can be analyzed with Sequential Parameter Optimization Toolbox (SPOT), because it computes and reports these values.
4 \(\texttt {funKerasGeneric}\): The Objective Function
The hyperparameter tuner, e.g., \(\texttt {spot}\), performs model selection during the tuning run: training data \(X^{(\text {train})}\) is used for fitting (training) the models, e.g., the weights of the DNNs. Each trained model \(\mathcal {A} _{\lambda _i}\left( X^{(\text {train})}\right) \) will be evaluated on the validation data \(X^{(\text {val})}\), i.e., the loss is calculated as shown in Eq. (2.9). Based on \((\lambda _i, \psi ^{(\text {val})}_i )\), at each iteration of the outer optimization loop a surrogate model \(\mathcal {S}(t)\) is fitted, e.g., a Bayesian Optimization (BO) (Kriging) model using \(\texttt {spot}\) ’s \(\texttt {buildKriging}\) function.
For each hyperparameter configuration \(\lambda _i\), the objective function \(\texttt {funKerasGeneric}\) reports information about the related DNN models \(\mathcal {A} _{\lambda _i}\)
-
1.
training loss, \(\psi ^{(\text {train})}\),
-
2.
training accuracy, \(f_{\text {acc}}^{(\text {train})}\),
-
3.
validation (testing) loss, \(\psi ^{(\text {val})}\), and
-
4.
validation (testing) accuracy, \(f_{\text {acc}}^{(\text {val})}\).
5 \(\texttt {spot}\): Experimental Setup for the Hyperparameter Tuner
The SPOT package for R, which was introduced in Sect. 4.5, will be used for the DL hyperparameter tuning (Bartz-Beielstein et al. 2021). The budget is set to twelve hours, i.e., the run time of DL tuning is larger than the run time of the ML tuning. The budget for the \(\texttt {spot}\) runs was set to this value, because of the complexity of the hyperparameter search space \(\Lambda \) and the relatively long run time of the DNN.
SPOT provides several options for adjusting the HPT parameters, e.g., type of the Surrogate Model Based Optimization (SMBO) model, \(\mathcal {S}\), and optimizer, \(\mathcal {O}\), as well as the size of the initial design, \(n_{\text {init}}\). These parameters can be passed via the \(\texttt {spotControl}\) function to \(\texttt {spot}\). For example, instead of the default surrogate \(\mathcal {S}\), which is BO (implemented as \(\texttt {buildKriging}\)), a Random Forest (RF), (implemented as \(\texttt {buildRanger}\)) can be chosen.
The general DL HPT data workflow is as follows: first the training data, \((X,Y)^{(\text {train})}\) are fed to the DNN. The DNN will then learn to associate images and labels. Based on the keras parameter \(\texttt {validation\_split}\), the training data will be partitioned into a (smaller) training data set, \(X^{(\text {train})}\), and a validation data set, \((X,Y)^{(\text {valtrain})}\). The trained DNN produces predictions for validations based on \((X,Y)^{(\text {val})}\) data. The DL HPT data workflow is shown in Fig. 10.4.
Similar to the process described in Sect. 8.1 for ML, the hyperparameter tuning for DL can be started as follows:
The \(\texttt {startCensusRun}\) function performs the following steps:
-
1.
Providing the CID data set, \((\mathcal {(X}, \mathcal {Y})_{\text {CID}}\), see Sect. 8.2.1.
-
2.
Generating the random sample \((X,Y) \subseteq (\mathcal {(X}, \mathcal {Y})_{\text {CID}}\) of size \(\texttt {nobs}\).
-
3.
Defining an experimental design, including performance measures.
-
4.
Configuration of the hyperparameter tuner, \(\mathcal {T}\).
-
5.
Configuration of the DL model, \(\mathcal {A}\).
-
6.
Performing the experiments.
Furthermore, it can be decided whether to use the default hyperparameter setting, \(\lambda _0\), as a starting point or not. Using the parameter specifications from Tables 10.6 and 10.7, we are ready to perform the HPT run: \(\texttt {spot}\) can be started.
6 Tunability
Regarding tunability as defined in Definition 2.26, we are facing a special situation in this chapter, because there is no generally accepted “default” hyperparameter configuration, \(\lambda _0\), for DNNs. This problem is not as obvious in ML, because the corresponding methods have a long history, i.e., there are publications for most of the shallow methods that can give hints how to select adequate \(\lambda \) values. This information is collected and summarized in Chap. 3. The “default” hyperparameter setting of the DNNs analyzed in this chapter is based on our own experiences, combined with recommendations in the literature. Chollet and Allaire (2018) may be considered as a reference in this field.Footnote 7
The \(\texttt {result}\) list from the \(\texttt {spot}\) run can be loaded. It contains the 14 values shown in Table 4.6, e.g., names of the tuned hyperparameters that were introduced in Table 10.4:
The HPT inner optimization loop is shown in Fig. 10.5. The DNN uses the tuned hyperparameters, \(\lambda ^{\star }\) from Table 10.8. The model training supports the result found by the tuner \(\texttt {spot}\) that the number of training epochs should be 32. The reader may compare the inner optimization loop with default and with tuned hyperparameters in Figs. 10.3 and 10.5.
The tuned DNN model has the following structure:
6.1 Progress
After loading the results from the experiments, the hyperparameter tuning progress can be visually analyzed. First of all, the \(\texttt {result}\) list information will be used to visualize the route to the solution: in Fig. 10.6, loss function values, \(\psi ^{(\text {val})}\), are plotted against the number of iterations, t. Each point represents one evaluation of an DNN model \(\mathcal {A} _{\lambda }(t)\) at time step (\(\texttt {spot}\) iteration) t.
The initial design, which includes the default hyperparameter setting, \(\lambda _0\), results in a loss value of \(\psi ^{(\text {val})}_{\text {init}} = 0.3371\). The best value, that was found during the tuning, is \(y_{\text {val}}^{(*)} = 0.3285\). These values have to be taken with caution, because they represent onyl one evaluation of \(\mathcal {A} _{\lambda }\). Based on OCBA, which takes the noise in the model evaluation via the function \(\texttt {funKerasGeneric}\) into consideration, the best function value is \(y_{\text {val}}^{(\text {OCBA}^*)}= 0.3346\).
After 12 h, 914 \(\texttt {dl}\) models were evaluated. Comparing the worst configuration that was observed during the HPT with the best, a 81.773% reduction in the BCE loss was obtained. After the initial phase, which includes 44 evaluations, the smallest BCE reads 0.3370858. The dotted red line in Fig. 8.6 illustrates this result. The final best value reads 0.3285304, i.e., a reduction of the BCE of 2.5381%. These values, in combination with results shown in the progress plot (Fig. 8.6) indicate that a relatively short HPT run is able to improve the quality of the DNN model. It also indicates, that increased run times do not result in a significant improvement of the BCE. The full comparison of the DL and ML algorithm performances with default, \(\lambda _0\), and tuned, \(\lambda ^{\star }\), hyperparameters is shown in Sect. 10.9.
! Attention
These results do not replace a sound statistical comparison, they are only indicators, not final conclusions.
The corresponding code is presented in the Appendix. The related hyperparameters values are shown in Table 10.8.
There is a large variance in the loss as can be seen in Figs. 10.6 and 10.7. The latter of these two plots visualizes the same data as the former, but uses log-log axes instead.
6.2 \(\texttt {evalParamCensus}\): Comparing Default and Tuned Parameters on Test Data
The function \(\texttt {evalParamCensus}\) evaluates ML and DL hyperparameter configurations on the CID data set. It compiles a data frame, which includes performance scores from several hyperparameter configurations and can also process results from default settings. This data frame can be used for a comparison of default and tuned hyperparameters, \(\lambda _0\) and \(\lambda ^{\star }\), respectively. A violin plot of this comparison is shown in Fig. 10.8. It is based on 30 evaluations of \(\lambda _0\) and \(\lambda ^{\star }\) and shows—in contrast to the values in the DNN progress plots—the Mean Mis-Classification Error (MMCE). The MMCE was chosen to enable a comparison of the DL results with the ML results shown in this book. Identical evaluations were done in Chaps. 8, 9, and 12. A global comparison of the six ML and the DL methods from this book will be shown in Sect. 10.9.
7 Analysing the Deep Learning Tuning Process
The values that are used for the analysis in this section are biased because they are not using an experimental design (space filling or factorial). Instead, they are using the data from the \(\texttt {spot}\) tuning process, i.e., they are biased by the search strategy (Expected Improvenment (EI)) on the surrogate \(\mathcal {S}\).
Identical to the analysis of the ML methods, a simple regression tree as shown in Fig. 10.9 can be used for analysing effects and interactions between hyperparameters \(\lambda \).
The regression tree supports the observations, that units and epochs have the largest effect on the validation loss. The importance of the parameters from the random forest analysis are shown in Table 10.9.
To perform a sensitivity analysis, parallel and sensitivity plots can be used.
The parallel plot (Fig. 10.10) indicates that the hyperparameter \(\texttt {units}\) should be set to a value of 32 (the transformed values range from 1 to 32), the \(\texttt {epochs}\), i.e. \(x_6\), should be set to a value of 32 (the transformed values range from 8 to 128), the \(\texttt {layers}\), i.e. \(x_9\), should be set to a value of 1 (the transformed values range from 1 to 4), and the \(\texttt {optimizer}\), i.e. \(x_{11}\), should be set to a value of 4 (the transformed values range from 1 to 7).
Looking at Fig. 10.11, the following observations can be made: Similar to the results from the parallel plot (Fig. 10.10), the sensitivity plot shows that the \(\texttt {epochs}\), i.e. \(x_6\), and the \(\texttt {optimizer}\), i.e. \(x_{11}\), have the largest effect: the former leads to poor results for larger values, whereas the latter produces poor results for relatively small values. This indicates that the number of training epochs should not be too large (probably to prevent overfitting, see Fig. 10.5) and that the optimizers \(\texttt {adadelta}\) or \(\texttt {adam}\) are recommended (Fig. 10.12).
Finally, a simple linear regression model can be fitted to the data. Based on the data from SPOT’s \(\texttt {res}\) list, this can be done as follows:
Although this linear model requires a detailed investigation (a misspecification analysis is recommended, see, e.g., Spanos 1999), it can be used in combination with other Exploratory Data Analysis (EDA) tools and visualizations from this section to discover unexpected and/or interesting effects. It should not be used alone for a final decision. Despite of a relatively low adjusted \(R^2\) value, the regression output shows—in correspondence with previous observations—that increasing the number of \(\texttt {epochs}\) worsens the model performance.
8 Severity: Validating the Results
Considering the results of the experimental runs the difference is \(\bar{x} = \) 0.0054. Since this value is positive, for the moment, let us assume that the tuned solution is superior. The corresponding standard deviation is \(s_d = \) 0.0056. Based on Eq. 5.14, and with \(\alpha = \) 0.05, \(\beta = \) 0.2, and \(\Delta = \) 0.006.
Next, we will identify the required number of runs for the full experiment using the \(\texttt {getSampleSize}\) function. For a relevant difference of 0.006 approximately 11 completing runs per algorithm are required. Hence, we can directly proceed to evaluate the severity and analyse the performance improvement achieved through tuning the parameters of the DL model.
Result summaries are presented in Table 10.10. The decision based on p-value is to reject the null hypothesis, i.e, the claim that the tuned parameter setup provides a significant performance improvement in terms of MMCE is supported. The effect size suggests that the difference is of medium magnitude. For the chosen \(\Delta =0.006\), the severity value is at 0.29 and thus it does not support the decision of rejecting the \(H_0\). The severity plot is shown in Fig. 10.13. Severity shows that only performance differences smaller than 0.0045 are well supported.
9 Summary and Discussion
A HPT approach based on SMBO was introduced and exemplified in this chapter. It uses functions from the packages keras, SPOT and SPOTMisc from the statistical programming environment R, hence providing a HPT environment that is fully accessible from R. Although HPT can be performed with R functions, an underlying Python environment has to be installed. This installation is explained in the Appendix.
The first three case studies in this book are concluded with a global comparison of the seven methods, i.e., six ML methods and one DL method. The main goal of these studies was to analyze whether a relatively short HPT run, which is performed on a notebook or desktop computer without High Performance Computing (HPC) hardware, can improve the performance. Or, stated differently:
Is it worth doing a short HPT run before doing a longer study?
To illustrate the performance gain (tunability), a final comparison of the seven methods will be presented. The number of repeats will be determined first:
An approximate formula for sample size determination will be used. The reader is referred to Sect. 5.6.5 and to Senn (2021) for details. A sample size of 30 experiments was chosen, i.e., altogether 210 runs were performed.
The list of results from the rfunctionspot HPT run stores relevant information about the configuration and the experimental results.
Violin plots (Fig. 10.14) can be used. These observations are based on data collected from default and tuned parameter settings. Although the absolute best value was found by Extreme Gradient Boosting (XGBoost), Support Vector Machine (SVM) should be considered as well, because the performance is similar while the variance is much lower. This study briefly explained how HPT can be used as a datascope for the optimization of DNN hyperparameters. The results from this brief study scratch on the surface of the HPT set of tools. Especially for DL, SPOT allows recommendations for improvement, it provides tools for comparisons using different losses and measures on different data sets, e.g., \(\psi ^{(\text {train})}\), \(\psi ^{(\text {val})}\), and \(\psi ^{(\text {test})}\).
While discussing the hyperparameter tuning results, HPT does not search for the final, best solution only. For sure, the hyperparameter practitioner is interested in the best solution. But even from this greedy point of view, considering the route to the solution is also of great importance, because analyzing this route enables learning and can be much more efficient in the long run compared to a greedy strategy.
Example: Route to the solution
Consider a classification task that has to be performed several times in a different context with similar data. Instead of blindly (automatically) running the Hyperparameter Optimization (HPO) procedure individually for each classification task (which might also require a significant amount of time and resources, even when it is performed automatically) a few HPT procedures are performed. Insights gained from HPT might help to avoid ill specified parameter ranges, too short run times, and further pitfalls.
In addition to an effective and efficient way to determine the optimal hyperparameters, SPOT provides means for understanding algorithms’ performance (we will use datascopes similar to microscopes in biology and telescopes in astronomy). Considering the research goals stated in Sect. 4.1, the HPT approach presented in this study provides many tools and solutions.
To conclude this chapter, in addition to the research goals (R-1) to (R-8) from Sect. 4.1, important goals that are specific for HPT in DNN are presented.
The selection of an adequate performance measure is relevant. Kedziora et al. (2020) claimed that “research strands into ML performance evaluation remain arguably disorganized, \([\ldots ]\). Typical ML benchmarks focus on minimizing both loss functions and processing times, which do not necessarily encapsulate the entirety of human requirement.” Furthermore, a sound test problem specification is necessary, i.e., train, validation, and test sets should be clearly specified. Importantly, the initialization (this is similar to the specification of starting points in optimization) procedures should be made transparent. Because DL methods require a large amount of computational resources, the usage of surrogate benchmarks should be considered (this is similar to the use of Computational Fluid Dynamics (CFD) simulations in optimization). Most of the ML and DL methods are noisy. Therefore, repeats should be considered. The power of the test, severity, and related tools which were introduced in Chap. 5 can give hints for choosing adequate values, i.e., how many runs are feasible or necessary. The determination of meaningful differences—with respect to the specification of the loss function or the accuracy—based on tools like severity are of great relevance for the practical application. Remember: scientific relevance is not identical to statistical significance. Furthermore, floor and ceiling effects should be avoided, i.e., the comparison should not be based on too hard (or too easy) problems. We strongly recommend a comparison to baseline (e.g., default settings or Random Search (RS)).
The model \(\mathcal {A}\) must be clearly specified, i.e., the initialization, pre-training (starting points in optimization) should be explained. The hyperparameter (ranges, types) should be clearly specified. If there are any additional (untunable) parameters, then they should be explained. How is reproducibility ensured (and by whom)? Last but not least: open source code and open data should be provided.
The final conclusion from the three case studies (Chaps. 8–10) can be formulated as follows:
HPT provides tools for comparing, analyzing, and selecting an adequate ML or DL method for unknown real-world problems. It requires only moderate computational resources (notebooks or desktop computers) and limited time. Practitioners can start HPT runs at the end of their work day and will find the results ready on their desk the next morning.
10 Program Code
Program Code
Notes
- 1.
- 2.
The data from CID is historical. It includes wording or categories regarding people which do not represent or reflect any views of the authors and editors.
- 3.
- 4.
- 5.
- 6.
- 7.
An updated version of Chollet and Allaire (2018) is under preparation while we are writing this text. Check the authors’ web-page for more information: https://www.manning.com/books/deep-learning-with-r.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Bartz-Beielstein, T., Chandrasekaran, S., Rehbach, F. (2023). Case Study III: Tuning of Deep Neural Networks. In: Bartz, E., Bartz-Beielstein, T., Zaefferer, M., Mersmann, O. (eds) Hyperparameter Tuning for Machine and Deep Learning with R. Springer, Singapore. https://doi.org/10.1007/978-981-19-5170-1_10
Download citation
DOI: https://doi.org/10.1007/978-981-19-5170-1_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5169-5
Online ISBN: 978-981-19-5170-1
eBook Packages: Computer ScienceComputer Science (R0)