Arti ﬁ cial Neural Networks and Deep Learning for Genomic Prediction of Continuous Outcomes

This chapter provides elements for implementing deep neural networks (deep learning) for continuous outcomes. We give details of the hyperparameters to be tuned in deep neural networks and provide a general guide for doing this task with more probability of success. Then we explain the most popular deep learning frameworks that can be used to implement these models as well as the most popular optimizers available in many software programs for deep learning. Several practical examples with plant breeding data for implementing deep neural networks in the Keras library are outlined. These examples take into account many components in the predictor as well many hyperparameters (hidden layer, number of neurons, learning rate, optimizers, penalization, etc.) for which we also illustrate how the tuning process can be done to increase the probability of a successful application.

other topologies also work well in some circumstances, the application of DL is still at an early stage (experimental stage) with no definite conclusions. Picking the right network topology is more an art than a science and only practice will make a proper neural network architect (Chollet and Allaire 2017). However, with images data the most powerful DNN are convolutional neural networks, see Chap. 13.

Activation Functions
The principal rule for choosing activation functions for output layers depends on the type of response variable at hand. For this reason, for binary data, the best option is the sigmoid activation function, for ordinal or multiclass classification problems, the softmax activation function, and for continuous data, the linear activation function is suggested. However, even though there are no clear rules for choosing the activation functions for hidden layers, there is a lot of evidence that using sigmoid activation functions is not a good choice for hidden layers since sigmoid functions very often discard information due to saturation in both forward and backpropagation. RELU activation functions are the most used for hidden neurons, as can be seen in modern deep learning networks today, but RELU activation functions have the problem that some neurons may never activate across the entire training data set. For this reason, the suggestion is to use the leaky RELU activation function in hidden layers which have no zero gradient for all input values (Patterson and Gibson 2017).

Loss Function
Choosing the right loss function is a key component for the successful implementation of ANN and DL models. Next are given some general rules for selecting the right loss function: (a) when the response variable is continuous, use the sum of squares error loss as the primary option and the sum of absolute percentage error loss as the secondary option; (b) when the response variable is binary, use the hinge loss or logistic (cross-entropy) loss; (c) however, if you are interested in the probability of success of each individual, the logistic loss is the best option; and (d) for categorical responses, the categorical cross-entropy loss is the most used, but when the response variable is a count, a reasonable choice is the Poisson loss or negative binomial loss.

Number of Hidden Layers
Selecting the number of hidden layers is a very important issue in the configuration process of DL models. According to the universal approximation theorem (Cybenko 1989), a network with one hidden layer and a large number of neurons is enough to approximate any arbitrary function to the desired degree of accuracy. However, the reality is that many times the required number of neurons is so large that its implementation is not practical. Also, there is a lot of evidence that more than one hidden layer is required when the data are not separated linearly (Fig. 11.1b). Therefore, if your data are linearly separable (Fig. 11.1a) (which you often unknow by the time you begin coding an ANN or DL model), then you don't need any hidden layers at all. Of course, you don't need an ANN or DL model to resolve your data either, but it will still do the job. One hidden layer is sufficient for most problems.
In general, neural networks with two hidden layers can represent functions with any kind of shape. There are currently no strong theoretical reasons to use neural networks with any more than two hidden layers, but empirical evidence shows that using more than two hidden layers can capture in a better way non-linear patterns and complex interactions (Chollet and Allaire 2017). Further, for many practical problems, there's no reason to use any more than one hidden layer. Problems that require two hidden layers are not very common. Differences between the number of hidden layers are summarized in Table 11.1.  Can approximate well any functions that contain continuous mapping from one finite space to another 2 Represent a decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any degree of accuracy

Number of Neurons in Each Layer
The number of neurons in the network is very important since using too few neurons in the hidden layers will result in underfitting, while using too many neurons in the hidden layers can result in severe overfitting. For the input layer, the number of neurons is predetermined by the number of input features (predictors) of the input data. Also, in the output layer, the number of neurons is determined by the number of response variables (outputs) to be predicted or by the number of classes of the response variable when it is ordinal or categorical. For example, if we want to predict grain yield, a continuous outcome, using information of 3000 markers as input variables, this means that the input layer should have 3000 neurons and the output layer just 1, since we are only interested in one response variable that is continuous. However, if instead of predicting a continuous outcome we are interested in predicting an categorical output with ten categories, in this case the required number of neurons in the output layer should be equal to the number of categories in the categorical response variable. But defining the number of neurons in the hidden layers is challenging, and it is left to the user to decide which number to use prior to training the model. According to the literature review we performed, we found that there is no unique and reliable rule for determining the required number of neurons in the hidden layers, since this number depends on (a) the number of input neurons; (b) the amount of training data; (c) the quality of the training data; and (d) the complexity of the learning task (Lantz 2015). As a general rule, the more complex the network topology (with more neurons, layers, etc.), the more powerful the network for learning more complex problems. That is, the more neurons and layers are used, the better the representation of the training data, which means that the predicted values should be very close to the true values observed in the training set; however, this runs the risk of overfitting and may generalize poorly to new data. We also need to be aware that increasing the complexity of the network also implies that more computational resources are required for its implementation (Lantz 2015). However, as pointed out before, there is empirical evidence that only one hidden layer with a large number of neurons is enough even for complex applications, since increasing the number of layers increases the computational resources required by the complexity of the neural network. For this reason, in general, increasing the number of neurons instead of the number of layers is suggested, for example, if we have a neural network with 100 neurons in the single hidden layer, computationally it is better to increase to 200 neurons in the same hidden layer than to add a second hidden layer with 100 neurons. Of course, this suggestion is not always valid since there are many applications in the fields of pattern recognition and image classification, among others, were to be able to create adequately complex decision boundaries. It is very important to add more hidden layers to better capture the complexity of the phenomenon we want to predict.
Given the difficulties mentioned above, there are two very general approaches for specifying the required number of neurons in the hidden layers. The first one, called the backward approach, suggests starting with a very large number of neurons and evaluating their performance, and then decreasing the number of neurons until there is no more gain in the reduction of the testing error. The second one, called the forward approach, suggests starting with half (50%) of the input neurons and then increasing the number of neurons until no significant gain in the prediction performance of sample data is observed. Since there is no consensus about how to choose the required number of hidden neurons, next we provide some general rules: (a) the number of hidden neurons should be in the range between the size of the input layer and the size of the output layer; (b) the number of hidden neurons should be 2/3 of the input layer size, plus the size of the output layer; (c) the number of hidden neurons should be less than twice the input layer size; (d) Hecht-Nielsen (1987) suggested using 2P +1 hidden neurons (where P is the number of inputs) for a network with one hidden layer to compute an arbitrary function; (e) Lippmann (1987) suggested that the maximum number of neurons required for the single hidden layer should be equal to L Â (P + 1), where L is the number of output neurons; and (f) for a single hidden layer classifier, an upper bound is about 2P + 1 hidden neurons, whichever is larger.
But in practical terms, the optimal size of the hidden layer is usually between the size of the input and size of the output layers. In some problems, one could probably achieve a decent performance (even without a second optimization step) by setting the hidden layer configuration following just two rules: 1. the number of hidden layers is equal to one and 2. the number of neurons in that layer is the mean of the neurons in the input and output layers.
These rules are only starting points that you may want to consider. Ultimately, the selection of the neurons of your neural network will come down to trial and error following a backward or forward approach. Of course, the idea is to use as few neurons as possible that provide good performance in validation data sets. In general, rules (e) and (f) many times are conservative given that fewer neurons are often sufficient, since in a lot of cases, using only a small number of hidden neurons in the artificial neural network offers a tremendous amount of learning ability. Also, we need to recall that according to the universal approximation theorem, an artificial neural network with at least one hidden layer with a lot of neurons is a good universal function approximator (Lantz 2015).

Regularization Type
The main goal of regularization is to avoid overfitting, and the basic idea is to penalize the weights in such a way that the model increases its prediction performance in out-of-sample data. The most common methods are the L2 (Ridge penalization), L1 (Lasso penalization), L2-L1 (Elastic Net penalization), and the dropout method. The L1 regularization forces many weights to be exactly zero, which provides a sparse solution, while with L2 regularization, the larger the weight, the stronger the penalization. The L2-L1 is somewhere between the L2 and L1 regularization, while the dropout method fixes a certain percentage of weights to be exactly zero in the input or hidden neurons. However, it is not easy to suggest one of them, since all of them do a reasonable job in particular contexts, but any of them works well for all types of data. But, in general, the L2 and dropout regularizations perform well in many applications.

Learning Rate
The learning rate modifies weights and thresholds (bias) of the neural network, in order for a given input to the network to produce a favored output. It also determines the speed for convergence of the backpropagation algorithm to an optimal solution. Typically, a value of 0.001 or less is used to initialize the learning rate; however, in general, it is a hard task to choose the appropriate learning rate. Before explaining the problems involved in choosing the learning rate, we will first mention that the backpropagation algorithm can stop because (1) the number of maximum epochs allowed was reached and (2) an optimal solution was found (Baysolow II 2017). Figure 11.2 illustrates how to reach an optimal solution with an optimal learning rate.
Next, we explain the consequences of choosing a learning rate that is too small or too large. When the learning rate is very small, it is more likely to stop the backpropagation algorithm in a non-optimal condition since the first stopping condition is satisfied. Although apparently this can be fixed by increasing the number of epochs, this solution is not optimal since increasing the number of epochs considerably increases the required computation resources. On the other hand, when the learning rate is very large, it is possible not to reach the optimal solution due to the fact that the loss function in each epoch may overcorrect and give updated values for the coefficient that are far too small or far too large (Figs. 11.3 and 11.4). For this reason, the suggestion is to try several learning rate values and see how the algorithm performs across each epoch (Baysolow II 2017).

Number of Epochs and Number of Batches
Before defining an epoch, we will define a sample (individual) as a single row of data that contains inputs that are fed into the algorithm and an output (prediction) that is compared to the true value to calculate an error. The number of epochs is a hyperparameter that defines the number of times that the learning algorithm will work across the entire training data set. One epoch means that each sample in the training data set has had an opportunity to update the internal model parameters. An epoch is comprised of one or more batches. For example, assume you have a data set with 600 samples (rows of data) and you choose a batch size of 15 and 500 epochs. This means that the data set will be divided into 40 batches, each with 15 samples. The model weights will be updated after each batch of 15 samples. This also means that one epoch will involve 40 batches or 40 updates of 15 samples to be completed. With 500 epochs, the model will be exposed to or will pass through the whole data set 500 times. That is, a total of 40 Â 500 ¼ 20,000 batches during the entire training process. Therefore, with this example, it is clear that the number of batches in each epoch is equal to the number of samples in the data sets divided by the batch size. To better understand these two hyperparameters, you can think of a for-loop over the number of epochs where each loop proceeds over batches of the training data set. Within this for-loop is another nested for-loop that iterates over each batch of samples, where one batch has the specified "batch size" number of samples. The number of epochs is traditionally large, often hundreds or thousands, allowing the learning algorithm to run until the error from the model has been sufficiently minimized. You may see examples of the number of epochs in the literature and in tutorials set to 10, 100, 500, 1000, and larger. This means that the number of epochs can be set to an integer value between one and a very large value. You can run the algorithm for as long as you like and even stop using other criteria besides a fixed number of epochs, such as a change (or lack of change) in the model error over time. The more complex the features and relationships in your data, the more epochs you'll require for your model to learn, adjust the weights, and minimize the loss function. With regard to the batch size, that is, the number of samples processed before the model is updated, this hyperparameter should be more than or equal to one and less than or equal to the number of samples in the training data set. Popular batch sizes include 32, 64,128 samples and so on.
It is common to create line plots that show epochs along the x-axis as time, and the error or skill of the model on the y-axis. These plots were mentioned in Chap. 10 and were called learning curves under the early stopping rule. These plots can help to diagnose whether the model has over learned, under learned, or is suitably fit to the training data set. Since there are no magic rules on how to configure these parameters, you must try different values and see what works best for your problem.

Normalization Scheme for Input Data
It is generally good practice to scale input variables to have the same scale. When input variables have different scales, the scale of the weights of the network will, in turn, vary accordingly. This introduces a problem when using weight regularization because the absolute or squared values of the weights must be added for use in the penalty. Also, without normalization, the input variable with greater numeric range has larger impact than the one having less numeric range, and this could, in turn, impact prediction accuracy. This problem can be addressed by the following methods applied to each input variable: Standardizing. Subtract the mean and divide by the standard deviation. Many times it is good practice to only divide each input by the standard deviation without centering.
Min-max normalization. Subtract from each observation the min value of the input and divide this by the range (max value-min value) of the input. The disadvantage of the min-max normalization technique is that it tends to bring the data toward the mean. Details of this method and other methods of scaling were provided in Chap. 2 of this book.

Popular DL Frameworks
Given that the adoption of DL has proceeded at an alarming pace, the maturity of the ecosystem has also shown phenomenal improvement. Thanks to many large tech organizations and open-source initiatives, we now have a plethora of options to choose from. Due to the evolution of the software industry, today it is far easier to develop high-end software than it was a few years back, thanks to the available tools that have automated complex problems in a way that's simple to use.
Given the maturity of software tools available today, we can afford to automate several complexities that happen in the background. These tools are nothing but building blocks for software systems. You technically don't need to start from scratch; you can instead rely on available powerful tools that have matured significantly to take care of several software-building services. Given the level of abstraction a framework provides, we can classify it as a: (a) Low-level DL framework and (b) High-level DL framework A few of the popular low-level frameworks for DL are Theano, Torch, PyTorch, MxNet, TensorFlow, etc. The previously mentioned frameworks can be defined as the first level of abstraction for DL models. You would still need to write fairly long codes and scripts to get your DL model ready, although much less than using just Python or C++. The advantage of using the first-level abstraction is the flexibility it provides in designing a model. However, to simplify the process of DL models, we have frameworks that work on the second level of abstraction called High-Level DL Frameworks. Rather than using the previously mentioned frameworks directly, we can use a new framework on top of an existing framework and thereby simplifying DL model development even further. The most popular high-level DL framework that provides a second-level abstraction to DL model development is Keras. Other frameworks like Gluon, Lasagne, and so on are also available, but Keras has been the most widely adopted one.
Keras is a high-level neural network API written in Python, and it can help you in developing a fully functional DL model with few lines of code. Since it is written in Python, it has a larger community of users and supporters and is extremely easy to get started on. The simplicity of Keras is that it helps users to quickly develop DL models and provides a ton of flexibility while still being a high-level API. This really makes Keras a special framework to work with. By far the most widely adopted usage of Keras is with TensorFlow as a back end (i.e., Keras as a high-level DL API and TensorFlow as its low-level API back end). In a nutshell, the code you write in Keras gets converted to TensorFlow, which then runs on a computing instance. You can read more about Keras and its recent developments here: https://keras.io/.
For this reason, in this book, we use Keras for implementing ANN and DL models (Allaire and Chollet 2019). The examples that follow use this framework.

Optimizers
The optimizer is a key component of the model training process. Up to this point, we are only familiar with the backpropagation algorithm that provides feedback to the model to reach a good trained process. The loss function of ANN is the main element to understand how well or poorly the current set of weights has performed on the training sample. The next step for the model is to reduce the loss. How does it know what steps or updates it should perform on the weights to reduce the loss? As we observed in the previous chapter, the optimizer function is a mathematical algorithm that uses calculus (derivatives and partial derivatives) to estimate how much change the network will need in the loss function to be able to reach the minimum. The change in the loss function, which would be a decrease or increase, helps in finding the direction of the change required in the weight of the connection. The math equations required are specific for each optimizer and are beyond the scope of this book (Géron 2019), but the general issues are those provided in Chap. 10 for the back propagation algorithm.
Batch Gradient Descent (BGD). The implementation of BGD requires training the model with the full data sets at every step. As a result, it is considerably slow on very large training sets. The BGD will eventually converge to the optimal solution for convex functions with a fixed learning rate, but requires a considerable amount of time (Géron 2019). Under BGD, the cost function has the shape of a bowl when the inputs have the same scale, but an elongated bowl with different scales of the features is also possible. Figure 11.5 shows gradient descent on a training set where features 1 and 2 have the same scale (on the left), and on a training set where feature 1 has much smaller values than feature 2 (on the right).
Other optimizers based on gradient descent available in Keras are listed next. Stochastic Gradient Descent (SGD). SGD, at the extreme, simply chooses a random instance in the training set at each step and calculates the gradients only in that single instance, that is, SGD uses batch_ size ¼ 1 in the model training function. Fig. 11.5 Gradient descent with and without input (feature) scaling (Géron 2019). On the left are shown the two features on the same scale that reach the target faster, while on the right the two features have different scales and take longer to reach the target Obviously, this makes the algorithm much faster, since in each iteration, it has less data to manipulate ( Fig. 11.6). To reduce high fluctuations in SGD optimizations, a better approach would reduce the number of affected iterations in a mini-batch. Mini-batch gradient descent is a modification of the gradient descent method that splits the training data set into small batches that are used to calculate model prediction error and update model coefficients. This approach has been more successful and results in a more fluid training process. The batch size is usually set to powers of 2 (i.e., 32, 64, 128, etc.) (Géron 2019).
Adaptive Moment Estimation (Adam). This is by far the most popular and widely used optimizer in DL. In most cases, you can blindly choose the Adam optimizer and forget about the optimization alternatives. This optimization technique computes an adaptive learning rate for each parameter. It defines momentum and variance of the gradient of the loss and leverages a combined effect to update the weight parameters. The variance and momentum together help smooth the learning curve and effectively improve the learning process.
Other important optimizers. There are many other popular optimizers for DL models. We will not discuss all of them in this book. We will list only a few of the most popular optimization alternatives used and available within Keras (Adadelta, RMSProp, Adagrad, Adamax, Nadam). There is no perfect optimization method since each of them has its own pros and cons. Each optimizer requires tunable parameters like learning rate, momentum, and decay. The saddle point and vanishing gradient are two problems often faced in DL. These problems can be explored in more detail while choosing the best optimizer for your problem. But Adam always works fine in most cases (Géron 2019).

Illustrative Examples
Example 11.1 MaizeToy In this example, we use the MaizeToy data set. This data set is preloaded in the BMTME library, and for this reason it is important to first install this library and also the Keras library. To see the information of this data set, we can use: rm(list=ls()) library(keras) library(BMTME) ##########Set seed for reproducible results################### use_session_with_seed(45) #################Loading the MaizeToy data sets############# data("MaizeToy") ls() head(phenoMaizeToy) Using this code, we can see in the console that this data set has two objects: "genoMaizeToy" and "phenoMaizeToy." The first contains the genomic relationship matrix (GRM) of the lines, while the second contains the phenotypic information for which the first six observations resulting from using the head() command are shown: Here we can observe that in this data set, the first column contains the names of the lines, the second contains the name of the environments, while columns 3-5 contain the measured traits: grain yield (Yield), anthesis-silking interval (ASI), and plant height (PH).
##############Selecting the response variable###################### Y <-as.matrix(phenoMaizeToy[, -c(1, 2)]) ####Training-testing sets using the BMTME package############### pheno <-data.frame(GID = phenoMaizeToy[, 1], Env = phenoMaizeToy[, 2], The Y matrix contains information of the three traits available in the MaizeToy data sets. The pheno variable is an auxiliary object that contains information of lines, information of environments, and the response variable of only one trait to be able to create a training-testing partition according to a five-fold cross-validation strategy where information of some lines is missing in some environments but not in all. The CrossV$CrossValidation_list argument was used to print the testing observations for each fold. 88 79 59 67 3 36 90 14 23 78 10 27 26 24 74 9 13 82 18 5 71 47 46 49 85 50 25 55 62 22 32 68 1 77 We used the following code to select only one trait to be our response variable (Yield), and we also collected in an X matrix the information of environments, lines, and genotype by environment that will be the input of our artificial deep neural network model: The output of this code shows that the total number of observations is 90, that is, 30 for each environment, and the number of input variables (predictors) is 123, which corresponds to the number of columns in matrix X. For the moment, we will only work with one-fold of the five training-testing sets, and we will also specify the number of epochs and units (neurons)  Before specifying the model, we will explain the pipe operator (%>%) that connects functions or operations together in the Keras library. We will be using the pipe operator (%>%) to add layers to the network. The pipe operator allows us to pass the value on its left as the first argument to the function on its right.
Next, we define the artificial neural network model and compile it in Keras: build_model <-function() { model <-keras_model_sequential() model %>% layer_dense(units =N_Units, activation = "relu", input_shape = c(dim (X_trn)[2])) %>% layer_dropout(rate = 0.0) %>% layer_dense(units = 1, activation = "linear") model %>% compile( loss = "mse", optimizer = "rmsprop", metrics = c("mse")) model} The first part of this code specifies the type and topology of the model. With keras_model_sequential we are using Keras to implement a feedforward (or densely connected) network. The sequential model is composed of a linear stack of layers. Layers consist of the input layer, hidden layers, and an output layer. Sequential models are appropriate when the information available is in two dimensions: one dimension for samples (observations or individuals) and the other for predictors (features), since the available data set is a matrix where the rows correspond to observations and the columns correspond to the features at hand. The first layer_dense is used to specify the first hidden layer and also the required number of neurons (units), and in the input_shape is specified the dimension of the input variables (in this case, it is equal to the number of columns in the training set of the predictors, that is, 123). Also specified here is the linear activation function which gives as output positive values. Next in layer_dropout() we are instructing Keras that any of the weights of the neurons of the first hidden layer will be set to zero, that is, all specified neurons will be used since rate ¼ 0. Then the second layer_dense() is used to specify the output layer, which only used one neuron since the response variable under study is continuous and unique (univariate); in this layer we used the linear activation function, which is a good choice for continuous outputs. An input shape argument is not required in the second or other hidden layers since its input shape is automatically inferred as being the output of the previous layer.
Next, the compilation step is done in which the learning process of the network is configured and where you should specify the optimizer, the loss function(s), and the metric used for monitoring the match between observed and predicted values. In this case, for the optimizer we specified "rmsprop," which is a gradient-descent-based algorithm that combines Adagrad and Adadelta adaptive learning ability. Adagrad is a gradient-descent-based algorithm that accumulates previous costs to do adaptive learning. Adadelta is a gradient-descent-based algorithm that uses Hessian approximation to do adaptive learning. The loss function and metric used in this case is the mean squared error (mse) since the response variable is quantitative. The last two lines of this code provide a summary of the model that will be implemented. From this summary output of the neural model that will be implemented, we can see that there are 33 neurons (since we decided to use 33 neurons) in the hidden layer, and any of their weights will be fixed at zero since the dropout percentage is zero. For this reason, the number of parameters (weights) in the first hidden layer is equal to 123 Â 33 + 33 ¼ 4092, and to 34 in the second layer since there are 34 weights that connect from the hidden layer to the output layer; this corresponds to 33 units plus an intercept (bias) term. For this reason, the total number of parameters to estimate is equal to 4126. We then provide the code for training (fitting) the model and also specify a plot to visualize training and validation metrics by epoch: ##########Internally this part of the code split the training set in training-inner and validation### model_fit <-model %>% fit( The first part of the code is a custom callback function for replacing the default training output with a single dot per epoch, while the second part is for fitting the artificial neural model and storing training stats. Also, in this second part of the code is specified the input and output training information, the number of epochs and batch size, and the size of the inner validation set, which in this case was 0.2 or 20%. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data are selected from the last samples in the X and y data provided, before shuffling (permining the data). The goal is to shuffle the training data before each epoch when TRUE is specified, but not when FALSE is specified. The plot of model_fit is for visualizing the model's training progress using the metrics stored in the model_fit variable. We want to use these data to determine how long to train before the model stops making progress.
In Fig. 11.7, the mean square error is plotted in the bottom panel, and the loss in the top panel. In this plot you can see that after 250 epochs, there is no relevant improvement in the prediction performance of the validation set. Next, we calculate the MSE for the training set; the predictions and MSE of the testing set were obtained using the following code: ############MSE for No_epochs=1000 in the outer testing set########## model_fit$metrics$val_mse [No_Epoch] pred=model%>% predict(X_tst) Pred=c(pred) Obs=y_tst MSE_First=mean((Obs-Pred)^2) MSE_First The first part of the code extracts the MSE for the training set implemented with 1000 epochs, which is equal to 4.5429; the second manually calculates the MSE for the testing set, which is equal to 1.8688. Next, we refit the model but using the early  The first part of this code is a function that will help to implement the early stopping process in Keras, and the function to monitor is the validation loss score. The patience parameter is the number of epochs with no improvement, after which training will be stopped. The second part is exactly the same as the one used before for fitting the model, with the difference that now the training process will stop according to the information provided in the early stopping function. The code length(model_fit_Final$metrics$mse) prints the number of epochs during the stopping process. Finally, the history of the training and validation process is depicted in this plot (Fig. 11.8). In the plot it is clear that the stop occurred in epoch 282, which means that the training process was done with this number of epochs and not with 1000 as before. Finally, the following code was used to obtain the predicted values and MSE for the testing set using the training information with 282 epochs. Two MSE were also printed, one using the original 1000 epochs (MSE_First) and the last one with 282 epochs. From this output it is clear that when the training process was done with the early stopping approach, the MSE of the testing set was better by 3.4% (even though maybe it is not so relevant since the predictions are still quite far from the true values), than when using all the epochs originally used. However, it is important to point out that the early stopping method did not always improve the prediction performance.

Example 11.2 MaizeToy Example with a Tuning Process
The last example illustrated how to use the early stopping rule to train ANN or DL models. However, that example only used one partition of the data sets (one TRN with 80% and one TST with 20%) and one value of neurons. However, if another TRN-TST split and number of neurons are used, the expected MSE will be different. For this reason, next we split the training into five TRN-TST inner subsets and also train the ANN model with a grid of two values for the neurons and two dropout values.
Tuning a DNN model often requires exploring many hyperparameters to find the best combination that produces the best out-of-sample predictions. The best way to approach this in Keras is by defining external flags (a map of the model parameter name and an arrangement of values to try) for key parameters, which you may want to vary instead of progressively changing your source code of the training script each time. In other words, this is a way in Keras for specifying a grid search. The flags() function provides a flexible mechanism for defining flags and varying them across training runs. For this reason, for tuning hyperparameters in Keras, we will now use the tfruns package (Allaire 2018), which uses flags and provides a lot of flexibility for tracking, visualizing, and managing training runs and experiments in R. This package allows tracking the metrics, hyperparameters, source code of every training run, and output, and visualizes the results of individual runs and comparisons between runs. For this reason, first we created the Keras script with flags called Code_Tuning_With_Flags_00.R, as follows: ####a) Declaring the flags for hyperparameters FLAGS = flags( flag_numeric("dropout1", 0.05), flag_integer("units",33), flag_string("activation1", "relu"), flag_integer("batchsize1",56), flag_integer("Epoch1",1000), flag_numeric("learning_rate", 0.01), flag_numeric("val_split",0.2), flag_numeric("reg_l1",0.001)) In part a) of the above code, we declare the flags for each of the candidate hyperparameters to be tuned. It is important to point out that the type of flag that is used depends on the type of hyperparameter, and there are four types of flags: a) flag_integer() for integer values like 1, 2, 3, 4, etc., b) flag_numeric() for floating point values like 2.5, 4.4. À1.23, 1.35, etc., c) flag_boolean() for logical values (e.g., true, false, 1, 0), and d) flag_string(), for example, character values (e.g., "RELU").
Next we incorporate the flag parameters within our model. In part b) of the code, a feedforward neural network is defined, but incorporating the flags for units and hidden activation function allows us to vary the values of these hyperparameters and not only work with the literal values of these hyperparameters. The definition of the DNN model differs from the previous DNN model in the use of flags. In part c), the compilation of the DNN model is done using flags and in this part the only flag used was for the learning rate. Finally, in part d), the model is fitted; flags were also used here but now for epochs, batch_size, and validation_split.
Next we created a grid of hyperparameters and we called the last R code (Code_Tuning_With_Flags_00.R) to be able to vary each flag with a grid of values using the tfruns::tuning_run() function where the grid of values for each of the hyperparameters is specified. The following code performs a tuning process for choosing the optimal hyperparameters. The data used here are the same as before (MaizeToy data sets), but it is important to point out that to be able to use this code, first all the codes that were used in Example 11.1 should be used, until part 0: Part 0: Final X and y before splitting. ################a) Inner cross-validation########################## nCVI=5 ####Number of folds for inner CV Hyperpar=data.frame() for (i in 1:nCVI){ Sam_per=sample(1:nrow(X_trn),nrow(X_trn))

X_trII=X_trn[Sam_per,] y_trII=y_trn[Sam_per]
############b) Grid search using the tuning_run() function of tfruns package######## runs.sp<-tuning_run("Code_Tuning_With_Flags_00.R",runs_dir = '_tuningE1',flags=list(dropout1= c(0,0.05), units = c(33,67), activation1=("relu"), batchsize1=c (28) In part a) of the above code, the number of inner TRN-TST partitions that will be implemented is specified; in this case, there are five. Then a data.frame called Hyperpar is defined in which the results of each combination of hyperparameters resulting from each inner partition will be saved. Then we start the loop for implementing the five inner CV strategies for selecting the best hyperparameters. Then the outer training set is shuffled (permuted; using sample without replacement) in such a way that the TRN-TST partition constructed internally in Keras using the validation_set function does not use the same validation and training set in each inner partition.
Part b) of this code specifies the grid for each hyperparameter. It is important to point out that two values were used, 0 (0%) and 0.05 (5%), for dropout. Also, two values, 33 and 67, were used for units, while for the rest of the hyperparameters only one value was used. For this reason, a total of four hyperparameter combinations should be evaluated. It is important to point out that you can put many more values for each hyperparameter, for example, using a grid of 4 values for each hyperparameter implies a total of 4 7 ¼ 16,384 combinations since 7 hyperparameters can be tuned with the flags specified in the above code, which illustrates that the hyperparameter search space explodes quickly with DL models since there are many hyperparameters that need to be tuned. For this reason, the tuning_run () function, in addition to implementing by default a full Cartesian grid search (with sample ¼ 1), also allows implementing a random grid search by specifying a value between 0 and 1 in the sample command inside the tuning_run() function. For example, by specifying sample ¼ 0.25, only 25% of the total combinations of hyperparameters should be evaluated.
In part c) of the code, first the outputs of the tuning_run function are ordered, that were saved in the runs.sp object. The ordering should be done for all those hyperparameters that have more than one value for evaluation. We did the ordering process only for dropout1 and units (hyperparameter neurons) that have two values each to be evaluated. However, if more hyperparameters contain more than one value to be evaluated, they should also be ordered. Then we added to the runs.sp data frame the unique consecutive number for each unique combination of hyperparameters under study. Then in parameters, in addition to the grid length, the important hyperparameters are saved for each inner fold in the Hyperpar data frame. In part d) the average prediction performance is obtained for each of the four hyperparameters under study in this problem. In e) the best combination of hyperparameters is selected. Then in part f), the model with the optimal hyperparameters is refitted, and finally in part g), the predictions of the secondouter cross-validation are obtained.
Before giving the output of the refitted model, we printed the average prediction performance of the five inner cross-validations of the four combinations of hyperparameters in the grid search for the outer fold cross-validation of the second partition: We can see that the best combination of hyperparameters, in partition 2, of the inner cross-validation is combination 4 (with dropout ¼ 0.05, units ¼ 67, and epochs ¼ 247.6) since it has the lowest MSE (0.971) in the testing set (validation set). Then with the optimal values of dropout, epochs, and neurons, the ANN was refitted, but using the whole second outer training set, and then, with the final trained ANN, the predictions for the outer testing set were obtained. Finally, the plot of the observed and predicted values of the outer testing set was done, as shown in Fig. 11.9. The MSE was calculated and was equal to (0.810), which is lower than the MSE obtained (1.222) using only one split (inner TRN-TST) for partition 2.
Finally, using the code given in Appendix 1, we implemented the above code using the five outer folds and five inner folds, and the MSE resulting from the average of the five outer CV is equal to 0.892; it is slightly larger than when using only fold 2, which means that the variability of fold to fold is considerably large.
Example 11.3 MaizeToy Example with Five-fold CV with Dropout Next, we perform a five-fold CV for the MaizeToy data sets but with different percentages of dropout, a type of regularization. It is important to point out that now, in addition to reporting the MSE by fold, we also provide the MSE for each environment. First, we provide the MSE for each environment, then we average the five-folds, using as input the information of environments + the information of lines (taking into account the marker information). The code used to obtain these results is given in Appendix 2. This code also uses the Code_Tuning_With_Flags_00.R, but now the grid search is performed not only for two hyperparameters since we use two values for the following hyperparameters: units, batchsize, and learning_rate, which means that the total combinations to be evaluated are equal to 2 3 ¼ 8, while the remaining four hyperparameters, validation split, dropout, activation function, and number of epochs, were fixed at 0.20, 0, relu, and 1000, respectively. Next we provide the key part of the Appendix 2 code: runs.sp<-tuning_run("Code_Tuning_With_Flags_00.R",flags=list (dropout1= c(0), units = c(33,67), activation1=("relu"), batchsize1=c(28,56), where it is clear that the % of dropout value was fixed at zero. Of course, when you want to use other dropout percentages, you only need to specify in dropout1 the new value in the above code for tuning and it is modified in the Code_Tuning_With_Flags_00.R the following part of the code layer_dropout(rate = 0.05) %>% This means that when the Code_Tuning_With_Flags_00.R is executed with 5% dropout, in this code, the line above is modified with a rate ¼ 0.05, but if you want to run with 25% dropout, the rate parameter is changed to 0.25. This code in Appendix 2 is run for the following dropout percentages: 0%, 5%, 15%, 25%, and 35%. This code was run once for a grid of the five dropout percentage values, but first without taking into account the genotype by environment interaction and then with the genotype by environment interaction. It is important to point out that by changing the value of the sample argument in the tuning_run() function, we can implement the random search grid that tests only the proportion of hyperparameter combinations that is specified in the sample. For example, if we specify sample ¼ 0.5, this means that only half of the total number of hyperparameter combinations in the grid will be evaluated, while if we specify sample ¼ 0.10, only 10% of the total number of hyperparameter combinations should be evaluated. However, to use the code in Appendix 2 with a value of sample less than one, it is also important to modify the number of inner testing CV to 1 (nCVI ¼ 1, ####Number of folds for inner CV) since with more than two (nCVI ¼ 2, 3, 4, . . .), the R code given in Appendix 2 is not valid. Table 11.2 shows that in general, the prediction performance in terms of MSE and MAAPE improved using different dropout percentages, since the best prediction performance was observed using 15% of dropout (rows in bold). Table 11.3 gives the prediction performance using the same dropout percentages using the code given in Appendix 2, but now taking into account as predictors the information corresponding to the G Â E interaction term which is obtained by replacing, in Appendix 2, X ¼ cbind(Z.E, Z.G) with X ¼ cbind(Z.E, Z.G, Z.EG). This table shows that there is a gain in prediction performance with a dropout percentage larger than zero and the best predictions are obtained with 15% dropout. However, it is clear that the predictions are worse when the G Â E interaction term is ignored, which can be attributed in part to the fact that the number of data sets used is small and that ANN with at least one hidden layer can capture complex interactions by the nonlinear transformations performed in the hidden layer. Perhaps for this reason, in many cases providing as input the information corresponding to the G Â E interaction term is not required. It is possible to increase the prediction accuracy of this model with the G Â E interaction term by increasing the number of neurons in the hidden layer, but this also increases the computational resources required.
Example 11.4 MaizeToy Example with More Than One Hidden Layer with Inner CV The goal of this section is to show how to train DL models with different number of layers and how they impact the prediction performance. Now the grid contains the number of neurons (33, 67), three dropout percentages (0, 5, and 10%) and two batch size values (28, 56), and again the number of epochs is chosen using the early stopping approach. All 12 combinations are evaluated, since they were implemented using the full grid search with sample ¼ 1. It is important to point out that this grid was implemented with an outer 5FCV and inner 5FCV. The code used for implementing this grid with four hidden layers is given in Appendix 4 (this code call the code given in Appendix 3 that contains the flags). It is very similar to the code used in Appendix 2, but one of the most important parts is how the layers are added to this code. Next we provide the part where the layers are added. This part of the code that forms part of Appendix 4 is for a DL model with four hidden layers; the lines that start with layer_dense() specified all the layers, but the first four correspond to the hidden layers and the last one to the output layer. Also, this DL model corresponds to a neural network with five layers. Table 11.4 shows the prediction performance with one to four different hidden layers; the prediction performance is affected by the number of hidden layers used, but in this case, the best predictions across environments were obtained with three hidden layers. Example 11.5 MaizeToy Example with More Than One Hidden Layer Without Inner CV The goal of this example is to implement exactly the same grid used in Example 11.4, but instead of implementing the five inner fold CV, we implemented this with only one inner random partition. The code used here is the one given in Appendix 5 with only one inner cross-validation. Table 11.5 shows that the best predictions are observed with three hidden layers, but it is important to point out that in general the prediction performances using only one inner random partition were not quite similar to performances using five inner fold CV, as can be observed in Table 11.4 in Example 11.4. Example 11.6 MaizeToy Example with One Hidden Layer with Inner CV and Ridge, Lasso, and Elastic Net (Ridge-Lasso) Regularization The code used for implementing a grid search with an outer and inner CV with Ridge regularization (L2) is similar to the one given in Appendix 2, with the difference that now for specifying the model the following part of the code is used: model_Sec %>% layer_dense(units =Units_O , activation =activation_O, kernel_regularizer=regularizer_l2(Reg_l2_O), input_shape = c(dim (X_trn)[2])) %>% layer_dropout(rate =Drop_O) %>% layer_dense(units =1, activation ="linear") The only new part of this code is the part that contains kernel_regularizer ¼ regularizer_l2(Reg_l2_O) that specifies Ridge regularization, and in Ridge_values, the following values were used: 0.001, 0.01, 0.1, 0.5, 1. The grid here is composed of four values due to two unit values (33, 67) and two batch size values (28, 56); we used sample ¼ 1, which means that the four hyperparameters were evaluated. Each of these values will add Ridge_valÂweight_coefficient_value to the total loss of the network. Since this penalty is only added at training time, the loss for this network will be much bigger at training than at testing time. To reproduce these results, you can replace in Appendices A3 and A4 the specification of the model provided above each time you need to modify in kernel_regularizer ¼ regularizer_l2(Reg_l2_O) the corresponding value of the Ridge regularization parameter. Table 11.6 gives the output without taking into account the G Â E interaction using the following L2 regularization values of lambda, 0.001, 0.01, 0.1, 0.5, 1, implemented with the early stopping method. We can see in Table 11.6 that the best performance was obtained using the value of 0.01 for the L2 regularization hyperparameter. Table 11.7 provides the output without taking into account the G Â E interaction for the L1 regularization for the same grid containing the following values of lambda, 0.001, 0.01, 0.1, 0.5, 1, two neuron values (33, 67), two batch size values (28, 56), and 1000 epochs implemented with early stopping. The performance was very similar using different values of L1 for regularization, but the performance under the L1 regularization using a value of 0.01 was slightly better.
In similar fashion, Table 11.8 gives the output without taking into account the G Â E interaction for five regularization parameters (0.001, 0.01, 0.1, 0.5, 1) using the Elastic Net regularization (L1-L2) for two neuron values (33, 67), two batch size values (28,56), and 1000 epochs implemented with early stopping. We can see in Table 11.8 that the best performance was obtained using the value of 0.001 of the L1-L2 regularization parameter across environments under the MAAPE and a value of 0.01 of the L1-L2 regularization parameter under the MSE.  Table 11.9 are given the results of evaluating seven types of optimizers for a five outer and five inner fold cross-validations without taking into account the genotypeÂenvironment interaction. The code given in Appendix 2 was used in this implementation, but with a 0% dropout rate and with only one hidden layer. The neuron grid consists of two neurons (33 and 67), two batch sizes (28, 56) and they were implemented with early stopping to find the optimal number of epochs, which was fixed at 1000. Since the code given in Appendix 2 was used each time, this code was run by replacing in the optimizer each of the seven types of optimizers. The results in Table 11.9 show that the best performance in terms of prediction accuracy was found under the optimizer_adamax and the worst under the optimizer_adagrad.
Finally, with these examples we illustrated how to use DNN models for implementing genomic prediction models for continuous outcomes in Keras. Most of the key elements for implementing DNN models were explained to enable you to implement these models on your own data without any complications. However, as mentioned before, the DNN training process is challenging due to the complexity of tuning hyperparameters, but thanks to functions like the tfruns, this work is not as hard. However, more research and automatization of all these activities are required to be able to implement DNN models more quickly and efficiently since the tuning process still requires a lot of time.