Artificial Neural Networks and Deep Learning for Genomic Prediction of Binary, Ordinal, and Mixed Outcomes

Before starting with the examples, we explain in general terms the process to follow to train DNN for binary outcomes. When training binary outcomes, it is important to denote the values of the response variable as 0 and 1, where 0 denotes the absence of the disease and 1 its presence; any other two types of interest should be denoted as 0 and 1. Below we provide some key elements to train this type of DNNs more efficiently:


Training DNN with Binary Outcomes
Before starting with the examples, we explain in general terms the process to follow to train DNN for binary outcomes. When training binary outcomes, it is important to denote the values of the response variable as 0 and 1, where 0 denotes the absence of the disease and 1 its presence; any other two types of interest should be denoted as 0 and 1. Below we provide some key elements to train this type of DNNs more efficiently: (a) Define the DNN model in Keras. As in the previous chapters, we again focus only on fully connected neural networks that consist of stacking fully connected networks to all neurons (units). We suggest using the RELU activation function or some of the following activation functions (leaky RELU, tanh, exponential linear unit, etc.) for hidden layers, but for the output layer we suggest using a single-unit layer with the sigmoid activation function to guarantee that the output is a probability between 0 and 1. Also, the first layer requires the input shape (features) information; however, this is not required for the following layers since they automatically infer the shape from the previous layer. The construction of a DNN in Keras for binary outcomes again needs to start by initializing a sequential model using the keras_model_sequential() function which allows implementing a series of layer functions that create a linear stacking of layers. The summary () can also be used to print a summary of our DNN model. The number of neurons in the output layer is 1 since we are dealing with a univariate binary outcome with two categories in the outcome variable. (b) Configuring and compiling the model. At this stage of the training process, the loss function, the optimizer, and the metric for evaluating the prediction performance should be defined. Regarding the loss function, most of the time binary_crossentropy is suggested for binary outcomes, while categorical_crossentropy is suggested for categorical or ordinal data. As it was studied in the previous chapter, the mean_squared_error loss is the most popular for continuous outcomes. However, it is also possible to use the mean_squared_error loss for binary and categorical outcomes, but binary_crossentropy and categorical_crossentropy are the best choices when we are dealing with DNN models that output probabilities because they measure the distance between probability distributions, that is, between the ground truth distribution and the predictions obtained. As was mentioned in the previous chapter, the optimizer plays a really important role when updating model parameters (weights and biases). There is no specific optimizer for each type of response variable, and as we studied in the previous chapter, there are at least seven optimizers available in the Keras library. However, we usually use the Adam optimizer since it performs well in many cases. Finally, regarding the type of metric for evaluating the prediction performance for binary and categorical data, we use the accuracy metric which measures the proportion of cases that are correctly classified. (c) Fitting the model. At this stage, we need to specify the number of epochs (i.e., the number of times the algorithm uses the entire training data set) and the batch size (size of the sample to be passed through the entire algorithm in each epoch) because if the training data consist of 1000 observations and we use a batch size ¼ 50, we will need 20 iterations per epoch. Here, we should specify the validation split when you are in the tuning process (the value of the validation split is between 0 and 1) or specify the validation data set that should be used. For example, if you specified a validation_split ¼ 0.3, this means that 30% of the original training data should be used as the validation set, and the remaining 70% of the observations will be used for training the model; the prediction performance of the model is evaluated with the validation set. Also, if you want to use the early stopping method, this should be specified in callbacks exactly as was done in the previous chapter for continuous outcomes. (d) Evaluating the prediction performance. For binary outcomes, we suggest using the predict_classes() function that requires the information of the independent variables of the testing set as input for which the predictions are required. The predict_classes() function gives binary results (0 or 1) as output. However, practitioners can also use the predict() function, which provides probabilities for each category as outputs that need to be converted to 0 and 1 using a threshold value, for example, observations with probabilities larger or equal to 0.5 should be classified as 1 and observations with probabilities smaller than 0.5 should be classified as 0. Next, we provide the first illustrative example for training DNN with binary outcomes.
Example 12.1 Binary outcomes. This toy data set is called EYT and is composed of four environments (Bed5IR, EHT, Flat5IR, and LHT), 40 lines in each environment, and contains four traits (DTHD, DTMT, GY, and Height). Traits DTHD and DTMT are ordinal traits, GY is a continuous trait, and Height is a binary trait. This data set contains a genomic relationship matrix of 40 Â 40 that corresponds to the similarity between lines.
The first eight observations of this data set are given below.
The model is compiled in part c) of the code, and the important things to note are that (a) now the loss function is binary_crossentropy, which is appropriate for binary response variables, (b) the optimizer specified was the Adam optimizer, optimizer_adam(), which is not specific for binary data, and (c) the metrics specified for evaluating the prediction performance is the accuracy that measures the proportion of correctly classified cases. In part d) the model is fitted using the number of epochs, the batch size, and the validation split specified in the flags (part a). In this case, the fitting process was done using the early stopping method.
Then, the above codes named "Code_Tuning_With_Flags_Bin.R" are called in the code given in Appendix 1. The code given in Appendix 1 executes the grid search using the library tfruns (Allaire 2018) with the tuning_run() function. The implemented grid search is shown below: runs.sp<-tuning_run("Code_Tuning_With_Flags_Bin_4HL.R",runs_dir = '_tuningE1', flags=list(dropout1= c(0,0.05), units = c(67,100), activation1=("relu"), batchsize1=c(28), Epoch1=c(1000), learning_rate=c(Learn_val[e]), val_split=c(0.2)),sample=1,confirm =FALSE,echo =F) The grid search is composed of only four combinations of hyperparameters that resulted from using two values of dropout (0, 0.05) and two units (67, 100). We used this small grid search because it is often not possible to do a full cartesian grid search with many values of each hyperparameter due to time and computational constraints. The code given in Appendix 1 was run five times, and each time it was run for a specific value of learning rate. It is important to note that the prediction performance is reported using not only the PCCC but also the Kappa coefficient, the sensitivity, and the specificity. Table 12.1 indicates that the best prediction performance was obtained using a learning rate (learn_val) of 0.01 across environments.

Training DNN with Categorical (Ordinal) Outcomes
When training DNN for categorical or ordinal outcomes, the C levels of the response variable are 0, 1, 2, . . ., C À 1. For example, if the categorical or ordinal response variable has three levels (no infection, middle level of infection, and total level of infection), they should be denoted as 0, 1, and 2, where 0 denotes no infection, 1 middle level of infection, and 2 total level of infection. Another example is that assume you are interested in training a machine for classification with orange, mandarin, tangerine, and lemon as outcomes; you can denote orange with 0, mandarin with 1, tangerine with 2, and lemon with 3. Of course you can choose different values for each fruit, but because there are four categories, you will use 0, 1, 2, and 3 to denote the four fruits even though this is a nominal variable. Next, we provide some key elements to train this type of DNN models more efficiently. Define the DNN model in Keras. The training process is equal to the training of binary response variables, except that in the output layer we suggest using the softmax activation function with a number of units equal to the number of categories; this guarantees that the output of each category is a probability between 0 and 1, and that the sum of these (all categories) probabilities is 1. Before starting the training process, you need to convert to dummy variables the categorical (or ordinal) response variable using the to_categorical() function that needs, as input, the vector of the categorical response variable and the number of classes that the response has. This way of coding the categorical and ordinal responses is called one-hot encoding or categorical encoding. It consists of embedding each level (label) of the categorical response variable as an all-zero vector with 1 in the place of the label index. For example, suppose that your response variable contains the following values: y ¼ (0, 2, 4, 1, 3, 0, 1, 3, 4, 2). Then the vector of response variable is transformed to yf ¼ to _ categorical(y, 5) that produces the following result: This means that the dependent variable is no longer a vector, because it is a matrix of zeros and ones; for this reason, the number of units required for the output layer is equal to the number of classes. In this example, the number of units required for the output layer should be five.
(a) Configuring and compiling the model. The only difference when compiling binary response variables and ordinal (or categorical) outcomes is that now using the categorical_crossentroy loss as the loss function is recommended. (b) Fitting the model. Everything that was explained for binary data also applies to ordinal and categorical outcomes. (c) Evaluating the prediction performance. For ordinal and categorical outcomes, we suggest using the predict_classes() function because it produces, as output, values of the categorical or ordinal data in the scale of the response variable, that is, 0, 1, . . ., C À 1. However, you can also use the predict() function, which will provide you with probabilities for each category and the sum of all of them is equal to 1. These probabilities need to be converted to the original response variable (0, 1, . . ., C À 1). This conversion can be done by assigning each observation to the category with the largest probability (Allaire and Chollet 2019). Next, we provide one illustrative example for a DNN with an ordinal outcome.
Example 12.2 Ordinal outcome. This example uses the same data as Example 12.1 (Toy_EYT data set), but now we use the ordinal trait DTHD as the response variable. For the tuning process, we first created the flags which are given next and should be placed in a file called Code_Tuning_With_Flags_Ordinal_4HL2.R, as it is called in Appendix 2.
####a) Declaring the flags for hyperparameters FLAGS = flags( flag_numeric("dropout1", 0.05), flag_integer("units",33), flag_string("activation1", "relu"), flag_integer("batchsize1",56), flag_integer("Epoch1",1000), flag_numeric("learning_rate", 0.001), flag_numeric("val_split",0.2)) In a) are given the default flag values for some hyperparameters; the DNN is defined in b) and is very similar to the definition of the DNN for binary outcomes, except that in the output layer the softmax activation function is used, which is appropriate for categorical or ordinal data, and now instead of one unit, three are used in the output layer since this is the number of classes of the DTHD ordinal response variable. Another important difference in the definition of the DNN model is that now we used the layer_batch_normalization() function just after specifying each hidden layer. The layer_batch_normalization() function is used to help with a problem called internal covariate shift that consists of changing the distribution of network activations due to the change in network parameters during training. Therefore, the layer_batch_normalization() function improves the training process by reducing the internal covariate shift by fixing the distribution of the layer inputs x as the training progresses (Ioffe and Szegedy 2015;LeCun et al. 1998;Wiesler and Ney 2011), and the internal covariate shift is reduced by linearly transforming the input to have zero means and unit variances and decorrelating the input information. This process is done in the input of each layer to fix the distributions of inputs that would remove the effects of the internal covariate shift (Ioffe and Szegedy 2015).
In part c) of the code, the DNN model is compiled; this is the same as the compilation process of binary outcomes, except that now a categorical_crossentropy loss function should be used because the response variable is ordinal or categorical. The fitting process, part d), is the same as that for binary outcomes.
Before using these flags, the response variable was converted to dummy variables using the to_categorical() function. Next, we show the first eight observations of the testing set of the first partition used in the code given in Appendix 2. Here we observe that the first column corresponds to the original ordinal score of the response variable with levels 0, 1, and 2, that is, nclas ¼ 3, while the remaining three columns are the three dummy variables created using the to_categorical() function since the original response variable has three types. From the output produced using the to_categorical() function, we can see that the first observation in the last column is a 1, while columns 2 and 3 have values of 0, since the original categorical response variable is 2. In the second observation, we can see that since the original categorical score is 0, the 1 appears in the second column, and values of 0 in the remaining columns. In the fifth observation, we can see that the original categorical score is 1; for this reason, in the third column, there is a 1 and in the remaining columns there is a value of 0.
The code given above is called Code_Tuning_With_Flags_Ordinal_4HL2.R in the code given in Appendix 2. The code given in Appendix 2 does the grid search using the library tfruns (Allaire 2018) and the tuning_run() function. The implemented grid search is shown below: runs.sp<-tuning_run("Code_Tuning_With_Flags_Ordinal_4HL2.R", runs_dir = '_tuningE1', flags=list(dropout1= c(0,0.05), units = c(67,100), activation1=("relu"), batchsize1=c(28), Epoch1=c(1000), learning_rate=c(Learn_val[e]), val_split=c(0.2)),sample=1,confirm =FALSE,echo =F) The grid search was used (sample ¼ 1) for the tuning process and, for the four hyperparameter combinations (two values of dropout and two values of units), were evaluated. Table 12.2 gives the results of implementing the code given in Appendix 2 for five different values of learning rate (0.005, 0.01, 0.015, 0.03, and 0.06), where the best predictions were obtained with a learning rate value of 0.01 across environments. These results give evidence that the prediction performance depends considerably on the value of the hyperparameter called learning rate.

Training DNN with Count Outcomes
Remember that count data (0, 1, 2, . . .) are usually modeled with Poisson regression or negative binomial regression in the statistical world. In the world of DNN, only the Poisson DNN model has been available until now in Keras and its key elements imitate those of the generalized linear models of the statistical world. For this reason, its construction in Keras uses as loss function the minus log-likelihood of a Poisson distribution; for the output layer, you can use the exponential activation function (inverse of log link in generalized linear models) that is available in Keras, which guarantees only positive outcomes. Also a RELU activation function can be used to guarantee a positive outcome. In general, the definition of a DNN model for count outcomes in Keras is very similar to what we have studied before for continuous, binary, and categorical data for univariate responses. If we are interested in predicting only one response variable, we only need to specify one unit in the output layer, but if we are interested in predicting five count response variables, we need to specify a unit for each response we wish to predict. Also, the process of adding the hidden layers is exactly the same as was done for continuous, binary, and categorical (or ordinal) data with the same activation functions, for example, RELU for all the hidden layers, since the fitting process of the model is exactly the same as the fitting process of continuous, binary, and ordinal univariate outcomes. However, some of the key differences are in the compilation and prediction process. In the compilation process, we need to specify the "poisson" loss function that was created as the minus log-likelihood of the Poisson distribution, and for the metric, we can still use the mean squared error metric; however, for the prediction process, we should use the predict() function that will always produce positive values only if we specify an appropriate activation function in the output layer like the exponential activation function. Next, we provide an illustrative example for training DNN with count outcomes.
Example 12.3 Count data. This toy data set contains 115 lines, evaluated in three environments (Batan2012, Batan2014, and Chunchi2014) and in each environment two blocks were created. The total number of observations of this data set is 649. The data set is denoted as Data_Count_Toy.RData. The count response variable has a minimum value of 0 and a maximum value of 17.
Modeling and predicting count data is not only important in plant breeding, but also very common in areas such as health, finance, social science, etc. Generalized linear models have been widely used for modeling count response variables, but many times fail to capture complex data patterns. For this reason, nonlinear Poisson regressions under the umbrella of deep artificial neural networks are of paramount importance for modeling count data and improving the prediction accuracy. The details for implementing DNN models for count data are given below.
The tuning process was done by creating flags which are given below; they should be placed in a file named: Code_Tuning_With_Flags_Count_Lasso.R, that is used in Appendix 3. ####a) Declaring the flags for hyperparameters FLAGS = flags( flag_numeric("dropout1", 0.0), flag_integer("units",33), flag_string("activation1", "relu"), flag_integer("batchsize1",56), flag_integer("Epoch1",1000), flag_numeric("learning_rate", 0.001), flag_numeric("val_split",0.2), flag_numeric("Lasso_par",0.001)) Again, in part a) are defined the default flags, in part b) the DNN for count data is defined, where the significant difference is that the exponential activation function is used for the output layer; we should point out that in each hidden layer, the Lasso (L1) regularization is also used in addition to dropout. In part c) the relevant parts are (1) the specification of the Poisson loss function and (2) the use of the mean squared error as a metric for evaluating the prediction performance, while the fitting process is exactly the same as was done for continuous, binary, and categorial or ordinal outcomes.

Training DNN with Multivariate Outcomes
Training models with multivariate outcomes are very important for plant breeders since they are interested in predicting more than one trait. In the context of plant breeding, multivariate models for the prediction of more than one trait simultaneously are called multi-trait models. There is evidence that multi-trait models capture the complex relationships between traits more efficiently and, for this reason, many times they improve the prediction performance when compared with univariate models. Statistical multi-trait models capture the correlation between traits and also the correlation between lines. There is evidence, and not only in genomic selection, that the larger the correlation between traits, the better the prediction performance of multi-trait analysis (Jia and Jannink 2012;Jiang et al. 2015). Authors like He et al. (2016) and Schulthess et al. (2017) found a significant improvement of multi-trait analysis with regard to univariate analysis, while Calus and Veerkamp (2011) 2019) found modest improvement of multivariate analysis when compared to univariate analysis. Also in the context of multi-trait models, it helps to clarify the relationship and the effect of each studied independent variable on the dependent multivariate variables (Castro et al. 2013;Huang et al. 2015).
Despite the positive advantages of statistical multi-trait models mostly for continuous outcomes, it has not been possible to develop efficient models for other types of multivariate response variables (binary, categorical, and count) and efficient models for mixed outcomes (continuous, binary, categorical, and count) are still lacking. There have been some developments in the statistical literature, but most of them are not efficient for large data sets. However, as shown in two publications by Montesinos-López et al. (2018b, c) and Montesinos-López et al. (2019), the deep learning methodology has the power to efficiently implement univariate and multivariate models for each type of response variables, and even for mixed outcomes (Chollet and Alliare 2017). For this reason, in this section, we will show how to implement multi-trait analysis for continuous outcomes, multi-trait binary outcomes, multi-trait categorical outcomes, multi-trait count outcomes, and multi-trait mixed outcomes with a combination of at least two types of response variables (continuous and binary; continuous and categorical; continuous and count; categorical and count; etc.) and even with four types of response variables for continuous, binary, categorical, and count outcomes. Next, we will illustrate the DNN training process with multi-trait outcomes with all traits as continuous.

DNN with Multivariate Continuous Outcomes
The data set used for illustrating how to train multivariate continuous outcomes is called MaizeToy data set. This data set contains 30 lines that were measured in three environments (EBU, KAT, and KTI); for this reason, the total number of observations is 90. Also, three continuous traits were measured for each observation, and these traits were Yield, ASI, and PH. The genomic information is contained in the genomic relationship matrix denoted as genoMaizeToy.R.
Next, we provide the default flags for training continuous outcomes. These flags should be placed in the R (R Core Team 2019) code called Code_Tuning_With_ Flags_MT_normal.R. ####a) Declaring the flags for hyperparameters FLAGS = flags( flag_numeric("dropout1", 0.05), flag_integer("units",33), flag_string("activation1", "relu"), flag_integer("batchsize1",56), flag_integer("Epoch1",1000), flag_numeric("learning_rate", 0.001), flag_numeric("val_split",0.2)) ####b) Defining the multi-trait DNN model input <-layer_input(shape=dim(X_trII)[2],name="covars") # add hidden layers base_model <-input %>% layer_dense(units =FLAGS$units, activation=FLAGS$activation1) %>% layer_dropout(rate =FLAGS$dropout1) %>% layer_dense(units =FLAGS$units, activation=FLAGS$activation1) %>% layer_dropout(rate =FLAGS$dropout1) %>% layer_dense(units =FLAGS$units, activation=FLAGS$activation1) %>% layer_dropout(rate =FLAGS$dropout1) # add output 1 yhat1 <-base_model %>% layer_dense(units=1, name="response_1") # add output 2 yhat2 <-base_model %>% layer_dense(units= 1, name="response_2") # add output 3 yhat3 <-base_model %>% layer_dense(units= 1,name="response_3") #c) Compiling the multi-trait model model <-keras_model(input,list(response_1=yhat1,response_2=yhat2, response_3=yhat3)) %>% compile(optimizer =optimizer_adam (lr=FLAGS$learning_rate), loss=list(response_1="mse",response_2="mse", response_3="mse"), metrics=list(response_1="mse",response_2="mse", response_3="mse"), loss_weights=list(response_1=0.99,response_2=1.8, response_3=0.069)) In part a) the default flags which are defined exactly as for univariate outcomes are defined. In part b) the multi-trait DNN model is defined, with layer_input(), the dimension (number of independent variables) of the input information is then provided and the hidden layers are added. In this case, only three hidden layers were specified and each layer had dropout that was added with layer_dropout(). Then three output layers were added that correspond to each of the three traits of the mult-trait DNN model. The three output layers use one unit and the linear activation function (not necessary to specify which) since the three traits are assumed to be continuous. Next, in part c), the compilation process is done; here, as in univariate DNN models, we need to specify the loss function, the optimizer, and the metrics used to evaluate the prediction performance. The specified optimizer is exactly the same as in the univariate DNN models. However, for the specification of the loss function and metrics we need to specify a loss function and metrics for each trait (outcome), that need to be in agreement with the type of response of each trait. Since in this example we have three continuous traits, we specified as a loss function and metric for each trait the mean_squared error (mse); however, for traits with different types of outcome, the practitioner should specify a loss function and metric appropriate for each type of trait. Two differences in the compilation process of multi-trait DNN models with regard to univariate DNN models are found. The first one is that in the Keras we need to specify the traits under study using a list() where, separated by a comma, the names of the traits under study are provided. The second one is that we need to specify the loss_weights for each trait. One simple approach is to use the same weights for all traits: for example, (1,1,1), if we have three traits or (0.3333,0.3333, 0.3333), this approach is valid when all traits are on the same scale, but when traits are on different scales, we suggest using different weights for each continuous trait. In this example, different weights were used, since each trait has a different scale and the weights were built as follows: (1) first we calculated the median of each trait, (2) then we calculated the 0.25 and 0.75 quantiles for each trait, (3) then we calculated the maximum distance in terms of absolute value between the median and both quantiles, (4) then we used as the weight for the first trait (GY) its calculated distance, and (5) then we used as weight for the second trait the value obtained by dividing the distance of the first trait by the distance of the second trait, and finally the weight for the third trait was also obtained by dividing the distance of the first trait by the distance of the third trait. Finally, the fitting process is done in part d), and it is the same as the fitting process for the univariate DNN model, except that the training set of the response variables is provided as a list. These steps are only suggestions that can work for some data sets, but there is no guarantee that they can work for all data sets.
Next, we called the flag Code_Tuning_With_Flags_MT_normal.R. In Appendix 4, it is used to implement the tuning process for selecting the best combination of hyperparameters, and after selecting the best combination of hyperparameters, the multi-trait DNN with the optimum hyperparameters is refitted, and the prediction performance using cross-validation is evaluated with this refitted model. The grid search implemented in this code is given next: (1000), learning_rate=c(0.001), val_split=c(0.25)), sample=0.5, confirm =FALSE,echo =F) The results of implementing the last random grid search (sample ¼ 0.5) that consists of four combinations of hyperparameters resulting from two dropout values (0, 0.05) and two values of units (56, 97) are given in Table 12.4, which shows that the best predictions were obtained for trait PH in terms of Pearson's correlation and MAAPE.

DNN with Multivariate Binary Outcomes
For illustrating the process of training multivariate binary DNN models, we used the same data set (Data_Toy_EYT.RData) as in Example 12.1 in this chapter. As was explained in Example 12.1, this data set contains four environments (Bed5IR, EHT, Flat5IR, and LHT), 40 lines in each environment, a genomic relationship matrix of order 40 Â 40, four traits (DTHD, DTMT, GY, and Height) of which DTHD and DTMT are ordinal traits, trait GY is continuous, and trait Height is binary. However, for the implementation of the multivariate binary DNN model, we converted ordinal traits DTHD and DTMT into binary traits, making 0 the levels of 1, and 1 the levels of 2 and 3. For this reason, the illustration of the multivariate binary DNN model was done using the following three traits: DTHD, DTMT, and Height.
These flags (Code_Tuning_With_Flags_MT_Binary.R) can be used with Appendix 4 with some small modifications such as those just described above, which are related to (1) the activation function used for the output layers, (2) the loss function used for each trait since now we should use binary_crossentropy, (3) the metrics used for each trait since now we should use accuracy for each trait, (4) the weights used in the compilation process since now we will use the same weight for each trait, and (5) the prediction process since we will replace the following code of Appendix 4: The random grid search was used since sample ¼ 0.5, for the prediction under a multi-trait binary DNN model, as is shown next: The important thing about this random grid search is that the tuning_run() function calls the flags developed above for the training process of multi-trait binary DDN models. Also, the total number of hyperparameters of the grid search is four, since we set a unique value for all the hyperparameters, except for the hyperparameters dropout with two values (0, 0.05) and the units with another two values (56, 97). It is important to point out that of the four total hyperparameter combinations, only two were evaluated, since we implemented a random grid search by specifying sample ¼ 0.5.
In Table 12.5 is given the prediction performance for the average of the five outer fold cross-validations used for training the multi-trait binary DNN model, the prediction performance reported as metrics, the PCCC, the Kappa coefficient, the sensitivity, and the specificity. From these results, it is evident that the best predictions belong to trait DTHD and trait DTMT. For all trait-environment combinations, the PCCC was larger than 0.5, that is, larger than the probability of a correct classification by chance since we are dealing with binary outcomes.

DNN with Multivariate Ordinal Outcomes
To illustrate the training of multivariate ordinal or categorical DNN models, we used the same data set (Data_Toy_EYT.RData) used in this chapter in Example 12.1 and before for training multivariate binary DNN models. However, now the implementation of the multivariate ordinal DNN model was done with traits DTHD, DTMT, and GY, but since GY is a continuous trait, this was converted to an ordinal outcome in the following way: if GY is less than 3.2, then the outcome was set to 0, but if 3.2 < GY < 5.8, the outcome was set to 1, while if 5.8 < GY < 6.2, the ordinal outcome was set to 2, and finally, if GY > 6.2, the outcome was set to 3. This means the training of the multivariate ordinal DNN model was done with three traits (DTHD, DTMT, and GY) where the first two had three levels and the last one had four levels.
The results of implementing the multi-trait ordinal DNN model using the best combination of hyperparameters resulting from the above research grid are given in Table 12.6, where we can see that the metrics used for evaluating the prediction performance were the same as those used for the multi-trait binary outcomes, but now the best predictions were observed in trait DTHD and the worst in trait GY. However, note that by chance alone now for traits DTHD and DTMT the probability is 1/3 while for trait GY this probability is ¼ since for the first two traits there are three levels, while for the third trait there are four response options.

DNN with Multivariate Count Outcomes
To illustrate the training process of the multivariate count DNN model, we used the data called Data_Multi_Count_Toy.RData, which is a modified version of the data called Data_ Count_Toy.RData. The basic modification is that the modified version has two traits instead of one, which are denoted as y1 and y2. For this reason, again this data set contains 115 lines, evaluated in three environments (Batan2012, Batan2014, and Chunchi2014) and the total number of observations of this data set is 298. Next, we provide the default flags that we suggest placing in an R file called Code_Tuning_With_Flags_MT_Count.R. # add output 1 yhat1 <-base_model %>% layer_dense(units=1,activation="exponential", name="response_1") # add output 2 yhat2 <-base_model %>% layer_dense(units=1, activation="exponential",name="response_2") #c) Compiling the multi-trait ordinal DNN model model <-keras_model(input,list(response_1=yhat1,response_2=yhat2)) %>% compile(optimizer =optimizer_adam(lr=FLAGS$learning_rate), loss=list(response_1="poisson",response_2="poisson"), metrics=list(response_1="mse",response_2="mse"), loss_weights=list(response_1=1,response_2=1)) print_dot_callback The default flags for the multi-trait count DNN model are in part a). The process of building the multi-trait count DNN model (part b) is very similar to the building of continuous multivariate outcomes, except that now the exponential activation function should be used. The compilation process (part c) is the same as the compilation of the multi-trait continuous DNN model, except that now the Poisson loss function is used for each trait under study, and the fitting process (part d) is exactly the same as for multi-trait continuous DNN models.
The flags given above for training multivariate count outcomes (Code_Tuning_With_Flags_MT_Count. R) can be used with Appendix 4 with the following modifications: (1) use the exponential activation function for each output layer, (2) use the Poisson loss function for each trait, and (3) use the same weights for each trait.
Next, we give the random grid search used for tuning a multi-trait count DNN model: Again, we used the tuning_run() function to perform the tuning process which now was done with a random grid search since we specified sample ¼ 0.5. Here the total number of hyperparameters is four, since only hyperparameters % of dropout and number of units have two values.
The prediction performance of training the multi-trait count data is given in Table 12.7. Now the metrics used for evaluating the accuracy were the mean square error of prediction and MAAPE. The best predictions across environments belong to

DNN with Multivariate Mixed Outcomes
Finally, in this chapter, we provided key elements for training DNN models for binary, categorical, count, and continuous outcomes. However, with the implementation of these DNN models, it is clear that DNN models are a novel tool for training univariate and multivariate genomic prediction models for binary, categorical, count, and mixed outcomes. The power to train univariate and multivariate nonlinear regression with count data is unique to deep learning models since similar tools in conventional statistical learning are not efficient and only a few are available. But the gain in training multivariate DNN models is only due to the increase in the sample size by modeling more than one trait simultaneously since the DNN models just studied do not take into account a variance-covariance matrix of traits to capture the correlation between traits. However, we also emphasize that the training process of DNN models is very challenging since more time, thought, experimentation, and resources are required for training these models than other statistical machine learning models studied in this book, since for most of the statistical machine learning algorithms, the search space for finding the optimum combination of hyperparameters is small compared to DNN models. For these reasons, we strongly suggest using the random grid search (or other new approaches like Bayesian optimization or genetic algorithms) since the larger the number of hyperparameters, the bigger the number of hyperparameter combinations that need to be evaluated due to the quick explosion in the number of hyperparameter combinations. For this reason, the use of the random grid search that explores only a fraction of the total combination of hyperparameters is more efficient. Next the default flags are given in an R file called Code_Tuning_With_Flags_MT_Mixed.R. ####a) Declaring the flags for hyperparameters FLAGS = flags( flag_numeric("dropout1", 0.05), flag_integer("units",33), flag_string("activation1", "relu"), flag_integer("batchsize1",56), flag_integer("Epoch1",1000), flag_numeric("learning_rate", 0.001), flag_numeric("val_split",0.2)) ####b) Defining the DNN model input <-layer_input(shape=dim(X_trII)[2],name="covars") # add hidden layers base_model <-input %>% layer_dense(units =FLAGS$units, activation=FLAGS$activation1) %>% layer_dropout(rate =FLAGS$dropout1) %>% layer_dense(units =FLAGS$units, activation=FLAGS$activation1) %>% layer_dropout(rate =FLAGS$dropout1) %>% layer_dense(units =FLAGS$units, activation=FLAGS$activation1) %>% layer_dropout(rate =FLAGS$dropout1) # add output 1 yhat1 <-base_model %>% layer_dense(units = 3,activation="softmax", name="response_1") # add output 2 yhat2 <-base_model %>% layer_dense(units = 1, activation="exponential",name="response_2") # add output 3 yhat3 <-base_model %>% layer_dense(units = 1, name="response_3") # add output 3 yhat4 <-base_model %>% layer_dense(units =1, activation="sigmoid",name="response_4") #c) Compiling the multi-trait mixed DNN model Model=keras_model(input,list(response_1=yhat1,response_2=yhat2, response_3=yhat3,response_4=yhat4)) %>% compile(optimizer =optimizer_adam(lr=FLAGS$learning_rate), loss=list(response_1="categorical_crossentropy",response_2="mse", response_3="mse",response_4="binary_crossentropy"), metrics=list (response_1="accuracy",response_2="mse",response_3="mse", response_4="accuracy"), loss_weights=list(response_1=1, response_2=1,response_3=1,response_4=1)) print_dot_callback <-callback_lambda( on_epoch_end = function(epoch, logs) { if (epoch %% 20 == 0) cat("\n") cat(".") }) early_stop <-callback_early_stopping(monitor = c("val_loss"), mode='min', patience =50) The default flags for the multi-trait mixed outcome DNN model are in part a). The building process of the multi-trait mixed outcome DNN model (part b) is different only in the specification of the outputs. Since now we have specified an activation function for each response variable, the first response variable is an ordinal output with three categories. For this reason, we specified three neurons and the softmax activation function. The second response variable is a count, and for this reason, we specified one unit (neuron) and the exponential activation function. The third response variable is continuous, and for this reason, we specified only one neuron and the linear activation function (default activation function). Finally, the last response variable is binary, and for this reason, we specified only one neuron and the sigmoid activation function. With regard to the compilation process (part c), we had to specify a mixture loss function due to the fact that we had multivariate mixed outcomes. For this reason, the mixture loss function in this example is composed of four types of losses: "categorical_crossentropy," "mse," "mse," and "binary_crossentropy," which are appropriate for a mixed outcome of ordinal, count, continuous, and binary response variables. The same type of mixture is required for specifying the metrics to evaluate the prediction accuracy to guarantee that they are in agreement with the response variables. For this reason, in this case, the specification of the metrics contains: "accuracy," "mse," "mse," and "accuracy." The first metric is for the ordinal response variable, the second for the count output, the third for the continuous outcome, and the last one for the binary outcome. The fitting process (part d) is exactly the same as the fitting process of the previous multivariate deep learning models with only one type of scale in the output.
The flags given above for training multivariate mixed outcomes (Code_Tuning_With_Flags_MT_Mixed. R) can be used with Appendix 5.
The prediction performance of training multi-trait mixed outcome data is given in Table 12.8. The metrics used for evaluating the accuracy were MAAPE for the continuous and count response variables, and the proportion of cases correctly classified (PCCC) for the ordinal and binary traits. The best predictions for the two traits evaluated with MAAPE were those of trait GY, while for the ordinal and binary traits we can see that the PCCC was better for the trait Height that includes only two categories (Table 12.8). Again, these predictions belong to the best combination of hyperparameters of the above grid. Practitioners should remember that we only reported the MAAPE for the count and continuous traits, since it made no sense for the ordinal and binary outcomes. Similarly, we only reported the PCCC for the binary and ordinal traits since it made no sense for the count and continuous outcomes.