Keywords

13.1 The Importance of Convolutional Neural Networks

Convolutional neural networks (CNNs or ConvNets) are a specialized form of deep neural networks for analyzing input data that contain some form of spatial structure (Goodfellow et al. 2016). CNNs are primarily used to solve problems of computer vision (such as self-driving cars, robotics, drones, security, medical diagnoses, treatments for the visually impaired, etc.) using images as input, since it takes advantage of the grid structure of the data. CNNs work by first deriving low-level representations, local edges, and points, and then composing higher level representations, overall shapes, and contours. The name of these deep neural networks is due to the fact that they apply convolutions, a type of linear mathematical operation. CNNs use convolution in at least one of their layers, instead of a general matrix multiplication, as do the feedforward deep neural networks studied in the previous chapters. Compared to other classification algorithms, the preprocessing required in a CNN is considerably lower. For these reasons, CNNs have been used to identify faces, individuals, tumors, objects, street signs, etc. Thanks to successful commercial applications of this type of deep neural networks, the term “deep learning” was coined and it is the most popular name given to artificial neural networks with more than one hidden layer. Also, their popularity is attributed in part to the fact that the best CNNs today reach or exceed human-level performance, a feat considered impossible by most experts in computer vision only a few decades back.

The inspiration for CNNs came from a cat’s visual cortex (Hubel and Wiesel 1959), which has small regions of cells that are sensitive to specific regions in the visual field; this means that when specific areas of the visual field are excited, then those cells in the visual cortex will be activated as well. In addition, it is important to point out that the excited cells depend on the shape and orientation of the objects in the visual field. This implies that horizontal edges excite some neurons, while vertical edges excite other neurons (Hubel and Wiesel 1959).

Successful applications of CNNs are not limited to the examples given above, since they have also been used for solving problems related to natural language processing using digitalized text as input, but they are also a powerful tool for analyzing words as discrete textual units (Patterson and Gibson 2017). In addition, they have been applied (a) to sound data when they are represented as a spectrogram, (b) for predicting time series (one-dimensional data) that use historical information as input, and (c) for sentiment analysis. In general, CNNs are very popular for dealing with input data that have some structure (as image data) (one, two, three, or more dimensions), video data (three dimensions), genetic data, etc.

CNNs are deep neural networks that most efficiently take advantage of both the extraordinary computing power and very large data sets that are available in many fields nowadays. Also, CNNs bypass the need to explicitly define which independent variables (inputs) should be included or selected for the analysis, since they optimize a complete end-to-end process to map data samples to outputs that are consistent with the large, labeled data sets used for training a deep neural network (Wang et al. 2019). Empirical evidence shows that CNNs are a powerful tool for image analysis using many layers of filters. In addition, the improvements in accuracy have been so extraordinarily large in the last few years that they have changed the research landscape in this area. Low-level image features (e.g., detecting horizontal or vertical edges, bright points, or color variations) are captured by the first filters and subsequent layers capture increasingly complicated combinations of earlier features, and when the training data set is large enough, CNNs most of the time outperform other machine learning algorithms (Wang et al. 2019). In addition, CNNs are expected to outperform feedforward deep neural networks since they exploit part of the correlation between nearby pixel positions. For the classification problem of attempting to label which of 1000 different objects are in an image, results have improved from 84.6% (in 2012) to 96.4% in 2015. For this reason, CNNs are being applied to complex tasks in plant genomics like (a) root and shoot feature identification (Pound et al. 2017), (b) leaf counting (Dobrescu et al. 2017; Giuffrida et al. 2018), (c) classification of biotic and abiotic stresses (Ghosal et al. 2018), (d) counting seeds per pot (Uzal et al. 2018), (e) detection of wheat spikes (Hasan et al. 2018), and (f) estimating plant morphology and developmental stages (Wang et al. 2019). These examples show their great potential for accurately estimating plant phenotypes directly from images.

This chapter provides the basic concepts and definitions related to CNNs, as well as the main elements to implement this type of deep learning models. Finally, some examples of CNNs in the context of genomic prediction are provided.

13.2 Tensors

Nowadays, all current machine learning algorithms use tensors as their basic data structure. Tensors are of paramount importance in machine learning and are so fundamental that Google’s framework for machine learning is called TensorFlow. For this reason, here we define this concept in detail.

By tensors we understand a generalization of vectors and matrices to an arbitrary number of dimensions.

0D tensors

These tensors are scalars represented by only one number; for this reason, we call them zero-dimensional tensors (0D tensors) (Chollet and Allaire 2017). For example, a 0D tensor is when you measure the temperature on a specific day in the city of Colima, Mexico, which can be 30 °C. Another example can be the tons per hectare produced by a hybrid in a specific year or location, which can be 8 tons per hectare.

1D tensors

These tensors are vectors as defined in linear algebra, that is, a one-dimensional array of numbers; for this reason, they are called one-dimensional tensors (1D tensors) (Chollet and Allaire 2017). This type of tensor has exactly one axis. For example, the grain yield of the same hybrid measured in five environments is a 1D tensor; in this case, it is equal to GY = c(9,7.8,6.4,8.2,5.9). Another example of a 1D tensor is the temperature measured over the last 7 days (1 week) in Colima, Mexico, that is, Tem = c(28, 29, 30, 32, 33, 34, 34.5).

2D tensors

These tensors are arrays of numbers in two dimensions (rows and columns); for this reason, they are called two-dimensional tensors (2D tensors) or matrix arrays. An example of a 2D tensor can be the grain yield measured over 3 years in five environments for the same hybrid (Table 13.1). Now this tensor has two axes.

Table 13.1 Example of a 2D tensor

Another example is the information of an image taken in grayscale, since an image is nothing but a matrix of pixel values. For example, an image in grayscale can be represented in two dimensions (height and width) and assuming that the image’s height = 5 pixels and width = 5 pixels, the 2D tensor is given in Table 13.2.

Table 13.2 Example of a 2D tensor of an image in black and white, 5 pixels in height and 5 pixels in width

Most of the data sets used in this book are 2D tensors, since the samples (or observations) are in the rows (first axis) and the variables measured for each observation are in the columns (second axis).

3D tensors

These tensors are arrays of numbers in three dimensions; for this reason, they are called three-dimensional tensors (3D tensors). An example of a 3D tensor is the example given above assuming the trait was measured in 3 years, five environments, and three traits (Fig. 13.1).

Fig. 13.1
figure 1

A 3D tensor for grain yield measured in five environments, 3 years, and three traits

Other examples are color images in 3D, which, in addition to height and width given in pixels, also have another axis for depth. This can be appreciated in Fig. 13.2. There’s no problem in understanding the height and width of an image, but the depth is not clear. For this reason, the depth is defined as the axis that will encode the type of colors. For example, images called Red-Green-Blue (RGB) have a depth of 3, since they encode the three types of primary colors in this axis (see Fig. 13.2).

Fig. 13.2
figure 2

A 3D tensor of a Red-Green-Blue (RGB) image of a dimension of 4 × 4 × 3

4D tensors

This array is obtained when we pack a 3D tensor in a new array. One example is adding one more dimension to the example of measuring grain yield in 3 years, five environments, and three types of traits. This dimension can be the samples (Lines), which means that we are measuring a 3D tensor for each line. The same idea can be used for the example of the 3D tensor of an RGB image, but assuming that a color image (RGB) is measured on several samples (plants or objects), which transforms the information of a 3D tensor to a 4D tensor.

For practical purposes, in R we should use the array_reshape() function to reshape a tensor. As an illustration, first we will use the data in Table 13.1 corresponding to a 2D tensor and we put this information in the following matrix:

GY=matrix(c(9,7.8,6.4,8.2,5.9,7,6.8,6,7.5,5.5,8.0,7.2,6.3,7.8,5.7), ncol=5,byrow=T) GY > GY [,1] [,2] [,3] [,4] [,5] [1,] 9 7.8 6.4 8.2 5.9 [2,] 7 6.8 6.0 7.5 5.5 [3,] 8 7.2 6.3 7.8 5.7

Now I will reshape the GY 2D tensor to a 1D tensor.

GY1=array_reshape(GY)

This produces a column vector that has 15 elements, a 1D tensor, and the transpose of this 1D tensor is equal to

GY1=array_reshape(GY, dim=c(15,1)) > t(GY1) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [1,] 9 7.8 6.4 8.2 5.9 7 6.8 6 7.5 5.5 8 7.2 6.3 7.8 5.7

Now the GY 2D tensor will be reshaped to a 3D tensor of dimension 1 × 5 × 3; the R code is

GY2=array_reshape(GY, dim=c(1,5,3))

That output is a 3D tensor

> GY2 , , 1 [,1] [,2] [,3] [,4] [,5] [1,] 9 8.2 6.8 5.5 6.3 , , 2 [,1] [,2] [,3] [,4] [,5] [1,] 7.8 5.9 6 8 7.8 , , 3 [,1] [,2] [,3] [,4] [,5] [1,] 6.4 7 7.5 7.2 5.7

Where , ,1; , , 2; and , , 3 are the depths 1, 2, and 3 of the 3D tensor, and for each, a vector of dimension 1 × 5 is given as output.

Finally, we will reshape GY2, the 3D tensor to a 2D tensor of dimension 15 × 3.

GY3 = array_reshape(GY2, dim = c(5,3))

That produces as output

> GY3 [,1] [,2] [,3] [1,] 9.0 7.8 6.4 [2,] 8.2 5.9 7.0 [3,] 6.8 6.0 7.5 [4,] 5.5 8.0 7.2 [5,] 6.3 7.8 5.7

In similar fashion, you can reshape the last 2D tensor to a 4D tensor of dimension 5 × 1 × 1 × 3, using the following R code:

GY4 = array_reshape(GY3, dim = c(5,1,1,3))

That produces as output each of the components of the 4D tensor.

> GY4 , , 1, 1 [,1] [1,] 9.0 [2,] 8.2 [3,] 6.8 [4,] 5.5 [5,] 6.3 , , 1, 2 [,1] [1,] 7.8 [2,] 5.9 [3,] 6.0 [4,] 8.0 [5,] 7.8 , , 1, 3 [,1] [1,] 6.4 [2,] 7.0 [3,] 7.5 [4,] 7.2 [5,] 5.7

The output is in three blocks, and each block contains a 2D tensor of dimension 5 × 1. The last two numbers of the name of each block refer to the positions in dimensions 3 and 4, respectively, where for all blocks, the third position has a 1, since it only has one dimension, but dimension 4 (four position) contains numbers 1, 2, and 3, since 3 components form this dimension.

13.3 Convolution

Convolution comes from a Latin word meaning “to convolve” which means to roll together; it is a mathematical operation that merges two sets of information. Mathematically, convolution is a mathematical operation that is performed on two functions to produce a third one that is usually interpreted as a modified (filtered) version of one of the original functions (Berzal 2018). The convolution of the functions f and g, which are usually denoted by an * or a ⋆, is defined as the integral of the product of two functions after one of them is reflected and moved:

$$ \left(f\star g\right)(t)=\underset{-\infty }{\overset{\infty }{\int }}f(t)g\left(t-\tau \right) d\tau =\underset{-\infty }{\overset{\infty }{\int }}f\left(t-\tau \right)g(t) d\tau $$

In digital signal processing, when we use discrete signals, the previous integral becomes a summation (Berzal 2018):

$$ \left(f\star g\right)\left[n\right]=\sum \limits_{m=-\infty}^{\infty }f\left[g\right]\left[n-m\right]=\sum \limits_{m=-\infty}^{\infty }f\left[n-m\right]g\left[m\right] $$

Normally, one of the convolution operands is the signal that we want to process, x[n], and the other corresponds to the filter, h[n], with which we process the signal. When the filter is finite and defined only in the domain {0, 1, ..., K − 1}, the convolution operation consists of, for each value of the signal, performing K multiplications and K − 1 sums (Berzal 2018):

$$ \left(x\star h\right)\left[n\right]=\sum \limits_{k=0}^{K-1}h\left[k\right]\times \left[n-k\right] $$

According to Berzal (2018), the convolution operation, which so far we have defined on functions of a variable, can be easily extended to the multidimensional case. In the case of discrete signals defined on two variables and those that apply a filter of size K1 × K2, the convolution is calculated using the following expression:

$$ \left(x\star h\right)\left[{n}_1,{n}_2\right]=\sum \limits_{k_1=0}^{K_1-1}\sum \limits_{k_2=0}^{K_2-1}h\left[{k}_1,{k}_2\right]\times \left[{n}_1-{k}_1,{n}_2-{k}_2\ \right] $$

In digital image processing, in which the variables [n1, n2] correspond to the coordinates [x, y] of the pixels of an image, the minus sign that appears in the definition of the convolution operation is usually replaced by a plus sign, so that the convolution is calculated as (Berzal 2018):

$$ \left(x\star h\right)\left[x,y\right]=\sum \limits_{k_1=0}^{K_1-1}\sum \limits_{k_2=0}^{K_2-1}h\left[{k}_1,{k}_2\right]\times \left[x+{k}_1,y+{k}_2\ \right] $$

Technically, it is not a convolution, but a similar operation called cross-correlation. In the discrete case for real signals, the only difference is that the filter used h[x, y] appears reflected with respect to the formal definition of convolution (Berzal 2018). Given that the difference is minimal, in many cases we talk about convolution. The input to a convolution is a raw data or a feature map output from another convolution (Patterson and Gibson 2017). Finally, we can say that the convolution operation is a fancy kind of multiplication that is key in signal processing.

To better understand what the convolution operation is, let’s use a simple example. Suppose we have a gray color image captured by a camera. This image, 4 × 4 pixels, can be represented by a matrix of dimension 4 × 4 (left of Fig. 13.3) to which we apply a filter (middle part of Fig. 13.3) of dimension 2 × 2. Then a dot product operation between each patch (square matrix of the same size as the filter) of the original image and the filter is performed that produces the output given on the right, where we can see that the first patch starts in the upper left-hand corner of the underlying image, and we move the filter across the image step by step until it reaches the lower right-hand corner. At each step, the filter is multiplied by the input data values within its bounds, creating a single entry in the output feature map (Patterson and Gibson 2017). We end up with a convolved matrix of dimension 3 × 3, of lower dimension than the original matrix. The size of the step in which we move the filter across the image is called the stride; this means that you can move the filter to the right one, two, three, etc., steps at a time.

Fig. 13.3
figure 3

Example of a 2D convolution with a kernel (filter) of dimension 2 × 2 and stride of 1

The convolutional operation in Fig. 13.3 is not performed on one pixel at a time, but by taking square patches of pixels that are then passed through a filter. The filter is a square matrix of smaller dimensions than the original image and is the same size as the patch. The filter is called a kernel (a key word in support vector machine but with a different meaning) whose job is to find patterns in the pixels of the image. If the patch matrix and the filter have low values in the same positions, the dot products will be small; otherwise, they will be high. You can use a matrix of dimension 5 × 5 or 7 × 7 as a kernel, but most of the time the dimension of the kernel is a 3 × 3 matrix.

The output matrix resulting from the convolutional operation is called the activation map . The number of columns in this matrix is called the width and depends on the step size that the filter takes to traverse the underlying image. Larger strides (step size) lead to fewer steps and result in a smaller activation map matrix. The power of convolutional neural networks is related to the convolutional operation, since the larger the stride and the filter, the smaller the dimension of the activation map matrix produced, which considerably reduces the computational resources required.

Figure 13.3 shows that the input image has height (Lh0 = 4), width (Lw0 = 4), and depth (Ld0 = 1), and the filters have height (Fh0 = 2), width (Fw0 = 2), and depth (Fd0 = 1). Assuming the use of a stride (S0 = 1), the output matrix is also called an activation map. After performing the convolutional operation, the input image has height Lh1 = Lh0 − Fh0 + 1 = 4 − 2 + 1 = 3, width Lw1 = Lw0 − Fw0 + 1 = 4 − 2 + 1 = 3, and depth Ld1 = Ld0 = Fd0 = 1 (this did not change). In general, the height, width, and depth of the activation map, with a stride larger than 1 in the lth layer, can be calculated as Lh(l + 1) = (Lh(l) − Fh(l))/S(l) + 1, Lw(l + 1) = (Lw(l) − Fw(l))/S(l) + 1, and Ld(l + 1) = Ld(l) = Fd(l), respectively.

To completely understand the convolutional operation, below we provide other examples.

The input image in Fig. 13.4 has height (Lh0 = 5), width (Lw0 = 5), and depth (Ld0 = 1), and the filters have height (Fh0 = 3), width (Fw0 = 3), and depth (Fd0 = 1) and since a stride size of 1 will be used, the activation map, after performing the convolutional operation, has height Lh1 = Lh0 − Fh0 + 1 = 5 − 3 + 1 = 3, width Lw1 = Lw0 − Fw0 + 1 = 5 − 3 + 1 = 3, and depth Ld1 = Ld0 = Fd0 = 1 (this did not change).

Fig. 13.4
figure 4

Example of a convolution operation with a stride of 1

The input image in Fig. 13.5 has height (Lh0 = 5), width (Lw0 = 5), and depth (Ld0 = 1), and the filter has height (Fh0 = 3), width (Fw0 = 3), and depth (Fd0 = 1), but now using a stride of 2 (S0 = 2). Therefore, the activation map, after performing the convolutional operation, has height Lh1 = (Lh0 − Fh0)/S0 + 1 = (5 − 3)/2 + 1 = 2, width Lw1 = (Lw0 − Fw0)/S0 + 1 = (5 − 3)/2 + 1 = 2, and depth Ld1 = Ld0 = Fd0 = 1 (this did not change).

Fig. 13.5
figure 5

Example of a convolution operation with a stride of 2

13.4 Pooling

Pooling is a mathematical operation that is also required in CNNs. A pooling operation replaces the output of the convolution operation at a certain location with summary statistics of the nearby outputs. The pooling operation is called downsampling or subsampling in machine learning. The two most popular pooling operations are the max pooling and the average pooling and, as in convolution, this process is applied one patch at a time. Max pooling performs dimensional reduction and de-noising, while average pooling mostly performs dimensional reduction. The max pooling operation summarizes the input as the maximum within a rectangular neighborhood, but does not introduce any new parameter to the CNN, with the advantage that the total number of parameters in the model is reduced considerably due to this operation. For example, assuming that Fig. 13.6 is the input (which can be raw information of the image or the convolved information of an image) and we apply the max pooling operation with a filter (kernel) of dimension 2 × 2, the max operation takes the largest of each of the four numbers in the filter area, and the process of pooling with the filter of dimension 2 × 2 starts in the upper left-hand corner of the input image; since we use a step of one, we move the filter across the image in steps of one until we reach the lower right-hand corner. At each step, for the four elements within the filter bounds, we get the largest value, with which we create the output matrix. The first output value for the first four values (in the upper left-hand corner) is 7 since this is the maximum value of the four elements that conform to the bounds of the filter (3, 3, 7, 4). The second value (after moving the filter one column to the right) is 5 since after the first slide of the filter, this is the maximum value that corresponds to the max of 3, 4, 4, and 5, while the last value is 6, which corresponds to a max of 3, 2, 5, and 6 in the last slice (lower right-hand corner). In this case, the size of the output matrix using a stride of 1 is 3, which is smaller than the original input matrix. By doing pooling after convolution, we threw away some information; since the network does not care about the exact location of a pattern, it is enough to know whether the pattern is present or not.

Fig. 13.6
figure 6

Max pooling with 2 × 2 filters and stride of 1

In general, the output after applying the max pooling or average pooling operation depends on the size of the small grid region of size Ph(l) × Pw(l) and the stride size S(l). For this reason, the height, width, and depth of the activation map, after applying the pooling operation with a stride larger than 1 in the lth layer, can be calculated as Lh(l + 1) = (Lh(l) − Ph(l))/S(l) + 1, Lw(l + 1) = (Lw(l) − Pw(l))/S(l) + 1, and Ld(l + 1) = Ld(l), respectively. For example, Fig. 13.6 contains an image of height (Lh0 = 4), width (Lw0 = 4), and depth (Ld0 = 1), and using filters with height (Ph0 = 2), width (Pw0 = 2), depth (Pd0 = 1), and a stride (S0 = 1), the output activation map, after performing the max pooling operation, has height Lh1 = (Lh0 − Ph0)/S0 + 1 = 4 − 2 + 1 = 3, width Lw1 = (Lw0 − Pw0)/S0 + 1 = 4 − 2 + 1 = 3, and depth Ld1 = Ld0 = 1 (this did not change).

However, if we apply a stride of 2 to the same input of Fig. 13.6, the output matrix is of dimension 2 × 2, since the output activation map after performing the max pooling operation has height Lh1 = (Lh0 − Ph0)/S0 + 1 = (4 − 2)/2 + 1 = 2, width Lw1 = (Lw0 − Pw0)/S0 + 1 = (4 − 2)/2 + 1 = 2, and depth Ld1 = Ld0 = 1 (this did not change), that is, the dimension of the output matrix after pooling is halved (Fig. 13.7). A stride of 2 means that the filter is moved to the right two columns at a time.

Fig. 13.7
figure 7

Max pooling with 2 × 2 filters and stride of 2

Figure 13.8 illustrates, for the same input image given in the two previous figures, the average pooling with a 2 × 2 filter (kernel) and a stride of 1; the only difference is that now, instead of getting the maximum of each of the four values, the average is calculated in the bounds of the filter. For the first four values corresponding to the upper left-hand corner of Fig. 13.8, we calculated the average of the four values (3, 3, 7, 4) and we got 17/4 = 4.25. The average of the four values corresponding to the remaining slices to the right of the filter from the upper left-hand corner to the lower right-hand corner of the input image was calculated in the same fashion.

Fig. 13.8
figure 8

Average pooling with 2 × 2 filters and stride of 1

Figure 13.9 illustrates average pooling with a filter of size 2 × 2 and a stride of 2, where the output is part of the output of Fig. 13.8, but without the elements in column 2 and row 2, due to the fact that the stride used in Fig. 13.9 is 2.

Fig. 13.9
figure 9

Average pooling with 2 × 2 filters and stride of 2

The size of the filter is not restricted to a dimension of 2 × 2; this can be of any size but as we pointed out above, the most common filter size is dimension 2 × 2, but for example, filters of size 6 × 6 or 8 × 8 or any other filter size can be applied. Figure 13.10 illustrates the use of a filter of size 3 × 3 with the max pooling operation. In this case, for the first nine values that correspond to the bounds of the filter, the maximum value is 5. The image has height (Lh0 = 5), width (Lw0 = 5), and depth (Ld0 = 1), using filters with height (Ph0 = 3), width (Pw0 = 3), depth (Pd0 = 1), and a stride (S0 = 1). Therefore, the output activation map after performing the max pooling operation has height Lh1 = (Lh0 − Ph0)/S0 + 1 = 5 − 3 + 1 = 3, width Lw1 = (Lw0 − Pw0)/S0 + 1 = 5 − 3 + 1 = 3, and depth Ld1 = Ld0 = 1 (this did not change).

Fig. 13.10
figure 10

Max pooling with 3 × 3 filters and stride of 1

The pooling operation is done at the level of each activation map, whereas the convolutional operation simultaneously uses all feature maps in combination with a filter to produce a single feature value; independent pooling operates on each feature map to produce other feature maps, which means that pooling does not change the number of feature maps. In simple terms, the depth of the layer created by pooling is the same as the depth of the layer on which the pooling operation was performed. Among the many pooling operations, the max pooling operation is the most popular downsampling operation and the second is the average pooling operation. Pooling layers that consist of layers in neural networks that apply pooling operations to the input information are commonly inserted between successive convolutional layers. Convolutional layers followed by pooling layers are applied to progressively reduce the spatial size (width and height) of the data representation (Patterson and Gibson 2017). Besides reducing the size of the representation, pooling layers also help to control overfitting. When we have input information in three dimensions (3D tensor), the pooling operation works independently on every depth slice of the input, which means that only the width and the height of the tensor are reduced in size but not the depth. Most of the time, the pooling operation is performed on square grid patches of the image. In a 2D-CNN, each pixel within the image is represented by its x and y positions as well as the depth, representing image channels (red, green, and blue). The filter in this example is 2 × 2 pixels. It moves over the images both horizontally and vertically.

13.5 Convolutional Operation for 1D Tensor for Sequence Data

The convolutional operation described above is for 2D tensors but can also be applied to sequence data in 1D tensors, but instead of extracting 2D patches from the image tensor, now 1D patches (subsequences) are extracted from the sequences (Chollet and Allaire 2017), as can be observed in Fig. 13.11.

Fig. 13.11
figure 11

Illustration of how the 1D-convolutional operation works in a sequence of data

This type of 1D-convolutional operation works to capture temporal relationships; for this reason, they are good for identifying local patterns in a sequence. Since this operation is performed in every patch, a pattern learned at a certain position in a sequence can be better recognized in a different position. This 1D-convolutional operation is useful when you are interested in capturing interesting features from shorter (fixed length) segments of the data set and where the specific location of the feature within the sequence data is not significant. This 1D-convolutional operation is very practical for dealing with time series data or for analyzing signal data over a fixed-length period. Also, the pooling operation can be performed in a similar fashion as was illustrated in Fig. 13.11 for the convolutional operation in one-dimensional data. Also, 1D-CNNs are different from 2D-CNNs since 1D-CNNs allow using larger filter sizes. In a 1D-CNN, a filter of size 7 or 9 contains only 7 or 9 feature vectors, whereas in a 2D-CNN, a filter of size 7 will contain 49 feature vectors, making it a very broad selection.

13.6 Motivation of CNN

Feedforward networks or fully connected networks studied in the two previous chapters use all the information of the image as input. This means that if we have images of 9500 × 8800 pixels, the dimension of the input should be 83,600,000 pixels since this topology uses as input the stacked information of the complete image, which is computationally expensive. However, by applying convolutions and pooling, CNNs reduce the dimension of the image considerably, without relevant loss of critical information, which is easier to process. CNNs are also preferred over feedforward networks with image input because they are not only good at learning features but are also scalable to massive data sets, since CNNs can extract high-level features such as edges from the input image without significant loss of information. The process of how the convolutional and the pooling operations are included in the training process of deep learning is illustrated in Fig. 13.12. Figure 13.12 shows that, most of the time, a convolutional layer is composed of three stages: first, the convolution operation is applied to the input, followed by a nonlinear transformation (like ReLU, hyperbolic tangent, or another activation function); then the pooling operation is applied. With this convolutional layer, we significantly reduce the size of the input without relevant loss of information. The convolutional layer picks up different signals of the image by passing many filters over each image, which is key for reducing the size of the original image (input) without losing critical information, and in early convolutional layers we capture the edges in the image. CNN that applies convolutional layers are scalable and robust by learning different portions of a feature space. It is important to point out that the input to CNN is not restricted to tensors in two dimensions and can be of one, two, three, or more dimensions.

Fig. 13.12
figure 12

Illustration of the application of a convolutional layer that consists of the convolution operation followed by the ReLU nonlinear transformation and the pooling operation to reduce the input data

The output of the first convolutional layer can be followed by more convolutional layers, or its output can be stacked (flattened) directly to be used as input for a feedforward network, as can be observed in Fig. 13.13.

Fig. 13.13
figure 13

Convolutional neural network

Compared to other machine learning algorithms, CNNs require less preprocessing since the convolutional layers (convolution + nonlinear transformation + pooling) make it possible to filter the noise and capture the spatial and temporal dependencies of the input more efficiently. CNNs also provide more efficient fitting because they reduce the number of parameters that need to be estimated due to the reduction in the size of the input, and allow parameter sharing, and also due to the fact that the input is connected only to some neurons. Also, CNNs allow reusing weights in many circumstances, which facilitates the training process even with data sets that are not really large. Figure 13.13 indicates that depending on the complexity of the input (images), the number of convolutional layers can be more than one to be able to capture low-level details with more precision, but for sure the computational resources need to increase. Figure 13.13 also shows that after convolutional layers have been performed, the output of the last convolutional layer is flattened into a column vector (stacked as one long vector) and fed into a typical feedforward deep neural network for regression or classification purposes; backpropagation is applied to every iteration of training.

13.7 Why Are CNNs Preferred over Feedforward Deep Neural Networks for Processing Images?

To explain why the feedforward networks studied in the previous chapters (Chaps. 10, 11, and 12) are not the best option for processing images, we assume that we have as input images in the RGB format, that is, in a 3D tensor with 256 × 256 × 3 pixels. This means that we have a matrix of dimension 256 × 256 for each of the three colors (Red, Green, and Blue). Stacking the 3D tensor on a 1D tensor implies that the input consists of 196,608 columns, which requires learning the same amount of weights (parameters) for each node in the first hidden layer of a feedforward network, also called fully connected deep neural network. Assuming that we used 300 neurons in the first hidden layer, this implies that we need to estimate 58,982,400.00 weights only for the first hidden layer, which is computationally very demanding. However, the greater the number of parameters that need to be estimated, the larger the training set that should be used to avoid overfitting, which of course increases the computational resources needed for the training process. On the other hand, there is evidence that CNNs require at most half of the parameters needed by a feedforward deep network and improve accuracy and reduce the training time substantially. There is also evidence that feedforward deep neural networks most of the time do not improve accuracy substantially after adding 5 hidden layers, while CNNs can continue improving prediction accuracy by adding more hidden layers; for this reason, there are applications of CNNs that use up to 150 hidden layers. Another disadvantage of feedforward deep neural networks is that they do not take advantage of structural information (correlation between pixels) since for processing these networks, the 3D tensor format of the images is transformed into a linear 1D tensor, which causes a loss of this structural information. Figure 13.14 shows how the information of an image in a 3D tensor is converted to a linear 1D tensor as input for a feedforward deep neural network. Figure 13.14 also shows that the input image in a 3D tensor format with depth = 3 (three colors), width and size of 256 pixels produces a total input of size 256 × 256 × 3 = 196, 608 pixels.

Fig. 13.14
figure 14

Image of a 3D tensor converted to a 1D tensor that is used as input for a fully feedforward deep neural network

Since CNNs can process the original 2D image at each depth, they are able to recognize shapes and capture the correlation between pixels using filter matching. Next, we want to develop a machine learning algorithm for the automatic classification of diseased and non-diseased wheat plants using images in a 3D format of dimension 256 × 256 × 3 (Varma and Das 2018). Since we want to develop a classifier for binary outcomes under a linear model like logistic regression, the predictor (pre-activation) should be

$$ {z}_i=\sum \limits_{j=1}^{\mathrm{196,608}}{w}_j{x}_{ij}+b $$

The weights wj, 1 ≤ j ≤ 196,608, can be interpreted as a filter for the category of interest (diseased plants), so that the classification operation can be interpreted as a filter matching the input image, as shown in Fig. 13.15.

Fig. 13.15
figure 15

Classification using filter matching

This interpretation of linear filtering using filter matching gives evidence that it is possible to improve the system, but instead of using a filter that attempts to match the entire image, we can use smaller filters that try to match objects in local portions of the image. This has the following advantages: (a) smaller filters need a smaller filter size and, therefore, fewer parameters and (b) even if the object being detected moves around the image, the same filter can still be used, which leads to invariance translation (Varma and Das 2018). This is the main idea behind CNNs, as illustrated in Fig. 13.16. The increasing number of filters in a particular layer increases the number of feature maps (i.e., depth) in the next layer, and the number of feature maps in the next layer depends on the number of filters we use for the convolutional operation in the previous layer. For this reason, it is possible that the number of feature maps is different in different hidden layers. It is noted that filters in early layers capture primitive shapes (vertical and horizontal edges, corners, points, etc.), while filters in the later layers can capture more complex compositions.

Fig. 13.16
figure 16

Illustrating local filters and feature maps in CNNs

The important issues related to CNNs are explained in parts (a) to (d) of Fig. 13.16.

Part (a) of Fig. 13.16 tries to explain the key differences between filter matching in feedforward deep neural networks and CNNs. Feedforward deep neural networks use a larger filter than CNNs; CNN filters maintain the depth size but the height and width are smaller than the original height and width of the original image. This is illustrated in part (a) of Fig. 13.16, where a filter of size 7 × 7 × 3 is used for an image of size 256 × 256 × 3.

Part (b) of Fig. 13.16 exemplifies the filter matching operation using CNNs, and we can see clearly that the matching process is done locally, that is, in small patches or overlapping patches of an image. Filter matching is done by computing the pre-activation (zi) and activation (yi) values with the following equations. This is done at each position of the filter.

$$ {z}_i=\sum \limits_{j=1}^{147}{w}_j{x}_{ij}+b $$
(13.1)
$$ {y}_i=g\left({z}_i\right) $$
(13.2)

The pixel values in Eqs. (13.1) and (13.2), xij, i = 1, 2, …, 147, are called the local receptive field, which corresponds to the local patch of the image of size 7 × 7 × 3 and changes as the filter slides vertically and horizontally across the whole image. The filter has only 7 × 7 × 3 + 1 = 148 parameters, a lot less than 256 × 256 × 3 + 1 = 196, 609 parameters needed for the feedforward deep neural network given in Fig. 13.15. This reduction in the required number of parameters is due to the fact that under CNNs, parameters are shared. The local image patch used each time, as well as the filter, are 3D tensor structures, but the multiplication operation of Eq. (13.1) uses a stretched out vectorized version of the two tensors (Varma and Das 2018). The convolutional operation results in much sparser neural networks than the fully connected networks.

The filter is moved across the image, as can be seen in part (c) of Fig. 13.16, and a new value of zi is computed with Eq. (13.1) at each position, which produces an output matrix of size 250 × 250. This output matrix is called the feature map (or activation map). Convolution is the name given to this operation that computes the dot product at each position in which we slide the filters across the image. All the nodes in the feature map are tuned to detect the same feature in the input layer at different positions of the image since the same filters are used for all nodes in the feature map. This means that CNNs are able to detect objects regardless of their location in the image and, for this reason, CNNs possess the property of translational invariance (Varma and Das 2018).

The examples given above illustrate how CNNs work using a single filter and are able to capture or detect a single pattern in the input image. For these reasons, we need to add many more filters to detect many patterns, each of which produces its own feature map, as can be seen in part (d) of Fig. 13.16. For example, only vertical edges can be detected with feature map 1, while only horizontal edges can be detected with feature map 2. This means that a hidden CNN consists of a stack of feature maps.

The entire input plane spans the filter in feedforward deep neural networks, looking for patterns that span the entire image. On the other hand, CNNs use smaller filters to detect smaller shapes built hierarchically into bigger shapes and objects to capture in more detail the information of the image. This is practical since real-world images are built with smaller patterns that rarely span the entire image. Thanks to the translational invariance property, any shape can be detected regardless of its location in the image plane.

Figure 13.17 also contrasts the feedforward deep neuronal network with CNNs. The top part of Fig. 13.17 shows the input image and the neurons in the first hidden layer; when a particular pattern in the image is detected, a neuron is activated, but each neuron looks for a particular pattern. On the other hand, in CNNs, each neuron of the hidden layer is replaced by a feature map with multiple sub-neurons. A local filter is used to compute the activation value at each sub-neuron in an activation map which in different parts of the image looks for the same pattern. For this reason, in CNNs as many feature maps are needed as neurons in feedforward deep neural networks. Figure 13.17 also shows that the number of weights in the algorithm using CNNs is considerably reduced at the expense of the increase in the number of neurons. However, the increase in the number of neurons increases the computational resources required, but this is the price we need to pay for the reduction in the number of parameters that we need to estimate during the training process (Varma and Das 2018).

Fig. 13.17
figure 17

Relation between CNNs and deep feedforward networks

Figure 13.18 illustrates how to implement more than one hidden layer in a deep neural network using the convolutional operation. In this figure, we can see that it is almost the same as when a convolutional operation is performed between the input layer and the first hidden layer. Figure 13.18 also shows that the second convolutional hidden layer added contains ten feature maps, each generated with filters of size 7 × 7 × 10, and the depth of this second hidden layer was inherited from the first hidden layer; for this reason, this is not a free parameter. In CNNs, the initial hidden layers are able to detect simple shapes, which allows the later layers to detect more complex shapes. But to be able to successfully implement a CNN, we need to tune: (a) the number of feature maps (number of filters) needed in each hidden layer, (b) the size of the filter, and (c) the stride size (Varma and Das 2018).

Fig. 13.18
figure 18

CNNs with multiple hidden layers

13.8 Illustrative Examples

Example maize with CNNs

This data set contains 101 maize lines, each of which was studied in two environments (FLAT5I and BED2I) and genotyped with 101 markers. This first example will use CNNs in one dimension, which are useful for capturing spatial relationships between genomic markers and also in time series data.

The phenotypic information contains three columns: one for environments (Env), the other for genotypes (GID), and the last one for grain yield (Yield):

> head(Pheno) GID Env Yield 1 GID7459918 FLAT5I 5.481080 2 GID7461686 FLAT5I 6.395386 3 GID7462121 FLAT5I 6.633570 4 GID7462462 FLAT5I 5.358746 5 GID7624708 FLAT5I 6.281305 6 GID7624898 FLAT5I 6.746165

With command tail(), we can see the last six observations of a data set:

> tail(Pheno) GID Env Yield 197 GID8057500 BED2I 4.658421 198 GID8057905 BED2I 5.235285 199 GID8058141 BED2I 4.768317 200 GID8058432 BED2I 4.899983 201 GID8059034 BED2I 4.820649 202 GID8059268 BED2I 5.097920

Where we can see that there are 202 observations and 101 genotypes, and each of these genotypes was measured in two environments, as can be seen next.

> Env=unique(Pheno$Env) > Env [1] FLAT5I BED2I Levels: FLAT5I BED2I

Then, with the following lines of code, the design matrices of environment and genotypes are created:

Z.E<-model.matrix(~0 + as.factor(Pheno$Env)) Z.G<-model.matrix(~0 + as.factor(Pheno$GID)) Z.G=Z.G%*%Markers_Final

Next, with the following lines of code, the training-testing sets are created using the BMTME package:

pheno <- data.frame(GID =Pheno[, 1], Env =Pheno[, 2], Response =Pheno[, 3]) CrossV <- CV.KFold(Pheno, DataSetID = 'GID', K = 5, set_seed = 123)

Then we select the response variable and create the input matrix of information:

y=(Pheno[, 3]) X=cbind(Z.E,Z.G)

The whole data set contains 202 observations and the input information contains 103 independent variables.

From this output and input information (y,X), we extracted a training sample of 161 observations for training (y_trn, X_trn), and the remaining observations were used for testing (y_tst, X_tst). Using this output and input, next we show the basic CNN in one dimension (1D-CNN), that consists of stacking layer_conv_1d() and layer_max_pooling_1d. The basic difference in the implementation of 1D-CNN is in the specification of the model, which is given next.

######Specification of the 1D-CNN model############## model_Sec<-keras_model_sequential() model_Sec %>% layer_conv_1d(filters=32, kernel_size=5, activation ="relu", input_shape=c(NULL, dim(X_trn)[2],1)) %>% layer_max_pooling_1d(pool_size =2) %>% layer_conv_1d(filters=64, kernel_size=5, activation ="relu") %>% layer_max_pooling_1d(pool_size =4, strides=1) %>% layer_conv_1d(filters=64, kernel_size=5, activation ="relu") %>% layer_max_pooling_1d(pool_size =6,strides=2) %>% layer_flatten() %>% layer_dense(units =500 , activation ="relu") %>% layer_dropout(rate =0) %>% layer_dense(units =1) summary(model_Sec)

As mentioned above, the layer_conv_1d() is used to specify hidden layers for CNNs in one dimension that need an input of two dimensions, where the first dimension is due to the number of independent variables in the training set, while the second is the spatial dimension. Then inside this layer, the following needs to be specified: the number of filters, the activation function, the kernel size, and the stride which by default is 1. The number of filters and the kernel size should be specified by the user or tuned. Then after a layer_conv_1d(), a pooling layer is used with the function layer_max_pooling_1d(), that need as input the parameters pool size and stride which is by default equal to 2 for pooling layers. The pool size should be specified by the user, and in this case, values of 3, 5, or 7 are very common. This scheme is in agreement with the one provided in Fig. 13.12, which illustrates that a convolutional layer consists of a convolutional operation followed by ReLU nonlinear transformation and the pooling operation to reduce the input data. However, the ReLU and pooling operations after the convolutional operation are not mandatory. Three convolutional layers are specified in this example, and the only difference between them is the number of filters, the type of activation function, the pool size, and the stride specified in each convolutional layer. After the third convolutional layer, a flatten layer is specified using the function layer_flatten() which reshapes the tensor to have the shape that is equal to the number of elements contained in the tensor, not including the batch dimension. After this layer, the layers are fully connected, as those studied in feedforward neural networks; the first contains 500 units and the output layer has only one layer since we want to predict a continuous outcome. Note that 1D-CNNs process input patches independently since they are not sensitive to the order of the timesteps.

Next is a summary of the 1D-CNN model.

Since the input has 103 time points and 32 filters were used in the first convolutional layer, the output for this layer is (103 − 5 + 1) = 99 time points and 32 filters, since by default, this convolutional operation has stride = 1. Therefore, for this layer, the required number of parameters is equal to (32 + 1) × 5 = 192. Then for the first max pooling operation, the output contains (99 − 2)/2 + 1 = 49 time points and 32 is the depth; since this is not affected by the pooling operation, here and in all max pooling operations, there are no parameters to estimate (Fig. 13.19). For the second convolutional operation, the time points required are equal to (49 − 5) + 1 = 45 with a depth equal to 64, since this was the number of filters specified for this convolutional operation. For this second convolutional operation, 32 × 5 × 64 + 64 = 10,304 parameters are required and need to be estimated. For the second pooling layer, the output contains the same 64 filters as the depth but the number of time points now is equal to (45 − 4)/1 + 1 = 42 since now the stride = 1 (Fig. 13.19). The third convolutional layer also has 64 filters since this was the value specified for this layer, but the number of time points of the output now is equal to (42 − 5) + 1 = 38, while for this operation, the number of parameters needed to be estimated is equal to 64 × 5 × 64 + 64 = 20,544 (Fig. 13.19). Then the third pooling operation needs (38 − 6)/2 + 1 = 17 time points since the stride is equal to 2 and the pool size is equal to 6. Then the flattened layer stacks the final output of the third pooling layer; for this reason, it produces an output of 17 × 64 = 1088. Since in the first feedforward neural network, we specifically used 500 neurons, the number of parameters that need to be estimated is equal to 500 × (1088 + 1) = 544,500. Finally, in the output layer of the feedforward deep neural network, 501 parameters are required, since 500 weights +1 intercept need to be estimated (Fig. 13.19).

Fig. 13.19
figure 19

Summary of the 1D-CNN with three convolutional layers and two feedforward neural network layers

Next, we provide the flags needed under a 1D-CNN to implement a grid search to tune the required hyperparameters. The following code creates the flags; this code is called: Code_Tuning_With_Flags_CNN_1D_1HL_2CHL.R

####a) Declaring the flags for hyperparameters FLAGS = flags( flag_numeric("dropout1", 0.05), flag_integer("units",33), flag_string("activation1", "relu"), flag_integer("batchsize1",32), flag_integer("Epoch1",500), flag_numeric("learning_rate", 0.001), flag_integer("filters",64), flag_numeric("KernelS", 3), flag_numeric("PoolS", 2), flag_numeric("val_split",0.2)) ####b) Defining the DNN model build_model<-function() { model <- keras_model_sequential() model %>% layer_conv_1d(filters=FLAGS$filters, kernel_size=FLAGS$KernelS, activation =FLAGS$activation1,input_shape=c(dim(X_trII)[2],1)) %>% layer_max_pooling_1d(pool_size =FLAGS$PoolS) %>% layer_conv_1d(filters=FLAGS$filters, kernel_size=FLAGS$KernelS, activation =FLAGS$activation1) %>% layer_max_pooling_1d(pool_size =FLAGS$PoolS) %>% layer_flatten() %>% layer_dense(units=FLAGS$units , activation =FLAGS$activation1) %>% layer_dropout(rate=FLAGS$dropout1) %>% layer_dense(units =1) #####c) Compiling the DNN model model %>% compile( loss = "mse", optimizer =optimizer_adam(lr=FLAGS$learning_rate), metrics = c("mse")) model} model<-build_model() model %>% summary() print_dot_callback <- callback_lambda( on_epoch_end = function(epoch, logs) { if (epoch %% 20 == 0) cat("\n") cat(".")}) early_stop <- callback_early_stopping(monitor = "val_loss",mode='min',patience =50) ###########d) Fitting the DNN model################# model_Final<-build_model() model_fit_Final<-model_Final %>% fit( X_trII, y_trII, epochs =FLAGS$Epoch1, batch_size =FLAGS$batchsize1, shuffled=F, validation_split =FLAGS$val_split, verbose=0,callbacks = list(early_stop,print_dot_callback))

In a) are given the default flag values for the number of filters, kernel size, pool size, etc. In b) the DNN model is defined and the flag parameters are incorporated within our DNN model. But instead of specifying only dense layers (layer_dense) like in feedforward networks, first we specified the convolutional layers followed by pooling layers, and the specified dense layers are at the end. But before specifying dense layers, a flatten layer is specified that transforms the input from a tensor of 2D or more dimensions to a vector (tensor of one dimension). The specification of the dense layers is exactly the same as in all the feedforward networks studied before; the new part is the specification of the convolutional layers. In this code, only convolutional layers are only specified in 1D; for this reason, the keras function used to fit this DNN model is layer_conv_1d(), while for the pooling process, the layer_max_pooling_1d() function is used. The convolutional layers in 1D take as input 2D tensors with shape (samples, time features) and also return a shaped 2D tensor.

The first axis is a 1D window (patch) in the temporal axis, while the second is the input tensor (Chollet and Allaire 2017). CNNs in 1D can recognize temporal patterns in a sequence and since the same input transformation is performed at every subsequence (window or patch), a pattern learned in a specific position of the sequence can later be recognized at a difference position, making CNNs in 1D translation invariant (for temporal translations). CNNs in 1D are insensitive to the order of the timesteps (beyond the size of the convolutional windows) since they process input patches (subsequences) independently; for this reason, they should be a good choice when global ordering of the sequence is not fundamentally meaningful in the processing of temporal patterns. In this example, two convolutional layers were specified; for this reason, in the above code (part b) appears in the specification of the deep learning model two times layer_conv_1d() and layer_max_pooling_1d(), before the flatten layer, which is specified as layer_flatten(). However, the number of the convolutional layers to be specified is problem-dependent, and more than one can be used. Also, the number of dense layers after the flatten layer can be more than one. Finally, it is important to mention that in the output layer, we only specified one neuron without specifying the activation function since we want to predict a continuous outcome, and by default, the linear activation function will be used. For all activation functions, except the output layer, we specified the relu activation function to capture no linear patterns. The input_shape for CNN in one dimension needs an input of two dimensions; for this reason, we specified the first dimension as the number of columns in the matrix design (X) and 1, since we only have data in one dimension. In CNNs in 1D, it is practical to use kernel sizes of 7 or 9.

In part c) of the code, the model is compiled with a loss function and as metric the mean squared error (MSE). Since the response we want to predict is continuous, here we also specified the optimizer_adam() as optimizer. In part d) the model is fitted using the number of epochs, batch size, and validation split as specified in the flags (part a). The fitting process in this case was done using the early stopping method.

The code given above called Code_Tuning_With_Flags_CNN_1D_1HL_2CHL.R is in the code given in Appendix 1. The code given in Appendix 1 executes the grid search using the library tfruns (Allaire 2018) and the tuning_run() function. The grid search implemented is shown below.

runs.sp<-tuning_run("Code_Tuning_With_Flags_00_CNN_1D_3HL_3CHL.R",runs_dir= '_tuningE1',flags=list(dropout1= c(0,0.05), units = seq(dim(X)[2]/2,2*dim(X)[2],40), activation1=("relu"), batchsize1=c(32), Epoch1=c(500), learning_rate=c(0.001), filters=c(32,64), KernelS=c(3), PoolS=c(1), val_split=c(0.2)),sample=1,confirm=FALSE,echo =F)

The grid search is composed of 16 combinations of hyperparameters: two values of dropout, four units and two values of filters, and one value for the remaining hyperparameters. The code given in Appendix 1 was run for five-fold, and each time four were used for training and the remaining for testing. The prediction performance is reported in terms of MSE and mean arctangent absolute percentage error (MAAPE) due to the fact that the outcome we want to predict is continuous. Table 13.3 indicates a similar prediction performance in both environments.

Table 13.3 Prediction performance for different values for hidden layers (HL) in the feedforward deep neuronal network part and different hidden convolutional layers (HCL) with five outer fold cross-validation

13.9 2D Convolution Example

The implementation of 2D-CNNs requires input for each observation in at least a 2D format. For this reason, these CNNs are useful for capturing spatial relationships in the input data; for this reason, they are most popular when using image information as input. In the context of genomic selection, the input data (SNPs) are not in this 2D format; for this reason, to use 2D-CNNs, a set of encoding methods was developed to overcome these constraints. We used one-hot encoding, which is simply recoding the three SNP genotypes as three 0/1 dummy variables. Using this approach, nonlinear relationships can be modeled using nonlinear activation functions in the first layer. Under one-hot coding, each marker is represented by a three-dimensional vector with 1 at the index for one genotype; the rest of them are set at 0. To illustrate this, we assume that the genotypes [AA, Aa, aa] are represented as [1, 0, 0], [0, 1, 0], and [0, 0, 1], respectively (Fig. 13.20). Liu et al. (2019), Bellot et al. (2018), and Pérez-Enciso and Zingaretti (2019) used this approach, but Liu et al. (2019) also used one-hot coding taking into account the missing values; the input vector is four-dimensional because the additional dimension was required to accommodate the missing values.

Fig. 13.20
figure 20

CNN with genotypes that are one-hot encoded. A denotes the homozygous, a is the reference homozygous, and Aa is the heterozygous. Processed features with two CNN layers are then passed to the output processing block, which contains a flatten layer and a fully connected dense layer

Before we explain the basic issues to implement a 2D-CNN, we note that we will use the same data sets as were used for illustrating the 1D-CNN, which contains 202 observations; the input information contains 101 independent variables, but with the difference that now a two-dimensional matrix was created for each observation. This means that each observation input has a height of 101 and a width equal to 3, which is equivalent to an image of 101 × 3.

Again from output and input information (y,X), we extracted a training set with 161 observations (y_trn, X_trn) and a testing set (y_tst, X_tst) with 41 observations. Next we show how 2D-CNNs are implemented by stacking layer_conv_2d() and layer_max_pooling_2d layers.

model_Sec<-keras_model_sequential() model_Sec %>% layer_conv_2d(filters=32, kernel_size=c(3,3), activation ="relu", input_shape=c(dim(X_trn)[2], dim(X_trn)[3], dim(X_trn)[4])) %>% layer_max_pooling_2d(pool_size =c(5,1), strides =2) %>% layer_conv_2d(filters=64,kernel_size=c(3,1), activation ="relu") %>% layer_max_pooling_2d(pool_size =c(3,1), strides=3) %>% layer_conv_2d(filters=128, kernel_size=c(3,1), activation ="relu") %>% layer_max_pooling_2d(pool_size =c(2,1)) %>% layer_flatten() %>% layer_dense(units =384, activation ="relu",kernel_regularizer=regularizer_l2(0.001)) %>% layer_dropout(rate =0) %>% layer_dense(units =1) summary(model_Sec)

As mentioned above, a 2D-CNN consists of stacking layer_conv_2d() and layer_max_pooling_2d(). Layer_conv_2d() requires at least three arguments which are the number of filters, kernel size, activation function and input shape, at least for the first convolutional layer. All three parameters need to be specified by the user; with regard to the kernel size, the user needs to specify the height and width of the kernel, which can be different. The activation function can be any of the ones existing in deep learning, but most of the time, the ReLU activation function is used in CNNs. With regard to the input_shape for 2D-CNNs, it needs to be three-dimensional: the first should be the height (101 in this example), the second the width (3 in this example), and the last one the depth (1 in this example) of the image or input.

For the layer_max_pooling_2d(), at least two inputs are required: one is the height and width of the patch that will be pooled and the second is the stride size. In this example, a layer_conv_2d() is followed by a layer_max_pooling_2d(), which form a convolutional layer; for this reason, in this example, three convolutional layers were implemented since the layer_conv_2d() and the layer_max_pooling_2d() appear three times. After the last convolutional layer, layer_flatten() was used that flattened the output of the last layer_max_pooling_2d(). Then a feedforward layer was used for which the user provides the number of neurons. Finally, the output layer that belongs to the feedforward layers is specified. In Fig. 13.21 is given the summary of the output shape of each convolutional and feedforward layer produced for the keras code given above for this 2D-CNN example.

Fig. 13.21
figure 21

Summary of the 2D-CNN with three convolutional layers and two feedforward neural network layers

Since the 2D input has order 101 × 3, and in the first convolutional layer 32 filters were used, the output of this layer has a height equal to (101 − 3 + 1) = 99, width equal to 3 − 3 + 1 = 1 since the stride = 1, and 32 filters (specified by the user). The required number of parameters for this layer is equal to (32 × 9 + 32) = 320. Then for the first max pooling operation, the output has a width of 1, a height equal to (99 − 5)/2 + 1 = 48, and a depth equal to 32. Since this is not affected by the pooling operation, here and in all max pooling operations there are no parameters to estimate (Fig. 13.21). For the second convolutional operation, the width is equal to 1 and the height is equal to (48 − 3) + 1 = 46 with a depth equal to 64, since this was the number of filters specified for this convolutional operation. For the second convolutional operation, 32 × 3 × 64 + 64 = 6208 parameters are required and need to be estimated. For the second pooling layer, the output contains the same 64 filters as the depth, plus a width of 1; the height is equal to (46 − 3)/3 + 1 = 15 since now the stride = 3 (Fig. 13.21). The third convolutional layer has 128 filters since this was the value specified for this layer, but the output now has a width equal to 1 and a height equal to (15 − 3) + 1 = 13, while the number of parameters that need to be estimated for this operation is equal to 64 × 3 × 128 + 128 = 24,704 (Fig. 13.21). The third pooling operation also has a width of 1 and a height of (13 − 2)/2 + 1 = 6 since the stride and the pooling size were equal to 2. The flattened layer stacked the final output of the third pooling layer and produced an output of 6 × 128 = 768. Since in the first feedforward neural network, we specified using 384 neurons, the number of parameters that need to be estimated is equal to 384 × (768 + 1) = 295,296. Finally, in the output layer of the feedforward deep neural network, 385 parameters are required, since 384 weights +1 intercept need to be estimated (Fig. 13.21).

Again, we will create flags for the tuning process. The following code creates the flags under 2D-CNNs and is called Code_Tuning_With_Flags_CNN_2D_2HL_1CHL.R. The details of this code are given next:

####a) Declaring the flags for hyperparameters FLAGS = flags( flag_numeric("dropout1", 0.05), flag_integer("units",33), flag_string("activation1", "relu"), flag_integer("batchsize1",56), flag_integer("Epoch1",1000), flag_numeric("learning_rate", 0.001), flag_integer("filters",5), flag_integer("KernelS",3), flag_integer("PoolS",1), flag_numeric("val_split",0.2)) ####b) Defining the DNN model build_model<-function() { model<-keras_model_sequential() model %>% layer_conv_2d(filters=FLAGS$filters, kernel_size=c(FLAGS$KernelS,FLAGS$KernelS), activation =FLAGS$activation1,input_shape=c(dim(X_trII)[2],dim(X_trII)[3],dim(X_trII)[4])) %>% layer_flatten() %>% layer_dense(units =FLAGS$units, activation =FLAGS$activation1) %>% layer_dropout(rate =FLAGS$dropout1) %>% layer_dense(units =FLAGS$units, activation =FLAGS$activation1) %>% layer_dropout(rate =FLAGS$dropout1) %>% layer_dense(units =1) #####c) Compiling the DNN model model %>% compile( loss = "mse", optimizer =optimizer_adam(lr=FLAGS$learning_rate), metrics = c("mse")) model} model<-build_model() model %>% summary() print_dot_callback <- callback_lambda( on_epoch_end = function(epoch, logs) { if (epoch %% 20 == 0) cat("\n") cat(".")}) early_stop <- callback_early_stopping(monitor = "val_loss",mode='min',patience =50) ###########d) Fitting the DNN model################# model_Final<-build_model() model_fit_Final<-model_Final %>% fit( X_trII, y_trII, epochs =FLAGS$Epoch1, batch_size =FLAGS$batchsize1, shuffled=F, validation_split =FLAGS$val_split, verbose=0,callbacks = list(early_stop,print_dot_callback))

In part a) of the above code are given the default flag values for the number of filters, kernel size, pool size, etc. In part b) the DNN model with one convolutional layer is defined, and at the end, two dense layers are specified. Before specifying the dense layers, a flatten layer is specified to transform the input from a tensor of 2D or more dimensions to a vector (tensor of one dimension). The specification of the dense layer is exactly the same as all the feedforward networks and 1D-CNNs studied previously; the new part is the specification of the convolutional layers in 2D tensors. Now we are specifying 2D convolutions with the function layer_conv_2d() and max pooling with the function layer_max_pooling_2d(). The convolutional layers need input information in three dimensions (height , width, depth) not including the batch dimension, and return a 3D tensor as output (Chollet and Allaire 2017). CNNs in 2D also have the translation invariance property, since they can recognize local patterns in images, and a pattern learned on a specific position of the 2D tensor can later be recognized anywhere, due to the fact that the same transformation is performed in all patches extracted from the original image. As mentioned in this example, one convolutional layer was specified (see part b of the code) using the layer_conv_2d() and layer_max_pooling_2d() before the flatten layer and the layer_flatten(); however, the user can implement more than one convolutional layer and dense layers depending on the problem at hand. In this example, only one neuron is specified in the output layer because we want to predict a continuous outcome. Except in the output layer, we used the ReLU activation function to capture nonlinear patterns. We specified a 3D tensor in the input_shape() function with the first argument representing the height, the second the width, and last one the depth. In CNNs in 2D, it is very popular to use kernel sizes of 3 × 3 or 5 × 5.

In part c) the model is compiled using the MSE as the loss function and metric since the outcome of interest is continuous, and optimizer_adam() was specified as the optimizer. In part d) the model is fitted using the number of epochs, batch size, and validation split as specified in the flags (part a). The fitting process in this case was done using the early stopping method.

The code given above was put in a file called Code_Tuning_With_Flags_CNN_2D_2HL_2CHL.R, which is used in Appendix 2 to implement a 2D tensor deep neural network with convolutional layers for predicting a continuous response variable. The code given in Appendix 2 executes a grid search using the library tfruns (Allaire 2018) and the tuning_run() function; the grid search that was implemented is shown below.

runs.sp<-tuning_run("Code_Tuning_With_Flags_00_CNN_2D_2HL_1CHL.R",runs_dir = '_tuningE1',flags=list(dropout1= c(0,0.05), units = seq(dim(X)[2]/2,2*dim(X)[2],40), activation1=("relu"), batchsize1=c(32), Epoch1=c(500), learning_rate=c(0.001), filters=c(32,64), KernelS=c(3), PoolS=c(1), val_split=c(0.2)),sample=1,confirm=FALSE,echo =F)

The grid search is composed of 16 combinations of hyperparameters: two values of dropout, four units and two values of filters, and one value for the remaining hyperparameters. The code given in Appendix 2 was run for five-fold and each time four folds were used for training and the remaining for testing. The prediction performance is reported in terms of MSE and MAAPE due to the fact that the outcome we want to predict is continuous. Table 13.4 shows a similar prediction performance in both environments.

Table 13.4 Prediction performance under a 2D-CNN, with one hidden convolutional layer (HCL) and 1, 2, and 3 hidden layers (HL) in the fully connected layers. Five outer fold cross-validation was implemented

Table 13.4 indicates that the best predictions are observed with only one hidden layer since with more hidden layers the results are slightly worse. To implement the model with one and three hidden layers, we modified the number of hidden layers in Code_Tuning_With_Flags_00_CNN_2D_2HL_2CHL.R and in the code given in Appendix 2.

Finally, as mentioned earlier, CNNs can capture the spatial structure between genomic markers, and for this reason, these deep neural network methods have been implemented in genomic selection by some authors like Liu et al. (2019), Bellot et al. (2018), Pérez-Enciso and Zingaretti (2019), Ma et al. (2018), Zou et al. (2019), and Waldmann et al. (2020). But the outperformance of CNNs with regard to statistical genomic selection models is modest, and a lot of time is needed to tune these CNNs. In genomics, most applications concern functional genomics, with examples that include predicting the sequence specificity of DNA- and RNA-binding proteins and of enhancer and cis-regulatory regions, methylation status, gene expression, and control of splicing (Waldmann et al. 2020). Deep learning has been especially successful when applied to regulatory genomics, by using architectures directly adapted from modern computer vision and natural language-processing applications.

13.10 Critics of Deep Learning

To finish this chapter, we need to mention some of the problems of deep learning to have both faces of the same coin. This is important since in the media deep learning is sold as the panacea that will solve all association and prediction problems using data, and for this reason, it is not necessary to learn any other machine learning or statistical learning algorithm—which is totally wrong. Next, we will provide some arguments about why it is totally wrong to think that it is enough to learn only deep learning algorithms and no other statistical machine learning methods: (A) Most deep learning models are not interpretable and when inference (association) is the goal, simple statistical learning methods are always the best option; in general, deep learning-based solutions lack mathematical elegance and offer very little interpretability of the found solution or understanding of the underlying phenomena. (B) Many times simple statistical machine learning algorithms, such as multiple regression or logistic regression, work just fine for the required model accuracy. This is due in part to the “No free lunch theorem” that states that there is no one model that works for every problem since the assumptions of a great model for one problem might not hold for another problem, and also due to the size of the training set, since the lower the training set, the worse the performance of deep learning models. Also, since many times the data at hand have a clearly linear pattern and there is no reason to solve this problem with a nonlinear model (e.g., multilayer perceptron), this means that different problems need a different best method. Also, when the input is not structured and of high quality, many times conventional statistical machine learning methods are enough to get prediction performance that is good enough. (C) Most conventional statistical machine learning algorithms do not require massive computations that need to be run on computer clusters or graphic processing units, nor a complex tuning process that needs wise optimization algorithms that require effective initializations and gradual stochastic gradient learning to perform reasonably well. (D) The many successful applications of deep learning are supported only by great empirical evidence with few theoretical understanding of the underlying paradigm. Moreover, the optimization employed in the learning process is highly non-convex and intractable from a theoretical viewpoint.

For the reasons mentioned above, Rami (2017) stated that deep learning is “alchemy,” since many times it is not possible to successfully implement it due to the fact that the loss function is a non-convex function that only guarantees a local minimum and even when a lot of time is used for tuning, it is not possible or easy to arrive at a reasonable solution. This is true because tuning methods used to select hyperparameters require brute force and their implementation consists of a trial-and-error process that is mostly art and little science. Le Cun, one of the great promoters of deep learning, argues that deep learning is “alchemy” in the following way: “In the history of science and technology, the engineering artifacts have almost always preceded the theoretical understanding: the lens and the telescope preceded optic theory, the steam engine preceded thermodynamics, the computer preceded computer science. Why? Because theorists will spontaneously study ‘simple’ phenomena, and will not be enticed to study complex ones until there is a practical importance to it” (Elad 2017).

However, successful applications of deep learning can be seen in many areas and the power of this tool, even in the many problems mentioned above, will continue reshaping and influencing not only our everyday lives but also many other areas of science. For this reason, it is important to adopt the good things from this field in our own field to take advantage of this power tool and help provide the strong theoretical background that is needed for this field to continue growing, improving, and reshaping our everyday lives and our ways of doing science.

Also, DL methods have some really good advantages, since they can efficiently handle natural data in their raw form, which is not possible with most statistical machine learning models (Lecun et al. 2015). DL has also proven to provide models with higher accuracy that are efficient at discovering patterns in high-dimensional data, making them applicable in a variety of domains. They also make it possible to more efficiently incorporate all omics data (Metabolomics, Proteomics, and Transcriptomics) in the same model. However, it is difficult to implement DL models in genomic data sets, since they usually represent a very large number of variables and a small number of samples; they also offer a lot of opportunities to design specific topologies (deep neural networks) that can deal in a better way with the type of data present in genomic selection.