Every image can be represented by a matrix of pixel values. A color image, can be represented in three channels (or 2D matrices) stacked over each other in the RGB color space in which red, green and blue are combined in various ways to yield an extensive array of colours. Conversely, a greyscale image is often represented by a single channel with pixel values ranging from 0 to 255, where 0 indicates black and 255 indicates white.
16.2.1 Convolution Operation
Suppose that we are trying to classify the object in Fig. 16.2. Convolutional networks allow the machine to extract features like paws, small hooded ears, two eyes and so on from the original image. Then, the network makes connections with all the extracted information to generate a likelihood probability of its class category. This feature extraction is unique to CNN and is achieved by introducing a convolution filter, or the kernel, which is defined by a small two dimensional matrix. The kernel acts as feature detector by sliding the window over the high-dimensional input matrices of the image. At each point, it performs a point-wise matrix multiplication and the output is summed up to get the elements to the new array. The resulting array of this operation is known as the convolved feature or the feature map. A feature map conveys a distinct feature drawn from the image activated by the kernel. In order for our networks to perform, we often assign sufficiently large number of kernels in the convolution function to allow our model to be good at recognizing patterns in the unseen images.
Besides, after every convolution, the resolution of the output becomes smaller as compared to the input matrix. This is due to the arithmetic computation using a sliding window of size greater than 1 × 1. As a result, information are summarized at the cost of losing some potentially important data. To control this, we can utilize zero padding which appends zero values around the input matrix.
To illustrate, we refer to the example as shown below. The dimension of the original output after convolution is a 3 × 3 array. For us to preserve the original resolution of the 5 × 5 matrix, we can add zeros around the input matrix to make it 7 × 7. Then it can be shown that the final output is also a 5 × 5 matrix. This does not affect the quality of the dataset as adding zeros around the borders does not transform nor change the information of the image.
A formula to calculate the dimension of the output from a square input matrix is given as follows (Fig. 16.3),
$$Width_{{{\text{feature}}\,{\text{map }}}} = \frac{{Width_{\text{input }} - Width_{\text{kernel}} \, + 2(padding)}}{stride} + 1$$
From Figure 16.4, we show the typical learned filters of a convolutional network. As mentioned previously, the filters in convolutional networks extract features by activating them from the matrices. We would like to highlight that the first few layers of the network are usually very nice and smooth. They often pick-up lines, curves and edges of the image as those would fundamentally define the important elements that are crucial for processing images. In the subsequent layers, the model will start to learn more refined filters to identify presence of the unique features.
In comparison to the traditional neural network, convolution achieves better image learning system by exploiting three main attributes of a convolutional neural network: 1. sparse interactions, 2. parameter sharing and 3. equivariant representation.
Sparse interactions refer to the interactions between the input and the kernel. It is the matrix multiplication as described earlier, and sparse refers to the small kernel since we construct our kernel to be smaller than the input image. The motivation behind choosing a small filter is because machines are able to find small, meaningful features with kernels that occupy only tens or hundreds of pixels. This reduces the parameters used, which cuts down the memory required by the model and improves its statistical computation.
Furthermore, we applied the same kernel with the same parameters over all positions during the convolution operation. This means that instead of learning a distinctive set of parameters over every location, machines only require to learn one set of filter. As a result, it makes the computation even more efficient. Here, the idea is also known as parameter sharing. Subsequently, combining the two effects of sparse interaction and parameter sharing, we have shown in Fig. 16.4 that it can drastically enhance the efficiency of a linear function for detecting edges in an image.
In addition, the specific form of parameter sharing enables the model to be equivariant to translation. We say that a function is equivariant if the input changes and the output changes in the same way. In this way, it allows the network to generalise texture, edge and shape detection in different locations. However, convolution fails to be equivariant to some transformations, such as rotation and changes in the scale of an image. Other mechanisms are needed to handle such transformations, i.e. batch normalisation and pooling.
16.2.2 Non-linear Rectifier Unit
After performing convolution operation, an activation function is used to select and map information from the current layer. This is sometimes called the detector stage. Very often, we use a non-linear rectifier unit to induce non-linearity in the computation. This is driven by the effort to simulate the activity of neurons in human brain as we usually process information in a non-linear manner. Furthermore, it is also motivated by the belief that the data in the real world are mostly non-linear. Hence, it enables better training and fitting of deeper networks to achieve better results. We have listed a few commonly used activation functions as shown below.
16.2.2.1 Sigmoid or Logistic Function
A sigmoid function is a real continuous function that maps the input to the value between the range of zero and one. This property gives an ideal ground in predicting a probabilistic output since it satisfies the axiom of probability. Moreover, considering the output value between zero and one, it is sometimes used to access the weighted importance of each feature, by assigning a value to each component. To elaborate, a value of zero removes the feature component in the layer while a value of one keeps every information in the layer. The preserved information will be used for computing prediction in the subsequent event. This attribute is helpful when we work with data that are sequential in event i.e. RNN, LSTM model.
16.2.2.2 ReLU (Rectified Linear Unit)
ReLU is the most frequently used activation function in deep learning. It is reported to be the most robust in terms of model performance. As we can see in Fig. 16.5, ReLU function sets the output to zero for every input value that is negative or else, it returns the input value. However, a shortcoming with ReLU is that all negative values become zero immediately which may affect the capacity of the model to train the data properly.
16.2.2.3 Hyperbolic Tangent (TanH)
The last activation function that we have on the list is tanh. It is very similar to a sigmoid function with the range from negative one to positive one. Hence, we would usually use it for classification. This maps the input with strong prior in which a negative input will be strongly negative and zero inputs will be close to zero in the tanh graph.
Here, the execution of the activation function takes place element wise, where the individual element of each row and column from the feature map is passed into the function. The derived output has the same dimensionality as the input feature map.
16.2.3 Spatial Pooling
Typical block of a classifying CNN model that achieves state of the art would consist of three stages. First, a convolution operation finds acute patterns in the image. Then, the output features are handed over to an activation function in the second stage. At the last stage, we would implement a pooling function that trimmed the dimensionality (down sampling) of each feature map while keeping the most critical information. This would in turn reduce the number of parameters in the network and prevent overfitting of our model.
Spatial pooling comes in various forms and the most frequently used pooling operation is max pooling. To illustrate the process of max pooling, we use a kernel of a definite shape (i.e. size = 2 × 2) and then carry out pointwise operation to pull the maximum value of the location. A diagram is drawn in Fig. 16.6 to visualize the process.
One of the most important reasons of using pooling is to make the input feature invariant to small translations. This means that if we apply local translation to Fig. 16.2, max pooling helps to maintain most of the output value. Essentially, we are able to acquire asymptotically the same output for convoluting a cat that sits on top of a tree versus the same cat that sleeps under the tree. Hence, we conclude that pooling ignores the location of subjects and places more emphasis on the presence of the features, which will be the cat in this example.
16.2.4 Putting Things Together
Until now, we have covered the main operating structures found in most typical CNN model. The CNN block is usually constructed in the checklist as listed below:
-
1.
Convolution
-
2.
Activation Function (ReLU)
-
3.
Pooling (Sub-sampling)
-
4.
Fully connected layer
The last component of CNN is usually the fully connected layer (Fig. 16.7). This layer connects all the sophisticated features extracted at the end layer of the convolution with a vector of individual parameters specifying the interactions between each pixels of the feature maps. The weights for the parameter are learned to reduce inaccuracy in the prediction. This is similar to the concepts of a regression model as we fit the parameter weights with the least square solution to explain the target outcome. However, our predictors in this case are the flatten vector of the convolved map. Finally, we use a sigmoid function to generate the likelihood of the classes for the input image of a two classes problem or else, we will use a Softmax function in the case of multiclass.
16.2.5 Back-Propagation and Training with Gradient Descent
During backpropagation, we conduct supervised learning as we train our model with gradient descent algorithm to find the best fitted parameters that gives optimal prediction. Gradient descent is a first order iterative optimization method where we can find a local minimum that minimizes the loss function. Here, the loss function defines an evaluating metric that measures how far off the current model performs against the target in our dataset. This is also sometimes referred to as the error or the cost function. If we know the local minimum, our job is almost done and we conclude that the model is optimized at that region.
To understand the motivation behind gradient descent, suppose we are learning the parameters of a multiple linear regression, i.e. \({\text{y}} = {\text{X}}\upbeta + \epsilon\). The least square estimate of β is the minimizer of the square error \({\mathbb{L}}({\text{B}}) = ({\text{Y}} - {\text{X}}\upbeta)^{\prime } ({\text{Y}} - {\text{X}}\upbeta)\). The first and second order derivatives of \({\mathbb{L}}(\upbeta)\) with respect to β is given by
$$\frac{{\partial {\mathbb{L}}}}{\partial \beta } = - 2X^{\prime } (Y - XB),\frac{{\partial^{2} {\mathbb{L}}}}{{\partial \beta \partial \beta^{\prime } }} = 2X^{\prime } X$$
Since \({\text{X}}^{{\prime }} {\text{X}}\) is positive semi-definite and if we assume \({\text{X}}^{{\prime }} {\text{X}}\) is of full rank, the least square solution of \({\mathbb{L}}(\upbeta)\) is given by
$$\widehat{{\beta_{\text{Loss}} }} = \left( {X^{\prime } X} \right)^{ - 1} \left( {X^{\prime } Y} \right)$$
Suppose now that \({\text{X}}^{{\prime }} {\text{X}}\) is non full rank, i.e. \({\text{p}} \gg {\text{n}}\). This is often the case for an image dataset where the number of features is usually very large. We can’t simply inverse the matrix and it turns out that there is no unique solution in this case. However, we do know that \({\mathbb{L}}(\upbeta)\) is a strictly convex function and the local minimum is the point where error minimizes, i.e. least square solution. As such, we take another approach to solve this problem with the ‘descending stairs’ approach to find our solution.
This approach is an iterative process that begins with a random location, x0, on the convex curve that is not the minimum. Our aim is to find the optimum \({\text{x}}_{\text{ * }}\) that gives the minimum loss, \(\text{argmin}_{\text{x}} \mapsto {\text{F}}({\text{x}})\) by updating xi in every ith iteration. We choose a descent direction such that the dot product of the gradient is negative,\(\langle \nabla F(x);d\rangle < 0\), where \(\nabla F(x) = \frac{1}{N}\sum\nolimits_{i = 1}^{N} {\nabla_{x} L\left( {x,y_{i} } \right)}\). This ensures that we are moving towards the minimum point where the gradient is less negative.
To show this, we refer to the identity of the dot product given
$$\cos (\theta ) = \frac{a \cdot b}{|a||b|}$$
Suppose vector a, b is a unit vector and the identity is reduced to \(\cos\uptheta = {\text{a}}.{\text{b}}\). We know that taking cosine of any angle larger than 90° is negative. Since gradient is pointing towards the ascent direction as shown in Fig. 16.8, we can find any descent directions of more than 90° and the dot product computed to be negative, i.e. \(\cos\uptheta = {\text{negative}}.\)
$$\langle \nabla F(x); - \nabla F(x)\rangle = - |\nabla F(x)|^{2} < 0$$
Hence a naive descent direction,
This guarantees a negative value which indicates a descent direction.
Then, the steps to compute the new x is given by
$${{\text{x}}}_{{{\text{n}} + 1}} = {{\text{x}}}_{{\text{n}}} +\upeta_{{\text{n}}} {{\text{d}}}_{{\text{n}}}$$
OR
$${{\text{x}}}_{{{\text{n}} + 1}} = {{\text{x}}}_{{\text{n}}} +\upeta_{{\text{n}}} \nabla {\text{f}}\left( {{{\text{x}}}_{{\text{n}}} } \right)$$
where ηn is the learning rate.
The learning rate (or step-size) is a hyper-parameter that controls how much we are adjusting xn position with respect to the descent direction. It can be thought of as how far should we move in the descending direction of the current loss gradient. Taking too small of a step would result in very slow convergence to the local minimum and too big of a step would overshoot the minimum or even cause divergence. Thus, we have to be careful in choosing a suitable learning rate for our model. Then after, we iterate through the algorithm as we let it computationally alter towards the optimum point. The solution is asymptotically close to the estimated βloss.
However, this is computationally expensive as we are aggregating losses for every observed data point. The complexity increases as the volume of the dataset increases. Hence, a more practical algorithm would sample a smaller subset from the original dataset and we would estimate the current gradient loss based on the smaller subset. The randomness in sampling smaller sample is known as stochastic gradient descent (SGD) and we can also prove that \({\mathbb{E}}[\nabla \hat{F}(x)] = \nabla F(x)\). In practice, the estimated loss converge to the actual loss if we sample this large enough of times by the law of large numbers.
Hence, we prefer that \({\text{n}} \ll {\text{N}}\).
$$\nabla \widehat{F(x)} = \frac{1}{n}\sum\limits_{k = 1}^{n} {\nabla_{x} } L\left( {x,y_{i\,k} } \right)$$
This results in updating x with
$$x_{n + 1} = x_{n} - \eta_{n} \nabla f\left( {\hat{x}_{n} } \right)$$
Lastly, there are a few commonly used loss functions namely, cross entropy, Kullback Leibler Divergence, Mean Square Error (MSE), etc. The first two functions are used to train a generative model while MSE is used for a discriminative model. Since the performance of the prediction model improves with every updated parameters from the SGD, we expect the loss to decrease in all iterations. When the loss converges to a significantly small value, this indicates that we are ready to do some prediction.
16.2.6 Other Useful Convolution Layers
In this section, we discuss some innovation to the convolution layer to manage certain tasks more effectively.
16.2.6.1 Transposed Convolution
Transposed convolution works as an up sampling method. In some cases where we want to generate an image from lower resolution to higher resolution, we need a function that maps the input without any distortion to the information. This can be processed by some interpolation methods like nearest neighbour interpolation or bi-linear interpolation. However, they are very much like a manual feature engineering and there is no learning taking place in the network. Hence, if we hope to design a network to optimize the up sampling, we can refer to a transposed convolution as it augments the dimension of our original matrix using learnable parameters.
As shown in Fig. 16.9, suppose we have a 3 × 3 matrix and we are interested to obtain a matrix with 5 × 5 resolution. We choose a transposed convolution with 3 × 3 kernel and stride of 2. Here, the stride is defined slightly different from the convolution operation. When stride of 2 is called upon, each pixel is bordered with a row and a column of zeros. Then, we slide a kernel of 3 × 3 down every pixels and carry out the usual pointwise multiplication. This will eventually result in a 5 × 5 matrix.
16.2.6.2 Dilated Convolution
Dilated convolution is an alternative to the conventional pooling method. It is usually used for down sampling tasks and we can generally see an improvement in performance like for an image segmentation problem. To illustrate this operation, the input is presented with the bottom matrix in Fig. 16.10 and the top shows the output of a dilated convolution. Similarly, when we set a 3 × 3 kernel and the stride to be two, it does not slide the kernel two pixels down for every matrix multiplication. Instead, a stride of two slots zeros around every pixel row wise and column wise of the kernel and the multiplication involves a 5 × 5 kernel matrix (larger receptive field with same computation and memory costs while preserving resolution). Then, pointwise matrix multiplication is done in every pixel interval and we can show that our final output is a 3 × 3 matrix. The main benefit of this is that dilated convolutions support exponential growth of the receptive field without loss of resolution or coverage.