There are many articles teaching people how to build intelligent applications using different frameworks such as TensorFlow, PyTorch, etc. However, except those very professional research papers, very few articles can give us a comprehensive understanding on how to develop such frameworks. In this chapter, rather than just “casting spells,” we focus on explaining how to make the magic work in the first place. We will dissect the deep neural network module in Owl, then demonstrate how to assemble different building blocks to build a working framework. Owl’s neural network module is a full-featured DNN framework. You can define a neural network in a very compact and elegant way thanks to OCaml’s expressiveness. The DNN applications built on Owl can achieve state-of-the-art performance.

5.1 Module Architecture

To explain in layman’s terms, you can imagine a neural network as a communication network where data flow from one node to another node without loops. Nodes are referred to as neurons. Every time data pass a neuron, it will be processed in different ways depending on the type of a neuron. The link between neurons represents nonlinear transformation of the data. Neurons can be wired in various ways to exhibit different architectures which specialize in different tasks. During the training phase, data can be fed into a neural network to let it form the knowledge of certain patterns. During the inference phase, the neural network can apply previously learned knowledge to the input data.

A DNN framework is built to let us define the network structure and orchestrate its learning and inference tasks. The framework is a complicated artifact containing lots of technologies. However, from the high-level system perspective, there is only a limited amount of core functions which a framework must implement. Let us take a look at the key functionalities required by Owl’s neural network module:

  • The neuron module defines the functionality of a specific type of neuron. Even though a deep neural network can be directly and equivalently expressed in a computation graph, the graph is often very complicated. The definition is difficult to manually construct and hard to maintain. Using high-level neurons which are packaged with complicated internal processing logic can significantly simplify the network definition.

  • The network module allows us to define the graph structure of a neural network and provides a set of functions to manipulate the network and orchestrate the training and inference. This module is often the entry point to the system for the majority of users.

  • The optimization module is the driver of training a neural network. The module allows us to configure and orchestrate the process of training, as well as control how data should be fed into the system. We have introduced this module in Chapter 4.

  • The algorithmic differentiation module is the underlying technology for optimization. The module provides powerful machinery to automatically calculate derivatives for any given functions, so that other modules can make use of these derivatives for various optimization purposes. We have introduced this module in detail in Chapter 3.

  • The neural network compiler module optimizes the performance of a neural network by optimizing its underlying computation graphs. The module relies heavily on Owl’s computation graph module, which we will introduce in Chapter 6.

In the rest of this chapter, we will examine internal mechanisms of these modules. The optimization and algorithmic differentiation modules are not DNN specific, we will skip them for now; their implementations have been covered in detail in the previous chapters.

5.2 Neurons

Neurons are implemented as modules. Each type of neuron corresponds to a specific module. These modules share many common functions such as mktag, mkpar, update, etc., but their implementation might slightly differ. Every neuron has its own neuron_typ which specifies the shape of the neuron’s input and output.

module Linear : sig   type neuron_typ =     { mutable w : t     ; mutable b : t     ; mutable init_typ : Init.typ     ; mutable in_shape : int array     ; mutable out_shape : int array     }   val create : ?inputs:int -> int -> Init.typ -> neuron_typ   val connect : int array -> neuron_typ -> unit   val init : neuron_typ -> unit   val reset : neuron_typ -> unit   val mktag : int -> neuron_typ -> unit   val mkpar : neuron_typ -> t array   val mkpri : neuron_typ -> t array   val mkadj : neuron_typ -> t array   val update : neuron_typ -> t array -> unit   val copy : neuron_typ -> neuron_typ   val run : t -> neuron_typ -> t   val to_string : neuron_typ -> string   val to_name : unit -> string end

The neuron_typ also includes all the parameters associated with this neuron. For example, the preceding code presents the signature of the Linear neuron which performs wx + b for input x. As you can see, fields x and b are used for storing the weight and bias of this linear function.

5.2.1 Core Functions

The record type neuron_typ is created by calling the create function when constructing the network.

let create ?inputs o init_typ =   let in_shape =     match inputs with     | Some i -> [| i |]     | None   -> [| 0 |]   in   { w = Mat.empty 0 0   ; b = Mat.empty 0 0   ; init_typ   ; in_shape   ; out_shape = [| o |]   }

The parameters are created, but their values need to be initialized in the init function. The bias parameter is set to zero, while the initialization of weight depends on init_typ.

let init l =   let m = l.in_shape.(0) in   let n = l.out_shape.(0) in   l.w <- Init.run l.init_typ [| m; n |] l.w;   l.b <- Mat.zeros 1 n

How the weights are initialized matters a lot in the training. If weights are not initialized properly, it takes much longer to train the network. In the worse case, the training may fail to converge. The Init module provides various ways to initialize weights, specified by different type constructors. Some initialization methods require extra parameters. For example, if we want to randomize the weight with Gaussian distribution, we need to specify the mean and variance. As discussed by X. Glorot in [24], the initialization method has a nontrivial impact on model training performance. Besides those supported here, users can also use Custom to implement their own initialization methods.

module Init = struct   type typ =     | Uniform       of float * float     | Gaussian      of float * float     | Standard     | Tanh     | GlorotNormal     | GlorotUniform     | LecunNormal     | Custom        of (int array -> t) end

When a neuron is added to the network, the connect function is called to validate that the input shape is consistent with the output shape.

let connect out_shape l =   assert (Array.(length out_shape = length l.in_shape));   l.in_shape.(0) <- out_shape.(0)

The following functions are used to retrieve the parameters and their corresponding primal and adjoint values. These functions are mainly used in the training phase. When the parameters need to be updated during the training, the optimization engine can call the update function to do the job.

let mkpar l = [| l.w; l.b |] let mkpri l = [| primal l.w; primal l.b |] let mkadj l = [| adjval l.w; adjval l.b |] let update l u =   l.w <- u.(0) |> primal';   l.b <- u.(1) |> primal'

The run function is the most important one in the module. This function defines how the input data should be processed and is called by the network during the evaluation. Let us look at the run function of the linear neuron. The function is so simple and contains only one line of code which calculates exactly wx + b.

let run x l = Maths.((x *@ l.w) + l.b)

Most neurons’ run functions are just a one-liner; the simplicity is because Owl has implemented a very comprehensive set of numerical functions.

5.2.2 Activation Module

One reason we can pile up even just linear neurons to construct a deep neural network is its property of nonlinearity, which is introduced by activation neurons. Nonlinearity is a useful property in our models as most real-world data demonstrate nonlinear features. Without nonlinearity, all the linear neurons can be reduced to just one matrix. Activation functions are aggregated in one module called Activation. Similar to other neuron modules, the Activation also has neuron_typ and many similar functions.

module Activation = struct   type typ =     | Elu (* Exponential linear unit *)     | Relu (* Rectified linear unit *)     | Sigmoid (* Element-wise sigmoid *)     | HardSigmoid (* Linear approximation of sigmoid *)     | Softmax of int (* Softmax along specified axis *)     | Softplus (* Element-wise softplus *)     | Softsign (* Element-wise softsign *)     | Tanh (* Element-wise tanh *)     | Relu6 (* Element-wise relu6 *)     | LeakyRelu of float (* Leaky version of a Rectified Linear Unit *)     | TRelu of float (* Thresholded Rectified Linear Unit *)     | Custom of (t -> t) (* Element-wise customised activation *)     | None   type neuron_typ =     { mutable activation : typ     ; mutable in_shape : int array     ; mutable out_shape : int array     }   ... end

Currently, there are 12 activation functions implemented in the module. We need to choose a proper activation function for the output layers based on the type of prediction problem that is to be solved. For example, ReLU is the most commonly used for hidden layers. To visualize what we have introduced so far, Figure 5-1 illustrates a linear layer and an activation layer, one common combination in many neural network architectures.

Figure 5-1
A model diagram illustrates the linear layer and activation layers of a neural network. The input data passes through the linear layer and produces the output that passes through the activation layer.

Illustration of a linear layer and an activation layer in a neural network

5.3 Networks

Essentially, a neural network is a computation graph. Nodes represent neurons which aggregate more complicated data processing logic than nodes in a vanilla computation graph. The following code presents the type definition of the node. A node can have a name for reference. The prev and next fields are used for linking to ancestors and descendants, respectively. The output field is used to store the output of the computation.

The node per se does not contain any data processing logic. Rather, the node refers to the neuron which implements actual numerical operations on the data. The motivation of this design is to separate the mechanism of a network and the logic of neurons. The network field refers to the network that the current node belongs to. Note that the network structure is not necessarily the same in training and inference phases. Some nodes may be dropped during the inference phase, such as dropout. The train field is used for specifying whether a node is only for training purposes.

type node =   { mutable name : string   ; mutable prev : node array   ; mutable next : node array   ; mutable neuron : neuron   ; mutable output : t option   ; mutable network : network   ; mutable train : bool   }

A network can be constructed by wiring up nodes. Looking at the type definition, we can see that a neural network can be identified by a unique id nnid. The size field is used to track the number of nodes in a network. The roots field refers to the data inputs, while the outputs refers to the outputs of a network. The topo is a list of all nodes sorted in topological order; it can be used for iterating nodes when evaluating the network.

and network =   { mutable nnid : string   ; mutable size : int   ; mutable roots : node array   ; mutable outputs : node array   ; mutable topo : node array   }

As we can see, these type definitions are similar to computation graphs. Even though they contain some specific neural network–related fields, the type definitions are not more complicated than a general-purpose computation graph.

To build up networks, most of the time we use functions that build a node and connect it to an existing node stack. For example:

let conv2d ?name ?(padding = SAME) ?(init_typ = Init.Tanh)       ?act_typ kernel stride input_node =     let neuron = Conv2D (Conv2D.create padding kernel stride init_typ) in     let nn = get_network input_node in     let n = make_node ?name [||] [||] neuron None nn in     add_node ?act_typ nn [| input_node |] n

This function first creates a Conv2D neuron with various parameters and wraps it into a node using the make_node function. Then we connect n to its parent nodes using the add_node function. This step uses the connect function of the neuron and also updates the child’s input and output shape during connection. With the network graph APIs, we can write concise code to build up a network, such as

open Owl open Neural.S open Neural.S.Graph let make_network input_shape =   input input_shape   |> conv2d [|1;1;1;6|] [|1;1|] ~act_typ:Activation.Relu   |> max_pool2d [|2;2|] [|2;2|]   |> conv2d [|5;5;6;16|] [|1;1|] ~act_typ:Activation.Relu   |> max_pool2d [|2;2|] [|2;2|]   |> fully_connected 120 ~act_typ:Activation.Relu   |> linear 84  ~act_typ:Activation.Relu   |> linear 10 ~act_typ:Activation.(Softmax 1)   |> get_network

The network definition always starts with an input layer and ends with the get_network function which finalizes and returns the constructed network. We can also see the shape of the data, and the parameters will be inferred later as long as the input_shape is determined. We only need to provide the data shape in the input node, and the network can automatically infer the shape in the other nodes.

We have already covered most elements to build up a neural network. For example, Figure 5-2 shows the structure of a basic LeNet-like neural network, combining the convolution layer, pooling layer, linear layer, activation layer, etc. This network is simple yet powerful, perfectly capable of performing the handwritten digit recognition task accurately. But to do that, the network should first be trained.

Figure 5-2
A diagram illustrates the structure of a basic neural network. It begins with input and combines with the layers of convolution and pooling to form a fully connected network.

Structure of a basic neural network

5.4 Training

Training is a complicated, time-consuming, and computation-intensive process. There are many parameters to configure different components in a neural network framework to control the process. The following functor definition can give us a good understanding about what needs to be configured. Fortunately, the Optimise module does all the heavy-lifting job; it implements several engines for different optimization tasks.

module Flatten (Graph : Owl_neural_graph_sig.Sig) = struct   module Graph = Graph   ...   module Params = Graph.Neuron.Optimise.Params   module Batch = Graph.Neuron.Optimise.Batch   module Learning_Rate = Graph.Neuron.Optimise.Learning_Rate   module Loss = Graph.Neuron.Optimise.Loss   module Gradient = Graph.Neuron.Optimise.Gradient   module Momentum = Graph.Neuron.Optimise.Momentum   module Regularisation = Graph.Neuron.Optimise.Regularisation   module Clipping = Graph.Neuron.Optimise.Clipping   module Stopping = Graph.Neuron.Optimise.Stopping   module Checkpoint = Graph.Neuron.Optimise.Checkpoint end

We have already introduced these optimization modules in Chapter 4. The main logic of training is encoded in the train_generic function. In addition to the network to be trained nn, inputs x, and labeled data y, the function also accepts optional parameters like state if we want to resume previous training.

let train_generic ?state ?params ?(init_model = true) nn x y =   if init_model = true then init nn;   let f = forward nn in   let b = backward nn in   let u = update nn in   let s = save nn in   let p =     match params with     | Some p -> p     | None   -> Optimise.Params.default ()   in   Optimise.minimise_network ?state p f b u s x y

We need to specify four important functions: the function for forward evaluation, the function for backward propagation, the function for updating the weights, and the function for saving the network. These four functions are passed as parameters to the minimise_network function which is the engine specifically for optimizing neural networks as a function. We have introduced minimise_fun in Chapter 4 and used it to find an optimal x to minimize f (x). The minimise_network function works similarly and is also implemented similarly, with the exception of one subtle difference. Instead of input x, this function aims to find optimal θ to minimize fθ(x) for a given input x. In the case of optimizing a neural network, θ indicates the weight parameters.

5.4.1 Forward and Backward Pass

Evaluating a neural network can be done in two directions. The direction from inputs to outputs is referred to as forward pass, while the opposite direction from outputs to inputs is backward pass. The inference only requires a forward pass, but the training requires both a forward and a backward pass in many iterations. The forward function has only two steps. The first call to mktag is a necessary step required by the algorithmic differentiation, so that we can use AD to calculate derivatives in the following backward pass. For inference, this step is not necessary. The run function pushes the data x into network nn and iterates through all the neurons’ calculations. mkpar nn returns all the parameters of the network, for example, weights.

let forward nn x =   mktag (tag ()) nn;   run x nn, mkpar nn

The core logic of the run function is iterating all the neurons in a topological order. For each neuron, the inputs are collected from its ancestors’ outputs first, then the neuron’s activation function is triggered to process the inputs. The neuron’s output is saved in its hosting node. Finally, the output of the whole network is collected and returned.

let run x nn =   Array.iter     (fun n ->       (* collect the inputs from parents' output *)       let input =         match n.neuron with         | Input _ -> [| x |]         | _       -> collect_output n.prev       in       (* process the current neuron, save output *)       let output = run input n.neuron in       n.output <- Some output)     nn.topo;   (* collect the final output from the tail *)   let sink = [| nn.topo.(Array.length nn.topo - 1) |] in   (collect_output sink).(0)

A backward pass is much more complicated than a forward pass, even though the code in the backward function looks as simple as the forward. The actual complexity is hidden in the reverse_prop which is the core function in the AD module. The purpose of the backward pass is to propagate the errors backward from the output to inputs. By doing so, the neurons along the path can utilize this error information to adjust their parameters and hopefully minimize the future errors as well.

Derivatives can also be calculated in the forward pass, for example, using dual numbers, why do we use backward propagation in the implementation? The reason is that a typical neural network has much more input parameters than output parameters. Backward propagation requires much less computation in this scenario.

let backward nn y =   reverse_prop (_f 1.) y;   mkpri nn, mkadj nn

Here, mkpri and mkadj return the primal and adjoint values of all the parameters.

5.5 Neural Network Compiler

If we consider a neural network as a complicated function, then training this neural network is an iterative process of optimizing the function. In every iteration, the optimization engine first runs the forward pass of the function, then calls an algorithmic differentiation module to obtain the derivative function, and finally runs the backward pass to update the weight. As we can see, there are two computation graphs created dynamically in each iteration, one for the forward pass and the other for the backward pass. In fact, given a fixed neural network, the structure of both computation graphs is identical. This observation serves as the basis of further optimizing the training process. The optimization consists of three parts:

  • We dry run the neural network to derive the computation graphs for both forward and backward pass. We reuse these computation graphs in the following iterative process rather than regenerating them.

  • We optimize the graph structure to optimize both computation performance and memory usage.

  • We replace eager evaluation with lazy evaluation when evaluating a computation graph, which can further optimize the performance.

The Owl_neural_compiler functor is designed to automate these steps. Compared to directly using an optimization engine, the neural network compiled by Owl_neural_compiler can be trained much faster with much less memory footprint. Let us create a VGG-like convolution neural network to illustrate. The VGG neural network is for image classification tasks, for example, using the CIFAR10 dataset. The network structure is defined by the following code:

let make_network input_shape =   input input_shape   |> normalisation ~decay:0.9   |> conv2d [|3;3;3;32|] [|1;1|] ~act_typ:Activation.Relu   |> conv2d [|3;3;32;32|] [|1;1|] ~act_typ:Activation.Relu ~padding:VALID   |> max_pool2d [|2;2|] [|2;2|] ~padding:VALID   |> dropout 0.1   |> conv2d [|3;3;32;64|] [|1;1|] ~act_typ:Activation.Relu   |> conv2d [|3;3;64;64|] [|1;1|] ~act_typ:Activation.Relu ~padding:VALID   |> max_pool2d [|2;2|] [|2;2|] ~padding:VALID   |> dropout 0.1   |> fully_connected 512 ~act_typ:Activation.Relu   |> linear 10 ~act_typ:Activation.(Softmax 1)   |> get_network

The following function first creates the network, then configures the training process, and finally trains the network by calling Graph.train. In fact, the Graph.train function calls the train_generic function we just introduced in the previous section. The train_generic directly passes the neural network along with the configurations to the optimization engine to kick off the optimizing process.

let train () =   let x, _, y = Dataset.load_cifar_train_data 1 in   let network = make_network [|32;32;3|] in   Graph.print network;   let params = Params.config     ~batch:(Batch.Mini 100) ~learning_rate:(Learning_Rate.Adagrad 0.005)     ~checkpoint:(Checkpoint.Epoch 1.) ~stopping:(Stopping.Const 1e-6) 10.   in   Graph.train ~params network x y

However, from a programmer’s perspective, if we use the neural compiler, the only thing that needs to be changed is the train function. The network definition remains exactly the same.

let train network =   let x, _, y = Dataset.load_cifar_train_data 1 in   let x = CGCompiler.Engine.pack_arr x |> Algodiff.pack_arr in   let y = CGCompiler.Engine.pack_arr y |> Algodiff.pack_arr in   let params = Params.config     ~batch:(Batch.Mini 100) ~learning_rate:(Learning_Rate.Adagrad 0.005) 10.   in   CGCompiler.train ~params network x y

Except for the mundane packing and unpacking parameters, the most noticeable change is that we are now using CGCompiler.train to train a network. CGCompiler.train is implemented in the neural compiler function. So what is contained in this function? Let us have a look at its implementation.

let train ?state ?params network x y =   let params =     match params with     | Some p -> p     | None   -> Params.default ()   in   let network_name = Graph.get_network_name network in   Owl_log.info "compile network %s into static graph ..." network_name;   (* compile network into static graph *)   let x_size = (unpack_arr x |> Engine.shape).(0) in   let loss, xt, yt, cgraph = compile_deep params network x_size in   let eval = make_eval_fun loss xt yt cgraph in   let update = make_update_fun cgraph in   let save _fname = () in   (* Optimise graph structure *)   Engine.save_graph cgraph (network_name ^ "_raw.cgd");   Engine.optimise cgraph;   Engine.save_graph cgraph (network_name ^ "_opt.cgd");   Owl_log.info "start training %s ..." network_name;   Optimise.minimise_compiled_network ?state params eval update save x y

Compared to the train_generic function, CGCompiler.train seems more complicated, but the logic is straightforward. The implementation consists of three steps. First, the function retrieves the training parameters and creates default values if they are not provided. Second, the neural network is compiled into a static graph. Two higher-order functions, that is, eval and update, are created to evaluate computation graphs and update the network weights. Third, the graph structure is optimized by using the functions defined in the computation graph module. Fourth, the optimization engine starts the iterative process to minimize the loss function.

The core of CGCompiler.train is the compile_deep, which is where the magic happens. However, compile _deep is a rather lengthy and complicated function. Instead of studying compile_deep as a whole function, let us divide it into many smaller parts and examine them separately:

let loss_fun = Loss.run params.loss in let grad_fun = Gradient.run params.gradient in let rate_fun = Learning_Rate.run params.learning_rate in let regl_fun = Regularisation.run params.regularisation in let momt_fun = Momentum.run params.momentum in let upch_fun = Learning_Rate.update_ch params.learning_rate in let clip_fun = Clipping.run params.clipping in ...

The first part is simply creating some higher-order functions from the network configuration. The purpose is to simplify the following code:

let batch =   match params.batch with   | Full       -> full_size   | Mini n     -> n   | Sample n   -> n   | Stochastic -> 1 in let network_shape = Graph.input_shape network in let input_shape = Array.append [| batch |] network_shape in ...

Because compile_simple needs to dry run the network, it needs to know the shape of the input. The input shape depends on how the training is configured. For a small dataset, we can input the whole dataset in each iteration, so the shape will be full size. For a larger dataset, we might want to use different logic to select a batch of data as input, even just one sample per iteration. We can calculate the size from the params.batch parameter.

(* initialise the network weight *) Graph.init network; Graph.mkpar network   |> Owl_utils.aarr_map (fun v ->        let v = Algodiff.unpack_arr v in        Engine.eval_arr [| v |];        let u = Engine.var_arr "" ~shape:(Engine.shape v) in        Engine.(assign_arr u (unpack_arr v));        Algodiff.pack_arr u)   |> Graph.update network; ...

Then the neural network is initialized, and the weights are updated. After this step, all the preparation work for a dry run is done.

(* derive the computation graph in forward mode *) let x = Engine.var_arr "x" ~shape:input_shape |> pack_arr in let y' = Graph.forward network x |> fst in let output_shape = unpack_arr y' |> Engine.shape in let y = Engine.var_arr "y" ~shape:output_shape |> pack_arr in let loss = loss_fun y y' in let loss = Maths.(loss / _f (Mat.row_num y |> float_of_int)) in ...

The most critical step is to derive the computation graph of the backward pass. Before we can do that, we need to first run the forward pass. The outcome of the forward pass y and ground truth y’ is fed into the loss function loss_fun which contains the computation graph of the forward pass.

let ws = Owl_utils_array.flatten (Graph.mkpri network) in let reg =   match params.regularisation <> Regularisation.None with   | true  -> Array.fold_left     (fun a w -> Maths.(a + regl_fun w)) (_f 0.) ws   | false -> _f 0. in let loss = Maths.(loss + reg) in (* assign loss variable name *) Owl_graph.set_name (unpack_elt loss |> Engine.elt_to_node) "loss"; ...

Then we further adjust the loss value by adding the regularization term if necessary and assign it with a proper name.

(* derive the computation graph in reverse mode *) let z = Graph.(backward network loss) in let ws = Owl_utils_array.flatten (fst z) in let gs' = Owl_utils_array.flatten (snd z) in ...

The Graph.backward function created the computation graph of the backward pass, contained in z. The computation graph is also the derivative of the loss function of the network. We also separate out both weights ws and their adjacent value gs’ from z. After this step, there is a very lengthy code to further calculate and adjust the gradient with clipping, momentum, etc.

(* construct a computation graph with inputs and outputs *) let network_name = Graph.get_network_name network in ... let output = Array.append param_o   [| unpack_elt loss |> Engine.elt_to_node |] in let cgraph = Engine.make_graph   ~input:param_i ~output network_name in ... loss, x, y, cgraph

The final computation graph is returned along with the loss function, input, and output.

5.6 Case Study: Object Detection

As a case study to demonstrate how we apply the Neural module in Owl in real-world applications, in this final section we show how to perform instance segmentation using the Mask R-CNN network. It provides both a complex network structure and an interesting application. Object detection is one of the most common DNN tasks, which takes an image of a certain object as input and infers what object this is. However, a neural network can easily get confused if it is applied on an image with multiple objects. For that purpose, object detection is another classical computer vision task. Given an image that contains multiple objects, it seeks to classify individual objects and localizes each one using a bounding box. Similarly, the semantic segmentation task requires classifying the pixels in an image into different categories. Each segment is recognized by a “mask” that covers the whole object. All possible objects are shown using different masks, but it does not categorize what those objects are. The Mask R-CNN (Mask Region-based Convolutional Neural Network) architecture was proposed in 2017 to address all the previous problems. With sufficient training, it can solve these problems at once: detecting objects in an image, labeling each of them, and providing a binary mask for the image to determine which pixels belong to which objects. This task is called instance segmentation.

As a preliminary example and for visual motivation, Figure 5-3 shows what this network generates. In this example, a normal street view picture is processed by the pretrained Mask R-CNN (MRCNN) network, and the objects (people, sheep, bag, car, bus, etc.) are segmented from the input figure and recognized with probability represented by a number between zero and one. Image segmentation has many important applications, including medical imaging (locating tumors, detecting cancer cells, etc.), traffic control systems, locating objects in satellite images, etc. In the rest of this section, we will explain how this complex network can be built in OCaml using the Neural module. The full code is provided in the GitHub repository.

Figure 5-3
A photograph of a street with some vehicles on the road parked beside a building.

Example of image segmentation: Oxford street view

5.6.1 Object Detection Network Architectures

To present some background knowledge, before explaining the details of Mask R-CNN, we briefly introduce how deep network architectures for object detection and instance segmentation are developed.

5.6.1.1 R-CNN Architecture

The idea of using a CNN to enhance the object detection task was first proposed in [23]. This paper proposes a “Regions with CNN features” (R-CNN) object detection system. It is divided into several phases. The first phase is to localize possible objects of interest in an input image. Instead of using a sliding window, R-CNN uses a different approach called “regions”: for each input image, it first generates a number of (e.g., 2000) region proposals that are independent of the object categories used. They are rectangle regions of the image, of different aspects and sizes. The content in each region is then checked to see if it contains any object of interest. Each region proposal is then processed by a CNN to get a 4096-dimension feature vector. This CNN takes an input of fixed size 227 × 227, and thus each region, regardless of its shape, is morphed into this fixed size before being processed by CNN. As to the output feature vector, it is processed by a trained SVM model to be classified into the accepted results.

5.6.1.2 Fast R-CNN Architecture

Compared to the previous state of the art, R-CNN improves the mean average precision by more than 30%. However, it has several problems. For one, the R-CNN pipeline consists of several parts, and so training takes multiple stages, including the training of the CNN, the SVM models, etc. Moreover, training is expensive in both space and time. Besides, since the CNN inference pass needs to be performed on all region proposals, the whole object detection process is slow.

To mitigate these challenges, the Fast R-CNN approach was proposed by Girshick et al. Similar to R-CNN, the region proposals are first generated based on the input images. But to reduce training costs, the Fast R-CNN consists of one CNN network that can be trained in a single stage. Furthermore, it does not need to perform a forward pass on each region proposal. Instead, it first computes a convolutional feature map for the whole input image and then projects the region of interest (RoI) from the input image to this feature map, deciding which part of it should be extracted. Such a region on the feature map is pooled by a “RoI pooling” layer into a smaller feature map of fixed size, which is then turned into a feature vector by several fully connected layers.

Next, the feature vectors are fed into a branch. One output of this branch contains the classification, and the confidence of that classification, of the object in that region. The other specifies the rectangle location of the object, encoded by four real-valued numbers. The output on this branch contains such four-number tuple for all of the object categories in this task. Compared to R-CNN, this method does not require a lot of space for feature caching, and it proves to be about 9 times faster in training and 213 times faster in inference.

5.6.1.3 Faster R-CNN Architecture

However, the Fast R-CNN is also not perfect. Region proposals are first generated based on the input image, and object detection is performed using both the convolutional feature map and the region proposal. Note that the feature map, which abstractly represents features in the input image, may already contain sufficient information to perform not just object detection but also to find regions where the objects may be. That important observation leads to the development of the Faster R-CNN method.

This approach is developed based on the Fast R-CNN network. It introduces a new network structure: the Region Proposal Network (RPN). This extra network operates on the feature maps that are generated by the CNN in Fast R-CNN, generating region proposals that are passed to the RoI pooling step for successive detections. The RPN uses a sliding window over the feature map. At each sliding window location, nine different proposals that are centered around this location are checked to see their coordinates (represented by four real-valued numbers) and the probability that an object exists in the given region. Each such proposal is called an anchor. In Faster R-CNN, the authors trained the RPN and used the generated proposals to train Fast R-CNN; the trained network is then used to initialize RPN, so on and so forth. Thus, these two parts are trained iteratively, and so there is no need to use external methods to produce region proposals. Everything is performed in a unified network for the object detection task.

5.6.1.4 Mask R-CNN Architecture

Based on this existing work, the Mask R-CNN was proposed to perform the task of both object detection and semantic segmentation. It keeps the architecture of Faster R-CNN, adding only one extra branch in the final stage of the RoI feature layer. Where previously outputs included object classification and location, now a third branch contains information about the mask of the detected object in the RoI. Therefore, the Mask R-CNN can retrieve the rectangle bound, classification results, classification possibility, and the mask of that object, for any RoI, in a single pass. In the next section, we will introduce the Mask R-CNN architecture in detail.

5.6.2 Implementation of Mask R-CNN

This section outlines the main parts of the architecture of Mask R-CNN, explaining how it differs from its predecessors. For a more detailed explanation, please refer to the original paper [27]. The OCaml implementation of the inference model is available in the code repository.Footnote 1

5.6.2.1 Building Mask R-CNN

After a quick introduction to the MRCNN and its development, let’s look at the code to understand how it is constructed:

open Owl module N = Dense.Ndarray.S open CGraph open Graph open AD module RPN = RegionProposalNetwork module PL = ProposalLayer module FPN = FeaturePyramidNetwork module DL = DetectionLayer module C = Configuration let image_shape = C.get_image_shape () in let inps = inputs     ~names:[|"input_image"; "input_image_meta"; "input_anchors"|]     [|image_shape; [|C.image_meta_size|]; [|num_anchors; 4|]|] in let input_image = inps.(0) and input_image_meta = inps.(1) and input_anchors = inps.(2) i

The network accepts three inputs, each representing images, metadata, and the number of anchors (the rectangular regions). The Configuration module contains a list of constants that will be used in building the network.

5.6.2.2 Feature Extractor

The picture is first fed to a convolutional neural network to extract features of the image. The first few layers detect low-level features of an image, such as edges and basic shapes. As you go deeper into the network, these simply features are assembled into higher-level features such as “people” and “cars.” Five of these layers (called “feature maps”) of various sizes, both high and low levels, are then passed on to the next parts. This implementation uses Microsoft’s ResNet101 network as a feature extractor.

let tdps = C.top_down_pyramid_size in let str = [|1; 1|] in let p5 = conv2d [|1; 1; 2048; tdps|] str   ~padding:VALID ~name:"fpn_c5p5" c5 in let p4 = add ~name:"fpn_p4add"   [|upsampling2d [|2; 2|] ~name:"fpn_p5upsampled" p5;     conv2d [|1; 1; 1024; tdps|]       str ~padding:VALID ~name:"fpn_c4p4" c4|] in let p3 = add ~name:"fpn_p3add"   [|upsampling2d [|2; 2|] ~name:"fpn_p4upsampled" p4;     conv2d [|1; 1; 512; tdps|]       str ~padding:VALID ~name:"fpn_c3p3" c3|] in let p2 = add ~name:"fpn_p2add"   [|upsampling2d [|2; 2|] ~name:"fpn_p3upsampled" p3;     conv2d [|1; 1; 256; tdps|]       str ~padding:VALID ~name:"fpn_c2p2" c2|] in let conv_args = [|3; 3; tdps; tdps|] in let p2= conv2d conv_args str ~padding:SAME ~name:"fpn_p2" p2 in let p3= conv2d conv_args str ~padding:SAME ~name:"fpn_p3" p3 in let p4= conv2d conv_args str ~padding:SAME ~name:"fpn_p4" p4 in let p5= conv2d conv_args str ~padding:SAME ~name:"fpn_p5" p5 in let p6= max_pool2d [|1; 1|] [|2; 2|]   ~padding:VALID ~name:"fpn_p6" p5 in let rpn_feature_maps = [|p2; p3; p4; p5; p6|] in let mrcnn_feature_maps = [|p2; p3; p4; p5|]

The features are extracted by combining both ResNet101 and the Feature Pyramid Network. ResNet extracts features of the image (early layers extract low-level features; later layers extract high-level features). The Feature Pyramid Network creates a second pyramid of feature maps from top to bottom so that every map has access to high- and low-level features. This combination achieves excellent gains in both accuracy and speed.

5.6.2.3 Proposal Generation

To try to locate the objects, about 250,000 overlapping rectangular regions or anchors are generated.

let nb_ratios = Array.length C.rpn_anchor_ratios in let rpns = Array.init 5 (fun i ->   RPN.rpn_graph rpn_feature_maps. (i)   nb_ratios C.rpn_anchor_stride   ("_p" ^ string_of_int (i + 2))) in let rpn_class = concatenate 1 ~name:"rpn_class"                 (Array.init 5 (fun i -> rpns. (i).(0))) in let rpn_bbox = concatenate 1 ~name:"rpn_bbox"                 (Array.init 5 (fun i -> rpns. (i).(1)))

Single RPN graphs are applied on different features in rpn_features_maps, and the results from these networks are concatenated. For each bounding box on the image, the RPN returns the likelihood that it contains an object, called its objectness, and a refinement for the anchor; both are represented by rank 3 ndarrays.

Next, in the proposal layer, the 1000 best anchors are selected according to their objectness. Anchors that overlap too much with each other are eliminated, to avoid detecting the same object multiple times. Each selected anchor is also refined in case it was not perfectly centered around the object.

let rpn_rois =     let prop_f = PL.proposal_layer       C.post_nms_rois C.rpn_nms_threshold in     MrcnnUtil.delay_lambda_array [|C.post_nms_rois; 4|]       prop_f ~name:"ROI"       [|rpn_class; rpn_bbox; input_anchors|] in

In rpn_rois, the proposal layer picks the top anchors from the RPN output, based on nonmaximum suppression and anchor scores.

5.6.2.4 Classification

All anchor proposals from the previous layer are resized to a given fixed size and fed into a ten-layer neural network. The network assigns each of them the probability that it belongs to each class. The network is pretrained on fixed classes; changing the set of classes requires retraining the whole network. Note that this step does not take as much time for each anchor as a full-fledged image classifier such as Inception, since it reuses the precomputed feature maps from the Feature Pyramid Network. Therefore, there is no need to go back to the original picture. The class with the highest probability is chosen for each proposal, and thanks to the class predictions, the anchor proposals are even more refined. Proposals classified in the background class are deleted. Eventually, only the proposals with an objectness over some threshold are kept, and we have our final detections, each with a bounding box and a label. This process can be described by the following code:

let mrcnn_class, mrcnn_bbox = FPN.fpn_classifier_graph rpn_rois   mrcnn_feature_maps input_image_meta   C.pool_size C.num_classes C.fpn_classif_fc_layers_size in let detections = MrcnnUtil.delay_lambda_array   [|C.detection_max_instances; 6|]   (DL.detection_layer ()) ~name:"mrcnn_detection"   [|rpn_rois; mrcnn_class; mrcnn_bbox; input_image_meta|] in let detection_boxes = lambda_array   [|C.detection_max_instances; 4|]   (fun t -> Maths.get_slice [[]; []; [0;3]] t.(0))   [|detections|]

A Feature Pyramid Network classifier associates a class to each proposal and further refines the bounding box for that class. The only thing left to do then is to generate a binary mask for each object. This is handled by a small convolutional neural network which produces a small square of values between 0 and 1 for each detected bounding box. This square is resized to the original size of the bounding box with bilinear interpolation, and pixels with a value over 0.5 are tagged as being part of the object.

let mrcnn_mask = FPN.build_fpn_mask_graph   detection_boxes mrcnn_feature_maps   input_image_meta C.mask_pool_size C.num_classes

Finally, the output contains detection results and masks from the previous steps.

outputs ~name:C.name [|detections; mrcnn_mask|]

After getting to know the internals of the MRCNN architecture, we can now run the code to see it work. The core code is listed as follows:

open Mrcnn let src = "image.png" in let fun_detect = Model.detect () in let Model.({rois; class_ids; scores; masks}) = fun_detect src in let img_arr = Image.img_to_ndarray src in let filename = Filename.basename src in let format = Images.guess_format src in let out_loc = out ^ filename in Visualise.display_masks img_arr rois masks class_ids; let camlimg = Image.img_of_ndarray img_arr in Visualise.display_labels camlimg rois class_ids scores; Image.save out_loc format camlimg; Visualise.print_results class_ids rois scores

A key step is to apply the Model.detect function on the input images, returning the regions of interest, the classification result ID of the object in each region, the classification certainty scores, and a mask that shows the outline of that object in the region. With this information, the Visualise module runs for three passes on the original image: the first for adding bounding boxes and object masks, the second for adding the numbers close to the bounding box, and finally for printing out the resulting images from the previous two steps. In this example, the pretrained weights on 80 classes of common objects are provided, which have been converted from the TensorFlow implementation mentioned earlier. As to the execution speed, processing one image with a size of 1024 × 1024 pixels takes between 10 and 15 seconds on a moderate laptop.

5.7 Summary

In this chapter, we provided an insight into the neural network module in Owl. Benefiting from solid implementation of algorithmic differentiation and optimization, the Neural module is concise and expressive. We explained the neurons and network components in this module and then showed how network training is done in Owl. This chapter also covered how we implement a neural network compiler to automatically optimize the network structure and memory usage. Finally, we introduced in detail a DNN application, instance segmentation, that drives the development of the computation graph module in Owl.