Mathematical optimization is the process of searching for optimal values from a selection of parameters, based on a certain metric. It can be formalized as follows:

$$ {\displaystyle \begin{array}{l}\textrm{minimise}\kern3em f(x)\\ {}\textrm{subject}\ \textrm{to}\kern0.75em {g}_i(x)\ \leq\ {b}_i,i=1,2,\dots, n.\end{array}} $$
(4.1)

Here, vector x = {x1, x2, …, xn} is the optimization variables, and function \( f:{\mathcal{R}}^n\to \mathcal{R} \) is the target function. The functions \( {g}_i:{\mathcal{R}}^n\to \mathcal{R},i=1,2,\dots, n \) are the constraints, with constants bi being the constraint boundaries. The target of solving optimization problems is to find x∗ to minimize f.

An optimization problem aims to find a solution that minimizes some quantity; therefore, it arises in a wide range of disciplines, such as finance, engineering, computer science, etc. For example, in portfolio management in the finance industry, an optimal solution is required to divide the given total capital into n types of investments, where xi is the amount of capital invested in financial asset i. The target might be to maximize to the expected return or to minimize the risk. The constraints might be requiring that the smallest return be larger than a predefined value, etc.

An optimization problem can be categorized into multiple types. The general form in Eq. 4.1 contains several constraints. If there are no constraints, the problem is called an unconstrained optimization; otherwise, it’s a constrained optimization problem. From another perspective, some optimization target is to find the global minimal point (e.g., minimize f (x) = x2), while the others only need to find the optimum in a certain range (e.g., minimize f (x) =  sin (x) in the range of [0, 2π]). In this chapter, and in the implemented module in Owl, we focus on the unconstrained and local optimization problems. Specifically, we have implemented one of the most important optimization methods: gradient descent.

4.1 Gradient Descent

The gradient descent method is one of the most commonly used family of iterative optimization processes. Its basic idea is to start from an initial value and then find a certain search direction along a function to decrease the value by a certain step size until it converges to a local minimum. We can thus describe the nth iteration of the descent method as follows:

  1. 1.

    Calculate a descent direction d.

  2. 2.

    Choose a step size μ.

  3. 3.

    Update the location: xn + 1 = xn + μ d.

Repeat this process until a stopping condition is met, such as the update being smaller than a threshold. Among the descent methods, gradient descent is one of the most widely used algorithms to perform optimization and the most common way to optimize neural networks. Based on the preceding descent process, a gradient descent method uses the function gradient to decide its direction d and can be described as follows:

  1. 1.

    Calculate a descent direction − ∇ f(xn).

  2. 2.

    Choose a step size μ.

  3. 3.

    Update the location: xn + 1 = xn + μ ∇ f (xn).

Here, ∇ denotes the gradient. The distance μ along a certain direction is also called the learning rate of this iteration. In a gradient descent process, when searching for the minimum, it always follows the direction that is against the direction represented by the negative gradient. The gradient can be calculated based on the algorithm differentiation module we have introduced in Chapter 3. That’s why the whole Optimisation module is built on Algodiff.

The implementation of gradient descent according to this definition is plain enough. For example, for a certain differentiable function f that does have one global minimal point, the following simple Owl code would do:

module n = Dense.Ndarray.D open Algodiff.D let _ =   for i = 1 to n - 1 do     let u = grad f !x |> unpack_arr in     x := N.(sub !x (scalar_mul alpha u))   done;;

It’s basically a line-to-line translation of the process described before. You should be familiar with the functions from the AD module, such as grad for calculating gradients and unpack_arr for converting an AD type ndarray into a normal one. However, there are a lot of details that should be attended to if we need to implement a robust gradient descent method, such as how the learning rate should change, how other variant methods should be incorporated, etc. Next, we will introduce several building blocks for this method and the structure of the Optimise module in Owl.

4.2 Components

The core of the Optimise module in Owl abstracts several aspects of the gradient descent method in applications: learning rate, gradient method, momentum, etc. Each of them is represented as a submodule. All computation in these modules relies on the AD module. The following code shows an outline of this optimization module. It is designed as a functor parameterized by the AD module. In this section, we introduce a part of these submodules and how they implement different methods.

module Make (Algodiff : Owl_algodiff_generic_sig.Sig) = struct   module Algodiff = Algodiff   open Algodiff   module Learning_Rate = struct     ...   end   module Gradient = struct     ...   end   ... end

4.2.1 Learning Rate

When training a machine learning model, the learning rate is arguably the most important hyperparameter that affects the training speed and quality. It specifies how much the model weight should be changed given the estimated error in each training round. A large learning rate may lead to suboptimal solutions and unstable training processes, whereas a small rate may result in a long training time. That’s why choosing a proper learning rate is both crucial and challenging in model training. There exist various methods to decide the learning rate, and we have incorporated them in the Learning_rate module, as follows:

module Learning_Rate = struct     type typ =       | Adagrad   of float       | Const     of float       | Decay     of float * float       | RMSprop   of float * float     let run = function       | Adagrad a -> fun _ _ c ->         Maths.(_f a / sqrt (c.(0) + _f 1e-32))       | Const a -> fun _ _ _ -> _f a       | Decay (a, k) -> fun i _ _ ->         Maths.(_f a / (_f 1. + (_f k * _f (float_of_int i))))       | RMSprop (a, _)   -> fun _ _ c ->         Maths.(_f a / sqrt (c.(0) + _f 1e-32))     let default = function       | Adagrad _   -> Adagrad 0.01       | Const _     -> Const 0.001       | Decay _     -> Decay (0.1, 0.1)       | RMSprop _   -> RMSprop (0.001, 0.9)     let update_ch typ g c =       match typ with       | Adagrad _ -> [| Maths.(c.(0) + (g * g)); c.(1) |]       | RMSprop (_, k) -> [| Maths.(         (_f k * c.(0)) + ((_f 1. - _f k) * g * g)); c.(1) |]       | _ -> c     let to_string = function,       | Adagrad a      -> Printf.sprintf "adagrad %g" a       | Const a        -> Printf.sprintf "constant %g" a       | Decay (a, k)   -> Printf.sprintf "decay (%g, %g)" a k       | RMSprop (a, k) -> Printf.sprintf "rmsprop (%g, %g)" a k end

This module consists of the type definitions of learning rate methods and the functions that can be applied on it. The Learning_Rate.typ consists of four different types of algorithms.Footnote 1 For each type, it specifies parameters it requires.

Let’s look at how these methods are implemented to better understand the code. The Const method is the most straightforward: just using a constant learning rate value throughout the whole training process,. In typ, its only parameter is this learning rate as a float number. Next, the run function takes in a learning rate type as input and returns a function that accepts three inputs: the iteration number i, the gradient g, and the parameters used in this method c (an array of floats). This function specifies how the learning rate should be changed. In the case of the Const method, the rate does not change. So it simply returns the previous learning rate itself. Recall from Chapter 3 that _f wraps a float number into an AD scalar type. The default and to_string are helper functions. The first generates a learning rate method with default parameters, and the second prints parameter information of a given method.

The Adagrad method is a bit more complex. As the name suggests, it changes the learning rate adaptively: a larger update step size for parameters associated with infrequent features and small learning rate otherwise. Its parameter update at the t’s iterate follows the following rules:

$$ {\theta}_{t+1}={\theta}_t-\mu\ \frac{g_t}{\sqrt{G_t+\epsilon }} $$
(4.2)

Here, G is the sum of the squares of the corresponding gradients gt’s up to time step t. This equation consists of two parts. The first is how the learning rate should be updated. It is specified in the run function. The following code:

fun _ _ c ->   Maths.(_f a / sqrt (c.(0) + _f 1e-32))

corresponds to \( \frac{\mu }{\sqrt{G+\epsilon }} \). The c array contains parameters that are utilized in updating μ, which in this case is G. The second part is how to update this parameter. It is specified in the update_ch function. In this case, the rule is \( {G}_t=\sum \limits_{i=1}^t\ {g}_i^2 \), or

$$ {G}_t={G}_{t-1}+{g}_i^2. $$

Therefore, the code is

[| Maths.(c.(0) + (g * g)); c.(1) |]

at each iteration. The second element in this array is not used, so it remains the same.

The RMSprop method, is an adaptive learning rate method proposed by Geoff Hinton. It is an extension of Adagrad. It follows the update rule in Eq. 4.2. Only that here

$$ {G}_t=k{G}_{t-1}+\left(1-k\right){g}_t^2 $$
(4.3)

Note that k is a factor that is normally set to 0.9. Therefore, the run function keeps the same; the update_ch function for RMSprop becomes

(_f k * c.(0)) + ((_f 1. - _f k) * g * g)

Compared with Adagrad, by using a decaying moving average of previous gradients, RMSprop enables forgetting early gradients and focuses on the most recently observed gradients.

To demonstrate how this seemingly simple framework can accommodate more complex methods, let’s consider the implementation of the Adam optimizer [33]. Its parameters are two decaying rates β1 and β2. Updating its learning rate requires two values: the estimated mean mt and uncentered variance vt of the gradients. They are updated according to the following rules:

$$ {\displaystyle \begin{array}{rcl}{m}_t& =& {\beta}_1\ {m}_{t-1}+\left(1-{\beta}_1\right){g}_t\\ {}{v}_t& =& {\beta}_2\ {v}_{t-1}+\left(1-{\beta}_2\right){g}_t^2\end{array}} $$

Accordingly, its update_ch can be implemented as

let update_ch typ g c =   match typ with   | Adam (_, b1, b2) ->     let m = Maths.((_f b1 * c.(0)) + ((_f 1. - _f b1) * g)) in     let v = Maths.((_f b2 * c.(1)) + ((_f 1. - _f b2) * g * g)) in     [| m; v |]   | ...

Note the meaning of c is not the same as that in the Adagrad and RMSprop methods.

The next thing is to specify how to update the learning rate. Adam’s update rule is

$$ {\theta}_t={\theta}_{t-1}-\mu \frac{{\overline{m}}_t}{\sqrt{{\overline{v}}_t}+\epsilon }, $$
(4.4)

where

$$ {\overline{m}}_t=\frac{m_t}{1-{\beta}_1^t},{\overline{v}}_t=\frac{v_t}{1-{\beta}_2^t}. $$

Therefore, the run function of Adam returns a function that utilizes all three parameters:

let run = function   | Adam (a, b1, b2) ->     fun i g c ->       Maths.(         let m = c.(0) /           (_f 1. - (_f b1 ** _f (float_of_int i))) in         let v = c.(1) /           (_f 1. - (_f b2 ** _f (float_of_int i))) in         (_f a) * m / (sqrt v + _f 1e-8)         / (g + _f 1e-32))   | ...

Note the final item / (g + _f 1e-32). You might notice that this item is not in Eq. 4.4. The reason we put it here is that our framework follows this update pattern:

$$ {\theta}_t={\theta}_{t-1}-\textrm{run}\left(\mu, \dots \right)g. $$

But the final g multiplication item is not in the end of Eq. 4.4. That’s why we divide it back in the run function.

So far, we have introduced multiple aspects in a learning rate method, most notably run and update_cn, but we have not yet explained how they will be used in an optimization process. We will show that in the next section. For now, let’s move on to another aspect of optimization: the gradient descent algorithm.

4.2.1.1 Gradient

We have provided the framework of gradient methods in Section 4.1. However, there exist many variants of gradient descent algorithms. They are included in the Gradient module. The code is shown as follows. Its structure is similar to that of Learning_rate. The typ contains all the supported gradient methods; these methods do not carry type parameters. The to_string function prints helper information for each method.

module Gradient = struct     type typ =       | GD (* classic gradient descent *)       | CG (* Hestenes and Stiefel 1952 *)       | CD (* Fletcher 1987 *)       | NonlinearCG (* Fletcher and Reeves 1964 *)       | DaiYuanCG (* Dai and Yuan 1999 *)       | NewtonCG (* Newton conjugate gradient *)       | Newton     let run = function       | GD          -> fun _ _ _ _ g' -> Maths.neg g'       | CG          ->         fun _ _ g p g' ->           let y = Maths.(g' - g) in           let b = Maths.(sum' (g' * y) / (sum' (p * y) + _f 1e-32)) in           Maths.(neg g' + (b * p))       | CD          ->         fun _ _ g p g' ->           let b = Maths.(l2norm_sqr' g' / sum' (neg p * g)) in           Maths.(neg g' + (b * p))       | NonlinearCG ->         fun _ _ g p g' ->           let b = Maths.(l2norm_sqr' g' / l2norm_sqr' g) in           Maths.(neg g' + (b * p))       | DaiYuanCG   ->         fun _ _ g p g' ->           let y = Maths.(g' - g) in           let b = Maths.(l2norm_sqr' g' / sum' (p * y)) in           Maths.(neg g' + (b * p))       | NewtonCG    ->         fun f w _ p g' ->           let hv = hessianv f w p |> Maths.transpose in           let b = Maths.(hv *@ g' / (hv *@ p)) in           Maths.(neg g' + (p *@ b))       | Newton      ->         fun f w _ _ _ ->           let g', h' = gradhessian f w in           Maths.(neg (g' *@ inv h'))     let to_string = function       | GD          -> "gradient descent"       | CG          -> "conjugate gradient"       | CD          -> "conjugate descent"       | NonlinearCG -> "nonlinear conjugate gradient"       | DaiYuanCG   -> "dai & yuan conjugate gradient"       | NewtonCG    -> "newton conjugate gradient"       | Newton      -> "newton"   end

The key component is the run function. Remember that the descent optimization method is all about the process:

$$ {x}_{n+1}={x}_n+\mu\ {d}_n, $$

which shows the exploration direction at the next step. The run function specifies the form of dn. Take the classic gradient descent method as an example; the direction is just the opposite of gradient. Therefore, dn =  − gn, and thus the run function returns another function:

fun _f _w _g _p g' -> Maths.neg g'

This function takes five parameters as inputs, and the last one is the current gradient, which is the only parameter used in this case.

Conjugate gradient descent: A problem with gradient descent is that it may perform badly on certain types of functions. For example, if a function is steep and narrow, then gradient descent will take many very small steps to reach the minimum, bouncing back and forth, even if the function is in quadratic form. This can be fixed by the conjugate gradient (CG) method, which is first proposed by Hestenes and Stiefel [29].

The CG method is similar to gradient descent, but the new direction at each step does not completely follow the new gradient, but is somehow conjugated to the old gradients and to all previous directions traversed. If both methods start from the same position, gradient descent would follow the direction of the descent, which could be a blunt one since this function is steep. But the conjugate method would prefer following the previous momentum a little bit. As a result, the conjugate method follows a direction in between. The new direction finds the minimum more efficiently than the gradient descent method.

The conjugate gradient descent is also a family of optimization methods. Instead of using the opposite of gradient − ∇ f (xn) as a direction, they follow the procedure in the nth iteration:

  1. 1.

    Calculate the steepest direction, that is, negative gradient −gn =  −  ∇ f (xn).

  2. 2.

    Calculate a certain parameter βn; this parameter varies among different conjugate gradient methods.

  3. 3.

    Apply the conjugate direction dn =  − gn + β dn − 1.

  4. 4.

    Update the optimization process xn = xn − 1 + μn dn where μn is the learning rate of this iteration n.

Based on this framework, we can take a look at the five parameters in the returned function:

fun f w g p g' -> ...

Here, g and p are gradient and direction vectors from the previous round; g' is the gradient of the current round. f is the function to be optimized itself, with input data w. The CG process can thus be implemented as

fun _f _w g p g' ->   let b = ... in   Maths.(neg g' + (b * p))

In the classic CG method:

$$ {\beta}_n=\frac{g_n^T\left({g}_n-{g}_{n-1}\right)}{-{d}_{n-1}^T\left({g}_n-{g}_{n-1}\right)}. $$
(4.5)

Here, gn, dn, etc. are assumed to be vectors. Note how this parameter and the CG method framework utilize information such as gradient and direction from the previous iteration (gn − 1 and dn − 1). We can implement Eq. 4.5 as

let b =   let y = Maths.(g' - g) in   Maths.(sum' (g' * y) / (sum' (p * y) + _f 1e-32))

It uses the sum' function to perform vector multiplication, and the extra epsilon value 1e-32 is used to make sure the denominator is not zero.

In the nonlinear conjugate method (NonlinearCG) [21]

$$ {\beta}_n=\frac{g_n^T\ {g}_n}{g_{n-1}^T\ {g}_{n-1}}. $$

It can thus be implemented as

let b = Maths.(l2norm_sqr' g' / l2norm_sqr' g)

Here, l2norm_sqr' g calculates the square of l2 norm (or Euclidean norm) of all elements in g, which is gTg.

Similarly, in the conjugate gradient method proposed by Dai and Yuan (DaiYuanCG) in [16]

$$ {\beta}_n=-\frac{g_n^T{g}_n}{d_n\left({g}_n-{g}_{n-1}\right)}. $$

The corresponding code is

let b =   let y = Maths.(g' - g) in   Maths.(l2norm_sqr' g' / sum' (p * y))

Newton’s method is another iterative descent method to find optimal values. It follows the update sequence as shown in the following:

$$ {x}_{n+1}={x}_n-\alpha {\textbf{H}}^{-1}\nabla f\left({x}_n\right) $$
(4.6)

Here, H is the hessian of f, that is, the second-order gradient of f. The code is a direct translation of this equation:

fun f w _ _ _ ->   let g', h' = gradhessian f w in   Maths.(neg (g' *@ inv h'))

Here, the Algodiff.gradhessian function returns both gradient and hessian of function f at point w. The *@ operator is the alias for matrix multiplication, and inv is the inverse of a matrix. Rather than moving only in the direction of the gradient, Newton’s method combines the gradient with the second-order gradients of a function, starting from a random position and iterating until convergence. Newton’s method guarantees quadratic convergence provided that f is strongly convex with Lipschitz Hessian and that the initial point x0 is close enough to the prima x. Note that this method is not to be confused with the Newton method that aims to find the root of a function.

4.2.2 Momentum

The basic gradient descent process can be further enhanced by the momentum mechanism. It allows some “inertia” in choosing the optimization search direction, which utilizes previous direction information. It helps to reduce noisy gradient descent that bounces in search direction. The code of the Momentum module is listed as follows. Its key component is the run function.

module Momentum = struct     type typ =       | Standard of float       | Nesterov of float       | None     let run = function       | Standard m -> fun u u' -> Maths.((_f m * u) + u')       | Nesterov m -> fun u u' ->         Maths.((_f m * _f m * u) + ((_f m + _f 1.) * u'))       | None       -> fun _ u' -> u'     let default = function       | Standard _ -> Standard 0.9       | Nesterov _ -> Nesterov 0.9       | None       -> None     let to_string = function       | Standard m -> Printf.sprintf "standard %g" m       | Nesterov m -> Printf.sprintf "nesterov %g" m       | None       -> Printf.sprintf "none" end

Recall in the basic structure of gradient descent, the change of value x at the nth iteration is

$$ {d}_n=-\mu \nabla f\left({x}_n\right) $$

With momentum, this process is revised to be

$$ {d}_n=-\mu \nabla f\left({x}_n\right)+m{d}_{n-1}. $$
(4.7)

The float number m is the momentum parameter that indicates the impact of direction information in the previous iteration.

The run function in this module returns a function that takes two inputs: the previous direction u and the current direction u' (calculated using any combination of learning rate and gradient methods). Therefore, the momentum method described earlier can be simply implemented as Maths.((_f m * u) + u'). This is the standard momentum method. If we decide not to use any momentum (None), it simply returns the current direction u'.

This module also supports the Nesterov Accelerated Gradient (Nesterov) method [40]. It employs a simple change on the standard momentum in Eq. 4.7, by first applying the momentum item on the parameter itself and then calculating the gradient ∇f, before adding the momentum item again:

$$ {d}_n=-\mu \nabla f\left({x}_n+{d}_{n-1}m\right)+m{d}_{n-1}. $$

In this module, we have implemented the solution by Bengio et al.

4.2.3 Batch

There is one more submodule we need to mention: the Batch module. It is about how the input data are divided into chunks and then fed into a training process. From the previous introduction about gradient descent, you might assume the function accepts scalar as input. However, in many applications, we should consider applying optimization on a vector x. That means in calculating the gradients, we need to consider using a group of data points instead of only one.

From the perspective of calculation, there is not much difference, and we can still use all the data in calculating the gradients. However, one big application field of such an optimization method is regression or, more broadly, machine learning, where there could be millions of data points just to find the optima. We will talk about regression in Section 4.4. In practice, computing optimization with large quantities of input data can be unavailable due to the limit of hardware factors such as memory size of the computer. Therefore, optimization for such problems is often repeated for several executive rounds, each round called an epoch. In each epoch, the given input data are split into batches. Each batch can choose to use a batch strategy, as the run function code shown as follows:

  module Batch = struct     type typ =       | Full       | Mini       of int       | Sample     of int       | Stochastic     let run typ x y i =       match typ with       | Full       -> x, y       | Mini c     -> Utils.get_chunk x y i c       | Sample c   -> Utils.draw_samples x y c       | Stochastic -> Utils.draw_samples x y 1     let batches typ x =       match typ with       | Full       -> 1       | Mini c     -> Utils.sample_num x / c       | Sample c   -> Utils.sample_num x / c       | Stochastic -> Utils.sample_num x     let to_string = function       | Full       -> Printf.sprintf "%s" "full"       | Mini c     -> Printf.sprintf "mini of %i" c       | Sample c   -> Printf.sprintf "sample of %i" c       | Stochastic -> Printf.sprintf "%s" "stochastic"   end

The Full strategy uses all the provided data. The Mini c and Sample c strategies both take in c data points each time; the former chooses data sequentially, and the latter does randomly. Finally, the Stochastic method only selects one random data point from existing ones.

4.2.4 Checkpoint

So far, we have introduced how the learning rate, gradients, and momentum modules return functions which utilize information such as the gradient or direction of the previous iteration. But where are they stored? The answer lies in the Checkpoint module. It stores all the information during optimization for later use and saves them as files on the hard disk if necessary. Its code is shown as follows:

module Checkpoint = struct   type state = {     mutable current_batch : int;     mutable batches_per_epoch : int;     mutable epochs : float;     mutable batches : int;     mutable gs : t array array;     mutable ps : t array array;     mutable us : t array array;     mutable ch : t array array array     mutable stop : bool;   }   type typ =     | Batch  of int (* default checkpoint at every specified batch interval *)     | Epoch  of float (* default  *)     | Custom of (state -> unit) (* customised checkpoint called at every batch *)     | None   .... end

The state type includes fields that we have introduced so far. The ch is used in the learning rate module and contains parameters to be updated from the previous iteration. The gs is the gradient of the previous iteration, and ps is the direction of the previous iteration. Both are used in the gradient methods. The us represents the direction update of the previous iteration and is the parameter used in momentum methods. Besides storing this information, there is also the stop boolean value, which indicates optimization stops if set to true. It also contains other information, including the current iteration progress in batch, the number of batches in each epoch, and the total number of epochs to run.

The typ decides at what point the checkpoint should be executed. Batch means to checkpoint at every specified batch interval. Epoch then checkpoints at every specified epoch interval. Besides these two, the user can also build customized functions that take a state type as input to decide when the most proper time is to checkpoint for a specific application.

...   let init_state batches_per_epoch epochs =     let batches = float_of_int batches_per_epoch *. epochs       |> int_of_float in {         current_batch = 1         ; batches_per_epoch         ; epochs         ; batches         ; stop = false         ; gs = [| [| _f 0. |] |]         ; ps = [| [| _f 0. |] |]         ; us = [| [| _f 0. |] |]         ; ch = [| [| [| _f 0.; _f 0. |] |] |]       }   let default_checkpoint_fun save_fun =     let file_name =       Printf.sprintf "%s/%s.%i" (Sys.getcwd ())         "model" (Unix.time () |> int_of_float)     in     Owl_log.info "checkpoint => %s" file_name;     save_fun file_name   let to_string = function     | Batch i  -> Printf.sprintf "per %i batches" i     | Epoch i  -> Printf.sprintf "per %g epochs" i     | Custom _ -> Printf.sprintf "customised f"     | None     -> Printf.sprintf "none"   ...

The init_state returns initial values for the different fields in a state. The users need to specify the number of epochs in optimization and the input data batches in one epoch. The default_checkpoint_fun executes a function to save certain content in a file. This save function should be defined by users. And similar to previous modules, the to_string method provides a convenient print function to show the configuration information about this module. Finally, the run function decides a suitable checkpoint interval and executes the checkpoint function, either the default one or the customized one provided by the user.

  let run typ save_fun current_batch current_loss state =     state.loss.(current_batch) <- primal' current_loss;     state.stop <- state.current_batch >= state.batches;     let interval =       match typ with       | Batch i  -> i       | Epoch i  -> i *.         float_of_int state.batches_per_epoch |> int_of_float       | Custom _ -> 1       | None     -> max_int     in     if state.current_batch mod interval = 0 &&       state.current_batch < state.batches then (         match typ with         | Custom f -> f state         | _        -> default_checkpoint_fun save_fun) end

4.2.4.1 Params

The Params submodules are what brings all the other submodules together. It provides an entry point for users to access various aspects of optimization. The code is shown as follows:

  module Params = struct     type typ =       { mutable epochs : float       ; mutable batch : Batch.typ       ; mutable gradient : Gradient.typ       ; mutable learning_rate : Learning_Rate.typ       ; mutable momentum : Momentum.typ       ; mutable checkpoint : Checkpoint.typ       ; mutable verbosity : bool       }     ...

The Params type consists of the types of other submodules, such as Gradient.typ. It also includes some other fields such as the number of epochs and a flag verbosity to indicate if the full information of parameters should be printed out during optimization.

    let default () =       { epochs = 1.       ; batch = Batch.Sample 100       ; gradient = Gradient.GD       ; learning_rate = Learning_Rate.(default (Const 0.))       ; momentum = Momentum.None       ; checkpoint = Checkpoint.None       ; verbosity = true       }     let config ?batch ?gradient ?learning_rate         ?momentum ?checkpoint ?verbosity epochs       =       let p = default () in       (match batch with       | Some x -> p.batch <- x       | None   -> ());       (match gradient with       | Some x -> p.gradient <- x       | None   -> ());       (match learning_rate with       | Some x -> p.learning_rate <- x       | None   -> ());       (match momentum with       | Some x -> p.momentum <- x       | None   -> ());       (match checkpoint with       | Some x -> p.checkpoint <- x       | None   -> ());       (match verbosity with       | Some x -> p.verbosity <- x       | None   -> ());       p.epochs <- epochs;       p     let to_string p =       Printf.sprintf "--- Training config\n"       ^ Printf.sprintf "epochs: %g\n" p.epochs       ^ Printf.sprintf "batch: %s\n" (Batch.to_string p.batch)       ^ Printf.sprintf "method: %s\n" (Gradient.to_string p.gradient)       ^ Printf.sprintf           "learning rate: %s\n"           (Learning_Rate.to_string p.learning_rate)       ^ Printf.sprintf "momentum: %s\n" (Momentum.to_string p.momentum)       ^ Printf.sprintf "checkpoint: %s\n"         (Checkpoint.to_string p.checkpoint)       ^ Printf.sprintf           "verbosity: %s\n"           (if p.verbosity then "true" else "false")       ^ "---"   end

The other three functions are straightforward. default assigns default values to each parameter, config sets parameter values using the given input, and to_string prints existing values.

4.3 Gradient Descent Implementation

Putting all the aforementioned submodules together, we can now turn to the implementation of a robust gradient descent optimization method. In the Optimise module, we implement the minimise_fun, and we will explain its code as follows:

  let minimise_fun params f x =     let open Params in     let grad_fun = Gradient.run params.gradient in     let rate_fun = Learning_Rate.run params.learning_rate in     let momt_fun = Momentum.run params.momentum in     let upch_fun = Learning_Rate.update_ch params.learning_rate in     let chkp_fun = Checkpoint.run params.checkpoint in     let optz_fun = f in     ...

The function accepts three inputs: the function f, the initial input x, and the optimization parameter params. It starts by defining the run function from various submodules.

    let iterate xi =       let loss, g = grad' optz_fun xi in       loss |> primal', g, optz_fun     in     ...

iterate defines operations in the ith iteration. It utilizes the Algodiff module to compute the primal value loss of evaluating optz_fun at point xi and the corresponding gradient g at that point.

    let state = Checkpoint.init_state 1 params.epochs     let x = ref x in     while Checkpoint.(state.stop = false) do       ...     done;     state, !x

The preceding code shows the outline of the optimization procedure. First, it initializes a new state of the optimization process. Here, we set it to one batch per epoch. Next, the code keeps updating the state during the body of the while loop until the stop status is set to true. The optimization result x and state are finally returned. The state contains various historical information as we have explained. Each iteration of the while loop contains the following steps:

      let loss', g', optz' = iterate !x in       chkp_fun (fun _ -> ())         Checkpoint.(state.current_batch) loss' state;

First, we execute iterate to get gradients. We can define the checkpoint of the current progress; here, we provide an empty save function, which means no need to save the current state into the file.

      let p' = Checkpoint.(grad_fun optz' !x         state.gs.(0).(0) state.ps.(0).(0) g') in       Checkpoint.(state.ch.(0).(0)         <- upch_fun g' state.ch.(0).(0));

Next, we calculate the gradient descent direction p' using grad_fun, based on gradient g'. Also, the learning rate parameter ch should be updated.

      let u' =         Checkpoint.(Maths.(p' * rate_fun           state.current_batch g' state.ch.(0).(0)))       in       let u' = momt_fun Checkpoint.(state.us.(0).(0)) u' in

Then, the optimization direction is adjusted, first based on the learning rate and then on momentum.

      x := Maths.(!x + u') |> primal';

The optimal value is then updated according to the direction u'.

      (if params.momentum <> Momentum.None then         Checkpoint.(state.us.(0).(0) <- u'));       Checkpoint.(state.gs.(0).(0) <- g');       Checkpoint.(state.ps.(0).(0) <- p');       Checkpoint.(state.current_batch <- state.current_batch + 1);

Finally, the values calculated in this iteration, such as the gradients, direction, etc., are saved in the state for future use. That’s all for one iteration. Let’s look at one example of optimization using gradient descent. Here, we use Himmelblau’s function; it is often used as a performance test for optimization problems. The function contains two inputs and is defined as in Eq. 4.8.

$$ f\left(x,y\right)={\left({x}^2+y-11\right)}^2+{\left(x+{y}^2-7\right)}^2. $$
(4.8)

Its definition can be expressed using Owl code as

open Algodiff.D module N = Dense.Ndarray.D let himmelblau a =   let x = Mat.get a 0 0 in   let y = Mat.get a 0 1 in   Maths.(x ** (F 2.) + y - (F 11.) ** (F 2.) +     (x + y ** (F 2.) - (F 7.)) ** (F 2.) |> sum')

First, let’s look at what the code would look like without using the Optimise module. Let’s apply the gradient descent method according to its definition in Section 4.1.

let v = N.of_array [|-2.; 0.|] [|1; 2|] let traj = ref (N.copy v) let a = ref v let eta = 0.0001 let n = 2000;;

Here, we use an initial starting point [-2., 0.]. The step size eta is set to 0.0001, and the iteration number is 2000. Then we can perform the iterative descent process.

let _ =   for i = 1 to n - 1 do   let u = grad himmelblau (Arr !a) |> unpack_arr in   a := N.(sub !a (scalar_mul eta u));   traj := N.concatenate [|!traj; (N.copy !a)|]   done;;

We apply the grad method in the Algodiff module to the Himmelblau function iteratively, and the updated data a is stored in the traj array. Utilizing the Plot module in Owl, we can visualize this function and the optimization trajectory using the following code:

let plot () =   let a, b = Dense.Matrix.D.meshgrid (-4.) 4. (-4.) 4. 50 50 in   let c = N.(add     (sub_scalar (add (pow_scalar a 2.) b) 11.)     (pow_scalar (sub_scalar (add a (pow_scalar b 2.)) 7.) 2.)   ) in   let h = Plot.create ~m:1 ~n:2 "plot_himm.pdf" in   Plot.subplot h 0 0;   Plot.(mesh ~h ~spec:[ NoMagColor ] a b c);   Plot.subplot h 0 1;   Plot.contour ~h a b c;   let vx = N.get_slice [[]; [0]] !traj in   let vy = N.get_slice [[]; [1]] !traj in   Plot.plot ~h vx vy;   Plot.output h

And the generated figures are shown in Figure 4-1.

Figure 4-1
Graph a depicts the 3 D mesh structure optimization process of gradient descent on the multimodal Himmelblau function along the axis x, y, and z. Graph b depicts the implementation of a gradient descent optimization method along the axis x, and y.

Optimization process of gradient descent on the multimodal Himmelblau function

To solve the same problem, we can also use the minimise_fun function introduced in the previous section. First, we set up the parameters:

let p = Owl_optimise.D.Params.default () let _ = p.epochs <- 10. let _ = p.gradient <- Owl_optimise.D.Gradient.GD;;

It suffices to simply set the iteration limit epochs to something like 10 or 20 iterations. Then we set the gradient method to be the classic gradient descent and then execute the code, starting from the same initial values:

let init_value = N.of_array [|-2.;0.|] [|1;2|] |> pack_arr let _ = Owl_optimise.D.minimise_fun p himmelblau init_value;;

This function outputs execution logs to track the intermediate results, looking in part like the following. It shows how the function value, starting from 2926 at the initial point, is quickly reduced to about 2.5 within only 10 steps using gradient descent. It shows the efficiency of the gradient descent method in finding optima.

... 10:46:49.805 INFO : T: 00s | E: 1.0/10 | B: 1/10 | L: 2026.000 10:46:49.806 INFO : T: 00s | E: 2.0/10 | B: 2/10 | L: 476.1010 10:46:49.807 INFO : T: 00s | E: 3.0/10 | B: 3/10 | L: 63.83614 10:46:49.807 INFO : T: 00s | E: 4.0/10 | B: 4/10 | L: 37.77679 10:46:49.808 INFO : T: 00s | E: 5.0/10 | B: 5/10 | L: 21.39686 10:46:49.809 INFO : T: 00s | E: 6.0/10 | B: 6/10 | L: 11.74234 10:46:49.809 INFO : T: 00s | E: 7.0/10 | B: 7/10 | L: 6.567733 10:46:49.809 INFO : T: 00s | E: 8.0/10 | B: 8/10 | L: 4.085909 10:46:49.810 INFO : T: 00s | E: 9.0/10 | B: 9/10 | L: 3.016714 10:46:49.810 INFO : T: 00s | E: 10.0/10 | B: 10/10 | L: 2.5943 ...

4.4 Regression

In this section, we introduce a broad area that heavily relies on optimization: regression. Regression is an important topic in statistical modeling and machine learning. It’s about modeling problems which include one or more variables (also called “features” or “predictors”) and require us to make predictions of another variable (“output variable”) based on previous values of the predictors. Regression analysis includes a wide range of models, from linear regression to isotonic regression, each with different theoretical backgrounds and applications. In this section, we use the most widely used linear regression as an example to demonstrate how optimization plays a key part in solving regression problems.

4.4.1 Linear Regression

Linear regression models the relationship between input features and the output variable with a linear model. It is the most widely used regression model. Without loss of generality, let’s look at an example with a single variable in the model. Such a linear regression problem can be informally stated as follows. Suppose we have a series of (x, y) data points:

                ----------------------------                 |x| 5.16 | 7.51 | 6.53 | ...                 ----------------------------                 |y| 0.36 | 5.84 | 16.9 | ...                 ----------------------------

Given that the relationship between these two quantities is y ≈ hθ(x), where hθ(x) = θ0 + θ1 x, can we find out the θ0 and θ1 values that can fit the observed data points as closely as possible? This problem can be further formalized later. Denote the list of x’s and y’s as two vectors x and y. Suppose we have a function C that measures the distances between x and y: Cθ(x, y). The target is to find suitable parameters θ that minimize the distance. That’s where optimization comes to help.

4.4.2 Loss

So the next question is: How to represent this distance mathematically? One good choice is to use the Euclidean distance. That means the target is to minimize function:

$$ {C}_{\boldsymbol{\uptheta}}\left(\textbf{x},\textbf{y}\right)=\frac{1}{2n}\sum \limits_{i=1}^n{\left({h}_{\boldsymbol{\uptheta}}\left({x}^{(i)}\right)-{y}^{(i)}\right)}^2 $$
(4.9)

Here, x(i) indicates the ith element in the vector x. The factor \( \frac{1}{2n} \) is used to normalize the distance. Other forms of distance can also be applied here. Due to its importance, this distance is called the loss and abstracted as the Loss submodule in the optimization module. Its code is shown as follows:

  module Loss = struct     type typ =       | Hinge       | L1norm       | L2norm       | Quadratic       | Cross_entropy       | Custom        of (t -> t -> t)     let run typ y y' =       match typ with       | Hinge         -> Maths.(sum' (max2 (_f 0.) (_f 1. - (y * y'))))       | L1norm        -> Maths.(l1norm' (y - y'))       | L2norm        -> Maths.(l2norm' (y - y'))       | Quadratic     -> Maths.(l2norm_sqr' (y - y'))       | Cross_entropy -> Maths.(cross_entropy y y')       | Custom f      -> f y y'     ...     end

It contains several methods to calculate the distance, or loss, between two values y and y'. What we have described is the Quadratic method. It also supports the l1 or l2 norm: \( \sum \limits_i\left|{x}^{(i)}-{y}^{(i)}\right| \) and \( \sqrt{\sum \limits_i{\left({x}^{(i)}-{y}^{(i)}\right)}^2} \). The cross-entropy measures the performance of a classification model, the output of which is a probability value between 0 and 1. It is calculated as \( -\sum \limits_i\ {x}^{(i)}\mathit{\log}\left({y}^{(i)}\right) \). The cross-entropy loss is most commonly used in training neural networks, as we will show in Chapter 5.

4.4.3 Implementation of Linear Regression

In the source code of Owl, the owl_regression_generic.ml file lists all several regression methods, and they are all based on a core linear regression implementation. This function optimizes parameters w and b in a general regression problem: minimize lw, b(xTw + b − y). Here, each data point x can be a vector of the same length as T, since there can be more parameters than that shown in Eq. 4.9. The code is listed as follows:

module Make (Optimise : Owl_optimise_generic_sig.Sig) = struct   module Optimise = Optimise   open Optimise   open Optimise.Algodiff   let _linear_reg bias params x y =     let s = A.shape x in     let l, m = s.(0), s.(1) in     let n = A.col_num y in     let o = if bias = true then m + 1 else m in     let x = if bias = true then A.concatenate       ~axis:1 [| x; A.ones [| l; 1 |] |] else x in     let r = 1. /. float_of_int o in     let p = Arr A.(uniform ~a:(float_to_elt       (-.r)) ~b:(float_to_elt r) [| o; n |]) in     ... end

The regression module is a functor that is parameterized by the Optimise module. The _linear_reg function takes in the x and y values and the optimization parameters as input. The argument bias is a boolean flag that indicates if the b parameter should be trained. This bias is the θ0 parameter we have seen, which does not multiply with x. If we are to include it in the optimization, the shape of parameters should be changed accordingly, as shown in the code. Here, p is a randomly generated initial parameter matrix.

    let f w x =       let w = Mat.reshape o n w in       Maths.(x *@ w) in

f is the function to minimize. It represents xw + b using a single matrix multiplication.

    let w =       minimise_weight params f (Maths.flatten p) (Arr x) (Arr y)       |> snd       |> Mat.reshape o n       |> unpack_arr     in     match bias with     | true  -> A.split ~axis:0 [| m; 1 |] w     | false -> [| w |]   ... end

The core step of this regression function is to apply optimization on the function f using the given parameters, with proper shape manipulation. If the bias is included in the optimization target, the returned result is split into two parts, first being w and second being b.

Note that we have introduced minimise_fun for optimization, but here it uses the minimise_weight. These two functions are actually very similar in implementation, but with one key difference. In minimise_fun f x, it keeps calculating gradients with regard to input x and changes the x accordingly until it reaches a point that minimizes f (x). In minimise_weight f w x though, it keeps calculating gradients regarding the function’s own parameter w and changes it accordingly until it reaches a point that minimizes fw(x). The input data x stays the same in each round of optimization.

Based on this function, the linear regression can be implemented by choosing suitable optimization parameters:

  let ols ?(i = false) x y =     let params =       Params.config         ~batch:Batch.Full         ~learning_rate:(Learning_Rate.Adagrad 1.)         ~gradient:Gradient.GD         ~loss:Loss.Quadratic         ~verbosity:false         ~stopping:(Stopping.Const 1e-16)         100.     in     _linear_reg i params x y

In linear regression, we utilize all the input data in one iteration or epoch (Full batch mode). We use the Adagrad learning method, classic gradient descent, and the Euclidean distance as the loss function. The optimization lasts 100 iterations until the loss value is smaller than 1e-16. The Stopping is a helper module in optimise that accepts a threshold, so that the optimization process can exit early.

4.4.4 Other Types of Regression

Even though linear regression is powerful and widely used, the linear model cannot fit all problems. A lot of data follow other patterns than a linear one. For example, sometimes the relationship between the feature x and the output variable can be modeled as an nth degree polynomial with regard to feature x:

$$ {h}_{\boldsymbol{\uptheta}}(x)={\theta}_0+{\theta}_1x+{\theta}_2{x}^2+{\theta}_3{x}^3\dots $$
(4.10)

This is called a polynomial regression. Owl provides a function poly in the Regression module to get the model parameter. The first two parameters are still x and y, and the third parameter limits the order of the polynomial model. Its implementation can also be concisely expressed with _linear_reg:

  let poly x y n =     let z =       Array.init (n + 1) (fun i -> A.(pow_scalar x         (float_of_int i |> float_to_elt)))     in     let x = A.concatenate ~axis:1 z in     let params =       Params.config         ~batch:Batch.Full         ~learning_rate:(Learning_Rate.Const 1.)         ~gradient:Gradient.Newton         ~loss:Loss.Quadratic         ~verbosity:false         ~stopping:(Stopping.Const 1e-16)         100.     in     (_linear_reg false params x y).(0)

The key is to first process the data, so that each data point x can be projected to a series of new features z, so that zi = xi. Eq. 4.10 then becomes a multivariable linear regression:

$$ {h}_{\boldsymbol{\uptheta}}(z)={\theta}_0+{\theta}_1{z}_1+{\theta}_2{z}_2+{\theta}_3{z}_3\dots $$

Another important type of regression is logistic regression, where the data y contain integers that indicate different classes of data, instead of real numbers. Therefore, it is most suitable for classification tasks, such as “age group,” “nationality,” etc. Logistic regression replaces its target optimization function to be

$$ {C}_{\boldsymbol{\uptheta}}\left(\textbf{x},\textbf{y}\right)=\frac{1}{m}\sum \limits_{i=1}^mg\left({h}_{\boldsymbol{\uptheta}}\left({x}^{(i)}\right),{y}^{(i)}\right), $$
(4.11)

where m is the total number of data points in input data x and y; the function g is defined as

$$ g\left({h}_{\boldsymbol{\uptheta}}(x),y\right)=\left\{\begin{array}{ll}-\mathit{\log}\left({h}_{\boldsymbol{\uptheta}}(x)\right),& \textrm{if}\ y=1\\ {}-\mathit{\log}\left(1-{h}_{\boldsymbol{\uptheta}}(x)\right),& \textrm{if}\ y=0\end{array}\right. $$
(4.12)

The logistic gradient can be implemented by using the cross-entropy loss function:

  let logistic ?(i = false) x y =     let params =       Params.config         ~batch:Batch.Full         ~learning_rate:(Learning_Rate.Adagrad 1.)         ~gradient:Gradient.GD         ~loss:Loss.Cross_entropy         ~verbosity:false         ~stopping:(Stopping.Const 1e-16)         1000.     in     _linear_reg i params x y

4.4.5 Regularization

There is one thing we need to understand: regression is more than just optimization after all. Its purpose is to create a model that fits the given data, and all too often, this model should be used to predict the output of future input. Therefore, if a model fits the given data too well, it may lose generality for future data. That’s where the idea of regularization comes in. This technique prevents a model from being tuned too closely to a particular dataset and thus may fail to predict future observations well.

Think about the polynomial regression. The regularization technique favors simple and low-order models. It modifies the optimization target function to penalize high-order parameters, so that the large parameter values lead to higher cost. Therefore, by minimizing the target function, we keep the unwanted parameters relatively small. This can be implemented by adding an extra term at the end of the original target function.

Owl supports multiple types of such regularization terms in the Regularisation submodule, which also belongs to the Optimiser module. Its core function run is shown as follows:

let run typ x =   match typ with   | L1norm a -> Maths.(_f a * l1norm' x)   | L2norm a -> Maths.(_f a * l2norm' x)   | Elastic_net (a, b) ->     Maths.((_f a * l1norm' x) + (_f b * l2norm' x))   | None -> _f 0.

The L2norm regularization function adds the L2 norm of θ as the penalty term: λ ∑ θ2. The L1norm cost function is similar, adding the L1 norm or absolute value of the parameter as penalty: λ ∑ |θ|. This difference means that L1norm permits coefficients to be zero, very useful for feature selection. Regressions using these two regularization techniques are sometimes called Ridge and Lasso regressions. The Elastic_net method combines the penalties of the previous two:

$$ \lambda \left(\frac{1-a}{2}\sum {\theta}^2+a\sum \left|\theta \right|\right), $$

where a is a hyperparameter balancing between the former two. This method aims to make feature selection less dependent on the input data.

We can create a new polynomial regression with regularization by simply changing the optimization parameter to the following values:

Params.config   ~batch:Batch.Full   ~learning_rate:(Learning_Rate.Const 1.)   ~gradient:Gradient.Newton   ~loss:Loss.Quadratic   ~regularisation:(Regularisation.L2norm 0.5)   ~verbosity:false   ~stopping:(Stopping.Const 1e-16)   100.

4.5 Summary

In this chapter, we introduced optimization and its implementation in Owl. Focusing on gradient descent, one of the most widely used optimization methods, we introduced various aspects, such as the gradient method, learning rate, momentum, etc. Together they provide a powerful and robust implementation. As an important example, we further introduced regression, a machine learning technique that heavily relies on optimization. We showed how various regression methods can be built efficiently using the optimization module.