Differentiation is key to numerous scientific applications including maximizing or minimizing functions, solving systems of ODEs, physical simulation, etc. Of existing methods, algorithmic differentiation, or AD, is a computer-friendly technique for performing differentiation that is both efficient and accurate. AD is a central component of the architecture design of Owl. In this chapter, we will show, with hands-on examples, how the AD engine is designed and implemented in Owl. AD will be used in some of the other chapters to show its application in optimization and machine learning.

3.1 Introduction

Assume an object moves a distance of Δs in a time Δt, the average velocity of this object during this period can be defined as the ratio between Δs and Δt. As both values get smaller and smaller, we can get the instantaneous velocity:

$$ v=\underset{\varDelta t\to 0}{\lim}\frac{\varDelta s}{\varDelta t}=\frac{ds}{dt} $$

The term \( \frac{ds}{dt} \) is referred to as “the derivative of s with respect to t.”

Differentiation is the process of finding a derivative in mathematics. It studies the functional relationship between variables, that is, how much one variable changes when the value of another variable changes. Differentiation has many important applications, for example, finding minimum and maximum values of a function, finding the rate of change of quantity, computing linear approximations to functions, and solving systems of differential equations. Its critical roles in these key mathematical fields mean it is widely used in various fields. One example is calculating marginal cost and revenue in economics.

In computer science, differentiation also plays a key role. Machine learning techniques, such as deep neural networks, have been gaining more and more momentum. The most important step in training a deep neural network module is called “backpropagation,” which in essence is calculating the gradient of a very complicated function.

3.1.1 Three Ways of Differentiating

If we are to support differentiation at a scale as large as a deep neural network, manual calculation based on the chain rule in Calculus 101 is far from being enough. To do that, we need the power of automation, which is what a computer is good at. Currently, there are three ways to calculate differentiation: numerical differentiation, symbolic differentiation, and algorithmic differentiation.

The first method is numerical differentiation. Derived directly from the definition, numerical differentiation uses a small step δ to compute an approximate value toward the limit, as shown in Eq. 3.2.

$$ {f}^{\prime }(x)=\underset{\delta \to 0}{\lim}\frac{f\left(x+\delta \right)-f(x)}{\delta }. $$

By treating the function f as a black box, this method is straightforward to implement as long as f can be evaluated. However, this method is unfortunately subject to multiple types of errors, such as the round-off error. It is caused by representing numbers with only a finite precision during the numerical computation. The round-off error can be so large as to that the computer thinks f (x + δ) and f (x) are identical.

The second is symbolic differentiation. By manipulating the underlying mathematical expressions, symbolic differentiation obtains analytical results without numerical approximation, using mathematical derivative rules. For example, consider the function f (x0, x1, x2) = x0 ∗ x1 ∗ x2. Computing ∇f symbolically, we end up with

$$ \nabla f=\left(\frac{\partial f}{\partial {x}_0},\frac{\partial f}{\partial {x}_1},\frac{\partial f}{\partial {x}_2}\right)=\left({x}_1\ast {x}_2,{x}_0\ast {x}_2,{x}_1\ast {x}_2\right) $$

This process completely eliminates the impact of numerical errors, but the complexity of symbolic manipulation quickly grows as expressions become more complex. Just imagine computing the derivative of a simple calculation \( f(x)=\prod \limits_{i=0}^{n-1}{x}_i \): the result would be terribly long, if not that complex. As a result, symbolic differentiation can easily consume a huge amount of computing resource and becomes impractically slow in the end. Besides, unlike in numerical differentiation, we must know how a function is constructed to use symbolic differentiation.

Finally, there is the algorithmic differentiation (AD). It is a chain rule–based technique for calculating derivatives with respect to input variables of functions defined in a computer program. Algorithmic differentiation is also known as automatic differentiation, though strictly speaking it does not fully automate differentiation and can sometimes lead to inefficient code. In general, AD combines the best of both worlds: on one hand, it efficiently generates exact results and so is highly applicable in many real-world applications; on the other hand, it does not rely on listing all the intermediate results, and its computing process can be efficient. Therefore, it is the mainstream implementation of many numerical computing tools and libraries, such as JuliaDiff in Julia, ad in Python, ADMAT, etc. The rest of this chapter focuses mainly on algorithmic differentiation.

3.1.2 Architecture of Algodiff Module

The Algodiff module plays a pivotal role in the whole Owl library stack. It unifies various fundamental data types in Owl and is the most important functionality that serves the advanced analytics such as regression, neural networks, etc. Its design is elegant thanks to OCaml’s powerful module system.

In this chapter, we assume you are familiar with how differentiation works mathematically, so we can focus on the design and implementation details of the AD module in Owl. But first, let’s take a look at a simple example to see how the AD module is used in Owl. In this example, we simply calculate the first-order and second-order derivatives of the function tanh.

module AD = Algodiff.D let f x = AD.Maths.(tanh x);; let f1 = AD.diff f;; let f2 = AD.diff f1;; let eval_flt h x = AD.(pack_flt x |> h |> unpack_flt);; let r1 = eval_flt f1 1. let r2 = eval_flt f2 1.

That’s all it takes. We define the function and apply diff on it to acquire its first-order derivative, on which the diff function can be directly applied to get the second-order derivative. We then evaluate and get the function value at point x = 1 on these two derivative functions.

Figure 3-1 shows the various components in the AD module. Let us inspect how they fit into the example code. First, we cannot directly use the basic data types, such as ndarray and float number. Instead, they need to be first “packed” into a type that the AD module understands. In this example, pack_flt is used to wrap a normal float number into an AD type float. After calculation finishes, assuming we still get an AD type float as output, it should be unpacked into a normal float number using the function unpack_flt. The type system is the most fundamental building block in AD. Second, to construct a computation in the AD system, we need to use operators, such as tahn used in this example. AD provides a rich set of operators that are generated from the op_builder module. After constructing a graph by stacking the operators, the AD engine starts to let the input data “flow,” or “propagate,” twice in this graph, once forward and once backward. The key function that is in charge of this process is the reverse function. Based on the aforementioned process, we can calculate the differentiation of various sorts. To simplify coding, a series of high-level APIs are constructed. The diff function used in this example is one such API. It applies differentiation on a function that accepts a float number as input and outputs a float number. These high-level APIs lead to extremely elegant code. As shown in this example, we can simply apply the differentiation function on the original tanh function iteratively to get its first-order, second-order, and any other higher-order derivatives. In the next several sections, we will explain these building blocks in detail and how these different pieces are assembled into a powerful AD module.

Figure 3-1
A block diagram represents components in the A D module, namely, high-level A P I in the top layer, operators, op-builder, reverse in the middle layer, and types in the bottom layer.

Architecture of the AD module

3.2 Types

We start with type definition. The data type in AD is defined in the owl_algodiff_types.ml file, as shown in the following. Even if you are familiar with the type system in OCaml, it may still seem a bit confusing. The essence of AD type is to express the forward and reverse differentiation modes. So first, we use an example to demonstrate how these two AD modes work.

module Make (A : Owl_types_ndarray_algodiff.Sig) = struct   type t =     | F   of A.elt     | Arr of A.arr     | DF  of t * t * int     | DR  of t * t ref * op * int ref * int * int ref   and adjoint = t -> t ref -> (t * t) list -> (t * t) list   and register = t list -> t list   and label = string * t list   and op = adjoint * register * label end

3.2.1 Forward and Reverse Modes

In this part, we illustrate the two modes of differentiation with an example, which is based on this simple function:

$$ y\left({x}_0,{x}_1\right)=\left(\sin {x}_0{x}_1\right) $$

This function takes two inputs, and our aim is to compute \( \nabla y=\left(\frac{\partial y}{\partial {x}_0},\frac{\partial y}{\partial {x}_1}\right) \). Computations can be represented as a graph shown in Figure 3-2. Each node represents either an input/output or intermediate variables generated by the corresponding mathematical function. Each node is named vi. Herein, the input v0 = x0 and v1 = x1. The output y = v4.

Figure 3-2
The illustration with 4 circles, namely, x subscript 0, x subscript 1, mul, and sin. x subscript 0 and x subscript 1 are the inputs to mul, mul is the input to sin and sin gives the output.

Computation graph in the example calculation

Both the forward and reverse modes rely on basic rules to calculate differentiation. On one hand, there are the basic forms of derivative equations, such as \( \frac{d}{dx}\sin (x)=\cos (x) \), \( \frac{d}{dx}u(x)v(x)={u}^{\prime }(x)v(x)+u(x){v}^{\prime }(x) \), etc. On the other is the chain rule. It states that, suppose we have two functions f and g that can be composed to create a function F(x) = f(g(x)), then the derivative of F can be calculated as

$$ {F}^{\prime }(x)={f}^{\prime}\left(g(x)\right){g}^{\prime }(x) $$

The question is how to implement them in a differentiation system. Forward Mode

Let’s look at the first way, namely, the “forward” mode, to calculate derivatives. We ultimately wish to calculate \( \frac{\partial y}{\partial {x}_0} \) (and \( \frac{\partial y}{\partial {x}_1} \), which can be calculated in a similar way). We begin by calculating some intermediate results that will prove to be useful. Using the labels vi to refer to the intermediate computations, we have \( \frac{\partial {v}_0}{\partial {x}_0}=1 \) and \( \frac{\partial {v}_1}{\partial {x}_0}=0 \) immediately, since v0 = x0 and v1 = x1 actually.

Next, consider \( \frac{\partial {v}_2}{\partial {x}_0} \), which requires us to use the derivative rule on multiplication. It is a bit trickier and requires the use of the chain rule:

$$ \frac{\partial {v}_2}{\partial {x}_0}=\frac{\partial \left({x}_0{x}_1\right)}{\partial {x}_0}={x}_1\frac{\partial \left({x}_0\right)}{\partial {x}_0}+{x}_0\frac{\partial \left({x}_1\right)}{\partial {x}_0}={x}_1 $$

After calculating \( \frac{\partial {v}_2}{\partial {x}_0} \), we proceed to compute partial derivatives of v4 which is the final result \( \frac{\partial y}{\partial {x}_0} \) we are looking for. This process starts with the input variables and ends with the output variables, and that’s where the name “forward differentiation” comes from. We can simplify the notation by letting \( {\dot{v}}_i=\frac{\partial {v}_i}{\partial {x}_0} \). The \( {\dot{v}}_i \) is called the tangent of function vi(x0, x1, …, xn) regarding the input variable x0, and the results of evaluating the function at each intermediate point are called the primal value.

Let’s calculate \( \dot{y} \) when setting x0 = 2 and x1 = 2. The full forward differentiation calculation process is shown in Table 3-1 where two simultaneous computation processes take place in the two computation columns: the primal just performs computation following the computation graph; the tangent gives the derivative for each intermediate variable with regard to x0.

Table 3-1 The Computation Process of Forward Differentiation, Shown to Three Significant Figures

Two things need to be noted in this calculation process. The first is that in algorithmic differentiation, unlike symbolic differentiation, the computation is performed step by step, instead of after the whole computation is unwrapped into one big formula following the chain rule. Second, in each step, we only need to keep two values: primal and tangent. Besides, each step only needs to have access to its “parents,” using graph theory’s term. For example, to compute v2 and \( {\dot{v}}_2 \), we need to know the primal and tangent of v0 and v1; to compute that of v3, we need to know the primal and tangent of v2; etc. These observations are key to our implementation. Reverse Mode

Now let’s rethink about this problem from the other direction: from outputs to inputs. The problem remains the same, that is, to calculate \( \frac{\partial y}{\partial {x}_0} \). We still follow the same “step-by-step” procedure as in the previous forward mode. The only difference is that this time we calculate it backward. For example, in our example \( y={v}_3=\sin (v2) \), so if only we know \( \frac{\partial y}{\partial {v}_2} \), we would move a step closer to our target solution.

We first observe that \( \frac{\partial y}{\partial {v}_3}=1 \), since y and v3 are the same. We then compute \( \frac{\partial y}{\partial {v}_2} \) by applying the chain rule:

$$ \frac{\partial y}{\partial {v}_2}=\frac{\partial y}{\partial {v}_3}\ast \frac{\partial {v}_3}{\partial {v}_2}=1\ast \frac{\partial \sin \left({v}_2\right)}{\partial {v}_2}=\cos \left({v}_2\right). $$

We can simplify it by applying a substitution:

$$ {\overline{v}}_i=\frac{\partial y}{\partial {v}_i} $$

for the derivative of output variable y with regard to intermediate node vi. \( {\overline{v}}_i \) is called the “adjoint of variable vi with respect to the output variable y.” Using this notation, Eq. 3.5 can be rewritten as

$$ {\overline{v}}_2={\overline{v}}_3\ast \frac{\partial {v}_3}{\partial {v}_2}=1\ast \cos \left({v}_2\right) $$

Note the difference between tangent and adjoint. In forward mode, we know \( {\dot{v}}_0 \) and \( {\dot{v}}_1 \) and then calculate \( {\dot{v}}_2 \), \( {\dot{v}}_3 \), … until we get the target. In reverse mode, we start with \( {\overline{v}}_n=1 \) and calculate \( {\overline{v}}_{n-1} \), \( {\overline{v}}_{n-2} \), … until we have our target \( {\overline{v}}_0=\frac{\partial y}{\partial {v}_0}=\frac{\partial y}{\partial {x}_0} \). \( {\dot{v}}_3={\overline{v}}_0 \) in this example, given that we are talking about derivatives with respect to x0 when we use \( {\dot{v}}_3 \). As a result, the reverse mode is also called the adjoint mode.

Following this procedure, we can now perform the complete reverse mode differentiation. Note one major difference compared to the forward mode. In Table 3-1, we can compute the primal and tangent in one pass, since computing one of them does not require the other. However, as shown in the previous analysis, it is possible to require the value of v2 and possibly other previous primal values to compute \( {\overline{v}}_2 \). Therefore, a forward passFootnote 1 is first required, as shown in Table 3-2, to compute the required intermediate values. They are actually identical to those in the Primal Computation column of Table 3-1. We put it here again to stress our point about this stand-alone forward computing pass.

Table 3-2 The Forward Pass in Reverse Mode

Table 3-3 shows the backward pass in the reverse differentiation process, starting from the very end, and calculates all the way up to the beginning. A short summary: To compute differentiation using reverse mode, we need a forward pass to compute primal and next a backward pass to compute adjoint.

Table 3-3 The Backward Pass in Reverse Mode

Both the forward and reverse modes are equivalent in computing differentiation. So you might wonder, since the forward mode looks more straightforward, why don’t we just stick with it all along? Note that we obtained \( \frac{\partial y}{\partial {x}_1} \) “for free” while calculating \( \frac{\partial y}{\partial {x}_0} \). But in the forward mode, to calculate the derivative regarding another input, we have to calculate all the intermediate results again. So here lies one of the most significant strengths of the reverse mode: no matter how many inputs there are, a single reverse pass gives us all the derivatives of the inputs.

This property is extremely useful in neural networks. The computation graph constructed in neural networks tend to be quite complex, often with more than one input. The target of using AD is to find the derivative of the output – probably a scalar value of a loss function – regarding inputs. Thus, using the reverse mode AD is more efficient.

3.2.2 Data Types

Now that we understand the basic elements in computing a derivative, let’s turn to the data type used in the AD system. It is built upon two basic types: scalar number F and ndarray Arr. They are of type A.elt and A.arr. Here, A presents an interface that mostly resembles that of an ndarray module. It means that their specific types, such as single or double precision, C implementation or base implementation, etc., all depend on this A ndarray module. Therefore, the AD module does not need to deal with all the lower-level details. We will talk about how the AD module interacts with the other modules later in this chapter. For now, it suffices to simply understand them as, for example, single-precision float number and ndarray with single-precision float as elements, so as to better grasp the core ideas in AD.

module Make (A : Owl_types_ndarray_algodiff.Sig) = struct   type t =     | F   of A.elt     | Arr of A.arr     (* primal, tangent, tag *)     | DF  of t * t * int     (* primal, adj accu, op, fanout, tag, tracker *)     | DR  of t * t ref * op * int ref * int * int ref   and adjoint = t -> t ref -> (t * t) list -> (t * t) list   and register = t list -> t list   and label = string * t list   and op = adjoint * register * label end

The other two types are compounded types, each representing one differentiation mode. The DF type contains three parts, and the most important ones are the first two: primal and tangent. The DR type contains six parts, and the most important ones are the first, primal, and the third, op. op itself consists of three parts: adjoint, register, and label, of which adjoint is the most important component. The DR type also contains an adjoint accumulator (the second parameter), a fanout flag, and a tracker flag. The accumulator is of reference type since it needs to be updated during the propagation process. Both DF and DR types contain a tag of integer type. Later, we will discuss how these extra parts work in an AD engine. To focus on the core idea in AD, for now we introduce the most important elements: primal, tangent, and adjoint.

In essence, the computation graph in AD is constructed by building a list. Each element of this list contains two elements: the partial derivative computation and the original type t data. In the data type, the adjoint is a function. For each t type data, it specifies how to construct this list. Though the derivative computation rule of different operators varies, the adjoint generally falls into several patterns. For example, here is what the adjoint function looks like for an operation/function that takes one input and produces one output, such as sin, exp, etc.

let r a =   let adjoint cp ca t = (dr (primal a) cp ca, a) :: t in   let register t = a :: t in   let label = S.label, [ a ] in   adjoint, register, label

Here, the r function returns an op type, which consists of the adjoint function, the register function, and the label tuple. First, let’s look at the adjoint function. The first two variables cp and ca will be used in the derivative function dr. We will talk about it later in Section 3.3. For now, we only need to know that the reverse derivative computation dr calculates something; we put it together with the original input operator a into a tuple and add them to the existing list t, which is the third argument. The other two components are supplementary. The register function actually is an adjoint function without really calculating adjoints; it only stacks a list of original operators. The third one, label, puts together a string such as “sin” or “exp” to the input operator.

Next, let’s see another example in an operator that takes multiple inputs, such as add, mul (multiplication), etc. It’s a bit more complex:

let r_d_d a b =   let adjoint cp ca_ref t =     let abar, bbar = dr_ab (primal a) (primal b) cp ca_ref in     (abar, a) :: (bbar, b) :: t   in   let register t = a :: b :: t in   let label = S.label ^ "_d_d", [ a; b ] in   adjoint, register, label

The difference is that one such operator needs to push two items into the list. So here dr_ab is still a function that calculates derivatives reversely, and it returns the derivatives on its two parents, noted by abar and bbar, which are both pushed to the adjoint list. The register and label follow a similar pattern. In fact, in an operator that takes multiple inputs, we should consider other options, which is that one of the inputs is just a constant element. In that case, only one element should be put into the list:

let r_d_c a b =   let adjoint cp ca_ref t = (S.dr_a (primal a) b cp ca_ref, a) :: t in   let register t = a :: t in   let label = S.label ^ "_d_c", [ a; b ] in   adjoint, register, label

3.2.3 Operations on AD Type

After understanding the data type defined in AD, let’s take a look at what sorts of operations can be applied to them. They are defined in the owl_algodiff_core.ml file. The most notable ones are the “get” functions that retrieve certain information from an AD type data, such as its primal, tangent, and adjoint values. In the following code, the primal' is a “deep” function that recursively finds the primal value as float or ndarray format.

let primal = function   | DF (ap, _, _)          -> ap   | DR (ap, _, _, _, _, _) -> ap   | ap                     -> ap let rec primal' = function   | DF (ap, _, _)          -> primal' ap   | DR (ap, _, _, _, _, _) -> primal' ap   | ap                     -> ap let tangent = function   | DF (_, at, _) -> at   | DR _          -> failwith "error: no tangent for DR"   | ap            -> zero ap let adjval = function   | DF _                   -> failwith "error: no adjval for DF"   | DR (_, at, _, _, _, _) -> !at   | ap                     -> zero ap

And the zero function resets all elements to the zero status:

let rec zero = function   | F _                    -> F A.(float_to_elt 0.)   | Arr ap                 -> Arr A.(zeros (shape ap))   | DF (ap, _, _)          -> ap |> primal' |> zero   | DR (ap, _, _, _, _, _) -> ap |> primal' |> zero

Another group of important operations are those that convert the AD type to and from ordinary types such as float and ndarray:

  let pack_elt x = F x   let unpack_elt x =     match primal x with     | F x -> x     | _   -> failwith "error: AD.unpack_elt"   let pack_flt x = F A.(float_to_elt x)   let _f x = F A.(float_to_elt x)   let pack_arr x = Arr x   let unpack_arr x =     match primal x with     | Arr x -> x     | _     -> failwith "error: AD.unpack_arr"

There are also operations that provide helpful utilities. One of them is the zero we have seen, and also some functions show type information:

let shape x =     match primal' x with     | F _    -> [||]     | Arr ap -> A.shape ap     | _      -> failwith "error: AD.shape"

3.3 Operators

The graph is constructed with a series of operators that can be used to process AD type data as well as building up a computation graph that is differentiable. They are divided into submodules: Maths is the most important component, and it contains a full set of mathematical functions to enable constructing various computation graphs; Linalg contains a subset of linear algebra functions; NN contains functions used in neural networks, such as two-dimensional convolution, dropout, etc.; Mat is specifically for matrix operations, such as eye that generates an identity matrix; and Arr provides functions such as shape and numel for ndarrays.

As shown in Figure 3-1, the implementation of an operation can be abstracted into two parts: (a) what the derivative and calculation rules of it are and (b) how these rules are applied into the AD system. The first part is defined in the owl_algodiff_ops.ml, and the latter is in owl_algodiff_ops_builder.ml.

3.3.1 Calculation Rules

Let’s look at some examples from the first to see what these calculation rules are and how they are expressed in OCaml. We can use the sine function as an example. It takes an input and computes its sine value as output. This module specifies four computing rules, each corresponding to one type of AD data. Here, module A is the underlying “normal” ndarray module that implements functions for ndarray and scalar values. It can be single precision or double precision, implemented using OCaml or C. For the F scalar type, ff_f specifies using the sin function from the Scalar submodule of A. If the data is an AD ndarray, ff_arr states that the sine functions should be applied on all of its elements by using the A.sin function. Next, if the data is of type DF, the df function is used. As shown in the example in Table 3-1, it computes tangent (at) * derivative of primal (ap). In the case of the sine function, it computes at * cos ap. Finally, the dr computes what we have shown in Table 3-3. It computes adjoint (ca) * derivative of primal (a). Therefore, here it computes !ca * cos a. Using the get reference operator !ca is because the adjoint value in the DR type is a reference that can be updated.

module struct   let label = "sin"   let ff_f a = F A.Scalar.(sin a)   let ff_arr a = Arr A.(sin a)   let df _cp ap at = at * cos ap   let dr a _cp ca = !ca * cos a end

The similar template can be applied to other operators that take one input and produce one output, such as the square root (sqrt), as shown in the next module. The derivative rule for the square root is \( {\left(\sqrt{x}\right)}^{\prime }=\frac{1}{2\sqrt{x}} \).

module struct   let label = "sqrt"   let ff_f a = F A.Scalar.(sqrt a)   let ff_arr a = Arr A.(sqrt a)   let df cp _ap at = at / (pack_flt 2. * cp)   let dr _a cp ca = !ca / (pack_flt 2. * cp) end

However, things get more complicated once an operator needs to deal with more than one input. The problem is that for each of these four computation rules, we need to consider multiple possible cases. Take the divide operation as an example. For a simple primal value computation, we need to consider four cases: both inputs are scalar, both are ndarray, and one of them is ndarray and the other is scalar. It corresponds to four rules: ff_aa, ff_bb, ff_ab, and ff_ba. For the forward computation of tangent regarding \( \frac{a}{b} \), we also need to consider three cases:

  • df_da corresponds to the derivative when the second input is constant:

$$ {\left(\frac{a(x)}{b}\right)}^{\prime }=\frac{a^{\prime }(x)}{b} $$
  • In code, it is at / bp. Here, at is the tangent of the first input a(x), and bp is the primal value of the second input b.

  • df_db corresponds to the derivative when the first input is constant:

$$ {\left(\frac{a}{b(x)}\right)}^{\prime }=-\frac{a}{b{(x)}^2}{b}^{\prime }(x)=-{b}^{\prime }(x)\frac{a}{b(x)}\frac{1}{b(x)}, $$
  • And thus, it can be represented by neg bt*cp/bp. Here, neg is the negative operator, and cp represents the original input \( \frac{a}{b(x)} \).

  • df_dab is for the case that both inputs are of nonconstant AD type, that is, DF or DR. It thus calculates

$$ {\left(\frac{a(x)}{b(x)}\right)}^{\prime }=\frac{a^{\prime }(x)-\frac{a(x)}{b(x)}{b}^{\prime }(x)}{b(x)}, $$
  • And the corresponding code is (at-(bt*cp))/bp.

Expressing the rules in computing the reverse mode is more straightforward. If both inputs a and b are nonconstant, then the function dr_ab computes \( \overline{a}\frac{\partial\ y}{\partial\ a} \) and \( \overline{b}\frac{\partial\ y}{\partial\ b} \), where \( y=\frac{a}{b} \). Thus, dr_ab returns two values; the first is \( \overline{a}/b \) (!ca / b), and the second is \( -\frac{a}{b^2} \) (!ca * (neg a / (b * b))). In the code, squeeze_broadcast x s is an internal helper function that squeezes array x so that it has shape s. If one of the inputs is constant, then we can just omit the corresponding result, as shown in dr_a and dr_b.

module struct   let label = "div"   let ff_aa a b = F A.Scalar.(div a b)   let ff_ab a b = Arr A.(scalar_div a b)   let ff_ba a b = Arr A.(div_scalar a b)   let ff_bb a b = Arr A.(div a b)   let df_da _cp _ap at bp = at / bp   let df_db cp _ap bp bt = neg bt * cp / bp   let df_dab cp _ap at bp bt = (at - (bt * cp)) / bp   let dr_ab a b _cp ca =     ( _squeeze_broadcast (!ca / b) (shape a)     , _squeeze_broadcast (!ca * (neg a / (b * b))) (shape b) )   let dr_a a b _cp ca = _squeeze_broadcast (!ca / b) (shape a)   let dr_b a b _cp ca = _squeeze_broadcast     (!ca * (neg a / (b * b))) (shape b) end

A similar example is the operator pow that performs ab calculation. It implements calculation rules that are similar to those of div.

module struct   let label = "pow"   let ff_aa a b = F A.Scalar.(pow a b)   let ff_ab a b = Arr A.(scalar_pow a b)   let ff_ba a b = Arr A.(pow_scalar a b)   let ff_bb a b = Arr A.(pow a b)   let df_da _cp ap at bp = at *     (ap ** (bp - pack_flt 1.)) * bp   let df_db cp ap _bp bt = bt * cp * log ap   let df_dab cp ap at bp bt =     ((ap ** (bp - pack_flt 1.)) * (at * bp)) +       (cp * bt * log ap)   let dr_ab a b cp ca =     ( _squeeze_broadcast (!ca *       (a ** (b - pack_flt 1.)) * b) (shape a)     , _squeeze_broadcast (!ca * cp * log a) (shape b) )   let dr_a a b _cp ca =     _squeeze_broadcast (!ca *       (a ** (b - pack_flt 1.)) * b) (shape a)   let dr_b a b cp ca = _squeeze_broadcast     (!ca * cp * log a) (shape b) end

3.3.2 Generalize Rules into Builder Template

So far, we have talked about the calculation rules, but there is still a question: how to utilize these rules to build an operator of type t that we have described in Section 3.2. To do that, we need to use the power of functor in OCaml. In the AD module in Owl, the operators are categorized according to the number of inputs and outputs, each with its own template. Let’s take the “single-input-single-output” (SISO) operators such as sine as an example. This template takes a module of type Siso as input, as shown in the following. Notice that the calculation rules of the sine function shown in the previous section exactly forms such a module.

module type Siso = sig   val label : string   val ff_f : A.elt -> t   val ff_arr : A.arr -> t   val df : t -> t -> t -> t   val dr : t -> t -> t ref -> t end

In the end, we need to build a sin : t -> t operator, which accepts a data of AD type t and returns output of type t. This function is what we need:

let op_siso ~ff ~fd ~df ~r a =   match a with   | DF (ap, at, ai) ->     let cp = fd ap in     DF (cp, df cp ap at, ai)   | DR (ap, _, _, _, ai, _) ->     let cp = fd ap in     DR (cp, ref (zero cp), r a, ref 0, ai, ref 0)   | ap -> ff ap

These names may seem enigmatic. Here, the fd x function calculates the primal value of x. ff x performs forward computation on the two basic types: scalar and ndarray. The df cp ap at function computes the tangents in forward mode. Finally, the function r computes the op part in the type, which “remembers” how to build up the graph in the form of a list. To put them together, the basic logic of this function goes like this:

  • If the input is a DF type, produce a new DF type after calculating the primal and tangent in forward mode.

  • If the input is a DR type, produce a new DR type, with its knowledge about how to compute adjoints and how to build up the list.

  • Otherwise, it’s the basic type, scalar or ndarray; perform simple forward computation on it.

Note that the newly constructed DR type, aside from its primal value and op being updated, the rest values, including adjoint, label, etc., are all set to 0. That is because a computation graph is constructed in the forward pass, and the calculation of adjoints does not happen in this step.

So the next question is: for the sine function, how can we get the fd, ff, etc.? Luckily, from the previous Siso module that specifies various calculation rules, we have already had all the ingredients required. Assume we have named this Siso sine module S, then we have the forward computation on the two basic types:

let ff = function   | F a   -> S.ff_f a   | Arr a -> S.ff_arr a   | _     -> error_uniop label a

And the r function looks like what we have introduced in Section 3.2, using the dr function from module S to specify how to construct the list.

let r a =   let adjoint cp ca t = (S.dr (primal a) cp ca, a) :: t in   let register t = a :: t in   let label = S.label, [ a ] in   adjoint, register, label

So now we have the function:

let rec f a =   let open S in   (* define ff and r as stated above *)   let fd a = f a in   op_siso ~ff ~fd ~df:S.df ~r a

Put them together, and here is the final function that accepts a module and builds an operator:

let build_siso =   (* define op_siso *)   fun (module S : Siso) ->     (* define f *)     f

To build a sin operator, we use the following code:

let sin = build_siso (   module struct     let label = "sin"     let ff_f a = F A.Scalar.(sin a)     let ff_arr a = Arr A.(sin a)     let df _cp ap at = at * cos ap     let dr a _cp ca = !ca * cos a   end : Siso)

The code is concise, easy to read, and less prone to various possible errors in coding. To build another “siso” operator, such as a square root, we only need to change the rules:

let sqrt = build_siso (   module struct     let label = "log"     let ff_f a = F A.Scalar.(log a)     let ff_arr a = Arr A.(log a)     let df _cp ap at = at / ap     let dr a _cp ca = !ca / a   end : Siso)

Here, we only use the most simple SISO type builder template as an example. We also include the other templates:

  • SIPO: Single input and pair outputs, such as the linear algebra operation qr for QR factorization

  • SITO: Single input and three outputs, such as the SVD factorization

  • SIAO: Single input and array outputs, such as the split function that splits input ndarray into multiple ones

  • PISO: Pair inputs and single output, such as add and mul

  • AISO: Array input and single output, such as concatenate, the inverse operation of split

These templates can become quite complex. For example, in building the add function, to choose from different combinations of possible input types, the builder function can be as complex as

op_piso ~ff ~fd ~df_da ~df_db ~df_dab ~r_d_d ~r_d_c ~r_c_d a b

But the principles are the same.

3.4 API

The previous section introduces AD operators, the building blocks to construct an AD computation graph. The next thing we need is an “engine” that begins the differentiation process. For that purpose, we first introduce several low-level APIs provided by the AD module and explain how they are used to build up user-friendly advanced APIs such as diff and grad.

3.4.1 Low-Level APIs

We differentiate between the two differentiation modes: forward mode and backward mode. As explained in the previous section, if an input x is of type DF, then by applying operations such as sin x, a computation graph is constructed, and the primal and tangent values are also computed during this process. All we need to do is to retrieve the required value once this process is finished. To start a forward mode differentiation, we need to create a DF type data as initial input, using the primal value, the initial tangent (equals to 1), and an integer tag as arguments:

let make_forward p t i = DF (p, t, i)

For example, if we are to calculate the derivative of f =  sin (x2) at x = 2, we can first create an initial point as

let x = make_forward (pack_flt 2.) (pack_flt 1.) 1 let y = Maths.(pow x (pack_flt 2.) |> sin) let t = tangent y

That’s it. Once the computation y is constructed, we can directly retrieve the tangent value using the tangent function.

The backward mode is a bit more complex. Remember that it consists of two passes: one forward and one backward. From the previous section, we know that once the graph is constructed, the primal data are calculated, but the adjoint values are all set to zero. Therefore, we need some extra mechanism to pump the computation flow backward to calculate adjoint values. Here is an example to use low-level APIs to compute derivatives in the reverse mode:

open AD let f x = Maths.(pow x (pack_flt 2.) |> sin) let x = 2.; let x' = make_reverse x (tag ()); let y = f x'; let _ = reverse_prop (F 1.) y; let y' = adjval x';;

The problem to solve is still the same: calculate the derivative of f =  sin (x2) at x = 2; the only difference is that we use the reverse mode this time. Let’s explain this example line by line. First, we still need to build an initial operator with make_reverse.

let make_reverse p i =   let adjoint _cp _ca t = t in   let register t = t in   let label = "Noop", [] in   DR (p, ref (zero p), (adjoint, register, label), ref 0, i, ref 0)

The make_reverse function constructs a DR type data with a given primal value. The rest of its fields are all set to zero. It does two things: first, it wraps input x into a value of type t for Algodiff to process; second, it generates a unique tag for the input so that input numbers can be nested. Next, calling f x' constructs the computation graph of f, capturing the primal values and knowledge about how to calculate adjoints all in the DR type data y.

Next, reverse_prop propagates the error back to the inputs:

let reverse_prop v x =     reverse_reset x;     reverse_push v x

It consists of two steps: first, reset all values in this graph to initial status (reverse_reset); second, perform backward propagation (reverse_push). Both follow a recursive process.

  let reverse_reset x =     let rec reset xs =       match xs with       | []     -> ()       | x :: t ->         (match x with         | DR (_cp, aa, (_, register, _), af, _ai, tracker) ->           aa := reset_zero !aa;           af := !af + 1;           tracker := succ !tracker;           if !af = 1 && !tracker = 1 then reset (register t) else reset t         | _ -> reset t)     in     reset [ x ]

The next function is reverse_push that is the core engine that drives the backward propagation process. Its core idea is simple. It maintains a stack t of (adjoint value, AD value) pairs. At each iteration, the push function takes one pair out of the head of stack. The adjoint value v is added to the adjoint accumulator aa in the DR type node |x|. The node also specifies an adjoint function that knows how to calculate adjoint values of its parents, in the form of one or more (adjoint value, AD value) pairs. This process starts with only one pair, which is the output DR type value of a whole computation. It finishes when stack t is empty.

let reverse_push =     let rec push xs =       match xs with       | []          -> ()       | (v, x) :: t ->         (match x with         | DR (cp, aa, (adjoint, _, _), af, _ai, tracker) ->           aa := reverse_add !aa v;           (af := Stdlib.(!af - 1));           if !af = 0 && !tracker = 1           then push (adjoint cp aa t)           else (             tracker := pred !tracker;             push t)         | _ -> push t)     in     fun v x -> push [ v, x ]

After this step, the gradient of f is stored in the adjacent value of x', and we can retrieve the value using the adjval function.

3.4.2 High-Level APIs

Based on the basic low-level APIs, we are able to build more high-level and easy-to-access differentiation functions. The most commonly used function for differentiating is diff in the AD module. Given a function f that maps one scalar value to another, we can calculate its derivative at point x by diff f x. For example, given the triangular function tanh, we can easily calculate its derivative at position x = 0.1, as follows:

open Algodiff.D let f x = Maths.(tanh x);; let d = diff f (F 0.1);;

Its implementation using the forward mode low-level API is quite simple:

let diff' f x =     if not (is_float x) then       failwith "input must be a scalar";     let x = make_forward x (pack_flt 1.) (tag ()) in     let y = f x in     primal y, tangent y let diff f x = diff' f x |> snd

Next, we can generalize derivatives of scalar functions to gradients of multivariate functions. For a function that maps a vector input to a scalar, the grad function calculates its gradient at a given point. For example, in a three-dimensional space, the gradient at each point on a surface consists of three elements representing the partial derivative along the x, y, and z axes. This vector indicates the direction in which the function has the largest magnitude change. Its implementation uses the standard reverse mode:

let grad' f x =     let x = make_reverse x (tag ()) in     let y = f x in     assert (is_float y);     reverse_reset y;     reverse_push (pack_flt 1.) y;     primal y, x |> adjval let grad f x = grad' f x |> snd

One important application of gradient is in gradient descent, a widely used technique for finding the minimum value of a function. We will discuss it in more detail in Chapter 4.

Just as gradient generalizes derivatives from scalars to vectors, the Jacobian function generalizes gradient from vectors to matrices. In other words, grad is applied to functions mapping vectors to scalars, while jacobian is applied to functions that map vectors to vectors. If we assume the function f takes an input vector of length n and produces an output vector of length m, then the Jacobian is defined as

$$ \textbf{J}(y)=\left[\begin{array}{cccc}\frac{\partial {y}_0}{\partial {x}_0}& \frac{\partial {y}_0}{\partial {x}_1}& \dots & \frac{\partial {y}_0}{\partial {x}_{n-1}}\\ {}\frac{\partial {y}_2}{\partial {x}_0}& \frac{\partial {y}_2}{\partial {x}_1}& \dots & \frac{\partial {y}_2}{\partial {x}_{n-1}}\\ {}\vdots & \vdots & \dots & \vdots \\ {}\frac{\partial {y}_{m-1}}{\partial {x}_0}& \frac{\partial {y}_{m-1}}{\partial {x}_1}& \dots & \frac{\partial {y}_{m-1}}{\partial {x}_{n-1}}\end{array}\right] $$

The intuition behind the Jacobian is similar to that of the gradient. At a particular point in the domain of the target function, the Jacobian shows how the output vector changes given a small change in the input vector. Its implementation is as follows:

let jacobianv' f x v =     if shape x <> shape v     then failwith "jacobianv': vector not the same dimension as input";     let x = make_forward x v (tag ()) in     let y = f x in     primal y, tangent y let jacobianv f x v = jacobianv' f x v |> snd

The advanced APIs support convenient composition and can be used to build more complex ones. For example, the second-order derivative of function f can be implemented as g = f |> diff |> diff. Another example is the hessian API. Given a multivariate function that maps n input variables to a scalar, this function calculates its second-order derivatives as a matrix. Its implementation is based on Jacobian:

let hessian f x = (f |> grad |> jacobian) x

In most applications, we use these high-level APIs to support more advanced applications, such as optimization, regression, neural network, etc. One good example is to implement Newton’s method for finding the minimum value of a function. Rather than moving only in the direction of the gradient, Newton’s method combines the gradient with the second-order gradients of a function, \( \frac{\nabla f\left({x}_n\right)}{\nabla^2f\left({x}_n\right)} \), starting from a random position and iterating until convergence according to Eq. 3.9. Its implementation in OCaml is shown as follows:

$$ {x}_{n+1}={x}_n-\alpha {\textbf{H}}^{-1}\nabla f\left({x}_n\right) $$

open Algodiff.D let rec newton ?(eta=F 0.01) ?(eps=1e-6) f x =   let g = grad f x in   let h = hessian f x in   if (Maths.l2norm' g |> unpack_flt) < eps then x   else newton ~eta ~eps f Maths.(x - eta * g *@ (inv h))

As an example, we can apply this method to a two-dimensional triangular function, starting from a random initial point, to find a local minimum. Note that the newton function takes a vector as input and outputs a scalar.

let _ =   let f x = Maths.(cos x |> sum') in   newton f (Mat.uniform 1 2)

Besides what we have mentioned, Owl also implements other high-level differentiation functions, such as laplacian, which calculates the Laplacian operator ∇2f, or the trace of a Hessian matrix. In essence, they are all built upon a concise set of low-level APIs or other high-level APIs. We will see more applications of the AD module in later chapters.

3.5 More Implementation Details

Besides the main structure we have mentioned so far, there are some other details that should be mentioned to build an industry-grade AD module. We introduce them in this section.

3.5.1 Perturbation Confusion and Tag

We have explained some of the fields in the DR type. But one of them is not covered yet: tag of type int, which is used to solve a particular problem when calculating higher-order derivatives with nested forward and backward modes. This problem is referred to as perturbation confusion. It is crucial for an AD engine to function properly to handle this problem. Here, we only scratch the surface of it. Let’s look at an example. Suppose we want to compute the derivative of

$$ f(x)=x\frac{d\left(x+y\right)}{dy} $$

that is, a function that contains another derivative function. It initially seems straightforward, and we don’t even need a computer’s help: as \( \frac{d\left(x+y\right)}{dy}=1 \) so f(x) = x = 1. Unfortunately, applying the simple implementation without tag leads to wrong answer.

# let diff f x =     match x with     | DF (_, _)    ->       f x |> tangent     | DR (_, _, _) ->       let r = f x in       reverse_push [(1., r)];       !(adjoint x);; val diff : (t -> t) -> t -> float = <fun> # let f x =     let g = diff (fun y -> add_ad x y) in     mul_ad x (make_forward (g (make_forward 2. 1.)) 1.);; val f : t -> t = <fun> # diff f (make_forward 2. 1.);; - : float = 4.

The result is 4 at point (2, 2), but we have previously calculated, and the result should be 1 at any point. What has gone wrong? The answer is a bit tricky. Note that x=DF(2,1). The tangent value equals to 1, which means that \( \frac{dx}{dx}=1 \). Now if we continue to use this same x value in function g, whose variable is y, the same x=DF(2,1) can be incorrectly translated by the AD engine as \( \frac{dx}{dy}=1 \). Therefore, when used within function g, x should actually be treated as DF(2,0). That’s where tagging comes to help. It solves the nested derivative problem by distinguishing derivative calculations and their associated attached parameters with a unique tag for each usage of the derivative operator.

3.5.2 Lazy Evaluation

We have seen how separating building template and operation definitions makes it convenient to add new operations, simplifying code and improving productivity. But it comes with a price: efficiency. Imagine a large calculation that contains thousands of operations, with one operation occurring many times. Such situations are actually quite common when using AD with neural networks where large computation graphs are created that use functions such as add and mul many hundreds of times. With the Builder approach described earlier, the operation will be recreated every time it is used, which is rather inefficient. Fortunately, we can simply use OCaml’s lazy evaluation mechanism to perform caching.

val lazy: 'a -> 'a lazy_t module Lazy : sig   type 'a t = 'a lazy_t   val force : 'a t -> 'a end

OCaml provides a built-in function lazy that accepts an input of type 'a and returns a value of type 'a lazy_t where the computation of the value of type 'a has been delayed. This lazy expression won’t be evaluated until it is called by Lazy.force, and the first time it is called, the expression is evaluated and the result is cached. Subsequent applications of Lazy.force will simply return the cached result without further reevaluation. Here is an example of lazy evaluation in OCaml:

# let x = Printf.printf "hello world!"; 42 hello world! val x : int = 42 # let lazy_x = lazy (Printf.printf "hello world!"; 42) val lazy_x : int lazy_t = <lazy> # let _ = Stdlib.Lazy.force lazy_x hello world! - : int = 42 # let _ = Stdlib.Lazy.force lazy_x - : int = 42 # let _ = Stdlib.Lazy.force lazy_x - : int = 42

In this example, we can see that building lazy_x does not evaluate the content, which is delayed to the first Lazy.force. After that, every time force is called, only the value is returned; the x itself, including the printf function, will not be evaluated. Now come back to the AD module in Owl. Imagine that we need to add support for the sin operation. The definition of sin remains the same:

open Algodiff.D module Sin = struct   let label = "sin"   let ff_f a = F A.Scalar.(sin a)   let ff_arr a = Arr A.(sin a)   let df _cp ap at = Maths.(at * cos ap)   let dr a _cp ca = Maths.(!ca * cos (primal a)) end

However, we can instead use lazy evaluation to actually build the implementation and benefit from the efficiency gain of the caching it provides.

let _sin_ad = lazy Builder.build_siso (module Sin : Builder.Siso);; let new_sin_ad = Lazy.force _sin_ad;;

In this way, regardless of how many times this sin function is called in a massive computation graph, the Builder.build_siso process is only evaluated once.

3.5.3 Extending AD

A significant benefit of the module design described earlier is that it can be easily extended by providing modules representing new functions. For example, suppose that the AD system did not support the natural logarithm, \( \sin x \), whose derivative is \( {\sin}^{\prime }x=\cos x \). Including this function is a simple matter of defining the necessary functions for calculating primal, tangent, and adjoint values in a module and applying the relevant function from the Builder module – in this case, build_siso for building “single input, single output” functions.

open Algodiff.D module Sin = struct   let label = "sin"   let ff_f a = F A.Scalar.(sin a)   let ff_arr a = Arr A.(sin a)   let df _cp ap at = Maths.(at * cos ap)   let dr a _cp ca = Maths.(!ca * cos (primal a)) end let new_sin_ad = Builder.build_siso (module Sin : Builder.Siso)

We can directly use this new operator as if it is a native operation in the AD module. For example:

# let f x1 x2 =     let x1 = F. x1 in     let x2 = F. x2 in     Maths.(div (cos x1) (new_sin_ad x2);; val f : t -> t = <fun>

3.5.4 Graph Utility

Though not core functions, various utility functions provide convenience to users, for example, tools to visualize the computation graph built up by AD. They come in handy when we are trying to debug or understand how AD works. The core of the visualization function is a recursive traverse routine:

let _traverse_trace x =     let nodes = Hashtbl.create 512 in     let index = ref 0 in     (* local function to traverse the nodes *)     let rec push tlist =       match tlist with       | []       -> ()       | hd :: tl ->         if Hashtbl.mem nodes hd = false         then (           let op, prev =             match hd with             | DR (_ap, _aa, (_, _, label), _af, _ai, _) -> label             | F _a -> Printf.sprintf "Const", []             | Arr _a -> Printf.sprintf "Const", []             | DF (_, _, _) -> Printf.sprintf "DF", []           in           (* check if the node has been visited before *)           Hashtbl.add nodes hd (!index, op, prev);           index := !index + 1;           push (prev @ tl))         else push tl     in     (* iterate the graph then return the hash table *)     push x;     nodes

The _traverse_trace and its related functions are used to convert the computation graph generated in backward mode into human-readable format. It initializes variables for tracking nodes and indices, then iterates the graph and puts required information into a hash table. With some extra code, the parsed information can be displayed on a terminal or be converted into other formats that are suitable for visualization, such as the dot format by Graphviz.

3.6 How AD Is Built upon Ndarray

We have been saying how the AD does not need to deal with the details of computation implementation and thus can focus on the logic of differentiation. In previous examples, we assume the A module to be any Ndarray module. In the final section of this chapter, we will explain how the AD module is built upon the Ndarray modules. We hope to illustrate the power of the functor system in promoting a modular style system design.

First, the Ndarray module used in the AD module is not purely Ndarray as introduced in Chapter 2, but also contains several other modules, including the scalar functions, the ones that are specific to matrix and linear algebra. Together, they are called the Owl_algodiff_primal_ops module. Based on the specific precision of the modules included, it also divides the S and D submodules. For example, here are the components of Owl_algodiff_primal_ops.S:

module S = struct   include Owl_dense_ndarray.S   module Scalar = Owl_maths   module Mat = struct     let eye  = Owl_dense_matrix_s.eye     let tril = Owl_dense_matrix_s.tril     let triu = Owl_dense_matrix_s.triu     ...   end   module Linalg = struct     include Owl_linalg_s     let qr a =       let q, r, _ = qr a in       q, r       ...   end end

By replacing the single-precision modules used in it, such as Owl_dense_ndarray.S and Owl_dense_matrix_s, with their double-precision counterparts, we can get the Owl_algodiff_primal_ops.D module. Moreover, by replacing them with the base Ndarray modules, such as Owl_base_dense_ndarray.S, we can acquire AD modules the calculation of which is based on pure OCaml. Actually, the implementation is not limited to these types. The interface of the Owl_algodiff_primal_ops module is specified in the Owl_types_ndarray_algodiff.Sig module. As long as a module implements all the required functions and modules, it can be plugged into AD. For example, we can utilize the computation graph module in Owl and build a symbolic Ndarray module. An AD module built on this module can provide powerful symbolic computation functionality. It means that the execution of both forward and reverse differentiation modes can benefit from various optimization opportunities, such as graph and memory optimizations. We will explain the computation graph module in Chapter 6.

To build an AD module, we use code similar to the following:

module S = Owl_algodiff_generic.Make (Owl_algodiff_primal_ops.S) module D = Owl_algodiff_generic.Make (Owl_algodiff_primal_ops.D)

So next, let’s take a look at the Owl_algodiff_generic.Make functor. It includes all the existing submodules we have introduced so far: the core module, the operators, and the differential API functions such as make_forward and diff, as follows:

module Make (A : Owl_types_ndarray_algodiff.Sig) = struct   module Core = Owl_algodiff_core.Make (A)   include Core   module Ops = Owl_algodiff_ops.Make (Core)   include Ops   let make_forward p t i = DF (p, t, i)   let make_reverse p i = ...   let diff f x = ...   let grad f x = ... end

These components all rely on the fundamental computation module A. The Core module itself is built using a functor, with the ndarray module as the parameter. Its interface is specified in Owl_algodiff_core_sig.ml, as follows. It includes the basic type definition of types and the operations that can be applied on them.

module type Sig = sig   module A : Owl_types_ndarray_algodiff.Sig   include Owl_algodiff_types_sig.Sig     with type elt := A.elt and type arr := A.arr   val primal  : t -> t   val tangent : t -> t   val adjref  : t -> t ref   ... end

Next, the operators such as sin are built using the Core module as a parameter. As we have explained, first the Builder module works as a factory that assembles various operators by providing different templates such as siso, including a type definition of the template and the function to build operators.

module Make (Core : Owl_algodiff_core_sig.Sig) = struct   open Core   module type Siso = sig     val label : string     val ff_f : A.elt -> t     val ff_arr : A.arr -> t     val df : t -> t -> t -> t     val dr : t -> t -> t ref -> t   end   let build_siso =     ...

Then in the operator module, based on Core and Builder, this module contains all the operators which are built from the builder functions. They are categorized into different modules such as Maths and Linalg.

module Make (Core : Owl_algodiff_core_sig.Sig) = struct   open Core   module Builder = Owl_algodiff_ops_builder.Make (Core)   open Builder   module Maths = struct     let cos = (build_siso       (module struct         let label = "cos"         let ff_f a = F A.Scalar.(cos a)         let ff_arr a = Arr A.(cos a)         let df _cp ap at = neg (at * sin ap)         let dr a _cp ca = !ca * neg (sin a)       end : Siso))     and sin = (build_siso ...)     ...   end   module Linalg = struct   ...   end   module NN = struct   ...   end   ... end

3.7 Summary

In this chapter, we discussed the design of one of the core modules in Owl: the algorithmic differentiation module. We started from its basic theory and difference among three types of differentiations. Then we presented the overall architecture of the AD module in Owl. We explained several parts in detail in the following sections: the definition of types in this system, the operators, and the APIs built on existing mechanisms. We also discussed more subtle issues that should be paid attention to when building an industry-level AD engine, such as avoiding the perturbation confusion issue and using lazy evaluation to improve performance, graph visualization, etc. Finally, we explained how the AD system is built upon the Ndarray module in Owl.