Abstract
Differentiation is key to numerous scientific applications including maximizing or minimizing functions, solving systems of ODEs, physical simulation, etc. Of existing methods, algorithmic differentiation, or AD, is a computerfriendly technique for performing differentiation that is both efficient and accurate. AD is a central component of the architecture design of Owl. In this chapter, we will show, with handson examples, how the AD engine is designed and implemented in Owl. AD will be used in some of the other chapters to show its application in optimization and machine learning.
Download chapter PDF
Differentiation is key to numerous scientific applications including maximizing or minimizing functions, solving systems of ODEs, physical simulation, etc. Of existing methods, algorithmic differentiation, or AD, is a computerfriendly technique for performing differentiation that is both efficient and accurate. AD is a central component of the architecture design of Owl. In this chapter, we will show, with handson examples, how the AD engine is designed and implemented in Owl. AD will be used in some of the other chapters to show its application in optimization and machine learning.
3.1 Introduction
Assume an object moves a distance of Δs in a time Δt, the average velocity of this object during this period can be defined as the ratio between Δs and Δt. As both values get smaller and smaller, we can get the instantaneous velocity:
The term \( \frac{ds}{dt} \) is referred to as “the derivative of s with respect to t.”
Differentiation is the process of finding a derivative in mathematics. It studies the functional relationship between variables, that is, how much one variable changes when the value of another variable changes. Differentiation has many important applications, for example, finding minimum and maximum values of a function, finding the rate of change of quantity, computing linear approximations to functions, and solving systems of differential equations. Its critical roles in these key mathematical fields mean it is widely used in various fields. One example is calculating marginal cost and revenue in economics.
In computer science, differentiation also plays a key role. Machine learning techniques, such as deep neural networks, have been gaining more and more momentum. The most important step in training a deep neural network module is called “backpropagation,” which in essence is calculating the gradient of a very complicated function.
3.1.1 Three Ways of Differentiating
If we are to support differentiation at a scale as large as a deep neural network, manual calculation based on the chain rule in Calculus 101 is far from being enough. To do that, we need the power of automation, which is what a computer is good at. Currently, there are three ways to calculate differentiation: numerical differentiation, symbolic differentiation, and algorithmic differentiation.
The first method is numerical differentiation. Derived directly from the definition, numerical differentiation uses a small step δ to compute an approximate value toward the limit, as shown in Eq. 3.2.
By treating the function f as a black box, this method is straightforward to implement as long as f can be evaluated. However, this method is unfortunately subject to multiple types of errors, such as the roundoff error. It is caused by representing numbers with only a finite precision during the numerical computation. The roundoff error can be so large as to that the computer thinks f (x + δ) and f (x) are identical.
The second is symbolic differentiation. By manipulating the underlying mathematical expressions, symbolic differentiation obtains analytical results without numerical approximation, using mathematical derivative rules. For example, consider the function f (x_{0}, x_{1}, x_{2}) = x_{0} ∗ x_{1} ∗ x_{2}. Computing ∇f symbolically, we end up with
This process completely eliminates the impact of numerical errors, but the complexity of symbolic manipulation quickly grows as expressions become more complex. Just imagine computing the derivative of a simple calculation \( f(x)=\prod \limits_{i=0}^{n1}{x}_i \): the result would be terribly long, if not that complex. As a result, symbolic differentiation can easily consume a huge amount of computing resource and becomes impractically slow in the end. Besides, unlike in numerical differentiation, we must know how a function is constructed to use symbolic differentiation.
Finally, there is the algorithmic differentiation (AD). It is a chain rule–based technique for calculating derivatives with respect to input variables of functions defined in a computer program. Algorithmic differentiation is also known as automatic differentiation, though strictly speaking it does not fully automate differentiation and can sometimes lead to inefficient code. In general, AD combines the best of both worlds: on one hand, it efficiently generates exact results and so is highly applicable in many realworld applications; on the other hand, it does not rely on listing all the intermediate results, and its computing process can be efficient. Therefore, it is the mainstream implementation of many numerical computing tools and libraries, such as JuliaDiff in Julia, ad in Python, ADMAT, etc. The rest of this chapter focuses mainly on algorithmic differentiation.
3.1.2 Architecture of Algodiff Module
The Algodiff module plays a pivotal role in the whole Owl library stack. It unifies various fundamental data types in Owl and is the most important functionality that serves the advanced analytics such as regression, neural networks, etc. Its design is elegant thanks to OCaml’s powerful module system.
In this chapter, we assume you are familiar with how differentiation works mathematically, so we can focus on the design and implementation details of the AD module in Owl. But first, let’s take a look at a simple example to see how the AD module is used in Owl. In this example, we simply calculate the firstorder and secondorder derivatives of the function tanh.
module AD = Algodiff.D let f x = AD.Maths.(tanh x);; let f1 = AD.diff f;; let f2 = AD.diff f1;; let eval_flt h x = AD.(pack_flt x > h > unpack_flt);; let r1 = eval_flt f1 1. let r2 = eval_flt f2 1.
That’s all it takes. We define the function and apply diff on it to acquire its firstorder derivative, on which the diff function can be directly applied to get the secondorder derivative. We then evaluate and get the function value at point x = 1 on these two derivative functions.
Figure 31 shows the various components in the AD module. Let us inspect how they fit into the example code. First, we cannot directly use the basic data types, such as ndarray and float number. Instead, they need to be first “packed” into a type that the AD module understands. In this example, pack_flt is used to wrap a normal float number into an AD type float. After calculation finishes, assuming we still get an AD type float as output, it should be unpacked into a normal float number using the function unpack_flt. The type system is the most fundamental building block in AD. Second, to construct a computation in the AD system, we need to use operators, such as tahn used in this example. AD provides a rich set of operators that are generated from the op_builder module. After constructing a graph by stacking the operators, the AD engine starts to let the input data “flow,” or “propagate,” twice in this graph, once forward and once backward. The key function that is in charge of this process is the reverse function. Based on the aforementioned process, we can calculate the differentiation of various sorts. To simplify coding, a series of highlevel APIs are constructed. The diff function used in this example is one such API. It applies differentiation on a function that accepts a float number as input and outputs a float number. These highlevel APIs lead to extremely elegant code. As shown in this example, we can simply apply the differentiation function on the original tanh function iteratively to get its firstorder, secondorder, and any other higherorder derivatives. In the next several sections, we will explain these building blocks in detail and how these different pieces are assembled into a powerful AD module.
3.2 Types
We start with type definition. The data type in AD is defined in the owl_algodiff_types.ml file, as shown in the following. Even if you are familiar with the type system in OCaml, it may still seem a bit confusing. The essence of AD type is to express the forward and reverse differentiation modes. So first, we use an example to demonstrate how these two AD modes work.
module Make (A : Owl_types_ndarray_algodiff.Sig) = struct type t =  F of A.elt  Arr of A.arr  DF of t * t * int  DR of t * t ref * op * int ref * int * int ref and adjoint = t > t ref > (t * t) list > (t * t) list and register = t list > t list and label = string * t list and op = adjoint * register * label end
3.2.1 Forward and Reverse Modes
In this part, we illustrate the two modes of differentiation with an example, which is based on this simple function:
This function takes two inputs, and our aim is to compute \( \nabla y=\left(\frac{\partial y}{\partial {x}_0},\frac{\partial y}{\partial {x}_1}\right) \). Computations can be represented as a graph shown in Figure 32. Each node represents either an input/output or intermediate variables generated by the corresponding mathematical function. Each node is named v_{i}. Herein, the input v_{0} = x_{0} and v_{1} = x_{1}. The output y = v_{4}.
Both the forward and reverse modes rely on basic rules to calculate differentiation. On one hand, there are the basic forms of derivative equations, such as \( \frac{d}{dx}\sin (x)=\cos (x) \), \( \frac{d}{dx}u(x)v(x)={u}^{\prime }(x)v(x)+u(x){v}^{\prime }(x) \), etc. On the other is the chain rule. It states that, suppose we have two functions f and g that can be composed to create a function F(x) = f(g(x)), then the derivative of F can be calculated as
The question is how to implement them in a differentiation system.
3.2.1.1 Forward Mode
Let’s look at the first way, namely, the “forward” mode, to calculate derivatives. We ultimately wish to calculate \( \frac{\partial y}{\partial {x}_0} \) (and \( \frac{\partial y}{\partial {x}_1} \), which can be calculated in a similar way). We begin by calculating some intermediate results that will prove to be useful. Using the labels v_{i} to refer to the intermediate computations, we have \( \frac{\partial {v}_0}{\partial {x}_0}=1 \) and \( \frac{\partial {v}_1}{\partial {x}_0}=0 \) immediately, since v_{0} = x_{0} and v_{1} = x_{1} actually.
Next, consider \( \frac{\partial {v}_2}{\partial {x}_0} \), which requires us to use the derivative rule on multiplication. It is a bit trickier and requires the use of the chain rule:
After calculating \( \frac{\partial {v}_2}{\partial {x}_0} \), we proceed to compute partial derivatives of v_{4} which is the final result \( \frac{\partial y}{\partial {x}_0} \) we are looking for. This process starts with the input variables and ends with the output variables, and that’s where the name “forward differentiation” comes from. We can simplify the notation by letting \( {\dot{v}}_i=\frac{\partial {v}_i}{\partial {x}_0} \). The \( {\dot{v}}_i \) is called the tangent of function v_{i}(x_{0}, x_{1}, …, x_{n}) regarding the input variable x_{0}, and the results of evaluating the function at each intermediate point are called the primal value.
Let’s calculate \( \dot{y} \) when setting x_{0} = 2 and x_{1} = 2. The full forward differentiation calculation process is shown in Table 31 where two simultaneous computation processes take place in the two computation columns: the primal just performs computation following the computation graph; the tangent gives the derivative for each intermediate variable with regard to x_{0}.
Two things need to be noted in this calculation process. The first is that in algorithmic differentiation, unlike symbolic differentiation, the computation is performed step by step, instead of after the whole computation is unwrapped into one big formula following the chain rule. Second, in each step, we only need to keep two values: primal and tangent. Besides, each step only needs to have access to its “parents,” using graph theory’s term. For example, to compute v_{2} and \( {\dot{v}}_2 \), we need to know the primal and tangent of v_{0} and v_{1}; to compute that of v3, we need to know the primal and tangent of v_{2}; etc. These observations are key to our implementation.
3.2.1.1.1 Reverse Mode
Now let’s rethink about this problem from the other direction: from outputs to inputs. The problem remains the same, that is, to calculate \( \frac{\partial y}{\partial {x}_0} \). We still follow the same “stepbystep” procedure as in the previous forward mode. The only difference is that this time we calculate it backward. For example, in our example \( y={v}_3=\sin (v2) \), so if only we know \( \frac{\partial y}{\partial {v}_2} \), we would move a step closer to our target solution.
We first observe that \( \frac{\partial y}{\partial {v}_3}=1 \), since y and v_{3} are the same. We then compute \( \frac{\partial y}{\partial {v}_2} \) by applying the chain rule:
We can simplify it by applying a substitution:
for the derivative of output variable y with regard to intermediate node v_{i}. \( {\overline{v}}_i \) is called the “adjoint of variable v_{i} with respect to the output variable y.” Using this notation, Eq. 3.5 can be rewritten as
Note the difference between tangent and adjoint. In forward mode, we know \( {\dot{v}}_0 \) and \( {\dot{v}}_1 \) and then calculate \( {\dot{v}}_2 \), \( {\dot{v}}_3 \), … until we get the target. In reverse mode, we start with \( {\overline{v}}_n=1 \) and calculate \( {\overline{v}}_{n1} \), \( {\overline{v}}_{n2} \), … until we have our target \( {\overline{v}}_0=\frac{\partial y}{\partial {v}_0}=\frac{\partial y}{\partial {x}_0} \). \( {\dot{v}}_3={\overline{v}}_0 \) in this example, given that we are talking about derivatives with respect to x_{0} when we use \( {\dot{v}}_3 \). As a result, the reverse mode is also called the adjoint mode.
Following this procedure, we can now perform the complete reverse mode differentiation. Note one major difference compared to the forward mode. In Table 31, we can compute the primal and tangent in one pass, since computing one of them does not require the other. However, as shown in the previous analysis, it is possible to require the value of v_{2} and possibly other previous primal values to compute \( {\overline{v}}_2 \). Therefore, a forward pass^{Footnote 1} is first required, as shown in Table 32, to compute the required intermediate values. They are actually identical to those in the Primal Computation column of Table 31. We put it here again to stress our point about this standalone forward computing pass.
Table 33 shows the backward pass in the reverse differentiation process, starting from the very end, and calculates all the way up to the beginning. A short summary: To compute differentiation using reverse mode, we need a forward pass to compute primal and next a backward pass to compute adjoint.
Both the forward and reverse modes are equivalent in computing differentiation. So you might wonder, since the forward mode looks more straightforward, why don’t we just stick with it all along? Note that we obtained \( \frac{\partial y}{\partial {x}_1} \) “for free” while calculating \( \frac{\partial y}{\partial {x}_0} \). But in the forward mode, to calculate the derivative regarding another input, we have to calculate all the intermediate results again. So here lies one of the most significant strengths of the reverse mode: no matter how many inputs there are, a single reverse pass gives us all the derivatives of the inputs.
This property is extremely useful in neural networks. The computation graph constructed in neural networks tend to be quite complex, often with more than one input. The target of using AD is to find the derivative of the output – probably a scalar value of a loss function – regarding inputs. Thus, using the reverse mode AD is more efficient.
3.2.2 Data Types
Now that we understand the basic elements in computing a derivative, let’s turn to the data type used in the AD system. It is built upon two basic types: scalar number F and ndarray Arr. They are of type A.elt and A.arr. Here, A presents an interface that mostly resembles that of an ndarray module. It means that their specific types, such as single or double precision, C implementation or base implementation, etc., all depend on this A ndarray module. Therefore, the AD module does not need to deal with all the lowerlevel details. We will talk about how the AD module interacts with the other modules later in this chapter. For now, it suffices to simply understand them as, for example, singleprecision float number and ndarray with singleprecision float as elements, so as to better grasp the core ideas in AD.
module Make (A : Owl_types_ndarray_algodiff.Sig) = struct type t =  F of A.elt  Arr of A.arr (* primal, tangent, tag *)  DF of t * t * int (* primal, adj accu, op, fanout, tag, tracker *)  DR of t * t ref * op * int ref * int * int ref and adjoint = t > t ref > (t * t) list > (t * t) list and register = t list > t list and label = string * t list and op = adjoint * register * label end
The other two types are compounded types, each representing one differentiation mode. The DF type contains three parts, and the most important ones are the first two: primal and tangent. The DR type contains six parts, and the most important ones are the first, primal, and the third, op. op itself consists of three parts: adjoint, register, and label, of which adjoint is the most important component. The DR type also contains an adjoint accumulator (the second parameter), a fanout flag, and a tracker flag. The accumulator is of reference type since it needs to be updated during the propagation process. Both DF and DR types contain a tag of integer type. Later, we will discuss how these extra parts work in an AD engine. To focus on the core idea in AD, for now we introduce the most important elements: primal, tangent, and adjoint.
In essence, the computation graph in AD is constructed by building a list. Each element of this list contains two elements: the partial derivative computation and the original type t data. In the data type, the adjoint is a function. For each t type data, it specifies how to construct this list. Though the derivative computation rule of different operators varies, the adjoint generally falls into several patterns. For example, here is what the adjoint function looks like for an operation/function that takes one input and produces one output, such as sin, exp, etc.
let r a = let adjoint cp ca t = (dr (primal a) cp ca, a) :: t in let register t = a :: t in let label = S.label, [ a ] in adjoint, register, label
Here, the r function returns an op type, which consists of the adjoint function, the register function, and the label tuple. First, let’s look at the adjoint function. The first two variables cp and ca will be used in the derivative function dr. We will talk about it later in Section 3.3. For now, we only need to know that the reverse derivative computation dr calculates something; we put it together with the original input operator a into a tuple and add them to the existing list t, which is the third argument. The other two components are supplementary. The register function actually is an adjoint function without really calculating adjoints; it only stacks a list of original operators. The third one, label, puts together a string such as “sin” or “exp” to the input operator.
Next, let’s see another example in an operator that takes multiple inputs, such as add, mul (multiplication), etc. It’s a bit more complex:
let r_d_d a b = let adjoint cp ca_ref t = let abar, bbar = dr_ab (primal a) (primal b) cp ca_ref in (abar, a) :: (bbar, b) :: t in let register t = a :: b :: t in let label = S.label ^ "_d_d", [ a; b ] in adjoint, register, label
The difference is that one such operator needs to push two items into the list. So here dr_ab is still a function that calculates derivatives reversely, and it returns the derivatives on its two parents, noted by abar and bbar, which are both pushed to the adjoint list. The register and label follow a similar pattern. In fact, in an operator that takes multiple inputs, we should consider other options, which is that one of the inputs is just a constant element. In that case, only one element should be put into the list:
let r_d_c a b = let adjoint cp ca_ref t = (S.dr_a (primal a) b cp ca_ref, a) :: t in let register t = a :: t in let label = S.label ^ "_d_c", [ a; b ] in adjoint, register, label
3.2.3 Operations on AD Type
After understanding the data type defined in AD, let’s take a look at what sorts of operations can be applied to them. They are defined in the owl_algodiff_core.ml file. The most notable ones are the “get” functions that retrieve certain information from an AD type data, such as its primal, tangent, and adjoint values. In the following code, the primal' is a “deep” function that recursively finds the primal value as float or ndarray format.
let primal = function  DF (ap, _, _) > ap  DR (ap, _, _, _, _, _) > ap  ap > ap let rec primal' = function  DF (ap, _, _) > primal' ap  DR (ap, _, _, _, _, _) > primal' ap  ap > ap let tangent = function  DF (_, at, _) > at  DR _ > failwith "error: no tangent for DR"  ap > zero ap let adjval = function  DF _ > failwith "error: no adjval for DF"  DR (_, at, _, _, _, _) > !at  ap > zero ap
And the zero function resets all elements to the zero status:
let rec zero = function  F _ > F A.(float_to_elt 0.)  Arr ap > Arr A.(zeros (shape ap))  DF (ap, _, _) > ap > primal' > zero  DR (ap, _, _, _, _, _) > ap > primal' > zero
Another group of important operations are those that convert the AD type to and from ordinary types such as float and ndarray:
let pack_elt x = F x let unpack_elt x = match primal x with  F x > x  _ > failwith "error: AD.unpack_elt" let pack_flt x = F A.(float_to_elt x) let _f x = F A.(float_to_elt x) let pack_arr x = Arr x let unpack_arr x = match primal x with  Arr x > x  _ > failwith "error: AD.unpack_arr"
There are also operations that provide helpful utilities. One of them is the zero we have seen, and also some functions show type information:
let shape x = match primal' x with  F _ > []  Arr ap > A.shape ap  _ > failwith "error: AD.shape"
3.3 Operators
The graph is constructed with a series of operators that can be used to process AD type data as well as building up a computation graph that is differentiable. They are divided into submodules: Maths is the most important component, and it contains a full set of mathematical functions to enable constructing various computation graphs; Linalg contains a subset of linear algebra functions; NN contains functions used in neural networks, such as twodimensional convolution, dropout, etc.; Mat is specifically for matrix operations, such as eye that generates an identity matrix; and Arr provides functions such as shape and numel for ndarrays.
As shown in Figure 31, the implementation of an operation can be abstracted into two parts: (a) what the derivative and calculation rules of it are and (b) how these rules are applied into the AD system. The first part is defined in the owl_algodiff_ops.ml, and the latter is in owl_algodiff_ops_builder.ml.
3.3.1 Calculation Rules
Let’s look at some examples from the first to see what these calculation rules are and how they are expressed in OCaml. We can use the sine function as an example. It takes an input and computes its sine value as output. This module specifies four computing rules, each corresponding to one type of AD data. Here, module A is the underlying “normal” ndarray module that implements functions for ndarray and scalar values. It can be single precision or double precision, implemented using OCaml or C. For the F scalar type, ff_f specifies using the sin function from the Scalar submodule of A. If the data is an AD ndarray, ff_arr states that the sine functions should be applied on all of its elements by using the A.sin function. Next, if the data is of type DF, the df function is used. As shown in the example in Table 31, it computes tangent (at) * derivative of primal (ap). In the case of the sine function, it computes at * cos ap. Finally, the dr computes what we have shown in Table 33. It computes adjoint (ca) * derivative of primal (a). Therefore, here it computes !ca * cos a. Using the get reference operator !ca is because the adjoint value in the DR type is a reference that can be updated.
module struct let label = "sin" let ff_f a = F A.Scalar.(sin a) let ff_arr a = Arr A.(sin a) let df _cp ap at = at * cos ap let dr a _cp ca = !ca * cos a end
The similar template can be applied to other operators that take one input and produce one output, such as the square root (sqrt), as shown in the next module. The derivative rule for the square root is \( {\left(\sqrt{x}\right)}^{\prime }=\frac{1}{2\sqrt{x}} \).
module struct let label = "sqrt" let ff_f a = F A.Scalar.(sqrt a) let ff_arr a = Arr A.(sqrt a) let df cp _ap at = at / (pack_flt 2. * cp) let dr _a cp ca = !ca / (pack_flt 2. * cp) end
However, things get more complicated once an operator needs to deal with more than one input. The problem is that for each of these four computation rules, we need to consider multiple possible cases. Take the divide operation as an example. For a simple primal value computation, we need to consider four cases: both inputs are scalar, both are ndarray, and one of them is ndarray and the other is scalar. It corresponds to four rules: ff_aa, ff_bb, ff_ab, and ff_ba. For the forward computation of tangent regarding \( \frac{a}{b} \), we also need to consider three cases:

df_da corresponds to the derivative when the second input is constant:

In code, it is at / bp. Here, at is the tangent of the first input a^{′}(x), and bp is the primal value of the second input b.

df_db corresponds to the derivative when the first input is constant:

And thus, it can be represented by neg bt*cp/bp. Here, neg is the negative operator, and cp represents the original input \( \frac{a}{b(x)} \).

df_dab is for the case that both inputs are of nonconstant AD type, that is, DF or DR. It thus calculates

And the corresponding code is (at(bt*cp))/bp.
Expressing the rules in computing the reverse mode is more straightforward. If both inputs a and b are nonconstant, then the function dr_ab computes \( \overline{a}\frac{\partial\ y}{\partial\ a} \) and \( \overline{b}\frac{\partial\ y}{\partial\ b} \), where \( y=\frac{a}{b} \). Thus, dr_ab returns two values; the first is \( \overline{a}/b \) (!ca / b), and the second is \( \frac{a}{b^2} \) (!ca * (neg a / (b * b))). In the code, squeeze_broadcast x s is an internal helper function that squeezes array x so that it has shape s. If one of the inputs is constant, then we can just omit the corresponding result, as shown in dr_a and dr_b.
module struct let label = "div" let ff_aa a b = F A.Scalar.(div a b) let ff_ab a b = Arr A.(scalar_div a b) let ff_ba a b = Arr A.(div_scalar a b) let ff_bb a b = Arr A.(div a b) let df_da _cp _ap at bp = at / bp let df_db cp _ap bp bt = neg bt * cp / bp let df_dab cp _ap at bp bt = (at  (bt * cp)) / bp let dr_ab a b _cp ca = ( _squeeze_broadcast (!ca / b) (shape a) , _squeeze_broadcast (!ca * (neg a / (b * b))) (shape b) ) let dr_a a b _cp ca = _squeeze_broadcast (!ca / b) (shape a) let dr_b a b _cp ca = _squeeze_broadcast (!ca * (neg a / (b * b))) (shape b) end
A similar example is the operator pow that performs a^{b} calculation. It implements calculation rules that are similar to those of div.
module struct let label = "pow" let ff_aa a b = F A.Scalar.(pow a b) let ff_ab a b = Arr A.(scalar_pow a b) let ff_ba a b = Arr A.(pow_scalar a b) let ff_bb a b = Arr A.(pow a b) let df_da _cp ap at bp = at * (ap ** (bp  pack_flt 1.)) * bp let df_db cp ap _bp bt = bt * cp * log ap let df_dab cp ap at bp bt = ((ap ** (bp  pack_flt 1.)) * (at * bp)) + (cp * bt * log ap) let dr_ab a b cp ca = ( _squeeze_broadcast (!ca * (a ** (b  pack_flt 1.)) * b) (shape a) , _squeeze_broadcast (!ca * cp * log a) (shape b) ) let dr_a a b _cp ca = _squeeze_broadcast (!ca * (a ** (b  pack_flt 1.)) * b) (shape a) let dr_b a b cp ca = _squeeze_broadcast (!ca * cp * log a) (shape b) end
3.3.2 Generalize Rules into Builder Template
So far, we have talked about the calculation rules, but there is still a question: how to utilize these rules to build an operator of type t that we have described in Section 3.2. To do that, we need to use the power of functor in OCaml. In the AD module in Owl, the operators are categorized according to the number of inputs and outputs, each with its own template. Let’s take the “singleinputsingleoutput” (SISO) operators such as sine as an example. This template takes a module of type Siso as input, as shown in the following. Notice that the calculation rules of the sine function shown in the previous section exactly forms such a module.
module type Siso = sig val label : string val ff_f : A.elt > t val ff_arr : A.arr > t val df : t > t > t > t val dr : t > t > t ref > t end
In the end, we need to build a sin : t > t operator, which accepts a data of AD type t and returns output of type t. This function is what we need:
let op_siso ~ff ~fd ~df ~r a = match a with  DF (ap, at, ai) > let cp = fd ap in DF (cp, df cp ap at, ai)  DR (ap, _, _, _, ai, _) > let cp = fd ap in DR (cp, ref (zero cp), r a, ref 0, ai, ref 0)  ap > ff ap
These names may seem enigmatic. Here, the fd x function calculates the primal value of x. ff x performs forward computation on the two basic types: scalar and ndarray. The df cp ap at function computes the tangents in forward mode. Finally, the function r computes the op part in the type, which “remembers” how to build up the graph in the form of a list. To put them together, the basic logic of this function goes like this:

If the input is a DF type, produce a new DF type after calculating the primal and tangent in forward mode.

If the input is a DR type, produce a new DR type, with its knowledge about how to compute adjoints and how to build up the list.

Otherwise, it’s the basic type, scalar or ndarray; perform simple forward computation on it.
Note that the newly constructed DR type, aside from its primal value and op being updated, the rest values, including adjoint, label, etc., are all set to 0. That is because a computation graph is constructed in the forward pass, and the calculation of adjoints does not happen in this step.
So the next question is: for the sine function, how can we get the fd, ff, etc.? Luckily, from the previous Siso module that specifies various calculation rules, we have already had all the ingredients required. Assume we have named this Siso sine module S, then we have the forward computation on the two basic types:
let ff = function  F a > S.ff_f a  Arr a > S.ff_arr a  _ > error_uniop label a
And the r function looks like what we have introduced in Section 3.2, using the dr function from module S to specify how to construct the list.
let r a = let adjoint cp ca t = (S.dr (primal a) cp ca, a) :: t in let register t = a :: t in let label = S.label, [ a ] in adjoint, register, label
So now we have the function:
let rec f a = let open S in (* define ff and r as stated above *) let fd a = f a in op_siso ~ff ~fd ~df:S.df ~r a
Put them together, and here is the final function that accepts a module and builds an operator:
let build_siso = (* define op_siso *) fun (module S : Siso) > (* define f *) f
To build a sin operator, we use the following code:
let sin = build_siso ( module struct let label = "sin" let ff_f a = F A.Scalar.(sin a) let ff_arr a = Arr A.(sin a) let df _cp ap at = at * cos ap let dr a _cp ca = !ca * cos a end : Siso)
The code is concise, easy to read, and less prone to various possible errors in coding. To build another “siso” operator, such as a square root, we only need to change the rules:
let sqrt = build_siso ( module struct let label = "log" let ff_f a = F A.Scalar.(log a) let ff_arr a = Arr A.(log a) let df _cp ap at = at / ap let dr a _cp ca = !ca / a end : Siso)
Here, we only use the most simple SISO type builder template as an example. We also include the other templates:

SIPO: Single input and pair outputs, such as the linear algebra operation qr for QR factorization

SITO: Single input and three outputs, such as the SVD factorization

SIAO: Single input and array outputs, such as the split function that splits input ndarray into multiple ones

PISO: Pair inputs and single output, such as add and mul

AISO: Array input and single output, such as concatenate, the inverse operation of split
These templates can become quite complex. For example, in building the add function, to choose from different combinations of possible input types, the builder function can be as complex as
op_piso ~ff ~fd ~df_da ~df_db ~df_dab ~r_d_d ~r_d_c ~r_c_d a b
But the principles are the same.
3.4 API
The previous section introduces AD operators, the building blocks to construct an AD computation graph. The next thing we need is an “engine” that begins the differentiation process. For that purpose, we first introduce several lowlevel APIs provided by the AD module and explain how they are used to build up userfriendly advanced APIs such as diff and grad.
3.4.1 LowLevel APIs
We differentiate between the two differentiation modes: forward mode and backward mode. As explained in the previous section, if an input x is of type DF, then by applying operations such as sin x, a computation graph is constructed, and the primal and tangent values are also computed during this process. All we need to do is to retrieve the required value once this process is finished. To start a forward mode differentiation, we need to create a DF type data as initial input, using the primal value, the initial tangent (equals to 1), and an integer tag as arguments:
let make_forward p t i = DF (p, t, i)
For example, if we are to calculate the derivative of f = sin (x^{2}) at x = 2, we can first create an initial point as
let x = make_forward (pack_flt 2.) (pack_flt 1.) 1 let y = Maths.(pow x (pack_flt 2.) > sin) let t = tangent y
That’s it. Once the computation y is constructed, we can directly retrieve the tangent value using the tangent function.
The backward mode is a bit more complex. Remember that it consists of two passes: one forward and one backward. From the previous section, we know that once the graph is constructed, the primal data are calculated, but the adjoint values are all set to zero. Therefore, we need some extra mechanism to pump the computation flow backward to calculate adjoint values. Here is an example to use lowlevel APIs to compute derivatives in the reverse mode:
open AD let f x = Maths.(pow x (pack_flt 2.) > sin) let x = 2.; let x' = make_reverse x (tag ()); let y = f x'; let _ = reverse_prop (F 1.) y; let y' = adjval x';;
The problem to solve is still the same: calculate the derivative of f = sin (x^{2}) at x = 2; the only difference is that we use the reverse mode this time. Let’s explain this example line by line. First, we still need to build an initial operator with make_reverse.
let make_reverse p i = let adjoint _cp _ca t = t in let register t = t in let label = "Noop", [] in DR (p, ref (zero p), (adjoint, register, label), ref 0, i, ref 0)
The make_reverse function constructs a DR type data with a given primal value. The rest of its fields are all set to zero. It does two things: first, it wraps input x into a value of type t for Algodiff to process; second, it generates a unique tag for the input so that input numbers can be nested. Next, calling f x' constructs the computation graph of f, capturing the primal values and knowledge about how to calculate adjoints all in the DR type data y.
Next, reverse_prop propagates the error back to the inputs:
let reverse_prop v x = reverse_reset x; reverse_push v x
It consists of two steps: first, reset all values in this graph to initial status (reverse_reset); second, perform backward propagation (reverse_push). Both follow a recursive process.
let reverse_reset x = let rec reset xs = match xs with  [] > ()  x :: t > (match x with  DR (_cp, aa, (_, register, _), af, _ai, tracker) > aa := reset_zero !aa; af := !af + 1; tracker := succ !tracker; if !af = 1 && !tracker = 1 then reset (register t) else reset t  _ > reset t) in reset [ x ]
The next function is reverse_push that is the core engine that drives the backward propagation process. Its core idea is simple. It maintains a stack t of (adjoint value, AD value) pairs. At each iteration, the push function takes one pair out of the head of stack. The adjoint value v is added to the adjoint accumulator aa in the DR type node x. The node also specifies an adjoint function that knows how to calculate adjoint values of its parents, in the form of one or more (adjoint value, AD value) pairs. This process starts with only one pair, which is the output DR type value of a whole computation. It finishes when stack t is empty.
let reverse_push = let rec push xs = match xs with  [] > ()  (v, x) :: t > (match x with  DR (cp, aa, (adjoint, _, _), af, _ai, tracker) > aa := reverse_add !aa v; (af := Stdlib.(!af  1)); if !af = 0 && !tracker = 1 then push (adjoint cp aa t) else ( tracker := pred !tracker; push t)  _ > push t) in fun v x > push [ v, x ]
After this step, the gradient of f is stored in the adjacent value of x', and we can retrieve the value using the adjval function.
3.4.2 HighLevel APIs
Based on the basic lowlevel APIs, we are able to build more highlevel and easytoaccess differentiation functions. The most commonly used function for differentiating is diff in the AD module. Given a function f that maps one scalar value to another, we can calculate its derivative at point x by diff f x. For example, given the triangular function tanh, we can easily calculate its derivative at position x = 0.1, as follows:
open Algodiff.D let f x = Maths.(tanh x);; let d = diff f (F 0.1);;
Its implementation using the forward mode lowlevel API is quite simple:
let diff' f x = if not (is_float x) then failwith "input must be a scalar"; let x = make_forward x (pack_flt 1.) (tag ()) in let y = f x in primal y, tangent y let diff f x = diff' f x > snd
Next, we can generalize derivatives of scalar functions to gradients of multivariate functions. For a function that maps a vector input to a scalar, the grad function calculates its gradient at a given point. For example, in a threedimensional space, the gradient at each point on a surface consists of three elements representing the partial derivative along the x, y, and z axes. This vector indicates the direction in which the function has the largest magnitude change. Its implementation uses the standard reverse mode:
let grad' f x = let x = make_reverse x (tag ()) in let y = f x in assert (is_float y); reverse_reset y; reverse_push (pack_flt 1.) y; primal y, x > adjval let grad f x = grad' f x > snd
One important application of gradient is in gradient descent, a widely used technique for finding the minimum value of a function. We will discuss it in more detail in Chapter 4.
Just as gradient generalizes derivatives from scalars to vectors, the Jacobian function generalizes gradient from vectors to matrices. In other words, grad is applied to functions mapping vectors to scalars, while jacobian is applied to functions that map vectors to vectors. If we assume the function f takes an input vector of length n and produces an output vector of length m, then the Jacobian is defined as
The intuition behind the Jacobian is similar to that of the gradient. At a particular point in the domain of the target function, the Jacobian shows how the output vector changes given a small change in the input vector. Its implementation is as follows:
let jacobianv' f x v = if shape x <> shape v then failwith "jacobianv': vector not the same dimension as input"; let x = make_forward x v (tag ()) in let y = f x in primal y, tangent y let jacobianv f x v = jacobianv' f x v > snd
The advanced APIs support convenient composition and can be used to build more complex ones. For example, the secondorder derivative of function f can be implemented as g = f > diff > diff. Another example is the hessian API. Given a multivariate function that maps n input variables to a scalar, this function calculates its secondorder derivatives as a matrix. Its implementation is based on Jacobian:
let hessian f x = (f > grad > jacobian) x
In most applications, we use these highlevel APIs to support more advanced applications, such as optimization, regression, neural network, etc. One good example is to implement Newton’s method for finding the minimum value of a function. Rather than moving only in the direction of the gradient, Newton’s method combines the gradient with the secondorder gradients of a function, \( \frac{\nabla f\left({x}_n\right)}{\nabla^2f\left({x}_n\right)} \), starting from a random position and iterating until convergence according to Eq. 3.9. Its implementation in OCaml is shown as follows:
open Algodiff.D let rec newton ?(eta=F 0.01) ?(eps=1e6) f x = let g = grad f x in let h = hessian f x in if (Maths.l2norm' g > unpack_flt) < eps then x else newton ~eta ~eps f Maths.(x  eta * g *@ (inv h))
As an example, we can apply this method to a twodimensional triangular function, starting from a random initial point, to find a local minimum. Note that the newton function takes a vector as input and outputs a scalar.
let _ = let f x = Maths.(cos x > sum') in newton f (Mat.uniform 1 2)
Besides what we have mentioned, Owl also implements other highlevel differentiation functions, such as laplacian, which calculates the Laplacian operator ∇^{2}f, or the trace of a Hessian matrix. In essence, they are all built upon a concise set of lowlevel APIs or other highlevel APIs. We will see more applications of the AD module in later chapters.
3.5 More Implementation Details
Besides the main structure we have mentioned so far, there are some other details that should be mentioned to build an industrygrade AD module. We introduce them in this section.
3.5.1 Perturbation Confusion and Tag
We have explained some of the fields in the DR type. But one of them is not covered yet: tag of type int, which is used to solve a particular problem when calculating higherorder derivatives with nested forward and backward modes. This problem is referred to as perturbation confusion. It is crucial for an AD engine to function properly to handle this problem. Here, we only scratch the surface of it. Let’s look at an example. Suppose we want to compute the derivative of
that is, a function that contains another derivative function. It initially seems straightforward, and we don’t even need a computer’s help: as \( \frac{d\left(x+y\right)}{dy}=1 \) so f^{′}(x) = x^{′} = 1. Unfortunately, applying the simple implementation without tag leads to wrong answer.
# let diff f x = match x with  DF (_, _) > f x > tangent  DR (_, _, _) > let r = f x in reverse_push [(1., r)]; !(adjoint x);; val diff : (t > t) > t > float = <fun> # let f x = let g = diff (fun y > add_ad x y) in mul_ad x (make_forward (g (make_forward 2. 1.)) 1.);; val f : t > t = <fun> # diff f (make_forward 2. 1.);;  : float = 4.
The result is 4 at point (2, 2), but we have previously calculated, and the result should be 1 at any point. What has gone wrong? The answer is a bit tricky. Note that x=DF(2,1). The tangent value equals to 1, which means that \( \frac{dx}{dx}=1 \). Now if we continue to use this same x value in function g, whose variable is y, the same x=DF(2,1) can be incorrectly translated by the AD engine as \( \frac{dx}{dy}=1 \). Therefore, when used within function g, x should actually be treated as DF(2,0). That’s where tagging comes to help. It solves the nested derivative problem by distinguishing derivative calculations and their associated attached parameters with a unique tag for each usage of the derivative operator.
3.5.2 Lazy Evaluation
We have seen how separating building template and operation definitions makes it convenient to add new operations, simplifying code and improving productivity. But it comes with a price: efficiency. Imagine a large calculation that contains thousands of operations, with one operation occurring many times. Such situations are actually quite common when using AD with neural networks where large computation graphs are created that use functions such as add and mul many hundreds of times. With the Builder approach described earlier, the operation will be recreated every time it is used, which is rather inefficient. Fortunately, we can simply use OCaml’s lazy evaluation mechanism to perform caching.
val lazy: 'a > 'a lazy_t module Lazy : sig type 'a t = 'a lazy_t val force : 'a t > 'a end
OCaml provides a builtin function lazy that accepts an input of type 'a and returns a value of type 'a lazy_t where the computation of the value of type 'a has been delayed. This lazy expression won’t be evaluated until it is called by Lazy.force, and the first time it is called, the expression is evaluated and the result is cached. Subsequent applications of Lazy.force will simply return the cached result without further reevaluation. Here is an example of lazy evaluation in OCaml:
# let x = Printf.printf "hello world!"; 42 hello world! val x : int = 42 # let lazy_x = lazy (Printf.printf "hello world!"; 42) val lazy_x : int lazy_t = <lazy> # let _ = Stdlib.Lazy.force lazy_x hello world!  : int = 42 # let _ = Stdlib.Lazy.force lazy_x  : int = 42 # let _ = Stdlib.Lazy.force lazy_x  : int = 42
In this example, we can see that building lazy_x does not evaluate the content, which is delayed to the first Lazy.force. After that, every time force is called, only the value is returned; the x itself, including the printf function, will not be evaluated. Now come back to the AD module in Owl. Imagine that we need to add support for the sin operation. The definition of sin remains the same:
open Algodiff.D module Sin = struct let label = "sin" let ff_f a = F A.Scalar.(sin a) let ff_arr a = Arr A.(sin a) let df _cp ap at = Maths.(at * cos ap) let dr a _cp ca = Maths.(!ca * cos (primal a)) end
However, we can instead use lazy evaluation to actually build the implementation and benefit from the efficiency gain of the caching it provides.
let _sin_ad = lazy Builder.build_siso (module Sin : Builder.Siso);; let new_sin_ad = Lazy.force _sin_ad;;
In this way, regardless of how many times this sin function is called in a massive computation graph, the Builder.build_siso process is only evaluated once.
3.5.3 Extending AD
A significant benefit of the module design described earlier is that it can be easily extended by providing modules representing new functions. For example, suppose that the AD system did not support the natural logarithm, \( \sin x \), whose derivative is \( {\sin}^{\prime }x=\cos x \). Including this function is a simple matter of defining the necessary functions for calculating primal, tangent, and adjoint values in a module and applying the relevant function from the Builder module – in this case, build_siso for building “single input, single output” functions.
open Algodiff.D module Sin = struct let label = "sin" let ff_f a = F A.Scalar.(sin a) let ff_arr a = Arr A.(sin a) let df _cp ap at = Maths.(at * cos ap) let dr a _cp ca = Maths.(!ca * cos (primal a)) end let new_sin_ad = Builder.build_siso (module Sin : Builder.Siso)
We can directly use this new operator as if it is a native operation in the AD module. For example:
# let f x1 x2 = let x1 = F. x1 in let x2 = F. x2 in Maths.(div (cos x1) (new_sin_ad x2);; val f : t > t = <fun>
3.5.4 Graph Utility
Though not core functions, various utility functions provide convenience to users, for example, tools to visualize the computation graph built up by AD. They come in handy when we are trying to debug or understand how AD works. The core of the visualization function is a recursive traverse routine:
let _traverse_trace x = let nodes = Hashtbl.create 512 in let index = ref 0 in (* local function to traverse the nodes *) let rec push tlist = match tlist with  [] > ()  hd :: tl > if Hashtbl.mem nodes hd = false then ( let op, prev = match hd with  DR (_ap, _aa, (_, _, label), _af, _ai, _) > label  F _a > Printf.sprintf "Const", []  Arr _a > Printf.sprintf "Const", []  DF (_, _, _) > Printf.sprintf "DF", [] in (* check if the node has been visited before *) Hashtbl.add nodes hd (!index, op, prev); index := !index + 1; push (prev @ tl)) else push tl in (* iterate the graph then return the hash table *) push x; nodes
The _traverse_trace and its related functions are used to convert the computation graph generated in backward mode into humanreadable format. It initializes variables for tracking nodes and indices, then iterates the graph and puts required information into a hash table. With some extra code, the parsed information can be displayed on a terminal or be converted into other formats that are suitable for visualization, such as the dot format by Graphviz.
3.6 How AD Is Built upon Ndarray
We have been saying how the AD does not need to deal with the details of computation implementation and thus can focus on the logic of differentiation. In previous examples, we assume the A module to be any Ndarray module. In the final section of this chapter, we will explain how the AD module is built upon the Ndarray modules. We hope to illustrate the power of the functor system in promoting a modular style system design.
First, the Ndarray module used in the AD module is not purely Ndarray as introduced in Chapter 2, but also contains several other modules, including the scalar functions, the ones that are specific to matrix and linear algebra. Together, they are called the Owl_algodiff_primal_ops module. Based on the specific precision of the modules included, it also divides the S and D submodules. For example, here are the components of Owl_algodiff_primal_ops.S:
module S = struct include Owl_dense_ndarray.S module Scalar = Owl_maths module Mat = struct let eye = Owl_dense_matrix_s.eye let tril = Owl_dense_matrix_s.tril let triu = Owl_dense_matrix_s.triu ... end module Linalg = struct include Owl_linalg_s let qr a = let q, r, _ = qr a in q, r ... end end
By replacing the singleprecision modules used in it, such as Owl_dense_ndarray.S and Owl_dense_matrix_s, with their doubleprecision counterparts, we can get the Owl_algodiff_primal_ops.D module. Moreover, by replacing them with the base Ndarray modules, such as Owl_base_dense_ndarray.S, we can acquire AD modules the calculation of which is based on pure OCaml. Actually, the implementation is not limited to these types. The interface of the Owl_algodiff_primal_ops module is specified in the Owl_types_ndarray_algodiff.Sig module. As long as a module implements all the required functions and modules, it can be plugged into AD. For example, we can utilize the computation graph module in Owl and build a symbolic Ndarray module. An AD module built on this module can provide powerful symbolic computation functionality. It means that the execution of both forward and reverse differentiation modes can benefit from various optimization opportunities, such as graph and memory optimizations. We will explain the computation graph module in Chapter 6.
To build an AD module, we use code similar to the following:
module S = Owl_algodiff_generic.Make (Owl_algodiff_primal_ops.S) module D = Owl_algodiff_generic.Make (Owl_algodiff_primal_ops.D)
So next, let’s take a look at the Owl_algodiff_generic.Make functor. It includes all the existing submodules we have introduced so far: the core module, the operators, and the differential API functions such as make_forward and diff, as follows:
module Make (A : Owl_types_ndarray_algodiff.Sig) = struct module Core = Owl_algodiff_core.Make (A) include Core module Ops = Owl_algodiff_ops.Make (Core) include Ops let make_forward p t i = DF (p, t, i) let make_reverse p i = ... let diff f x = ... let grad f x = ... end
These components all rely on the fundamental computation module A. The Core module itself is built using a functor, with the ndarray module as the parameter. Its interface is specified in Owl_algodiff_core_sig.ml, as follows. It includes the basic type definition of types and the operations that can be applied on them.
module type Sig = sig module A : Owl_types_ndarray_algodiff.Sig include Owl_algodiff_types_sig.Sig with type elt := A.elt and type arr := A.arr val primal : t > t val tangent : t > t val adjref : t > t ref ... end
Next, the operators such as sin are built using the Core module as a parameter. As we have explained, first the Builder module works as a factory that assembles various operators by providing different templates such as siso, including a type definition of the template and the function to build operators.
module Make (Core : Owl_algodiff_core_sig.Sig) = struct open Core module type Siso = sig val label : string val ff_f : A.elt > t val ff_arr : A.arr > t val df : t > t > t > t val dr : t > t > t ref > t end let build_siso = ...
Then in the operator module, based on Core and Builder, this module contains all the operators which are built from the builder functions. They are categorized into different modules such as Maths and Linalg.
module Make (Core : Owl_algodiff_core_sig.Sig) = struct open Core module Builder = Owl_algodiff_ops_builder.Make (Core) open Builder module Maths = struct let cos = (build_siso (module struct let label = "cos" let ff_f a = F A.Scalar.(cos a) let ff_arr a = Arr A.(cos a) let df _cp ap at = neg (at * sin ap) let dr a _cp ca = !ca * neg (sin a) end : Siso)) and sin = (build_siso ...) ... end module Linalg = struct ... end module NN = struct ... end ... end
3.7 Summary
In this chapter, we discussed the design of one of the core modules in Owl: the algorithmic differentiation module. We started from its basic theory and difference among three types of differentiations. Then we presented the overall architecture of the AD module in Owl. We explained several parts in detail in the following sections: the definition of types in this system, the operators, and the APIs built on existing mechanisms. We also discussed more subtle issues that should be paid attention to when building an industrylevel AD engine, such as avoiding the perturbation confusion issue and using lazy evaluation to improve performance, graph visualization, etc. Finally, we explained how the AD system is built upon the Ndarray module in Owl.
Notes
 1.
Not to be confused with the “forward differentiation mode” introduced before.
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Wang, L., Zhao, J. (2023). Algorithmic Differentiation. In: Architecture of Advanced Numerical Analysis Systems. Apress, Berkeley, CA. https://doi.org/10.1007/9781484288535_3
Download citation
DOI: https://doi.org/10.1007/9781484288535_3
Published:
Publisher Name: Apress, Berkeley, CA
Print ISBN: 9781484288528
Online ISBN: 9781484288535
eBook Packages: Professional and Applied ComputingProfessional and Applied Computing (R0)Apress Access Books