Reverse AD at Higher Types: Pure, Principled and Denotationally Correct

We show how to define forward- and reverse-mode automatic differentiation source-code transformations or on a standard higher-order functional language. The transformations generate purely functional code, and they are principled in the sense that their definition arises from a categorical universal property. We give a semantic proof of correctness of the transformations. In their most elegant formulation, the transformations generate code with linear types. However, we demonstrate how the transformations can be implemented in a standard functional language without sacrificing correctness. To do so, we make use of abstract data types to represent the required linear types, e.g. through the use of a basic module system.


INTRODUCTION
Automatic Differentiation (AD) is a technique for transforming code that implements a function f into code that computes f 's derivative, essentially by using the chain rule for derivatives.Due to its efficiency and numerical stability, AD is the technique of choice whenever derivatives need to be computed of functions that are implemented as programs, particularly in high dimensional se ings.Optimization and Monte-Carlo integration algorithms, such as gradient descent and Hamiltonian Monte-Carlo methods, rely crucially on the calculation of derivatives.ese algorithms are used in virtually every machine learning and computational statistics application, and the calculation of derivatives is usually the computational bo le-neck.ese applications explain the recent surge of interest in AD, which has resulted in the proliferation of popular AD systems such as Tensor-Flow (Abadi et al. 2016), PyTorch (Paszke et al. 2017), and Stan Math (Carpenter et al. 2015).
AD, roughly speaking, comes in two modes: forward-mode and reverse-mode.When differentiating a function R n → R m , forward-mode tends to be more efficient if m ≫ n, while reversemode generally is more performant if n ≫ m.As most applications reduce to optimization or Monte-Carlo integration of an objective function R n → R with n very large (today, in the order of 10 4 − 10 7 ), reverse-mode AD is in many ways the more interesting algorithm.
However, it is also much more complicated to understand and implement than forward AD.Forward AD can be straightforwardly implemented as a structure-preserving program transformation, even on languages with complex features (Shaikhha et al. 2019).As such, it admits an elegant proof of correctness (Huot et al. 2020).By contrast, reverse-AD is only well-understood as a source-code transformation (also called define-then-run style AD) on limited programming languages.Typically, its implementations on more expressive languages that have features such as 1:2 Ma hijs Vákár higher-order functions make use of define-by-run approaches.ese approaches first build a computation graph during runtime, effectively evaluating the program until a straight-line first-order program is le , and then they evaluate this new program (Carpenter et al. 2015;Paszke et al. 2017).Such approaches have the severe downside that the differentiated code cannot benefit from existing optimizing compiler architectures.As such, these AD libraries need to be implemented using carefully, manually optimized code, that for example does not contain any common subexpressions.
is implementation process is precarious and labour intensive.Further, some whole-program optimizations that a compiler would detect go entirely unused in such systems.
Similarly, correctness proofs of reverse AD have taken a define-by-run approach and have relied on non-standard operational semantics, using forms of symbolic execution (Abadi and Plotkin 2020;Brunel et al. 2020;Mak and Ong 2020).Most work that treats reverse-AD as a source-code transformation does so by making use of complex transformations which introduce mutable state and/or non-local control flow (Pearlmu er and Siskind 2008;Wang et al. 2019).As a result, we are not sure whether and why such techniques are correct.Another approach has been to compile high-level languages to a low-level imperative representation first, and then to perform AD at that level (Innes 2018), using mutation and jumps. is approach has the downside that we might lose important opportunities for compiler optimizations, such as map-fusion and embarrassingly parallel maps, which we can exploit if we perform define-then-run AD on a high-level representation.
A notable exception to these define-by-run and non-functional approaches to AD is (Ellio 2018), which presents an elegant, purely functional, define-then-run version of reverse AD.Unfortunately, their techniques are limited to first-order programs over tuples of real numbers.is paper extends the work of (Ellio 2018) to apply to higher-order programs over (primitive) arrays of reals: • It defines purely functional define-then-run reverse-mode AD on a higher-order language.
• It shows how the resulting, mysterious looking program transformation arises from a universal property if we phrase the problem in a suitable categorical language.Consequently, the transformations automatically respect equational reasoning principles.• It explains, from this categorical se ing, precisely in what sense reverse AD is the "mirror image" of forward AD. • It presents an elegant proof of semantic correctness of the AD transformations, based on a semantic logical relations argument, demonstrating that the transformations calculate the derivatives of the program in the usual mathematical sense.• It shows that the AD definitions and correctness proof are extensible to higher-order primitives such as a map-operation over our primitive arrays.• It discusses how our techniques are readily implementable in standard functional languages to give purely functional, principled, semantically correct, define-then-run reversemode AD.

KEY IDEAS
Consider a very simple programming language.Types are statically sized arrays real n for some n, and programs are obtained from a collection of (unary) primitive operations x : real n ⊢ op(x) : real m (intended to implement differentiable functions like linear algebra operations such as addition and products, and sigmoid functions) through sequencing.
Observe that we can straightforwardly implement both forward mode − → D and reverse mode AD ← − D on this language as source-code translations to the larger language of a simply typed λ-calculus over the ground types real n that includes at least the same operations.We translate a type τ to a pair of types (D(τ ) 1 , D (τ ) 2 ) -the former for holding function values (also called primals in the Proc.ACM Program.Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. Correct Reverse AD at Higher Types 1:3 AD literature), the la er for holding derivative values (also called tangents or adjoints/cotangents in the AD literature, depending on whether one is considering forward or reverse AD): Terms x : τ ⊢ t : σ can then be translated to pairs of terms x : respectively, for forward AD and reverse AD.Indeed, we define, by induction on the syntax: where we assume that we have chosen suitable terms x : real n ⊢ (Dop)(x) : real n → real m and x : real n ⊢ (Dop) t (x) : real m → real n to represent the derivative and transposed derivative, respectively, of the primitive operation op : real n → real m .While this technique works well for performing AD on the limited first-order language we described, it is far from being satisfying.Notably, it has the following two shortcomings: (1) it does not tell us how to perform AD on programs that involve tuples or operations of multiple arguments; (2) it does not tell us how to perform AD on higher-order programs, that is, programs involving λ-abstractions and applications.e key contributions of this paper are its extension of this transformation (see §7) to apply to a full simply typed λ-calculus (of §3), and its proof that this transformation is correct (see §8).
Shortcoming (1) seems easy to address, at first sight.Indeed, as the (co)tangent vectors to a product of spaces are simply tuples of (co)tangent vectors, one would expect to define Indeed, this technique straightforwardly applies to forward mode AD: For reverse mode AD, however, tuples already present challenges.Indeed, we would like to use the definitions below, but they require terms ⊢ 0 : τ and t + s : τ for any two t, s : τ for each type τ : ese formulae capture the well-known issue of fanout translating to addition in reverse-mode AD, caused by the contravariance of reverse AD in its second component (Pearlmu er and Siskind 2008).Such 0 and + could indeed be defined by induction on the structure of types, using 0 and + at real n .However, more problematically, −, − , fst − and snd − represent explicit uses of structural rules of contraction and weakening at types τ , which, in a λ-calculus, can also be used implicitly in the typing context Γ. us, we should also make these implicit uses explicit to account for their presence in the code.
en, we can appropriately translate them into their "mirror image": we map the contraction-weakening comonoids to the monoid structures (+, 0).Here, we see insight (1): In define-then-run reverse AD, we need to make use of explicit structural rules and "mirror them", which we can do by translating our language into combinators.
Put differently: we define AD on the syntactic category Syn which has types τ as objects and programs (α)βη-equivalence classes of programs x : τ ⊢ t : σ as morphisms τ → σ .
Yet the question remains: why should this translation for tuples be correct?What is even less clear is how to address shortcoming (2).What should the spaces of tangents − → D (τ → σ ) 2 and adjoints ← − D (τ → σ ) 2 look like? is is not something we are taught in Calculus 1.01.Instead, we again employ category theory, which leads us to insight (2): Follow where the categorical structure of the syntax leads you, as doing so produces principled definitions that are easy to prove correct.
With the aim of categorical compositionality in mind, we can note that our translations compose in the sense that By the following trick, these equations are functoriality laws.Given a Cartesian closed category (C, 1, ×, ⇒), define categories

Both have identities id
, where we write Λ for categorical currying and π 2 for the second projection.
where we work in the internal language of C. en, we can see that we have defined two functors: where we write Syn 1 for the syntactic category of our restrictive first-order language, and we write Syn for that of the full λ-calculus.We would like to extend these to functors turns out to be a category with finite products, given by ( e reason turns out to be that not all functions are linear in the sense of respecting 0 and +.) erefore, the categorical structure does not immediately give us guidance on how to extend our translation to all of Syn.Now, we reach key insight (3): Linear types can help.By using a more fine-grained type system, we can capture the linearity of the derivative.As a result, we can phrase AD on our full language simply as the unique structure-preserving functor that extends the uncontroversial definitions given so far.
To implement this insight, we extend our λ-calculus to a language LSyn with limited linear types (in §4): linear function types ⊸ and a kind of multiplicative conjunction !(−) ⊗ (−), in the sense of the enriched effect calculus (Egger et al. 2009).e algebraic effect giving rise to these linear types, in this instance, is that of the theory of commutative monoids.As we have seen, such monoids are intimately related to reverse AD.Consequently, we demand that every f with a linear function type τ ⊸ σ is indeed linear, in the sense that f 0 = 0 and f (t + s) = (f t) + (f s).For the categorically inclined reader: that is, we enrich LSyn over the category of commutative monoids.Now, we can give more precise types to our derivatives, as we know they are linear functions: for x : τ ⊢ t : σ , we have x : erefore, given any model C of our linear type theory, we generalise our previous construction of the categories

C
(D AD, §7).Once we fix the interpretation of the primitives operations op to their respective derivatives and transpose derivatives, we obtain unique structure-preserving forward and reverse AD functors In particular, the following definitions are forced on us by the theory, producing key insight (4): With these definitions in place, we turn to the correctness of the source-code transformations.
To phrase correctness, we first need to construct a suitable denotational semantics with an uncontroversial notion of semantic differentiation.A technical challenge arises, as the usual calculus se ing of Euclidean spaces (or manifolds) and smooth functions cannot interpret higher-order functions.To solve this problem, we work with a conservative extension of this standard calculus se ing (see §5): the category Diff of diffeological spaces.We model our types as diffeological spaces, and programs as smooth functions.By keeping track of a commutative monoid structure on these spaces, we are also able to interpret the required linear types.We write Diff CM for this "linear" category of commutative diffeological monoids and smooth monoid homomorphisms.
By the universal properties of the syntax, we obtain canonical, structure-preserving functors − : LSyn → Diff CM and − : Syn → Diff once we fix interpretations R n of real n and well-typed interpretations op for each operation op.ese functors define a semantics for our language.
Having constructed the semantics, we can turn to the correctness proof (of §8).e proof consists of a logical relations argument over the semantics, which we phrase categorically, key insight (5): Once we show that the derivatives of primitive operations op are correctly implemented, correctness of derivatives of other programs follows from a standard logical relations construction over the semantics that relates a curve to its (co)tangent curve.To show correctness of forward AD, we construct a category As a consequence, we can work with P where we write D f (x)( ) for the usual multivariate calculus derivative of f at a point x evaluated at a tangent vector .By an application of the chain rule for multivariate differentiation, we see that every op respects this predicate, as long as Dop = D op .e commuting of our diagram then virtually establishes the correctness of forward AD. e only remaining step in the argument is to note that any tangent vector at τ R N , for first-order τ , can be represented by a curve R → τ .For reverse AD, the same construction works, if Dop t = D op t , by replacing as the predicates for constructing real n r , where we write A t for the matrix transpose of A. We now obtain our main theorem.Crucially, note that this theorem holds even for t that involve higherorder subprograms.

T
(C AD, T .8.1).For any typed term x : τ ⊢ t : σ in Syn between first-order types τ , σ , we have that Next, we address the practicality of our method (in §9).e code transformations we employ are not too daunting to implement.We can mechanically translate λ-calculus and functional languages into a (categorical) combinatory form (Curien 1986).However, the implementation of the required linear types presents a challenge.Indeed, types like !(−) ⊗ (−) and (−) ⊸ (−) are absent from functional languages such as Haskell and O'Caml.Luckily, in this instance, we can implement them using abstract data types by making use of a (basic) module system, key insight (6): Under the hood, !τ ⊗ σ can consist of a list of values of type τ * σ .Its API ensures that the list order and the difference between xs ++ [(t, s), (t, s ′ )] ++ s and xs ++ [(t, s + s ′ )] ++ s cannot be observed: as such, it is a quotient type.Meanwhile, τ ⊸ σ can be implemented as a standard function type τ → σ with a limited API that enforces that we can only ever construct linear functions: as such, it is a subtype.
We next phrase the correctness proof of the AD transformations in elementary terms, such that it holds in the applied se ing where we use abstract types to implement linear types.en, we show that our correctness results are meaningful, as they make use of a denotational semantics that is adequate with respect to the standard operational semantics.Finally, to stress the applicability of our method, we sketch its extension to higher-order (primitive) operations, such as map.

λ-CALCULUS AS A SOURCE LANGUAGE FOR AUTOMATIC DIFFERENTIATION
As a source language for our AD translations, we can begin with a standard, simply typed λcalculus which has ground types real n of statically sized arrays of n real numbers, for all n ∈ N, and sets Op m n 1 , ...,n k of primitive operations op for all k, m, n 1 , . . ., n k ∈ N. ese operations will be interpreted as smooth functions (R for which we slightly abuse notation and write c( ) as c; • elementwise addition and product (+), ( * ) ∈ Op n n,n and matrix-vector product (⋆) ∈ Op n n •m,m ; • operations for summing all the elements in an array: sum ∈ Op 1 n ; • some non-linear functions like the sigmoid function ς ∈ Op 1 1 .We intentionally present operations in a schematic way, as primitive operations tend to form a collection that is added to in a by-need fashion, as an AD library develops.
e precise operations needed will depend on the applications, but, in statistics and machine learning applications, Op tends to include a mix of multi-dimensional linear algebra operations and mostly onedimensional non-linear functions.A typical library for use in machine learning would work with multi-dimensional arrays (sometimes called "tensors").We focus here on one-dimensional arrays as the issues of how precisely to represent the arrays are orthogonal to the concerns of our development.
Syn has the following universal property: for any Cartesian closed category (C, 1, ×, ⇒), we obtain a unique Cartesian closed functor F : Syn → C, once we choose objects F real n of C as well as make well-typed choices of C-morphisms, for each op ∈ Op m n 1 , ...,n k :

A λ-CALCULUS WITH LINEAR TYPES AS AN IDEALISED AD TARGET LANGUAGE
As a target language for our AD source code transformations, we consider a language that extends the language of §3 with limited linear types.We could opt to work with a full linear logic as in (Benton 1994) or (Barber and Plotkin 1996).Instead, however, we will only include the bare minimum of linear type formers that we actually need to phrase the AD transformations.e resulting language is closely related to, but more minimal than, the Enriched Effect Calculus of (Egger et al. 2009).We limit our language in this way because we want to stress that the resulting code transformations can easily be implemented in existing functional languages such as Haskell or O'Caml.As we discuss in §9, the idea will be to make use of a module system to implement the required linear types as abstract data types.
In our idealised target language, we consider linear types (also called computation types) τ , σ , ρ, in addition to the Cartesian types (also called value types) τ , σ , ρ that we have considered so far.We think of Cartesian types as denoting spaces and linear types as denoting spaces equipped with an algebraic structure.As we are interested in studying differentiation, the relevant space structure in this instance is a geometric structure that suffices to define differentiability.Meanwhile, the

=
to indicate that the variables x 1 , . . ., x n need to be fresh in the le hand side.Equations hold on pairs of terms of the same type.As usual, we only distinguish terms up to α-renaming of bound variables.
relevant algebraic structure on linear types turns out to be that of a commutative monoid, as this algebraic structure is needed to phrase Automatic Differentiation algorithms.Indeed, we will use the linear types to denote spaces of (co)tangent vectors to the spaces of primals denoted by Cartesian types.ese spaces of (co)tangents form a commutative monoid under addition.
Concretely, we extend the types and terms of our language as follows: ese operations can include e.g.dense and sparse matrix-vector multiplications.
eir purpose is to serve as primitives that we can use to implement derivatives Dop(x; ) and (Dop) t (x; ) of the operations op from the source language as terms that are linear in .
In addition to the judgement Γ ⊢ t : τ , which we encountered in §3, we now consider an additional judgement Γ; x : τ ⊢ t : σ .While we think of the former as denoting a (structure-preserving) function between spaces, we think of the la er as a (structure-preserving) function from the space which Γ denotes to the space of (structure-preserving) monoid homomorphisms from the denotation of τ to that of σ .In this instance, "structure-preserving" will mean differentiable.
Fig. 3 displays the typing rules of our language.We consider the terms of this language up to the βη+-equational theory of Fig. 4. It includes βη-rules as well as monoid and homomorphism laws.

Preliminaries
Category theory.We assume familiarity with categories C, D, functors F , G : C → D, natural transformations α, β : F → G, and their theory of (co)limits and adjunctions.We write: • unary, binary, and I -ary products as 1, X 1 × X 2 , and i ∈I X i , writing π i for the projections and (), (x 1 , x 2 ), and (x i ) i ∈I for the tupling maps; • unary, binary, and I -ary coproducts as 0, X 1 + X 2 , and i ∈I X i , writing ι i for the injections and [], [x 1 , x 2 ], and [x i ] i ∈I for the cotupling maps; • exponentials as Y ⇒ X , writing Λ and ev for the currying and evaluation maps.
Typing rules for the idealised AD target language with linear types.
Fig. 4. Equational rules for the idealised, linear AD language, which we use on top of the rules of Fig. 2. In addition to standard βη-rules for !(−) ⊗ (−)and ⊸-types, we add rules making (0, +) into a commutative monoid on the terms of each linear type as well as rules which say that terms of linear types are homomorphisms in their linear variable.Equations hold on pairs of terms of the same type.
e real numbers R form a commutative monoid with 0 and + equal to the number 0 and addition of numbers.
Example 5.2.Given commutative monoids (X i ) i ∈I , we can form the product monoid i ∈I X i Ex. 5.2 gives the categorical product in CMon.We can, for example, construct a commutative monoid structure on any Euclidean space R k by combining the one on R with the product monoid structure.
Example 5.3.Given a set S, we can form the free commutative monoid !S on S. |!S | is defined as the set of functions f : S → N to the natural numbers N that have finite support in the sense that f (s) 0 for only finitely many s.( at is, |!S | is the set of finite multisets of elements of S.) We define 0 !S to be the function that is constantly 0 and (f . is follows as we can uniquely extend such an f bilinearly to a map X ⊗ Y → Z . Finally, a category C is called CMon-enriched if we have a commutative monoid structure on each homset C(C, C ′ ) and function composition gives monoid homomorphisms Finite products in a category C are well-known to be biproducts (i.e.simultaneously products and coproducts) if and only if C is CMon-enriched (see e.g.(Fiore 2007)):

Abstract Semantics
e language of §3 has a canonical interpretation in any Cartesian closed category (C, 1, ×, ⇒ ), once we fix C-objects real n to interpret real n and C-morphisms . We interpret (Cartesian) types τ and contexts Γ as C-objects τ and Γ : We interpret terms Γ ⊢ t : τ as morphisms t in C( Γ , τ ): is is an instance of the universal property of Syn mentioned in §3.
We discuss how to extend − to apply to the full target language of §4.Suppose that D : C op → Cat is a locally indexed category, i.e. a (strict) contravariant functor from C to the category Cat of categories, such that ob We call a D satisfying all these conditions a categorical model of the language of §4.If we choose D-objects real n to interpret real n and compatible D-morphisms lop in , then we can interpret linear types τ as objects τ of D: We can interpret τ ⊸ σ as the C-object τ ⊸ σ def = τ ⊸ σ .Finally, we can interpret terms Γ ⊢ t : τ as morphisms t in C( Γ , τ ) and terms Γ; x : τ ⊢ t : σ as t in D( Γ )( τ , σ ): Observe that we interpret 0 and + using the biproduct structure of D. • Objects of LSyn(τ ) are linear types σ of our target language.
• Composition of x : τ ; 1 : σ 1 ⊢ t : σ 2 and x : τ ; 2 : σ 2 ⊢ t : σ 3 in LSyn(τ ) is defined by the capture avoiding substitution x : τ ; 1 : • All type formers are interpreted as would be expected based on their notation, using their introduction and elimination rules for the required structural isomorphisms.

Concrete Semantics
Diffeological Spaces.roughout this paper, we will have an instance of the abstract semantics of our languages in mind, as we intend to interpret real n as the usual Euclidean space R n and to interpret each program x 1 : real n 1 , . . ., One challenge is that the usual se ings for multivariate calculus and differential geometry do not form Cartesian closed categories, making the interpretation of higher types impossible (see (Huot et al. 2020, Appx.A)).One solution, recently employed by (Huot et al. 2020), is to work with diffeological spaces (Iglesias-Zemmour 2013; Souriau 1980), which generalise the usual notions of differentiability from Euclidean spaces and smooth manifolds to apply to higher types (as well as a range of other types such a sum and inductive types).We will also follow this route and use such spaces to construct our concrete semantics.Other valid options for a concrete semantics exist: convenient vector spaces (Blute et al. 2012;Frolicher 1988), Frölicher spaces (Frölicher 1982), or synthetic differential geometry (Kock 2006), to name a few.We choose to work with diffeological spaces mostly because they seem to us to provide simplest way to define and analyse the semantics of a rich class of language features.Diffeological spaces formalise the important intuition that a higher-order function is smooth if it sends smooth functions to smooth functions, meaning that we can never use it to build nonsmooth first-order functions.is intuition is reminiscent of a logical relation, and it is realised by directly axiomatising smooth maps into the space, rather than treating smoothness as a derived property.
We think of plots as the maps that are axiomatically deemed "smooth".We call a function f : X → Y between diffeological spaces smooth if, for all plots p ∈ P U X , we have that p; f ∈ P U Y .We write Diff(X , Y ) for the set of smooth maps from X to Y .Smooth functions compose, and so we have a category Diff of diffeological spaces and smooth functions.We give some examples of such spaces.
Example 5.7 (Manifold diffeology).Given any open subset X of a Euclidean space R n (or, more generally, a smooth manifold X ), we can take the set of smooth (C ∞ ) functions U → X in the traditional sense as P U X .Given another such space X ′ , then Diff(X , X ′ ) coincides precisely with the set of smooth functions X → X ′ in the traditional sense of calculus and differential geometry.
Put differently, the categories CartSp of Euclidean spaces and Man of smooth manifolds with smooth functions form full subcategories of Diff.
Example 5.8 (Product diffeology).Given diffeological spaces (X i ) i ∈I , we can equip i ∈I |X i | with the product diffeology: Example 5.9 (Functional diffeology).Given diffeological spaces X , Y , we can equip Diff(X , Y ) Examples 5.8 and 5.9 give us the categorical product and exponential objects, respectively, in Diff.e embeddings of CartSp and Man into Diff preserve products (and coproducts).
We work with the concrete semantics, where we fix C = Diff as the target for interpreting Cartesian types and their terms.at is, by choosing the interpretation real n def = R n , and by interpreting each op ∈ Op m n 1 , ...,n k as the smooth function op : R n 1 × . . .× R n k → R m that it is intended to represent, we obtain a unique interpretation − : Syn → Diff.
(Commutative) Diffeological Monoids.To interpret linear types and their terms, we need a semantic se ing D that is both compatible with Diff and enriched over the category of commutative monoids.We choose to work with commutative diffeological monoids.at is, commutative monoids internal to the category Diff.
We write Diff CM for the category whose objects are commutative diffeological monoids and whose morphisms Given that Diff CM is CMon-enriched, finite products are biproducts.
e real numbers R form a commutative diffeological monoid R by combining its standard diffeology with its usual commutative monoid structure (0, +).Similarly, N ∈ Diff CM by equipping N with (0, +) and the discrete diffeology, in which plots are locally constant functions.
Example 5.12.We form the (categorical) product in Diff CM of (X i ) i ∈I by equipping i ∈I |X i | with the product diffeology and product monoid structure.
Example 5.15.Given commutative diffeological monoids X and Y , we can define a commutative In this paper, we will primarily be interested in X ⊸ Y as a diffeological space, and we will mostly disregard its monoid structure, until § §9.3.
However, we do not need such a rich type system.For us, the following suffices.Define Diff CM (X ), for X ∈ ob Diff, to have the objects of Diff CM and homsets defines a locally indexed category.By taking C = Diff and D(−) = Diff CM (−), we obtain a concrete instance of our abstract semantics.Indeed, we have natural isomorphisms ). e prime motivating examples of morphisms in this category are derivatives.Recall that the derivative at x, D f (x), and transposed derivative at x, (D f ) t (x), of a smooth function f : R n → R m are defined as the unique functions respectively.Indeed, derivatives D f (x) of f at x are linear functions, as are transposed derivatives (D f ) t (x).Both depend smoothly on x in case f is C ∞ -smooth.Note that the derivatives are not merely linear in the sense of preserving 0 and +.
ey are also multiplicative in the sense that We could have captured this property by working with vector spaces internal to Diff.However, we will not need this property to phrase or establish correctness of AD.
erefore, we restrict our a ention to the more straightforward structure of monoids.
By interpreting real n def = R n and by interpreting each operation lop ∈ LOp as the smooth function lop it is intended to represent, we obtain a canonical interpretation of our target language in Diff CM .

PAIRING PRIMALS WITH THEIR TANGENTS/ADJOINTS, CATEGORICALLY
In this section, we show that any categorical model D : C op → Cat of our target language gives rise to two Cartesian closed categories Σ C D and Σ C D op (which we wrote We believe these observations of Cartesian closure are novel.Surprisingly, they are highly relevant for obtaining a principled understanding of AD on a higher-order language: the former for forward AD, and the la er for reverse AD.Applying these constructions to the syntactic category LSyn : Syn op → Cat of our language, we produce a canonical definition of the AD macros, as the canonical interpretation of the λ-calculus in the Cartesian closed categories Σ Syn LSyn and Σ Syn LSyn op .In addition, when we apply this construction to the denotational semantics Diff CM : Diff op → Cat and invoke a categorical logical relations technique, known as subsconing, we find an elegant correctness proof of the source code transformations.e abstract construction delineated in this section is in many ways the theoretical crux of this paper.

Grothendieck Constructions on Strictly Indexed Categories
Recall that for any strictly indexed category, i.e. a (strict) functor D : C op → Cat, we can consider its total category (or Grothendieck construction) Σ C D, which is a fibred category over C (see (Johnstone 2002, sections A1.1.7,B1.3.1)).We can view it as a Σ-type of categories, which generalizes the Cartesian product.Concretely, its objects are pairs Σ C D is the following category: Σ C D op is the following category: We examine the categorical structure present in Σ C D and Σ C D op .As this structure are of such importance in our development, we discuss in detail.

P (
).We have natural bijections We observe that we need D to have biproducts (equivalently: to be CMon enriched) in order to show Cartesian closure.Further, we need linear ⇒-types and Cartesian ⊸-types.

P (
).We have natural bijections Correct Reverse AD at Higher Types 1:17 Observe that we need the biproduct structure of D to construct finite products in Σ C D op .Further, we need Cartesian ⊸-types and !(−) ⊗ (−)-types to construct exponentials, but not biproducts.For the AD transformations to be correct, it is important that these derivatives of language primitives are implemented correctly in the sense that In practice, AD library developers tend to assume the subtle task of correctly implementing such derivatives Dop(x; ) and Dop t (x; ) whenever a new primitive operation op is added to the library.e extension of the AD macros − → D and ← − D to the full source language are now canonically determined, as the unique Cartesian closed functors that extend the previous definitions, following the categorical structure described in §6.Because of the counter-intuitive nature of the Cartesian closed structures on Σ Syn LSyn and Σ Syn LSyn op , we list the full macros explicitly in Appx. A.

PROVING REVERSE AND FORWARD AD DENOTATIONALLY CORRECT
In this section, we will show that the source code transformations described in §7 correctly implement mathematical derivatives.We make correctness precise as the statement that for programs x : τ ⊢ t : σ between first-order types τ and σ , i.e. types not containing any function type constructors, we have that Σ Diff Diff CM op . is logical relations proof can be phrased in elementary terms, but the resulting argument is very technical and would be hard to discover.Instead, we prefer to phrase it in terms of a categorical subsconing construction, a more abstract and elegant perspective on logical relations.We discovered the proof by taking this categorical perspective, and, while we have verified the elementary argument (see Appx.B), we would not otherwise have come up with it.

Preliminaries
Subsconing.Logical relations arguments provide a powerful proof technique for demonstrating properties of typed programs.e arguments proceed by induction on the structure of types.Here, we briefly review the basics of categorical logical relations arguments, or subsconing constructions.
We restrict to the level of generality that we need here, but we would like to point out that the theory applies much more generally.
Consider a Cartesian closed category (C, 1, ×, ⇒).Suppose that we are given a functor F : C → Set to the category Set of sets and functions which preserves finite products in the sense that F (1) 1 and F (C ×C ′ ) F (C) × F (C ′ ).en, we can form the subscone of F , or category of logical relations over F , which is Cartesian closed (Johnstone et al. 2007): • objects are pairs (C, P) of an object C of C and a predicate P ⊆ FC; • morphisms (C, P) → (C ′ , P ′ ) are C morphisms f : C → C ′ which respect the predicates in the sense that F (f )(P) ⊆ P ′ ; • identities and composition are as in C; • (1, F 1) is the terminal object, and products and exponentials are given by (C, . Forge ing about the predicates gives a faithful Cartesian closed functor π 1 from the subscone to C. In typical applications, C can be the syntactic category of a language (like Syn), the codomain of a denotational semantics − (like Diff), or a product of the above, if we want to consider nary logical relations instead of logical predicates.Typically, F tends to be a hom-functor (which always preserves products), like C(1, −) or C(C 0 , −), for some important object C 0 .When applied to the syntactic category Syn and F = Syn(1, −), the formulae for products and exponentials in the subscone clearly reproduce the usual recipes in traditional, syntactic logical relations arguments.As such, subsconing generalises standard logical relations methods.

Subsconing for Correctness of AD
We will apply the subsconing construction above to  Correct Reverse AD at Higher Types 1:19 , where we write D f for the semantic derivative of f (see §5).We need to verify, respectively, that ( op , ( n . is respecting of relations follows immediately from the chain rule for multivariate differentiation, as long as we have implemented our derivatives correctly for the basic operations op: x; ⊢ Dop(x; ) = D op and x; ⊢ (Dop) t (x; ) = (D op ) t .
By writing real n 1 , ..., ), since derivatives of tuple-valued functions are computed component-wise.(In fact, the corresponding facts hold more generally for any first-order type, as an iterated product of real.)Suppose that (f , ( , h)) ∈ P f real n 1 , . .., n k , i.e. = f and h = D f .en, using the chain rule in the last step, we have real n 1 , . .., n k , then by the chain rule and basic linear algebra m .Consequently, we obtain our Cartesian closed functors − f and − r .
Further, observe that we have a Cartesian closed functor Σ − − : Similarly, we get a Cartesian closed functor Σ − − op : Σ Syn LSyn op → Σ Diff Diff CM op .As a consequence, the two squares below commute.Indeed, going around the squares in both directions define Cartesian closed functors that agree on their action on real n and all operations op.erefore, by the universal property of Syn, they must coincide.In particular, ( t , SScone and therefore respects the logical relations P f for any well-typed term t of the source language of §3.Similarly, ( t , SScone and therefore respects the logical relations P r .Most of the work is now in place to show correctness of AD.We finish the proof below.To ease notation, we work with terms in a context with a single type.Doing so is not a restriction as our language has products, and the theorem holds for arbitrary terms between first-order types. T 8.1 (C AD).For programs x : τ ⊢ t : σ between first-order types τ and σ , where we write D for the usual calculus derivative and (−) t for the matrix transpose.

P (
).First, we focus on R N (for some N ).en, there is a smooth curve γ : R → τ , such that γ (0) = x and Dγ (0)(1) = .Clearly, (γ , (γ , Dγ )) ∈ P where we use the definition of composition in Diff × Σ Diff Diff CM .erefore, γ ; t = γ ; − → D (t) 1 and, by the chain rule, Evaluating the former at 0 gives t (x) = − → D (t) 1 (x).Similarly, evaluating the la er at 0 and 1 gives Next, we turn to R N (for some N ).Let γ i : R → τ be a smooth curve such that γ i (0) = x and Dγ i (0)(1) = e i , where we write e i for the i-th standard basis vector of 1 and, by the chain rule, Evaluating the former at 0 gives t (x) = ← − D (t) 1 (x).Similarly, evaluating the la er at 0 and gives us e i • D t (x) t ( ) = e i • ← − D (t) 2 (x)( ).As this equation holds for all basis vectors e i of ← − D (τ ) , we find that D t (x) t ( )

PRACTICAL RELEVANCE AND IMPLEMENTATION IN FUNCTIONAL LANGUAGES
Most popular functional languages, such as Haskell and O'Caml, do not natively support linear types.As such, the transformations described in this paper may seem hard to implement.However, as we will argue in this section, we can easily implement the limited linear types necessary for phrasing the transformations as abstract data types by using merely a basic module system.Specifically, we explain how to implement !(−) ⊗ (−)-and Cartesian (−) ⊸ (−)-types.We first convey some intuitions, and then we discuss the required API, the AD transformations, their semantics and correctness, and, finally, we explain how the API can be implemented.
Based on the denotational semantics, τ ⊸ σ -types should hold (representations of) functions f from τ to σ that are homomorphisms of the monoid structures on τ and σ .We will see that these types can be implemented using an abstract data type that holds certain basic linear functions (extensible as the library evolves) and is closed under the identity, composition, argument swapping, and currying (to be discussed later).Again, based on the semantics, !τ ⊗ σ should contain (representations of) finite multisets n i =1 δ (t i ,s i ) of pairs (t i , s i ), where t i is of type τ , and s i is of type σ , and where we identify xs + δ (t,s) + δ (t,s ′ ) and xs + δ (t,s+s ′ ) .9.1 An Alternative, Applied Target Language for AD Based on Abstract Data Types Next, we discuss an extension of the source language of §3 with two abstract data type formers LFun and Tens, as it can serve as an alternative, applied target language for our transformation.
is language is essentially equivalent to that of §4, but it no longer distinguishes between linear and Cartesian types.To be precise, we extend the source language with the types and terms linear pairing, which are typed according to the rules of Fig. 5.We can use this extension of the source language as an alternative target language for our AD transformations.In fact, we could define a translation (−) † form our linear target language to this language that relates the AD macros on both languages and is semantics preserving.To do so, we define , and to extend (−) † structurally recursively, le ing it preserve all other type formers.We then translate We believe an interested reader can fill in the details.Instead of deriving correctness of AD on the applied target language via this translation, we will give an explicit logical relations proof, in Appx B, as it will be a useful tool for further extensions to the language, such as the extension with higher-order primitive operations that we consider in § §9.6.
Γ ⊢ lpair(t, s) : LFun(τ , σ * ρ) 9.2 AD Macros Targeting the Applied Language with Abstract Types Assume that we have chosen suitable terms x : Dom(op) ⊢ Dop(x) : LFun(Dom(op), real m ) and x : Dom(op) ⊢ Dop t (x) : LFun(real m , Dom(op)) for representing the forward and reverse derivatives of operations op ∈ Op m n 1 , ...,n k .For forward AD, we translate each type τ into a pair of types We also translate each term x : τ ⊢ t : σ into a pair of terms x : On programs, we define it as where z : For reverse AD, we translate each type τ into a pair of types We also translate each term x : τ ⊢ t : σ into a pair of terms x : ).On programs, we define it as Correct Reverse AD at Higher Types 1:23 where z : We emphasise that this generated code is intended to be compiled by an optimizing compiler.Indeed, leveraging such existing compiler toolchains is one of the prime motivations for this work.
Here, we use the commutative monoid structure on the homomorphism spaces {|τ | } ⊸ {|σ | }, which we described in Ex. 5.14.We extend the semantics of Syn's terms to the applied target language (noting that the interpretation {| − | } of Syn's terms as Diff-morphisms can also serve as a well-typed interpretation in Diff non−lin CM , given our chosen interpretation of objects): e interpretation of lcur −1 t is well-defined, for two reasons: first, {|t | } is linear in its last argument by its type; second, + is commutative and associative.

A Correctness Proof of AD for the Applied Target Language
With a semantics in place, we can again give a correctness proof of AD. is time, we write out the logical relations proof by hand.It is essentially the unraveling of the categorical subsconing argument of §8.Appx.B contains the full proof.Here, we outline the structure.
Correctness of Forward AD.By induction on the structure of types, we construct a logical relation en, we establish the following fundamental lemma.
e proof goes via induction on the typing derivation of t.Next, the correctness theorem follows by exactly the argument in the proof of m. 8.1.Correctness of Reverse AD.We define, by induction on the structure of types, a logical relation en, we establish the following fundamental lemma.
e proof goes via induction on the typing derivation of t.Again, the correctness theorem then follows by exactly the argument in the proof of m. 8.1.
T 9.4 (C R AD).For any typed term x : τ ⊢ t : σ in Syn, where τ and σ are first-order types, we have that

How to Implement the API of the Applied Target Language
We observe that we can implement the API of our applied target language, as follows, in a language that extends the source language with types List(τ ) of lists of elements of type τ and a mechanism for creating abstract types, such as a basic module system as found in Haskell (or, a 1:25 fortiori, O'Caml).Indeed, we implement LFun(τ , σ ) under the hood, for example, as τ → σ and Tens(τ , σ ) as List(τ * σ ).e idea is that LFun(τ , σ ), which arose as a right adjoint in our linear language, is essentially a subtype of τ → σ .On the other hand, Tens(τ , σ ), which arose as a le adjoint, is a quotient type of List(τ * σ ).We achieve the desired subtyping and quotient typing by exposing only the API of Fig. 5 and hiding the implementation.We can then implement this interface as follows. 0 Here, we write [ ] for the empty list, t :: s for the list consisting of s with t prepended on the front, and fold t over x in s from acc = init for (right) folding an operation t over a list s, starting from init.Further, the implementer of the AD library can determine which linear operations lop to include within the implementation of LFun.We expect these linear operations to include various forms of dense and sparse matrix-vector multiplication as well as code for computing Jacobianvector and Jacobian-adjoint products for the operations op that avoids having to compute the full Jacobian.
is implementation shows that the applied target language is pure and terminating, as is standard for a λ-calculus extended with lists and some total primitive operations.For completeness, we describe, in Appx.C, the implied big-step operational semantics and prove its adequacy with respect to the denotational semantics {| − | }.
In a principled approach to building a define-then-run AD library, we would shield this implementation using the abstract data types Tens(τ , σ ) and LFun(τ , σ ) as we describe, both for reasons of type safety and because it conveys the intuition behind the algorithm and its correctness.However, nothing stops library implementers from exposing the full implementation.In fact, this seems to be the approach (Vytiniotis et al. 2019) have taken.A downside of that "exposed" approach is that the transformations then no longer respect equational reasoning principles.9.6 Is this practically relevant?Why exclude map, fold, etc. from your source language?e aim of this paper is to answer the foundational question of how to perform (reverse) AD at higher types.e problem of how to perform AD of evaluation and currying is highly challenging.For this reason, we have devoted this paper to explaining a solution to that problem in detail, working with a toy language with ground types of black-box, sized arrays real n with some firstorder operations op.However, many of the interesting applications only arise once we can use higher-order operations such as map and fold on real n .
Our definitions and correctness proofs extend to this se ing with higher-order primitives.We plan to discuss and implement them in detail them in an applied follow-up paper.For example, if we add higher-order operations map ∈ Syn((real → real) * real n , real n ) to the source language, to "map" functions over the black-box arrays, we can define their forward and reverse derivatives as where we make use of the standard functional programming idiom zip and zipWith.We assume that we are working internal to the module defining LFun(τ , σ ) and Tens(τ , σ ) as we are implementing derivatives of language primitives.As such, we can operate directly on their internal representations which we simply assume to be plain functions and lists of pairs.For a correctness proof, see Appx.D. Applications frequently require AD of higher-order primitives such as differential and algebraic equation solvers, e.g. for use in pharmacological modelling in Stan (Tsiros et al. 2019).Currently, derivatives of such primitives are derived using the calculus of variations (and implemented with define-by-run AD) (Betancourt et al. 2020;Hannemann-Tamas et al. 2015).Our proof method provides a more light-weight and formal method for calculating, and establishing the correctness of, derivatives for such higher-order primitives.Indeed, most formalizations of the calculus of variations use infinite-dimensional vector spaces and are technically involved (Kriegl and Michor 1997).

RELATED WORK
is work is closely related to (Huot et al. 2020), which introduced a similar semantic correctness proof for a version of forward-mode AD, using a subsconing construction.A major difference is that this paper also phrases and proves correctness of reverse-mode AD on a λ-calculus and relates reverse-mode to forward-mode AD.Using a syntactic logical relations proof instead, (Barthe et al. 2020) also proves correctness of forward-mode AD.Again, it does not address reverse AD.
(Cocke et al. 2020) proposes a similar construction to that of §6, and it relates this construction to the differential λ-calculus.
is paper develops sophisticated axiomatics for semantic reverse differentiation.However, it neither relates the semantics to a source-code transformation, nor discusses differentiation of higher-order functions.
Importantly, (Ellio 2018) describes and implements what are essentially our source-code transformations, though they were restricted to first-order functions and scalars.(Vytiniotis et al. 2019) sketches an extension of the reverse-mode transformation to higher-order functions in essentially the same way as proposed in this paper.It does not motivate or derive the algorithm or show its correctness.Nevertheless, this short paper discusses important practical considerations for implementing the algorithm, and it discusses a dependently typed variant of the algorithm.
Next, there are various lines of work relating to correctness of reverse-mode AD, which we consider less similar to our work.For example, (Mak and Ong 2020) define and prove correct a formulation of reverse-mode AD on a higher-order language that depends on a non-standard operational semantics, essentially a form of symbolic execution.(Abadi and Plotkin 2020) does something similar for reverse-mode AD on a first-order language extended with conditionals and iteration.(Brunel et al. 2020) defines an AD algorithm in a simply typed λ-calculus with linear negation and proves it correct using operational techniques.Further, they show that this algorithm corresponds to reverse-mode AD if one uses a non-standard operational semantics.ese formulations of reverse-mode AD all depend on non-standard run-times and hence fall into the category of "define-by-run" formulations of reverse-mode AD.Meanwhile, we are concerned with "definethen-run" formulations: source-code transformations producing differentiated code at compiletime, which can then by optimized during compilation with existing compiler tool-chains.
Finally, there is a very long history of work on reverse-mode AD, though almost none of it applies the technique to higher-order functions.A notable exception is (Pearlmu er and Siskind 2008), which gives an impressive implementation of reverse AD as a source-code transformation in Scheme.While very efficient, this implementation crucially uses mutation.Moreover, the transformation is complex and correctness is not considered.More recently, (Wang et al. 2019) describes a 1:27 much simpler implementation of a reverse AD code transformation, again very performant.However, the transformation is quite different from the one considered in this paper as it relies on a combination of delimited continuations and mutable state.Correctness is not considered, perhaps because of the semantic complexities introduced by impurity.
Our work adds to the existing literature by presenting (to our knowledge) the first principled and pure define-then-run reverse AD algorithm for a higher-order language, by arguing its practical applicability, and by proving semantic correctness of the algorithm.

A DEFINING THE CORE ALGORITHMS: AD SOURCE-CODE TRANSFORMATIONS
In particular, Σ Syn LSyn and Σ Syn LSyn op are both Cartesian closed categories.Hence, by the universal property of Syn, we obtain unique structure-preserving macros − → D (−) : Syn → Σ Syn LSyn (forward AD) and ← − D (−) : Syn → Σ Syn LSyn op (reverse AD) once we fix a compatible definition on basic types real n and on basic operations op. at is, we need to choose suitable terms Dop(x; ) and Dop t (x; ) below to represent to the forward and reverse-mode derivatives of the basic operations op ∈ Op m n 1 , ...,n k .We choose these representations of derivatives as they allow for efficient Jacobian-vector and Jacobian-adjoint products, which are known to be important to achieve performant AD implementations.
For the AD transformations to be correct, it is important that these derivatives of language primitives are implemented correctly in the sense that e implementation of such derivatives for language primitives is a subtle task that is constantly undertaken in practice by AD library developers, whenever a new primitive operation is added to the library.e extension of the AD macros − → D and ← − D to the full source language are now determined canonically as the unique Cartesian closed functor extending the previous definitions.However, because of the counter-intuitive nature of the Cartesian closed structures on Σ Syn LSyn and Σ Syn LSyn op , we still consider it worthwhile to list the resulting definitions here, particularly as these transformations lend themselves well to implementation and are highly practically relevant.
Correct Reverse AD at Higher Types 1:31 where z : On programs, we define it as where x 1 : where 1 : where z 1 : B A MANUAL CORRECTNESS PROOF OF AD THROUGH SEMANTIC LOGICAL RELATIONS Correctness of Forward AD.By induction on the structure of types, we construct a logical relation en, we establish the following fundamental lemma.

P
. We prove this by induction on the typing derivation of well-typed terms.We start with the cases of ev and Λ(t) as they are by far the most interesting.Consider ev ∈ Syn((τ → σ ) * τ , σ ).en Proc.ACM Program.Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018.
Consider ∈ Syn(τ , 1).Observe that Correct Reverse AD at Higher Types 1:35 Suppose that (f , ( , h)) ∈ P τ .en, we need to show that at is, we need to show that ((), ((), ())) ∈ P 1 , but that holds by definition of P 1 .Consider id ∈ Syn(τ , τ ).Observe that Suppose that (f , ( , h)) ∈ P τ .en, we need to show that at is, we need to show that (f , ( , h)) ∈ P τ , but that holds by assumption.Consider composition: suppose that t ∈ Syn(τ , σ ) and s ∈ Syn(σ , ρ) both respect the logical relation.en, t; s ∈ Syn(τ , ρ).Further, Suppose that (f , ( , h)) ∈ P τ .We need to show that But that follows from the fact that s respects the logical relation as where we use the crucial assumption that the derivatives of primitive operations are implemented correctly.en, let (f , ( , h)) ∈ P real n 1 * ... * real n k .at is,(f , ( , h)) = ((f 1 , . . ., f k ), (( 1 , . . ., k ), x → r → (h 1 (x)(r ), . . ., h k (x)(r )))), for (f i , ( i , h i )) ∈ P real n i , for 1 ≤ i ≤ k.We want to show that (f ; {|op| }, ( ; at is, By the assumption that (f , ( i , h i )) ∈ P real n i , we have that i = f i and h i = D f i .erefore, we need to show that Using the chain rule for multivariate differentiation (and a li le bit of linear algebra), this is equivalent to, erefore, the fundamental lemma follows.
Next, the correctness theorem follows by exactly the argument in the proof of m. 8.1.

T B.2 (C F AD).
For any typed term x : τ ⊢ t : σ in Syn, where τ and σ are first-order types, we have that Correctness of Reverse AD.We define, by induction on the structure of types, a logical relation en, we establish the following fundamental lemma.
e proof goes by induction on the typing derivation of well-typed terms t ∈ Syn.Indeed, we first consider the cases of evaluation and currying, as they are the most interesting.Consider ev ∈ Syn((τ → σ ) * τ , σ ).en Correct Reverse AD at Higher Types 1:37 Suppose that (f , ( , h)) ∈ P τ .We want to show that at is, we want to establish that for all (f ′ , ( ′ , h ′ )) ∈ P σ , we have that which is what we wanted to show!Next, we turn to product projections.We consider fst .e other projection is analogous.We have that fst ∈ Syn(τ * σ , τ ).erefore, By linearity of h 2 in its second argument which holds by virtue of its type, it is enough to show that which is true by assumption.
Correct Reverse AD at Higher Types 1:39 at is, By linearity of h in its second argument, it is enough to show that Next, we consider ∈ Syn(τ , 1).We have that erefore, given any (f , ( , h)) ∈ P τ , we need to show that is follows as h is linear in its second argument by virtue of its type.Consider identities: id ∈ Syn(τ , τ ).en, Suppose that (f , ( , h)) ∈ P τ .en, we need to show that (f ; {|id| }, ( ; at is, (f , ( , x → → h(x)( ))) ∈ P τ , which is true by assumption.
Consider composition: t ∈ Syn(τ , σ ) and s ∈ Syn(σ , ρ), which both respect the logical relation in the sense of the fundamental lemma.en, Suppose that (f , ( , h)) ∈ P τ .We want to show that Now, as t respects the logical relation, by our induction hypothesis, we have that erefore, as s also respects the logical relation, by our induction hypothesis, we have that e base cases of operations hold by the chain rule.Indeed, consider op ∈ Syn(real where we use the crucial assumption that the derivatives of primitive operations are implemented correctly.en, let (f , ( , h)) By the assumption that (f , ( i , h i )) ∈ P real n i , we have that i = f i and h i = D f i t .erefore, we need to show that ((f 1 , . . ., f k ); {|op| }, ((f 1 , . . ., f k ); {|op| }, x → → Using the chain rule for multivariate differentiation (and a li le bit of linear algebra), this is equivalent to, ((f 1 , . . ., f k ); {|op| }, ((f 1 , . . ., f k ); {|op| }, (D((f 1 , . . ., f k ); {|op| })) t )) ∈ P real m .
Again, the correctness theorem then follows by exactly the argument in the proof of m. 8.1.
T B.4 (C R AD).For any typed term x : τ ⊢ t : σ in Syn, where τ and σ are first-order types, we have that Correct Reverse AD at Higher Types 1:41

C OPERATIONAL SEMANTICS AND ADEQUACY FOR THE APPLIED TARGET LANGUAGE C.1 Big-Step Semantics
For completeness, we describe the big-step operational semantics for the applied target language which is implied by our suggested implementation.Because of purity, the precise evaluation strategy is unimportant.(We use call-by-name evaluation.)We write t ⇓ N to indicate that a term t evaluates to normal form N .If no rule applies to a term t, we intend it to be a normal form (i.e.t ⇓ t).As normal forms are unique, we will write ⇓ t for the unique N such that t ⇓ N .
t ⇓ c op ∈ Op op(t) ⇓ {|op| }(c) is is proved by induction on the structure of terms.
is is proved by induction on the definition of ⇓: note that every operational rule is also an equation in the semantics.
en, adequacy follows.In particular, it follows that the AD correctness proofs of this paper apply to this particular implementation technique.

D REVERSE AD OF HIGHER-ORDER OPERATIONS SUCH AS MAP
So far, we have considered our arrays of reals to be primitive objects which can only be operated on by first-order operations.Next, we show that our framework also lends itself to treating higher-order operations on these arrays.
We show correctness of this implementation again by extending the proof of our fundamental lemma with the inductive case for map.e correctness theorem then follows as before once the fundamental lemma has been extended.

D
to apply to an extension of Syn 1 with tuples by extending the functor in the unique structure-preserving way.However, ] supports function types.
], but now we work with linear functions in the second component.Unlike before, both − → D [C] and ← − D [C] are now Cartesian closed (by §6)! us, we find the following corollary, by the universal property of Syn. is property states that any well-typed choice of interpretations F (op) of the primitive operations in a Cartesian closed category C extends to a unique Cartesian closed functor F : Syn → C. It gives a principled definition of AD and explains in what sense reverse AD is the "mirror image" of forward AD.

For
reverse AD, an adjoint at function type τ → σ , needs to keep track of the incoming adjoints of type ← − D (σ ) 2 for each a primal x of type ← − D (τ ) 1 on which we call the function.We store these pairs (x, ) in the type ! 2 (which we will see is essentially a quotient of a list of pairs of type 2 ).Less surprisingly, for forward AD, a tangent at function type τ → σ consists of a function sending each argument primal of type − → D (τ ) 1 to the outgoing tangent of type a standard category of logical relations (or subscone), and it is widely known to inherit the Cartesian closure of Diff × − → D [Diff CM ].It also comes equipped with a Cartesian closed functor − −−−− → SScone − → Diff × − → D [Diff CM ]. erefore, once we fix predicates P f real n on ( − , − → D [ − ])(real n ) and show that all operations op respect these predicates, it follows that our denotational semantics li s to give a unique structure-preserving functor Syn that the le diagram below commutes (by the universal property of Syn).

Fig. 2 .
Fig. 2. Standard βη-laws for products and functions.We write #x 1 , ..., x n e interpretation − of the language of §4 in categorical models is both sound and complete with respect to the βη+-equational theory: t βη+ = s iff t = s in each such model.Soundness follows by case analysis on the βη+-rules.Completeness follows by the construction of the syntactic model LSyn : Syn op → Cat which we describe next.
Further, given a strictly indexed category D : C op → Cat, we can consider its fibrewise dual category D op : C op → Cat, which is defined as the composition C op F − → Cat op − − → Cat.us, we can apply the same construction to D to obtain a category Σ C D op .6.2 Categorical Structure of Σ C D and Σ C D op for Locally Indexed Categories § §6.1 applies, in particular, to the locally indexed categories of §5.In this case, we will analyze the categorical structure of Σ C D and Σ C D op .For reference, we first give a concrete description.

7
DEFINING THE CORE ALGORITHMS: AD SOURCE-CODE TRANSFORMATIONS As Σ Syn LSyn and Σ Syn LSyn op are both Cartesian closed categories by §6, the universal property of Syn yields unique structure-preserving macros, − → D (−) : Syn → Σ Syn LSyn (forward AD) and ← − D (−) : Syn → Σ Syn LSyn op (reverse AD), once we fix a compatible definition for the macros on real n and basic operations op.By definition of equality in Syn, Σ Syn LSyn and Σ Syn LSyn op , these macros automatically respect equational reasoning principles, in the sense that t βη We need to choose suitable terms Dop(x; ) and Dop t (x; ) to represent the forward-and reversemode derivatives of the basic operations op ∈ Op m n 1 , ...,n k .For example, for elementwise multiplication ( * ) ∈ Op n n,n , we can define D( * )(x; ) = (fst x) * (snd ) + (snd x) * (fst ) and D( * ) t (x; ) = (snd x) * , (fst x) * , where we use (linear) elementwise multiplication ( * ) ∈ LOp n n;n .We represent derivatives as linear functions.is representation allows for efficient Jacobian-vector/adjoint product implementations, which avoid first calculating a full Jacobian and next taking a product.Such implementations are known to be important to achieve performant AD systems.real n 1 * . . .* real n k ; : real n 1 * . . .* real n k ⊢ Dop(x; ) : real m real n 1 * . . .* real n k ; : real m ⊢ Dop t (x; ) : real n 1 * . . .* real n k where − is the semantics of §5. e proof mainly consists of logical relations arguments over the semantics in Σ Diff Diff CM and Proc.ACM Program.Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018.
)), −) for reverse AD, where we note that Diff, Σ Diff Diff CM , and Σ Diff Diff CM op are Cartesian closed (given the arguments of §5 and §6) and that the product of Cartesian closed categories is again Cartesian closed.Let us write SScone are Cartesian closed, we obtain unique Cartesian closed functors − f : Syn → −−−−− → SScone and − r : Syn → ← −−−− − SScone once we fix an interpretation of real n and all operations op.We write P f τ and P r τ , respectively, for the relations π 2 τ f and π 2 τ r .Let us interpret real n def = (f , ( , h)) | f = and h = (D f ) t )Proc.ACM Program.Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018.
2 )) respects the logical relation P f , we have

Fig. 5 .
Fig. 5. Typing rules for the alternative target language.

9. 3
Denotational Semantics for the Applied Target Language Let us write Diff non−lin CM for the category whose objects are commutative diffeological monoids X , and whose morphisms X → Y are functions |X | → |Y | that are diffeological space morphisms, but that may fail to be monoid homomorphisms.We can give a denotational semantics {| − | } to the applied target language in this category by interpreting types τ as objects {|τ | } in Diff non−lin CM and terms Γ ⊢ t : τ as morphisms {|t | } in Diff non−lin CM ({|Γ| }, {|τ | }).We interpret types by making use of the categorical constructions on objects in Diff CM described in §5: For any typed term x : τ ⊢ t : σ in Syn, where τ and σ are first-order types, we have that {| − → D (t) 1 | } = {|t | } and {| − → D (t) 2 | } = D{|t | }.