Correctness of Automatic Differentiation via Diffeologies and Categorical Gluing

We present semantic correctness proofs of Automatic Differentiation (AD). We consider a forward-mode AD method on a higher order language with algebraic data types, and we characterise it as the unique structure preserving macro given a choice of derivatives for basic operations. We describe a rich semantics for differentiable programming, based on diffeological spaces. We show that it interprets our language, and we phrase what it means for the AD method to be correct with respect to this semantics. We show that our characterisation of AD gives rise to an elegant semantic proof of its correctness based on a gluing construction on diffeological spaces. We explain how this is, in essence, a logical relations argument. Finally, we sketch how the analysis extends to other AD methods by considering a continuation-based method.


Introduction
Automatic differentiation (AD), loosely speaking, is the process of taking a program describing a function, and building the derivative of that function by applying the chain rule across the program code.As gradients play a central role in many aspects of machine learning, so too do automatic differentiation systems such as TensorFlow [1] or Stan [4].Differentiation has a well developed mathematical theory in terms of differential geometry.The aim of this paper is to formalize this connection between differential geometry and the syntactic operations of AD.In this way we achieve two things: (1) a compositional, denotational understanding of differentiable programming and AD; (2) an explanation of the correctness of AD.
This intuitive correspondence (summarized in Fig. 1) is in fact rather complicated.In this paper we focus on resolving the following problem: higher order functions play a key role in programming, and yet they have no counterpart in traditional differential geometry.Moreover, we resolve this problem while retaining the compositionality of denotational semantics.
Higher order functions and differentiation.A major application of higher order functions is to support disciplined code reuse.Code reuse is particularly acute in machine learning.For example, a multi-layer neural network might be built of millions of near-identical neurons, as follows.) We can use these functions to build a network as follows (see also Fig. 2): comp layer m (neuron k ), comp layer n (neuron m ), neuron n : real k × P → real (1) Fig. 2. The network in (1) with k inputs and two hidden layers.
If we β-reduce all the λ's, we end up with a very long function expression just built from the sigmoid function and linear algebra.We can then find a program for calculating its derivative by applying the chain rule.However, automatic differentiation can also be expressed without first β-reducing, in a compositional way, by explaining how higher order functions like (layer) and (comp) propagate derivatives.This paper is a semantic analysis of this compositional approach.
The general idea of denotational semantics is to interpret types as spaces and programs as functions between the spaces.In this paper, we propose to use diffeological spaces and smooth functions [26,12] to this end.These satisfy the following three desiderata: -R is a space, and the smooth functions R → R are exactly the functions that are infinitely differentiable; -The set of smooth functions X → Y between spaces again forms a space, so we can interpret function types.-The disjoint union of a sequence of spaces again forms a space, so we can interpret variant types and inductive types.
We emphasise that the most standard formulation of differential geometry, using manifolds, does not support spaces of functions.Diffeological spaces seem to us the simplest notion of space that satisfies these conditions, but there are other candidates [2,27].A diffeological space is in particular a set X equipped with a chosen set of curves C X ⊆ X R and a smooth map f : X → Y must be such that if γ ∈ C X then γ; f ∈ C Y .This is remiscent of the method of logical relations.
From smoothness to automatic derivatives at higher types.Our denotational semantics in diffeological spaces guarantees that all definable functions are smooth.But we need more than just to know that a definable function happens to have a mathematical derivative: we need to be able to find that derivative.
In this paper we focus on a simple, forward mode automatic differentiation method, which is a macro translation on syntax (called We are able to show that it is correct, using our denotational semantics. Here there is one subtle point that is central to our development.Although differential geometry provides established derivatives for first order functions (such as neuron above), there is no canonical notion of derivative for higher order functions (such as layer and comp) in the theory of diffeological spaces (e.g.[5]).We propose a new way to resolve this, by interpreting types as triples (X, X ′ , S) where, intuitively, X is a space of inhabitants of the type, X ′ is a space serving as a chosen bundle of tangents over X, and S ⊆ X R × X ′R is a binary relation between curves, informally relating curves in X with their tangent curves in X ′ .This new model gives a denotational semantics for automatic differentiation.
In §3 we boil this new approach down to a straightforward and elementary logical relations argument for the correctness of automatic differentiation.The approach is explained in detail in §5.
Related work and context.AD has a long history and has many implementations.AD was perhaps first phrased in a functional setting in [20], and there are now a number of teams working on AD in the functional setting (e.g.[28,25,9]), some providing efficient implementations.Although that work does not involve formal semantics, it is inspired by intuitions from differential geometry and category theory.
Recently there have been promising efforts to give operational semantic accounts of AD (e.g.[22]).In independent recent work, an elaborate operational approach has been used to establish correctness at higher types [3].In the first order setting, there are also accounts of denotational semantics in manifolds [10] and synthetic differential geometry [6].We understand that the authors of [6] are working on higher types at present.
The differential λ-calculus [8] appears related to AD, though the connections have not yet been made precise as far as we are aware.One difference is that the differential λ-calculus allows addition on terms at all types, and hence vector space models are suitable, but this appears peculiar with the variant and inductive types that we consider here.
Finally we emphasise that we have chosen the neural network (1) as our running example mainly for its simplicity.There are many other examples of AD outside the neural networks literature: AD is useful whenever derivatives need to be calculated on high dimensional spaces.This includes optimization problems more generally, where the derivative is passed to a gradient descent method (e.g.[24,14,23,15,7,17]).Other applications of AD are in advanced integration methods, since derivatives play a role in Hamiltonian Monte Carlo [19,11] and variational inference [16].

Summary of contributions.
We have provided a semantic analysis of automatic differentiation.Our syntactic starting point is a well-known forward-mode AD macro on a typed higher order language (e.g.[25,28]).We recall this in §2 for function types, and in §4 we extend it to inductive types and variants.The main contributions of this paper are as follows.
-We give a denotational semantics for the language in diffeological spaces, showing that every definable expression is smooth ( §3).-We show correctness of the AD macro by a logical relations argument (Th.1).
-We give a categorical analysis of this correctness argument with two parts: canonicity of the macro in terms of syntactic categories, and a new notion of glued space that abstracts the logical relation ( §5).-We then use this analysis to state and prove a correctness argument at all first order types (Th.2).-We show that our method is not specific to one particular AD macro, by also considering a continuation-based AD method ( §6).

A simple forward-mode AD translation
Rudiments of differentiation and dual numbers.Recall that the derivative of a function f : R → R, if it exists, is a function ∇f : R → R such that ∇f (x 0 ) = df (x) dx (x 0 ) is the gradient of f at x 0 .To find ∇f in a compositional way, two generalizations are reasonable: -We need both f and ∇f when calculating ∇(f ; g) of a composition f ; g, using the chain rule, so we are really interested in the pair (f, ∇f ) : R → R × R; -In building f we will need to consider functions of multiple arguments, such as + : R 2 → R, and these functions should propagate derivatives.Thus we are more generally interested in transforming a function g : R n → R into a function h : (R×R) n → R×R in such a way that for any f 1 . . .f n : R → R, ( An intuition for h is often given in terms of dual numbers.The transformed function operates on pairs of numbers, (x, x ′ ), and it is common to think of such a pair as x + x ′ ǫ for an 'infinitesimal' ǫ.But while this is a helpful intuition, the formalization of infinitesimals can be intricate, and the development in this paper is focussed on the elementary formulation in (2).
In summary, the idea of differentiation with dual numbers is to transform a differentiable function g : R n → R to a function h : R 2n → R 2 which captures g and all its partial derivatives.The idea of forward mode automatic differentiation is to perform this transformation at the source code level.
A simple language of smooth functions.We consider a standard higher order typed language a first order type real of real numbers.The types (τ, σ) and terms (t, s) are as follows.3. We have included a minimal set of operations for the sake of illustration, but it is not difficult to add further operations.We add some simple syntactic sugar t − u def = t + (−1) * u.We intend ς to stand for the sigmoid function, ς(x) def = 1 1+e −x .We further include syntactic sugar let x = t in s for (λx.s) t and λ x 1 , . . ., x n .tfor λx.case x of x 1 , . . ., x n → t.
Syntactic automatic differentiation: a functorial macro.The aim of forward mode AD is to find the dual numbers representation of a function by syntactic manipulations.For our simple language, we implement this as the following inductively defined macro − → D on both types and terms (see also [28,25]): We extend Example 1 (Inner products).Let us write τ n for the n-fold product (τ * . . .* τ ).Then, given Γ ⊢ t, s : real n we can define their inner product Example 2 (Neural networks).In our introduction (1), we provided a program in our language to build a neural network out of expressions neuron, layer, comp; this program makes use of the inner product of Ex. 1.We can similarly calculate − → D of such deep neural nets by mechanically applying the macro.

Semantics of differentiation
Consider for a moment the first order fragment of the language in § 2, with only one type, real, and no λ's or pairs.This has a simple semantics in the category of cartesian spaces and smooth maps.Indeed, a term x 1 . . .x n : real ⊢ t : real has a natural reading as a function t : R n → R by interpreting our operation symbols by the well-known operations on R n → R with the corresponding name.In fact, the functions that are definable in this first order fragment are smooth, which means that they are continuous, differentiable, and their derivatives are continuous, differentiable, and so on.Let us write CartSp for this category of cartesian spaces (R n for some n) and smooth functions.
The category CartSp has cartesian products, and so we can also interpret product types, tupling and pattern matching, giving us a useful syntax for constructing functions into and out of products of R. For example, the interpretation of (neuron n ) in (1) becomes where • n , + and ς are the usual inner product, addition and the sigmoid function on R, respectively.
Inside this category, we can straightforwardly study the first order language without λ's, and automatic differentiation.In fact, we can prove the following by plain induction on the syntax: The interpretation of the (syntactic) forward AD − → D (t) of a first-order term t equals the usual (semantic) derivative of the interpretation of t as a smooth function.
However, as is well known, the category CartSp does not support function spaces.To see this, notice that we have polynomial terms x n y n : real → real for each d, and so if we could interpret (real → real) as a Euclidean space R p then, by interpreting these polynomial expressions, we would be able to find continuous injections R d → R p for every d, which is topologically impossible for any p, for example as a consequence of the Borsuk-Ulam theorem (see Appendix A).
This means that we cannot interpret the functions (layer) and (comp) from (1) in CartSp, as they are higher order functions, even though they are very useful and innocent building blocks for differential programming!Clearly, we could define neural nets such as (1) directly as smooth functions without any higher order subcomponents, though that would quickly become cumbersome for deep networks.A problematic consequence of the lack of a semantics for higher order differential programs is that we have no obvious way of establishing compositional semantic correctness of − → D for the given implementation of (1).
Diffeological spaces.This motivates us to turn to a more general notion of differential geometry for our semantics, based on diffeological spaces [12].
The key idea will be that a higher order function is called smooth if it sends smooth functions to smooth functions, meaning that we can never use it to build first order functions that are not smooth.For example, (comp) in (1) has this property.Definition 1.A diffeological space (X, P X ) consists of a set X together with, for each n and each open subset U of R n , a set P U X ⊆ [U → X] of functions, called plots, such that -all constant functions are plots; We call a function f : X → Y between diffeological spaces smooth if, for all plots p ∈ P U X , we have that p; f ∈ P U Y .We write Diff (X, Y ) for the set of smooth maps from X to Y .Smooth functions compose, and so we have a category Diff of diffeological spaces and smooth functions.
A diffeological space is thus a set equipped with structure.Many constructions of sets carry over straightforwardly to diffeological spaces.

Example 3 (Cartesian diffeologies).
Each open subset U of R n can be given the structure of a diffeological space by taking all the smooth functions V → U as P V U .It is easily seen that smooth functions from V → U in the traditional sense coincide with smooth functions in the sense of diffeological spaces.Thus diffeological spaces have a profound relationship with ordinary calculus.
In categorical terms, this gives a full embedding of CartSp in Diff .
Example 4 (Product diffeologies).Given a family (X i ) i∈I of diffeological spaces, we can equip the product i∈I X i of sets with the product diffeology in which U -plots are precisely the functions of the form (p i ) i∈I for p i ∈ P U Xi .This gives us the categorical product in Diff .
Example 5 (Functional diffeology).We can equip the set Diff (X, Y ) of smooth functions between diffeological spaces with the functional diffeology in which Uplots consist of functions f : This specifies the categorical function object in Diff .
Semantics and correctness of AD.We can now give a denotational semantics to our language from § 2. We interpret each type τ as a set τ equipped with the relevant diffeology, by induction on the structure of types: . Now well typed terms Γ ⊢ t : τ are interpreted as smooth functions t : Γ → τ , giving a meaning for t for every valuation of the context.This is routinely defined by induction on the structure of typing derivations.Constants c : real are interpreted as constant functions; and the first order operations (+, * , ς) are interpreted by composing with the corresponding functions, which are smooth.For example, ς(t) (ρ) def = ς( t (ρ)), where ρ ∈ Γ .Variables are interpreted as x i (ρ) def = ρ i .The remaining constructs are interpreted as follows, and it is straightforward to show that smoothness is preserved.
Notice that a term x 1 : real, . . ., x n : real ⊢ t : real is interpreted as a smooth function t : R n → R, even if t involves higher order functions (like (1)).Moreover the macro differentiation Thus we are in a position to state a limited version of our main correctness theorem: Theorem 1 (Semantic correctness of − → D (limited)).For any term x 1 : real, . . ., x n : real ⊢ t : real, the function − → D (t) is the dual numbers representation (2) of t .In detail: for any smooth functions f 1 . . .f n : R → R, Proof.We prove this by logical relations.Although the following proof is elementary, we found it by using the categorical methods in § 5.
For each type τ , we define a binary relation S τ between curves in τ and curves in , by induction on τ : Then, we establish the following 'fundamental lemma': If x 1 :τ 1 , ..., x n :τ n ⊢ t : σ and, for all 1≤i≤n, y 1 ...y m : real ⊢ s i : τ i is such that ((f 1 , . . ., f m ); s i , (f 1 , ∇f 1 ), ..., f m , ∇f m ); This is proved routinely by induction on the typing derivation of t.The case for * relies on the precise definition of − → D (t * s), and similarly for +, ς.We conclude the theorem from the fundamental lemma by considering the case where τ i = σ = real, m = n and s i = y i .
⊓ ⊔ 4 Extending the language: variant and inductive types In this section, we show that the definition of forward AD and the semantics generalize if we extend the language of §2 with variants and inductive types.As an example of inductive types, we consider lists.This specific choice is only for expository purposes and the whole development works at the level of generality of arbitrary algebraic data types generated as initial algebras of (polynomial) type constructors formed by finite products and variants.Similarly, our choice of operations is for expository purposes.More generally, assume given a family of operations (Op n ) n∈N indexed by their arity n.Further assume that each op ∈ Op n has type real n → real.We then ask for a certain closure of these operations under differentiation, that is we define ) is some chosen term in the language, involving free variables from x 1 , . . ., x n , which we think of as implementing the partial derivative of op with respect to its i-th argument.For constructing the semantics, every op must be interpreted by some smooth function, and, to establish correctness, the semantics of ∂ i op(x 1 , . . ., x n ) must be the semantic i-th partial derivative of the semantics of op(x 1 , . . ., x n ).
Language.We additionally consider the following types and terms: τ, σ, ρ :: We extend the type system according to: We can then extend − → D to our new types and terms by To demonstrate the practical use of expressive type systems for differential programming, we consider the following two examples.
Example 6 (Lists of inputs for neural nets).Usually, we run a neural network on a large data set, the size of which might be determined at runtime.To evaluate a neural network on multiple inputs, in practice, one often sums the outcomes.This can be coded in our extended language as follows.Suppose that we have a network f : (real n * P ) → real that operates on single input vectors.We can construct one that operates on lists of inputs as follows: g def = λ l, w .fold(x 1 , x 2 ).f x 1 , w + x 2 over l from 0 : (list(real n ) * P ) → real Example 7 (Missing data).In practically every application of statistics and machine learning, we face the problem of missing data: for some observations, only partial information is available.In an expressive typed programming language like we consider, we can model missing data conveniently using the data type maybe(τ ) = {Nothing ( ) Just τ }.In the context of a neural network, one might use it as follows.First, define some helper functions Given a neural network f : (list(real k ) * P ) → real, we can build a new one that operates on on a data set for which some covariates (features) are missing, by passing in default values to replace the missing covariates: λ l, m, w .fmap (fromMaybe k m) l, w : (list((maybe(real)) k ) * (real k * P )) → real Then, given a data set l with missing covariates, we can perform automatic differentiation on this network to optimize, simultaneously, the ordinary network parameters w and the default values for missing covariates m.
Semantics.In § 3 we gave a denotational semantics for the simple language in diffeological spaces.This extends to the language in this section, as follows.As before, each type τ is interpreted as a diffeological space, which is a set equipped with a family of plots: .
-A list type list(τ ) is interpreted as the set of lists, list(τ The constructors and destructors for variants and lists are interpreted as in the usual set theoretic semantics.It is routine to show inductively that these interpretations are smooth.Thus every term Γ ⊢ t : τ in the extended language is interpreted as a smooth function t : Γ → τ between diffeological spaces. (In this section we focused on a language with lists, but other inductive types are easily interpreted in the category of diffeological spaces in much the same way; the categorically minded reader may regard this as a consequence of Diff being a concrete Grothendieck quasitopos, e.g.[2].)

Categorical analysis of forward AD and its correctness
This section has three parts.First, we give a categorical account of the functoriality of AD (Ex.8).Then we introduce our gluing construction, and relate it to the correctness of AD (dgm.( 3)).Finally, we state and prove a correctness theorem for all first order types by considering a category of manifolds (Th.2).Syntactic categories.Our language induces a syntactic category as follows.
Definition 2. Let Syn be the category whose objects are types, and where a morphism τ → σ is a term in context x : τ ⊢ t : σ modulo the βη-laws (Fig. 4).Composition is by substitution.
For simplicity, we do not impose arithmetic identities such as x + y = y + x in Syn.As is standard, this category has the following universal property.
Lemma 2 (e.g.[21]).For every bicartesian closed category C with list objects, and every object F (real) ∈ C and morphisms there is a unique functor F : Syn → C respecting the interpretation and preserving the bicartesian closed structure as well as list objects.
Proof (notes).The functor F : Syn → C is a canonical denotational semantics for the language, interpreting types as objects of C and terms as morphisms.For instance, F (τ → σ) def = (F τ → F σ), the function space in the category C, and F (t s) def = is the composite (F t, F s); eval .When C = Diff , the denotational semantics of the language in diffeological spaces ( §3,4) can be understood as the unique cartesian closed functor − : Syn → Diff satisfying real = R, ς = ς and so on.Categorical gluing and logical relations.Gluing is a method for building new categorical models which has been used for many purposes, including logical relations and realizability [18].Our logical relations argument in the proof of Th. 1 can be understood in this setting.In this subsection, for the categorically minded, we explain this, and in doing so we quickly recover a correctness result for the more general language in § 4 and for arbitrary first order types.
We define a category Gl U whose objects are triples (X, X ′ , S) where X and X ′ are diffeological spaces and S ⊆ P U X × P U X ′ is a relation between their U -plots.A morphism (X, X ′ , S) → (Y, Y ′ , T ) is a pair of smooth functions f : X → Y , f ′ : X ′ → Y ′ , such that if (g, g ′ ) ∈ S then (g; f, g ′ ; f ′ ) ∈ T .The idea is that this is a semantic domain where we can simultaneously interpret the language and its automatic derivatives.
Proposition 1.The category Gl U is bicartesian closed, has list objects, and the projection functor proj : Gl U → Diff × Diff preserves this structure.
Proof (notes).The category Gl U is a full subcategory of the comma category id Set ↓ Diff (U, −) × Diff (U, −).The result thus follows by the general theory of categorical gluing (e.g.[13,Lemma 15]).
We give a semantics − = ( − 0 , − 1 , S − ) for the language in Gl R , interpreting types τ as objects ( τ 0 , τ 1 , S τ ), and terms as morphisms.We let R → R smooth}).We interpret the constants c as pairs c 0 def = c and c 1 def = (c, 0), and we interpret +, ×, ς in the standard way (meaning, like − ) in − 0 , but according to the derivatives in − 1 , for instance, At this point one must check that these interpretations are indeed morphisms in Gl R , and the condition on morphisms amounts to checking that these interpretations are indeed dual numbers representations in the sense of (2).The remaining constructions of the language are interpreted using the categorical structure of Gl R , following Lemma 2.
Notice that the following diagram commutes.This is easily checked by hand, but it follows from initiality of Syn (L.2): all the functors preserve all the structure.
We thus arrive at a restatement of the correctness theorem (Th.1), which holds even for the extended language with variants and lists, because for any term x 1 ... Correctness at all first order types, via manifolds.We now generalize the correctness theorem (Th. 1) to hold at all first order types, not just the reals.To do this, we need to define the derivative of a smooth map between the interpretations of first order types.We do this by recalling the well known theory of manifolds and tangent bundles.
For our purposes, a smooth manifold M is a second-countable Hausdorff topological space together with a smooth atlas: an open cover U together with homeomorphisms φ smooth for all charts φ U and ψ V of M and N , respectively.Let us write Man for this category.
(Our manifolds are slightly unusual because different charts in an atlas may have different finite dimension n(U ).Thus we consider manifolds with dimensions that are potentially unbounded, albeit locally finite.This does not affect the theory of differential geometry as far as we need it here.) Each open subset of R n can be regarded as a manifold.We use this fact to regard the category of manifolds Man as a full subcategory of the category of diffeological spaces.We consider a manifold (X, {φ U } U ) as a diffeological space with the same carrier set X and where the plots P U X are the smooth functions in Man(U, X).A function X → Y is smooth in the sense of manifolds if and only if it is smooth in the sense of diffeological spaces [12].For the categorically minded reader, this means that we have a full embedding of Man into Diff .Moreover, the natural interpretation of the first order fragment of our language in Man coincides with that in Diff .That is, the embedding of Man into Diff preserves finite products and countable coproducts (hence initial algebras of polynomial endofunctors).
Proposition 2. Suppose that a type τ is first order, i.e. it is just built from reals, products, variants, and lists (or, again, arbitrary inductive types), and not function types.Then the diffeological space τ is a manifold.Proof (notes).This is proved by induction on the structure of types.In fact, one may show that every such τ is isomorphic to a manifold of the form n i=1 R di where the bound n is either finite or ∞, but this isomorphism is typically not an identity function.

⊓ ⊔
The constraint to first order types is necessary because, e.g. the diffeological space real → real is not a manifold, because of a Borsuk-Ulam argument (see Appendix A).We recall that the derivative of any morphism f : M → N of manifolds is given as follows.As is standard, we can understand the tangent bundle of a composite space in terms of that of its parts.Lemma 3.There are canonical isomorphisms T ( We define a canonical isomorphism φ with the tangent bundle functor, modulo these canonical isomorphisms: Proof (notes).For any curve γ ∈ Man(R, M ), let γ ∈ Man(R, T (M )) be the tangent curve, given by γ(x) = (γ(x), [t → γ(x + t)]).First, we note that a smooth map g : T (M ) → T (N ) is of the form T (h) for some h : M → N if for all smooth curves γ : R → M we have γ; g = (γ; h) : R → T (N ).This generalizes (2).Second, for any first order type τ , S τ = {(f, g) | g; φ This is shown by induction on the structure of types.We conclude the theorem from diagram (3), by putting these two observations together.

A continuation-based AD algorithm
We now illustrate the flexibility of our framework by briefly describing an alternative syntactic translation ← − D ρ .This alternative translation uses aspects of continuation passing style, inspired by recent developments in reverse mode AD [28,3].In brief, . Thus instead of using a pair of a number and its tangent, we use a pair of a number and a continuation.The answer type ρ = real k needs to have the structure of a vector space, and the continuations that we consider will turn out to be linear maps.Because we work in continuation passing style, the chain rule is applied contravariantly.If the reader is familiar with reverse-mode AD algorithms, they may think of the dimension k as the number of memory cells used to store the result.
Computing the whole gradient of a term x 1 : real, ..., x k : real ⊢ t : real at once is then achieved by running Here, we use sugar x : real k , y : real k ⊢ x + y def = case x of x 1 , . . ., x k → case y of y 1 , . . ., y k → x 1 + y 1 , . . ., x k + y k .(We could easily expand this definition by making ← − D k preserve all other term and type formers, as we did for − → D .)Note that the corresponding scheme for an arbitrary n-ary operation op would be (c.f. the scheme for forward AD in §4) is a higher order function that simultaneously computes t (the forward pass) and defines as a continuation the reverse pass which computes the gradient.In order to actually run the algorithm, we need two auxiliary definitions lamR D k preserve all type formers, we can extend these definitions to all first order types τ : z : . We can think of lamR k τ (z) as encoding k tangent vectors z : − → D k (τ ) as a closure, so it is suitable for running ← − D k (t) on, and evR k τ (z) as actually evaluating the reverse pass defined by z : ← − D k (τ ) and returning the result as k tangent vectors.The idea is that given some x : τ ⊢ t : σ between first order types τ, σ, we run our continuation-based AD by running evR k σ ( / x ]).The correctness proof closely follows that for forward AD.In particular, one defines a binary logical relation real Lemma 4. For all first order types τ (i.e.types not involving function types), we have that evR k τ (lamR k τ (t)) = t .
Proof (notes).This follows by an induction on the structure of τ .The idea is that lamR k τ embeds reals into function spaces as linear maps, which is undone by evR k τ by evaluating the linear maps at 1.

⊓ ⊔
To phrase correctness, in this setting, however, we need a few definitions.Keeping in mind the canonical projection T (M ) → M , we define T k (M ) as the k-fold categorical pullback (fibre product) T (M ) × M . . .× M T (M ).To be explicit, T k x M consists of k-tuples of tangent vectors at the base point x.Again, T k extends to a functor Man → Man by defining ).As T k preserves countable coproducts and finite products (like T ), it follows that the isomorphisms φ x:real v) / x , (lamR 2 y:real w) / y ])[ V, 1,0 / v , W, 0,1 / w ] = t ( V , W ), ∂ 1 t ( V , W ), ∂ 2 t ( V , W ) .

Discussion and future work
Summary.We have shown that diffeological spaces provide a denotational semantics for a higher order language with variants and inductive types ( §3,4).We have used this to show correctness of a simple AD translation (Thm.1, Thm. 2).But the method is not tied to this specific translation, as we illustrated in Section 6.
The structure of our elementary correctness argument for Theorem 1 is a typical logical relations proof.As explained in Section 5, this can equivalently be understood as a denotational semantics in a new kind of space obtained by categorical gluing.
Overall, then, there are two logical relations at play.One is in diffeological spaces, which ensures that all definable functions are smooth.The other is in the correctness proof (equivalently in the categorical gluing), which explicitly tracks the derivative of each function, and tracks the syntactic AD even at higher types.
Connection to the state of the art in AD implementation.As is common in denotational semantics research, we have here focused on an idealized language and simple translations to illustrate the main aspects of the method.There are a number of points where our approach is simplistic compared to the advanced current practice, as we now explain.

Representation of vectors.
In our examples we have treated n-vectors as tuples of length n.This style of programming does not scale to large n.A better solution would be to use array types, following [25].We envisage that our categorical semantics straightforwardly extends to cover them, in a similar way to our treatment of lists.
Efficient forward-mode AD.For AD to be useful, it must be fast.The syntactic translation − → D that we use is the basis of an efficient AD library [25].However, numerous optimizations are needed, ranging from algebraic manipulations, to partial evaluations, to the use of an optimizing C compiler.A topic for future work would be to validate some of these manipulations using our semantics.The resulting implementation is performant in experiments [25].
Efficient reverse-mode AD.Our sketch of continuation-based AD is primarily intended to emphasise that our denotational approach is not tied to any specific translation − → D .Nonetheless, it is worth noting that this algorithm shares similarities with advanced reverse-mode implementations: (1) it calculates derivatives in a (contravariant) "reverse pass" in which derivatives of operations are evaluated in the reverse order compared to their order in calculating the function value; (2) it can be used to calculate the full gradient of a function R n → R in a single reverse pass (while n passes of fwd AD would be necessary).However, speed is the main attraction of reverse-mode AD and its implementations tend to rely on mutable state, control operators and/or complex intermediate representations [20,4,28,3], which we have not considered here.
Other language features.The idealized languages that we considered so far do not touch on several useful language constructs.For example: the use of functions that are partial (such as division) or partly-smooth (such as RelU); phenomena such as iteration, recursion; and probabilities.There are suggestions that the denotational approach using diffeological spaces can be adapted to these features using standard categorical methods.We leave this for future work.
x n : real ⊢ t : real, the interpretations ( t , − → D (t) ) are in the image of the projection Gl R → Diff × Diff , and hence − → D (t) is a dual numbers encoding of t .
For each point x in a manifold M , define the tangent space T x M to be the set {γ ∈ Man(R, M ) | γ(0) = x}/ ∼ of equivalence classes [γ] of smooth curves γ in M based at x, where we identify γ 1 ∼ γ 2 iff ∇(γ 1 ; f )(0) = ∇(γ 2 ; f )(0) for all smooth f : M → R. The tangent bundle of M is the set T (M ) def = x∈M T x (M ).The charts of M equip T (M ) with a canonical manifold structure.Then for smooth f : M → N , the derivative T (f ) : T (M ) → T (N ) is defined as T (f )(x, [γ]) def = (f (x), [γ; f ]).All told, the derivative is a functor T : Man → Man.
) → T ( τ ) for every type τ , by induction on the structure of types.We let φ (x,x ′ ) def = (x, [t → x + x ′ t]).For the other types, we use Lemma 3. We can now phrase correctness at all first order types.Theorem 2 (Semantic correctness of − → D (full)).For any ground τ , any first order context Γ and any term Γ ⊢ t : τ , the syntactic translation − → D coincides t) on a k-tuple of basis vectors for real k .We define the continuation-based AD macro ← − D k on types and terms as the unique structure preserving functor Syn → Syn with ← − D k (real) = (real * (real → real k )) and k is a macro on types (and terms) exactly the same inductive definition as − → D except for the base case − → D k (real) = (real * real k ).By noting that both − → D k and ← − y) and ς(x) × ← − D k (ς(x)) respect this logical relation.It follows that this relation extends to a functor − r,k : Syn → Gl R k such that id × ← − D k factors over − r,k , implying the correctness of the continuation-based AD by the following lemma.