A Class of Non-Parametric Statistical Manifolds modelled on Sobolev Space

We construct a family of non-parametric (infinite-dimensional) manifolds of finite measures on $R^d$. The manifolds are modelled on a variety of weighted Sobolev spaces, including Hilbert-Sobolev spaces and mixed-norm spaces. Each supports the Fisher-Rao metric as a weak Riemannian metric. Densities are expressed in terms of a deformed exponential function having linear growth. Unusually for the Sobolev context, and as a consequence of its linear growth, this"lifts"to a nonlinear superposition (Nemytskii) operator that acts continuously on a particular class of mixed-norm model spaces, and on the fixed norm space $W^{2,1}$; i.e. it maps each of these spaces continuously into itself. It also maps continuously between other fixed-norm spaces with a loss of Lebesgue exponent that increases with the number of derivatives. Some of the results make essential use of a log-Sobolev embedding theorem. Each manifold contains a smoothly embedded submanifold of probability measures. Applications to the stochastic partial differential equations of nonlinear filtering (and hence to the Fokker-Planck equation) are outlined.


Introduction
In recent years there has been rapid progress in the theory of information geometry, and its application to a variety of fields including asymptotic statistics, machine learning, signal processing and statistical mechanics. (See, for example, [29,30].) Beginning with C.R. Rao's observation that the Fisher information can be interpreted as a Riemannian metric [33], information geometry has exploited the formalism of manifold theory in problems of statistical estimation. The finite-dimensional (parametric) theory is now mature, and is treated pedagogically in [1,3,9,14,21]. The archetypal example is the finite-dimensional exponential model, which is based on a finite set of real-valued random variables defined on an underlying probability space (X, X , µ). Affine combinations of these are exponentiated to yield probability density functions with respect to the reference measure µ. This construction induces a topology on the resulting set of probability measures, that is compatible with the statistical divergences of estimation theory, derivatives of which can be used to define the Fisher-Rao metric and covariant derivatives having various statistical interpretations.
The first successful extension of these ideas to the non-parametric setting appeared in [32], and was further developed in [13,31,8]. These papers follow the formalism of the exponential model by using the log of the density as a chart. This approach requires a model space with a strong topology: the exponential Orlicz space. It has been extended in a number of ways. In [18], the exponential function is replaced by the so-called q-deformed exponential, which has an important interpretation in statistical mechanics. (See chapter 7 in [22].) The model space used there is L ∞ (µ). A more general class of deformed exponential functions is used in [36] to construct families of probability measures dubbed ϕ-families. The model spaces used are Musielak-Orlicz spaces.
One of the most important statistical divergences is the Kullback-Leibler (KL) divergence. For probability measures P and Q having densities p and q with respect to µ, this is defined as follows: D(P |Q) = p log(p/q)dµ. (1) The KL divergence can be given the bilinear representation p, log p−log q , in which probability densities and their logs take values in dual function spaces (for example, the Lebesgue spaces L λ (µ) and L λ/(λ−1) (µ) for some 1 < λ < ∞). Loosely speaking, in order for the KL divergence to be smooth on an infinite-dimensional manifold, the charts of the latter must "control" both the density p and its log, and this provides one explanation of the need for strong topologies on the model spaces of non-parametric exponential models. This observation led to the construction in [24] of an infinite-dimensional statistical manifold modelled on Hilbert space. This employs a "balanced chart" (the sum of the density and its log), which directly controls both. This chart was later used in [26] in the development of Banach manifolds modelled on the Lebesgue spaces L λ (µ), for λ ∈ [2, ∞). These give increasing degrees of smoothness to statistical divergences. An ambient manifold of finite measures was also defined in [26], and used in the construction of α-parallel transport on the embedded statistical manifold. These manifolds make no reference to any topology that the underlying sample space X may possess. Statistical divergences measure dependency between abstract random variables (those taking values in measurable spaces) without reference to any other structures that these spaces may have. Nevertheless, topologies, metrics and linear structures on X play important roles in many applications. For example, the Fokker-Planck and Boltzmann equations both quantify the evolution of probability density functions on R d , making direct reference to the latter's topology through differential operators. A natural direction for research in infinite-dimensional information geometry is to adapt the manifolds outlined above to such problems by incorporating the topology of the sample space in the model space. One way of achieving this is to use model spaces of Sobolev type. This is carried out in the context of the exponential Orlicz manifold in [17], where it is applied to the spatially homogeneous Boltzmann equation. Manifolds modelled on the Banach spaces C k b (B; R), where B is an open subset of an underlying (Banach) sample space, are developed in [28], and manifolds modelled on Fréchet spaces of smooth densities are developed in [4,7] and [28].
The aim of this paper is to develop Sobolev variants of the Lebesgue L λ (µ) manifolds of [24,26] when the sample space X is R d . Our construction includes, as a special case, a class of Hilbert-Sobolev manifolds. In developing these, the author was motivated by applications in nonlinear filtering. The equations of nonlinear filtering for diffusion processes generalise the Fokker-Planck equation by adding a term that accounts for partial observations of the diffusion. Let (X t , Y t , t ≥ 0) be a d + 1-vector Markov diffusion process defined on a probability space (Ω, F , P), and satisfying the Itô stochastic differential equation where Y 0 = 0, (V t , t ≥ 0) is a d + 1-vector standard Brownian motion, independent of X 0 , and f : R d → R d , g : R d → R d×d and h : R d → R are suitably regular functions. The nonlinear filter for X computes, at each time t, the conditional probability distribution of X t given the history of the observations process (Y s , 0 ≤ s ≤ t). Since X and Y are jointly Markov the nonlinear filter can be expressed in a time-recursive manner. Under suitable technical conditions, the observation-conditional distribution of X t admits a density, p t , (with respect to Lebesgue measure) satisfying the Kushner Stratonovich stochastic partial differential equation [11] dp where A is the Kolmogorov forward (Fokker-Planck) operator for X, The exponential Orlicz manifold was proposed as an ambient manifold for partial differential equations of this type in [6] (and the earlier references therein), and methods of projection onto submanifolds were developed. Applications of the Hilbert manifold of [24] to nonlinear filtering were developed in [25,27], and information-theoretic properties were investigated.
It was argued in [26,27] that statistical divergences such as the KL divergence are natural measures of error for approximations to Bayesian conditional distributions such as those of nonlinear filtering. This is particularly so when the approximation constructed is used to estimate a number statistics of the process X, or when the dynamics of X are significantly nonlinear. We summarise these ideas here since they motivate the developments that follow; details can be found in [27]. If our purpose is to estimate a single real-valued variate v(X t ) ∈ L 2 (µ), then the estimate with the minimum mean-square error is the conditional meanv t : , where E is expectation with respect to P, and Π t is the conditional distribution of X t . If the estimate is based on a (Y s , 0 ≤ s ≤ t)-measurable approximation to Π t ,Π t , then the mean-square error admits the orthogonal decomposition The first term on the right-hand side here is the statistical error, and is associated with the limitations of the observation Y ; the second term is the approximation error resulting from the use ofΠ t instead of Π t . When comparing different approximations, it is appropriate to measure the second term relative to the first; ifv t is a poor estimate of v(X t ) then there is no point in approximating it with great accuracy. Maximising these relative errors over all square-integrable variates leads to the (extreme) multi-objective measure of mean-square approximation errors Although extreme, it illustrates an important feature of multi-objective measures of error-they require probabilities of events that are small to be approximated with greater absolute accuracy than those that are large. A less extreme multi-objective measure of meansquare errors is developed in [27]. This constrains the functions v of (5) to have exponential moments. The resulting measure of errors is shown to be of class C 1 on the Hilbert manifold of [24], and so has this same property on the manifolds developed here. See [27] for further discussion of these ideas. The paper is structured as follows. Section 2 provides the technical background in mixed-norm weighted Sobolev spaces, where the L λ spaces are based on a probability measure. Section 3 constructs (M, G, φ), a manifold of finite measures modelled on the general Sobolev space of section 2. It outlines the properties of mixture and exponential representations of measures on the manifold, as well as those of the KL divergence. In doing so, it defines the Fisher-Rao metric and Amari-Chentsov tensor. Section 4.1 shows that a particular choice of mixed-norm Sobolev space is especially suited to the manifold in the sense that the density of any P ∈ M also belongs to the model space, and the associated nonlinear superposition operator is continuous-a rare property in the Sobolev context [35]. Section 4.2 shows that this property does not hold for fixed norm spaces, except in the special case G = W 2,1 . It also develops a general class of fixed norm spaces, for which the continuity property can be retained if the Lebesgue exponent in the range space is suitably reduced. Section 5 develops an embedded submanifold of probability measures (M 0 , G 0 , φ 0 ), in which the charts are centred versions of φ. Section 6 outlines applications to the problem of nonlinear filtering for a diffusion process, as defined in (2) and (3). Finally, section 7 makes some concluding remarks, discussing, in particular, a variant of the results that uses the Kaniadakis deformed logarithm as a chart.

The Model Spaces
For some t ∈ (0, 2], let θ t : [0, ∞) → [0, ∞) be a strictly increasing function that is twice continuously differentiable on (0, ∞), such that lim z↓0 θ ′ t (z) < ∞, and If t ∈ (1, 2] then we also require θ t and − √ θ t to be convex. and Here, β t z t is the unique solution in the interval (0, π) of the equation For some d ∈ N, let X be the σ-algebra of Lebesgue measurable subsets of R d , and let µ t be the following product probability measure on (R d , X ): (9) and C t ∈ R is such that exp(C t − θ t (|z|))dz = 1. In what follows, we shall suppress the subscript t, and so l t , r t and µ t will become l, r and µ, etc.
For any 1 ≤ λ < ∞, let L λ (µ) be the Banach space of (equivalence classes of) measurable functions u : be the space of continuous functions with continuous partial derivatives of all orders, and let C ∞ 0 (R d ; R) be the subspace of those functions having compact support.
For k ∈ N, let S := {0, . . . , k} d be the set of d-tuples of integers in the range 0 ≤ s i ≤ k. For s ∈ S, we define |s| = i s i , and denote by 0 the dtuple for which |s| = 0. For any 0 ≤ j ≤ k, S j := {s ∈ S : j ≤ |s| ≤ k} is the set of d-tuples of weight at least j and at most k. Let Λ = (λ 0 , λ 1 , . . . , λ k ), where 1 ≤ λ k ≤ λ k−1 ≤ · · · ≤ λ 0 < ∞, and let W k,Λ (µ) be the mixed-norm, weighted Sobolev space comprising functions a ∈ L λ 0 (µ) that have weak partial derivatives D s a ∈ L λ |s| (µ), for all s ∈ S 1 . For a ∈ W k,Λ (µ) we define The following theorem is a variant of a standard result in the theory of fixednorm, unweighted Sobolev spaces.
Proof. That · W k,Λ (µ) satisfies the axioms of a norm is easily verified. Suppose that (a n ∈ W k,Λ (µ)) is a Cauchy sequence in this norm; then, since the spaces L λ j (µ), 0 ≤ j ≤ k are all complete, there exist functions v s ∈ L λ |s| (µ), s ∈ S 0 such that D s a n → v s in L λ |s| (µ). For any s ∈ S 0 , and ≤ sup x∈supp(ϕ) (|ϕ|/r) D s a n − v s L 1 (µ) → 0, and so v s ϕ dx = lim n D s a n ϕ dx = (−1) |s| lim n a n D s ϕ dx = (−1) |s| v 0 D s ϕ dx, v 0 admits weak derivatives up to order k, and D s v 0 = v s . So W k,Λ (µ) is complete.
The following developments show that functions in W k,Λ (µ) can be approximated by particular functions in ) be a function having the following properties: (i) supp(J) = B 1 ; (ii) J dx = 1. For any 0 < ǫ < 1, let J ǫ (x) = ǫ −d J(x/ǫ); then J ǫ also has unit integral, but is supported on B ǫ . Since l is bounded on bounded sets, any u ∈ L 1 (µ) is also in L 1 loc (dx), and we can define the mollified version J ǫ * u ∈ C ∞ (R d ; R) as follows: For any m ∈ N, let U m ⊂ L 1 (µ) comprise those functions that take the value zero on the complement of B m . If u ∈ U m then J ǫ * u ∈ C ∞ 0 (B m+1 ; R). Lemma 1. (i) For any λ ∈ [1, ∞) and any u ∈ U m ∩ L λ (µ), there exists an ǫ > 0 such that (ii) For any a ∈ W k,Λ (µ), ǫ > 0 and s ∈ S 1 , D s (J ǫ * a) = J ǫ * (D s a).
For a and s as in part (ii), and any ϕ ∈ C ∞ 0 (R d ; R), where we have used integration by parts |s| times in the third step. This completes the proof of part (ii).
For ease of notation in what follows, we shall abbreviate J ǫ * u to Ju, where it is understood that ǫ has been chosen as in part (i). With this convention, we can express part (ii) as D s (Ja) = J(D s a), where it is understood that ǫ has been chosen to satisfy (13) for both a and D s a.
For any a ∈ W k,Λ (µ) and m ∈ N, let a m ( (13) is satisfied for all u = D s a m with s ∈ S 0 and λ = λ |s| . According to the Leibniz rule, and so |D s a m | ≤ K σ |D σ a| ∈ L λ |s| (µ). Since D s a m → D s a for all x, it follows from the dominated convergence theorem that it also converges in L λ |s| (µ). Lemma 1 completes the proof.

The Manifolds of Finite Measures
In this section, we construct manifolds of finite measures on (R d , X ) modelled on the Sobolev spaces of section 2. The charts of the manifolds are based on the "deformed logarithm" log d : (0, ∞) → R, defined by Now inf y log d y = −∞, sup y log d y = +∞, and log d ∈ C ∞ ((0, ∞); R) with strictly positive first derivative 1 + y −1 , and so, according to the inverse function theorem, log d is a diffeomorphism from (0, ∞) onto R. Let ψ be its inverse. This can be thought of as a "deformed exponential" function [22]. We use ψ (n) to denote its n-th derivative and, for convenience, set ψ (0) := ψ.
Proof. That ψ (1) and ψ (2) are as stated is verified by a straightforward computation. Both (19) and (20) then follow by induction arguments.
Let G := W k,Λ (µ) be the general mixed-norm space of section 2, and let M be the set of finite measures on (R d , X ) satisfying the following: (M1) P is mutually absolutely continuous with respect to µ; (We denote measures on (R d , X ) by the upper-case letters P , Q, . . . , and their densities with respect to µ by the corresponding lower case letters, p, q, . . . ) In order to control both the density p and its log, we employ the "balanced" chart of [24] and [26], φ : M → G. This is defined by: Proof. It follows from (M2) that, for any P ∈ M, φ(P ) ∈ G. Suppose, conversely, that a ∈ G; then since ψ (1) is bounded, ψ(a) ∈ L 1 (µ), and so defines a finite measure P (dx) = ψ(a(x))µ(dx). Since ψ is strictly positive, P satisfies (M1). That it also satisfies (M2) follows from the fact that log d ψ(a) = a ∈ G.
We have thus shown that P ∈ M and clearly φ(P ) = a.
The inverse map φ −1 : G → M takes the form In [24,26], tangent vectors were defined as equivalence classes of differentiable curves passing through a given base point, and having the same first derivative at this point. This allowed them to be interpreted as linear operators acting on differentiable maps. Here, we use a different definition that is closer to that of membership of M. For any P ∈ M, letP a be the finite measure on We define a tangent vector U at P to be a signed measure on (R d , X ) that is absolutely continuous with respect tõ P a , with Radon-Nikodym derivative dU/dP a ∈ G. The tangent space at P is the linear space of all such measures, and the tangent bundle is the disjoint union T M := ∪ P ∈M (P, T P M). This is globally trivialised by the chart The derivative of a (Fréchet) differentiable, Banach-space-valued map f : M → Y (at P and in the "direction" U) is defined in the obvious way: Clearly u = Uφ. We shall also need a weaker notion of differentiability due to Leslie [15,16]. Let A : G → Y be a continuous linear map and, for fixed If f is Leslie differentiable at all P ∈ M then we say that it is Leslie differentiable. This is a slightly stronger property than the "d-differentiability" used in [24], which essentially demands continuity of R in the first argument only.
The construction above defines an infinite-dimensional manifold of finite measures, (M, G, φ), with atlas comprising the single chart φ. M is a subset of an instance of the manifold constructed in [26] (that in which the measurable space X of [26] is R d ), but has a stronger topology than the associated relative topology. Results in [26] concerning the smoothness of maps defined on the model space L λ 0 (µ) are true a-forteriori when the latter is replaced by G; in fact, stronger results can be obtained under the following hypothesis: For (ii) If λ 0 /β ∈ N and (E1) does not hold, then the highest Fréchet derivative, Ψ (iii) Ψ β satisfies global Lipschitz continuity and linear growth conditions, and all its derivatives (including that in (28)) are globally bounded.
Proof. According to the mean value theorem, for any a, b ∈ G, and so the Lipschitz continuity and linear growth of Ψ β follow from the boundedness of ψ (1) . Let (a n ∈ G \ {a}) be a sequence converging to a in G.
For any 1 ≤ j ≤ N let According to the mean-value theorem ∆ n = δ n (a n − a), where Hölder's inequality shows that, for all u 1 , . . . , u j in the unit ball of G, In order to prove part (i), it thus suffices to show that If ν < λ 0 (eg. if (E1) does not hold) then Hölder's inequality shows that where ζ := λ 0 ν/(λ 0 − ν). Now δ n and Γ n are bounded and converge to zero in probability, and so the bounded convergence theorem establishes (31). If ν = λ 0 then (E1) holds. Suppose first that ν > 1, and let f m ∈ C ∞ (R d ; R) be a sequence converging in G to some b ∈ G. For some 1 ≤ i ≤ d and a weakly differentiable g : R d → R, let g ′ := ∂g/∂x i ; then where h ∈ C(R; R) is defined by h(y) = ν|y| ν−1 sgn(y), , K < ∞ and we have used Hölder's inequality in the bounds on R m and T m . With a slight abuse of notation, let f m be a subsequence that converges to b almost surely; As in the proof of Theorem 1, this shows that |b| ν is weakly differentiable with respect to x i , with derivative This enables the use of a log-Sobolev inequality. Let α := (t − 1)/t, and let (See, for example, [34].) F α is equivalent to any Young functionF α , for which Similarly, G α is equivalent to any Young functionG α , for whichG α (z) = exp(z 1/α ) for z ≥ 2. We denote the associated Orlicz spaces L 1 log α L(µ) and exp L 1/α (µ), respectively. L 1 log α L(µ) is equal (modulo equivalent norms) to the Lorentz-Zygmund space L 1,1;α (µ), which in the context of the product probability space (R d , X , µ) is a rearrangement-invariant space. (See section 3 in [10].) It follows from Theorem 7.12 in [10], together with (32), that This is clearly also true if ν = 1. In the light of the generalised Hölder inequality, in order to prove (31) it now suffices to show that the sequences |δ n | ν and |Γ n | ν converge to zero in exp L 1/α (µ), but this follows from their boundedness and convergence to zero in probability. This completes the proof of part (i).
With the hypotheses of part (ii), let (t n ∈ R \ {0}) and (v n ∈ G) be sequences converging to 0 and u N +1 , respectively, and let a n := a + t n v n . Substituting this sequence into (30), we obtain t −1 n ∆ n = δ n v n = δ n (v n − u N +1 ) + δ n u N +1 . Both terms on the right-hand side here converge to zero in L λ 0 (µ) since δ n is bounded and converges to zero in probability. This completes the proof of part (ii). Part (iii) follows from (29) and the boundedness of the ψ (j) .
where ı : G → L β (µ) is the inclusion map. These are injective and share the smoothness properties of Ψ β developed in Lemma 4. In particular, where a = φ(P ), and the derivatives are Leslie derivatives if β = λ 0 , and (E1) does not hold. The maps m β and e β can be used to investigate the regularity of statistical divergences on M. The usual extension of the KL divergence to sets of finite measures, such as M, is [1]: where E µ is expectation (integration) with respect to µ. This clearly requires λ 0 ≥ 2. Its smoothness is investigated in [26]; D admits mixed second partial derivatives (in the sense of Leslie if λ 0 = 2 and (E1) does not hold). So we can use Eguchi's characterisation of the Fisher-Rao metric on T P M [12]: for any U, V ∈ T P M, It follows that V, U P = U, V P and that yU, V P = U, yV P = y U, V P for any y ∈ R; furthermore, and U, U P = 0 if and only if Uφ = 0. So the metric is positive definite and dominated by the chart-induced norm on T P M. However the Fisher-Rao metric and chart-induced norm are not equivalent, even when the model space is L 2 (µ) [24]. In the general, infinite-dimensional case (T P M, · , · P ) is not a Hilbert space; the Fisher-Rao metric is a weak Riemannian metric. If λ 0 ≥ 3 then M also admits the Amari-Chentsov tensor. This is the symmetric covariant 3-tensor field defined by The regularity of the Fisher-Rao metric and higher-order covariant tensors can be derived from that of Ψ β , as developed in Lemma 4. They become smoother with increasing values of λ 0 . Log-Sobolev embedding enhances this gain for particular integer values of λ 0 . Suppose, for example, that λ 0 = 2. If (E1) holds then the metric is a continuous covariant 2-tensor on M; however if (E1) does not hold then, although the composite map M ∋ P → U(P ), V(P ) P ∈ R is continuous for all continuous vector fields U, V, the metric is not continuous in the sense of the operator norm.
If λ 0 ≥ 2 the variables m 2 and e 2 are bi-orthogonal representations of measures in M. This can be seen in the following generalised cosine rule: Setting R = P and using the fact that m 2 + e 2 = ı • φ, where ı : G → L 2 (µ) is the inclusion map, we obtain the global bound

Special Model Spaces
The construction of M and T M in the previous section is valid for any of the weighted mixed-norm spaces developed in section 2, including the fixed norm space G f := W k,(λ,...,λ) (µ). However, certain spaces are particularly suited to the deformed exponential function ψ; these are introduced next. A special class of mixed-norm spaces, on which the nonlinear superposition operators associated with ψ act continuously, is developed in section 4.1. Section 4.2 investigates fixed-norm spaces and shows that, with the exception of the cases k = 1, λ ∈ [1, ∞) and k = 2, λ = 1, they do not share this property.

A Family of Mixed Norm Spaces
This section develops the mixed-norm space G m := W k,Λ (µ) with λ 0 ≥ λ 1 ≥ k and λ j = λ 1 /j for 2 ≤ j ≤ k. Lemma 4 can be augmented as follows.
Proof. A partition of s ∈ S 1 is a set π = {σ 1 , . . . , σ n ∈ S 1 } such that i σ i = s. Let Π(s) denote the set of distinct partitions of s and, for any π ∈ Π(s), let |π| denote the number of d-tuples in π. According to the Faá di Bruno formula, for any s ∈ S 1 and any f ∈ C ∞ (R d ; R), where the K π < ∞ are combinatoric constants. D s ψ(f ) ∈ C ∞ (R d ; R) since the derivatives of ψ are bounded and D σ f ∈ C ∞ (R d ; R) for all σ ∈ π. We set F 0 := ψ, and extend the domain of F s to G m in the obvious way. Let (f n ∈ C ∞ (R d ; R)) be a sequence converging in the sense of G m to a. Since the first derivative of ψ is bounded, the mean value theorem shows that ψ(f n ) → ψ(a) = F 0 (a) in the sense of L λ 0 (µ). Furthermore, for any s ∈ S 1 , (|D τ f n | + |D τ a|) . Now σ∈π |σ| = |s|, and so it follows from Hölder's inequality that which, together with the boundedness of the derivatives of ψ, shows that the first term on the right-hand side of (43) converges to zero in the sense of L λ/|s| (µ). The second term converges to zero in probability and is dominated by the function C σ∈π |D σ a| ∈ L λ/|s| (µ) for some C < ∞, and so it also converges to zero in the sense of L λ/|s| (µ). We have thus shown that, for any s ∈ S 0 , D s ψ(f n ) converges to F s (a) in the sense of L λ/|s| (µ). In particular, F s (a) ∈ L λ/|s| (µ). That ψ(a) is weakly differentiable with derivatives D s ψ(a) = F s (a), for all s ∈ S 1 , follows from arguments similar to those in (11) with f n playing the role of a n , and this completes the proof of part (i).
Let (a n ∈ G m ) be a sequence converging to a in the sense of G m . The above arguments, with a n replacing f n , show that, for any s ∈ S 0 , F s (a n ) → F s (a) in the sense of L λ/|s| (µ), and this completes the proof of part (ii).
For any P 0 , P 1 ∈ M and any y ∈ (0, 1), let P y := (1 − y)P 0 + yP 1 . Clearly p y ∈ G m ; we must show that log p y ∈ G m . Let f : (0, ∞) → R be defined by then | log z| λ ≤ f (z), and f is of class C 2 with non-negative second derivative, and so is convex. It follows from Jensen's inequality that A further application of the Faá di Bruno formula shows that, for any s ∈ S 1 , Since ψ (n) /ψ is bounded, the arguments used above to show that D s ψ(a i ) ∈ L λ/|s| (µ) can be used to show that D σ ψ(a i )/ψ(a i ) ∈ L λ/|σ| (µ). Hölder's inequality then shows that D s log p y ∈ L λ/|s| (µ). We have thus shown that log p y ∈ G m . So P y ∈ M, and this completes the proof of part (iii).

Fixed Norm Spaces
Proposition 2 shows that the function ψ defines a superposition operator that "acts continuously" on the mixed norm Sobolev space G m . The question naturally arises whether or not it has this property with respect to any fixed norm spaces (other than W 1,1 (µ)). Since, for k ≥ 2 and λ ≥ λ 0 , the space G f = W k,(λ,...,λ) (µ) is a subset of G m and has a topology stronger than the relative topology, it is clear that ψ(G f ) ⊂ ψ(G m ) ⊂ G m , and that the restriction, Ψ : G f → G m , is continuous. However, except in one specific case, ψ(G f ) is not a subset of G f , as the following proposition shows.
Proposition 3. If λ > 1 and k ≥ 2 then there exists an a ∈ G f for which ψ(a) / ∈ G f .
Since ψ is not a polynomial, its k'th derivative ψ (k) is not identically zero, and we can choose −∞ < ζ 1 < ζ 2 < ζ 1 + 1 such that |ψ (k) (z)| ≥ ǫ for all z ∈ [ζ 1 , ζ 2 ] and some ǫ > 0. Finally, let a : R d → R be defined by the sum where α = exp(2/((k + 1)λ − 1)) and m > z t + 1. (The support of the n'th term in the sum here is a subset of S n , and so a is well defined and of class C ∞ .) We claim that a ∈ G f ; in fact, for any s ∈ S 1 with |s| = j, and a similar bound can be found for E µ |a − ζ 1 | λ . It now suffices to show that D s ψ(a) / ∈ L λ (µ), where s = (k, 0, . . . , 0). Let then, for any x ∈ T n , a(x) = ζ 1 + nα n (x − σ n ) 1 ∈ [ζ 1 , ζ 2 ], and so where |T n | is the Lebesgue measure of T n , and this completes the proof.
Proof. As in the proof of Proposition 2, it suffices to show that, for any a ∈ G f , any sequence (a n ∈ G f ) converging to a in G f , and any s ∈ S 0 , F s (a n ) → F s (a) in L ν (µ), where F s is as defined in (42). For any s with |s| < k this can be accomplished by means of Hölder's inequality, as in the proof of Proposition 2. Furthermore, even if |s| = k, all terms in the sum on the right-hand side of (42) for which |π| < k can be treated in the same way. (There are no more than k − 1 factors in the product, each of which is in L λ (µ), and λ/(k − 1) ≥ ν.) This leaves the terms for which |π| = |s| = k; in order to show that these converge in L ν (µ) it suffices to show that, for any 1 ≤ i ≤ d, the sequence (|ψ (k) (a n )(a ′ n ) k | ν ) is uniformly integrable, where, for any weakly differentiable g : R d → R, g ′ := ∂g/∂x i .
where K 1 and K 2 depend only on the function ψ, and R(f ), S(f ) and T (f ) are as follows: In (49), we have used the boundedness of ψ (k) in the first step, Lemma 3(ii) in the second step and integration by parts with respect to x i in the fourth step. (If t = 1 in Example 1(i), then θ t (| · |) is not differentiable at 0 and the integration by parts has to be accomplished separately on the two sub-intervals (−∞, 0) and (0, ∞).) In (50), we have used (48). Let a m,n := a n (x)ρ(x/m) ∈ U m ; then, with J as defined in section 2, where we have used the definition of F in the first step, Lemma 1(ii) in the second step, Jensen's inequality in the third step, Lemma 1(i) in the fourth step and (16) in the final step. Similar bounds can be found for S(Ja m,n ) and, if t ∈ (0, 1] (so that l ′ is bounded), T (Ja m,n ).
If we want all derivatives of ψ(a) to be continuous maps from G f to L ν (µ) (for some ν ≥ 1) then the fixed norm space G f should have Lebesgue exponent λ = max{2, νk − 1}. (The resulting manifold will not have a strong enough topology for global information geometry unless λ ≥ 2.) The mixed norm space G m requires λ 1 = νk, λ 2 = νk/2, . . . , λ k = ν. This places a slightly higher integrability constraint on the first derivative, but lower constraints on all other derivatives (significantly lower if k ≥ 3). Furthermore, if G f is used as a model space, then ψ(a) and its first partial derivatives actually belong to L λ (µ), and so the true range of the superposition operator in this context is a mixed norm space, whether or not we choose to think about it in this way.
The case in which λ = 1 is of particular interest. Proposition 4 then shows that ψ defines a nonlinear superposition operator that acts continuously on G s := W 2,(1,1,1) (µ). The use of such a low Lebesgue exponent precludes the results in section 3 concerning the smoothness of the KL-divergence. In particular, we cannot expect to retain global geometric constructs such as the Fisher-Rao metric. However, D(µ| · ) : M s → [0, ∞) is still continuous for all t ∈ (0, 2], and D( · |µ) is finite if t = 2. Since ψ (1) is bounded, there is no difficulty in extending these results as follows.
Remark 2. When the model space, G, is G m , G s or G ms , then condition (M2) can be replaced by: (M2') p, log p ∈ G.

The Manifolds of Probability Measures
In this section we shall assume that λ 0 > 1, or that λ 0 = 1 and the embedding hypothesis (E1) holds. Let M 0 ⊂ M be the subset of the general manifold of section 3 (that modelled on G := W k,Λ (µ)), whose members are probability measures. These satisfy the additional hypothesis: The co-dimension 1 subspaces of L λ (µ) and G, whose members, a, satisfy E µ a = 0 will be denoted L λ 0 (µ), and G 0 . Let φ 0 : M 0 → G 0 be defined by Proposition 5. (i) φ 0 is a bijection onto G 0 . Its inverse takes the form where Z ∈ C N (G 0 ; R) is an (implicitly defined) normalisation function, and N = N(λ 0 , λ 1 , 1, t) is as defined in (26).
(ii) The first (and if N ≥ 2 second) derivative of Z is as follows: where P a :=P a /P a (R d ) andP a is the finite measure defined before (23).
(iv) Z and any derivatives it admits are bounded on bounded sets.
where Ψ β is as in Lemma 4. It follows from Lemma 4, that Υ is of class C N and that, for any u ∈ G 0 , Since ψ is convex, furthermore, the monotone convergence theorem shows that So Υ(a, · ) is a bijection with strictly positive derivative, and the inverse function theorem shows that it is a C N -isomorphism. The implicit mapping theorem shows that Z : G 0 → R, defined by Z(a) = Υ(a, · ) −1 (1), is of class C N . For some a ∈ G 0 , let P be the probability measure on X with density p = ψ(a + Z(a)); then φ 0 (P ) = a and P ∈ M 0 , which completes the proof of part (i).
That the first derivative of Z is as in (54) follows from (56). Since E µ ψ (1) (a + Z(a)) > 0, parts (ii) and (iii) follow from Lemma 4 and the chain and quotient rules of differentiation (which hold for Leslie derivatives). Part (iv) is proved in Proposition 4.1 in [26].
Expressed in charts, the inclusion map ı : M 0 → M is as follows and has the same smoothness properties as Z. The following goes further.
Proof. Let η : G → G 0 be the superposition operator defined by η(a)(x) = a(x) − E µ a; then η is of class C ∞ , has first derivative η (1) a u = u − E µ u, and zero higher derivatives. Now η • ρ is the identity map of G 0 , which shows that ρ is homeomorphic onto its image, ρ(G 0 ), endowed with the relative topology. Furthermore, for any u ∈ G 0 , and so ρ (1) a is a toplinear isomorphism, and its image, ρ a G 0 , is a closed linear subspace of G. Let E a be the one dimensional subspace of G defined by E a = {yψ (1) (ρ(a)) : y ∈ R}. If u ∈ E a and v ∈ ρ (1) a G 0 then there exist y ∈ R and w ∈ G 0 such that We have thus shown that ρ is a C N -immersion, and this completes the proof.
For any P ∈ M 0 , the tangent space T P M 0 is a subspace of T P M of codimension 1; in fact, as shown in the proof of Proposition 6, Let Φ 0 : T M 0 → G 0 × G 0 be defined as follows: a u). For any (P, U) ∈ T M 0 , Uφ = ρ (1) a u = u−E Pa u, and so tangent vectors in T P M 0 are distinguished from those merely in T P M by the fact that their total mass is zero: The map Z of (53) is (the negative of) the additive normalisation function, α, associated with the interpretation of M 0 as a generalised exponential model with deformed exponential function ψ. (See Chapter 10 in [22]. We use the symbol Z rather than −α for reasons of consistency with [24,26].) In this context, the probability measure P a of (54) is called the escort measure to P . In [19], the authors considered local charts on the Hilbert manifold of [24]. In the present context, these take the form φ P : M 0 → G P , where G P is the subspace of G whose members, b, satisfy E Pa b = 0. This amounts to re-defining the origin of G as φ(P ), and using the co-dimension 1 subspace that is tangential to the image φ(M 0 ) at this new origin as the model space. This local chart is normal at P for the Riemannian metric and Levi-Civita parallel transport induced by the global chart φ on M. However, the metric differs from the Fisher-Rao metric on all fibres of the tangent bundle other than that at µ.
Their properties are developed in [24], and follow from those of m β and e β .

Application to Nonlinear Filtering
We sketch here an application of the manifolds of sections 3 and 5 to the nonlinear filtering problem discussed in section 1. An abstract filtering problem (in which X is a Markov process evolving on a measurable space) was investigated in [25]. Under suitable technical conditions, it was shown that the (Y s , 0 ≤ s ≤ t)-conditional distribution of X t , Π t , satisfies an infinitedimensional stochastic differential equation on the Hilbert manifold of [24], and this representation was used to study the filter's information-theoretic properties. This equation involves the normalisation constant Z, which is difficult to use since it is implicitly defined, and so it is of interest to use a manifold of finite measures not involving Z, such as M of section 3. Because of its special connection with the function ψ, the mixed norm model space G m of section 4.1 is of particular interest, although the fixed norm spaces of section 4.2 could also be used. If the conditional distribution Π t has a density with respect to Lebesgue measure, p t , satisfying the Kushner-Stratonovich equation (3), then its den-sity with respect to µ, π t = p t /r, also satisfies (3), but with the transformed forward operator: where Γ = gg * and we have used the Einstein summation convention. The density π t also satisfies where, for appropriate densities p,h(p) := (E µ p) −1 E µ ph. Unlike (3), this equation is homogeneous, in the sense that if π t is a solution then so is απ t , for any α > 0. A straightforward formal calculation shows that log d π t satisfies the following stochastic partial differential equation where v(x, a) = (1 + ψ(a(x)))(h(x) −h(ψ(a))), and In order to make sense of (64) and (65), we need further hypotheses. The following are used for illustration purposes, and are not intended to be ripe.
(F2) The functions f , g and h are of class C ∞ (R d ).
(F3) The functions f and h, and all their derivatives, satisfy polynomial growth conditions in |x|.
(F4) The function g and all its derivatives are bounded.
Parts (ii) and (iii) can be proved by applying Hölder's inequality to the weak derivatives of the various components of u( · , a) and v( · , a). The quadratic term in u is the most difficult to treat, and so we give a detailed proof for this. We begin by noting that (1+ψ(a)) −1 ∂a/∂x i = ∂(a−ψ(a))/∂x i . For any |s| ≤ k − 2 According to Proposition 2, the nonlinear superposition operator Ψ σ,i : G m → L λ 1 /(|σ|+1) (µ) defined by Ψ σ,i (a) = D σ (∂ψ(a)/∂x i ) is continuous, and so it follows from Hölder's inequality that the same is true of Υ s,i,j : G m → L λ 1 /(|s|+2) (µ) defined by the right-hand side of (67). Together with (F4), this shows that Υ : G m → H k−2 (µ) defined by Υ(a) = Γ ij (∂ψ(a)/∂x i )(∂ψ(a)/∂x j ) is continuous. The other components of u( · , a) and the only component of v( · , a) can be shown to have the stated continuity by similar arguments. These make use of (66), Proposition 2 and part (i) here. One application of Proposition 7 is in the development of projective approximations, as proposed in the context of the exponential Orlicz manifold in [6] and the earlier references therein. As a particular instance, suppose that k ≥ 2 and λ 1 ≥ 2k; let (η i ∈ C k (R d ) ∩ G m , 1 ≤ i ≤ m) be linearly independent, and define This is an m-dimensional linear subspace of both G m and H k−2 (µ). We can use the inner product of H k−2 (µ) to project members of H k−2 (µ) onto G m,η . In particular, we can project U(a) and V(a) onto G M,η for any a ∈ G m,η to obtain continuous vector fields of the finite-dimensional submanifold of M defined by M η = φ −1 (G m,η ). Since the model space norms of H k−2 (µ) dominate the Fisher-Rao metric on every fibre of the tangent bundle (38), the projection takes account of the information theoretic cost of approximation, as well as controlling the derivatives of the conditional density π t . M η is a finite-dimensional deformed exponential model, and is trivially a C ∞ -embedded submanifold of M. Many other classes of finite-dimensional manifold also have this property. For example, since Ψ(G m ) is convex, certain finite-dimensional mixture manifolds modelled on the space G m,η , where η i ∈ Ψ(G m ), are also C ∞ -embedded submanifolds of M. This is also true of particular finite-dimensional exponential models.

Concluding Remarks
This paper has developed a class of infinite-dimensional statistical manifolds that use the balanced chart of [24,26] in conjunction with a variety of probability spaces of Sobolev type. It has shown that the mixed-norm space of section 4.1 is especially suited to the balanced chart (and any other chart with similar properties), in the sense that densities then also belong to this space and vary continuously on the manifolds. It has shown that this property is also true of a particular fixed norm space involving two derivatives, but can be retained for fixed norm spaces with more than two derivatives only with the loss of Lebesgue exponent. The paper has outlined an application of the manifolds to nonlinear filtering (and hence to the Fokker-Planck equation). Although motivated by problems of this type, the manifolds are clearly applicable in other domains, the Boltzmann equation of statistical mechanics being an obvious candidate.
The deformed exponential function used in the construction of M has linear growth, a feature that has recently been shown to be advantageous in quantum information geometry [23]. The linear growth arises from the deformed logarithm of (21), which is dominated by the density, p, when the latter is large. As recently pointed out in [20], this property is shared by other deformed exponentials, notably the Kaniadakis 1-exponential ψ K (z) = z + √ 1 + z 2 . The corresponding deformed logarithm is log K (y) = (y 2 −1)/2y, and so the density is controlled (when close to zero) by the term −1/p rather than log p, as used here. In the non-parametric setting, the need for both p and 1/p to be in L λ 0 (µ) places significant restrictions on membership of the manifold. If, for example, the reference measure of Example 1(i) is used, and t = 1, then the measure having density C exp(−α|x|) (with respect to Lebesgue measure) belongs to the manifold only if |α − 1| < 1/λ 0 .
The Kaniadakis 1-exponential shares the properties of ψ used in this paper; these are summarised in Lemma 5, which is easily proved by induction.
and so ψ K is strictly increasing and convex.
(ii) For any n ≥ 2, where Q 3(n−2) is a polynomial of degree no more than 3(n − 2). In particular, ψ We can therefore construct a manifold of finite measures M K , as in section 3, substituting the chart of (21) by φ K : M K → G, defined by φ K (P ) = log K p. The only properties of ψ used in section 3 are its strict positivity, and the boundedness of its derivatives, properties shared by ψ K . The results in section 4 carry over to M K with the exception of Proposition 2(iii). Most of these depend only on the boundedness of the derivatives of ψ; however, the integration by parts in (49) uses (20), which can be substituted by (70) in the case of M K . The results of section 5 all carry over to M K . Proposition 5(v) depends on the strict positivity of inf a∈B E µ ψ (1) (a) for bounded sets B ⊂ G. This is also true of ψ K , since where we have used Jensen's inequality in both steps. M K is a subset of M. Let τ : R → R be the "transition function" τ (z) = log d ψ K (z). All derivatives of τ are bounded, which explains why the regularity of the KL-divergence on M carries over to M K . Furthermore, it follows from arguments similar to those used in the proof of Proposition 2 that the superposition operator T m : G m → G m defined by T m (a)(x) = τ (a(x)) is continuous for any of the mixed norm model spaces of section 4.1.
The deformed logarithm of (21) was chosen in [24] because the resulting manifold is highly inclusive, and suited to the Shannon-Fisher-Rao information geometry. In this context, it yields the global bound (41). The Kaniadakis 1-logarithm is less suited to this geometry, but more to that generated by the Kaniadakis 1-entropy, which is of interest in statistical mechanics. The full development of this geometry is beyond the scope of this article.
Condition (6) (on the reference measure µ) has to be considered in the context of (M2), which places upper and lower bounds on the rate at which the densities of measures in M can decrease as |x| becomes large. For example, if all nonsingular Gaussian measures are to belong to M, then (M2) requires r to decay more slowly than a Gaussian density, but more rapidly than a Cauchy density. Variants of the reference measure µ with t ∈ [1, 2) may be good choices for such applications.