1 Introduction

Great progress has been made during the last five decades on the theory of information geometry, and its application in many scientific fields. The fundamental parametric theory is well developed, and is treated pedagogically in a number of texts (See, for example, [1, 3, 5, 9, 12]). The non-parametric theory, on the other hand, is largely to be found in a series of research papers. A notable exception is the text [2], which treats parametric and non-parametric theories in a unified way. The step from the parametric to the non-parametric setting is not an easy one, since it introduces the infinite-dimensional spaces of Functional Analysis.

The parametric exponential model is arguably the nucleus of the subject. Its extension to the non-parametric setting was accomplished by G. Pistone and his co-workers in the fundamental series of papers [4, 7, 18, 19]. The manifolds there constructed are “maximally inclusive” in a precise sense, and various statistical divergences, including the Amari \(\alpha \)-divergences, for \(\alpha \) in the interval \([-1,1]\), are smooth on them. As with parametric exponential manifolds, the log of the density is used as a chart. This requires a model space with a particularly strong topology: the exponential Orlicz space, which has a number of disadvantages. Since its publication, several variations of the exponential Orlicz manifold have been developed. In [11], the exponential function was replaced by the Tsallis q-deformed exponential, which has an important interpretation in statistical mechanics. (See [20], and Chapter 7 in [13].) The model space used is \(L^\infty \), which significantly restricts membership of the manifold consructed. A large class of deformed exponential functions (the “\(\varphi \)-functions”) was used in [21, 22] to construct inclusive manifolds of probability measures, in which the model spaces are Musielak-Orlicz spaces.

The constructions in these references begin with the tangent space at a generic point, P, of a set of measures. A representation of tangent vectors, derived from the (deformed) logarithm, is then used to construct a local chart, which naturally maps to a model space defined in terms of P. However, the model spaces required in this approach can be difficult to use in practice. A different approach was taken in [14, 15], where a global chart was used with the specific deformed logarithm \(\log _d=y-1+\log y\) to construct an inclusive manifold modelled on Lebesgue \(L^p\) spaces (including the Hilbert space \(L^2\)). The corresponding deformed exponential has bounded derivatives of all orders; a property that has a number of advantages. Both the probability density and its (non-deformed) log (objects of central importance to information geometry) belong to the model space and, considered as superposition operators mapping into this space, are continuous.

The sample space in all these manifolds is an abstract probability space: a set of “outcomes”, a class of measurable subsets of these outcomes, and a probability measure attaching a number between 0 and 1 to each subset. This has the advantage of generality: the sample space could be \(\mathbb {R}^d\), or the path space of a stochastic process,...However, topologies, metrics and linear structures on the sample space play important roles in most applications, including the theory of partial differential equations. A natural direction for research in the non-parametric theory is to specialise the manifolds outlined above to such problems by incorporating the topology of the sample space in the manifolds. One way of achieving this is to use model spaces of Sobolev type. This was carried out in the context of the exponential Orlicz manifold in [10], where it was applied to the spatially homogeneous Boltzmann equation. It was carried out in the context of the \(L^p\) manifolds in [17]. The resulting fusion of information and sample space topologies is mutually beneficial. For example, it was shown in [17] that log-Sobolev embedding strengthens the topology of the raw \(L^p\) manifolds in a useful way.

This paper takes the approach of [14, 15, 17] further, by constructing a new class of non-parametric manifolds based on a two-parameter deformed exponential, dubbed the \(\eta \)-exponential. The paper has two primary aims: (i) to provide manifolds on which a wider class of divergences and entropy functions can be accommodated; (ii) to refine the Sobolev space methods of [17] in order to increase the degree of smoothness they confer on these quantities. Regarding the first aim, there is a vast literature on the importance of different divergences and entropies to particular branches of science, a full review of which is beyond the scope of this article. Let us mention, however, the special volume (13) of Entropy on applications of the Tsallis q-divergences and entropies. The review article by Tsallis [20], in particular, cites many applications in which the q parameter should be strictly greater than or strictly less than the value \(q=1\) of the Boltzmann-Gibbs theory. In the context of Amari’s \(\alpha \)-divergences, this translates to values of \(\alpha \) both greater than and less than 1. The author was motivated, in particular, by the study of multi-objective measures of error in nonlinear filtering that lead naturally to divergences with \(\alpha <-1\), [16].

The two-parameter \(\eta \)-exponential we use corresponds to a deformed exponential introduced in [8], but is reparametrised into the Amari setting. The manifolds accommodate \(\alpha \)-divergences and entropies over a range of \(\alpha \) values, but are especially suited to those for which \(\alpha \in [\eta _-,\eta _+]\), for the chosen parameters \(-\infty<\eta _-<1\le \eta _+<\infty \). The parameter values \(\pm 1\) yield the linear-growth deformed exponential of [14, 15, 17]; values of \(\eta _-\) other than \(-1\) yield deformed exponentials with power law or sublinear growth. For a more detailed account of deformed exponentials, and their use in Statistical Mechanics, the reader is referred to [13].

The paper is structured as follows. Section 2 introduces the model Sobolev spaces, expanding considerably on the material in [17]. It also introduces a new class of “subordinate” Sobolev spaces, which are later used in the analysis of superposition operators derived from Amari’s \(\alpha \)-embedding maps. (The latter can be used in the analsis of divergences and entropies.) Section 3 introduces the two-parameter deformed exponential and uses it to construct manifolds of finite measures. Section 4 then shows that the subsets of probability measures are smoothly embedded submanifolds of those of Sect. 3.

2 The Sobolev spaces

The manifolds are modelled on mixed-norm, weighted Sobolev spaces generalising those defined in [17]. The spaces are based on a reference probability measure \(\mu \) on the sample space \(\mathbb {R}^d\). This takes the form

$$\begin{aligned} \mu (dx) = r(x)dx = \exp (l_r(x))dx, \end{aligned}$$
(1)

where \(l_r:\mathbb {R}^d\rightarrow \mathbb {R}\) is a continuous function such that \(\mu (\mathbb {R}^d)=1\). Stronger results can be obtained with the following additional hypothesis on \(l_r\).

  1. (E)

    The log-density \(l_r\) is constructed as follows. Let \(\theta :[0,\infty )\rightarrow [0,\infty )\) be a strictly increasing, convex function that is twice continuously differentiable on \((0,\infty )\), is such that \(\lim _{z\downarrow 0}\theta ^\prime (z)<\infty \), \(-\sqrt{\theta }\) is convex and, for some \(t\in (1,2]\),

    $$\begin{aligned} \theta (z) = \left\{ \begin{array}{ll} 0 &{} \quad \mathrm{if\ }z=0 \\ c + z^t &{} \quad \mathrm{if\ }z\ge z_0 \end{array}\right\} , \quad \mathrm{where\ }z_0\ge 0,\mathrm{\ and\ }c\in \mathbb {R}. \end{aligned}$$
    (2)

    \(l_r\) then takes the special form

    $$\begin{aligned} l_r(x) := \textstyle \sum _i (C-\theta (|x_i|)), \end{aligned}$$
    (3)

    where \(C\in \mathbb {R}\) is such that \(\mu (\mathbb {R}^d)=1\). (Some examples are given in [17] including the Gaussian case, in which \(c=z_0=0\) and \(t=2\).)

The model spaces used in the construction of the manifolds comprise measurable functions defined on \(\mathbb {R}^d\) having weak derivatives of various orders, that belong to the Lebesgue spaces \({L^\lambda }=L^\lambda (\mu )\) for various exponents \(\lambda \). (\(f\in {L^\lambda }\) if and only if \({{\textrm{E}}_\mu }|f|^\lambda :=\int |f(x)|^\lambda \mu (dx)<\infty \).) Under (E), \(\mu \) is a product measure and the model spaces admit a log-Sobolev embedding result.

Let \(C^\infty (\mathbb {R}^d;\mathbb {R})\) be the space of continuous functions with continuous partial derivatives of all orders, and let \({C_0^\infty }(\mathbb {R}^d;\mathbb {R})\) be the subspace of those functions having compact support. For any \(\lambda \in [2,\infty )\), and any \(0\le k\le \lambda \), the space \({W^{k,\lambda }}\) is the mixed-norm Sobolev space comprising measurable functions \(a\in {L^\lambda }\) that have weak partial derivatives up to order k, those of order i belonging to the Lebesgue space \(L^{\lambda /i}\). We shall also use the “subordinate” spaces \({W^{k,\lambda ;l}}\), for certain integer values of l. Let \({\lambda ^\circ }\) be the following Lebesgue exponent: if (E) holds and \(k\ge 1\) then \({\lambda ^\circ }=\lambda \), otherwise \({\lambda ^\circ }=\lambda -\epsilon \) for some \(0<\epsilon<<1\). For \(1\le l\le \lfloor {\lambda ^\circ }\rfloor \), the space \({W^{k,\lambda ;l}}\) comprises measurable functions \(a\in L^{{\lambda ^\circ }/l}\) that have weak partial derivatives up to order \(k_l:=\min \{k,\lfloor {\lambda ^\circ }\rfloor -l\}\), those of order i belonging to the Lebesgue space \(L^{{\lambda ^\circ }/(l+i)}\). (For convenience \(W^{k,\lambda ;0}:=W^{k,\lambda }\).)

Model spaces with more general derivative structures were developed in [17], including fixed-norm spaces; however, the Lebesgue exponents in \(W^{k,\lambda }\) and its subordinates are especially suited to the deformed logarithms used here. Weak derivatives are defined in the usual way: for any \(\varphi \in {C_0^\infty }(\mathbb {R}^d;\mathbb {R})\),

$$\begin{aligned} \int (\partial _i a)\,\varphi \;dx = - \int a \,(\partial _i\varphi )\; dx \quad \mathrm{where\ } \partial _ia \mathrm{\ is\ shorthand\ for\ }\frac{\partial a}{\partial x_i }. \end{aligned}$$
(4)

In order to express higher-order weak derivatives in an efficient way, we use the following standard “multi-index” notation. Let \(S:=\{0,\ldots , k\}^d\) be the set of d-tuples of integers in the range \(0\le s_i\le k\). For \(s\in S\), we define \(|s|=\sum _is_i\), and denote by \(S_i:=\{s\in S:1\le |s|\le i\}\) the set of d-tuples of weight at most i. For appropriate a, we define the following

$$\begin{aligned} D^sa= & {} \partial _1^{s_1}\cdots \partial _d^{s_d}a \nonumber \\ \Vert a\Vert _{W^{k,\lambda }}^\lambda= & {} \Vert a\Vert _{L^{\lambda }}^\lambda + \sum _{s\in S_k}\Vert D^sa\Vert _{L^{\lambda /(|s|)}}^\lambda \nonumber \\ \Vert a\Vert _{W^{k,\lambda ;l}}^\lambda= & {} \Vert a\Vert _{L^{{\lambda ^\circ }/l}}^\lambda + \sum _{s\in S_{k_l}}\Vert D^sa\Vert _{L^{{\lambda ^\circ }/(l+|s|)}}^\lambda , \quad 1\le l\le \lfloor {\lambda ^\circ }\rfloor . \end{aligned}$$
(5)

Theorem 1

  1. (i)

    For any \(1\le l\le \lfloor {\lambda ^\circ }\rfloor \), \({W^{k,\lambda ;l}}\) and \({W^{k,\lambda }}\) are Banach spaces with respect to the norms in (5);

  2. (ii)

    For any \(1\le l\le \lfloor {\lambda ^\circ }\rfloor \), \({C_0^\infty }(\mathbb {R}^d;\mathbb {R})\) is dense in \({W^{k,\lambda ;l}}\) and \({W^{k,\lambda }}\).

Proof

Both parts are proved in Theorem 1 and Lemmas 1 and 2 in [17]. (The only property required of the log density, \(l_r\), is its continuity.) Part (ii) is a consequence of the non-increasing nature of the Lebesgue exponents in \({W^{k,\lambda }}\) and \({W^{k,\lambda ;l}}\). \(\square \)

The spaces admit the following continuous embeddings:

$$\begin{aligned} W^{k,\lambda ;0} := {W^{k,\lambda }}\prec {W^{k,\lambda ;l}}\prec W^{k,\lambda ;{\tilde{l}}},\quad \mathrm{where\ } 1\le l< {\tilde{l}}\le \lfloor {\lambda ^\circ }\rfloor . \end{aligned}$$
(6)

The spaces \({W^{k,\lambda }}\) will be used as model spaces for manfolds of finite measures in Sect. 3, and centred versions of them, as model spaces for manifolds of probability measures in Sect. 4. The following theorem derives some properties of particular types of map acting on them. It will be used in the sequel.

Theorem 2

  1. (i)

    For any \(\psi \in {C^\infty }(\mathbb {R};\mathbb {R})\) having bounded derivatives of all orders, the nonlinear superposition operator \(\Psi :{W^{k,\lambda }}\rightarrow {W^{k,\lambda }}\), defined by \(\Psi (a)(x)=\psi (a(x))\), is continuous. Its spatial derivatives are given by the Faà di Bruno formula

    $$\begin{aligned} D^s\Psi (a) = F_s(a) := \sum _{\pi \in \Pi (s)}\psi ^{(|\pi |)}(a)\prod _{\sigma \in \pi }D^\sigma a, \end{aligned}$$
    (7)

    where \(\pi =\{\sigma _1,\ldots ,\sigma _{|\pi |}\in S_{|s|};1\le |\sigma _j|\le |s|,\sum _j\sigma _j=s\}\) is a partition of s, \(|\pi |\) is the cardinal of \(\pi \), and \(\Pi (s)\) is the set of all such partitions.

  2. (ii)

    For appropriate Banach spaces of functions on \(\mathbb {R}^d\), A, B and C, let \(\Pi _{A,B}:A\times B\rightarrow C\) be defined by \(\Pi _{A,B}(a,b)(x)=a(x)b(x)\). \(\Pi _{A,B}\) is a well defined, continuous, bilinear map in the following instances:

    $$\begin{aligned}{} & {} A =B = C = L^\infty \cap W^{k,\lambda } \quad \mathrm{normed\ by\ } \Vert \cdot \Vert _{L^\infty }+\Vert \cdot \Vert _{W^{k,\lambda }};\nonumber \\ {}{} & {} A = L^\infty \cap W^{k,\lambda },\quad B=W^{k,\lambda },\quad C=W^{k,\lambda ;1}; \nonumber \\ {}{} & {} A =L^\infty \cap W^{k,\lambda },\quad B=C= W^{k,\lambda ;l}\quad 1\le l\le \lfloor \lambda ^\circ \rfloor ; \nonumber \\ {}{} & {} A=W^{k,\lambda ;l}, \quad B=W^{k,\lambda ;{\tilde{l}}},\quad C=W^{k,\lambda ;l+{\tilde{l}}} \quad 1\le l,{\tilde{l}};\ l+{\tilde{l}}\le \lfloor {\lambda ^\circ }\rfloor . \end{aligned}$$
    (8)

    The spatial derivatives of \(\Pi _{A,B}\) are given by the Leibniz formula

    $$\begin{aligned} D^s\Pi _{A,B}(a,b) = H_s(a,b) := \sum _{\sigma \le s}\frac{s!}{\sigma !(s-\sigma )!} D^\sigma a D^{s-\sigma }b, \end{aligned}$$
    (9)

    where \(s!:=s_1!\cdots s_d!\), and \(\sigma \le s\) if and only if \(\sigma _i\le s_i\) for \(1\le i\le d\).

  3. (iii)

    For \(\psi \) as in part (i), and any \(0\le l\le \lfloor {\lambda ^\circ }\rfloor \), the superposition operator \(\Psi _l:{W^{k,\lambda }}\rightarrow {W^{k,\lambda ;l}}\), defined by \(\Psi _l(a)(x)=\psi (a(x))\), is of class \(C^l\). Its derivatives are as follows:

    $$\begin{aligned} \Psi _l^{(i)}(u_1,\ldots ,u_i)(x) = \psi ^{(i)}(a(x))u_1(x)\cdots u_i(x),\quad \mathrm{for\ }1\le i\le l. \end{aligned}$$
    (10)

The proof makes use of the following Lemma.

Lemma 1

Let \(a\in {W^{k,\lambda }}\), let \((a_n\ne a)\) and \((b_n)\) be sequences converging to a in the sense of \({W^{k,\lambda }}\), and let B be the unit ball of \({W^{k,\lambda }}\). For any continuous, bounded function \(f:\mathbb {R}\rightarrow \mathbb {R}\),

$$\begin{aligned}&\Vert a_n-a\Vert _{W^{k,\lambda }}^{-1}\Vert (f(b_n)-f(a))(a_n-a)\Vert _{L^{{\lambda ^\circ }}} \rightarrow 0 \end{aligned}$$
(11)
$$\begin{aligned}&\textrm{and}\qquad \qquad \qquad \sup _{u\in B}\Vert (f(a_n)-f(a))u\Vert _{L^{{\lambda ^\circ }}} \rightarrow 0.\qquad \qquad \qquad \qquad \qquad \end{aligned}$$
(12)

Proof

We use the generalised Hölder inequality,

$$\begin{aligned} \Vert (f(b_n)-f(a))(a_n-a)\Vert _{L^{{\lambda ^\circ }}} \le \Vert f(b_n)-f(a)\Vert _A\Vert |a_n-a|^\lambda \Vert _E^{1/\lambda }, \end{aligned}$$
(13)

where A and E are the following Banach spaces. If (E) does not hold or \(k=0\), then \(A=L^{\lambda {\lambda ^\circ }/\epsilon }\) and \(E=L^1\); (13) is then the classical Hölder inequality on dual Lebesgue spaces. If (E) holds and \(k\ge 1\) then \(A=\exp L^{1/\beta }(\mu )\) and \(E=L^1\log ^\beta L(\mu )\), where \(\beta =(t-1)t\); these are Orlicz spaces based on the complementary Young functions:

$$\begin{aligned} G_\beta (z) = \int _0^z\left( \exp (y^{1/\beta })-1\right) \,dy \quad \textrm{and}\quad F_\beta (z) = \int _0^z\log ^\beta (y+1)\,dy. \end{aligned}$$
(14)

It follows from a log-Sobolev embedding theorem (see, for example, Theorem 7.12 in [6]), and the following representation for first-order weak derivatives

$$\begin{aligned} \partial _i|a_n-a|^\lambda = \lambda |a_n-a|^{\lambda -1}\textrm{sgn}(a_n-a)\partial _i(a_n-a)\in L^1, \quad 1\le i\le d, \end{aligned}$$

that \(\Vert |a_n-a|^\lambda \Vert _E\le K\Vert a_n-a\Vert _{W^{k,\lambda }}^\lambda \), for some \(K<\infty \). (See the proof of Lemma 4 in [17] for fuller details.) Now \(f(b_n)-f(a)\) is bounded and converges to zero in probability, and so it converges to zero in the sense of A (with either definition of A), and (11) follows. A similar argument establishes (12). \(\square \)

Proof of Theorem 2

A proof of part(i) is given in Proposition 2 of [17]. It involves a sequence \(f_n\in {C^\infty }(\mathbb {R}^d;\mathbb {R})\) converging to a in the sense of \({W^{k,\lambda }}\). \(D^s\Psi (f_n)\) is defined in the classical sense, and is equal to \(F_s(f_n)\). It is then shown that \(F_s(f_n)\) converges to \(F_s(a)\) in the sense of \(L^{\lambda /|s|}\). \(F_s(a)\) is then shown to be the weak derivative of \(\Psi (a)\) by the standard procedure of integrating \(D^sF_s(f_n)\varphi \) by parts for a \(\varphi \in {C_0^\infty }(\mathbb {R}^d;\mathbb {R})\). Finally, continuity is established by repeating the argument for a sequence \({W^{k,\lambda }}\ni a_n\rightarrow a\).

The proof of part (ii) is similar. Let \(f_n, g_n\in {C^\infty }(\mathbb {R}^d;\mathbb {R})\) be sequences converging to a (respectively b) in the sense of \({W^{k,\lambda ;l}}\) (respectively \(W^{k,\lambda ;{\tilde{l}}}\)). Clearly \(f_nb\rightarrow ab\) in the sense of \(L^{{\lambda ^\circ }/(l+{\tilde{l}})}\). Furthermore, for any \(s\in S_{k_{l+{\tilde{l}}}}\),

$$\begin{aligned} H_s(f_n,g_n) - H_s(a,g_n) = \sum _{\sigma \le s}\frac{s!}{\sigma !(s-\sigma )!}(D^\sigma f_n - D^\sigma a) D^{s-\sigma }g_n, \end{aligned}$$

and so it follows from Hölder’s inequality that

$$\begin{aligned} \Vert H_s(f_n,g_n) - H_s(a,g_n)\Vert _{L^{{\lambda ^\circ }/(|s|+l+{\tilde{l}})}} \rightarrow 0. \end{aligned}$$

A similar argument can be applied to \(H_s(a,g_n) - H_s(a,b)\), and so \(D^s(f_ng_n)\rightarrow H_s(a,b)\) in \(L^{{\lambda ^\circ }/(|s|+l+{\tilde{l}})}\). Once again, an argument involving integration by parts establishes that \(H_s(a,b)=D^s\Pi (a,b)\), and an argument involving sequences \({W^{k,\lambda ;l}}\ni a_n\rightarrow a\), and \(W^{k,\lambda ;{\tilde{l}}}\ni b_n\rightarrow b\) establishes continuity. Similar arguments can be used with the other cases.

The case \(l=0\) of part (iii) is established in part (i). Let \((a_n\ne a)\) be a sequence converging to a in the sense of \({W^{k,\lambda }}\), and let

$$\begin{aligned} \Delta _n := \psi (a_n) - \psi (a) - \psi ^{(1)}(a)(a_n-a). \end{aligned}$$
(15)

Then, according to the mean-value theorem, \(\Delta _n=\delta _n(a_n-a)\), where

$$\begin{aligned} \delta _n = \psi ^{(1)}((1-\beta _n)a+\beta _n a_n) - \psi ^{(1)}(a) \quad \mathrm{for\ some\ }0\le \beta _n(x)\le 1. \end{aligned}$$

Lemma 1 shows that \(\Vert a_n-a\Vert _{W^{k,\lambda }}^{-1}\Vert \Delta _n\Vert _{L^{{\lambda ^\circ }}}\rightarrow 0\), and so \(\Psi _1:{W^{k,\lambda }}\rightarrow L^{{\lambda ^\circ }}\) is differentiable, with derivative as in (10) with \(i=1\). That this derivative is continuous follows from (12).

For \(0\le i\le l-1\), let \(F_{s,i}(a)\) be as in (7), but with \(\psi \) replaced by \(\psi ^{(i)}\); then

$$\begin{aligned}{} & {} F_s(a_n) - F_s(a) - F_{s,1}(a)(a_n-a) - \sum _{\pi \in \Pi (s)}\psi ^{(|\pi |)}(a)\sum _j\Gamma _{\pi ,j}D^{\sigma _j}(a_n-a) \\{} & {} \qquad = \sum _{\pi \in \Pi (s)}\bigg (\Gamma _{\pi ,0}\delta _\pi (a_n-a) + \sum _j(\Gamma _{\pi ,j}\zeta _\pi + \Theta _{\pi ,j})D^{\sigma _j}(a_n-a))\bigg ), \end{aligned}$$

where \(\{\sigma _1,\ldots ,\sigma _j,\ldots \sigma _{|\pi |}\}\) is an enumeration of \(\pi \),

$$\begin{aligned} \Gamma _{\pi ,0}= & {} \prod _{\sigma \in \pi }D^\sigma a, \qquad \Gamma _{\pi ,j} = \prod _{m\ne j}D^{\sigma _m} a, \qquad \\ \delta _\pi= & {} \psi ^{(|\pi |+1)}(\beta _na_n+(1-\beta _n) a)-\psi ^{(|\pi |+1)}(a), \\ \zeta _\pi= & {} \psi ^{(|\pi |)}(a_n) - \psi ^{(|\pi |)}(a), \\ \Theta _{\pi ,j}= & {} \psi ^{(|\pi |)}(a_n) \bigg (\prod _{m<j}D^{\sigma _m}a_n - \prod _{m<j}D^{\sigma _m}a\bigg ) \prod _{m>j}D^{\sigma _m}a. \end{aligned}$$

Hölder’s inequality now shows that

$$\begin{aligned} R_1:= & {} \Vert \Gamma _{\pi ,0}\delta _\pi (a_n-a)\Vert _{L^{\gamma _0}} \le \Vert \Gamma _{\pi ,0}\Vert _{L^{{\lambda ^\circ }/|s|}} \Vert \delta _\pi (a_n-a)\Vert _{L^{{\lambda ^\circ }}} \\ R_2:= & {} \Vert \Gamma _{\pi ,j}\zeta _\pi D^{\sigma _j}(a_n-a)\Vert _{L^{\gamma _0}} \le \Vert \Gamma _{\pi ,j}\zeta _\pi \Vert _{L^{\gamma _j}} \Vert D^{\sigma _j}(a_n-a)\Vert _{L^{\lambda /|\sigma _j|}} \\ R_3:= & {} \Vert \Theta _{\pi ,j}D^{\sigma _j}(a_n-a)\Vert _{L^{\gamma _0}} \le \Vert \Theta _{\pi ,j}\Vert _{L^{\gamma _j}} \Vert D^{\sigma _j}(a_n-a)\Vert _{L^{\lambda /|\sigma _j|}} \end{aligned}$$

where \(\gamma _0={\lambda ^\circ }/(|s|+1)\) and \(\gamma _j={\lambda ^\circ }/(|s|-|\sigma _j|+1)\). We now claim that

$$\begin{aligned} \Vert a_n-a\Vert _{W^{k,\lambda }}^{-1} R_i \rightarrow 0\quad \mathrm{for\ }i=1,2,3. \end{aligned}$$

That this is true of \(R_1\) follows from Lemma 1. Regarding \(R_2\), \(\zeta _\pi \) is bounded and converges to zero in probability and so, according to the dominated convergence theorem, \(\Gamma _{\pi ,j}\zeta _\pi \rightarrow 0\) in the sense of \(L^{\gamma _j}\). Finally, the bracketed term in \(\Theta _{\pi ,j}\) can be expanded as a telescopic sum of products each containing one of the differences \(D^{\sigma _m}(a_n-a)\). Hölder’s inequality then shows that \(\Theta _{\pi ,j}\rightarrow 0\) in the sense of \(L^{\gamma _j}\).

We have thus shown that \(D^s\Psi _1:{W^{k,\lambda }}\rightarrow L^{{\lambda ^\circ }/(|s|+1)}\) is differentiable, and

$$\begin{aligned} (D^s\Psi _1)^{(1)}u = F_{s,1}(a)u + \sum _{\pi \in \Pi (s)}\psi ^{(|\pi |)}(a)\sum _j \Gamma _{\pi ,j} D^{\sigma _j}u. \end{aligned}$$

We can now apply the Leibniz and Faà di Bruno formulae to \(D^s(\psi ^{(1)}(a)u)\) to show that it is equal to \((D^s\Psi _1)^{(1)}u\), and is continuous in (au). This proves (10) for the case \(l=1\).

We now proceed by induction on l. Suppose that (10) is correct for l; then, since \({W^{k,\lambda ;l}}\prec W^{k,\lambda ;l+1}\), \(\Psi _{l+1}\) is of class \(C^l\), with derivatives as in (10). Setting \(\Delta _{l,n}=\psi ^{(l)}(a_n)-\psi ^{(l)}(a)-\psi ^{(l+1)}(a)(a_n-a)\), we can apply the arguments used above on \(\Delta _n\) of (15), and the fact that

$$\begin{aligned} \sup _{u_i\in B}\Vert \Pi (\Delta _{l,n},u_1\cdots u_l)\Vert _{W^{k,\lambda ;l+1}} \le K \Vert \Delta _{l,n}\Vert _{W^{k,\lambda ;1}},\quad \mathrm{for\ some\ }K<\infty , \end{aligned}$$

where B is the unit ball of \({W^{k,\lambda }}\), to show that \(\Psi _{l+1}^{(l)}\) is of class \(C^1\). \(\square \)

3 The manifolds of finite measures

The charts of the statistical manifolds developed here are based on a two-parameter family of \(\eta \)-deformed logarithms. These are defined in terms of Amari’s \(\alpha \)-logarithms, \(\Lambda _\alpha :(0,\infty )\rightarrow \mathbb {R}\):

$$\begin{aligned} \Lambda _\alpha (y) = \left\{ \begin{array}{ll} \frac{2}{1-\alpha }\left( y^{(1-\alpha )/2}-1\right) &{} \mathrm{if\ }\alpha \ne 1 \\ \log y &{} \mathrm{if\ }\alpha =1 \end{array}\right. \end{aligned}$$
(16)

The \(\eta \)-logarithm is defined for \(\eta =(\eta _-,\eta _+)\), (\(-\infty<\eta _-< 1\le \eta _+<\infty \)) as:

$$\begin{aligned} \log _\eta (y) = \Lambda _{\eta _-}(y) + \Lambda _{\eta _+}(y) \end{aligned}$$
(17)

The deformed logarithm \(\log _{(-1,+1)}\) is that used to construct a family of highly inclusive statistical manifolds in [14, 15, 17]. Setting \(\kappa =(\eta _+-\eta _-)/4\) and \(r=(2-\eta _--\eta _+)/4\), \(\log _\eta \) is essentially the two-parameter \((\kappa ,r)\)-logarithm defined in [8]. The different weightings for the two components of the deformed logarithm used here have no effect on the membership or properties of the manifolds, but are more convenient in the context of information geometry.

Now \(\inf _y\log _\eta y=-\infty \), \(\sup _y\log _\eta y=+\infty \), and \(\log _\eta \in C^\infty ((0,\infty );\mathbb {R})\) with strictly positive first derivative \(y^{-(1+\eta _-)/2}+y^{-(1+\eta _+)/2}\) and so, according to the inverse function theorem, \(\log _\eta \) is a diffeomorphism from \((0,\infty )\) onto \(\mathbb {R}\). Let \({\textrm{exp}}_\eta \) be its inverse. This can be thought of as a deformed exponential function. Using \(f^{(n)}\) to denote the n-th derivative of a function f, we have

$$\begin{aligned} {\textrm{exp}}_\eta ^{(1)} = \frac{1}{1+{\textrm{exp}}_\eta ^{\delta /2}}{\textrm{exp}}_\eta ^{[1+\eta _+]/2} = \frac{{\textrm{exp}}_\eta ^{\delta /2}}{1+{\textrm{exp}}_\eta ^{\delta /2}}{\textrm{exp}}_\eta ^{[1+\eta _-]/2}, \end{aligned}$$
(18)

where \(\delta :=\eta _+-\eta _-\). So \({\textrm{exp}}_\eta \) satisfies the differential inequality

$$\begin{aligned} {\textrm{exp}}_\eta ^{(1)} < {\textrm{exp}}_\eta ^{[1+\eta _-]/2} \end{aligned}$$
(19)

and, since \({\textrm{exp}}_\eta (0)=1\), there exists a \(K_\eta <\infty \) such that

$$\begin{aligned} {\textrm{exp}}_\eta (z)\le K_\eta (1+z^{2/(1-\eta _-)}) \quad \mathrm{for\ all\ }z\ge 0. \end{aligned}$$
(20)

If \(\eta _-=-1\) (as is the case in [14, 15, 17]) then the exponent is 1, and \({\textrm{exp}}_\eta \) has linear growth; otherwise it has sublinear or power law growth. The exponent itself grows without limit as \(\eta _-\) approaches 1 from below.

For any \(\alpha \in \mathbb {R}\), the Amari embedding map \(\xi _\alpha :\mathbb {R}\rightarrow \mathbb {R}\) is as follows:

$$\begin{aligned} \xi _\alpha (z) = \Lambda _\alpha \circ {\textrm{exp}}_\eta (z). \end{aligned}$$
(21)

These maps can be used in the analysis of a large class of divergences, and their associated tensors. The maps \(\xi _{-1}\) and \(\xi _{+1}\) are especially important since they will represent the density of a measure and its log, respectively. The following lemma establishes some of their properties in the context of the \(\eta \)-log.

Lemma 2

  1. (i)

    For any \(\alpha \in \mathbb {R}\), \(\xi _\alpha \in C^\infty (\mathbb {R};\mathbb {R})\); its derivatives are

    $$\begin{aligned} \xi _\alpha ^{(i)} = \frac{f_{\alpha ,i}({\textrm{exp}}_\eta )}{\big (1+{\textrm{exp}}_\eta ^{\delta /2}\big )^{(2i-1)}} \quad \mathrm{for\ }1\le i<\infty , \end{aligned}$$
    (22)

    where \(f_{\alpha ,1}(y)=y^{(\eta _+-\alpha )/2}\) and

    $$\begin{aligned} f_{\alpha ,i+1}(y)= & {} (y^{\delta /2}+y^\delta )y^{(\eta _-+1)/2}f_{\alpha ,i}^{(1)}(y) - (i-1/2)\delta y^\delta y^{(\eta _--1)/2}f_{\alpha ,i}(y) \nonumber \\= & {} (1+y^{\delta /2})y^{(\eta _++1)/2}f_{\alpha ,i}^{(1)}(y) - (i-1/2)\delta y^{(2\eta _+-\eta _--1)/2}f_{\alpha ,i}(y). \nonumber \\ \end{aligned}$$
    (23)
  2. (ii)

    For any \(1\le i<\infty \), and any \(\alpha \in \mathbb {R}\),

    $$\begin{aligned} \limsup _{z\rightarrow \infty }z^{-\beta _i}|\xi _\alpha ^{(i)}(z)| < \infty , \quad \mathrm{where\ }\beta _i := \frac{1-\alpha }{1-\eta _-} - i. \end{aligned}$$
    (24)
  3. (iii)

    For any \(1\le i<\infty \), and any \(\alpha \le \eta _+\),

    $$\begin{aligned} \limsup _{z\rightarrow -\infty }|\xi _\alpha ^{(i)}(z)| < \infty . \end{aligned}$$
    (25)
  4. (iv)

    If \(\alpha \in [\eta _-,\eta _+]\), then \(\xi _\alpha ^{(i)}\) is bounded for all \(1\le i<\infty \).

Proof

Part (i) is straightforward.

The power of y in \(f_{\alpha ,1}(y)\) is \((\eta _+-\alpha )/2=\delta /2+(\eta _--\alpha )/2\), and so \(\xi _\alpha ^{(1)}(z)\) grows as \({\textrm{exp}}_\eta (z)^{(\eta _--\alpha )/2}\) for large z. It now follows from (20) that (24) is correct for \(i=1\). That it is also correct for \(i\ge 2\) follows from an induction argument based on the first representation of \(f_{\alpha ,i+1}\) in (23).

If \(\alpha \le \eta _+\) then the power of y in \(f_{\alpha ,1}(y)\) is greater than or equal to 0, and so (25) is correct for \(i=1\). That it is also correct for \(i\ge 2\) follows from an induction argument based on the second representation of \(f_{\alpha ,i+1}\) in (23). Part (iv) is an immediate consequence of parts (ii) and (iii). \(\square \)

Let \(\theta _0:=(1-\eta _-)\lambda /2\). We assume that \(\theta _0>1\); it then follows from (20) and (24) that, for any \(0\le i< \lambda /\theta _0\) and any \(a\in {L^\lambda }\),

$$\begin{aligned} {\textrm{exp}}_\eta ^{(i)}(a) \in L^{\theta _0\lambda /(\lambda -i\theta _0)}. \end{aligned}$$
(26)

(If \(i\ge \lambda /\theta _0\) then \({\textrm{exp}}_\eta ^{(i)}\) is bounded.) We can now construct the manifold M \((=M_\eta ^{k,\lambda })\). This is the set of finite measures on \(\mathbb {R}^d\) satisfying the following:

  1. (M1)

    P is mutually absolutely continuous with respect to Lebesgue measure;

  2. (M2)

    \(\log _\eta p\in G\) (\(=G^{k,\lambda }:={W^{k,\lambda }}\)).

Here, p denotes the density of P with respect to the reference probability measure \(\mu \). Its density with respect to Lebesgue measure is pr, where r is as in (1). The chart \(\phi :M\rightarrow G\) is defined by:

$$\begin{aligned} \phi (P) = \log _\eta p. \end{aligned}$$
(27)

Proposition 1

\(\phi \) is a bijection onto G. Its inverse is

$$\begin{aligned} \phi ^{-1}(a) = P(dx) = {\textrm{exp}}_\eta (a(x))\mu (dx). \end{aligned}$$
(28)

Proof

It follows from (M2) that, for any \(P\in M\), \(\phi (P)\in G\). Suppose, conversely, that \(a\in G\); since \({\textrm{exp}}_\eta (a)\in L^1\), we can define the finite measure \(P(dx)={\textrm{exp}}_\eta (a(x))\mu (dx)\). Since \({\textrm{exp}}_\eta \) is strictly positive, P satisfies (M1). That it also satisfies (M2) follows from the fact that \(\log _\eta {\textrm{exp}}_\eta (a)=a\in G\). We have thus shown that \(P\in M\), and clearly \(\phi (P)=a\). \(\square \)

Remark 1

This proposition shows that M is, in one sense, nothing more than a (whole) Banach space. Manifold theory enters the picture with the introduction of base-point dependent tensors such as the Fisher-Rao metric and Amari-Chentsov tensor on the tangent bundle.

The tangent space at basepoint P, \(T_PM\), is the linear space of signed measures, U, that are absolutely continuous with respect to Lebesgue measure and take the form

$$\begin{aligned} U(dx) = {\textrm{exp}}_\eta ^{(1)}(a(x))u(x)\mu (dx),\quad \mathrm{for\ some\ }u\in G, \end{aligned}$$
(29)

where \(a=\phi (P)\). U is well defined because of (26). The representation in (29) is obtained from the tangent map of the chart; u is then the natural representation of the tangent vector, U, in the model space. The tangent bundle is the disjoint union \(TM:=\cup _{P\in M}(P,T_PM)\), and is globally trivialised by the chart \(\Phi :TM\rightarrow G\times G\), where

$$\begin{aligned} \Phi (P,U) = (a,u), \quad {\mathrm{and \ }} a \ {\mathrm{and \ }} u{\mathrm{\ are\ as\ in \ }} (29). \end{aligned}$$
(30)

We now investigate some of the smoothness properties of Amari’s embedding maps (21) in the context of the manifold M. These, in turn, can be used to analyse the smoothness properties of divergences and tensors.

Proposition 2

  1. (i)

    For any \(\alpha \in [\eta _-,\eta _+]\), any \(0\le l\le \lfloor {\lambda ^\circ }\rfloor \) and any \(a\in G\), \(\xi _\alpha (a)\in {W^{k,\lambda ;l}}\). The superposition operator \(\Xi _{\alpha ,l}:G\rightarrow W^{k,\lambda ;l}\), defined by \(\Xi _{\alpha ,l}(a)(x)=\xi _\alpha (a(x))\), is of class \(C^l\), with derivatives:

    $$\begin{aligned} \Xi _{\alpha ,l}^{(i)}(a)(u_1,\ldots ,u_i)(x) = \xi _\alpha ^{(i)}(a(x))u_1(x)\cdots u_i(x). \end{aligned}$$
    (31)
  2. (ii)

    For any \(1\le \theta <\theta _0\) (as defined before (26)), the superposition operator \(\textrm{Exp}_{\eta ,\theta }:G\rightarrow L^\theta \), defined by \(\textrm{Exp}_{\eta ,\theta }(a)(x)={\textrm{exp}}_\eta (a(x))\) is of class \(C^{\lceil \lambda \rceil -1}\), with derivatives:

    $$\begin{aligned} \textrm{Exp}_{\eta ,\theta }^{(i)}(u_1,\ldots ,u_i)(x) = {\textrm{exp}}_\eta ^{(i)}(a(x))u_1(x)\cdots u_i(x). \end{aligned}$$
    (32)

Proof

Part (i) is a special case of Theorem 2(iii). Part (ii) can be proved in a similar way; the essential differences are that the derivatives of \({\textrm{exp}}_\eta \) are not necessarily bounded, and the range space of the superposition operator has a weaker topology. Let \((a_n\in G\setminus \{a\})\) be a sequence converging to a in the sense of G. For any \(1\le i\le \lceil \lambda \rceil -1\) let

$$\begin{aligned} \Delta _n:= & {} {\textrm{exp}}_\eta ^{(i-1)}(a_n) - {\textrm{exp}}_\eta ^{(i-1)}(a) - {\textrm{exp}}_\eta ^{(i)}(a)(a_n-a) \nonumber \\ \Gamma _n:= & {} {\textrm{exp}}_\eta ^{(i)}(a_n)-{\textrm{exp}}_\eta ^{(i)}(a). \end{aligned}$$
(33)

According to the mean-value theorem \(\Delta _n=\delta _n(a_n-a)\), where

$$\begin{aligned} \delta _n = {\textrm{exp}}_\eta ^{(i)}(\beta _na_n+(1-\beta _n)a) - {\textrm{exp}}_\eta ^{(i)}(a) \quad \mathrm{for\ some\ }0\le \beta _n(x)\le 1. \end{aligned}$$

It follows from (26) and Hölder’s inequality that, for any \(u_1,\ldots ,u_i\) in the unit ball of G,

$$\begin{aligned} \Vert \Delta _n u_1\cdots u_{i-1}\Vert _{L^\theta } \le \Vert \Delta _n\Vert _{L^\gamma } \quad \textrm{and}\quad \Vert \Gamma _n u_1\cdots u_i\Vert _{L^\theta } \le \Vert \Gamma _n u_i\Vert _{L^\gamma }, \end{aligned}$$

where \(\gamma :=\lambda \theta /(\lambda -i\theta )\). In order to prove part (ii), it thus suffices to show that

$$\begin{aligned} \Vert a_n-a\Vert _G^{-1}\Vert \Delta _n\Vert _{L^\gamma }\rightarrow 0 \quad \textrm{and}\quad \sup _{\Vert u\Vert _G=1}\Vert \Gamma _n u\Vert _{L^\gamma }\rightarrow 0. \end{aligned}$$
(34)

According to (26) and the de la Vallée-Poussin theorem, \({\textrm{exp}}_\eta ^{(i)}(a_n)^\gamma \) is uniformly integrable. Now \(\delta _n\) and \(\Gamma _n\) both converge to zero in measure, and so (34) follows from the Lebesgue-Vitaly theorem. \(\square \)

Remark 2

  1. (i)

    There is a vast choice of range spaces for superposition operators of this type, each of which results in operators with different properties. Proposition 2 is not intended to be exhaustive, but to cover some of the more interesting and useful cases.

  2. (ii)

    The case \(l=0\) is worth special mention since the domain and range spaces of the superposition operators are then both G. If, for example, \(\eta _-\le -1\) then the density p (\(=\xi _{-1}(a)+1\)) belongs to the model space and varies continuously on the manifold, as does the log of the density.

The superposition operators \(\Xi _{\alpha ,l}\) can be used in the analysis of divergences and entropies. This analysis was carried out for the \(\alpha \)-divergences, \(\alpha \in [-1,1]\), in [15], where it was shown that they are of class \(C^l(M\times M)\), for values of l dependent on \(\lambda \). Although we do not pursue these issues here, it is clear that similar methods can be used with the manifolds of this paper for any \(\alpha \in [\eta _-,\eta _+]\). We would also expect the \((\kappa ,r)\) divergences of [8] to exhibit an equivalent degree of smoothness. Divergences can be used to define various tensor fields on M, which depend naturally on the superposition operators, \(\Xi _{\alpha ,l}\). In particular, the Fisher-Rao metric on M can be expressed in terms of the maps \((\xi _\alpha ,\alpha \in [\eta _-,\eta _+])\) in two different ways, according to the value of \(\eta _-\):

$$\begin{aligned} \langle U, V\rangle _P = \left\{ \begin{array}{ll} {{\textrm{E}}_\mu }\xi _0^{(1)}(a)\,\xi _0^{(1)}(a)uv &{} \quad \mathrm{if\ }\eta _-\le 0 \\ {{\textrm{E}}_\mu }{\textrm{exp}}_\eta (a)^{\eta _-}\xi _{\eta _-}^{(1)}(a)\,\xi _{\eta _-}^{(1)}(a)uv &{} \quad \mathrm{if\ } \eta _->0, \end{array} \right. \end{aligned}$$
(35)

where \((a,u)=\Phi (P,U)\) and \((a,v)=\Phi (P,V)\).

Corollary 1

The Fisher-Rao metric is if class \(C^l\) on M, where

$$\begin{aligned} l = \left\{ \begin{array}{ll} \lfloor {\lambda ^\circ }\rfloor -2 &{} \quad \mathrm{if\ }\eta _-\le 0 \\ \lfloor {\lambda ^\circ }-2\eta _-/(1-\eta _-)\rfloor - 2 &{} \quad \mathrm{if\ }\eta _->0. \end{array}\right. \end{aligned}$$
(36)

Proof

The case \(\eta _-\le 1\) follows from a repeated application of Lemma 1, starting with \(\psi =(\xi _0^{(1)})^2\). A similar technique can be applied if \(\eta _->0\); the essential difference is that, at each stage, we must use Hölder’s inequality and Proposition 2(ii) with \(\theta =\eta _-\) to remove the term in \({\textrm{exp}}_\eta \). \(\square \)

The Fisher-Rao metric is positive definite and dominated by the chart-induced norm on \(T_PM\). However the norms are not equivalent, and so the metric is a weak Riemannian metric. The Fisher-Rao metric and higher-order tensor fields, such as the Amari-Chentsov tensor, become smoother with increasing values of \(\lambda \). As Corollary 1 shows, log-Sobolev embedding plays a role in these results for certain integer values of \(\lambda \).

Of course, the use of Sobolev model spaces enables the analysis of quantities that depend on the weak derivatives of probability densities, such as the Hyvärinen divergence, and this is one of the motivations for extending the results of [14, 15]. The manifolds may also be useful in the theory of partial differential equations, as discussed in the final section of [17]. These aspects will be pursued elsewhere.

4 The manifolds of probability measures

Let \(M_0\subset M\) be the subset of the manifold of Sect. 3, whose members are probability measures, and let \({L_0^\lambda }\) (respectively \(G_0\)) be the co-dimension 1 subspaces of \({L^\lambda }\) (respectively G) whose members have zero \(\mu \)-mean. Let \(\phi _0:M_0\rightarrow G_0\) be defined by

$$\begin{aligned} \phi _0(P) = \phi (P) - {{\textrm{E}}_\mu }\phi (P) = \log _\eta p - {{\textrm{E}}_\mu }\log _\eta p. \end{aligned}$$
(37)

Proposition 3

  1. (i)

    \(\phi _0\) is a bijection onto \(G_0\).

  2. (ii)

    \((M_0,G_0)\) is a \(C^{\lceil \lambda \rceil -1}\)-embedded submanifold of (MG). The inclusion map \(\rho :G_0\rightarrow G\) takes the form \(\rho (a)=a+Z(a)\), where \(Z:G_0\rightarrow \mathbb {R}\) is an (implicitly defined), additive normalisation constant.

  3. (iii)

    \(\rho \) and all its derivatives are bounded on bounded sets. The first (and if \(\lambda >2\), second) derivatives of \(\rho \) are as follows:

    $$\begin{aligned} \rho _a^{(1)}u= & {} u-{\textrm{E}}_{P_a}u \nonumber \\ \rho _a^{(2)}(u,v)= & {} -\frac{{{\textrm{E}}_\mu }{\textrm{exp}}_\eta ^{(2)}(\rho (a))(u-{\textrm{E}}_{P_a}u)(v-{\textrm{E}}_{P_a}v)}{{{\textrm{E}}_\mu }{\textrm{exp}}_\eta ^{(1)}(\rho (a))}, \end{aligned}$$
    (38)

    where \(P_a(dx):={\textrm{exp}}_\eta ^{(1)}(\rho (a(x)))\mu (dx)/{{\textrm{E}}_\mu }{\textrm{exp}}_\eta ^{(1)}(\rho (a))\), is the escort probability [13].

Proof

Let \(\Upsilon :G_0\times \mathbb {R}\rightarrow (0,\infty )\) be defined by

$$\begin{aligned} \Upsilon (a,z) = {{\textrm{E}}_\mu }{\textrm{exp}}_\eta (a+z) = {{\textrm{E}}_\mu }\textrm{Exp}_\eta (a+z), \end{aligned}$$
(39)

where \(\textrm{Exp}_\eta \) is as defined in Proposition 2 with \(\theta =1\). It follows from Proposition 2, that \(\Upsilon \) is of class \(C^{\lceil \lambda \rceil -1}\) and that, for any \(u\in G_0\),

$$\begin{aligned} \Upsilon _{a,z}^{(1,0)}u = {{\textrm{E}}_\mu }{\textrm{exp}}_\eta ^{(1)}(a+z)u \quad \textrm{and}\quad \Upsilon _{a,z}^{(0,1)} = {{\textrm{E}}_\mu }{\textrm{exp}}_\eta ^{(1)}(a+z) > 0. \end{aligned}$$
(40)

Since \({\textrm{exp}}_\eta \) is monotone increasing,

$$\begin{aligned} \Upsilon (a,z) \ge {{\textrm{E}}_\mu }{\textbf{1}}_{[-1,\infty )}(a){\textrm{exp}}_\eta (a+z) \ge \mu (a\ge -1){\textrm{exp}}_\eta (z-1), \end{aligned}$$

and so \(\lim _{z\uparrow \infty }\Upsilon (a,z)=\infty \). Furthermore, the monotone convergence theorem shows that

$$\begin{aligned} \lim _{z\downarrow -\infty } \Upsilon (a,z) = {{\textrm{E}}_\mu }\lim _{z\downarrow -\infty }\psi (a+z) = 0. \end{aligned}$$

So \(\Upsilon (a,{\,\cdot \,})\) is a bijection with strictly positive derivative, and the inverse function theorem shows that it is a \(C^{\lceil \lambda \rceil -1}\)-isomorphism. The implicit mapping theorem shows that \(Z:G_0\rightarrow \mathbb {R}\), defined by \(Z(a)=\Upsilon (a,{\,\cdot \,})^{-1}(1)\), is of class \(C^{\lceil \lambda \rceil -1}\). For some \(a\in G_0\), let P be the probability measure with density \(p={\textrm{exp}}_\eta (a+Z(a))\); then \(\phi _0(P)=a\) and \(P\in M_0\), which proves part (i).

The argument above shows that the inclusion map, \(\rho \), is of class \(C^{\lceil \lambda \rceil -1}\). Let \(c:G\rightarrow G_0\) be the (linear) superposition operator defined by \(c(a)(x)=a(x)-{{\textrm{E}}_\mu }a\); then c is continuous, and has derivative \(c_a^{(1)}u=u-{{\textrm{E}}_\mu }u\). Now \(c\circ \rho \) is the identity map of \(G_0\), which shows that \(\rho \) is homeomorphic onto its image, \(\rho (G_0)\), endowed with the relative topology. Furthermore, for any \(u\in G_0\),

$$\begin{aligned} u=(c\circ \rho )_a^{(1)}u = c_{\rho (a)}^{(1)}\rho _a^{(1)}u, \end{aligned}$$

and so \(\rho _a^{(1)}\) is a toplinear isomorphism, and its image, \(\rho _a^{(1)}G_0\), is a closed linear subspace of G. Let \(E_a\) be the one dimensional subspace of G defined by \(E_a=\{y{\textrm{exp}}_\eta ^{(1)}(\rho (a)): y\in \mathbb {R}\}\). If \(u\in E_a\) and \(v\in \rho _a^{(1)}G_0\) then there exist \(y\in \mathbb {R}\) and \(w\in G_0\) such that

$$\begin{aligned} {{\textrm{E}}_\mu }uv = y{{\textrm{E}}_\mu }{\textrm{exp}}_\eta ^{(1)}(\rho (a))(w-{\textrm{E}}_{P_a}w) = 0. \end{aligned}$$

So \(E_a\cap \rho _a^{(1)}G_0=\{0\}\), and \(\rho _a^{(1)}\) splits G into the direct sum \(E_a\oplus \rho _a^{(1)}G_0\). We have thus shown that \(\rho \) is a \(C^{\lceil \lambda \rceil -1}\)-immersion, and this completes the proof of part (ii).

Jensen’s inequality shows that there exists a \(K_{\eta }<\infty \) such that

$$\begin{aligned} -{\text {logE}_{\mu }{\text {exp}}_\eta ^{(1)}(\rho (a))\le -{\text {E}_{\mu }}\text {log}{\text {exp}}_\eta ^{(1)}(\rho (a))\le K_{\eta }\text {E}_{\mu }}|\log p| \le K_\eta \Vert a\Vert . \end{aligned}$$

So, for bounded B, \(\inf _{P\in B}\hbox {E}_\mu \exp _\eta ^{(1)}(\rho (a))>0\). \(\square \)

For any \(P\in M_0\), the tangent space \(T_PM_0\) is a subspace of \(T_PM\) of co-dimension 1; in fact, as shown in the proof of Proposition 3(ii),

$$\begin{aligned} T_PM = T_PM_0 \oplus \{y\hat{U},\,y\in \mathbb {R}\}, \quad \mathrm{where\ }\hat{U}\phi =\psi ^{(1)}(\phi (P)). \end{aligned}$$
(41)

Let \(\Phi _0:TM_0\rightarrow G_0\times G_0 \) be defined as follows:

$$\begin{aligned} \Phi _0(P,U) = \Phi (P,U)-{{\textrm{E}}_\mu }\Phi (P,U). \end{aligned}$$
(42)

Then \(\Phi \circ \Phi _0^{-1}(a,u)=(\rho (a),\rho _a^{(1)}u)\). For any \((P,U)\in TM_0\), \(U\phi =\rho _a^{(1)}u=u-{\textrm{E}}_{P_a}u\), and so tangent vectors in \(T_PM_0\) are distinguished from those merely in \(T_PM\) by the fact that their total mass is zero.

Any regularity possessed by divergences, entropies and tensors on M involving fewer than \(\lceil \lambda \rceil \) derivatives is also enjoyed by their restrictions to \(M_0\).

5 Concluding remarks

This paper has developed a family of non-parametric statistical manifolds that use the two-parameter deformed logarithm of (17), and a variety of model spaces of Sobolev type. It has shown that the mixed-norm space \({W^{k,\lambda }}\) is especially suited to this application. The Amari embedding maps, \(\xi _\alpha \), which are central to the analysis of divergences, entropies and associated tensors, “lift” to continuous nonlinear superposition operators acting on the Sobolev model spaces. (A rare property in the theory of such operators.) Variants of the superposition operators having Sobolev range spaces with weaker topologies enjoy greater regularity; they were shown to admit multiple derivatives on the manifolds, according to the values of the parameters \(k,\lambda \) and \(\eta \). Of course, this paper takes only the first step in a fuller analysis of the information geometry of the manifolds constructed. However, for reasons of space, we shall go no further here.