1 Introduction

Since the seminal work of [1], infinite-dimensional exponential manifolds have been studied from various viewpoints. The dual geometric structure [2, 3] of finite-dimensional exponential families have been extended to infinite-dimensional cases using the topology of Orlicz spaces [4,5,6]. Formulations with other function spaces have been explored [7, 8], and geometry of more general measure spaces have been studied [9]. While the theoretical studies on infinite-dimensional exponential manifolds have advanced, the practical use of infinite-dimensional or nonparametric exponential families is limited.

One of the main difficulties in using an infinite-dimensional exponential family for density estimation is the computation of the normalization factor. The infinite-dimensional exponential family is typically defined by the form \(\exp (f(x)-A(f))q_0(x)\), where f belongs to an appropriate function space, \(q_0\) is the probability density function (p.d.f.) of a base probability, and A(f) is the log of the normalization constant: \(A(f) = \log \int \exp (f(x))q_0(x)\). The integral is, however, not computable for a general function f.

For finite-dimensional exponential families, the maximum likelihood estimator (MLE) is the central device to estimate a parameter that fits a given sample. The MLE is based on the Kullback–Leibler (KL) divergence to measure the discrepancy between the true distribution and a probability defined by the model. However, to obtain the MLE, we need to compute the mapping from the mean parameter (the mean of sufficient statistics) to the natural parameter. This is not straightforward even for general finite-dimensional exponential families, and is not computable for infinite-dimensional ones.

Recently, Sriperumbdur et al. [10] applied the score matching method [11] to estimate the functional parameter of an infinite-dimensional exponential family defined by a reproducing kernel Hilbert space. The score matching method is based on the Fisher divergence, which does not require the normalization constants to measure the discrepancy and is thus applicable to any infinite- or finite-dimensional exponential families.

The purpose of this paper is twofold. One is to give a brief survey of the estimation methods for infinite-dimensional exponential families, following [10]. The other is to investigate the Fisher divergence as a member of an extended family of divergences. The extended family introduces new ideas to define divergence measures of probabilities and provides a description of Riemannian metrics induced by the divergences.

2 Estimation with infinite-dimensional exponential family

This section explains the estimation with the model of the infinite-dimensional exponential family with positive definite kernels [7, 10, 12], focusing mainly [10].

Let \((\Omega ,\mathcal {B},\mu )\) be a measure space, and k be a measurable positive definite kernel on \(\Omega \). The reproducing kernel Hilbert space (RKHS) \((\mathcal {H}_k,\langle ,\rangle _{\mathcal {H}_k})\) is uniquely defined as a function space over \(\Omega \) so that \(\mathcal {H}_k\) has a reproducing kernel \(k(\cdot ,x)\). See, for example, [13, 14] about the basics of RKHS.

Let \(q_0(x)\) be a probability density function (p.d.f.) with respect to the measure \(\mu \). The exponential family with positive definite kernel k (at \(q_0\)) is defined by

$$\begin{aligned} \mathcal {P}_{k,q_0}:= \left\{ p_f(x) = e^{f(x)-A(f)}q_0(x) \mid f\in \mathcal {F}_{k,q_0} \right\} , \end{aligned}$$
(1)

where

$$\begin{aligned} \mathcal {F}_{k,q_0}:= \left\{ f\in \mathcal {H}_k\mid \int e^{f(x)}q_0(x)d\mu (x) < \infty \right\} \end{aligned}$$

is the parameter space, and \(A(f):= \log \int e^{f(x)}q_0(x)d\mu (x)\) is the log of the normalization constant, which is also known as the cumulant generating function. The function \(f\in \mathcal {F}_{k,q_0}\) is the natural parameter in the terminology of the exponential family; from the reproducing property, the function value is expressed by the inner product as \(f(x) = \langle f, k(\cdot ,x)\rangle _{\mathcal {H}_k}\) so that f and \(k(\cdot ,x)\) are regarded as the natural parameter and sufficient statistics, respectively. The parameter space \(\mathcal {F}_{k,q_0}\) may not be a linear space in general, as a scalar multiple or addition may not preserve the finiteness of the integral. Also, the parameter space depends on the kernel. If \(\mathcal {H}_{k_1}\subset \mathcal {H}_{k_2}\) holds for two kernels \(k_1\) and \(k_2\), then the corresponding parameter spaces also satisfy \(\mathcal {F}_{k_1,q_0}\subset \mathcal {F}_{k_2,q_0}\).

Let \((X_1,\ldots , X_n\)) be an i.i.d. sample from the true probability distribution, and suppose one wishes to estimate the density function of the true probability. To estimate the functional parameter f in (1), the maximum likelihood estimation is given by

$$\begin{aligned} \max _{f\in \mathcal {F}_{k,q_0}} \frac{1}{n} \sum _{i=1}^n f(X_i) - A(f). \end{aligned}$$

From \(f(X_i) = \langle f,k(\cdot ,X_i)\rangle _{\mathcal {H}_k}\), it is reduced to the maximum likelihood equation:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n k(\cdot ,X_i) - \frac{\partial A(f)}{\partial f} = 0. \end{aligned}$$
(2)

It is difficult, however, to solve this equation, since A(f) is not computable in most cases. Thus, the maximum likelihood approach has not been successfully used for infinite-dimensional exponential families.

In [10], the score matching method [11] is employed to obtain an estimator of f while avoiding the use of A(f). Let \(\Omega = {\mathbb {R}}^d\), and \(\mu =dx\) be the Lebesgue measure. The score matching method uses the Fisher divergenceFootnote 1 for the objective of the estimation. For differentiable densities p and q on \({\mathbb {R}}^d\), the Fisher divergence is defined by

$$\begin{aligned} J(p\Vert q)=\frac{1}{2}\int _\Omega \left\| \nabla \log p(x)-\nabla \log q(x) \right\| ^2_2 p(x) dx, \end{aligned}$$
(3)

where \(\nabla \log p(x)= \left( \partial _1\log p(x),\ldots ,\partial _d\log p(x)\right) \) with \(\partial _a\log p(x):=\frac{\partial }{\partial x_a}\log p(x)\). It is important to note that when the p.d.f. to estimate is \(q(x) = p_f(x)\in \mathcal {P}_{k,q_0}\) in (1), the cumulant generating function A(f) is not necessary to compute, as \({\partial _a}\log q(x)={\partial _a}f(x)-{\partial _a}\log q_0(x)\), and thus applicable easily to the infinite-dimensional exponential family.

It is known that the Fisher divergence has various relations with other divergences. For example, with an appropriate choice of class for q(x), the total variation, Kolmogorov distance, and Wasserstein distance can be upper bounded by \(c_q J(p\Vert q)^{1/2}\), where \(c_q\) is a respective constant depending on q and the distance measure to compare (see [16], for example). It is also known that the Fisher divergence J and KL divergence \(D_{KL}\) have an interesting relation in the line of Bruijn’s identity [17]. More precisely, for two random vectors \(X^{(i)}\) (\(i=1,2\)) with p.d.f. \(p^{(i)}\), define their smoothed versions by \(X^{(i)}_t:= X^{(i)} + \sqrt{t}W^{(i)}\) (\(t\ge 0\)) with p.d.f. \(p^{(i)}_t\), where \(W^{(i)}\) are independent random variables of the law standard normal distribution \(N(0,I_d)\). Then, we have [18]

$$\begin{aligned} \frac{d}{dt} D_{KL}(p^{(1)}_t \Vert p^{(2)}_t) = -J(p^{(1)}_t \Vert p^{(2)}_t). \end{aligned}$$
(4)

We now discuss the empirical estimation of (3). Given that p(x) is the p.d.f. of the true probability, it is also desirable to avoid estimating the density p(x) in \(\nabla \log p(x)\), which is difficult in high-dimensional cases in general. This is cleverly achieved in [11] by using a partial integral. More precisely, assuming that \(\lim _{\vert x_a\vert \rightarrow \infty } \frac{\partial \log q(x)}{\partial x_a}p(x) = 0\), we observe

$$\begin{aligned} \int \frac{\partial \log p(x)}{\partial x_a} \frac{\partial \log q(x)}{\partial x_a} p(x) dx_a&= \int \frac{\partial \log q(x)}{\partial x_a}\frac{\partial p(x)}{\partial x_a} dx_a \nonumber \\&= \left[ \frac{\partial \log q(x)}{\partial x_a}p(x) \right] _{-\infty }^\infty - \int \frac{\partial ^2 \log q(x)}{\partial x_a^2}p(x) dx_a \nonumber \\&= - \int \frac{\partial ^2 \log q(x)}{\partial x_a^2}p(x) dx_a, \end{aligned}$$
(5)

and thus we can rewrite (3) by

$$\begin{aligned}&J(p\Vert q) \nonumber \\&\quad =\int \left\{ \frac{1}{2} \Vert \nabla \log q(x) \Vert ^2_2 + \sum _{a=1}^d \partial _a^2 \log q(x) \right\} p(x) dx + \frac{1}{2} \int \Vert \nabla \log p(x) \Vert ^2_2 p(x)dx, \end{aligned}$$
(6)

where \(\partial _a^2\) denotes \(\frac{\partial ^2}{\partial x_a^2}\). The last term does not depend on q(x), and the first two terms are integral with respect to p(x). Based on these facts and \(\partial _a \log p_f(x) = \partial _a f(x) +\partial _a \log q_0(x)\), for an i.i.d. sample \((X_1,\ldots ,X_n)\) with p.d.f. p(x), the objective function of the score matching is given by

$$\begin{aligned} \tilde{J}_n(f):= & {} \frac{1}{n}\sum _{i=1}^n \sum _{a=1}^d \Bigl ( \frac{1}{2} \bigl ( \partial _a f(X_i)\bigr )^2 + \partial _a f(X_i)\partial _a \log q_0(X_i) + \partial _a^2 f(X_i) \Bigr ) \nonumber \\{} & {} + \frac{\lambda }{2}\Vert f\Vert _{\mathcal {H}_k}^2, \end{aligned}$$
(7)

where only the terms dependent on the natural parameter f are collected. As in many kernel methods, the regularization term by the squared RKHS norm with coefficient \(\lambda \) is added to prevent the overfitting caused by the optimization over the infinite-dimensional function space. This regularization gives a unique minimizer to the objective function due to the representer theorem [19].

Taking an RKHS as a functional parameter space further provides a simple solution to the minimization of (7). In fact, \(\tilde{J}_n\) is reduced to a quadratic form of f using the facts \(\partial _a f(X_i) = \langle f, \partial _a k(\cdot ,X_i)\rangle _{\mathcal {H}_k}\) and \(\partial _a^2 f(X_i) = \langle f, \partial _a^2 k(\cdot ,X_i)\rangle _{\mathcal {H}_k}\), assuming that the kernel is two times differentiable [20, Section 4.3]. The objective function is then given by

$$\begin{aligned} \tilde{J}_n(f)= \frac{1}{2} \langle f, (\hat{C}+\lambda I) f\rangle _{\mathcal {H}_k} + \langle f, \hat{\xi }\rangle _{\mathcal {H}_k}, \end{aligned}$$
(8)

where

$$\begin{aligned} \hat{C}&:=\frac{1}{n}\sum ^n_{i=1}\sum ^d_{a=1}\partial _{a} k(X_i,\cdot )\otimes \partial _{a} k(X_i,\cdot ), \\ \hat{\xi }&:=\frac{1}{n}\sum ^n_{i=1}\sum ^d_{a=1}\left\{ \partial _{a} k(X_i,\cdot )\partial _i\log q_0(X_i)+\partial ^2_{a}k(X_i,\cdot ) \right\} . \end{aligned}$$

By the representer theorem [19], the solution \(f_{\lambda ,n}\) is a linear combination of \(\partial _a k(\cdot ,X_i)\) and \(\partial _a^2 k(\cdot ,X_i)\), and given by

$$\begin{aligned} f_{\lambda ,n}=-\frac{1}{\lambda }\hat{\xi }+\sum ^n_{i=1}\sum ^d_{a=1}\beta _{(i-1)n+a} \partial _a k(X_i,\cdot ), \end{aligned}$$
(9)

where \(\varvec{\beta }=(\beta _{(i-1)d+a})_{i,a}\) is a solution to the linear equation

$$\begin{aligned} \left( G+n\lambda I\right) \varvec{\beta }=\frac{1}{\lambda }\varvec{h} \end{aligned}$$
(10)

with

$$\begin{aligned} (G)_{(i-1)d+a,(j-1)d+b}=\partial _a\partial _{b+d}k(X_i,X_j) \end{aligned}$$

and

$$\begin{aligned} h_{(i-1)d+a}&=\langle \hat{\xi },\partial _a k(X_i,\cdot )\rangle _{\mathcal {H}_k} \\&=\frac{1}{n}\sum ^n_{j=1}\sum ^d_{b=1} \partial _a\partial _{b+d}^{2} k(X_i,X_j) +\partial _a\partial _{b+d} k(X_i,X_j)\partial _b \log q_0(X_j). \end{aligned}$$

Here, the partial derivatives \(\partial _a\) and \(\partial _{b+d}\) are applied to the first and second argument of \(k(X_i,X_j)\), respectively. It is noteworthy that the solution to the score matching estimator is given simply by the linear equation. Some numerical comparisons with the standard kernel density estimation have been reported in [10], which shows favorable results in estimating unnormalized densities. The matrix involved in (10) has the size \(nd \times nd\); the linear equation may be difficult to solve for large n and d. A gradient-based optimization may be a solution to this computational difficulty.

More recently, a method for estimating the functional parameter f(x) by neural networks has also been considered. In that case, the derivatives in the score matching can cause computational difficulty. To mitigate this issue, Song et al. [21] have proposed Sliced Score Matching, which reduces the objective functions into the integration of one-dimensional problems.

3 Fisher divergence and its extensions

In this section, a family of divergences is introduced to place the Fisher divergence as a special case. The new family uses operators on a functional space to define divergences.

Many traditional divergences and distances of probabilities are formulated as f-divergences [22,23,24]. Let PQ be two probability measures on \({\mathbb {R}}^d\). For a convex function \(f:[0,\infty )\rightarrow {\mathbb {R}}\) such that \(f(1)=0\), the f-divergence from P to Q is defined by

$$\begin{aligned} D_f(P\Vert Q):= E_q\left[ f\left( \frac{dP}{dQ}\right) \right] = \int f\left( \frac{dP}{dQ}(x) \right) dQ(x), \end{aligned}$$
(11)

where the Radon–Nykodim derivative \(\frac{dP}{dQ}\) is assumed to exist. It is well known that popular divergences such as KL-divergence, total variation, \(\chi ^2\)-divergence, and squared Hellinger distance, are instances of f-divergence with a choice of f (see [25], for example).

The Fisher divergence does not belong to f-divergences, however. While the f-divergence is defined by the point-wise value of the Radon-Nykodim derivative dP/dQ(x), the Fisher divergence requires the derivative of dP/dQ(x). In the sequel, a class of divergences based on operators of the function dP/dQ(x) or p(x)/q(x) will be considered.

Let \((\mathcal {X},\mathcal {B},\mu )\) be a measure space, \(\mathcal {F}\) be a function space of measurable functions on \(\mathcal {X}\), and \(\mathcal {G}_K\) be a space of K-dimensional vector-valued measurable functions on \(\mathcal {X}\). Assume that there is an operator \(L:\mathcal {F}\rightarrow \mathcal {G}_K\) and a function \(\psi :{\mathbb {R}}^K\rightarrow [0,\infty ]\) such that

  1. (i)

    \(\psi (z)=0\) if and only if \(z=0\),

  2. (ii)

    \(L\phi = 0\) if and only if \(\phi \in \mathcal {F}\) is a constant function.

Then, for p.d.f.s p and q on \((\mathcal {X},\mathcal {B},\mu )\), the divergence \(D_{\psi L}\) is defined by

$$\begin{aligned} D_{\psi L}(p\Vert q):= E_{X\sim q}\left[ \psi \left( L \left( \frac{p}{q}\right) (X)\right) \right] = \int \psi \left( L\left( \frac{p}{q}\right) (x) \right) q(x)d\mu (x).\nonumber \\ \end{aligned}$$
(12)

It is obvious from the assumptions that \(D_{\psi L}(p\Vert q) \ge 0\) and the equality holds if and only if p/q is constant, which means \(p=q\). This class of divergences includes the Fisher divergence and other interesting ones, as seen in the following subsections. In literature, divergence is often defined to satisfy the positive definite condition that \(D(p\Vert p+dp)\) should be a positive definite quadrature form on dp. We do not necessarily impose this condition in defining divergences, but discuss it in Sect. 3.4.

An extension of the Fisher divergence using operators has already been introduced in [18], where the authors focus on a specific form \(\frac{L p}{p}\) in defining divergences, where L is a linear operator. The class proposed in the current paper includes a wider class of divergences as it allows general nonlinear operators and functional forms.

3.1 Differential operators

Suppose that \(\mathcal {X}\) is the d-dimensional Euclidean space with Lebesgue measure. For \(\alpha \in {\mathbb {R}}\) and \(\beta > 0\), define a differential operator \(L_{\alpha ,\beta }\) with \(K=d\) by

$$\begin{aligned} L_{\alpha ,\beta }\, \phi = \phi ^{-\alpha /\beta }\left( \frac{\partial \phi }{\partial x_1},\ldots , \frac{\partial \phi }{\partial x_d}\right) . \end{aligned}$$
(13)

With the notation \(\partial _a = \frac{\partial }{\partial x_a}\), it follows that

$$\begin{aligned} L_{\alpha ,\beta } \left( \frac{p}{q}\right)= & {} \left( \frac{p}{q}\right) ^{1-\alpha /\beta } \left( \frac{\partial _a p}{p} - \frac{\partial _a q}{q} \right) _{a=1,\ldots ,d}\nonumber \\= & {} \left( \frac{p}{q}\right) ^{1-\alpha /\beta }\left( \partial _a \log p - \partial _a \log q \right) _{a=1,\ldots ,d}. \end{aligned}$$
(14)

Define \(\psi _{\beta }:{\mathbb {R}}^d \rightarrow [0,\infty )\) by

$$\begin{aligned} \psi _{\beta }(z) = \frac{1}{\beta }\sum _{a=1}^d \vert z_a\vert ^\beta . \end{aligned}$$
(15)

Then, the \(\psi L\)-divergence with \(\psi _{\beta }\) and \(L_{\alpha ,\beta }\) is given by

$$\begin{aligned} D_{\alpha ,\beta }(p\Vert q) = \frac{1}{\beta }\sum _{a=1}^d \int \vert \partial _a \log p - \partial _a \log q\vert ^\beta \left( \frac{p}{q}\right) ^{\beta -\alpha -1} p(x)dx, \end{aligned}$$
(16)

which provides a family of divergences. For \(\alpha = \beta -1\), the divergence is given by

$$\begin{aligned} D_{\beta -1,\beta }(p\Vert q) = \frac{1}{\beta }\sum _{a=1}^d \int \vert \partial _a \log p - \partial _a \log q\vert ^\beta p(x)dx, \end{aligned}$$
(17)

which is a generalization of the Fisher divergence (\(\beta =2\)). Note, however, that except for Fisher divergence, the empirical estimation of (17) requires the estimation of \(\partial _a \log p(x)\), since the technique of partial integral (5) is not applicable. The estimation is not straightforward in general for high-dimensional spaces.

3.2 Integral operators

Let \(h:\mathcal {X}\times \mathcal {X}\rightarrow {\mathbb {R}}\) be an integrable function with respect to the product measure \(\mu \times \mu \) such that the integral operator \(L_h:\mathcal {F}\rightarrow \mathcal {G}\) is well-defined by

$$\begin{aligned} (L_h \phi )(x) = \int h(x,x')\phi (x')d\mu (x'). \end{aligned}$$

This is valid for various choices of \(\mathcal {F}\), \(\mathcal {G}\), and a function class of h. For example, if \(h(x,x')\) is square integrable, which means \(\int \vert h(x,x')\vert ^2 d\mu (x)d\mu (x')<\infty \), then with \(\mathcal {F}=\mathcal {G}=L^2(\mathcal {X};\mu )\) (\(K=1\)), the integral operator \(L_h:L^2(\mathcal {X};\mu )\rightarrow L^2(\mathcal {X};\mu )\) is well-defined as usual.

Assume further that the relation

$$\begin{aligned} \int h(x,x')\phi (x')d\mu = 0\qquad (\text {a.s.}\; x) \end{aligned}$$
(18)

holds if and only if \(\phi \) is a constant function. Then, define an operator \(L_{h}\) by

$$\begin{aligned} L_{h}(\phi )(x):= \phi (x)^{1/2} \int h(x,x')\phi (x')d\mu (x'). \end{aligned}$$
(19)

With \(\psi (z) = z^2\), the divergence \(D_h\) as in (12) is given by

$$\begin{aligned} D_{h}(p\Vert q)&:= \int p(x) \Bigl \{ \int h(x,x')\frac{p}{q}(x') d\mu (x') \Bigr \}^2 d\mu (x) \nonumber \\&= \int \int \int h(x,x')h(x,x'')p(x) d\mu (x) \nonumber \\&\quad \cdot \Bigl (\frac{p}{q}(x')\Bigr ) \Bigl (\frac{p}{q}(x'')\Bigr ) d\mu (x') d\mu (x'') \nonumber \\&= \int \int k_p(x',x'')\frac{p}{q}(x')\frac{p}{q}(x'') d\mu (x') d\mu (x''), \end{aligned}$$
(20)

where

$$\begin{aligned} k_p(x',x''):= \int h(x,x')h(x,x'')p(x) d\mu (x). \end{aligned}$$
(21)

The positivity condition comes from the assumption (18); if \(D_h(p\Vert q)=0\), it means p/q is constant, which requires \(p=q\). If there are ways of estimating the kernel \(k_p\) and the ratio p/q, the measure \(D_h\) gives an estimable divergence.

Some methods can be applied to construct kernel h that satisfies the condition (18). Let \(\mathcal {F}=\mathcal {G}=L^2([0,1]^d)\) with Lebesgue measure, and \(\tilde{h}(x,x')\) be a continuous positive definite kernel on \([0,1]^d\). Then, Mercer’s theorem [26] tells that k admits a decomposition

$$\begin{aligned} \tilde{h}(x,x') = \sum _{j=0}^\infty \lambda _j \xi _j(x) \xi _j(x'), \end{aligned}$$

which converges absolutely and uniformly on \([0,1]^d\), where \((\xi _j)\) is the orthonormal basis of \(L^2([0,1]^d)\) and \(\lambda _j\ge 0\). Assume that \(\xi _0 = 1\). In this case, by eliminating \(\lambda _0 \xi _0(x)\xi _0(x')\), the kernel

$$\begin{aligned} h(x,x'):= \sum _{j=1}^\infty \lambda _j \xi _j(x)\xi _j(x') = \tilde{h}(x,x') - \lambda _0 \end{aligned}$$

satisfies the condition.

Another example is given by an odd function. Let \(\mathcal {F}=C^1({\mathbb {R}}^d)\) be the Banach space of bounded differentiable functions with bounded continuous derivatives, and \(\mathcal {G}=L^2({\mathbb {R}}^d,{\mathbb {R}}^d;dx)\) be \({\mathbb {R}}^d\)-valued square integrable functions on \({\mathbb {R}}^d\), respectively. Assume that a square integrable function w(z) on \({\mathbb {R}}^d\) satisfies \(w(z_1,\ldots ,z_{a-1},-z_a,z_{a+1},\ldots ,z_d)=-w(z_1,\ldots ,z_d)\) for every \(a=1,\ldots ,d\). Then, the kernel

$$\begin{aligned} h(x,x'):= w(x-x') \end{aligned}$$

satisfies \(\int h(x,x')dx'= 0\). The “only if” part can be satisfied with many functions; for example, \(w(x)=(x_1,\ldots ,x_d) e^{- \Vert x\Vert ^2/2}\) satisfies the condition (18). In fact, from the integration

$$\begin{aligned}&\int (x_a-x_a')e^{-\Vert x-x'\Vert ^2/2}\phi (x')dx_a' \\&\quad = \left[ e^{-\Vert x-x'\Vert ^2/2} \phi (x')\right] _{-\infty }^{\infty } - \int e^{-\Vert x-x'\Vert ^2/2}\frac{\partial \phi (x')}{\partial x_a'}dx_a' \\&\quad = - \int e^{-\Vert x-x'\Vert ^2/2}\frac{\partial \phi (x')}{\partial x_a'}dx_a', \end{aligned}$$

we see that \(\int h(x,x')\phi (x')dx' = 0\) holds if and only if \(\frac{\partial \phi (x)}{\partial x_a} = 0\) for almost every x and every \(a=1,\ldots ,d\), which means \(\phi (x)\) is constant.

3.3 Combinations of operators

Combining differential and integral operators provides further interesting divergences. Let \(\mathcal {X}={\mathbb {R}}^d\) with Lebesgue measure. We consider an operator L, which is applied to an appropriate function \(\phi \), with integral kernels \(h_1,\ldots ,h_d\). Define L by

$$\begin{aligned} L\phi (x):= \Bigl (\; \phi (x)^{1/2}\int h_a(x,x')\bigl (\phi ^{-1}\partial _a\phi )(x')dx' \; \Bigr )_{a=1,\ldots ,d}, \end{aligned}$$
(22)

and

$$\begin{aligned} \psi (z):= \Vert z \Vert ^2. \end{aligned}$$

From (14), if \(\phi = p/q\) is plugged, we have \(\phi ^{-1}\partial _a\phi = \partial _a \log (p/q)\). Instead of (18), we assume for the integral kernels that

$$\begin{aligned} \int h^a(x,x')\phi (x')d\mu (x') = 0\qquad (\text {a.s.}\; x) \end{aligned}$$
(23)

holds if and only if \(\phi \) is zero, which means that the integral operator with kernel \(h^a\) is injective. Under this assumption, we can guarantee that \(L(p/q)=0\) implies \(p=q\). Gaussian kernel \(h(x,x')=\exp (-\frac{1}{2\sigma ^2}\Vert x-x'\Vert ^2)\) satisfies this assumption for a class of \(\phi \) such that some polynomial \(\zeta (x)\) gives a bound \(\vert \phi (x)\vert \le \zeta (x)\) for all x.

In a similar derivation to (20), the \(\psi L\)-divergence in this case is given by

$$\begin{aligned}&D^M(p\Vert q) \nonumber \\&\quad = \sum _{a=1}^d \int \int k_p^a(x',x'')\bigl (\partial _a\log q(x')-\partial _a \log p(x')\bigr )\bigl (\partial _a\log q(x'')-\partial _a \log p(x'')\bigr ), \end{aligned}$$
(24)

where \(k_p^a\) corresponds to the kernel (21) with \(h_a\). Note that the Fisher divergence can be regarded as a special case of \(D^M\) with the choice \(h^a(x,x') = \delta (x-x')\), which leads \(k_p^a(x',x'')=\delta (x'-x'')p(x')\).

Given a sample from p(x), empirical estimation of this divergence is possible without knowing the normalization constants of p or q. Let \((\tilde{X}_1, \ldots ,\tilde{X}_n)\) be an i.i.d. sample of p(x)dx. With the estimator

$$\begin{aligned} \hat{k}_p^a(x,x'):= \frac{1}{n} \sum _{i=1}^n h_a(\tilde{X}_i,x')h_a(\tilde{X}_i,x''), \end{aligned}$$

the divergence \(D^M(p\Vert q)\) can be estimated by

$$\begin{aligned}&\hat{D}^M := \sum _{a=1}^d \frac{1}{n}\sum _{i=1}^n \int \int \hat{k}_p^a(x',x'')\bigl (\partial _a\log q(x')-\partial _a \log p(x')\bigr )\bigl (\partial _a\log q(x'')-\partial _a \log p(x'')\bigr ). \end{aligned}$$

Assume further that the densities p and q belong to an infinite-dimensional exponential family; \(p(x)=\exp (f(x)-A(f))q_0(x)\) and \(q(x)=\exp (g(x)-A(g))q_0(x)\). With the choice of

$$\begin{aligned} h_a(x,x')=\frac{1}{(2\pi \sigma ^2)^{d/2}}e^{-\frac{\Vert x-x'\Vert ^2}{2\sigma ^2}}, \end{aligned}$$

the estimator of the divergence is given by

$$\begin{aligned} \hat{D}^M= & {} \sum _{a=1}^d \frac{1}{n}\sum _{i=1}^n \int \int \frac{1}{(2\pi \sigma ^2)^{d/2}}e^{-\frac{\Vert x'-\tilde{X}_i\Vert ^2}{2\sigma ^2}} \frac{1}{(2\pi \sigma ^2)^{d/2}}e^{-\frac{\Vert x''-\tilde{X}_i\Vert ^2}{2\sigma ^2}} \\{} & {} \cdot \Bigl ( \frac{\partial g(x')}{\partial x'_a}-\frac{\partial f(x')}{\partial x'_a} \Bigr )\Bigl ( \frac{\partial g(x'')}{\partial x''_a}-\frac{\partial f(x'')}{\partial x''_a} \Bigr )dx'dx''. \end{aligned}$$

Based on the above formula, the following empirical estimation is possible.

  1. 1.

    \((\tilde{X}_1,\ldots ,\tilde{X}_n)\) is given (i.i.d. sample from p(x)dx).

  2. 2.

    For each i, take independent samples \(X'_{(i,j)}, X''_{(i,j)}\) (\(j=1,\ldots ,N\)) from \(N(\tilde{X}_i,\sigma ^2 I_d)\).

  3. 3.

    Compute

    $$\begin{aligned} \hat{\hat{D}}^M:= \frac{1}{n}\frac{1}{N}\sum _{i=1}^n\sum _{j=1}^N \sum _{a=1}^d \left( \frac{\partial g(X'_{(i,j)})}{\partial x_a}-\frac{\partial f(X'_{(i,j)})}{\partial x_a} \right) \left( \frac{\partial g(X''_{(i,j)})}{\partial x_a}-\frac{\partial f(X''_{(i,j)})}{\partial x_a} \right) .\nonumber \\ \end{aligned}$$
    (25)

Suppose that the functional parameters f and g are available; for example, density estimation in Sect. 2 may give their estimators. In such a case, the above formula provides a tractable estimate of the divergence without estimating the normalization constants A(f), A(g) or the density functions p(x), q(x).

3.4 Geometry of divergences

A divergence defined for an exponential family provides the Riemannian structure on the family as a manifold [27]. Below, the Riemannian metrics induced by some of the proposed divergences are discussed on the infinite-dimensional exponential family.

As discussed in [27], given a divergence \(D(p\Vert q)\) on a finite-dimensional exponential family \(\{p_\theta \mid \theta =(\theta _1,\ldots ,\theta _d)\in \Theta \}\), the Riemannian metric \(\sum _{i,j=1}^d g_{ij}(\theta )d\theta ^i d\theta ^j\) is defined by

$$\begin{aligned} g_{ij}(\theta ) = \frac{\partial ^2}{\partial \theta _i \partial \theta _j} D( p_\theta \Vert p_{\tilde{\theta }} ) \vert _{\tilde{\theta }=\theta }. \end{aligned}$$
(26)

By extending this construction to the \(\psi L\)-divergences on the infinite-dimensional exponential family (1), under the assumption that the function \(\psi (z)\) is two times differentiable at \(z=0\), the Riemannian metric tensor at \(p_s=\exp (s(x)-A(s))q_0(x)\)Footnote 2 is given by

$$\begin{aligned} g(p_s)[U,V] = \sum _{a,b=1}^K E_{p_s}\left[ \frac{\partial ^2 \psi (z)}{\partial z_a \partial z_b}\Bigr \vert _{z=0} \right. \left. \frac{\partial L^a(\phi )}{\partial \phi }\Bigr \vert _{\phi =1} (U)\, \frac{\partial L^b(\phi )}{\partial \phi }\Bigr \vert _{\phi =1} (V) \right] , \end{aligned}$$
(27)

where \(\frac{\partial L^a(\phi )}{\partial \phi }\) is the functional derivative of the a-th component of the operator L, which is a (generally nonlinear) operator on the tangent space of \(\mathcal {F}\) at \(\phi \). It is known that the tangent space is canonically identified with the vectors \(W\in \mathcal {F}\) with \(E_{p_s}[W]=0\), and the variables U and V in the tensor (27) are two of such tangent vectors at \(p_s\). The symmetric tensor \(g(p_s)\) is strictly positive if the Hessian \(\bigl (\frac{\partial ^2 \psi (z)}{\partial z_a \partial z_b}\Bigr \vert _{z=0}\bigr )\) is strictly positive definite and the operator \(\frac{\partial L^a(\phi )}{\partial \phi }\vert _{\phi =1}\) is injective. Under these conditions, \(g(p_s)\) defines a Riemannian metric tensor on \(\mathcal {P}_{k,q_0}\). Note that any f-divergence defines the unique Riemannian metric given by Fisher information up to scale irrespective of f. In contrast, the Riemannian metric (27) depends on the operator L.

In the sequel, some examples of the Riemannian metrics will be shown. First, for the divergence \(D_{\alpha ,\beta }\) (16) defined by the differential operator, the non-degenerate Riemannian metric is obtained only with the case of \(\alpha =1\) and \(\beta =2\), which is the Fisher divergence. We can see that \((\partial L^a(\phi )/\partial \phi )(U) = -1/2\phi ^{-3/2}\partial _a\phi \cdot U + \phi ^{-1/2} \partial _a U\) so that \((\partial L^a(\phi )/\partial \phi )\vert _{\phi =1}(U)= \partial _a U\). The Riemannian metric \(g^D\) induced by the Fisher divergence is thus given by

$$\begin{aligned} g^D(p_s)[U,V] = \sum _{a=1}^d E_{p_s}\bigl [ \partial _a U(X) \partial _a V(X)\bigr ]. \end{aligned}$$
(28)

Next, for the divergences (20) and (24), the Riemannian metrics are respectively given by

$$\begin{aligned} g^I(p_s)[U,V] = \int \int k_{p_s}(x',x'') U(x') V(x'')d\mu (x')d\mu (x''), \end{aligned}$$
(29)

and

$$\begin{aligned} g^M(p_s)[U,V] = \sum _{a=1}^d \int \int k_{p_s}^a(x',x'') \partial _a U(x') \partial _a V(x'')dx'dx''. \end{aligned}$$
(30)

The derivation is easy and omitted here. The tensors \(g^I\) and \(g^M\) are non-degenerate if the kernel \(k_{p_s}\) (or \(k_{p_s}^a\)) is strictly positive definiteFootnote 3, which can be guaranteed with Gaussian kernel for h (or \(h^a\)) under mild conditions on \(p_s\).

4 Conclusion

This paper discussed infinite-dimensional exponential families and Fisher divergence, with a focus on estimation. The first half of the paper delved into density estimation using an infinite-dimensional exponential family and presented a brief survey of previous studies. These studies have shown that the score matching method avoids the computation of the normalization constant and the probability density function, and derives a simple, but possibly high dimensional, linear equation to solve the estimator.

The second topic was the Fisher divergence, which forms the foundation of the score matching method. We explored the Fisher divergence and introduced an extended class of divergences that employ operators on a function space. This extends the scope of previous studies based on f-divergence, which uses a point-wise function. We also demonstrated the Riemannian structures induced by these divergences.

The geometry of Fisher divergence and relevant divergences remains ripe for further investigation and represents an important area for future research.