Estimation with infinite-dimensional exponential family and Fisher divergence

Fukumizu, Kenji

doi:10.1007/s41884-023-00122-z

Estimation with infinite-dimensional exponential family and Fisher divergence

Research Paper
Open access
Published: 13 November 2023

Volume 7, pages 609–622, (2024)
Cite this article

Download PDF

You have full access to this open access article

Information Geometry Aims and scope Submit manuscript

Estimation with infinite-dimensional exponential family and Fisher divergence

Download PDF

Kenji Fukumizu ORCID: orcid.org/0000-0002-3488-2625¹

1250 Accesses
6 Altmetric
Explore all metrics

Abstract

Infinite dimensional exponential families have been theoretically studied, but their practical applications are still limited because empirical estimation is not straightforward. This paper first gives a brief survey of studies on the estimation method for infinite-dimensional exponential families. The method uses score matching, which is based on the Fisher divergence. The second topic is to investigate the Fisher divergence as a member of an extended family of divergences, which employ operators in defining divergences.

Wasserstein information matrix

Article 14 February 2023

The Exponential Family in Abstract Information Theory

Asymptotic Properties of Minimum S-Divergence Estimator for Discrete Models

Article 18 December 2014

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Since the seminal work of [1], infinite-dimensional exponential manifolds have been studied from various viewpoints. The dual geometric structure [2, 3] of finite-dimensional exponential families have been extended to infinite-dimensional cases using the topology of Orlicz spaces [4,5,6]. Formulations with other function spaces have been explored [7, 8], and geometry of more general measure spaces have been studied [9]. While the theoretical studies on infinite-dimensional exponential manifolds have advanced, the practical use of infinite-dimensional or nonparametric exponential families is limited.

One of the main difficulties in using an infinite-dimensional exponential family for density estimation is the computation of the normalization factor. The infinite-dimensional exponential family is typically defined by the form $\exp (f(x)-A(f))q_0(x)$, where f belongs to an appropriate function space, $q_0$ is the probability density function (p.d.f.) of a base probability, and A(f) is the log of the normalization constant: $A(f) = \log \int \exp (f(x))q_0(x)$. The integral is, however, not computable for a general function f.

For finite-dimensional exponential families, the maximum likelihood estimator (MLE) is the central device to estimate a parameter that fits a given sample. The MLE is based on the Kullback–Leibler (KL) divergence to measure the discrepancy between the true distribution and a probability defined by the model. However, to obtain the MLE, we need to compute the mapping from the mean parameter (the mean of sufficient statistics) to the natural parameter. This is not straightforward even for general finite-dimensional exponential families, and is not computable for infinite-dimensional ones.

Recently, Sriperumbdur et al. [10] applied the score matching method [11] to estimate the functional parameter of an infinite-dimensional exponential family defined by a reproducing kernel Hilbert space. The score matching method is based on the Fisher divergence, which does not require the normalization constants to measure the discrepancy and is thus applicable to any infinite- or finite-dimensional exponential families.

The purpose of this paper is twofold. One is to give a brief survey of the estimation methods for infinite-dimensional exponential families, following [10]. The other is to investigate the Fisher divergence as a member of an extended family of divergences. The extended family introduces new ideas to define divergence measures of probabilities and provides a description of Riemannian metrics induced by the divergences.

2 Estimation with infinite-dimensional exponential family

This section explains the estimation with the model of the infinite-dimensional exponential family with positive definite kernels [7, 10, 12], focusing mainly [10].

Let $(\Omega ,\mathcal {B},\mu )$ be a measure space, and k be a measurable positive definite kernel on $\Omega $. The reproducing kernel Hilbert space (RKHS) $(\mathcal {H}_k,\langle ,\rangle _{\mathcal {H}_k})$ is uniquely defined as a function space over $\Omega $ so that $\mathcal {H}_k$ has a reproducing kernel $k(\cdot ,x)$. See, for example, [13, 14] about the basics of RKHS.

Let $q_0(x)$ be a probability density function (p.d.f.) with respect to the measure $\mu $. The exponential family with positive definite kernel k (at $q_0$) is defined by

$$\begin{aligned} \mathcal {P}_{k,q_0}:= \left\{ p_f(x) = e^{f(x)-A(f)}q_0(x) \mid f\in \mathcal {F}_{k,q_0} \right\} , \end{aligned}$$

(1)

where

$$\begin{aligned} \mathcal {F}_{k,q_0}:= \left\{ f\in \mathcal {H}_k\mid \int e^{f(x)}q_0(x)d\mu (x) < \infty \right\} \end{aligned}$$

is the parameter space, and $A(f):= \log \int e^{f(x)}q_0(x)d\mu (x)$ is the log of the normalization constant, which is also known as the cumulant generating function. The function $f\in \mathcal {F}_{k,q_0}$ is the natural parameter in the terminology of the exponential family; from the reproducing property, the function value is expressed by the inner product as $f(x) = \langle f, k(\cdot ,x)\rangle _{\mathcal {H}_k}$ so that f and $k(\cdot ,x)$ are regarded as the natural parameter and sufficient statistics, respectively. The parameter space $\mathcal {F}_{k,q_0}$ may not be a linear space in general, as a scalar multiple or addition may not preserve the finiteness of the integral. Also, the parameter space depends on the kernel. If $\mathcal {H}_{k_1}\subset \mathcal {H}_{k_2}$ holds for two kernels $k_1$ and $k_2$, then the corresponding parameter spaces also satisfy $\mathcal {F}_{k_1,q_0}\subset \mathcal {F}_{k_2,q_0}$.

Let $(X_1,\ldots , X_n$) be an i.i.d. sample from the true probability distribution, and suppose one wishes to estimate the density function of the true probability. To estimate the functional parameter f in (1), the maximum likelihood estimation is given by

$$\begin{aligned} \max _{f\in \mathcal {F}_{k,q_0}} \frac{1}{n} \sum _{i=1}^n f(X_i) - A(f). \end{aligned}$$

From $f(X_i) = \langle f,k(\cdot ,X_i)\rangle _{\mathcal {H}_k}$, it is reduced to the maximum likelihood equation:

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n k(\cdot ,X_i) - \frac{\partial A(f)}{\partial f} = 0. \end{aligned}$$

(2)

It is difficult, however, to solve this equation, since A(f) is not computable in most cases. Thus, the maximum likelihood approach has not been successfully used for infinite-dimensional exponential families.

In [10], the score matching method [11] is employed to obtain an estimator of f while avoiding the use of A(f). Let $\Omega = {\mathbb {R}}^d$, and $\mu =dx$ be the Lebesgue measure. The score matching method uses the Fisher divergence^{Footnote 1} for the objective of the estimation. For differentiable densities p and q on ${\mathbb {R}}^d$, the Fisher divergence is defined by

$$\begin{aligned} J(p\Vert q)=\frac{1}{2}\int _\Omega \left\| \nabla \log p(x)-\nabla \log q(x) \right\| ^2_2 p(x) dx, \end{aligned}$$

(3)

where $\nabla \log p(x)= \left( \partial _1\log p(x),\ldots ,\partial _d\log p(x)\right) $ with $\partial _a\log p(x):=\frac{\partial }{\partial x_a}\log p(x)$. It is important to note that when the p.d.f. to estimate is $q(x) = p_f(x)\in \mathcal {P}_{k,q_0}$ in (1), the cumulant generating function A(f) is not necessary to compute, as ${\partial _a}\log q(x)={\partial _a}f(x)-{\partial _a}\log q_0(x)$, and thus applicable easily to the infinite-dimensional exponential family.

It is known that the Fisher divergence has various relations with other divergences. For example, with an appropriate choice of class for q(x), the total variation, Kolmogorov distance, and Wasserstein distance can be upper bounded by $c_q J(p\Vert q)^{1/2}$, where $c_q$ is a respective constant depending on q and the distance measure to compare (see [16], for example). It is also known that the Fisher divergence J and KL divergence $D_{KL}$ have an interesting relation in the line of Bruijn’s identity [17]. More precisely, for two random vectors $X^{(i)}$ ($i=1,2$) with p.d.f. $p^{(i)}$, define their smoothed versions by $X^{(i)}_t:= X^{(i)} + \sqrt{t}W^{(i)}$ ($t\ge 0$) with p.d.f. $p^{(i)}_t$, where $W^{(i)}$ are independent random variables of the law standard normal distribution $N(0,I_d)$. Then, we have [18]

$$\begin{aligned} \frac{d}{dt} D_{KL}(p^{(1)}_t \Vert p^{(2)}_t) = -J(p^{(1)}_t \Vert p^{(2)}_t). \end{aligned}$$

(4)

We now discuss the empirical estimation of (3). Given that p(x) is the p.d.f. of the true probability, it is also desirable to avoid estimating the density p(x) in $\nabla \log p(x)$, which is difficult in high-dimensional cases in general. This is cleverly achieved in [11] by using a partial integral. More precisely, assuming that $\lim _{\vert x_a\vert \rightarrow \infty } \frac{\partial \log q(x)}{\partial x_a}p(x) = 0$, we observe

$$\begin{aligned} \int \frac{\partial \log p(x)}{\partial x_a} \frac{\partial \log q(x)}{\partial x_a} p(x) dx_a&= \int \frac{\partial \log q(x)}{\partial x_a}\frac{\partial p(x)}{\partial x_a} dx_a \nonumber \\&= \left[ \frac{\partial \log q(x)}{\partial x_a}p(x) \right] _{-\infty }^\infty - \int \frac{\partial ^2 \log q(x)}{\partial x_a^2}p(x) dx_a \nonumber \\&= - \int \frac{\partial ^2 \log q(x)}{\partial x_a^2}p(x) dx_a, \end{aligned}$$

(5)

and thus we can rewrite (3) by

$$\begin{aligned}&J(p\Vert q) \nonumber \\&\quad =\int \left\{ \frac{1}{2} \Vert \nabla \log q(x) \Vert ^2_2 + \sum _{a=1}^d \partial _a^2 \log q(x) \right\} p(x) dx + \frac{1}{2} \int \Vert \nabla \log p(x) \Vert ^2_2 p(x)dx, \end{aligned}$$

(6)

where $\partial _a^2$ denotes $\frac{\partial ^2}{\partial x_a^2}$. The last term does not depend on q(x), and the first two terms are integral with respect to p(x). Based on these facts and $\partial _a \log p_f(x) = \partial _a f(x) +\partial _a \log q_0(x)$, for an i.i.d. sample $(X_1,\ldots ,X_n)$ with p.d.f. p(x), the objective function of the score matching is given by

$$\begin{aligned} \tilde{J}_n(f):= & {} \frac{1}{n}\sum _{i=1}^n \sum _{a=1}^d \Bigl ( \frac{1}{2} \bigl ( \partial _a f(X_i)\bigr )^2 + \partial _a f(X_i)\partial _a \log q_0(X_i) + \partial _a^2 f(X_i) \Bigr ) \nonumber \\{} & {} + \frac{\lambda }{2}\Vert f\Vert _{\mathcal {H}_k}^2, \end{aligned}$$

(7)

where only the terms dependent on the natural parameter f are collected. As in many kernel methods, the regularization term by the squared RKHS norm with coefficient $\lambda $ is added to prevent the overfitting caused by the optimization over the infinite-dimensional function space. This regularization gives a unique minimizer to the objective function due to the representer theorem [19].

Taking an RKHS as a functional parameter space further provides a simple solution to the minimization of (7). In fact, $\tilde{J}_n$ is reduced to a quadratic form of f using the facts $\partial _a f(X_i) = \langle f, \partial _a k(\cdot ,X_i)\rangle _{\mathcal {H}_k}$ and $\partial _a^2 f(X_i) = \langle f, \partial _a^2 k(\cdot ,X_i)\rangle _{\mathcal {H}_k}$, assuming that the kernel is two times differentiable [20, Section 4.3]. The objective function is then given by

$$\begin{aligned} \tilde{J}_n(f)= \frac{1}{2} \langle f, (\hat{C}+\lambda I) f\rangle _{\mathcal {H}_k} + \langle f, \hat{\xi }\rangle _{\mathcal {H}_k}, \end{aligned}$$

(8)

where

$$\begin{aligned} \hat{C}&:=\frac{1}{n}\sum ^n_{i=1}\sum ^d_{a=1}\partial _{a} k(X_i,\cdot )\otimes \partial _{a} k(X_i,\cdot ), \\ \hat{\xi }&:=\frac{1}{n}\sum ^n_{i=1}\sum ^d_{a=1}\left\{ \partial _{a} k(X_i,\cdot )\partial _i\log q_0(X_i)+\partial ^2_{a}k(X_i,\cdot ) \right\} . \end{aligned}$$

By the representer theorem [19], the solution $f_{\lambda ,n}$ is a linear combination of $\partial _a k(\cdot ,X_i)$ and $\partial _a^2 k(\cdot ,X_i)$, and given by

$$\begin{aligned} f_{\lambda ,n}=-\frac{1}{\lambda }\hat{\xi }+\sum ^n_{i=1}\sum ^d_{a=1}\beta _{(i-1)n+a} \partial _a k(X_i,\cdot ), \end{aligned}$$

(9)

where $\varvec{\beta }=(\beta _{(i-1)d+a})_{i,a}$ is a solution to the linear equation

$$\begin{aligned} \left( G+n\lambda I\right) \varvec{\beta }=\frac{1}{\lambda }\varvec{h} \end{aligned}$$

(10)

with

$$\begin{aligned} (G)_{(i-1)d+a,(j-1)d+b}=\partial _a\partial _{b+d}k(X_i,X_j) \end{aligned}$$

and

$$\begin{aligned} h_{(i-1)d+a}&=\langle \hat{\xi },\partial _a k(X_i,\cdot )\rangle _{\mathcal {H}_k} \\&=\frac{1}{n}\sum ^n_{j=1}\sum ^d_{b=1} \partial _a\partial _{b+d}^{2} k(X_i,X_j) +\partial _a\partial _{b+d} k(X_i,X_j)\partial _b \log q_0(X_j). \end{aligned}$$

Here, the partial derivatives $\partial _a$ and $\partial _{b+d}$ are applied to the first and second argument of $k(X_i,X_j)$, respectively. It is noteworthy that the solution to the score matching estimator is given simply by the linear equation. Some numerical comparisons with the standard kernel density estimation have been reported in [10], which shows favorable results in estimating unnormalized densities. The matrix involved in (10) has the size $nd \times nd$; the linear equation may be difficult to solve for large n and d. A gradient-based optimization may be a solution to this computational difficulty.

More recently, a method for estimating the functional parameter f(x) by neural networks has also been considered. In that case, the derivatives in the score matching can cause computational difficulty. To mitigate this issue, Song et al. [21] have proposed Sliced Score Matching, which reduces the objective functions into the integration of one-dimensional problems.

3 Fisher divergence and its extensions

In this section, a family of divergences is introduced to place the Fisher divergence as a special case. The new family uses operators on a functional space to define divergences.

Many traditional divergences and distances of probabilities are formulated as f-divergences [22,23,24]. Let P, Q be two probability measures on ${\mathbb {R}}^d$. For a convex function $f:[0,\infty )\rightarrow {\mathbb {R}}$ such that $f(1)=0$, the f-divergence from P to Q is defined by

$$\begin{aligned} D_f(P\Vert Q):= E_q\left[ f\left( \frac{dP}{dQ}\right) \right] = \int f\left( \frac{dP}{dQ}(x) \right) dQ(x), \end{aligned}$$

(11)

where the Radon–Nykodim derivative $\frac{dP}{dQ}$ is assumed to exist. It is well known that popular divergences such as KL-divergence, total variation, $\chi ^2$-divergence, and squared Hellinger distance, are instances of f-divergence with a choice of f (see [25], for example).

The Fisher divergence does not belong to f-divergences, however. While the f-divergence is defined by the point-wise value of the Radon-Nykodim derivative dP/dQ(x), the Fisher divergence requires the derivative of dP/dQ(x). In the sequel, a class of divergences based on operators of the function dP/dQ(x) or p(x)/q(x) will be considered.

Let $(\mathcal {X},\mathcal {B},\mu )$ be a measure space, $\mathcal {F}$ be a function space of measurable functions on $\mathcal {X}$, and $\mathcal {G}_K$ be a space of K-dimensional vector-valued measurable functions on $\mathcal {X}$. Assume that there is an operator $L:\mathcal {F}\rightarrow \mathcal {G}_K$ and a function $\psi :{\mathbb {R}}^K\rightarrow [0,\infty ]$ such that

(i)
$\psi (z)=0$ if and only if $z=0$,
(ii)
$L\phi = 0$ if and only if $\phi \in \mathcal {F}$ is a constant function.

Then, for p.d.f.s p and q on $(\mathcal {X},\mathcal {B},\mu )$, the divergence $D_{\psi L}$ is defined by

$$\begin{aligned} D_{\psi L}(p\Vert q):= E_{X\sim q}\left[ \psi \left( L \left( \frac{p}{q}\right) (X)\right) \right] = \int \psi \left( L\left( \frac{p}{q}\right) (x) \right) q(x)d\mu (x).\nonumber \\ \end{aligned}$$

(12)

It is obvious from the assumptions that $D_{\psi L}(p\Vert q) \ge 0$ and the equality holds if and only if p/q is constant, which means $p=q$. This class of divergences includes the Fisher divergence and other interesting ones, as seen in the following subsections. In literature, divergence is often defined to satisfy the positive definite condition that $D(p\Vert p+dp)$ should be a positive definite quadrature form on dp. We do not necessarily impose this condition in defining divergences, but discuss it in Sect. 3.4.

An extension of the Fisher divergence using operators has already been introduced in [18], where the authors focus on a specific form $\frac{L p}{p}$ in defining divergences, where L is a linear operator. The class proposed in the current paper includes a wider class of divergences as it allows general nonlinear operators and functional forms.

3.1 Differential operators

Suppose that $\mathcal {X}$ is the d-dimensional Euclidean space with Lebesgue measure. For $\alpha \in {\mathbb {R}}$ and $\beta > 0$, define a differential operator $L_{\alpha ,\beta }$ with $K=d$ by

$$\begin{aligned} L_{\alpha ,\beta }\, \phi = \phi ^{-\alpha /\beta }\left( \frac{\partial \phi }{\partial x_1},\ldots , \frac{\partial \phi }{\partial x_d}\right) . \end{aligned}$$

(13)

With the notation $\partial _a = \frac{\partial }{\partial x_a}$, it follows that

$$\begin{aligned} L_{\alpha ,\beta } \left( \frac{p}{q}\right)= & {} \left( \frac{p}{q}\right) ^{1-\alpha /\beta } \left( \frac{\partial _a p}{p} - \frac{\partial _a q}{q} \right) _{a=1,\ldots ,d}\nonumber \\= & {} \left( \frac{p}{q}\right) ^{1-\alpha /\beta }\left( \partial _a \log p - \partial _a \log q \right) _{a=1,\ldots ,d}. \end{aligned}$$

(14)

Define $\psi _{\beta }:{\mathbb {R}}^d \rightarrow [0,\infty )$ by

$$\begin{aligned} \psi _{\beta }(z) = \frac{1}{\beta }\sum _{a=1}^d \vert z_a\vert ^\beta . \end{aligned}$$

(15)

Then, the $\psi L$-divergence with $\psi _{\beta }$ and $L_{\alpha ,\beta }$ is given by

$$\begin{aligned} D_{\alpha ,\beta }(p\Vert q) = \frac{1}{\beta }\sum _{a=1}^d \int \vert \partial _a \log p - \partial _a \log q\vert ^\beta \left( \frac{p}{q}\right) ^{\beta -\alpha -1} p(x)dx, \end{aligned}$$

(16)

which provides a family of divergences. For $\alpha = \beta -1$, the divergence is given by

$$\begin{aligned} D_{\beta -1,\beta }(p\Vert q) = \frac{1}{\beta }\sum _{a=1}^d \int \vert \partial _a \log p - \partial _a \log q\vert ^\beta p(x)dx, \end{aligned}$$

(17)

which is a generalization of the Fisher divergence ($\beta =2$). Note, however, that except for Fisher divergence, the empirical estimation of (17) requires the estimation of $\partial _a \log p(x)$, since the technique of partial integral (5) is not applicable. The estimation is not straightforward in general for high-dimensional spaces.

3.2 Integral operators

Let $h:\mathcal {X}\times \mathcal {X}\rightarrow {\mathbb {R}}$ be an integrable function with respect to the product measure $\mu \times \mu $ such that the integral operator $L_h:\mathcal {F}\rightarrow \mathcal {G}$ is well-defined by

$$\begin{aligned} (L_h \phi )(x) = \int h(x,x')\phi (x')d\mu (x'). \end{aligned}$$

This is valid for various choices of $\mathcal {F}$, $\mathcal {G}$, and a function class of h. For example, if $h(x,x')$ is square integrable, which means $\int \vert h(x,x')\vert ^2 d\mu (x)d\mu (x')<\infty $, then with $\mathcal {F}=\mathcal {G}=L^2(\mathcal {X};\mu )$ ($K=1$), the integral operator $L_h:L^2(\mathcal {X};\mu )\rightarrow L^2(\mathcal {X};\mu )$ is well-defined as usual.

Assume further that the relation

$$\begin{aligned} \int h(x,x')\phi (x')d\mu = 0\qquad (\text {a.s.}\; x) \end{aligned}$$

(18)

holds if and only if $\phi $ is a constant function. Then, define an operator $L_{h}$ by

$$\begin{aligned} L_{h}(\phi )(x):= \phi (x)^{1/2} \int h(x,x')\phi (x')d\mu (x'). \end{aligned}$$

(19)

With $\psi (z) = z^2$, the divergence $D_h$ as in (12) is given by

$$\begin{aligned} D_{h}(p\Vert q)&:= \int p(x) \Bigl \{ \int h(x,x')\frac{p}{q}(x') d\mu (x') \Bigr \}^2 d\mu (x) \nonumber \\&= \int \int \int h(x,x')h(x,x'')p(x) d\mu (x) \nonumber \\&\quad \cdot \Bigl (\frac{p}{q}(x')\Bigr ) \Bigl (\frac{p}{q}(x'')\Bigr ) d\mu (x') d\mu (x'') \nonumber \\&= \int \int k_p(x',x'')\frac{p}{q}(x')\frac{p}{q}(x'') d\mu (x') d\mu (x''), \end{aligned}$$

(20)

where

$$\begin{aligned} k_p(x',x''):= \int h(x,x')h(x,x'')p(x) d\mu (x). \end{aligned}$$

(21)

The positivity condition comes from the assumption (18); if $D_h(p\Vert q)=0$, it means p/q is constant, which requires $p=q$. If there are ways of estimating the kernel $k_p$ and the ratio p/q, the measure $D_h$ gives an estimable divergence.

Some methods can be applied to construct kernel h that satisfies the condition (18). Let $\mathcal {F}=\mathcal {G}=L^2([0,1]^d)$ with Lebesgue measure, and $\tilde{h}(x,x')$ be a continuous positive definite kernel on $[0,1]^d$. Then, Mercer’s theorem [26] tells that k admits a decomposition

$$\begin{aligned} \tilde{h}(x,x') = \sum _{j=0}^\infty \lambda _j \xi _j(x) \xi _j(x'), \end{aligned}$$

which converges absolutely and uniformly on $[0,1]^d$, where $(\xi _j)$ is the orthonormal basis of $L^2([0,1]^d)$ and $\lambda _j\ge 0$. Assume that $\xi _0 = 1$. In this case, by eliminating $\lambda _0 \xi _0(x)\xi _0(x')$, the kernel

$$\begin{aligned} h(x,x'):= \sum _{j=1}^\infty \lambda _j \xi _j(x)\xi _j(x') = \tilde{h}(x,x') - \lambda _0 \end{aligned}$$

satisfies the condition.

Another example is given by an odd function. Let $\mathcal {F}=C^1({\mathbb {R}}^d)$ be the Banach space of bounded differentiable functions with bounded continuous derivatives, and $\mathcal {G}=L^2({\mathbb {R}}^d,{\mathbb {R}}^d;dx)$ be ${\mathbb {R}}^d$-valued square integrable functions on ${\mathbb {R}}^d$, respectively. Assume that a square integrable function w(z) on ${\mathbb {R}}^d$ satisfies $w(z_1,\ldots ,z_{a-1},-z_a,z_{a+1},\ldots ,z_d)=-w(z_1,\ldots ,z_d)$ for every $a=1,\ldots ,d$. Then, the kernel

$$\begin{aligned} h(x,x'):= w(x-x') \end{aligned}$$

satisfies $\int h(x,x')dx'= 0$. The “only if” part can be satisfied with many functions; for example, $w(x)=(x_1,\ldots ,x_d) e^{- \Vert x\Vert ^2/2}$ satisfies the condition (18). In fact, from the integration

$$\begin{aligned}&\int (x_a-x_a')e^{-\Vert x-x'\Vert ^2/2}\phi (x')dx_a' \\&\quad = \left[ e^{-\Vert x-x'\Vert ^2/2} \phi (x')\right] _{-\infty }^{\infty } - \int e^{-\Vert x-x'\Vert ^2/2}\frac{\partial \phi (x')}{\partial x_a'}dx_a' \\&\quad = - \int e^{-\Vert x-x'\Vert ^2/2}\frac{\partial \phi (x')}{\partial x_a'}dx_a', \end{aligned}$$

we see that $\int h(x,x')\phi (x')dx' = 0$ holds if and only if $\frac{\partial \phi (x)}{\partial x_a} = 0$ for almost every x and every $a=1,\ldots ,d$, which means $\phi (x)$ is constant.

3.3 Combinations of operators

Combining differential and integral operators provides further interesting divergences. Let $\mathcal {X}={\mathbb {R}}^d$ with Lebesgue measure. We consider an operator L, which is applied to an appropriate function $\phi $, with integral kernels $h_1,\ldots ,h_d$. Define L by

$$\begin{aligned} L\phi (x):= \Bigl (\; \phi (x)^{1/2}\int h_a(x,x')\bigl (\phi ^{-1}\partial _a\phi )(x')dx' \; \Bigr )_{a=1,\ldots ,d}, \end{aligned}$$

(22)

and

$$\begin{aligned} \psi (z):= \Vert z \Vert ^2. \end{aligned}$$

From (14), if $\phi = p/q$ is plugged, we have $\phi ^{-1}\partial _a\phi = \partial _a \log (p/q)$. Instead of (18), we assume for the integral kernels that

$$\begin{aligned} \int h^a(x,x')\phi (x')d\mu (x') = 0\qquad (\text {a.s.}\; x) \end{aligned}$$

(23)

holds if and only if $\phi $ is zero, which means that the integral operator with kernel $h^a$ is injective. Under this assumption, we can guarantee that $L(p/q)=0$ implies $p=q$. Gaussian kernel $h(x,x')=\exp (-\frac{1}{2\sigma ^2}\Vert x-x'\Vert ^2)$ satisfies this assumption for a class of $\phi $ such that some polynomial $\zeta (x)$ gives a bound $\vert \phi (x)\vert \le \zeta (x)$ for all x.

In a similar derivation to (20), the $\psi L$-divergence in this case is given by

$$\begin{aligned}&D^M(p\Vert q) \nonumber \\&\quad = \sum _{a=1}^d \int \int k_p^a(x',x'')\bigl (\partial _a\log q(x')-\partial _a \log p(x')\bigr )\bigl (\partial _a\log q(x'')-\partial _a \log p(x'')\bigr ), \end{aligned}$$

(24)

where $k_p^a$ corresponds to the kernel (21) with $h_a$. Note that the Fisher divergence can be regarded as a special case of $D^M$ with the choice $h^a(x,x') = \delta (x-x')$, which leads $k_p^a(x',x'')=\delta (x'-x'')p(x')$.

Given a sample from p(x), empirical estimation of this divergence is possible without knowing the normalization constants of p or q. Let $(\tilde{X}_1, \ldots ,\tilde{X}_n)$ be an i.i.d. sample of p(x)dx. With the estimator

$$\begin{aligned} \hat{k}_p^a(x,x'):= \frac{1}{n} \sum _{i=1}^n h_a(\tilde{X}_i,x')h_a(\tilde{X}_i,x''), \end{aligned}$$

the divergence $D^M(p\Vert q)$ can be estimated by

$$\begin{aligned}&\hat{D}^M := \sum _{a=1}^d \frac{1}{n}\sum _{i=1}^n \int \int \hat{k}_p^a(x',x'')\bigl (\partial _a\log q(x')-\partial _a \log p(x')\bigr )\bigl (\partial _a\log q(x'')-\partial _a \log p(x'')\bigr ). \end{aligned}$$

Assume further that the densities p and q belong to an infinite-dimensional exponential family; $p(x)=\exp (f(x)-A(f))q_0(x)$ and $q(x)=\exp (g(x)-A(g))q_0(x)$. With the choice of

$$\begin{aligned} h_a(x,x')=\frac{1}{(2\pi \sigma ^2)^{d/2}}e^{-\frac{\Vert x-x'\Vert ^2}{2\sigma ^2}}, \end{aligned}$$

the estimator of the divergence is given by

$$\begin{aligned} \hat{D}^M= & {} \sum _{a=1}^d \frac{1}{n}\sum _{i=1}^n \int \int \frac{1}{(2\pi \sigma ^2)^{d/2}}e^{-\frac{\Vert x'-\tilde{X}_i\Vert ^2}{2\sigma ^2}} \frac{1}{(2\pi \sigma ^2)^{d/2}}e^{-\frac{\Vert x''-\tilde{X}_i\Vert ^2}{2\sigma ^2}} \\{} & {} \cdot \Bigl ( \frac{\partial g(x')}{\partial x'_a}-\frac{\partial f(x')}{\partial x'_a} \Bigr )\Bigl ( \frac{\partial g(x'')}{\partial x''_a}-\frac{\partial f(x'')}{\partial x''_a} \Bigr )dx'dx''. \end{aligned}$$

Based on the above formula, the following empirical estimation is possible.

1.
$(\tilde{X}_1,\ldots ,\tilde{X}_n)$ is given (i.i.d. sample from p(x)dx).
2.
For each i, take independent samples $X'_{(i,j)}, X''_{(i,j)}$ ($j=1,\ldots ,N$) from $N(\tilde{X}_i,\sigma ^2 I_d)$.
3.
Compute
$$\begin{aligned} \hat{\hat{D}}^M:= \frac{1}{n}\frac{1}{N}\sum _{i=1}^n\sum _{j=1}^N \sum _{a=1}^d \left( \frac{\partial g(X'_{(i,j)})}{\partial x_a}-\frac{\partial f(X'_{(i,j)})}{\partial x_a} \right) \left( \frac{\partial g(X''_{(i,j)})}{\partial x_a}-\frac{\partial f(X''_{(i,j)})}{\partial x_a} \right) .\nonumber \\ \end{aligned}$$
(25)

Suppose that the functional parameters f and g are available; for example, density estimation in Sect. 2 may give their estimators. In such a case, the above formula provides a tractable estimate of the divergence without estimating the normalization constants A(f), A(g) or the density functions p(x), q(x).

3.4 Geometry of divergences

A divergence defined for an exponential family provides the Riemannian structure on the family as a manifold [27]. Below, the Riemannian metrics induced by some of the proposed divergences are discussed on the infinite-dimensional exponential family.

As discussed in [27], given a divergence $D(p\Vert q)$ on a finite-dimensional exponential family $\{p_\theta \mid \theta =(\theta _1,\ldots ,\theta _d)\in \Theta \}$, the Riemannian metric $\sum _{i,j=1}^d g_{ij}(\theta )d\theta ^i d\theta ^j$ is defined by

$$\begin{aligned} g_{ij}(\theta ) = \frac{\partial ^2}{\partial \theta _i \partial \theta _j} D( p_\theta \Vert p_{\tilde{\theta }} ) \vert _{\tilde{\theta }=\theta }. \end{aligned}$$

(26)

By extending this construction to the $\psi L$-divergences on the infinite-dimensional exponential family (1), under the assumption that the function $\psi (z)$ is two times differentiable at $z=0$, the Riemannian metric tensor at $p_s=\exp (s(x)-A(s))q_0(x)$^{Footnote 2} is given by

$$\begin{aligned} g(p_s)[U,V] = \sum _{a,b=1}^K E_{p_s}\left[ \frac{\partial ^2 \psi (z)}{\partial z_a \partial z_b}\Bigr \vert _{z=0} \right. \left. \frac{\partial L^a(\phi )}{\partial \phi }\Bigr \vert _{\phi =1} (U)\, \frac{\partial L^b(\phi )}{\partial \phi }\Bigr \vert _{\phi =1} (V) \right] , \end{aligned}$$

(27)

where $\frac{\partial L^a(\phi )}{\partial \phi }$ is the functional derivative of the a-th component of the operator L, which is a (generally nonlinear) operator on the tangent space of $\mathcal {F}$ at $\phi $. It is known that the tangent space is canonically identified with the vectors $W\in \mathcal {F}$ with $E_{p_s}[W]=0$, and the variables U and V in the tensor (27) are two of such tangent vectors at $p_s$. The symmetric tensor $g(p_s)$ is strictly positive if the Hessian $\bigl (\frac{\partial ^2 \psi (z)}{\partial z_a \partial z_b}\Bigr \vert _{z=0}\bigr )$ is strictly positive definite and the operator $\frac{\partial L^a(\phi )}{\partial \phi }\vert _{\phi =1}$ is injective. Under these conditions, $g(p_s)$ defines a Riemannian metric tensor on $\mathcal {P}_{k,q_0}$. Note that any f-divergence defines the unique Riemannian metric given by Fisher information up to scale irrespective of f. In contrast, the Riemannian metric (27) depends on the operator L.

In the sequel, some examples of the Riemannian metrics will be shown. First, for the divergence $D_{\alpha ,\beta }$ (16) defined by the differential operator, the non-degenerate Riemannian metric is obtained only with the case of $\alpha =1$ and $\beta =2$, which is the Fisher divergence. We can see that $(\partial L^a(\phi )/\partial \phi )(U) = -1/2\phi ^{-3/2}\partial _a\phi \cdot U + \phi ^{-1/2} \partial _a U$ so that $(\partial L^a(\phi )/\partial \phi )\vert _{\phi =1}(U)= \partial _a U$. The Riemannian metric $g^D$ induced by the Fisher divergence is thus given by

$$\begin{aligned} g^D(p_s)[U,V] = \sum _{a=1}^d E_{p_s}\bigl [ \partial _a U(X) \partial _a V(X)\bigr ]. \end{aligned}$$

(28)

Next, for the divergences (20) and (24), the Riemannian metrics are respectively given by

$$\begin{aligned} g^I(p_s)[U,V] = \int \int k_{p_s}(x',x'') U(x') V(x'')d\mu (x')d\mu (x''), \end{aligned}$$

(29)

and

$$\begin{aligned} g^M(p_s)[U,V] = \sum _{a=1}^d \int \int k_{p_s}^a(x',x'') \partial _a U(x') \partial _a V(x'')dx'dx''. \end{aligned}$$

(30)

The derivation is easy and omitted here. The tensors $g^I$ and $g^M$ are non-degenerate if the kernel $k_{p_s}$ (or $k_{p_s}^a$) is strictly positive definite^{Footnote 3}, which can be guaranteed with Gaussian kernel for h (or $h^a$) under mild conditions on $p_s$.

4 Conclusion

This paper discussed infinite-dimensional exponential families and Fisher divergence, with a focus on estimation. The first half of the paper delved into density estimation using an infinite-dimensional exponential family and presented a brief survey of previous studies. These studies have shown that the score matching method avoids the computation of the normalization constant and the probability density function, and derives a simple, but possibly high dimensional, linear equation to solve the estimator.

The second topic was the Fisher divergence, which forms the foundation of the score matching method. We explored the Fisher divergence and introduced an extended class of divergences that employ operators on a function space. This extends the scope of previous studies based on f-divergence, which uses a point-wise function. We also demonstrated the Riemannian structures induced by these divergences.

The geometry of Fisher divergence and relevant divergences remains ripe for further investigation and represents an important area for future research.

Data availability

Data sharing is not applicable to this article as no data sets were generated or analyzed during the current study.

Code Availability

No code is available.

Notes

This divergence is also known as Hyvärinen divergence in some literature, e.g., [15].
The parameter function is denoted by s in this subsection to avoid the conflict with f in f-divergence.
Here, a symmetric kernel $k(x,x')$ is said to be strictly positive definite if $ \int k(x,x')f(x)g(x')d\mu (x)d\mu (x')>0$ for any non-zero f and g.

References

Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 23(5), 1543–1561 (1995)
Article MathSciNet Google Scholar
Amari, S.: Differential-Geometrical Methods in Statistics. Lecture Notes in Statistics. Springer, Berlin (1985)
Google Scholar
Amari, S.-I., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Rhode Island (2000)
Google Scholar
Pistone, G., Rogantin, M.P.: The exponential statistical manifold: mean parameters, orthogonality and space transformations. Bernoulli 5, 721–760 (1999)
Article MathSciNet Google Scholar
Gibilisco, P., Pistone, G.: Connections on non-parametric statistical manifolds by orlicz space geometry. Infin. Dimens. Anal. Quantum Probab. Relat. Top. 01, 325–347 (1998). https://doi.org/10.1142/S021902579800017X
Article MathSciNet Google Scholar
Cena, A., Pistone, G.: Exponential statistical manifold. Ann. Inst. Stat. Math. 59, 27–56 (2007). https://doi.org/10.1007/s10463-006-0096-y
Article MathSciNet Google Scholar
Fukumizu, K.: Exponential Manifold by Reproducing Kernel Hilbert Spaces, pp. 291–306. Cambridge University Press, Cambridge (2010). https://doi.org/10.1017/CBO9780511642401.019
Book Google Scholar
Newton, N.J.: A class of non-parametric statistical manifolds modelled on sobolev space. Inf. Geom. 2, 283–312 (2019). https://doi.org/10.1007/s41884-019-00024-z
Article MathSciNet Google Scholar
Ay, N., Jost, J., Lê, H.V., Schwachhöfer, L.: Information Geometry. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56478-4
Book Google Scholar
Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyvärinen, A., Kumar, R.: Density estimation in infinite dimensional exponential families. J. Mach. Learn. Res. 18(57), 1–59 (2017)
MathSciNet Google Scholar
Hyvärinen, A.: Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6, 695–709 (2005)
MathSciNet Google Scholar
Canu, S., Smola, A.: Kernel methods and the exponential family. Neurocomputing 69, 714–720 (2006). https://doi.org/10.1016/j.neucom.2005.12.009
Article Google Scholar
Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)
Article MathSciNet Google Scholar
Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B.: Kernel mean embedding of distributions: a review and beyond. Found. Trends Mach. Learn. 10, 1–141 (2017). https://doi.org/10.1561/2200000060
Article Google Scholar
Amari, S.: Information Geometry and Its Applications. Springer, Tokyo (2016)
Book Google Scholar
Ley, C., Swan, Y.: Stein’s density approach and information inequalities. Electron. Commun. Probab. 18, 1–14 (2013). https://doi.org/10.1214/ECP.v18-2578
Article MathSciNet Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing), 2nd edn. Wiley-Interscience, New Jersey (2006)
Google Scholar
Lyu, S.: Interpretation and generalization of score matching. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 359–366. AUAI Press, Virginia, USA (2009)
Kimeldorf, G.S., Wahba, G.: A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Stat. 41(2), 495–502 (1970). https://doi.org/10.1214/aoms/1177697089
Article MathSciNet Google Scholar
Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)
Google Scholar
Song, Y., Garg, S., Shi, J., Ermon, S.: Sliced score matching: A scalable approach to density and score estimation. In: Adams, R.P., Gogate, V. (eds.) Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, vol. 115, pp. 574–584 (2020). https://proceedings.mlr.press/v115/song20a.html
Csiszár, I.: Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica 2, 229–318 (1967)
MathSciNet Google Scholar
Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodological) 28, 131–142 (1966). https://doi.org/10.1111/j.2517-6161.1966.tb00626.x
Article MathSciNet Google Scholar
Rényi, A.: On measures of entropy and information. In: The 4th Berkeley Symposium on Mathematics, Statistics and Probability, pp. 547–561 (1961)
Liese, F., Vajda, I.: On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 52, 4394–4412 (2006)
Article MathSciNet Google Scholar
Mercer, J.: XVI. Functions of positive and negative type, and their connection the theory of integral equations. Philos. Trans. R. Soc. Lond. Ser. A Contain. Pap. Math. Phys. Character 209, 415–446 (1909). https://doi.org/10.1098/rsta.1909.0016
Article ADS Google Scholar
Amari, S.-I., Cichocki, A.: Information geometry of divergence functions. Bull. Pol. Acad. Sci. Tech. Sci. 58, 183–195 (2010). https://doi.org/10.2478/v10175-010-0019-1
Article Google Scholar

Download references

Acknowledgements

This work has been supported in part by JSPS Grant-in-Aid for Transformative Research Areas (A) 22H05106.

Author information

Authors and Affiliations

The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan
Kenji Fukumizu

Authors

Kenji Fukumizu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

100% (single author).

Corresponding author

Correspondence to Kenji Fukumizu.

Ethics declarations

Conflict of interest

The author states that there is no conflict of interest.

Ethics approval

Not available.

Consent to participate

Not available.

Consent for publication

I consent to publication.

Additional information

Communicated by Giovanni Pistone.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Fukumizu, K. Estimation with infinite-dimensional exponential family and Fisher divergence. Info. Geo. 7 (Suppl 1), 609–622 (2024). https://doi.org/10.1007/s41884-023-00122-z

Download citation

Received: 28 February 2023
Revised: 25 October 2023
Accepted: 29 October 2023
Published: 13 November 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s41884-023-00122-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Estimation with infinite-dimensional exponential family and Fisher divergence

Abstract

Similar content being viewed by others

Wasserstein information matrix

The Exponential Family in Abstract Information Theory

Asymptotic Properties of Minimum S-Divergence Estimator for Discrete Models

1 Introduction

2 Estimation with infinite-dimensional exponential family

3 Fisher divergence and its extensions

3.1 Differential operators

3.2 Integral operators

3.3 Combinations of operators

3.4 Geometry of divergences

4 Conclusion

Data availability

Code Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Estimation with infinite-dimensional exponential family and Fisher divergence

Abstract

Similar content being viewed by others

Wasserstein information matrix

The Exponential Family in Abstract Information Theory

Asymptotic Properties of Minimum S-Divergence Estimator for Discrete Models

1 Introduction

2 Estimation with infinite-dimensional exponential family

3 Fisher divergence and its extensions

3.1 Differential operators

3.2 Integral operators

3.3 Combinations of operators

3.4 Geometry of divergences

4 Conclusion

Data availability

Code Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation