Abstract
Infinite dimensional exponential families have been theoretically studied, but their practical applications are still limited because empirical estimation is not straightforward. This paper first gives a brief survey of studies on the estimation method for infinite-dimensional exponential families. The method uses score matching, which is based on the Fisher divergence. The second topic is to investigate the Fisher divergence as a member of an extended family of divergences, which employ operators in defining divergences.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Since the seminal work of [1], infinite-dimensional exponential manifolds have been studied from various viewpoints. The dual geometric structure [2, 3] of finite-dimensional exponential families have been extended to infinite-dimensional cases using the topology of Orlicz spaces [4,5,6]. Formulations with other function spaces have been explored [7, 8], and geometry of more general measure spaces have been studied [9]. While the theoretical studies on infinite-dimensional exponential manifolds have advanced, the practical use of infinite-dimensional or nonparametric exponential families is limited.
One of the main difficulties in using an infinite-dimensional exponential family for density estimation is the computation of the normalization factor. The infinite-dimensional exponential family is typically defined by the form \(\exp (f(x)-A(f))q_0(x)\), where f belongs to an appropriate function space, \(q_0\) is the probability density function (p.d.f.) of a base probability, and A(f) is the log of the normalization constant: \(A(f) = \log \int \exp (f(x))q_0(x)\). The integral is, however, not computable for a general function f.
For finite-dimensional exponential families, the maximum likelihood estimator (MLE) is the central device to estimate a parameter that fits a given sample. The MLE is based on the Kullback–Leibler (KL) divergence to measure the discrepancy between the true distribution and a probability defined by the model. However, to obtain the MLE, we need to compute the mapping from the mean parameter (the mean of sufficient statistics) to the natural parameter. This is not straightforward even for general finite-dimensional exponential families, and is not computable for infinite-dimensional ones.
Recently, Sriperumbdur et al. [10] applied the score matching method [11] to estimate the functional parameter of an infinite-dimensional exponential family defined by a reproducing kernel Hilbert space. The score matching method is based on the Fisher divergence, which does not require the normalization constants to measure the discrepancy and is thus applicable to any infinite- or finite-dimensional exponential families.
The purpose of this paper is twofold. One is to give a brief survey of the estimation methods for infinite-dimensional exponential families, following [10]. The other is to investigate the Fisher divergence as a member of an extended family of divergences. The extended family introduces new ideas to define divergence measures of probabilities and provides a description of Riemannian metrics induced by the divergences.
2 Estimation with infinite-dimensional exponential family
This section explains the estimation with the model of the infinite-dimensional exponential family with positive definite kernels [7, 10, 12], focusing mainly [10].
Let \((\Omega ,\mathcal {B},\mu )\) be a measure space, and k be a measurable positive definite kernel on \(\Omega \). The reproducing kernel Hilbert space (RKHS) \((\mathcal {H}_k,\langle ,\rangle _{\mathcal {H}_k})\) is uniquely defined as a function space over \(\Omega \) so that \(\mathcal {H}_k\) has a reproducing kernel \(k(\cdot ,x)\). See, for example, [13, 14] about the basics of RKHS.
Let \(q_0(x)\) be a probability density function (p.d.f.) with respect to the measure \(\mu \). The exponential family with positive definite kernel k (at \(q_0\)) is defined by
where
is the parameter space, and \(A(f):= \log \int e^{f(x)}q_0(x)d\mu (x)\) is the log of the normalization constant, which is also known as the cumulant generating function. The function \(f\in \mathcal {F}_{k,q_0}\) is the natural parameter in the terminology of the exponential family; from the reproducing property, the function value is expressed by the inner product as \(f(x) = \langle f, k(\cdot ,x)\rangle _{\mathcal {H}_k}\) so that f and \(k(\cdot ,x)\) are regarded as the natural parameter and sufficient statistics, respectively. The parameter space \(\mathcal {F}_{k,q_0}\) may not be a linear space in general, as a scalar multiple or addition may not preserve the finiteness of the integral. Also, the parameter space depends on the kernel. If \(\mathcal {H}_{k_1}\subset \mathcal {H}_{k_2}\) holds for two kernels \(k_1\) and \(k_2\), then the corresponding parameter spaces also satisfy \(\mathcal {F}_{k_1,q_0}\subset \mathcal {F}_{k_2,q_0}\).
Let \((X_1,\ldots , X_n\)) be an i.i.d. sample from the true probability distribution, and suppose one wishes to estimate the density function of the true probability. To estimate the functional parameter f in (1), the maximum likelihood estimation is given by
From \(f(X_i) = \langle f,k(\cdot ,X_i)\rangle _{\mathcal {H}_k}\), it is reduced to the maximum likelihood equation:
It is difficult, however, to solve this equation, since A(f) is not computable in most cases. Thus, the maximum likelihood approach has not been successfully used for infinite-dimensional exponential families.
In [10], the score matching method [11] is employed to obtain an estimator of f while avoiding the use of A(f). Let \(\Omega = {\mathbb {R}}^d\), and \(\mu =dx\) be the Lebesgue measure. The score matching method uses the Fisher divergenceFootnote 1 for the objective of the estimation. For differentiable densities p and q on \({\mathbb {R}}^d\), the Fisher divergence is defined by
where \(\nabla \log p(x)= \left( \partial _1\log p(x),\ldots ,\partial _d\log p(x)\right) \) with \(\partial _a\log p(x):=\frac{\partial }{\partial x_a}\log p(x)\). It is important to note that when the p.d.f. to estimate is \(q(x) = p_f(x)\in \mathcal {P}_{k,q_0}\) in (1), the cumulant generating function A(f) is not necessary to compute, as \({\partial _a}\log q(x)={\partial _a}f(x)-{\partial _a}\log q_0(x)\), and thus applicable easily to the infinite-dimensional exponential family.
It is known that the Fisher divergence has various relations with other divergences. For example, with an appropriate choice of class for q(x), the total variation, Kolmogorov distance, and Wasserstein distance can be upper bounded by \(c_q J(p\Vert q)^{1/2}\), where \(c_q\) is a respective constant depending on q and the distance measure to compare (see [16], for example). It is also known that the Fisher divergence J and KL divergence \(D_{KL}\) have an interesting relation in the line of Bruijn’s identity [17]. More precisely, for two random vectors \(X^{(i)}\) (\(i=1,2\)) with p.d.f. \(p^{(i)}\), define their smoothed versions by \(X^{(i)}_t:= X^{(i)} + \sqrt{t}W^{(i)}\) (\(t\ge 0\)) with p.d.f. \(p^{(i)}_t\), where \(W^{(i)}\) are independent random variables of the law standard normal distribution \(N(0,I_d)\). Then, we have [18]
We now discuss the empirical estimation of (3). Given that p(x) is the p.d.f. of the true probability, it is also desirable to avoid estimating the density p(x) in \(\nabla \log p(x)\), which is difficult in high-dimensional cases in general. This is cleverly achieved in [11] by using a partial integral. More precisely, assuming that \(\lim _{\vert x_a\vert \rightarrow \infty } \frac{\partial \log q(x)}{\partial x_a}p(x) = 0\), we observe
and thus we can rewrite (3) by
where \(\partial _a^2\) denotes \(\frac{\partial ^2}{\partial x_a^2}\). The last term does not depend on q(x), and the first two terms are integral with respect to p(x). Based on these facts and \(\partial _a \log p_f(x) = \partial _a f(x) +\partial _a \log q_0(x)\), for an i.i.d. sample \((X_1,\ldots ,X_n)\) with p.d.f. p(x), the objective function of the score matching is given by
where only the terms dependent on the natural parameter f are collected. As in many kernel methods, the regularization term by the squared RKHS norm with coefficient \(\lambda \) is added to prevent the overfitting caused by the optimization over the infinite-dimensional function space. This regularization gives a unique minimizer to the objective function due to the representer theorem [19].
Taking an RKHS as a functional parameter space further provides a simple solution to the minimization of (7). In fact, \(\tilde{J}_n\) is reduced to a quadratic form of f using the facts \(\partial _a f(X_i) = \langle f, \partial _a k(\cdot ,X_i)\rangle _{\mathcal {H}_k}\) and \(\partial _a^2 f(X_i) = \langle f, \partial _a^2 k(\cdot ,X_i)\rangle _{\mathcal {H}_k}\), assuming that the kernel is two times differentiable [20, Section 4.3]. The objective function is then given by
where
By the representer theorem [19], the solution \(f_{\lambda ,n}\) is a linear combination of \(\partial _a k(\cdot ,X_i)\) and \(\partial _a^2 k(\cdot ,X_i)\), and given by
where \(\varvec{\beta }=(\beta _{(i-1)d+a})_{i,a}\) is a solution to the linear equation
with
and
Here, the partial derivatives \(\partial _a\) and \(\partial _{b+d}\) are applied to the first and second argument of \(k(X_i,X_j)\), respectively. It is noteworthy that the solution to the score matching estimator is given simply by the linear equation. Some numerical comparisons with the standard kernel density estimation have been reported in [10], which shows favorable results in estimating unnormalized densities. The matrix involved in (10) has the size \(nd \times nd\); the linear equation may be difficult to solve for large n and d. A gradient-based optimization may be a solution to this computational difficulty.
More recently, a method for estimating the functional parameter f(x) by neural networks has also been considered. In that case, the derivatives in the score matching can cause computational difficulty. To mitigate this issue, Song et al. [21] have proposed Sliced Score Matching, which reduces the objective functions into the integration of one-dimensional problems.
3 Fisher divergence and its extensions
In this section, a family of divergences is introduced to place the Fisher divergence as a special case. The new family uses operators on a functional space to define divergences.
Many traditional divergences and distances of probabilities are formulated as f-divergences [22,23,24]. Let P, Q be two probability measures on \({\mathbb {R}}^d\). For a convex function \(f:[0,\infty )\rightarrow {\mathbb {R}}\) such that \(f(1)=0\), the f-divergence from P to Q is defined by
where the Radon–Nykodim derivative \(\frac{dP}{dQ}\) is assumed to exist. It is well known that popular divergences such as KL-divergence, total variation, \(\chi ^2\)-divergence, and squared Hellinger distance, are instances of f-divergence with a choice of f (see [25], for example).
The Fisher divergence does not belong to f-divergences, however. While the f-divergence is defined by the point-wise value of the Radon-Nykodim derivative dP/dQ(x), the Fisher divergence requires the derivative of dP/dQ(x). In the sequel, a class of divergences based on operators of the function dP/dQ(x) or p(x)/q(x) will be considered.
Let \((\mathcal {X},\mathcal {B},\mu )\) be a measure space, \(\mathcal {F}\) be a function space of measurable functions on \(\mathcal {X}\), and \(\mathcal {G}_K\) be a space of K-dimensional vector-valued measurable functions on \(\mathcal {X}\). Assume that there is an operator \(L:\mathcal {F}\rightarrow \mathcal {G}_K\) and a function \(\psi :{\mathbb {R}}^K\rightarrow [0,\infty ]\) such that
-
(i)
\(\psi (z)=0\) if and only if \(z=0\),
-
(ii)
\(L\phi = 0\) if and only if \(\phi \in \mathcal {F}\) is a constant function.
Then, for p.d.f.s p and q on \((\mathcal {X},\mathcal {B},\mu )\), the divergence \(D_{\psi L}\) is defined by
It is obvious from the assumptions that \(D_{\psi L}(p\Vert q) \ge 0\) and the equality holds if and only if p/q is constant, which means \(p=q\). This class of divergences includes the Fisher divergence and other interesting ones, as seen in the following subsections. In literature, divergence is often defined to satisfy the positive definite condition that \(D(p\Vert p+dp)\) should be a positive definite quadrature form on dp. We do not necessarily impose this condition in defining divergences, but discuss it in Sect. 3.4.
An extension of the Fisher divergence using operators has already been introduced in [18], where the authors focus on a specific form \(\frac{L p}{p}\) in defining divergences, where L is a linear operator. The class proposed in the current paper includes a wider class of divergences as it allows general nonlinear operators and functional forms.
3.1 Differential operators
Suppose that \(\mathcal {X}\) is the d-dimensional Euclidean space with Lebesgue measure. For \(\alpha \in {\mathbb {R}}\) and \(\beta > 0\), define a differential operator \(L_{\alpha ,\beta }\) with \(K=d\) by
With the notation \(\partial _a = \frac{\partial }{\partial x_a}\), it follows that
Define \(\psi _{\beta }:{\mathbb {R}}^d \rightarrow [0,\infty )\) by
Then, the \(\psi L\)-divergence with \(\psi _{\beta }\) and \(L_{\alpha ,\beta }\) is given by
which provides a family of divergences. For \(\alpha = \beta -1\), the divergence is given by
which is a generalization of the Fisher divergence (\(\beta =2\)). Note, however, that except for Fisher divergence, the empirical estimation of (17) requires the estimation of \(\partial _a \log p(x)\), since the technique of partial integral (5) is not applicable. The estimation is not straightforward in general for high-dimensional spaces.
3.2 Integral operators
Let \(h:\mathcal {X}\times \mathcal {X}\rightarrow {\mathbb {R}}\) be an integrable function with respect to the product measure \(\mu \times \mu \) such that the integral operator \(L_h:\mathcal {F}\rightarrow \mathcal {G}\) is well-defined by
This is valid for various choices of \(\mathcal {F}\), \(\mathcal {G}\), and a function class of h. For example, if \(h(x,x')\) is square integrable, which means \(\int \vert h(x,x')\vert ^2 d\mu (x)d\mu (x')<\infty \), then with \(\mathcal {F}=\mathcal {G}=L^2(\mathcal {X};\mu )\) (\(K=1\)), the integral operator \(L_h:L^2(\mathcal {X};\mu )\rightarrow L^2(\mathcal {X};\mu )\) is well-defined as usual.
Assume further that the relation
holds if and only if \(\phi \) is a constant function. Then, define an operator \(L_{h}\) by
With \(\psi (z) = z^2\), the divergence \(D_h\) as in (12) is given by
where
The positivity condition comes from the assumption (18); if \(D_h(p\Vert q)=0\), it means p/q is constant, which requires \(p=q\). If there are ways of estimating the kernel \(k_p\) and the ratio p/q, the measure \(D_h\) gives an estimable divergence.
Some methods can be applied to construct kernel h that satisfies the condition (18). Let \(\mathcal {F}=\mathcal {G}=L^2([0,1]^d)\) with Lebesgue measure, and \(\tilde{h}(x,x')\) be a continuous positive definite kernel on \([0,1]^d\). Then, Mercer’s theorem [26] tells that k admits a decomposition
which converges absolutely and uniformly on \([0,1]^d\), where \((\xi _j)\) is the orthonormal basis of \(L^2([0,1]^d)\) and \(\lambda _j\ge 0\). Assume that \(\xi _0 = 1\). In this case, by eliminating \(\lambda _0 \xi _0(x)\xi _0(x')\), the kernel
satisfies the condition.
Another example is given by an odd function. Let \(\mathcal {F}=C^1({\mathbb {R}}^d)\) be the Banach space of bounded differentiable functions with bounded continuous derivatives, and \(\mathcal {G}=L^2({\mathbb {R}}^d,{\mathbb {R}}^d;dx)\) be \({\mathbb {R}}^d\)-valued square integrable functions on \({\mathbb {R}}^d\), respectively. Assume that a square integrable function w(z) on \({\mathbb {R}}^d\) satisfies \(w(z_1,\ldots ,z_{a-1},-z_a,z_{a+1},\ldots ,z_d)=-w(z_1,\ldots ,z_d)\) for every \(a=1,\ldots ,d\). Then, the kernel
satisfies \(\int h(x,x')dx'= 0\). The “only if” part can be satisfied with many functions; for example, \(w(x)=(x_1,\ldots ,x_d) e^{- \Vert x\Vert ^2/2}\) satisfies the condition (18). In fact, from the integration
we see that \(\int h(x,x')\phi (x')dx' = 0\) holds if and only if \(\frac{\partial \phi (x)}{\partial x_a} = 0\) for almost every x and every \(a=1,\ldots ,d\), which means \(\phi (x)\) is constant.
3.3 Combinations of operators
Combining differential and integral operators provides further interesting divergences. Let \(\mathcal {X}={\mathbb {R}}^d\) with Lebesgue measure. We consider an operator L, which is applied to an appropriate function \(\phi \), with integral kernels \(h_1,\ldots ,h_d\). Define L by
and
From (14), if \(\phi = p/q\) is plugged, we have \(\phi ^{-1}\partial _a\phi = \partial _a \log (p/q)\). Instead of (18), we assume for the integral kernels that
holds if and only if \(\phi \) is zero, which means that the integral operator with kernel \(h^a\) is injective. Under this assumption, we can guarantee that \(L(p/q)=0\) implies \(p=q\). Gaussian kernel \(h(x,x')=\exp (-\frac{1}{2\sigma ^2}\Vert x-x'\Vert ^2)\) satisfies this assumption for a class of \(\phi \) such that some polynomial \(\zeta (x)\) gives a bound \(\vert \phi (x)\vert \le \zeta (x)\) for all x.
In a similar derivation to (20), the \(\psi L\)-divergence in this case is given by
where \(k_p^a\) corresponds to the kernel (21) with \(h_a\). Note that the Fisher divergence can be regarded as a special case of \(D^M\) with the choice \(h^a(x,x') = \delta (x-x')\), which leads \(k_p^a(x',x'')=\delta (x'-x'')p(x')\).
Given a sample from p(x), empirical estimation of this divergence is possible without knowing the normalization constants of p or q. Let \((\tilde{X}_1, \ldots ,\tilde{X}_n)\) be an i.i.d. sample of p(x)dx. With the estimator
the divergence \(D^M(p\Vert q)\) can be estimated by
Assume further that the densities p and q belong to an infinite-dimensional exponential family; \(p(x)=\exp (f(x)-A(f))q_0(x)\) and \(q(x)=\exp (g(x)-A(g))q_0(x)\). With the choice of
the estimator of the divergence is given by
Based on the above formula, the following empirical estimation is possible.
-
1.
\((\tilde{X}_1,\ldots ,\tilde{X}_n)\) is given (i.i.d. sample from p(x)dx).
-
2.
For each i, take independent samples \(X'_{(i,j)}, X''_{(i,j)}\) (\(j=1,\ldots ,N\)) from \(N(\tilde{X}_i,\sigma ^2 I_d)\).
-
3.
Compute
$$\begin{aligned} \hat{\hat{D}}^M:= \frac{1}{n}\frac{1}{N}\sum _{i=1}^n\sum _{j=1}^N \sum _{a=1}^d \left( \frac{\partial g(X'_{(i,j)})}{\partial x_a}-\frac{\partial f(X'_{(i,j)})}{\partial x_a} \right) \left( \frac{\partial g(X''_{(i,j)})}{\partial x_a}-\frac{\partial f(X''_{(i,j)})}{\partial x_a} \right) .\nonumber \\ \end{aligned}$$(25)
Suppose that the functional parameters f and g are available; for example, density estimation in Sect. 2 may give their estimators. In such a case, the above formula provides a tractable estimate of the divergence without estimating the normalization constants A(f), A(g) or the density functions p(x), q(x).
3.4 Geometry of divergences
A divergence defined for an exponential family provides the Riemannian structure on the family as a manifold [27]. Below, the Riemannian metrics induced by some of the proposed divergences are discussed on the infinite-dimensional exponential family.
As discussed in [27], given a divergence \(D(p\Vert q)\) on a finite-dimensional exponential family \(\{p_\theta \mid \theta =(\theta _1,\ldots ,\theta _d)\in \Theta \}\), the Riemannian metric \(\sum _{i,j=1}^d g_{ij}(\theta )d\theta ^i d\theta ^j\) is defined by
By extending this construction to the \(\psi L\)-divergences on the infinite-dimensional exponential family (1), under the assumption that the function \(\psi (z)\) is two times differentiable at \(z=0\), the Riemannian metric tensor at \(p_s=\exp (s(x)-A(s))q_0(x)\)Footnote 2 is given by
where \(\frac{\partial L^a(\phi )}{\partial \phi }\) is the functional derivative of the a-th component of the operator L, which is a (generally nonlinear) operator on the tangent space of \(\mathcal {F}\) at \(\phi \). It is known that the tangent space is canonically identified with the vectors \(W\in \mathcal {F}\) with \(E_{p_s}[W]=0\), and the variables U and V in the tensor (27) are two of such tangent vectors at \(p_s\). The symmetric tensor \(g(p_s)\) is strictly positive if the Hessian \(\bigl (\frac{\partial ^2 \psi (z)}{\partial z_a \partial z_b}\Bigr \vert _{z=0}\bigr )\) is strictly positive definite and the operator \(\frac{\partial L^a(\phi )}{\partial \phi }\vert _{\phi =1}\) is injective. Under these conditions, \(g(p_s)\) defines a Riemannian metric tensor on \(\mathcal {P}_{k,q_0}\). Note that any f-divergence defines the unique Riemannian metric given by Fisher information up to scale irrespective of f. In contrast, the Riemannian metric (27) depends on the operator L.
In the sequel, some examples of the Riemannian metrics will be shown. First, for the divergence \(D_{\alpha ,\beta }\) (16) defined by the differential operator, the non-degenerate Riemannian metric is obtained only with the case of \(\alpha =1\) and \(\beta =2\), which is the Fisher divergence. We can see that \((\partial L^a(\phi )/\partial \phi )(U) = -1/2\phi ^{-3/2}\partial _a\phi \cdot U + \phi ^{-1/2} \partial _a U\) so that \((\partial L^a(\phi )/\partial \phi )\vert _{\phi =1}(U)= \partial _a U\). The Riemannian metric \(g^D\) induced by the Fisher divergence is thus given by
Next, for the divergences (20) and (24), the Riemannian metrics are respectively given by
and
The derivation is easy and omitted here. The tensors \(g^I\) and \(g^M\) are non-degenerate if the kernel \(k_{p_s}\) (or \(k_{p_s}^a\)) is strictly positive definiteFootnote 3, which can be guaranteed with Gaussian kernel for h (or \(h^a\)) under mild conditions on \(p_s\).
4 Conclusion
This paper discussed infinite-dimensional exponential families and Fisher divergence, with a focus on estimation. The first half of the paper delved into density estimation using an infinite-dimensional exponential family and presented a brief survey of previous studies. These studies have shown that the score matching method avoids the computation of the normalization constant and the probability density function, and derives a simple, but possibly high dimensional, linear equation to solve the estimator.
The second topic was the Fisher divergence, which forms the foundation of the score matching method. We explored the Fisher divergence and introduced an extended class of divergences that employ operators on a function space. This extends the scope of previous studies based on f-divergence, which uses a point-wise function. We also demonstrated the Riemannian structures induced by these divergences.
The geometry of Fisher divergence and relevant divergences remains ripe for further investigation and represents an important area for future research.
Data availability
Data sharing is not applicable to this article as no data sets were generated or analyzed during the current study.
Code Availability
No code is available.
Notes
This divergence is also known as Hyvärinen divergence in some literature, e.g., [15].
The parameter function is denoted by s in this subsection to avoid the conflict with f in f-divergence.
Here, a symmetric kernel \(k(x,x')\) is said to be strictly positive definite if \( \int k(x,x')f(x)g(x')d\mu (x)d\mu (x')>0\) for any non-zero f and g.
References
Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 23(5), 1543–1561 (1995)
Amari, S.: Differential-Geometrical Methods in Statistics. Lecture Notes in Statistics. Springer, Berlin (1985)
Amari, S.-I., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Rhode Island (2000)
Pistone, G., Rogantin, M.P.: The exponential statistical manifold: mean parameters, orthogonality and space transformations. Bernoulli 5, 721–760 (1999)
Gibilisco, P., Pistone, G.: Connections on non-parametric statistical manifolds by orlicz space geometry. Infin. Dimens. Anal. Quantum Probab. Relat. Top. 01, 325–347 (1998). https://doi.org/10.1142/S021902579800017X
Cena, A., Pistone, G.: Exponential statistical manifold. Ann. Inst. Stat. Math. 59, 27–56 (2007). https://doi.org/10.1007/s10463-006-0096-y
Fukumizu, K.: Exponential Manifold by Reproducing Kernel Hilbert Spaces, pp. 291–306. Cambridge University Press, Cambridge (2010). https://doi.org/10.1017/CBO9780511642401.019
Newton, N.J.: A class of non-parametric statistical manifolds modelled on sobolev space. Inf. Geom. 2, 283–312 (2019). https://doi.org/10.1007/s41884-019-00024-z
Ay, N., Jost, J., Lê, H.V., Schwachhöfer, L.: Information Geometry. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56478-4
Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyvärinen, A., Kumar, R.: Density estimation in infinite dimensional exponential families. J. Mach. Learn. Res. 18(57), 1–59 (2017)
Hyvärinen, A.: Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6, 695–709 (2005)
Canu, S., Smola, A.: Kernel methods and the exponential family. Neurocomputing 69, 714–720 (2006). https://doi.org/10.1016/j.neucom.2005.12.009
Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)
Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B.: Kernel mean embedding of distributions: a review and beyond. Found. Trends Mach. Learn. 10, 1–141 (2017). https://doi.org/10.1561/2200000060
Amari, S.: Information Geometry and Its Applications. Springer, Tokyo (2016)
Ley, C., Swan, Y.: Stein’s density approach and information inequalities. Electron. Commun. Probab. 18, 1–14 (2013). https://doi.org/10.1214/ECP.v18-2578
Cover, T.M., Thomas, J.A.: Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing), 2nd edn. Wiley-Interscience, New Jersey (2006)
Lyu, S.: Interpretation and generalization of score matching. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 359–366. AUAI Press, Virginia, USA (2009)
Kimeldorf, G.S., Wahba, G.: A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Stat. 41(2), 495–502 (1970). https://doi.org/10.1214/aoms/1177697089
Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)
Song, Y., Garg, S., Shi, J., Ermon, S.: Sliced score matching: A scalable approach to density and score estimation. In: Adams, R.P., Gogate, V. (eds.) Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, vol. 115, pp. 574–584 (2020). https://proceedings.mlr.press/v115/song20a.html
Csiszár, I.: Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica 2, 229–318 (1967)
Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodological) 28, 131–142 (1966). https://doi.org/10.1111/j.2517-6161.1966.tb00626.x
Rényi, A.: On measures of entropy and information. In: The 4th Berkeley Symposium on Mathematics, Statistics and Probability, pp. 547–561 (1961)
Liese, F., Vajda, I.: On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 52, 4394–4412 (2006)
Mercer, J.: XVI. Functions of positive and negative type, and their connection the theory of integral equations. Philos. Trans. R. Soc. Lond. Ser. A Contain. Pap. Math. Phys. Character 209, 415–446 (1909). https://doi.org/10.1098/rsta.1909.0016
Amari, S.-I., Cichocki, A.: Information geometry of divergence functions. Bull. Pol. Acad. Sci. Tech. Sci. 58, 183–195 (2010). https://doi.org/10.2478/v10175-010-0019-1
Acknowledgements
This work has been supported in part by JSPS Grant-in-Aid for Transformative Research Areas (A) 22H05106.
Author information
Authors and Affiliations
Contributions
100% (single author).
Corresponding author
Ethics declarations
Conflict of interest
The author states that there is no conflict of interest.
Ethics approval
Not available.
Consent to participate
Not available.
Consent for publication
I consent to publication.
Additional information
Communicated by Giovanni Pistone.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fukumizu, K. Estimation with infinite-dimensional exponential family and Fisher divergence. Info. Geo. 7 (Suppl 1), 609–622 (2024). https://doi.org/10.1007/s41884-023-00122-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41884-023-00122-z