Abstract
This paper introduces a new way to extract a set of representative points from a continuous distribution, which focuses on a method where the selection of points is essentially deterministic, with an emphasis on achieving accurate approximation when the size of points is small. These points are generated by minimizing the Kullback–Leibler divergence, which is an information-based measure of the disparity between two probability distributions. We refer to these points as Kullback–Leibler points. Based on the link between the total variation and the Kullback–Leibler divergence, we prove that the empirical distribution of Kullback–Leibler points converges to the target distribution. Additionally, we illustrate that Kullback–Leibler points have advantages in simulations when compared with representative points generated by Monte Carlo or other representative points methods. In addition, to prevent the frequent evaluation of complex functions, a sequential version of Kullback–Leibler points is proposed, which adaptively updates the representative points by learning about the complex or unknown functions sequentially. Two potential applications of Kullback-Leibler points in simulation of complex probability densities and optimization of complex response surfaces are discussed and demonstrated with examples.
Similar content being viewed by others
References
Billingsley P (2008) Probability and measure. Wiley, New York
Brooks S, Gelman A, Jones G, Meng XL (2011) Handbook of Markov chain Monte Carlo. Chapman and Hall/CRC, Boca Raton
Chen WY, Mackey L, Gorham J, Briol FX, Oates CJ (2018) Stein points. In Proceedings of the 35th international conference on machine learning, vol 80, pp 844–853
Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New York
Dick J, Kuo FY, Sloan IH (2013) High-dimensional integration: the quasi-Monte Carlo way. Acta Numer 22:133–573
Dudewicz EJ, Van DMEC (1981) Entropy-based tests of uniformity. J Am Stat Assoc 76:967–974
Fang KT, Li RZ, Sudijanto A (2006) Designs and modeling for computer experiments. Chapman and Hall/CRCl, Boca Raton
Fasshauer G (2007) Meshfree approximation methods with MATLAB. World Scientific, Singapore
Haario H, Saksman E, Tamminen J (1999) Adaptive proposal distribution for random walk Metropolis algorithm. Comput Stat 14:375–395
Härdle WG, Werwatz A, Müller M, Sperlich S (2004) Nonparametric and semiparametric models. Springer, New York
Hickernell F (1998) A generalized discrepancy and quadrature error bound. Math Comput 67:299–322
Lin CD, Tang BX (2015) Latin hypercubes and space-filling designs. Handb Des Anal Exp Chap 17:593–626
Joseph VR, Dasgupta T, Tuo R, Wu CFJ (2015) Sequential exploration of complex surfaces using minimum energy designs. Technometrics 57:64–74
Joseph VR, Wang DP, Gu L, Lv SJ, Tuo R (2019) Deterministic sampling of expensive posteriors using minimum energy designs. Technometrics 61:297–308
Jourdan A, Franco J (2010) Optimal Latin hypercube designs for the Kullback–Leibler criterion. AStA Adv Stat Anal 94:341–351
Kennedy MC, O’Hagan A (2001) Bayesian calibration of computer models (with discussion). J R Stat Soc B 63:425–464
Kullback S, Leibler R (1951) On information and sufficiency. Ann Math Stat 22:79–86
Mak S, Joseph VR (2017) Projected support points: a new method for high-dimensional data reduction. arXiv: 1708.06897
Mak S, Joseph VR (2018) Support points. Ann Stat 46:2562–2592
McKay MD, Beckman RJ, Conover WJ (1979) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21:239–245
Morris MD, Mitchell TJ (2018) Exploratory designs for computer experiments. J Stat Plan Inference 43:381–402
Miettinen K (2012) Nonlinear multiobjective optimization. Springer, New York
Sacks J, Schiller SB, Welch WJ (1989) Designs for computational experiments. Technometrics 31:41–47
Santner TJ, Williams BJ, Notz WI (2019) The design and analysis of computer experiments. Springer, New York
Shi CL, Tang BX (2020) Construction results for strong orthogonal arrays of strength three. Bernoulli 26:418–431
Sobol’ IM (1967) On the distribution of points in a cube and the approximate evaluation of integrals. Zh Vychisl Mat Mat Fiz 7:784–802
Tsybakov AB (2009) Introduction to nonparametric estimation. Springer, New York
Wang Q, Kulkarni SR, Verdú S (2006) A nearest-neighbor approach to estimating divergence between continuous random vectors. In: IEEE international symposium on information theory
Worley BA (1987) Deterministic uncertainty analysis. Technical Report ORNL-6428. Oak Ridge National Laboratories
Wu Y, Ghosal S (2008) Kullback Leibler property of kernel mixture priors in Bayesian density estimation. Electron J Stat 2:298–331
Acknowledgements
This study was supported by the National Natural Science Foundation of China (Grant Nos. 11971098, 11471069 and 12131001) and National Key Research and Development Program of China (Grant Nos. 2020YFA0714102 and 2022YFA1003701).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Proofs
Appendix A: Proofs
This appendix material provides proofs of Theorems 1–3. Define \(f_n^*\) as the kernel density estimator on \(\{\textbf{x}_j^*\}_{j=1}^{n}\), where \(\{\textbf{x}_j^*\}_{j=1}^{n} \overset{i.i.d.}{\sim }\ f(\textbf{x})\). The proof of Theorem 1 relies on Lemma 1 below, which indicates that the expectation of \(D(f_n^*\Vert f)\) converges to 0 as \(n \rightarrow \infty \).
Lemma 1
Suppose probability density function f satisfies conditions (A1)–(A2) and that \(\{\textbf{x}_j^*\}_{j=1}^{n} \overset{i.i.d.}{\sim }\ f\). \(f_n^*\) is the kernel density estimator on \(\{\textbf{x}_j^*\}_{j=1}^{n}\), and the kernel function K and the bandwidth \(h_n\) satisfy (K1)–(K5). Then,
Proof
For simplicity, we take \(d=1\) for example, the proof for \(d>1\) is similar. Let \(\{x_j^*\}_{j=1}^{n} \overset{i.i.d.}{\sim }\ f\), \(f_n^*(x)=\frac{1}{nh_{n}}\sum _{i=1}^n K(\frac{x-x_i^*}{h_{n}})\) be the kernel density estimator based on \(\{x_j^*\}_{j=1}^{n}\), and kernel function K and the bandwidth \(h_n\) satisfy (K1)–(K5).
Since f satisfies (A1)-(A2), then f is continuous over \(\mathcal {X}\). And note that f is a density function then we know that there exists constant \(f_{\max }\) such \(f\le f_{\max } <\infty \) holds. The variance of \(f_n^*(x)\) satisfies:
where \(C_1=f_{\max } \Vert K\Vert _{2}^{2}\), \(\Vert K\Vert _{2}^{2}=\int _{\mathcal {X}}K^{2}(t)dt\).
The bias of \(f_n^*(x)\) have the following results.
In term of f satisfies \(\forall \) \(x, y \in \mathcal {X}\), \(|f(x)-f(y)|\le L|x- y|\). Hence, we obtain
where \(C_2=L\left\{ \int _{\mathcal {X}}t^2K(t)dt\right\} ^{1/2}\).
The Kullback–Leibler divergence between f and \(f_n^*\) is
By the inequality \(\log x\le (x-1)\), the expectation of \(D(f_n^*\Vert f)\) satisfies:
The penultimate “=” sign holds following from Fubini theorem.
According to (A1) and (A3), we can obtain
Obviously, \(\frac{G_n(x)}{f(x)}\) is monotonically decreasing in n, and \(\lim _{n\rightarrow \infty }\frac{G_n(x)}{f(x)}=0\). So, \(\forall n\), \(\frac{G_n(x)}{f(x)} \le \frac{G_1(x)}{f(x)}\). Due to \(\int _{\mathcal {X}}\frac{G_1(x)}{f(x)}dx < \infty \), then, by the Lebesgue’s convergence theorem (Billingsley 2008), we obtain
Note that
and we have
Then the conclusion \(\lim _{n\rightarrow \infty } E[D(f_n^*\Vert f)]=0\) is established.
To prove Theorems 1 and 2, we also need the following lemma. \(\square \)
Lemma 2
Let f and g be two density functions supported on \(\mathcal {X}\subseteq R^d\), then
-
(a)
$$\begin{aligned} V(f, g)\overset{\textrm{def}}{=}\sup _{A\in \mathcal {B}}\left|\int _{A} (f(\textbf{x})-g(\textbf{x}))d\textbf{x}\right|=\frac{1}{2}\int _{\mathcal {X}}|f(\textbf{x})-g(\textbf{x})|d\textbf{x}, \end{aligned}$$
where \(\mathcal {B}\) is the Borel \(\sigma \)-algebra of \(\mathcal {X}\).
-
(b)
$$\begin{aligned} 2V^{2}(f, g) \le D(g \Vert f). \end{aligned}$$
This Lemma can be obtained by Scheffé’s theorem (refer to Tsybakov 2009 p.84) and Pinsker’s inequality [refer to Tsybakov (2009, p. 88)], respectively.
Proof of Theorem 1
Define the sequence of random variables \(\{\textbf{x}_j^*\}_{j=1}^{\infty } \overset{i.i.d.}{\sim }\ f\), and let \(f_n^*\) denote the kernel density estimator on \(\{\textbf{x}_j^*\}_{j=1}^{n}\). According the Lemma 1, \(\lim _{n\rightarrow \infty } E[D(f_n^*\Vert f)]=0\).
Consider now the kernel density estimator \(f_{n}^{KL}\) on the KL points \(\{\varvec{\xi }_{i}\}_{i=1}^n\). By the definition of KL points,
so \(\lim _{n\rightarrow \infty }D(f_{n}^{KL}\Vert f)=0\).
Using Pinsker’s inequality in Lemma 2 (b),
then conclusion \(\lim _{n\rightarrow \infty } \int _{\mathcal {X}} |f_{n}^{KL} (\textbf{x})-f(\textbf{x})|d\textbf{x}=0\) follows.
Proof of Theorem 2
Due to \(f_{n}^{KL}\) is the kernel density estimator on KL points \(\{\varvec{\xi }_{i}\}_{i=1}^n\) and satisfies (K1)–(K5), then
based on Lemma 2 (a). Combing Theorem 1 and (A8), we have
Set \(A=(-\infty , \textbf{x}]\in \mathcal {B}\), we have
where \(F_n\) is the cumulative distribution function of density function \(f_{n}^{KL}\), and F is the cumulative distribution function of f. \(\square \)
Two lemmas will be needed for the proof of Theorem 3.
Lemma 3
Suppose kernel K satisfys (K1)–(K5). Then,
-
(a)
\(\forall \) \(\epsilon >0\), there exist \(M>0\), such that \(\int _{[-M, M]^d}K(\textbf{t})d\textbf{t}\ge 1-\epsilon \).
-
(b)
For the above \(M>0\), \(\exists \) \(\textbf{y}_0\), such that \(\varphi (\textbf{x}, \textbf{y}_0)\ge 1/3\), where
$$\begin{aligned} \varphi (\textbf{x}, \textbf{y})=\int _{\prod _{i=1}^d[x_i-y_i, x_i+y_i]}K(\textbf{t})d\textbf{t}, \end{aligned}$$\(\textbf{x}=(x_1, \ldots , x_d)\in [-M, M]^d\) and \(\textbf{y}=(y_1, \ldots , y_d)\).
Proof
(a) follows from is a density function.
(b) Note that, \(\lim _{\textbf{y}\rightarrow +\infty } \varphi (\textbf{x}, \textbf{y})\ge 1/2\) for any \(\textbf{x}=(x_1, \ldots , x_d)\in [-M, M]^d\), then conclusion follows. \(\square \)
Lemma 4
Let \(\{\varvec{\xi }_{i}\}_{i=1}^n\) be the KL points of f. Then
where
\(h_n\) is the bandwidth used to generate KL points and M is defined in Lemma 3.
Proof
For simplicity, we take \(d = 1\) for example. Due to \(\forall \) \(\delta >0 \), \(\lim _{n\rightarrow \infty }\frac{\delta }{h_n}=+\infty \), so \(\exists \) \(n_0\), such that when \(n> n_0\), \(\frac{\delta }{h_{n}}\ge y_0\) holds, where \(y_0\) satisfies \(I(|x|\le M)\varphi (x, y_0)\ge 1/3\) defined in Lemma 3.
We use reduction to absurdity to prove this result. If Lemma 4 doesn’t hold, then \(\exists \) \(x^{*}\), for \(\forall \) \( N>n_0\), \(\exists \) \(n_{k}>N\), such that \( \frac{N_{x^*}}{n_k}\ge c_0\), which means
where \(c_0\) is a positive constant.
Then, we have
where \(F_{n_k}\) is the cumulative distribution function correspond the kernel density estimator \(f_{n_k}\), which used to generated KL points. The penultimate “\(\ge \)" holds by Lemma 3. This contradicts to \(F_{n_k}\) is a continuous distribution. We conclude the proof. \(\square \)
Proof of Theorem 3
Let \(F_n^{^{KL}}\) denote the standard empirical distribution of KL ponits \(\{\varvec{\xi }_{i}\}_{i=1}^n\), \(F_{n}\) denote the cumulative distribution function of \(f_{n}^{KL}\). We first prove that
For simplicity, we take \(d=1\), \(\forall \) \(x \in \mathcal {X}\),
If \(\xi _{i} \le x\), then \(\frac{x-\xi _{i}}{h_n} \ge 0\), and
where M and \( \epsilon \) are defined as in Lemma 3, i.e. \(\forall \) \( \epsilon >0\), there exist \(M>0\), such that \(\int _{-M}^{M}K(t)dt\ge 1-\epsilon \), and \(\int _{-\infty }^{-M}K(t)dt=\int _{M}^{+\infty }K(t)dt<\frac{\epsilon }{2}\).
If \(\xi _{i} > x\), then \(\frac{x-\xi _{i}}{h_n} <0\), and
In summary, for \(\forall \) \(x\in \mathcal {X}\) the absolute error between \(F_{n}(x)\) and \(F_n^{^{KL}}(x)\) is
In term of Lemma 4,
Hence, \(\forall \) \(\epsilon > 0\), there exist N, such that when \(n\ge N\), for \(\forall \) \(x\in \mathcal {X}\), we have
Consequently, we obtain \(\lim _{n \rightarrow \infty }\sup _{\textbf{x}} |F_{n}(x)-F_n^{^{KL}}(x)|=0\). The proof for \(d > 1\) is similar, hence
Due to
and combing with Theorem 2, we have
\(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, S., Sun, F. Deterministic sampling based on Kullback–Leibler divergence and its applications. Stat Papers (2023). https://doi.org/10.1007/s00362-023-01449-6
Received:
Revised:
Published:
DOI: https://doi.org/10.1007/s00362-023-01449-6
Keywords
- Bayesian computation
- Computer experiments
- Gaussian process model
- Representative points
- Space-filling design