Abstract
For a parametric model of distributions, the closest distribution in the model to the true distribution located outside the model is considered. Measuring the closeness between two distributions with the Kullback–Leibler divergence, the closest distribution is called the “information projection.” The estimation risk of the maximum likelihood estimator is defined as the expectation of Kullback–Leibler divergence between the information projection and the maximum likelihood estimative density (the predictive distribution with the plugged-in maximum likelihood estimator). Here, the asymptotic expansion of the risk is derived up to the second order in the sample size, and the sufficient condition on the risk for the Bayes error rate between the predictive distribution and the information projection to be lower than a specified value is investigated. Combining these results, the “p/n criterion” is proposed, which determines whether the estimative density is sufficiently close to the information projection for the given model and sample. This criterion can constitute a solution to the sample size or model selection problem. The use of the p/n criteria is demonstrated for two practical datasets.
Similar content being viewed by others
1 Introduction
Given a certain data set, an unknown probability distribution that generates the data as the independent, identically distributed (i.i.d.) sample can be assumed. Under this assumption, if a certain parametric distribution model is adopted to “explain” the data, the first task is to find the “best” approximating distribution in the model. Because the true distribution is assumed to be outside the model (except for some rare cases), the “best” means the “closest” to the true distribution.
Consider the following parametric distribution model:
where \(g(x;\theta )\) is the probability density function (p.d.f.) with respect to a reference measure \(d\mu \) on a measurable space. The p.d.f. of the unknown true distribution with respect to \(d\mu \) is denoted by g(x). If we use a certain divergence \(D[ \cdot \,|\, \cdot ]\) to measure the closeness between g(x) and \(g(x;\theta )\), then the “best” approximating distribution in \(\mathcal {M}\) is given by the predictive distribution \(g(x;\theta _*)\), where
Following Csiszár (1975), we will call \(g(x;\theta _*)\) the “information projection” in this paper.
Let \(\hat{\theta }\) denote the maximum likelihood estimator (MLE) based on the i.i.d. sample \(\varvec{X}= (X_1,\ldots ,X_n)\) from g(x). Consider the predictive density \(g(x; \hat{\theta })\). Since MLE converges to \(\theta _*\) in probability (see, e.g., Theorem 5.21 of van der Vaart (1998)) as the sample size, n, increases,
also converges to zero in probability. The predictive density \(g(x; \hat{\theta })\) is produced with plugged-in MLE. This type of predictive density is called “estimative density”. Another common method to formulate the predictive density is Bayesian predictive density. For the asymptotic properties of Bayesian predictive density, see e.g. Komaki (1996), Hartigan (1998), Komaki (2015) and Zhang et al. (2018).
Take the expectation
with respect to the i.i.d. sample \(\varvec{X}= (X_1,\ldots ,X_n)\) from g(x). Throughout this study, the expectation under g(x) is denoted by \(E[\cdot ]\), while the expectation under \(g(x;\theta _*)\) is denoted by \(E_{\theta _*}[\cdot ]\). We call (2) “estimation risk” for discriminating it with the “total risk”
The estimation risk converges to zero under some mild conditions. We will use this estimation risk as the measure of the closeness between \(g(x; \theta _*)\) and \( g(x; \hat{\theta })\).
Given the data and the model, we need to know whether \(g(x;\hat{\theta })\) is sufficiently close to the information projection. Thus, with a certain threshold C, the following criterion is considered.
where the left hand side is the estimator of the estimation risk.
This criterion gives a solution to the following two problems.
-
Sample size problem: With the model fixed, it indicates exactly how much sample size n is needed for \(g(x;\hat{\theta })\) to be close to the information projection. If the criterion is not satisfied, we need to collect more sample.
-
Model selection problem: With the sample size fixed, it tells us whether a model is simple enough (especially the dimension of the parameter p is small enough) to guarantee that \(g(x;\hat{\theta })\) is close to the information projection. Unless the criterion is satisfied, simplifying the model could be a remedy.
As seen later in the manuscript, the estimation risk is mainly determined by p/n when the information projection is close to the true distribution, and we will call this criterion “p/n criterion” hereafter.
In this paper, as the divergence, Kullback–Leibler divergence is taken, that is,
Note that for this divergence, the information projection is given by
and its solution \(\theta ^*\) is naturally estimated via the MLE, which is the solution of
For the other divergences, the information projection is more complicated, and its natural estimator is not as simple as MLE.
This paper aims to present a simple and practical criterion (3), and proceeds as follows;
-
1.
The asymptotic expansion of the estimation risk is derived.
-
2.
The asymptotic expansion combined with the estimated moments gives the estimator of the estimation risk.
-
3.
The reasonable (persuasive) threshold C is proposed.
An overview of the contents of each section is now provided. First, the asymptotic expansion of the estimation risk is given for both the general model (Sect. 2.1) and an exponential family model (Sect. 2.2). The estimator of the estimation risk is given in Sect. 2.3. Next, the concrete threshold C is proposed in view of the Bayes error rate. With these results combined, p/n criterion is proposed in an explicit form (Sect. 3.1). As an application of p/n criterion, the bin number problem in a multinomial distribution or a histogram is considered (Sect. 3.2). In Sect. 3.3, the algorithm for calculating the p/n criterion in the case of an exponential family is described. In Sect. 3.4, the use of the p/n criterion is demonstrated for two practical examples.
2 Estimation risk for general case and exponential family
In this section, the asymptotic expansion with respect to n of the estimation risk (2) is presented up to the first-order term for a general distribution, and up to the second-order term for an exponential family distribution.
Hartigan (1998) derives the asymptotic expansion of the estimation risk (2) up to the second order under the assumption g(x) belongs to \(\mathcal {M}\). The result here is the extension of his result in the sense that the true distribution is not necessarily located in \(\mathcal {M}\).
On the risk of an exponential family, the most relevant work is that of Barron and Sheu (1991). They consider the convergence rate of the K–L divergence (not the risk, but the divergence itself) for an exponential family on a compact set. Their interest lies in the closeness between g(x) and \(g(x;\hat{\theta })\), while this research focuses on the closeness between \(g(x;\theta _*)\) and \(g(x;\hat{\theta })\)
2.1 Estimation risk for general case
Taylor expansion of
as a function of \(\hat{\theta }\) around \(\theta _*\) is considered:
where \(\tilde{\theta }_*\) is a point between \(\theta _*\) and \(\hat{\theta }\). Because
it turns out that
Here,
and \(g^*_{ij}\) indicates the components of the Fisher metric matrix on \(\mathcal {M}\), given by
As \(\theta _*\) is the solution of equation (4) and \(\hat{\theta }\) is its empirical solution (i.e., the M-estimator), the following result holds (see, e.g., Theorem 5.21 of van der Vaart (1998)).
where
For a general distribution, the estimation risk is asymptotically given as follows;
Theorem 1
Because the \(n^{-2}\)-order term is prohibitively lengthy, if it is incorporated into the p/n criterion, the result is not suitable for practical use. Hence, it is omitted here. (For interested readers, Theorem 1 of Sheena (2021) is being referred to. You can also find the proof of the whole expansion there.)
Note that, if g(x) exists within the model, then \(G=\tilde{G}=G^*\). Hence, the first-order term equals p/(2n) (for more general result for the well-specified model, see Sheena (2018)). Thus, the first-order term is mainly determined by p if \(g(x;\theta _*)\) is close to g(x).
2.2 Estimation risk for exponential family
This subsection investigates the estimation risk when the parametric model is an exponential family (for general references on exponential families, see Brown (1986), Barndorff-Nielsen (2014) and Sundberg (2019)). In the case of the exponential family, the \(n^{-2}\)-order term in the asymptotic expansion of the estimation risk has a simpler form.
Let the model \(\mathcal {M}\) be given by
where \(\Psi (\theta )\) is the cumulant-generating function of the \(\xi \) terms, such that,
The “dual coordinate” \(\eta \) is defined as
In particular, from the definition of \(\theta _*\) (see (4)),
The last equation requires the means of \(\xi _i\) to coincide under g(x) and \(g(x;\theta _*)\). It is known that \(g(x;\theta _*)\) maximizes the Shannon entropy among all probability distributions of \((\xi _1,\ldots ,\xi _p)\) with a given \(E[\xi _i],\ i=1,\ldots p\) (the “entropy maximization property” of an exponential family; see, e.g., Wainwright and Jordan (2008)). The K–L divergence is the difference between the cross-entropy and Shannon entropy.
The \(\eta \) coordinate is easily estimated. In fact, \(\hat{\eta }\), the MLE for \(\eta \), is the sample mean of \(\xi \). Hence,
In contrast, \(\hat{\theta }\) is difficult to obtain explicitly because \(\Psi \) or its derivative cannot be theoretically obtained for a complex model. This could pose a serious obstacle to application of an exponential family model to a practical problem, and is discussed in Sect. 3.3.
Let the matrix \(\ddot{\Psi }(\theta )\) be defined by
Thus, \(\ddot{\Psi }\) is a covariance matrix of the \(\xi _i\) terms under \(g(x;\theta )\); hence, it is positive definite. Therefore, \(\Psi (\theta )\) is a convex function. The notable property
is proven by the fact that both sides are equal to \((\ddot{\Psi }(\theta ))_{ij}\).
The following notation is used for the third- or fourth-order cumulant:
for \(1\le i, j, k,l \le p\).
Next theorem states the asymptotic expansion of the estimation risk for an exponential family distribution. In the case of an exponential family, the second-order term is relatively simple and can be practically used if it is incorporated into the p/n criterion proposed in the next section.
In the theorem, for brevity, Einstein notation is used and the dependency on \(\theta _*\) is omitted; e.g., G for \(G(\theta _*)\) and \(\tilde{g}^{ij}\) for \(\tilde{g}^{ij}(\theta _*)\).
Theorem 2
If the parametric model is an exponential family, the estimation risk is given by
Proof
The calculation is carried out straightforwardly from the expansion for the general distribution. See Sheena (2021) for the proof. \(\square \)
The estimation risk up to the second-order term is determined by the moments of the \(\xi _i\) terms, \(g_{ij}\), and \(\kappa _{ijk}\) under g(x), as well as their moments under \(g(x;\theta _*)\), \(\tilde{g}^{ij}\), \(\kappa ^*_{ijk}\), and \(\kappa ^*_{ijkl}\).
2.3 Estimator of estimation risk
We will use Theorem 1 and 2 for the approximation of the estimation risk. In order to establish the criterion (3), we need the estimator of the (approximated) estimation risk. The moments contained in (5) or (8) needs to be estimated; The second moments (Fisher information metric)
and cumulant
Naive estimators of these properties (denoted by the “hat” mark: \(\hat{G}\), \(\hat{\kappa }_{ijk}\), etc.) are gained by replacing \(\theta _*\) with MLE \(\hat{\theta }\), and the expectation \(E[\cdot ]\) with the empirical mean.
First the estimator of the second moments are given as follows;
Now we have the p/n criterion for a general distribution with a given C.
Criterion for a general distribution
Next the criterion for the exponential family is considered. \(\hat{G}\) equals the sample covariance matrix of the \(\xi _i\) terms, \(\hat{\Sigma }\):
where \(\bar{\xi }_i = n^{-1} \sum _{t} \xi _i(X_t).\) Similarly, the estimator of the true third-order cumulant is given by the sample third-order cumulant:
Further,
Consequently, for an exponential family, the p/n criterion is given as follows.
Criterion for an exponential family
How to determine C in (9) or (15) is studied in the next section. Once C is determined, we can use these criterion for the two problems, that is, the sample size problem and the model selection problem, as introduced in Sect. 1.
3 Criterion for model complexity and sample size
In this section, we complete p/n criterion by providing reasonable threshold C for (9) or (15) (Sect. 3.1). As an immediate application of the criterion, we deal with the bin number problem in a multinomial distribution or a histogram (Sect. 3.2). We also state the algorithm for the calculation of the \(n^{-2}\)-order term in (15) (Sect. 3.3). In the end, the use of the p/n criterion is demonstrated for two practical examples (Sect. 3.4).
3.1 Choice of threshold
Because the value of the divergence (1) or the risk (2) does not have an absolute standard by itself, we relate it to another reasonable standard. One of the often used measures of the closeness between the two distributions is the error rate, which is more intuitive than the divergence and is suitable for setting a threshold. Let \(g_i(x), i=1,2\) be the p.d.f. If both \(g_i(x),\ i=1,2\), are known, the Bayes discriminant rule (with the noninformative prior) is as follows.
For the sample X from either \(g_1(x)\) or \(g_2(x)\),
The Bayes error rate, Er, i.e., the probability that this rule gives an error, is formally defined by
The next theorem states the relation between Er and the K–L divergence.
Theorem 3
If \(D[g_1(x) \,| \, g_2(x)] \le \delta \), then
where
Proof
See Appendix. \(\square \)
Corollary 1
Let \(\delta = D[g(x;\theta _1)\, | \, g(x;\theta _2)]\) and \(\alpha \) be a certain small positive number (e.g. \(\alpha =0.05, 0.01\)). If
then
Analytical calculation of \(\min \{ t \,|\, (x,t) \in A(\delta ) \}\) is difficult. The approximation when t is close to 1/2 is given here. As \(\log {(1+x)} \doteqdot x - x^2/2\) around \(x=0\),
Therefore, \(A(\delta )\) is approximated by
Note that
Hence, the condition \(\sqrt{\delta /8} \le \alpha \) or, equivalently, \(\delta \le 8\alpha ^2\) is approximately sufficient for (16). Let the solution of \(\delta \) denoted by \(C_\alpha \) for the equation
or more simply, let \(C_\alpha \) be given by
In the latter case, if \(\alpha =0.05(0.01)\), then \(C_\alpha =1/50(1/1250)\). The final form of p/n criterion is given by substituting C in (9) or (15) with \(C_\alpha \).
3.2 p/n Criterion for multinomial distribution
In this section, we present a formula for the bin number of a multinomial distribution using the p/n criterion. The bin number problem in a histogram can be treated similarly. Although several formulas have been proposed on the bin number (or the bin width) in the histogram such as Sturges’ formula, Freedman-Diaconis’ formula (see the Chapter 3 of Scott (2015)), the formula here is derived from a new perspective.
In view of the true distribution g(x) and the information projection \(g(x;\theta _*)\), a multinomial distribution can be seen as the approximation by the step function model. Let
with
where \(S_i, i=0,1,\ldots ,p\) is the partition of the range of x with volume
and \(I(x \in S_i)\) is an indicator function of \(S_i\). In this case, from (4), the information projection \(g(x; m^*)\) is given by \(m^*_i = P(X \in S_i | g(x))\). The step-function model is not an exponential family. However, we easily notice that Kullback–Leibler divergence between the two step functions (where \(d\mu \) is the continuous measure) is equal to the divergence between the two corresponding multinomial distributions (where \(d\mu \) is the counting measure) . Hence, the argument of the estimation risk can be deduced from that of the multinomial distribution model. It is notable that, if X is originally a discrete random variable, the model always contains g(x).
Consider a multinomial distribution with \(p+1\) possible values \(x_i, i=0,\ldots ,p\), with the corresponding probabilities \(m=(m_0,\ldots ,m_p)\). This is an exponential family (6), where
and \(d\mu \) is the counting measure on \(\{x_1,\ldots ,x_p\}\). Here,
The asymptotic expansion of the estimation risk up to the second order can be derived as follows (this corresponds to equation (41) of Sheena (2018) with \(\alpha =-1\)).
where \(\theta =(m_1,\ldots ,m_p)\) is the true-distribution free parameter. Note that if some \(m_i\)’s are close to zero, the convergence speed reduces considerably.
If we combine the first-order approximation in (18) with the threshold (17), p/n criterion becomes
If we adopt \(\alpha =0.05(0.01)\), then the sample size n or the bin number \(p+1\) is determined by the formula;
Simple criterion for the sample size or the bin number
The second-order approximation gives the following p/n criterion:
where
and \(\hat{m}_i\) is the MLE, the sample relative frequency, for each i. Applying the criterion for n determination gives the formula
In contrast, if the criterion is used for the bin number problem, the formula is given by
Use of these criteria for practical examples is discussed in Sect. 3.4.
3.3 Algorithm for p/n criterion of exponential family
This section describes calculation of the right-hand side of (15). If we can calculate the function \(\Psi (\theta )\) analytically, the algorithm is simply the following.
-
Step 1
Calculate \(\hat{\eta }_i=\bar{\xi }_i,\ i=1,\ldots ,p\) from the sample.
-
Step 2
Solve the simultaneous equations w.r.t. \(\theta \) in (7) to give \(\hat{\theta }=(\hat{\theta }_1,\ldots ,\hat{\theta }_p)\):
$$\begin{aligned} \hat{\eta }_i = \eta _i(\hat{\theta }) = \frac{\partial \Psi }{\partial \theta _i}(\hat{\theta }),\qquad i=1,\ldots ,p. \end{aligned}$$ -
Step 3
Calculate (12), (13), and (14) from \(\Psi (\hat{\theta })\).
- Step 4
-
Step 5
Calculate the right-hand side of (15) and compare it with \(C_\alpha \).
Often, \(\Psi (\theta )\) is not explicitly given, especially for a complex model. Then, \(\hat{\theta }\) can be iteratively calculated using the Newton--Raphson method with the Jacobian matrix (12). Because \(\ddot{\Psi }(\theta )\) is the variance-covariance matrix of the \(\xi _i\) terms under the \(g(x;\theta )\) distribution, its value can be approximated from the generated sample. The alternative methods are as follows.
-
Step 2’
Iteratively search for \(\hat{\theta }\) with
$$\begin{aligned} \theta ^{(n+1)} = \theta ^{(n)} - \bigl (\eta (\theta ^{(n)})-\hat{\eta }\bigr )\bigl (\ddot{\Psi }(\theta ^{(n)})\bigr )^{-1}, \end{aligned}$$where \(\eta (\theta ^{(n)})\) and \(\ddot{\Psi }(\theta ^{(n)})\) are approximated by the sample mean and the sample covariance matrix of the \(\xi _i\) terms from the \(g(x;\theta ^{(n)})\) distribution.
Further, (12), (13), and (14) can also be approximated using the generated sample.
-
Step 3’
Approximate (12), (13), and (14) using the sample moments and cumulants, where the sample is generated from \(g(x;\hat{\theta })\).
The point here is that \(\Psi (\theta )\) is not required for sample generation in Steps 2’ and 3’ if methods such as MCMC (requiring no normalizing constant) are used. Although Steps 2’ and 3’ are computationally heavy tasks, they enable construction of a complex model without calculation of \(\Psi \).
3.4 Real data examples for p/n criterion
This section demonstrates use of the p/n criterion for a particular problem through two practical examples under the exponential family model.
Example 1
(Red Wine) The first example is a well-known dataset on wine quality, taken from the U.C.I. Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/wine+quality).
Only red wine data are used. The sample size is 1599, and the variables consist of 11 chemical substances (continuous variables) and “quality” indexes (integers from 3 to 8). The vector of the chemical substances and the “quality” variable are denoted by \(x^{(1)} =(x^{(1)}_{1},\ldots , x^{(1)}_{11})\) and \(x^{(2)}\), respectively. We divided the sample into two halves randomly, one of which (“data_base”) was used for the model formulation and the other (“data_est”) was used for the estimation of the parameter.
For model formulation, we determined the following: normalization method of the original data, the reference (probability) measure \(d\mu (x)\) and \(\xi \) elements. Using “data_base”, we proceed as;
-
1.
Each variable \(x^{(1)}_i(i=1,\ldots ,11)\) is divided by twice of its maximum such that its range is \([0,\ 1)\). Further, 2 is subtracted from each “quality” index to give a range of \(\{1,2,\ldots ,6\}\).
-
2.
As \(d\mu (x)\), 11 independent Beta distributions are applied to \(x^{(1)}\) so that their means and variances are equal to those of the “data_base”. The multinomial distribution of \(x^{(2)}\) is adopted, using each category’s sample relative frequency as the category probability parameter (say, \(m_i, \ i=1,\ldots ,6\)). In addition, \(x^{(1)}\) and \(x^{(2)}\) are taken to be independent.
Consequently, \(d\mu \) is selected as
where \(d(x^{(1)})\) is the Lebesgue measure on \([0,\ 1]^{11}\), \(d^*(x^{(2)})\) is the counting measure on \(\{1,2,\ldots ,6\}\), and \(I(\cdot )\) is the indicator function. Further, \(\beta _{1i}\), \(\beta _{2i}\), and \(m_i\) satisfy the relations
3. The candidate for the \(\xi _i\) terms are as follows;
and
Because some of these terms are highly correlated, we eliminate one of the pair with the correlation higher than 0.95. The following 20 \(\xi _i\) terms were removed from the full model:
Consequently, an exponential family model with \(p=47\) is formulated. As the probability distribution \(g(x;\theta )d\mu \) equals \(d\mu \) when the \(\theta \) terms all equal zero, it is denoted by g(x; 0). Note that the \(g(x ;\theta _*)\) of this model is the closest to g(x; 0) in the sense that
where \(\mathcal {H}\) is the p.d.f. set of h(x) (w.r.t. \(d\mu \)) that satisfies
for each \(\xi _i\) in the model. This is the consequence of so-called “minimum relative entropy characterization” of an exponential family” (see Csiszár (1975)).
Under the formulated exponential family model, the algorithm in the previous section was implemented and the right-hand side of (15) was calculated using the “data_est”, the size of which (n) equals 799. Because of the model complexity, the explicit form of \(\Psi (\theta )\) could not be obtained; hence, Alternative Steps 2’ and 3’ were used. The R and RStan program codes for the whole risk calculation are presented in GitHub (https://github.com/YSheena/P-N_Criteria_Program.git). The first-and second-order terms and the estimation risk in the total of (15) were as follows;
First-order term: 2.95e-02, Second-order term: -1.30e-04, Estimation Risk: 2.93e-02
Note that the second-order term contributes little to the estimation risk; thus, the first-order approximation seems sufficient for this model and data. With the threshold (17), the equation 2.93e-02=8\(\alpha ^2\) gives the solution \(\alpha \doteqdot 0.06\). Hence the Bayes error rate between \(g(x;\hat{\theta })\) and \(g(x;\theta _*)\) is higher than 0.44. If we set the threshold as \(\alpha =0.05\), we must trim the model further. For example, if we eliminate one of the \(\xi \) elements from the pair with correlation higher than 0.9, then p becomes as small as 37. For this model, the estimation risk is lower than the target value \(8*(0.05)^2=0.02\) as follows;
First-order term: 1.60e-02, Second-order term: 2.04e-04, Estimation Risk: 1.62e-02
Example 2
(Abalone Data) The next example also features a well-known dataset, in this case, for the physical measurement of abalones (U.C.I. Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Abalone). This data comprise eight properties (sex, length, diameter, etc.) of 4177 abalones. Here, only two discrete variables were considered: “sex” and “ring,” where “sex” had three values “Female,” “Infant,” and “Male”; and “rings” had integer values from 1 to 29. The frequency of each classified group by “sex” and “rings” is given in Table 1. The original frequencies were aggregated at both ends. In the table, if a cell with a star mark is located to the immediate left or right, the number in the cell is aggregated. For example, of the female abalones, cells with 24 or more rings were aggregated to frequency 4. The total number of cells was 63.
A multinomial distribution over 63 cells was considered; hence, \(p=62\). First the simple criterion (19) is adopted, then
but \(p/n > 1/625\). Consequently, the model distribution is close to the information projection (this case, the true distribution) to the extent that the Bayes error rate is more than 0.45 but less than 0.49.
In order to use the second order term, M needs to be estimated. From the sample relative frequency of each cell \(\hat{m}_i,\) where \(\ i=0,\ldots ,62\),
Use of the n formula (20) yielded
which indicates the actual sample size 4177 is large enough for Bayes error rate 0.45. However, to attain Bayes error rate of 0.49, the required sample size equals 38847, which is far beyond 1642.
References
Barndorff-Nielsen OE (2014) Information and exponential families in statistical theory. Wiley, New York
Barron AR, Sheu C (1991) Approximation of density functions by sequences of exponential families. Ann Stat 19(3):1347–1369
Brown LD (1986) Fundamentals of statistical exponential families. IMS
Csiszár I (1975) I-divergence geometry of probability distributions and minimization problems. Ann Probab 3:146–158
Hartigan JA (1998) The maximum likelihood prior. Ann Stat 26(6):2083–2103
Komaki F (1996) On asymptotic properties of predictive distributions. Biometrika 83(2):299–313
Komaki F (2015) Asymptotic properties of Bayesian predictive densities when the distributions of data and target variables are different. Bayesian Anal 10(1):31–51
Scott DW (2015) Multivariate density estimation. Wiley, New York
Sheena Y (2018) Asymptotic expansion of the risk of maximum likelihood estimator with respect to \(\alpha \)-divergence as a measure of the difficulty of specifying a parametric model. Commun Stat Theory Methods 47(16):4059–4087
Sheena Y (2021) Mle convergence speed to information projection of exponential family: criterion for model dimension and sample size—complete proof version–. arXiv, 2105.08947
Sundberg R (2019) Statistical modeling for exponential families. Cambridge University Press, Cambridge
van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, Cambridge
Wainwright MJ, Jordan MI (2008) Graphical models, exponential families, and variational inference. Now Publishers
Zhang F, Shi Y, Ng HKT, Wang R (2018) Information geometry of generalized Bayesian prediction using \(\alpha \)-divergence as loss functions. IEEE Trans Inf Theory 64(3):1812–1824
Acknowledgements
The author greatly appreciates the reviewers’ constructive comments on the previous version of the manuscript, which made the present version more concise and readable. This research was partially supported by Grant-in-Aid for Scientific Research (20K11706).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Theorem 3 A suitably fine partition \(S_i,\ i=1,\ldots ,m\) of the domain of \(d\mu \) and the associated step functions of \(g_j(x) = \sum _{i=1}^m c_{ji} I(x \in S_i),\ j=1,2\) are taken such that the two integrations
are sufficiently well approximated by
respectively. Furthermore, we can choose the partition such that
where \(\Delta _i = c_{2i}/c_{1i},\ i=1,\ldots ,m\). Suppose that \(D[g(X;\theta _1) \,| \, g(x;\theta _2)] \le \delta \). Then, with sufficiently finer \(S_i,\ i=1,\ldots ,m\), we have
The lower bound of \(t(\Delta )\) is searched for, under the condition of (23). Let
Note that, as the partition \(S_i,\ i=1,\ldots ,m\) becomes finer,
Without loss of generality, the following can be assumed:
Let \(t=m-s\) and
Note that
and, because of the concavity of \(f(\Delta )\),
Therefore, in search of the lower bound of \(t(\Delta )\), we must only consider the case where
Under condition (25), the relations (23) and (24) are
respectively, or equivalently,
where
Substituting the relation from (27), i.e.,
into \(\Delta ^- > 0\) and (26) gives
Furthermore, under condition (25),
Consider the minimization of \(t(x;\Delta ^+)\) under conditions (28), (29), and (30). Notice that
Since
and
the minimum value of \(t(x;\Delta ^+)\)(say, \(t^*\)) is attained when (30) holds with the equation. Let \((x^*, \Delta _*^+)\) denote the point that attains \(t^*\); then,
Inserting (31) into the left-hand side of (30) and equating it with \(-\delta \) gives
while, from (28), (29), and (31),
Let us define the region \(\tilde{A}(\delta )\) by
Then,
Taking the limit operation for both sides as the partition becomes finer gives the result.
Rights and permissions
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sheena, Y. Convergence of estimative density: criterion for model complexity and sample size. Stat Papers 64, 117–137 (2023). https://doi.org/10.1007/s00362-022-01309-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-022-01309-9
Keywords
- Kullback–Leibler divergence
- Exponential family
- Asymptotic risk
- Information projection
- Predictive density