Convergence of estimative density: criterion for model complexity and sample size

Sheena, Yo

doi:10.1007/s00362-022-01309-9

Convergence of estimative density: criterion for model complexity and sample size

Regular Article
Open access
Published: 25 April 2022

Volume 64, pages 117–137, (2023)
Cite this article

Download PDF

You have full access to this open access article

Statistical Papers Aims and scope Submit manuscript

Convergence of estimative density: criterion for model complexity and sample size

Download PDF

Yo Sheena ORCID: orcid.org/0000-0002-3340-7005^1,2

1529 Accesses
1 Altmetric
Explore all metrics

Abstract

For a parametric model of distributions, the closest distribution in the model to the true distribution located outside the model is considered. Measuring the closeness between two distributions with the Kullback–Leibler divergence, the closest distribution is called the “information projection.” The estimation risk of the maximum likelihood estimator is defined as the expectation of Kullback–Leibler divergence between the information projection and the maximum likelihood estimative density (the predictive distribution with the plugged-in maximum likelihood estimator). Here, the asymptotic expansion of the risk is derived up to the second order in the sample size, and the sufficient condition on the risk for the Bayes error rate between the predictive distribution and the information projection to be lower than a specified value is investigated. Combining these results, the “p/n criterion” is proposed, which determines whether the estimative density is sufficiently close to the information projection for the given model and sample. This criterion can constitute a solution to the sample size or model selection problem. The use of the p/n criteria is demonstrated for two practical datasets.

Adaptation of the tuning parameter in general Bayesian inference with robust divergence

Article Open access 04 February 2023

Maximum a posteriori estimators as a limit of Bayes estimators

Article 30 January 2018

Objective Bayesian testing on the common mean of several normal distributions under divergence-based priors

Article 18 October 2016

1 Introduction

Given a certain data set, an unknown probability distribution that generates the data as the independent, identically distributed (i.i.d.) sample can be assumed. Under this assumption, if a certain parametric distribution model is adopted to “explain” the data, the first task is to find the “best” approximating distribution in the model. Because the true distribution is assumed to be outside the model (except for some rare cases), the “best” means the “closest” to the true distribution.

Consider the following parametric distribution model:

$$\begin{aligned} \mathcal {M} = \{g(x;\theta )\,|\, \theta =(\theta ^1,\ldots ,\theta ^p) \in \Theta \}, \end{aligned}$$

where $g(x;\theta )$ is the probability density function (p.d.f.) with respect to a reference measure $d\mu $ on a measurable space. The p.d.f. of the unknown true distribution with respect to $d\mu $ is denoted by g(x). If we use a certain divergence $D[ \cdot \,|\, \cdot ]$ to measure the closeness between g(x) and $g(x;\theta )$, then the “best” approximating distribution in $\mathcal {M}$ is given by the predictive distribution $g(x;\theta _*)$, where

$$\begin{aligned} \theta _* = \mathop {\mathrm arg \, min}\limits _{\theta \in \Theta } D[ g(x) \,|\,g(x;\theta )] . \end{aligned}$$

Following Csiszár (1975), we will call $g(x;\theta _*)$ the “information projection” in this paper.

Let $\hat{\theta }$ denote the maximum likelihood estimator (MLE) based on the i.i.d. sample $\varvec{X}= (X_1,\ldots ,X_n)$ from g(x). Consider the predictive density $g(x; \hat{\theta })$. Since MLE converges to $\theta _*$ in probability (see, e.g., Theorem 5.21 of van der Vaart (1998)) as the sample size, n, increases,

$$\begin{aligned} D[g(x;\theta _*) \,| \, g(x; \hat{\theta })] \end{aligned}$$

(1)

also converges to zero in probability. The predictive density $g(x; \hat{\theta })$ is produced with plugged-in MLE. This type of predictive density is called “estimative density”. Another common method to formulate the predictive density is Bayesian predictive density. For the asymptotic properties of Bayesian predictive density, see e.g. Komaki (1996), Hartigan (1998), Komaki (2015) and Zhang et al. (2018).

Take the expectation

$$\begin{aligned} R[g(x;\theta _*) \,| \, g(x; \hat{\theta })] = E\Bigl [D[g(x;\theta _*) \,| \, g(x; \hat{\theta })] \Bigr ] \end{aligned}$$

(2)

with respect to the i.i.d. sample $\varvec{X}= (X_1,\ldots ,X_n)$ from g(x). Throughout this study, the expectation under g(x) is denoted by $E[\cdot ]$, while the expectation under $g(x;\theta _*)$ is denoted by $E_{\theta _*}[\cdot ]$. We call (2) “estimation risk” for discriminating it with the “total risk”

$$\begin{aligned} R[g(x) \,| \, g(x; \hat{\theta })] = E\Bigl [D[g(x) \,| \, g(x; \hat{\theta })] \Bigr ]. \end{aligned}$$

The estimation risk converges to zero under some mild conditions. We will use this estimation risk as the measure of the closeness between $g(x; \theta _*)$ and $ g(x; \hat{\theta })$.

Given the data and the model, we need to know whether $g(x;\hat{\theta })$ is sufficiently close to the information projection. Thus, with a certain threshold C, the following criterion is considered.

$$\begin{aligned} \hat{R}[g(x;\theta _*) \,| \, g(x; \hat{\theta })] < C, \end{aligned}$$

(3)

where the left hand side is the estimator of the estimation risk.

This criterion gives a solution to the following two problems.

Sample size problem: With the model fixed, it indicates exactly how much sample size n is needed for $g(x;\hat{\theta })$ to be close to the information projection. If the criterion is not satisfied, we need to collect more sample.
Model selection problem: With the sample size fixed, it tells us whether a model is simple enough (especially the dimension of the parameter p is small enough) to guarantee that $g(x;\hat{\theta })$ is close to the information projection. Unless the criterion is satisfied, simplifying the model could be a remedy.

As seen later in the manuscript, the estimation risk is mainly determined by p/n when the information projection is close to the true distribution, and we will call this criterion “p/n criterion” hereafter.

In this paper, as the divergence, Kullback–Leibler divergence is taken, that is,

$$\begin{aligned} D[ g(x) \,|\,g(x;\theta )] = \int g(x) \log \Bigl (g(x)/g(x;\theta ) \Bigr ) d\mu . \end{aligned}$$

Note that for this divergence, the information projection is given by

$$\begin{aligned} E \left[ \frac{\partial \ }{\partial \theta ^i}\log g(X;\theta ) \right] =0,\quad i=1,\ldots ,p,\end{aligned}$$

(4)

and its solution $\theta ^*$ is naturally estimated via the MLE, which is the solution of

$$\begin{aligned} \sum _{t=1}^n \frac{\partial \ }{\partial \theta ^i}\log g(X_t;\theta ) = 0,\quad i=1,\ldots ,p. \end{aligned}$$

For the other divergences, the information projection is more complicated, and its natural estimator is not as simple as MLE.

This paper aims to present a simple and practical criterion (3), and proceeds as follows;

1.
The asymptotic expansion of the estimation risk is derived.
2.
The asymptotic expansion combined with the estimated moments gives the estimator of the estimation risk.
3.
The reasonable (persuasive) threshold C is proposed.

An overview of the contents of each section is now provided. First, the asymptotic expansion of the estimation risk is given for both the general model (Sect. 2.1) and an exponential family model (Sect. 2.2). The estimator of the estimation risk is given in Sect. 2.3. Next, the concrete threshold C is proposed in view of the Bayes error rate. With these results combined, p/n criterion is proposed in an explicit form (Sect. 3.1). As an application of p/n criterion, the bin number problem in a multinomial distribution or a histogram is considered (Sect. 3.2). In Sect. 3.3, the algorithm for calculating the p/n criterion in the case of an exponential family is described. In Sect. 3.4, the use of the p/n criterion is demonstrated for two practical examples.

2 Estimation risk for general case and exponential family

In this section, the asymptotic expansion with respect to n of the estimation risk (2) is presented up to the first-order term for a general distribution, and up to the second-order term for an exponential family distribution.

Hartigan (1998) derives the asymptotic expansion of the estimation risk (2) up to the second order under the assumption g(x) belongs to $\mathcal {M}$. The result here is the extension of his result in the sense that the true distribution is not necessarily located in $\mathcal {M}$.

On the risk of an exponential family, the most relevant work is that of Barron and Sheu (1991). They consider the convergence rate of the K–L divergence (not the risk, but the divergence itself) for an exponential family on a compact set. Their interest lies in the closeness between g(x) and $g(x;\hat{\theta })$, while this research focuses on the closeness between $g(x;\theta _*)$ and $g(x;\hat{\theta })$

2.1 Estimation risk for general case

Taylor expansion of

$$\begin{aligned} D[g(x;\theta _*)\,|\,g(x;\hat{\theta }) ] = \int g(x;\theta _*) \log (g(x;\theta _*) /g(x;\hat{\theta }) ) d\mu \end{aligned}$$

as a function of $\hat{\theta }$ around $\theta _*$ is considered:

$$\begin{aligned}&D[g(x;\theta _*)\,|\,g(x;\hat{\theta }) ] \\&\quad = -\sum _i \int \frac{\partial \ }{\partial \theta ^i} g(x;\theta )\bigg |_{\theta =\theta _*} d\mu \ (\hat{\theta }^i - \theta _*^i) \\&\qquad + \frac{1}{2} \sum _{i,j} \int g(x;\theta _*)\Bigl ( \frac{\partial \ }{\partial \theta ^i}\log g(x;\theta ) \bigg |_{\theta =\theta _*}\Bigr ) \Bigl ( \frac{\partial \ }{\partial \theta ^j} \log g(x;\theta ) \bigg |_{\theta =\theta _*}\Bigr ) d\mu \\&\qquad \times (\hat{\theta }^i - \theta _*^i) (\hat{\theta }^j - \theta _*^j) \\&\qquad - \frac{1}{2}\sum _{i,j} \int \frac{\partial ^2 \ \ \ }{\partial \theta ^i \partial \theta ^j} g(x, \theta )\bigg |_{\theta =\theta _*} d\mu \ (\hat{\theta }^i - \theta _*^i) (\hat{\theta }^j - \theta _*^j)\\&\qquad - \frac{1}{3!}\sum _{i_1,i_2,i_3}\int g(x;\theta _*) \frac{\partial ^3\log g(x, \theta )}{\partial \theta ^{i_1}\partial \theta ^{i_2}\partial \theta ^{i_3}} \bigg |_{\theta =\tilde{\theta }_*} d\mu (\hat{\theta }^{i_1} - \theta _*^{i_1}) (\hat{\theta }^{i_2} - \theta _*^{i_2})(\hat{\theta }^{i_3} - \theta _*^{i_3}), \end{aligned}$$

where $\tilde{\theta }_*$ is a point between $\theta _*$ and $\hat{\theta }$. Because

$$\begin{aligned} \int \frac{\partial \ }{\partial \theta ^i} g(x;\theta ) d\mu =0,\qquad \int \frac{\partial ^2 \ \ }{\partial \theta ^i \partial \theta ^j} g(x, \theta ) d\mu =0,\qquad \forall \theta \in \Theta , \end{aligned}$$

it turns out that

$$\begin{aligned} R[g(x;\theta _*)\,|\,g(x;\hat{\theta }) ]&= \frac{1}{2}\sum _{i,j} g^*_{ij}(\theta _*) E[(\hat{\theta }^i - \theta _*^i) (\hat{\theta }^j - \theta _*^j)] \nonumber \\&\quad - \frac{1}{3!}\sum _{i_1,i_2,i_3} E[\tau _{i_1,i_2,i_3}(\hat{\theta }^{i_1} - \theta _*^{i_1})\cdots (\hat{\theta }^{i_3} - \theta _*^{i_3})]. \end{aligned}$$

Here,

$$\begin{aligned} \tau _{i_1,i_2,i_3} = \int g(x;\theta _*) \frac{\partial ^3 \log g(x, \theta )}{\partial \theta ^{i_1}\partial \theta ^{i_2} \partial \theta ^{i_3}} \bigg |_{\theta =\hat{\theta }_*} d\mu \end{aligned}$$

and $g^*_{ij}$ indicates the components of the Fisher metric matrix on $\mathcal {M}$, given by

$$\begin{aligned} g^*_{ij}(\theta _*)= (G^*(\theta _*))_{ij} = E_{\theta _*}\Bigl [\Bigl ( \frac{\partial \ }{\partial \theta ^i}\log g(x;\theta ) \Big |_{\theta =\theta _*}\Bigr ) \Bigl ( \frac{\partial \ }{\partial \theta ^j} \log g(x;\theta )\Big |_{\theta =\theta _*} \Bigr ) \Bigr ].\\ \end{aligned}$$

As $\theta _*$ is the solution of equation (4) and $\hat{\theta }$ is its empirical solution (i.e., the M-estimator), the following result holds (see, e.g., Theorem 5.21 of van der Vaart (1998)).

$$\begin{aligned} \sqrt{n}\bigl (\hat{\theta }-\theta _*\bigr ) {\mathop {\rightarrow }\limits ^{d}} N_p (0, \tilde{G}^{-1}G\tilde{G}^{-1}), \end{aligned}$$

where

$$\begin{aligned} g_{ij}(\theta _*) = (G(\theta _*))_{ij}&= E\Bigl [\Bigl ( \frac{\partial \ }{\partial \theta ^i}\log g(X;\theta )\Big |_{\theta =\theta _*} \frac{\partial \ }{\partial \theta ^j}\log g(X;\theta )\Big |_{\theta =\theta _*}\Bigr )\Bigr ], \\ \tilde{g}_{ij}(\theta _*) = (\tilde{G}(\theta _*))_{ij}&= -E\Bigl [\frac{\partial ^2 \ \quad }{\partial \theta ^j \partial \theta ^i} \log g(X;\theta )\Big |_{\theta =\theta _*}\Bigr ] . \end{aligned}$$

For a general distribution, the estimation risk is asymptotically given as follows;

Theorem 1

$$\begin{aligned} R[g(x;\theta _*)\,|\,g(x;\hat{\theta }) ] = (2n)^{-1} \mathrm {tr} \Bigl (\tilde{G}(\theta _*)^{-1}G(\theta _*)\tilde{G}(\theta _*)^{-1}G^*(\theta _*) \Bigr ) + O(n^{-2}). \end{aligned}$$

(5)

Because the $n^{-2}$-order term is prohibitively lengthy, if it is incorporated into the p/n criterion, the result is not suitable for practical use. Hence, it is omitted here. (For interested readers, Theorem 1 of Sheena (2021) is being referred to. You can also find the proof of the whole expansion there.)

Note that, if g(x) exists within the model, then $G=\tilde{G}=G^*$. Hence, the first-order term equals p/(2n) (for more general result for the well-specified model, see Sheena (2018)). Thus, the first-order term is mainly determined by p if $g(x;\theta _*)$ is close to g(x).

2.2 Estimation risk for exponential family

This subsection investigates the estimation risk when the parametric model is an exponential family (for general references on exponential families, see Brown (1986), Barndorff-Nielsen (2014) and Sundberg (2019)). In the case of the exponential family, the $n^{-2}$-order term in the asymptotic expansion of the estimation risk has a simpler form.

Let the model $\mathcal {M}$ be given by

$$\begin{aligned} \mathcal {M} = \Bigl \{g(x ;\theta ) = \exp \Bigl (\sum _{i=1}^p \theta ^i \xi _i(x) -\Psi (\theta )\Bigr ) \ \Big | \theta \in \Theta \Bigr \}. \end{aligned}$$

(6)

where $\Psi (\theta )$ is the cumulant-generating function of the $\xi $ terms, such that,

$$\begin{aligned} \Psi (\theta ) = \log \int \exp \Bigl (\sum _{i=1}^p \theta ^i \xi _i(x)\Bigr ) d\mu . \end{aligned}$$

The “dual coordinate” $\eta $ is defined as

$$\begin{aligned} \eta _i(\theta ) = \frac{\partial \Psi (\theta )}{\partial \theta ^i} = E_{\theta }[\xi _i], \quad i=1,\ldots ,p. \end{aligned}$$

In particular, from the definition of $\theta _*$ (see (4)),

$$\begin{aligned} \eta _i^* = \eta _i(\theta _*) = E_{\theta _*}[\xi _i] =E[\xi _i], \quad i=1,\ldots ,p. \end{aligned}$$

The last equation requires the means of $\xi _i$ to coincide under g(x) and $g(x;\theta _*)$. It is known that $g(x;\theta _*)$ maximizes the Shannon entropy among all probability distributions of $(\xi _1,\ldots ,\xi _p)$ with a given $E[\xi _i],\ i=1,\ldots p$ (the “entropy maximization property” of an exponential family; see, e.g., Wainwright and Jordan (2008)). The K–L divergence is the difference between the cross-entropy and Shannon entropy.

The $\eta $ coordinate is easily estimated. In fact, $\hat{\eta }$, the MLE for $\eta $, is the sample mean of $\xi $. Hence,

$$\begin{aligned} \hat{\eta }_i= \frac{\partial \Psi }{\partial \theta _i}(\hat{\theta }) = \bar{\xi }_i \bigl (= n^{-1}\sum _{t=1}^n \xi _i(X_t)\bigr ). \end{aligned}$$

(7)

In contrast, $\hat{\theta }$ is difficult to obtain explicitly because $\Psi $ or its derivative cannot be theoretically obtained for a complex model. This could pose a serious obstacle to application of an exponential family model to a practical problem, and is discussed in Sect. 3.3.

Let the matrix $\ddot{\Psi }(\theta )$ be defined by

$$\begin{aligned} (\ddot{\Psi }(\theta ))_{ij}= \frac{\partial ^2 \Psi (\theta ) }{\partial \theta ^i\partial \theta ^j} = E_{\theta }[(\xi _i-\eta _i)(\xi _j-\eta _j)], \quad 1 \le i,j \le p. \end{aligned}$$

Thus, $\ddot{\Psi }$ is a covariance matrix of the $\xi _i$ terms under $g(x;\theta )$; hence, it is positive definite. Therefore, $\Psi (\theta )$ is a convex function. The notable property

$$\begin{aligned} g^*_{ij}(\theta ) =\tilde{g}_{ij}(\theta ), \quad 1 \le i,j \le p, \quad \forall \theta \end{aligned}$$

is proven by the fact that both sides are equal to $(\ddot{\Psi }(\theta ))_{ij}$.

The following notation is used for the third- or fourth-order cumulant:

$$\begin{aligned}&\kappa _{ijk} = E[(\xi _i-\eta ^*_i)(\xi _j-\eta ^*_j)(\xi _i-\eta ^*_k)] \\&\kappa ^*_{ijk} = E_{\theta _*}[(\xi _i-\eta ^*_i)(\xi _j-\eta ^*_j)(\xi _i-\eta ^*_k)]= \frac{\partial ^3\Psi (\theta _*)}{\partial \theta ^i \partial \theta ^j \partial \theta ^k}\\&\kappa ^*_{ijkl} = E_{\theta _*}[(\xi _i-\eta ^*_i)(\xi _j-\eta ^*_j)(\xi _i-\eta ^*_k)(\xi _i-\eta ^*_l)]\\&\qquad \quad -E_{\theta _*}[(\xi _i-\eta ^*_i)(\xi _j-\eta ^*_j)]E_{\theta _*}[(\xi _k-\eta ^*_k)(\xi _l-\eta ^*_l)]\\&\qquad \quad -E_{\theta _*}[(\xi _i-\eta ^*_i)(\xi _k-\eta ^*_k)]E_{\theta _*}[(\xi _j-\eta ^*_j)(\xi _l-\eta ^*_l)]\\&\qquad \quad -E_{\theta _*}[(\xi _i-\eta ^*_i)(\xi _l-\eta ^*_l)]E_{\theta _*}[(\xi _j-\eta ^*_j)(\xi _k-\eta ^*_k)]=\frac{\partial ^4\Psi (\theta _*)}{\partial \theta ^i \partial \theta ^j \partial \theta ^k \partial \theta ^l} \end{aligned}$$

for $1\le i, j, k,l \le p$.

Next theorem states the asymptotic expansion of the estimation risk for an exponential family distribution. In the case of an exponential family, the second-order term is relatively simple and can be practically used if it is incorporated into the p/n criterion proposed in the next section.

In the theorem, for brevity, Einstein notation is used and the dependency on $\theta _*$ is omitted; e.g., G for $G(\theta _*)$ and $\tilde{g}^{ij}$ for $\tilde{g}^{ij}(\theta _*)$.

Theorem 2

If the parametric model is an exponential family, the estimation risk is given by

$$\begin{aligned} \begin{aligned}&R[g(x;\theta _*)\,|\,g(x;\hat{\theta }) ] = \frac{1}{2n} \mathrm {tr} \Bigl (\tilde{G}^{-1}G \Bigr ) \\&\quad +\frac{1}{24n^2}\Bigl [-8\tilde{g}^{uk}\tilde{g}^{ls}\tilde{g}^{mt}\kappa _{kst}\kappa ^*_{lmu}\\&\quad +9\tilde{g}^{ko}\tilde{g}^{lu}\tilde{g}^{sv}\tilde{g}^{tw}\tilde{g}^{hm}\kappa ^*_{lmo}\kappa ^*_{sth} (g_{ku}g_{vw}+g_{kv}g_{uw}+g_{kw}g_{uv})\\&\quad -3\tilde{g}^{kw}\tilde{g}^{ls}\tilde{g}^{mu}\tilde{g}^{tv}\kappa ^*_{lmtw} (g_{ks}g_{uv}+g_{ku}g_{sv}+g_{kv}g_{su})\Bigr ] + O(n^{-3}). \end{aligned} \end{aligned}$$

(8)

Proof

The calculation is carried out straightforwardly from the expansion for the general distribution. See Sheena (2021) for the proof. $\square $

The estimation risk up to the second-order term is determined by the moments of the $\xi _i$ terms, $g_{ij}$, and $\kappa _{ijk}$ under g(x), as well as their moments under $g(x;\theta _*)$, $\tilde{g}^{ij}$, $\kappa ^*_{ijk}$, and $\kappa ^*_{ijkl}$.

2.3 Estimator of estimation risk

We will use Theorem 1 and 2 for the approximation of the estimation risk. In order to establish the criterion (3), we need the estimator of the (approximated) estimation risk. The moments contained in (5) or (8) needs to be estimated; The second moments (Fisher information metric)

$$\begin{aligned} G^*=(g^*_{ij}),\quad \tilde{G}=(\tilde{g}_{ij}), \quad G=(g_{ij}) \end{aligned}$$

and cumulant

$$\begin{aligned} \kappa _{ijk},\quad \kappa ^*_{ijk}, \quad \kappa ^*_{ijkl},\qquad 1\le i,j,k,l \le p. \end{aligned}$$

Naive estimators of these properties (denoted by the “hat” mark: $\hat{G}$, $\hat{\kappa }_{ijk}$, etc.) are gained by replacing $\theta _*$ with MLE $\hat{\theta }$, and the expectation $E[\cdot ]$ with the empirical mean.

First the estimator of the second moments are given as follows;

$$\begin{aligned} (\hat{G})_{ij}&= n^{-1}\sum _{t=1}^n \frac{\partial }{\partial \theta ^i}\log g(X_t;\theta )\Big |_{\theta =\hat{\theta }} \frac{\partial }{\partial \theta ^j}\log g(X_t;\theta )\Big |_{\theta =\hat{\theta }}\\ (\hat{\tilde{G}})_{ij}&= -n^{-1}\sum _{t=1}^n \frac{\partial ^2 }{\partial \theta ^i\partial \theta ^j}\log g(X_t;\theta )\Big |_{\theta =\hat{\theta }}\\ (\hat{G}^*)_{ij}&= \int g(x;\hat{\theta })\Bigl ( \frac{\partial \ }{\partial \theta ^i}\log g(x;\theta ) \Big |_{\theta =\hat{\theta }}\Bigr ) \Bigl ( \frac{\partial \ }{\partial \theta ^j} \log g(x;\theta )\Big |_{\theta =\hat{\theta }} \Bigr ) d\mu . \\ \end{aligned}$$

Now we have the p/n criterion for a general distribution with a given C.

Criterion for a general distribution

$$\begin{aligned} C \ge \frac{1}{2n}\mathrm {tr} \Bigl (\hat{\tilde{G}}^{-1}\hat{G}\hat{\tilde{G}}^{-1}\hat{G}^* \Bigr ) \end{aligned}$$

(9)

Next the criterion for the exponential family is considered. $\hat{G}$ equals the sample covariance matrix of the $\xi _i$ terms, $\hat{\Sigma }$:

$$\begin{aligned} \hat{G} = \hat{\Sigma },\quad {\hat{g}}_{ij} = \bigl (\hat{\Sigma }\bigr )_{ij},\quad \bigl (\hat{\Sigma }\bigr )_{ij} = n^{-1}\sum _{t=1}^n (\xi _i(X_t)- \bar{\xi }_i)(\xi _j(X_t)- \bar{\xi }_j), \end{aligned}$$

(10)

where $\bar{\xi }_i = n^{-1} \sum _{t} \xi _i(X_t).$ Similarly, the estimator of the true third-order cumulant is given by the sample third-order cumulant:

$$\begin{aligned} \hat{\kappa }_{ijk} = n^{-1}\sum _{t=1}^n (\xi _i(X_t)- \bar{\xi }_i)(\xi _j(X_t)- \bar{\xi }_j)(\xi _k(X_t)- \bar{\xi }_k). \end{aligned}$$

(11)

Further,

$$\begin{aligned} \hat{\tilde{G}}&= \ddot{\Psi }(\hat{\theta } ), \quad {\hat{\tilde{g}}}_{ij}= \bigl (\ddot{\Psi }(\hat{\theta })\bigr )_{ij} \end{aligned}$$

(12)

$$\begin{aligned} \hat{\kappa }^*_{ijk}&= \frac{\partial ^3 \ }{\partial \theta ^i \partial \theta ^j \partial \theta ^k}\Psi (\theta )\Big |_{\theta =\hat{\theta }} \end{aligned}$$

(13)

$$\begin{aligned} \hat{\kappa }^*_{ijkl}&=\frac{\partial ^4 \ }{\partial \theta ^i \partial \theta ^j \partial \theta ^k \partial \theta _l}\Psi (\theta )\Big |_{\theta =\hat{\theta }}. \end{aligned}$$

(14)

Consequently, for an exponential family, the p/n criterion is given as follows.

Criterion for an exponential family

$$\begin{aligned} \begin{aligned} C&\ge \frac{1}{2n} \mathrm {tr} \Bigl (\hat{\Sigma }(\ddot{\Psi }(\hat{\theta }) )^{-1}\Bigr )\\&+\frac{1}{24n^2}\Bigl [-8\hat{\tilde{g}}^{uk}\hat{\tilde{g}}^{ls}\hat{\tilde{g}}^{mt}\hat{\kappa }_{kst}\hat{\kappa }^*_{lmu} +9\hat{\tilde{g}}^{ko}\hat{\tilde{g}}^{lu}\hat{\tilde{g}}^{sv}\hat{\tilde{g}}^{tw}\hat{\tilde{g}}^{hm}\hat{\kappa }^*_{lmo}\hat{\kappa }^*_{sth} (\hat{g}_{ku}\hat{g}_{vw}+\hat{g}_{kv}\hat{g}_{uw}\\&+\hat{g}_{kw}\hat{g}_{uv})-3\hat{\tilde{g}}^{kw}\hat{\tilde{g}}^{ls}\hat{\tilde{g}}^{mu}\hat{\tilde{g}}^{tv}\hat{\kappa }^*_{lmtw} (\hat{g}_{ks}\hat{g}_{uv}+\hat{g}_{ku}\hat{g}_{sv}+\hat{g}_{kv}\hat{g}_{su})\Bigr ]. \end{aligned}\nonumber \\ \end{aligned}$$

(15)

How to determine C in (9) or (15) is studied in the next section. Once C is determined, we can use these criterion for the two problems, that is, the sample size problem and the model selection problem, as introduced in Sect. 1.

3 Criterion for model complexity and sample size

In this section, we complete p/n criterion by providing reasonable threshold C for (9) or (15) (Sect. 3.1). As an immediate application of the criterion, we deal with the bin number problem in a multinomial distribution or a histogram (Sect. 3.2). We also state the algorithm for the calculation of the $n^{-2}$-order term in (15) (Sect. 3.3). In the end, the use of the p/n criterion is demonstrated for two practical examples (Sect. 3.4).

3.1 Choice of threshold

Because the value of the divergence (1) or the risk (2) does not have an absolute standard by itself, we relate it to another reasonable standard. One of the often used measures of the closeness between the two distributions is the error rate, which is more intuitive than the divergence and is suitable for setting a threshold. Let $g_i(x), i=1,2$ be the p.d.f. If both $g_i(x),\ i=1,2$, are known, the Bayes discriminant rule (with the noninformative prior) is as follows.

For the sample X from either $g_1(x)$ or $g_2(x)$,

$$\begin{aligned} \frac{g_{i_1}(X)}{g_{i_2}(X)} > 1 \Longleftrightarrow \text { Judge that }X\text { is generated from }g(x;\theta _{i_1}) \end{aligned}$$

The Bayes error rate, Er, i.e., the probability that this rule gives an error, is formally defined by

$$\begin{aligned} Er[g_1(x)\, | \, g_2(x)] = \frac{1}{2}\int \min \Bigl (g_1(x), g_2(x)\Bigr ) d\mu . \end{aligned}$$

The next theorem states the relation between Er and the K–L divergence.

Theorem 3

If $D[g_1(x) \,| \, g_2(x)] \le \delta $, then

$$\begin{aligned} Er[g(x;\theta _1)\, | \, g(x;\theta _2)] \ge \min \{ t \,|\, (x,t) \in A(\delta ) \}, \end{aligned}$$

where

$$\begin{aligned} A(\delta )= & {} \Bigl \{ (x,t) \,\Big |\, x\log {\Bigl (\frac{1-2t}{x} + 1\Bigr )}+(1-x) \log {\Bigl (\frac{2t-1}{1-x}+1\Bigr )}\\= & {} -\delta ,\quad 0< x< 2t < 1\Bigr \}. \end{aligned}$$

Proof

See Appendix. $\square $

Corollary 1

Let $\delta = D[g(x;\theta _1)\, | \, g(x;\theta _2)]$ and $\alpha $ be a certain small positive number (e.g. $\alpha =0.05, 0.01$). If

$$\begin{aligned} \min \{ t \,|\, (x,t) \in A(\delta ) \} \ge 1/2- \alpha , \end{aligned}$$

(16)

then

$$\begin{aligned} Er[g(x;\theta _1)\, | \, g(x;\theta _2)] \ge 1/2- \alpha . \end{aligned}$$

Analytical calculation of $\min \{ t \,|\, (x,t) \in A(\delta ) \}$ is difficult. The approximation when t is close to 1/2 is given here. As $\log {(1+x)} \doteqdot x - x^2/2$ around $x=0$,

$$\begin{aligned}&x\log {\Bigl (\frac{1-2t}{x} + 1\Bigr )}+(1-x) \log {\Bigl (\frac{2t-1}{1-x}+1\Bigr )} \\&= x\Bigl (\frac{1-2t}{x}\Bigr ) -\frac{x}{2}\Bigl (\frac{1-2t}{x}\Bigr )^2 + (1-x)\frac{2t-1}{1-x}-\frac{(1-x)}{2}\Bigl (\frac{2t-1}{1-x}\Bigr )^2= -\frac{1}{2}\frac{(1-2t)^2}{x(1-x)}. \end{aligned}$$

Therefore, $A(\delta )$ is approximated by

$$\begin{aligned} A^*(\delta ) = \Bigl \{ (x,t) \,\Big |\, t= \frac{1}{2}\Bigl (1-\sqrt{2\delta x(1-x)}\Bigr ),\quad 0< x< 2t < 1\Bigr \}. \end{aligned}$$

Note that

$$\begin{aligned} \min \{ t \,|\, (x,t) \in A^*(\delta ) \} \ge \min _{0<x<1}{\frac{1}{2}\Bigl (1-\sqrt{2\delta x(1-x)}\Bigr ) }= \frac{1}{2}-\sqrt{\delta /8}, \end{aligned}$$

Hence, the condition $\sqrt{\delta /8} \le \alpha $ or, equivalently, $\delta \le 8\alpha ^2$ is approximately sufficient for (16). Let the solution of $\delta $ denoted by $C_\alpha $ for the equation

$$\begin{aligned} \min \{ t \,|\, (x,t) \in A(\delta ) \} = 1/2- \alpha , \end{aligned}$$

or more simply, let $C_\alpha $ be given by

$$\begin{aligned} C_\alpha = 8 \alpha ^2. \end{aligned}$$

(17)

In the latter case, if $\alpha =0.05(0.01)$, then $C_\alpha =1/50(1/1250)$. The final form of p/n criterion is given by substituting C in (9) or (15) with $C_\alpha $.

3.2 p/n Criterion for multinomial distribution

In this section, we present a formula for the bin number of a multinomial distribution using the p/n criterion. The bin number problem in a histogram can be treated similarly. Although several formulas have been proposed on the bin number (or the bin width) in the histogram such as Sturges’ formula, Freedman-Diaconis’ formula (see the Chapter 3 of Scott (2015)), the formula here is derived from a new perspective.

In view of the true distribution g(x) and the information projection $g(x;\theta _*)$, a multinomial distribution can be seen as the approximation by the step function model. Let

$$\begin{aligned} \mathcal {M} = \{g(x;m)\,|\, m=(m^0,\ldots ,m^p)\} \end{aligned}$$

with

$$\begin{aligned} g(x;m) = \sum _{i=0}^p I(x \in S_i) \frac{m_i}{Vol(S_i)}, \end{aligned}$$

where $S_i, i=0,1,\ldots ,p$ is the partition of the range of x with volume

$$\begin{aligned} Vol(S_i) = \int _{S_i} 1 d\mu (x), \end{aligned}$$

and $I(x \in S_i)$ is an indicator function of $S_i$. In this case, from (4), the information projection $g(x; m^*)$ is given by $m^*_i = P(X \in S_i | g(x))$. The step-function model is not an exponential family. However, we easily notice that Kullback–Leibler divergence between the two step functions (where $d\mu $ is the continuous measure) is equal to the divergence between the two corresponding multinomial distributions (where $d\mu $ is the counting measure) . Hence, the argument of the estimation risk can be deduced from that of the multinomial distribution model. It is notable that, if X is originally a discrete random variable, the model always contains g(x).

Consider a multinomial distribution with $p+1$ possible values $x_i, i=0,\ldots ,p$, with the corresponding probabilities $m=(m_0,\ldots ,m_p)$. This is an exponential family (6), where

$$\begin{aligned}&\theta ^i = \log (m_i/m_0), \quad i=1,\ldots , p,\\&\xi _i(x) = {\left\{ \begin{array}{ll} 1,&{}\text { if }x = x_i,\\ 0,&{}\text { otherwise,} \end{array}\right. } \quad i=1,\ldots ,p \end{aligned}$$

and $d\mu $ is the counting measure on $\{x_1,\ldots ,x_p\}$. Here,

$$\begin{aligned} \Psi (\theta ) = \log \bigl (\sum _{i=0}^p \exp (\theta _i)\bigr )=-\log m_0=-\log \bigl (1-\sum _{i=1}^p m_i\bigr ). \end{aligned}$$

The asymptotic expansion of the estimation risk up to the second order can be derived as follows (this corresponds to equation (41) of Sheena (2018) with $\alpha =-1$).

$$\begin{aligned} R[g(x;\theta )\, | \, g(x;\hat{\theta })]=\frac{p}{2n} + \frac{1}{12n^2} (M-1) +O(n^{-3}), \qquad M = \sum _{i=0}^p {m_i}^{-1}, \end{aligned}$$

(18)

where $\theta =(m_1,\ldots ,m_p)$ is the true-distribution free parameter. Note that if some $m_i$’s are close to zero, the convergence speed reduces considerably.

If we combine the first-order approximation in (18) with the threshold (17), p/n criterion becomes

$$\begin{aligned} \frac{p}{n} \le 16 \alpha ^2. \end{aligned}$$

If we adopt $\alpha =0.05(0.01)$, then the sample size n or the bin number $p+1$ is determined by the formula;

Simple criterion for the sample size or the bin number

$$\begin{aligned} \frac{p}{n} \le 1/25(1/625). \end{aligned}$$

(19)

The second-order approximation gives the following p/n criterion:

$$\begin{aligned} 96 n^2 \alpha ^2 - 6 n p - (\hat{M}-1) > 0, \end{aligned}$$

where

$$\begin{aligned} \hat{M}=\sum _{i=0}^{p} {\hat{m}_i}^{-1} \end{aligned}$$

and $\hat{m}_i$ is the MLE, the sample relative frequency, for each i. Applying the criterion for n determination gives the formula

$$\begin{aligned} n \ge \frac{3p+\sqrt{9p^2+96\alpha ^2(\hat{M}-1)}}{96\alpha ^2}. \end{aligned}$$

(20)

In contrast, if the criterion is used for the bin number problem, the formula is given by

$$\begin{aligned} 6np + \hat{M} < 96n^2\alpha ^2+1. \end{aligned}$$

Use of these criteria for practical examples is discussed in Sect. 3.4.

3.3 Algorithm for p/n criterion of exponential family

This section describes calculation of the right-hand side of (15). If we can calculate the function $\Psi (\theta )$ analytically, the algorithm is simply the following.

Step 1
Calculate $\hat{\eta }_i=\bar{\xi }_i,\ i=1,\ldots ,p$ from the sample.
Step 2
Solve the simultaneous equations w.r.t. $\theta $ in (7) to give $\hat{\theta }=(\hat{\theta }_1,\ldots ,\hat{\theta }_p)$:
$$\begin{aligned} \hat{\eta }_i = \eta _i(\hat{\theta }) = \frac{\partial \Psi }{\partial \theta _i}(\hat{\theta }),\qquad i=1,\ldots ,p. \end{aligned}$$
Step 3
Calculate (12), (13), and (14) from $\Psi (\hat{\theta })$.
Step 4
Calculate (10) and (11) from the sample.
Step 5
Calculate the right-hand side of (15) and compare it with $C_\alpha $.

Often, $\Psi (\theta )$ is not explicitly given, especially for a complex model. Then, $\hat{\theta }$ can be iteratively calculated using the Newton--Raphson method with the Jacobian matrix (12). Because $\ddot{\Psi }(\theta )$ is the variance-covariance matrix of the $\xi _i$ terms under the $g(x;\theta )$ distribution, its value can be approximated from the generated sample. The alternative methods are as follows.

Step 2’
Iteratively search for $\hat{\theta }$ with
$$\begin{aligned} \theta ^{(n+1)} = \theta ^{(n)} - \bigl (\eta (\theta ^{(n)})-\hat{\eta }\bigr )\bigl (\ddot{\Psi }(\theta ^{(n)})\bigr )^{-1}, \end{aligned}$$
where $\eta (\theta ^{(n)})$ and $\ddot{\Psi }(\theta ^{(n)})$ are approximated by the sample mean and the sample covariance matrix of the $\xi _i$ terms from the $g(x;\theta ^{(n)})$ distribution.

Further, (12), (13), and (14) can also be approximated using the generated sample.

Step 3’
Approximate (12), (13), and (14) using the sample moments and cumulants, where the sample is generated from $g(x;\hat{\theta })$.

The point here is that $\Psi (\theta )$ is not required for sample generation in Steps 2’ and 3’ if methods such as MCMC (requiring no normalizing constant) are used. Although Steps 2’ and 3’ are computationally heavy tasks, they enable construction of a complex model without calculation of $\Psi $.

3.4 Real data examples for p/n criterion

This section demonstrates use of the p/n criterion for a particular problem through two practical examples under the exponential family model.

Example 1

(Red Wine) The first example is a well-known dataset on wine quality, taken from the U.C.I. Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/wine+quality).

Only red wine data are used. The sample size is 1599, and the variables consist of 11 chemical substances (continuous variables) and “quality” indexes (integers from 3 to 8). The vector of the chemical substances and the “quality” variable are denoted by $x^{(1)} =(x^{(1)}_{1},\ldots , x^{(1)}_{11})$ and $x^{(2)}$, respectively. We divided the sample into two halves randomly, one of which (“data_base”) was used for the model formulation and the other (“data_est”) was used for the estimation of the parameter.

For model formulation, we determined the following: normalization method of the original data, the reference (probability) measure $d\mu (x)$ and $\xi $ elements. Using “data_base”, we proceed as;

1.
Each variable $x^{(1)}_i(i=1,\ldots ,11)$ is divided by twice of its maximum such that its range is $[0,\ 1)$. Further, 2 is subtracted from each “quality” index to give a range of $\{1,2,\ldots ,6\}$.
2.
As $d\mu (x)$, 11 independent Beta distributions are applied to $x^{(1)}$ so that their means and variances are equal to those of the “data_base”. The multinomial distribution of $x^{(2)}$ is adopted, using each category’s sample relative frequency as the category probability parameter (say, $m_i, \ i=1,\ldots ,6$). In addition, $x^{(1)}$ and $x^{(2)}$ are taken to be independent.

Consequently, $d\mu $ is selected as

$$\begin{aligned} x= & {} (x^{(1)},x^{(2)}),\quad d\mu (x) = \prod _{i=1}^{11} {x^{(1)}_i}^{(\beta _{1i}-1)} {(1-x^{(1)}_i)}^{(\beta _{2i}-1)} d(x^{(1)})\\&\times \prod _{i=1}^6 m_i^{I(x^{(2)}=i)} d^*(x^{(2)}), \end{aligned}$$

where $d(x^{(1)})$ is the Lebesgue measure on $[0,\ 1]^{11}$, $d^*(x^{(2)})$ is the counting measure on $\{1,2,\ldots ,6\}$, and $I(\cdot )$ is the indicator function. Further, $\beta _{1i}$, $\beta _{2i}$, and $m_i$ satisfy the relations

$$\begin{aligned}&\frac{\beta _{1i}}{\beta _{1i}+\beta _{2i}} = \text { Sample mean of }x^{(1)}_i,\quad i=1,\ldots ,11\\&\frac{\beta _{1i}\beta _{2i}}{(\beta _{1i}+\beta _{2i})^2(\beta _{1i}+\beta _{2i}+1)} = \text {Sample variance of }x^{(1)}_i,\quad i=1,\ldots ,11\\&m_i = \text {Relative frequency of }i\text { in }x^{(2)} \end{aligned}$$

3. The candidate for the $\xi _i$ terms are as follows;

$$\begin{aligned}&\xi _1(x) = x^{(1)}_1 x^{(1)}_2, \quad \xi _2(x)=x^{(1)}_1 x^{(1)}_3, \quad \ldots \quad \xi _{10}(x) = x^{(1)}_1x^{(1)}_{11} \\&\xi _{11}(x) = x^{(1)}_2 x^{(1)}_3,\quad \ldots \quad \xi _{19}(x) = x^{(1)}_2 x^{(1)}_{11}\\& \cdots \\&\xi _{55}(x) = x^{(1)}_{10} x^{(1)}_{11} \end{aligned}$$

and

$$\begin{aligned} \xi _{56}(x) = x^{(1)}_1x^{(2)}, \quad \ldots \quad \xi _{66}(x)= x^{(1)}_{11}x^{(2)}. \end{aligned}$$

Because some of these terms are highly correlated, we eliminate one of the pair with the correlation higher than 0.95. The following 20 $\xi _i$ terms were removed from the full model:

$$\begin{aligned} \xi _i,\ i=8, 17, 19, 24, 25, 27, 32, 34, 38, 40, 43, 45, 46, 47, 49, 53, 58, 62, 64. \end{aligned}$$

Consequently, an exponential family model with $p=47$ is formulated. As the probability distribution $g(x;\theta )d\mu $ equals $d\mu $ when the $\theta $ terms all equal zero, it is denoted by g(x; 0). Note that the $g(x ;\theta _*)$ of this model is the closest to g(x; 0) in the sense that

$$\begin{aligned} D[g(x;\theta _*) | g(x;0)] = \min _{h \in \mathcal {H}} D[h(x) | g(x;0)], \end{aligned}$$

where $\mathcal {H}$ is the p.d.f. set of h(x) (w.r.t. $d\mu $) that satisfies

$$\begin{aligned} E_h[\xi _i(X)] = \int h(x) \xi _i(x) d\mu (x) = E[ \xi _i(X) ], \end{aligned}$$

for each $\xi _i$ in the model. This is the consequence of so-called “minimum relative entropy characterization” of an exponential family” (see Csiszár (1975)).

Under the formulated exponential family model, the algorithm in the previous section was implemented and the right-hand side of (15) was calculated using the “data_est”, the size of which (n) equals 799. Because of the model complexity, the explicit form of $\Psi (\theta )$ could not be obtained; hence, Alternative Steps 2’ and 3’ were used. The R and RStan program codes for the whole risk calculation are presented in GitHub (https://github.com/YSheena/P-N_Criteria_Program.git). The first-and second-order terms and the estimation risk in the total of (15) were as follows;

First-order term: 2.95e-02, Second-order term: -1.30e-04, Estimation Risk: 2.93e-02

Note that the second-order term contributes little to the estimation risk; thus, the first-order approximation seems sufficient for this model and data. With the threshold (17), the equation 2.93e-02=8$\alpha ^2$ gives the solution $\alpha \doteqdot 0.06$. Hence the Bayes error rate between $g(x;\hat{\theta })$ and $g(x;\theta _*)$ is higher than 0.44. If we set the threshold as $\alpha =0.05$, we must trim the model further. For example, if we eliminate one of the $\xi $ elements from the pair with correlation higher than 0.9, then p becomes as small as 37. For this model, the estimation risk is lower than the target value $8*(0.05)^2=0.02$ as follows;

First-order term: 1.60e-02, Second-order term: 2.04e-04, Estimation Risk: 1.62e-02

Example 2

(Abalone Data) The next example also features a well-known dataset, in this case, for the physical measurement of abalones (U.C.I. Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Abalone). This data comprise eight properties (sex, length, diameter, etc.) of 4177 abalones. Here, only two discrete variables were considered: “sex” and “ring,” where “sex” had three values “Female,” “Infant,” and “Male”; and “rings” had integer values from 1 to 29. The frequency of each classified group by “sex” and “rings” is given in Table 1. The original frequencies were aggregated at both ends. In the table, if a cell with a star mark is located to the immediate left or right, the number in the cell is aggregated. For example, of the female abalones, cells with 24 or more rings were aggregated to frequency 4. The total number of cells was 63.

A multinomial distribution over 63 cells was considered; hence, $p=62$. First the simple criterion (19) is adopted, then

$$\begin{aligned} p/n = 62/4177 \doteqdot 0.015 < 1/25, \end{aligned}$$

but $p/n > 1/625$. Consequently, the model distribution is close to the information projection (this case, the true distribution) to the extent that the Bayes error rate is more than 0.45 but less than 0.49.

Table 1 Abalones by sex and rings

Full size table

In order to use the second order term, M needs to be estimated. From the sample relative frequency of each cell $\hat{m}_i,$ where $\ i=0,\ldots ,62$,

$$\begin{aligned} \hat{M} = \sum _{i=0}^{62} {\hat{m}_i}^{-1} = 36128.33, \end{aligned}$$

Use of the n formula (20) yielded

$$\begin{aligned} n \ge 1642, \end{aligned}$$

which indicates the actual sample size 4177 is large enough for Bayes error rate 0.45. However, to attain Bayes error rate of 0.49, the required sample size equals 38847, which is far beyond 1642.

References

Barndorff-Nielsen OE (2014) Information and exponential families in statistical theory. Wiley, New York
Book MATH Google Scholar
Barron AR, Sheu C (1991) Approximation of density functions by sequences of exponential families. Ann Stat 19(3):1347–1369
Article MATH Google Scholar
Brown LD (1986) Fundamentals of statistical exponential families. IMS
Csiszár I (1975) I-divergence geometry of probability distributions and minimization problems. Ann Probab 3:146–158
Article MATH Google Scholar
Hartigan JA (1998) The maximum likelihood prior. Ann Stat 26(6):2083–2103
Article MATH Google Scholar
Komaki F (1996) On asymptotic properties of predictive distributions. Biometrika 83(2):299–313
Article MATH Google Scholar
Komaki F (2015) Asymptotic properties of Bayesian predictive densities when the distributions of data and target variables are different. Bayesian Anal 10(1):31–51
Article MATH Google Scholar
Scott DW (2015) Multivariate density estimation. Wiley, New York
MATH Google Scholar
Sheena Y (2018) Asymptotic expansion of the risk of maximum likelihood estimator with respect to $\alpha $-divergence as a measure of the difficulty of specifying a parametric model. Commun Stat Theory Methods 47(16):4059–4087
Article MATH Google Scholar
Sheena Y (2021) Mle convergence speed to information projection of exponential family: criterion for model dimension and sample size—complete proof version–. arXiv, 2105.08947
Sundberg R (2019) Statistical modeling for exponential families. Cambridge University Press, Cambridge
Book MATH Google Scholar
van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, Cambridge
Book MATH Google Scholar
Wainwright MJ, Jordan MI (2008) Graphical models, exponential families, and variational inference. Now Publishers
Zhang F, Shi Y, Ng HKT, Wang R (2018) Information geometry of generalized Bayesian prediction using $\alpha $-divergence as loss functions. IEEE Trans Inf Theory 64(3):1812–1824
Article MATH Google Scholar

Download references

Acknowledgements

The author greatly appreciates the reviewers’ constructive comments on the previous version of the manuscript, which made the present version more concise and readable. This research was partially supported by Grant-in-Aid for Scientific Research (20K11706).

Author information

Authors and Affiliations

Faculty of Data Science, Shiga University, Hikone, Japan
Yo Sheena
Institute of Statistical Mathematics, Tokyo, Japan
Yo Sheena

Authors

Yo Sheena
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yo Sheena.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof of Theorem 3 A suitably fine partition $S_i,\ i=1,\ldots ,m$ of the domain of $d\mu $ and the associated step functions of $g_j(x) = \sum _{i=1}^m c_{ji} I(x \in S_i),\ j=1,2$ are taken such that the two integrations

$$\begin{aligned} Er[g_1(x)\, | \, g_2(x)]&= \frac{1}{2}\int \min \Bigl (g_1(x), g_2(x)\Bigr ) d\mu \\&= \frac{1}{2}\int g_1(x) \min \Bigl (1, g_2(x)/g_1(x)\Bigr ) d\mu , \\ D[g_1(x) \,| \, g_2(x)]&= \int g_1(x) \log \Bigl ( g_1(x)/g_2(x) \Bigr ) d\mu , \end{aligned}$$

are sufficiently well approximated by

$$\begin{aligned}&\frac{1}{2}\sum _{i=1}^m \min (1, c_{2i}/c_{1i}) \int _{S_i} c_{1i} d\mu \end{aligned}$$

(21)

$$\begin{aligned}&\sum _{i=1}^m \log (c_{1i}/c_{2i} ) \int _{S_i} c_{1i} d\mu , \end{aligned}$$

(22)

respectively. Furthermore, we can choose the partition such that

$$\begin{aligned} \int _{S_i} c_{1i} d\mu = 1/m, \quad i=1,\ldots ,m. \end{aligned}$$

Then, (21) and (22) equal

$$\begin{aligned}&\frac{1}{2m}\sum _{i=1}^m \min (1, \Delta _i) \ (= t(\Delta ) ) \\&\frac{1}{m}\sum _{i=1}^m -\log {\Delta _i}, \end{aligned}$$

where $\Delta _i = c_{2i}/c_{1i},\ i=1,\ldots ,m$. Suppose that $D[g(X;\theta _1) \,| \, g(x;\theta _2)] \le \delta $. Then, with sufficiently finer $S_i,\ i=1,\ldots ,m$, we have

$$\begin{aligned} f(\Delta ) = \frac{1}{m} \sum _{i=1}^m \log \Delta _i \ge -\delta . \end{aligned}$$

(23)

The lower bound of $t(\Delta )$ is searched for, under the condition of (23). Let

$$\begin{aligned} \tilde{m} = \sum _{i=1}^m \Delta _i,\qquad \tilde{1} = \frac{\tilde{m}}{m} . \end{aligned}$$

(24)

Note that, as the partition $S_i,\ i=1,\ldots ,m$ becomes finer,

$$\begin{aligned} \sum _{i=1}^m \int _{S_i} c_{2i} d\mu =\sum _{i=1}^m \Delta _i/m = \tilde{1} \rightarrow \int g_2(x) d\mu =1. \end{aligned}$$

Without loss of generality, the following can be assumed:

$$\begin{aligned} \Delta _1 \ge \cdots \ge \Delta _s> 1> \Delta _{s+1} \ge \cdots \ge \Delta _m > 0,\quad \exists s (\ge 1). \end{aligned}$$

Let $t=m-s$ and

$$\begin{aligned} \Delta ^+ = \frac{1}{s} \sum _{i=1}^s \Delta _i,\qquad \Delta ^- = \frac{1}{t} \sum _{i=s+1}^m \Delta _{i}. \end{aligned}$$

Note that

$$\begin{aligned} t(\underbrace{\Delta ^+, \cdots , \Delta ^+}_s, \underbrace{\Delta ^-,\cdots , \Delta ^-}_t) = t(\Delta ) \\ \end{aligned}$$

and, because of the concavity of $f(\Delta )$,

$$\begin{aligned} f(\underbrace{\Delta ^+, \cdots , \Delta ^+}_s, \underbrace{\Delta ^-,\cdots , \Delta ^-}_t) \ge f(\Delta ) \ge -\delta . \end{aligned}$$

Therefore, in search of the lower bound of $t(\Delta )$, we must only consider the case where

$$\begin{aligned} \begin{aligned}&\Delta _1 = \Delta _2 = \cdots = \Delta _s = \Delta ^+ > 1, \\&0< \Delta _{s+1} = \Delta _{s+t} = \cdots = \Delta _m =\Delta ^- < 1, \end{aligned} \end{aligned}$$

(25)

Under condition (25), the relations (23) and (24) are

$$\begin{aligned}&\frac{1}{m} (s\log {\Delta ^+}+ t\log {\Delta ^{-}}) \ge -\delta , \\&s\Delta ^+ + t \Delta ^{-} = \tilde{m}, \end{aligned}$$

respectively, or equivalently,

$$\begin{aligned}&x \log {\Delta ^+} + (1-x) \log {\Delta ^-} \ge -\delta , \end{aligned}$$

(26)

$$\begin{aligned}&x\Delta ^+ +(1-x) \Delta ^- = \tilde{1}, \end{aligned}$$

(27)

where

$$\begin{aligned} 0< x=s/m <1. \end{aligned}$$

(28)

Substituting the relation from (27), i.e.,

$$\begin{aligned} \Delta ^- = \frac{\tilde{1}-x\Delta ^+}{1-x} \end{aligned}$$

into $\Delta ^- > 0$ and (26) gives

$$\begin{aligned}&1< \Delta ^+ < \frac{\tilde{1}}{x} \end{aligned}$$

(29)

$$\begin{aligned}&h(x;\Delta ^+) = x \log {\Delta ^+} +(1-x) \log {\Bigl (\frac{\tilde{1}-x\Delta ^+}{1-x}\Bigr )} \ge -\delta . \end{aligned}$$

(30)

Furthermore, under condition (25),

$$\begin{aligned} t(\Delta )&= \frac{1}{2m}\sum _{i=1}^m \min (1, \Delta _i) \\&= \frac{1}{2m}( s + t\Delta ^- ) \\&=\frac{1}{2}( x + (1-x) \Delta ^- ) \\&= \frac{1}{2} \bigl (\tilde{1}+x(1-\Delta ^+)\bigr ) \ \bigl (= t(x;\Delta ^+) \bigr ) \end{aligned}$$

Consider the minimization of $t(x;\Delta ^+)$ under conditions (28), (29), and (30). Notice that

$$\begin{aligned} \frac{d }{dx} h(x;\Delta ^+) = h'(x;\Delta ^+)&= \log {\Delta ^+} -\log {\Bigl (\frac{\tilde{1}-x\Delta ^+}{1-x}\Bigr )}+(1-x)\Bigl \{\frac{-\Delta ^+}{\tilde{1}-x\Delta ^+}+\frac{1}{1-x}\Bigr \}\\&=\log {\Bigl (\frac{\Delta ^+(1-x)}{\tilde{1}-x\Delta ^+}\Bigr )} + \frac{\tilde{1}-\Delta ^+}{\tilde{1}-x\Delta ^+}\\&\le \frac{\Delta ^{+}-\tilde{1}}{\tilde{1}-x\Delta ^{+}}+\frac{\tilde{1}-\Delta ^{+}}{\tilde{1}-x\Delta ^{+}} = 0 \qquad {(\because \log (1+x) \le x)}. \end{aligned}$$

Since

$$\begin{aligned} x< \frac{\tilde{1}}{\Delta ^+} = x + (1-x) \frac{\Delta ^-}{\Delta ^+} <1, \end{aligned}$$

and

$$\begin{aligned} \lim _{x \rightarrow \tilde{1}/\Delta ^+}h(x;\Delta ^+) = -\infty , \end{aligned}$$

the minimum value of $t(x;\Delta ^+)$(say, $t^*$) is attained when (30) holds with the equation. Let $(x^*, \Delta _*^+)$ denote the point that attains $t^*$; then,

$$\begin{aligned} \Delta _*^+ = (\tilde{1}-2t^*)/x^* + 1. \end{aligned}$$

(31)

Inserting (31) into the left-hand side of (30) and equating it with $-\delta $ gives

$$\begin{aligned} x^*\log {\Bigl (\frac{\tilde{1}-2t^*}{x^*} + 1\Bigr )}+(1-x^*) \log {\Bigl (\frac{2t^*-1}{1-x^*}+1\Bigr )} = -\delta , \end{aligned}$$

while, from (28), (29), and (31),

$$\begin{aligned} 0< x^*< 2t^* < \tilde{1}. \end{aligned}$$

Let us define the region $\tilde{A}(\delta )$ by

$$\begin{aligned} \tilde{A}(\delta )= & {} \Bigl \{(x^*,t^*)\, \\&\Big | \, x^*\log {\Bigl (\frac{\tilde{1}-2t^*}{x^*} + 1\Bigr )}+(1-x^*) \log {\Bigl (\frac{2t^*-1}{1-x^*}+1\Bigr )} = -\delta ,\quad 0< x^*< 2t^* < \tilde{1}.\Bigr \} \end{aligned}$$

Then,

$$\begin{aligned} \frac{1}{2m}\sum _{i=1}^m \min (1, \Delta _i) = t(x;\Delta ^+) \ge \min {\{t^* \,|\, (x^*,t^*) \in \tilde{A}(\delta )\}}. \end{aligned}$$

Taking the limit operation for both sides as the partition becomes finer gives the result.

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sheena, Y. Convergence of estimative density: criterion for model complexity and sample size. Stat Papers 64, 117–137 (2023). https://doi.org/10.1007/s00362-022-01309-9

Download citation

Received: 02 August 2021
Revised: 19 March 2022
Accepted: 20 March 2022
Published: 25 April 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s00362-022-01309-9

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Convergence of estimative density: criterion for model complexity and sample size

Abstract

Similar content being viewed by others

Adaptation of the tuning parameter in general Bayesian inference with robust divergence

Maximum a posteriori estimators as a limit of Bayes estimators

Objective Bayesian testing on the common mean of several normal distributions under divergence-based priors

1 Introduction

2 Estimation risk for general case and exponential family

2.1 Estimation risk for general case

Theorem 1

2.2 Estimation risk for exponential family

Theorem 2

Proof

2.3 Estimator of estimation risk

3 Criterion for model complexity and sample size

3.1 Choice of threshold

Theorem 3

Proof

Corollary 1

3.2 p/n Criterion for multinomial distribution

3.3 Algorithm for p/n criterion of exponential family

3.4 Real data examples for p/n criterion

Example 1

Example 2

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Convergence of estimative density: criterion for model complexity and sample size

Abstract

Similar content being viewed by others

Adaptation of the tuning parameter in general Bayesian inference with robust divergence

Maximum a posteriori estimators as a limit of Bayes estimators

Objective Bayesian testing on the common mean of several normal distributions under divergence-based priors

1 Introduction

2 Estimation risk for general case and exponential family

2.1 Estimation risk for general case

Theorem 1

2.2 Estimation risk for exponential family

Theorem 2

Proof

2.3 Estimator of estimation risk

3 Criterion for model complexity and sample size

3.1 Choice of threshold

Theorem 3

Proof

Corollary 1

3.2 p/n Criterion for multinomial distribution

3.3 Algorithm for p/n criterion of exponential family

3.4 Real data examples for p/n criterion

Example 1

Example 2

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation