1 Introduction

Given a certain data set, an unknown probability distribution that generates the data as the independent, identically distributed (i.i.d.) sample can be assumed. Under this assumption, if a certain parametric distribution model is adopted to “explain” the data, the first task is to find the “best” approximating distribution in the model. Because the true distribution is assumed to be outside the model (except for some rare cases), the “best” means the “closest” to the true distribution.

Consider the following parametric distribution model:

$$\begin{aligned} \mathcal {M} = \{g(x;\theta )\,|\, \theta =(\theta ^1,\ldots ,\theta ^p) \in \Theta \}, \end{aligned}$$

where \(g(x;\theta )\) is the probability density function (p.d.f.) with respect to a reference measure \(d\mu \) on a measurable space. The p.d.f. of the unknown true distribution with respect to \(d\mu \) is denoted by g(x). If we use a certain divergence \(D[ \cdot \,|\, \cdot ]\) to measure the closeness between g(x) and \(g(x;\theta )\), then the “best” approximating distribution in \(\mathcal {M}\) is given by the predictive distribution \(g(x;\theta _*)\), where

$$\begin{aligned} \theta _* = \mathop {\mathrm arg \, min}\limits _{\theta \in \Theta } D[ g(x) \,|\,g(x;\theta )] . \end{aligned}$$

Following Csiszár (1975), we will call \(g(x;\theta _*)\) the “information projection” in this paper.

Let \(\hat{\theta }\) denote the maximum likelihood estimator (MLE) based on the i.i.d. sample \(\varvec{X}= (X_1,\ldots ,X_n)\) from g(x). Consider the predictive density \(g(x; \hat{\theta })\). Since MLE converges to \(\theta _*\) in probability (see, e.g., Theorem 5.21 of van der Vaart (1998)) as the sample size, n, increases,

$$\begin{aligned} D[g(x;\theta _*) \,| \, g(x; \hat{\theta })] \end{aligned}$$
(1)

also converges to zero in probability. The predictive density \(g(x; \hat{\theta })\) is produced with plugged-in MLE. This type of predictive density is called “estimative density”. Another common method to formulate the predictive density is Bayesian predictive density. For the asymptotic properties of Bayesian predictive density, see e.g. Komaki (1996), Hartigan (1998), Komaki (2015) and Zhang et al. (2018).

Take the expectation

$$\begin{aligned} R[g(x;\theta _*) \,| \, g(x; \hat{\theta })] = E\Bigl [D[g(x;\theta _*) \,| \, g(x; \hat{\theta })] \Bigr ] \end{aligned}$$
(2)

with respect to the i.i.d. sample \(\varvec{X}= (X_1,\ldots ,X_n)\) from g(x). Throughout this study, the expectation under g(x) is denoted by \(E[\cdot ]\), while the expectation under \(g(x;\theta _*)\) is denoted by \(E_{\theta _*}[\cdot ]\). We call (2) “estimation risk” for discriminating it with the “total risk”

$$\begin{aligned} R[g(x) \,| \, g(x; \hat{\theta })] = E\Bigl [D[g(x) \,| \, g(x; \hat{\theta })] \Bigr ]. \end{aligned}$$

The estimation risk converges to zero under some mild conditions. We will use this estimation risk as the measure of the closeness between \(g(x; \theta _*)\) and \( g(x; \hat{\theta })\).

Given the data and the model, we need to know whether \(g(x;\hat{\theta })\) is sufficiently close to the information projection. Thus, with a certain threshold C, the following criterion is considered.

$$\begin{aligned} \hat{R}[g(x;\theta _*) \,| \, g(x; \hat{\theta })] < C, \end{aligned}$$
(3)

where the left hand side is the estimator of the estimation risk.

This criterion gives a solution to the following two problems.

  • Sample size problem: With the model fixed, it indicates exactly how much sample size n is needed for \(g(x;\hat{\theta })\) to be close to the information projection. If the criterion is not satisfied, we need to collect more sample.

  • Model selection problem: With the sample size fixed, it tells us whether a model is simple enough (especially the dimension of the parameter p is small enough) to guarantee that \(g(x;\hat{\theta })\) is close to the information projection. Unless the criterion is satisfied, simplifying the model could be a remedy.

As seen later in the manuscript, the estimation risk is mainly determined by p/n when the information projection is close to the true distribution, and we will call this criterion “p/n criterion” hereafter.

In this paper, as the divergence, Kullback–Leibler divergence is taken, that is,

$$\begin{aligned} D[ g(x) \,|\,g(x;\theta )] = \int g(x) \log \Bigl (g(x)/g(x;\theta ) \Bigr ) d\mu . \end{aligned}$$

Note that for this divergence, the information projection is given by

$$\begin{aligned} E \left[ \frac{\partial \ }{\partial \theta ^i}\log g(X;\theta ) \right] =0,\quad i=1,\ldots ,p,\end{aligned}$$
(4)

and its solution \(\theta ^*\) is naturally estimated via the MLE, which is the solution of

$$\begin{aligned} \sum _{t=1}^n \frac{\partial \ }{\partial \theta ^i}\log g(X_t;\theta ) = 0,\quad i=1,\ldots ,p. \end{aligned}$$

For the other divergences, the information projection is more complicated, and its natural estimator is not as simple as MLE.

This paper aims to present a simple and practical criterion (3), and proceeds as follows;

  1. 1.

    The asymptotic expansion of the estimation risk is derived.

  2. 2.

    The asymptotic expansion combined with the estimated moments gives the estimator of the estimation risk.

  3. 3.

    The reasonable (persuasive) threshold C is proposed.

An overview of the contents of each section is now provided. First, the asymptotic expansion of the estimation risk is given for both the general model (Sect. 2.1) and an exponential family model (Sect. 2.2). The estimator of the estimation risk is given in Sect. 2.3. Next, the concrete threshold C is proposed in view of the Bayes error rate. With these results combined, p/n criterion is proposed in an explicit form (Sect. 3.1). As an application of p/n criterion, the bin number problem in a multinomial distribution or a histogram is considered (Sect. 3.2). In Sect. 3.3, the algorithm for calculating the p/n criterion in the case of an exponential family is described. In Sect. 3.4, the use of the p/n criterion is demonstrated for two practical examples.

2 Estimation risk for general case and exponential family

In this section, the asymptotic expansion with respect to n of the estimation risk (2) is presented up to the first-order term for a general distribution, and up to the second-order term for an exponential family distribution.

Hartigan (1998) derives the asymptotic expansion of the estimation risk (2) up to the second order under the assumption g(x) belongs to \(\mathcal {M}\). The result here is the extension of his result in the sense that the true distribution is not necessarily located in \(\mathcal {M}\).

On the risk of an exponential family, the most relevant work is that of Barron and Sheu (1991). They consider the convergence rate of the K–L divergence (not the risk, but the divergence itself) for an exponential family on a compact set. Their interest lies in the closeness between g(x) and \(g(x;\hat{\theta })\), while this research focuses on the closeness between \(g(x;\theta _*)\) and \(g(x;\hat{\theta })\)

2.1 Estimation risk for general case

Taylor expansion of

$$\begin{aligned} D[g(x;\theta _*)\,|\,g(x;\hat{\theta }) ] = \int g(x;\theta _*) \log (g(x;\theta _*) /g(x;\hat{\theta }) ) d\mu \end{aligned}$$

as a function of \(\hat{\theta }\) around \(\theta _*\) is considered:

$$\begin{aligned}&D[g(x;\theta _*)\,|\,g(x;\hat{\theta }) ] \\&\quad = -\sum _i \int \frac{\partial \ }{\partial \theta ^i} g(x;\theta )\bigg |_{\theta =\theta _*} d\mu \ (\hat{\theta }^i - \theta _*^i) \\&\qquad + \frac{1}{2} \sum _{i,j} \int g(x;\theta _*)\Bigl ( \frac{\partial \ }{\partial \theta ^i}\log g(x;\theta ) \bigg |_{\theta =\theta _*}\Bigr ) \Bigl ( \frac{\partial \ }{\partial \theta ^j} \log g(x;\theta ) \bigg |_{\theta =\theta _*}\Bigr ) d\mu \\&\qquad \times (\hat{\theta }^i - \theta _*^i) (\hat{\theta }^j - \theta _*^j) \\&\qquad - \frac{1}{2}\sum _{i,j} \int \frac{\partial ^2 \ \ \ }{\partial \theta ^i \partial \theta ^j} g(x, \theta )\bigg |_{\theta =\theta _*} d\mu \ (\hat{\theta }^i - \theta _*^i) (\hat{\theta }^j - \theta _*^j)\\&\qquad - \frac{1}{3!}\sum _{i_1,i_2,i_3}\int g(x;\theta _*) \frac{\partial ^3\log g(x, \theta )}{\partial \theta ^{i_1}\partial \theta ^{i_2}\partial \theta ^{i_3}} \bigg |_{\theta =\tilde{\theta }_*} d\mu (\hat{\theta }^{i_1} - \theta _*^{i_1}) (\hat{\theta }^{i_2} - \theta _*^{i_2})(\hat{\theta }^{i_3} - \theta _*^{i_3}), \end{aligned}$$

where \(\tilde{\theta }_*\) is a point between \(\theta _*\) and \(\hat{\theta }\). Because

$$\begin{aligned} \int \frac{\partial \ }{\partial \theta ^i} g(x;\theta ) d\mu =0,\qquad \int \frac{\partial ^2 \ \ }{\partial \theta ^i \partial \theta ^j} g(x, \theta ) d\mu =0,\qquad \forall \theta \in \Theta , \end{aligned}$$

it turns out that

$$\begin{aligned} R[g(x;\theta _*)\,|\,g(x;\hat{\theta }) ]&= \frac{1}{2}\sum _{i,j} g^*_{ij}(\theta _*) E[(\hat{\theta }^i - \theta _*^i) (\hat{\theta }^j - \theta _*^j)] \nonumber \\&\quad - \frac{1}{3!}\sum _{i_1,i_2,i_3} E[\tau _{i_1,i_2,i_3}(\hat{\theta }^{i_1} - \theta _*^{i_1})\cdots (\hat{\theta }^{i_3} - \theta _*^{i_3})]. \end{aligned}$$

Here,

$$\begin{aligned} \tau _{i_1,i_2,i_3} = \int g(x;\theta _*) \frac{\partial ^3 \log g(x, \theta )}{\partial \theta ^{i_1}\partial \theta ^{i_2} \partial \theta ^{i_3}} \bigg |_{\theta =\hat{\theta }_*} d\mu \end{aligned}$$

and \(g^*_{ij}\) indicates the components of the Fisher metric matrix on \(\mathcal {M}\), given by

$$\begin{aligned} g^*_{ij}(\theta _*)= (G^*(\theta _*))_{ij} = E_{\theta _*}\Bigl [\Bigl ( \frac{\partial \ }{\partial \theta ^i}\log g(x;\theta ) \Big |_{\theta =\theta _*}\Bigr ) \Bigl ( \frac{\partial \ }{\partial \theta ^j} \log g(x;\theta )\Big |_{\theta =\theta _*} \Bigr ) \Bigr ].\\ \end{aligned}$$

As \(\theta _*\) is the solution of equation (4) and \(\hat{\theta }\) is its empirical solution (i.e., the M-estimator), the following result holds (see, e.g., Theorem 5.21 of van der Vaart (1998)).

$$\begin{aligned} \sqrt{n}\bigl (\hat{\theta }-\theta _*\bigr ) {\mathop {\rightarrow }\limits ^{d}} N_p (0, \tilde{G}^{-1}G\tilde{G}^{-1}), \end{aligned}$$

where

$$\begin{aligned} g_{ij}(\theta _*) = (G(\theta _*))_{ij}&= E\Bigl [\Bigl ( \frac{\partial \ }{\partial \theta ^i}\log g(X;\theta )\Big |_{\theta =\theta _*} \frac{\partial \ }{\partial \theta ^j}\log g(X;\theta )\Big |_{\theta =\theta _*}\Bigr )\Bigr ], \\ \tilde{g}_{ij}(\theta _*) = (\tilde{G}(\theta _*))_{ij}&= -E\Bigl [\frac{\partial ^2 \ \quad }{\partial \theta ^j \partial \theta ^i} \log g(X;\theta )\Big |_{\theta =\theta _*}\Bigr ] . \end{aligned}$$

For a general distribution, the estimation risk is asymptotically given as follows;

Theorem 1

$$\begin{aligned} R[g(x;\theta _*)\,|\,g(x;\hat{\theta }) ] = (2n)^{-1} \mathrm {tr} \Bigl (\tilde{G}(\theta _*)^{-1}G(\theta _*)\tilde{G}(\theta _*)^{-1}G^*(\theta _*) \Bigr ) + O(n^{-2}). \end{aligned}$$
(5)

Because the \(n^{-2}\)-order term is prohibitively lengthy, if it is incorporated into the p/n criterion, the result is not suitable for practical use. Hence, it is omitted here. (For interested readers, Theorem 1 of Sheena (2021) is being referred to. You can also find the proof of the whole expansion there.)

Note that, if g(x) exists within the model, then \(G=\tilde{G}=G^*\). Hence, the first-order term equals p/(2n) (for more general result for the well-specified model, see Sheena (2018)). Thus, the first-order term is mainly determined by p if \(g(x;\theta _*)\) is close to g(x).

2.2 Estimation risk for exponential family

This subsection investigates the estimation risk when the parametric model is an exponential family (for general references on exponential families, see Brown (1986), Barndorff-Nielsen (2014) and Sundberg (2019)). In the case of the exponential family, the \(n^{-2}\)-order term in the asymptotic expansion of the estimation risk has a simpler form.

Let the model \(\mathcal {M}\) be given by

$$\begin{aligned} \mathcal {M} = \Bigl \{g(x ;\theta ) = \exp \Bigl (\sum _{i=1}^p \theta ^i \xi _i(x) -\Psi (\theta )\Bigr ) \ \Big | \theta \in \Theta \Bigr \}. \end{aligned}$$
(6)

where \(\Psi (\theta )\) is the cumulant-generating function of the \(\xi \) terms, such that,

$$\begin{aligned} \Psi (\theta ) = \log \int \exp \Bigl (\sum _{i=1}^p \theta ^i \xi _i(x)\Bigr ) d\mu . \end{aligned}$$

The “dual coordinate” \(\eta \) is defined as

$$\begin{aligned} \eta _i(\theta ) = \frac{\partial \Psi (\theta )}{\partial \theta ^i} = E_{\theta }[\xi _i], \quad i=1,\ldots ,p. \end{aligned}$$

In particular, from the definition of \(\theta _*\) (see (4)),

$$\begin{aligned} \eta _i^* = \eta _i(\theta _*) = E_{\theta _*}[\xi _i] =E[\xi _i], \quad i=1,\ldots ,p. \end{aligned}$$

The last equation requires the means of \(\xi _i\) to coincide under g(x) and \(g(x;\theta _*)\). It is known that \(g(x;\theta _*)\) maximizes the Shannon entropy among all probability distributions of \((\xi _1,\ldots ,\xi _p)\) with a given \(E[\xi _i],\ i=1,\ldots p\) (the “entropy maximization property” of an exponential family; see, e.g., Wainwright and Jordan (2008)). The K–L divergence is the difference between the cross-entropy and Shannon entropy.

The \(\eta \) coordinate is easily estimated. In fact, \(\hat{\eta }\), the MLE for \(\eta \), is the sample mean of \(\xi \). Hence,

$$\begin{aligned} \hat{\eta }_i= \frac{\partial \Psi }{\partial \theta _i}(\hat{\theta }) = \bar{\xi }_i \bigl (= n^{-1}\sum _{t=1}^n \xi _i(X_t)\bigr ). \end{aligned}$$
(7)

In contrast, \(\hat{\theta }\) is difficult to obtain explicitly because \(\Psi \) or its derivative cannot be theoretically obtained for a complex model. This could pose a serious obstacle to application of an exponential family model to a practical problem, and is discussed in Sect. 3.3.

Let the matrix \(\ddot{\Psi }(\theta )\) be defined by

$$\begin{aligned} (\ddot{\Psi }(\theta ))_{ij}= \frac{\partial ^2 \Psi (\theta ) }{\partial \theta ^i\partial \theta ^j} = E_{\theta }[(\xi _i-\eta _i)(\xi _j-\eta _j)], \quad 1 \le i,j \le p. \end{aligned}$$

Thus, \(\ddot{\Psi }\) is a covariance matrix of the \(\xi _i\) terms under \(g(x;\theta )\); hence, it is positive definite. Therefore, \(\Psi (\theta )\) is a convex function. The notable property

$$\begin{aligned} g^*_{ij}(\theta ) =\tilde{g}_{ij}(\theta ), \quad 1 \le i,j \le p, \quad \forall \theta \end{aligned}$$

is proven by the fact that both sides are equal to \((\ddot{\Psi }(\theta ))_{ij}\).

The following notation is used for the third- or fourth-order cumulant:

$$\begin{aligned}&\kappa _{ijk} = E[(\xi _i-\eta ^*_i)(\xi _j-\eta ^*_j)(\xi _i-\eta ^*_k)] \\&\kappa ^*_{ijk} = E_{\theta _*}[(\xi _i-\eta ^*_i)(\xi _j-\eta ^*_j)(\xi _i-\eta ^*_k)]= \frac{\partial ^3\Psi (\theta _*)}{\partial \theta ^i \partial \theta ^j \partial \theta ^k}\\&\kappa ^*_{ijkl} = E_{\theta _*}[(\xi _i-\eta ^*_i)(\xi _j-\eta ^*_j)(\xi _i-\eta ^*_k)(\xi _i-\eta ^*_l)]\\&\qquad \quad -E_{\theta _*}[(\xi _i-\eta ^*_i)(\xi _j-\eta ^*_j)]E_{\theta _*}[(\xi _k-\eta ^*_k)(\xi _l-\eta ^*_l)]\\&\qquad \quad -E_{\theta _*}[(\xi _i-\eta ^*_i)(\xi _k-\eta ^*_k)]E_{\theta _*}[(\xi _j-\eta ^*_j)(\xi _l-\eta ^*_l)]\\&\qquad \quad -E_{\theta _*}[(\xi _i-\eta ^*_i)(\xi _l-\eta ^*_l)]E_{\theta _*}[(\xi _j-\eta ^*_j)(\xi _k-\eta ^*_k)]=\frac{\partial ^4\Psi (\theta _*)}{\partial \theta ^i \partial \theta ^j \partial \theta ^k \partial \theta ^l} \end{aligned}$$

for \(1\le i, j, k,l \le p\).

Next theorem states the asymptotic expansion of the estimation risk for an exponential family distribution. In the case of an exponential family, the second-order term is relatively simple and can be practically used if it is incorporated into the p/n criterion proposed in the next section.

In the theorem, for brevity, Einstein notation is used and the dependency on \(\theta _*\) is omitted; e.g., G for \(G(\theta _*)\) and \(\tilde{g}^{ij}\) for \(\tilde{g}^{ij}(\theta _*)\).

Theorem 2

If the parametric model is an exponential family, the estimation risk is given by

$$\begin{aligned} \begin{aligned}&R[g(x;\theta _*)\,|\,g(x;\hat{\theta }) ] = \frac{1}{2n} \mathrm {tr} \Bigl (\tilde{G}^{-1}G \Bigr ) \\&\quad +\frac{1}{24n^2}\Bigl [-8\tilde{g}^{uk}\tilde{g}^{ls}\tilde{g}^{mt}\kappa _{kst}\kappa ^*_{lmu}\\&\quad +9\tilde{g}^{ko}\tilde{g}^{lu}\tilde{g}^{sv}\tilde{g}^{tw}\tilde{g}^{hm}\kappa ^*_{lmo}\kappa ^*_{sth} (g_{ku}g_{vw}+g_{kv}g_{uw}+g_{kw}g_{uv})\\&\quad -3\tilde{g}^{kw}\tilde{g}^{ls}\tilde{g}^{mu}\tilde{g}^{tv}\kappa ^*_{lmtw} (g_{ks}g_{uv}+g_{ku}g_{sv}+g_{kv}g_{su})\Bigr ] + O(n^{-3}). \end{aligned} \end{aligned}$$
(8)

Proof

The calculation is carried out straightforwardly from the expansion for the general distribution. See Sheena (2021) for the proof. \(\square \)

The estimation risk up to the second-order term is determined by the moments of the \(\xi _i\) terms, \(g_{ij}\), and \(\kappa _{ijk}\) under g(x), as well as their moments under \(g(x;\theta _*)\), \(\tilde{g}^{ij}\), \(\kappa ^*_{ijk}\), and \(\kappa ^*_{ijkl}\).

2.3 Estimator of estimation risk

We will use Theorem 1 and 2 for the approximation of the estimation risk. In order to establish the criterion (3), we need the estimator of the (approximated) estimation risk. The moments contained in (5) or (8) needs to be estimated; The second moments (Fisher information metric)

$$\begin{aligned} G^*=(g^*_{ij}),\quad \tilde{G}=(\tilde{g}_{ij}), \quad G=(g_{ij}) \end{aligned}$$

and cumulant

$$\begin{aligned} \kappa _{ijk},\quad \kappa ^*_{ijk}, \quad \kappa ^*_{ijkl},\qquad 1\le i,j,k,l \le p. \end{aligned}$$

Naive estimators of these properties (denoted by the “hat” mark: \(\hat{G}\), \(\hat{\kappa }_{ijk}\), etc.) are gained by replacing \(\theta _*\) with MLE \(\hat{\theta }\), and the expectation \(E[\cdot ]\) with the empirical mean.

First the estimator of the second moments are given as follows;

$$\begin{aligned} (\hat{G})_{ij}&= n^{-1}\sum _{t=1}^n \frac{\partial }{\partial \theta ^i}\log g(X_t;\theta )\Big |_{\theta =\hat{\theta }} \frac{\partial }{\partial \theta ^j}\log g(X_t;\theta )\Big |_{\theta =\hat{\theta }}\\ (\hat{\tilde{G}})_{ij}&= -n^{-1}\sum _{t=1}^n \frac{\partial ^2 }{\partial \theta ^i\partial \theta ^j}\log g(X_t;\theta )\Big |_{\theta =\hat{\theta }}\\ (\hat{G}^*)_{ij}&= \int g(x;\hat{\theta })\Bigl ( \frac{\partial \ }{\partial \theta ^i}\log g(x;\theta ) \Big |_{\theta =\hat{\theta }}\Bigr ) \Bigl ( \frac{\partial \ }{\partial \theta ^j} \log g(x;\theta )\Big |_{\theta =\hat{\theta }} \Bigr ) d\mu . \\ \end{aligned}$$

Now we have the p/n criterion for a general distribution with a given C.

Criterion for a general distribution

$$\begin{aligned} C \ge \frac{1}{2n}\mathrm {tr} \Bigl (\hat{\tilde{G}}^{-1}\hat{G}\hat{\tilde{G}}^{-1}\hat{G}^* \Bigr ) \end{aligned}$$
(9)

Next the criterion for the exponential family is considered. \(\hat{G}\) equals the sample covariance matrix of the \(\xi _i\) terms, \(\hat{\Sigma }\):

$$\begin{aligned} \hat{G} = \hat{\Sigma },\quad {\hat{g}}_{ij} = \bigl (\hat{\Sigma }\bigr )_{ij},\quad \bigl (\hat{\Sigma }\bigr )_{ij} = n^{-1}\sum _{t=1}^n (\xi _i(X_t)- \bar{\xi }_i)(\xi _j(X_t)- \bar{\xi }_j), \end{aligned}$$
(10)

where \(\bar{\xi }_i = n^{-1} \sum _{t} \xi _i(X_t).\) Similarly, the estimator of the true third-order cumulant is given by the sample third-order cumulant:

$$\begin{aligned} \hat{\kappa }_{ijk} = n^{-1}\sum _{t=1}^n (\xi _i(X_t)- \bar{\xi }_i)(\xi _j(X_t)- \bar{\xi }_j)(\xi _k(X_t)- \bar{\xi }_k). \end{aligned}$$
(11)

Further,

$$\begin{aligned} \hat{\tilde{G}}&= \ddot{\Psi }(\hat{\theta } ), \quad {\hat{\tilde{g}}}_{ij}= \bigl (\ddot{\Psi }(\hat{\theta })\bigr )_{ij} \end{aligned}$$
(12)
$$\begin{aligned} \hat{\kappa }^*_{ijk}&= \frac{\partial ^3 \ }{\partial \theta ^i \partial \theta ^j \partial \theta ^k}\Psi (\theta )\Big |_{\theta =\hat{\theta }} \end{aligned}$$
(13)
$$\begin{aligned} \hat{\kappa }^*_{ijkl}&=\frac{\partial ^4 \ }{\partial \theta ^i \partial \theta ^j \partial \theta ^k \partial \theta _l}\Psi (\theta )\Big |_{\theta =\hat{\theta }}. \end{aligned}$$
(14)

Consequently, for an exponential family, the p/n criterion is given as follows.

Criterion for an exponential family

$$\begin{aligned} \begin{aligned} C&\ge \frac{1}{2n} \mathrm {tr} \Bigl (\hat{\Sigma }(\ddot{\Psi }(\hat{\theta }) )^{-1}\Bigr )\\&+\frac{1}{24n^2}\Bigl [-8\hat{\tilde{g}}^{uk}\hat{\tilde{g}}^{ls}\hat{\tilde{g}}^{mt}\hat{\kappa }_{kst}\hat{\kappa }^*_{lmu} +9\hat{\tilde{g}}^{ko}\hat{\tilde{g}}^{lu}\hat{\tilde{g}}^{sv}\hat{\tilde{g}}^{tw}\hat{\tilde{g}}^{hm}\hat{\kappa }^*_{lmo}\hat{\kappa }^*_{sth} (\hat{g}_{ku}\hat{g}_{vw}+\hat{g}_{kv}\hat{g}_{uw}\\&+\hat{g}_{kw}\hat{g}_{uv})-3\hat{\tilde{g}}^{kw}\hat{\tilde{g}}^{ls}\hat{\tilde{g}}^{mu}\hat{\tilde{g}}^{tv}\hat{\kappa }^*_{lmtw} (\hat{g}_{ks}\hat{g}_{uv}+\hat{g}_{ku}\hat{g}_{sv}+\hat{g}_{kv}\hat{g}_{su})\Bigr ]. \end{aligned}\nonumber \\ \end{aligned}$$
(15)

How to determine C in (9) or (15) is studied in the next section. Once C is determined, we can use these criterion for the two problems, that is, the sample size problem and the model selection problem, as introduced in Sect. 1.

3 Criterion for model complexity and sample size

In this section, we complete p/n criterion by providing reasonable threshold C for (9) or (15) (Sect. 3.1). As an immediate application of the criterion, we deal with the bin number problem in a multinomial distribution or a histogram (Sect. 3.2). We also state the algorithm for the calculation of the \(n^{-2}\)-order term in (15) (Sect. 3.3). In the end, the use of the p/n criterion is demonstrated for two practical examples (Sect. 3.4).

3.1 Choice of threshold

Because the value of the divergence (1) or the risk (2) does not have an absolute standard by itself, we relate it to another reasonable standard. One of the often used measures of the closeness between the two distributions is the error rate, which is more intuitive than the divergence and is suitable for setting a threshold. Let \(g_i(x), i=1,2\) be the p.d.f. If both \(g_i(x),\ i=1,2\), are known, the Bayes discriminant rule (with the noninformative prior) is as follows.

For the sample X from either \(g_1(x)\) or \(g_2(x)\),

$$\begin{aligned} \frac{g_{i_1}(X)}{g_{i_2}(X)} > 1 \Longleftrightarrow \text { Judge that }X\text { is generated from }g(x;\theta _{i_1}) \end{aligned}$$

The Bayes error rate, Er, i.e., the probability that this rule gives an error, is formally defined by

$$\begin{aligned} Er[g_1(x)\, | \, g_2(x)] = \frac{1}{2}\int \min \Bigl (g_1(x), g_2(x)\Bigr ) d\mu . \end{aligned}$$

The next theorem states the relation between Er and the K–L divergence.

Theorem 3

If \(D[g_1(x) \,| \, g_2(x)] \le \delta \), then

$$\begin{aligned} Er[g(x;\theta _1)\, | \, g(x;\theta _2)] \ge \min \{ t \,|\, (x,t) \in A(\delta ) \}, \end{aligned}$$

where

$$\begin{aligned} A(\delta )= & {} \Bigl \{ (x,t) \,\Big |\, x\log {\Bigl (\frac{1-2t}{x} + 1\Bigr )}+(1-x) \log {\Bigl (\frac{2t-1}{1-x}+1\Bigr )}\\= & {} -\delta ,\quad 0< x< 2t < 1\Bigr \}. \end{aligned}$$

Proof

See Appendix. \(\square \)

Corollary 1

Let \(\delta = D[g(x;\theta _1)\, | \, g(x;\theta _2)]\) and \(\alpha \) be a certain small positive number (e.g. \(\alpha =0.05, 0.01\)). If

$$\begin{aligned} \min \{ t \,|\, (x,t) \in A(\delta ) \} \ge 1/2- \alpha , \end{aligned}$$
(16)

then

$$\begin{aligned} Er[g(x;\theta _1)\, | \, g(x;\theta _2)] \ge 1/2- \alpha . \end{aligned}$$

Analytical calculation of \(\min \{ t \,|\, (x,t) \in A(\delta ) \}\) is difficult. The approximation when t is close to 1/2 is given here. As \(\log {(1+x)} \doteqdot x - x^2/2\) around \(x=0\),

$$\begin{aligned}&x\log {\Bigl (\frac{1-2t}{x} + 1\Bigr )}+(1-x) \log {\Bigl (\frac{2t-1}{1-x}+1\Bigr )} \\&= x\Bigl (\frac{1-2t}{x}\Bigr ) -\frac{x}{2}\Bigl (\frac{1-2t}{x}\Bigr )^2 + (1-x)\frac{2t-1}{1-x}-\frac{(1-x)}{2}\Bigl (\frac{2t-1}{1-x}\Bigr )^2= -\frac{1}{2}\frac{(1-2t)^2}{x(1-x)}. \end{aligned}$$

Therefore, \(A(\delta )\) is approximated by

$$\begin{aligned} A^*(\delta ) = \Bigl \{ (x,t) \,\Big |\, t= \frac{1}{2}\Bigl (1-\sqrt{2\delta x(1-x)}\Bigr ),\quad 0< x< 2t < 1\Bigr \}. \end{aligned}$$

Note that

$$\begin{aligned} \min \{ t \,|\, (x,t) \in A^*(\delta ) \} \ge \min _{0<x<1}{\frac{1}{2}\Bigl (1-\sqrt{2\delta x(1-x)}\Bigr ) }= \frac{1}{2}-\sqrt{\delta /8}, \end{aligned}$$

Hence, the condition \(\sqrt{\delta /8} \le \alpha \) or, equivalently, \(\delta \le 8\alpha ^2\) is approximately sufficient for (16). Let the solution of \(\delta \) denoted by \(C_\alpha \) for the equation

$$\begin{aligned} \min \{ t \,|\, (x,t) \in A(\delta ) \} = 1/2- \alpha , \end{aligned}$$

or more simply, let \(C_\alpha \) be given by

$$\begin{aligned} C_\alpha = 8 \alpha ^2. \end{aligned}$$
(17)

In the latter case, if \(\alpha =0.05(0.01)\), then \(C_\alpha =1/50(1/1250)\). The final form of p/n criterion is given by substituting C in (9) or (15) with \(C_\alpha \).

3.2 p/n Criterion for multinomial distribution

In this section, we present a formula for the bin number of a multinomial distribution using the p/n criterion. The bin number problem in a histogram can be treated similarly. Although several formulas have been proposed on the bin number (or the bin width) in the histogram such as Sturges’ formula, Freedman-Diaconis’ formula (see the Chapter 3 of Scott (2015)), the formula here is derived from a new perspective.

In view of the true distribution g(x) and the information projection \(g(x;\theta _*)\), a multinomial distribution can be seen as the approximation by the step function model. Let

$$\begin{aligned} \mathcal {M} = \{g(x;m)\,|\, m=(m^0,\ldots ,m^p)\} \end{aligned}$$

with

$$\begin{aligned} g(x;m) = \sum _{i=0}^p I(x \in S_i) \frac{m_i}{Vol(S_i)}, \end{aligned}$$

where \(S_i, i=0,1,\ldots ,p\) is the partition of the range of x with volume

$$\begin{aligned} Vol(S_i) = \int _{S_i} 1 d\mu (x), \end{aligned}$$

and \(I(x \in S_i)\) is an indicator function of \(S_i\). In this case, from (4), the information projection \(g(x; m^*)\) is given by \(m^*_i = P(X \in S_i | g(x))\). The step-function model is not an exponential family. However, we easily notice that Kullback–Leibler divergence between the two step functions (where \(d\mu \) is the continuous measure) is equal to the divergence between the two corresponding multinomial distributions (where \(d\mu \) is the counting measure) . Hence, the argument of the estimation risk can be deduced from that of the multinomial distribution model. It is notable that, if X is originally a discrete random variable, the model always contains g(x).

Consider a multinomial distribution with \(p+1\) possible values \(x_i, i=0,\ldots ,p\), with the corresponding probabilities \(m=(m_0,\ldots ,m_p)\). This is an exponential family (6), where

$$\begin{aligned}&\theta ^i = \log (m_i/m_0), \quad i=1,\ldots , p,\\&\xi _i(x) = {\left\{ \begin{array}{ll} 1,&{}\text { if }x = x_i,\\ 0,&{}\text { otherwise,} \end{array}\right. } \quad i=1,\ldots ,p \end{aligned}$$

and \(d\mu \) is the counting measure on \(\{x_1,\ldots ,x_p\}\). Here,

$$\begin{aligned} \Psi (\theta ) = \log \bigl (\sum _{i=0}^p \exp (\theta _i)\bigr )=-\log m_0=-\log \bigl (1-\sum _{i=1}^p m_i\bigr ). \end{aligned}$$

The asymptotic expansion of the estimation risk up to the second order can be derived as follows (this corresponds to equation (41) of Sheena (2018) with \(\alpha =-1\)).

$$\begin{aligned} R[g(x;\theta )\, | \, g(x;\hat{\theta })]=\frac{p}{2n} + \frac{1}{12n^2} (M-1) +O(n^{-3}), \qquad M = \sum _{i=0}^p {m_i}^{-1}, \end{aligned}$$
(18)

where \(\theta =(m_1,\ldots ,m_p)\) is the true-distribution free parameter. Note that if some \(m_i\)’s are close to zero, the convergence speed reduces considerably.

If we combine the first-order approximation in (18) with the threshold (17), p/n criterion becomes

$$\begin{aligned} \frac{p}{n} \le 16 \alpha ^2. \end{aligned}$$

If we adopt \(\alpha =0.05(0.01)\), then the sample size n or the bin number \(p+1\) is determined by the formula;

Simple criterion for the sample size or the bin number

$$\begin{aligned} \frac{p}{n} \le 1/25(1/625). \end{aligned}$$
(19)

The second-order approximation gives the following p/n criterion:

$$\begin{aligned} 96 n^2 \alpha ^2 - 6 n p - (\hat{M}-1) > 0, \end{aligned}$$

where

$$\begin{aligned} \hat{M}=\sum _{i=0}^{p} {\hat{m}_i}^{-1} \end{aligned}$$

and \(\hat{m}_i\) is the MLE, the sample relative frequency, for each i. Applying the criterion for n determination gives the formula

$$\begin{aligned} n \ge \frac{3p+\sqrt{9p^2+96\alpha ^2(\hat{M}-1)}}{96\alpha ^2}. \end{aligned}$$
(20)

In contrast, if the criterion is used for the bin number problem, the formula is given by

$$\begin{aligned} 6np + \hat{M} < 96n^2\alpha ^2+1. \end{aligned}$$

Use of these criteria for practical examples is discussed in Sect. 3.4.

3.3 Algorithm for p/n criterion of exponential family

This section describes calculation of the right-hand side of (15). If we can calculate the function \(\Psi (\theta )\) analytically, the algorithm is simply the following.

  1. Step 1

    Calculate \(\hat{\eta }_i=\bar{\xi }_i,\ i=1,\ldots ,p\) from the sample.

  2. Step 2

    Solve the simultaneous equations w.r.t. \(\theta \) in (7) to give \(\hat{\theta }=(\hat{\theta }_1,\ldots ,\hat{\theta }_p)\):

    $$\begin{aligned} \hat{\eta }_i = \eta _i(\hat{\theta }) = \frac{\partial \Psi }{\partial \theta _i}(\hat{\theta }),\qquad i=1,\ldots ,p. \end{aligned}$$
  3. Step 3

    Calculate (12), (13), and (14) from \(\Psi (\hat{\theta })\).

  4. Step 4

    Calculate (10) and (11) from the sample.

  5. Step 5

    Calculate the right-hand side of (15) and compare it with \(C_\alpha \).

Often, \(\Psi (\theta )\) is not explicitly given, especially for a complex model. Then, \(\hat{\theta }\) can be iteratively calculated using the Newton--Raphson method with the Jacobian matrix (12). Because \(\ddot{\Psi }(\theta )\) is the variance-covariance matrix of the \(\xi _i\) terms under the \(g(x;\theta )\) distribution, its value can be approximated from the generated sample. The alternative methods are as follows.

  1. Step 2’

    Iteratively search for \(\hat{\theta }\) with

    $$\begin{aligned} \theta ^{(n+1)} = \theta ^{(n)} - \bigl (\eta (\theta ^{(n)})-\hat{\eta }\bigr )\bigl (\ddot{\Psi }(\theta ^{(n)})\bigr )^{-1}, \end{aligned}$$

    where \(\eta (\theta ^{(n)})\) and \(\ddot{\Psi }(\theta ^{(n)})\) are approximated by the sample mean and the sample covariance matrix of the \(\xi _i\) terms from the \(g(x;\theta ^{(n)})\) distribution.

Further, (12), (13), and (14) can also be approximated using the generated sample.

  1. Step 3’

    Approximate (12), (13), and (14) using the sample moments and cumulants, where the sample is generated from \(g(x;\hat{\theta })\).

The point here is that \(\Psi (\theta )\) is not required for sample generation in Steps 2’ and 3’ if methods such as MCMC (requiring no normalizing constant) are used. Although Steps 2’ and 3’ are computationally heavy tasks, they enable construction of a complex model without calculation of \(\Psi \).

3.4 Real data examples for p/n criterion

This section demonstrates use of the p/n criterion for a particular problem through two practical examples under the exponential family model.

Example 1

(Red Wine) The first example is a well-known dataset on wine quality, taken from the U.C.I. Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/wine+quality).

Only red wine data are used. The sample size is 1599, and the variables consist of 11 chemical substances (continuous variables) and “quality” indexes (integers from 3 to 8). The vector of the chemical substances and the “quality” variable are denoted by \(x^{(1)} =(x^{(1)}_{1},\ldots , x^{(1)}_{11})\) and \(x^{(2)}\), respectively. We divided the sample into two halves randomly, one of which (“data_base”) was used for the model formulation and the other (“data_est”) was used for the estimation of the parameter.

For model formulation, we determined the following: normalization method of the original data, the reference (probability) measure \(d\mu (x)\) and \(\xi \) elements. Using “data_base”, we proceed as;

  1. 1.

    Each variable \(x^{(1)}_i(i=1,\ldots ,11)\) is divided by twice of its maximum such that its range is \([0,\ 1)\). Further, 2 is subtracted from each “quality” index to give a range of \(\{1,2,\ldots ,6\}\).

  2. 2.

    As \(d\mu (x)\), 11 independent Beta distributions are applied to \(x^{(1)}\) so that their means and variances are equal to those of the “data_base”. The multinomial distribution of \(x^{(2)}\) is adopted, using each category’s sample relative frequency as the category probability parameter (say, \(m_i, \ i=1,\ldots ,6\)). In addition, \(x^{(1)}\) and \(x^{(2)}\) are taken to be independent.

Consequently, \(d\mu \) is selected as

$$\begin{aligned} x= & {} (x^{(1)},x^{(2)}),\quad d\mu (x) = \prod _{i=1}^{11} {x^{(1)}_i}^{(\beta _{1i}-1)} {(1-x^{(1)}_i)}^{(\beta _{2i}-1)} d(x^{(1)})\\&\times \prod _{i=1}^6 m_i^{I(x^{(2)}=i)} d^*(x^{(2)}), \end{aligned}$$

where \(d(x^{(1)})\) is the Lebesgue measure on \([0,\ 1]^{11}\), \(d^*(x^{(2)})\) is the counting measure on \(\{1,2,\ldots ,6\}\), and \(I(\cdot )\) is the indicator function. Further, \(\beta _{1i}\), \(\beta _{2i}\), and \(m_i\) satisfy the relations

$$\begin{aligned}&\frac{\beta _{1i}}{\beta _{1i}+\beta _{2i}} = \text { Sample mean of }x^{(1)}_i,\quad i=1,\ldots ,11\\&\frac{\beta _{1i}\beta _{2i}}{(\beta _{1i}+\beta _{2i})^2(\beta _{1i}+\beta _{2i}+1)} = \text {Sample variance of }x^{(1)}_i,\quad i=1,\ldots ,11\\&m_i = \text {Relative frequency of }i\text { in }x^{(2)} \end{aligned}$$

3. The candidate for the \(\xi _i\) terms are as follows;

$$\begin{aligned}&\xi _1(x) = x^{(1)}_1 x^{(1)}_2, \quad \xi _2(x)=x^{(1)}_1 x^{(1)}_3, \quad \ldots \quad \xi _{10}(x) = x^{(1)}_1x^{(1)}_{11} \\&\xi _{11}(x) = x^{(1)}_2 x^{(1)}_3,\quad \ldots \quad \xi _{19}(x) = x^{(1)}_2 x^{(1)}_{11}\\& \cdots \\&\xi _{55}(x) = x^{(1)}_{10} x^{(1)}_{11} \end{aligned}$$

and

$$\begin{aligned} \xi _{56}(x) = x^{(1)}_1x^{(2)}, \quad \ldots \quad \xi _{66}(x)= x^{(1)}_{11}x^{(2)}. \end{aligned}$$

Because some of these terms are highly correlated, we eliminate one of the pair with the correlation higher than 0.95. The following 20 \(\xi _i\) terms were removed from the full model:

$$\begin{aligned} \xi _i,\ i=8, 17, 19, 24, 25, 27, 32, 34, 38, 40, 43, 45, 46, 47, 49, 53, 58, 62, 64. \end{aligned}$$

Consequently, an exponential family model with \(p=47\) is formulated. As the probability distribution \(g(x;\theta )d\mu \) equals \(d\mu \) when the \(\theta \) terms all equal zero, it is denoted by g(x; 0). Note that the \(g(x ;\theta _*)\) of this model is the closest to g(x; 0) in the sense that

$$\begin{aligned} D[g(x;\theta _*) | g(x;0)] = \min _{h \in \mathcal {H}} D[h(x) | g(x;0)], \end{aligned}$$

where \(\mathcal {H}\) is the p.d.f. set of h(x) (w.r.t. \(d\mu \)) that satisfies

$$\begin{aligned} E_h[\xi _i(X)] = \int h(x) \xi _i(x) d\mu (x) = E[ \xi _i(X) ], \end{aligned}$$

for each \(\xi _i\) in the model. This is the consequence of so-called “minimum relative entropy characterization” of an exponential family” (see Csiszár (1975)).

Under the formulated exponential family model, the algorithm in the previous section was implemented and the right-hand side of (15) was calculated using the “data_est”, the size of which (n) equals 799. Because of the model complexity, the explicit form of \(\Psi (\theta )\) could not be obtained; hence, Alternative Steps 2’ and 3’ were used. The R and RStan program codes for the whole risk calculation are presented in GitHub (https://github.com/YSheena/P-N_Criteria_Program.git). The first-and second-order terms and the estimation risk in the total of (15) were as follows;

First-order term: 2.95e-02, Second-order term: -1.30e-04, Estimation Risk: 2.93e-02

Note that the second-order term contributes little to the estimation risk; thus, the first-order approximation seems sufficient for this model and data. With the threshold (17), the equation 2.93e-02=8\(\alpha ^2\) gives the solution \(\alpha \doteqdot 0.06\). Hence the Bayes error rate between \(g(x;\hat{\theta })\) and \(g(x;\theta _*)\) is higher than 0.44. If we set the threshold as \(\alpha =0.05\), we must trim the model further. For example, if we eliminate one of the \(\xi \) elements from the pair with correlation higher than 0.9, then p becomes as small as 37. For this model, the estimation risk is lower than the target value \(8*(0.05)^2=0.02\) as follows;

First-order term: 1.60e-02, Second-order term: 2.04e-04, Estimation Risk: 1.62e-02

Example 2

(Abalone Data) The next example also features a well-known dataset, in this case, for the physical measurement of abalones (U.C.I. Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Abalone). This data comprise eight properties (sex, length, diameter, etc.) of 4177 abalones. Here, only two discrete variables were considered: “sex” and “ring,” where “sex” had three values “Female,” “Infant,” and “Male”; and “rings” had integer values from 1 to 29. The frequency of each classified group by “sex” and “rings” is given in Table 1. The original frequencies were aggregated at both ends. In the table, if a cell with a star mark is located to the immediate left or right, the number in the cell is aggregated. For example, of the female abalones, cells with 24 or more rings were aggregated to frequency 4. The total number of cells was 63.

A multinomial distribution over 63 cells was considered; hence, \(p=62\). First the simple criterion (19) is adopted, then

$$\begin{aligned} p/n = 62/4177 \doteqdot 0.015 < 1/25, \end{aligned}$$

but \(p/n > 1/625\). Consequently, the model distribution is close to the information projection (this case, the true distribution) to the extent that the Bayes error rate is more than 0.45 but less than 0.49.

Table 1 Abalones by sex and rings

In order to use the second order term, M needs to be estimated. From the sample relative frequency of each cell \(\hat{m}_i,\) where \(\ i=0,\ldots ,62\),

$$\begin{aligned} \hat{M} = \sum _{i=0}^{62} {\hat{m}_i}^{-1} = 36128.33, \end{aligned}$$

Use of the n formula (20) yielded

$$\begin{aligned} n \ge 1642, \end{aligned}$$

which indicates the actual sample size 4177 is large enough for Bayes error rate 0.45. However, to attain Bayes error rate of 0.49, the required sample size equals 38847, which is far beyond 1642.