Convergence of estimative density: criterion for model complexity and sample size

For a parametric model of distributions, the closest distribution in the model to the true distribution located outside the model is considered. Measuring the closeness between two distributions with the Kullback–Leibler divergence, the closest distribution is called the “information projection.” The estimation risk of the maximum likelihood estimator is defined as the expectation of Kullback–Leibler divergence between the information projection and the maximum likelihood estimative density (the predictive distribution with the plugged-in maximum likelihood estimator). Here, the asymptotic expansion of the risk is derived up to the second order in the sample size, and the sufficient condition on the risk for the Bayes error rate between the predictive distribution and the information projection to be lower than a specified value is investigated. Combining these results, the “p/n criterion” is proposed, which determines whether the estimative density is sufficiently close to the information projection for the given model and sample. This criterion can constitute a solution to the sample size or model selection problem. The use of the p/n criteria is demonstrated for two practical datasets.


Introduction
Given a certain data set, an unknown probability distribution that generates the data as the independent, identically distributed (i.i.d.) sample can be assumed. Under this assumption, if a certain parametric distribution model is adopted to "explain" the data, the first task is to find the "best" approximating distribution in the model. Because the true distribution is assumed to be outside the model (except for some rare cases), the "best" means the "closest" to the true distribution.
Consider the following parametric distribution model: where g(x; θ) is the probability density function (p.d.f.) with respect to a reference measure dμ on a measurable space. The p.d.f. of the unknown true distribution with respect to dμ is denoted by g(x). If we use a certain divergence D[· | ·] to measure the closeness between g(x) and g(x; θ), then the "best" approximating distribution in M is given by the predictive distribution g(x; θ * ), where Following Csiszár (1975), we will call g(x; θ * ) the "information projection" in this paper. Letθ denote the maximum likelihood estimator (MLE) based on the i.i.d. sample X = (X 1 , . . . , X n ) from g (x). Consider the predictive density g(x;θ). Since MLE converges to θ * in probability (see, e.g., Theorem 5.21 of van der Vaart (1998)) as the sample size, n, increases, (1) also converges to zero in probability. The predictive density g(x;θ) is produced with plugged-in MLE. This type of predictive density is called "estimative density". Another common method to formulate the predictive density is Bayesian predictive density. For the asymptotic properties of Bayesian predictive density, see e.g. Komaki (1996), Hartigan (1998), Komaki (2015) and Zhang et al. (2018). Take the expectation with respect to the i.i.d. sample X = (X 1 , . . . , X n ) from g (x). Throughout this study, the expectation under g(x) is denoted by E [·], while the expectation under g(x; θ * ) is denoted by E θ * [·]. We call (2) "estimation risk" for discriminating it with the "total risk" The estimation risk converges to zero under some mild conditions. We will use this estimation risk as the measure of the closeness between g(x; θ * ) and g(x;θ).
Given the data and the model, we need to know whether g(x;θ) is sufficiently close to the information projection. Thus, with a certain threshold C, the following criterion is considered.R where the left hand side is the estimator of the estimation risk. This criterion gives a solution to the following two problems.
• Sample size problem: With the model fixed, it indicates exactly how much sample size n is needed for g(x;θ) to be close to the information projection. If the criterion is not satisfied, we need to collect more sample. • Model selection problem: With the sample size fixed, it tells us whether a model is simple enough (especially the dimension of the parameter p is small enough) to guarantee that g(x;θ) is close to the information projection. Unless the criterion is satisfied, simplifying the model could be a remedy.
As seen later in the manuscript, the estimation risk is mainly determined by p/n when the information projection is close to the true distribution, and we will call this criterion " p/n criterion" hereafter. In this paper, as the divergence, Kullback-Leibler divergence is taken, that is, Note that for this divergence, the information projection is given by and its solution θ * is naturally estimated via the MLE, which is the solution of For the other divergences, the information projection is more complicated, and its natural estimator is not as simple as MLE. This paper aims to present a simple and practical criterion (3), and proceeds as follows; 1. The asymptotic expansion of the estimation risk is derived. 2. The asymptotic expansion combined with the estimated moments gives the estimator of the estimation risk. 3. The reasonable (persuasive) threshold C is proposed.
An overview of the contents of each section is now provided. First, the asymptotic expansion of the estimation risk is given for both the general model (Sect. 2.1) and an exponential family model (Sect. 2.2). The estimator of the estimation risk is given in Sect. 2.3. Next, the concrete threshold C is proposed in view of the Bayes error rate. With these results combined, p/n criterion is proposed in an explicit form (Sect. 3.1). As an application of p/n criterion, the bin number problem in a multinomial distribution or a histogram is considered (Sect. 3.2). In Sect. 3.3, the algorithm for calculating the p/n criterion in the case of an exponential family is described. In Sect. 3.4, the use of the p/n criterion is demonstrated for two practical examples.

Estimation risk for general case and exponential family
In this section, the asymptotic expansion with respect to n of the estimation risk (2) is presented up to the first-order term for a general distribution, and up to the second-order term for an exponential family distribution. Hartigan (1998) derives the asymptotic expansion of the estimation risk (2) up to the second order under the assumption g(x) belongs to M. The result here is the extension of his result in the sense that the true distribution is not necessarily located in M.
On the risk of an exponential family, the most relevant work is that of Barron and Sheu (1991). They consider the convergence rate of the K-L divergence (not the risk, but the divergence itself) for an exponential family on a compact set. Their interest lies in the closeness between g(x) and g(x;θ), while this research focuses on the closeness between g(x; θ * ) and g(x;θ)

Estimation risk for general case
Taylor expansion of D[g(x; θ * ) | g(x;θ)] = g(x; θ * ) log(g(x; θ * )/g(x;θ))dμ as a function ofθ around θ * is considered: whereθ * is a point between θ * andθ . Because it turns out that Here, and g * i j indicates the components of the Fisher metric matrix on M, given by As θ * is the solution of equation (4) andθ is its empirical solution (i.e., the Mestimator), the following result holds (see, e.g., Theorem 5.21 of van der Vaart (1998)).
For a general distribution, the estimation risk is asymptotically given as follows; Theorem 1 Because the n −2 -order term is prohibitively lengthy, if it is incorporated into the p/n criterion, the result is not suitable for practical use. Hence, it is omitted here. (For interested readers, Theorem 1 of Sheena (2021) is being referred to. You can also find the proof of the whole expansion there.) Note that, if g(x) exists within the model, then G =G = G * . Hence, the first-order term equals p/(2n) (for more general result for the well-specified model, see Sheena (2018)). Thus, the first-order term is mainly determined by p if g(x; θ * ) is close to g(x).

Estimation risk for exponential family
This subsection investigates the estimation risk when the parametric model is an exponential family (for general references on exponential families, see Brown (1986), Barndorff-Nielsen (2014) and Sundberg (2019)). In the case of the exponential family, the n −2 -order term in the asymptotic expansion of the estimation risk has a simpler form.
Let the model M be given by where (θ) is the cumulant-generating function of the ξ terms, such that, The "dual coordinate" η is defined as In particular, from the definition of θ * (see (4)), The last equation requires the means of ξ i to coincide under g(x) and g(x; θ * ). It is known that g(x; θ * ) maximizes the Shannon entropy among all probability distributions of (ξ 1 , . . . , ξ p ) with a given E[ξ i ], i = 1, . . . p (the "entropy maximization property" of an exponential family; see, e.g., Wainwright and Jordan (2008)). The K-L divergence is the difference between the cross-entropy and Shannon entropy. The η coordinate is easily estimated. In fact,η, the MLE for η, is the sample mean of ξ . Hence,η In contrast,θ is difficult to obtain explicitly because or its derivative cannot be theoretically obtained for a complex model. This could pose a serious obstacle to application of an exponential family model to a practical problem, and is discussed in Sect. 3.3.
Let the matrix¨ (θ) be defined by Thus,¨ is a covariance matrix of the ξ i terms under g(x; θ); hence, it is positive definite. Therefore, (θ) is a convex function. The notable property is proven by the fact that both sides are equal to (¨ (θ)) i j .
The following notation is used for the third-or fourth-order cumulant: Next theorem states the asymptotic expansion of the estimation risk for an exponential family distribution. In the case of an exponential family, the second-order term is relatively simple and can be practically used if it is incorporated into the p/n criterion proposed in the next section.
In the theorem, for brevity, Einstein notation is used and the dependency on θ * is omitted; e.g., G for G(θ * ) andg i j forg i j (θ * ).
Theorem 2 If the parametric model is an exponential family, the estimation risk is given by Proof The calculation is carried out straightforwardly from the expansion for the general distribution. See Sheena (2021) for the proof.
The estimation risk up to the second-order term is determined by the moments of the ξ i terms, g i j , and κ i jk under g(x), as well as their moments under g(x; θ * ),g i j , κ * i jk , and κ * i jkl .

Estimator of estimation risk
We will use Theorem 1 and 2 for the approximation of the estimation risk. In order to establish the criterion (3), we need the estimator of the (approximated) estimation risk. The moments contained in (5) or (8) needs to be estimated; The second moments (Fisher information metric) and cumulant Naive estimators of these properties (denoted by the "hat" mark:Ĝ,κ i jk , etc.) are gained by replacing θ * with MLEθ, and the expectation E[·] with the empirical mean. First the estimator of the second moments are given as follows; Now we have the p/n criterion for a general distribution with a given C.

Criterion for a general distribution
Next the criterion for the exponential family is considered.Ĝ equals the sample covariance matrix of the ξ i terms,ˆ : Similarly, the estimator of the true third-order cumulant is given by the sample third-order cumulant: Further,Ĝ Consequently, for an exponential family, the p/n criterion is given as follows.

Criterion for an exponential family
How to determine C in (9) or (15) is studied in the next section. Once C is determined, we can use these criterion for the two problems, that is, the sample size problem and the model selection problem, as introduced in Sect. 1.

Criterion for model complexity and sample size
In this section, we complete p/n criterion by providing reasonable threshold C for (9) or (15) (Sect. 3.1). As an immediate application of the criterion, we deal with the bin number problem in a multinomial distribution or a histogram (Sect. 3.2). We also state the algorithm for the calculation of the n −2 -order term in (15)

Choice of threshold
Because the value of the divergence (1) or the risk (2) does not have an absolute standard by itself, we relate it to another reasonable standard. One of the often used measures of the closeness between the two distributions is the error rate, which is more intuitive than the divergence and is suitable for setting a threshold. Let g i (x), i = 1, 2 be the p.d.f. If both g i (x), i = 1, 2, are known, the Bayes discriminant rule (with the noninformative prior) is as follows.
For the sample X from either g 1 (x) or g 2 (x), The Bayes error rate, Er, i.e., the probability that this rule gives an error, is formally defined by The next theorem states the relation between Er and the K-L divergence.

Therefore, A(δ) is approximated by
Hence, the condition √ δ/8 ≤ α or, equivalently, δ ≤ 8α 2 is approximately sufficient for (16). Let the solution of δ denoted by C α for the equation or more simply, let C α be given by In the latter case, if α = 0.05(0.01), then C α = 1/50(1/1250). The final form of p/n criterion is given by substituting C in (9) or (15) with C α .

p/n Criterion for multinomial distribution
In this section, we present a formula for the bin number of a multinomial distribution using the p/n criterion. The bin number problem in a histogram can be treated similarly. Although several formulas have been proposed on the bin number (or the bin width) in the histogram such as Sturges' formula, Freedman-Diaconis' formula (see the Chapter 3 of Scott (2015)), the formula here is derived from a new perspective. In view of the true distribution g(x) and the information projection g(x; θ * ), a multinomial distribution can be seen as the approximation by the step function model. Let and I (x ∈ S i ) is an indicator function of S i . In this case, from (4), the information projection g(x; m * ) is given by m * i = P(X ∈ S i |g(x)). The step-function model is not an exponential family. However, we easily notice that Kullback-Leibler divergence between the two step functions (where dμ is the continuous measure) is equal to the divergence between the two corresponding multinomial distributions (where dμ is the counting measure) . Hence, the argument of the estimation risk can be deduced from that of the multinomial distribution model. It is notable that, if X is originally a discrete random variable, the model always contains g(x).
Consider a multinomial distribution with p + 1 possible values x i , i = 0, . . . , p, with the corresponding probabilities m = (m 0 , . . . , m p ). This is an exponential family (6), where and dμ is the counting measure on {x 1 , . . . , x p }. Here, The asymptotic expansion of the estimation risk up to the second order can be derived as follows (this corresponds to equation (41) of Sheena (2018) with α = −1).
where θ = (m 1 , . . . , m p ) is the true-distribution free parameter. Note that if some m i 's are close to zero, the convergence speed reduces considerably. If we combine the first-order approximation in (18) with the threshold (17), p/n criterion becomes p n ≤ 16α 2 .
If we adopt α = 0.05(0.01), then the sample size n or the bin number p + 1 is determined by the formula; Simple criterion for the sample size or the bin number The second-order approximation gives the following p/n criterion: andm i is the MLE, the sample relative frequency, for each i. Applying the criterion for n determination gives the formula n ≥ 3 p + 9 p 2 + 96α 2 (M − 1) In contrast, if the criterion is used for the bin number problem, the formula is given by 6np +M < 96n 2 α 2 + 1.
Use of these criteria for practical examples is discussed in Sect. 3.4.

Algorithm for p/n criterion of exponential family
This section describes calculation of the right-hand side of (15). If we can calculate the function (θ) analytically, the algorithm is simply the following.
Step 5 Calculate the right-hand side of (15) and compare it with C α . Often, (θ) is not explicitly given, especially for a complex model. Then,θ can be iteratively calculated using the Newton--Raphson method with the Jacobian matrix (12). Because¨ (θ) is the variance-covariance matrix of the ξ i terms under the g(x; θ) distribution, its value can be approximated from the generated sample. The alternative methods are as follows.
Step 2' Iteratively search forθ with where η(θ (n) ) and¨ (θ (n) ) are approximated by the sample mean and the sample covariance matrix of the ξ i terms from the g(x; θ (n) ) distribution. Further, (12), (13), and (14) can also be approximated using the generated sample.
Step 3' Approximate (12), (13), and (14) using the sample moments and cumulants, where the sample is generated from g(x;θ). The point here is that (θ) is not required for sample generation in Steps 2' and 3' if methods such as MCMC (requiring no normalizing constant) are used. Although Steps 2' and 3' are computationally heavy tasks, they enable construction of a complex model without calculation of .

Real data examples for p/n criterion
This section demonstrates use of the p/n criterion for a particular problem through two practical examples under the exponential family model.

Example 1 (Red Wine)
The first example is a well-known dataset on wine quality, taken from the U.C.I. Machine Learning Repository (https://archive.ics.uci.edu/ml/ datasets/wine+quality).
Only red wine data are used. The sample size is 1599, and the variables consist of 11 chemical substances (continuous variables) and "quality" indexes (integers from 3 to 8). The vector of the chemical substances and the "quality" variable are denoted by 11 ) and x (2) , respectively. We divided the sample into two halves randomly, one of which ("data_base") was used for the model formulation and the other ("data_est") was used for the estimation of the parameter.
For model formulation, we determined the following: normalization method of the original data, the reference (probability) measure dμ(x) and ξ elements. Using "data_base", we proceed as; 1, . . . , 11) is divided by twice of its maximum such that its range is [0, 1). Further, 2 is subtracted from each "quality" index to give a range of {1, 2, . . . , 6}. 2. As dμ(x), 11 independent Beta distributions are applied to x (1) so that their means and variances are equal to those of the "data_base". The multinomial distribution of x (2) is adopted, using each category's sample relative frequency as the category probability parameter (say, m i , i = 1, . . . , 6). In addition, x (1) and x (2) are taken to be independent.
Consequently, dμ is selected as (2) ) is the counting measure on {1, 2, . . . , 6}, and I (·) is the indicator function. Further, β 1i , β 2i , and m i satisfy the relations 3. The candidate for the ξ i terms are as follows; Because some of these terms are highly correlated, we eliminate one of the pair with the correlation higher than 0.95. The following 20 ξ i terms were removed from the full model: 17, 19, 24, 25, 27, 32, 34, 38, 40, 43, 45, 46, 47, 49, 53, 58, 62, 64. Consequently, an exponential family model with p = 47 is formulated. As the probability distribution g(x; θ)dμ equals dμ when the θ terms all equal zero, it is denoted by g(x; 0). Note that the g(x; θ * ) of this model is the closest to g(x; 0) in the sense that for each ξ i in the model. This is the consequence of so-called "minimum relative entropy characterization" of an exponential family" (see Csiszár (1975)).
Under the formulated exponential family model, the algorithm in the previous section was implemented and the right-hand side of (15) was calculated using the "data_est", the size of which (n) equals 799. Because of the model complexity, the explicit form of (θ) could not be obtained; hence, Alternative Steps 2' and 3' were used. The R and RStan program codes for the whole risk calculation are presented in GitHub (https://github.com/YSheena/P-N_Criteria_Program.git). The first-and second-order terms and the estimation risk in the total of (15) were as follows; First-order term: 2.95e-02, Second-order term: -1.30e-04, Estimation Risk: 2.93e-02 Note that the second-order term contributes little to the estimation risk; thus, the first-order approximation seems sufficient for this model and data. With the threshold (17), the equation 2.93e-02=8α 2 gives the solution α 0.06. Hence the Bayes error rate between g(x;θ) and g(x; θ * ) is higher than 0.44. If we set the threshold as Table 1 Abalones by sex and rings   1  2  3  4  5  6  7  8  9  10  11  12  13   F  *  *  *  *  4  16  44  122  238  248  200  128  88   I  1  1  12  51  100  216  267  274  173  92  62  21  24   M  *  *  3  6  11  27  80  172  278  294  225  118  91   14  15  16  17  18  19  20  21  22  23  24  Example 2 (Abalone Data) The next example also features a well-known dataset, in this case, for the physical measurement of abalones (U.C.I. Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Abalone). This data comprise eight properties (sex, length, diameter, etc.) of 4177 abalones. Here, only two discrete variables were considered: "sex" and "ring," where "sex" had three values "Female," "Infant," and "Male"; and "rings" had integer values from 1 to 29. The frequency of each classified group by "sex" and "rings" is given in Table 1. The original frequencies were aggregated at both ends. In the table, if a cell with a star mark is located to the immediate left or right, the number in the cell is aggregated. For example, of the female abalones, cells with 24 or more rings were aggregated to frequency 4. The total number of cells was 63.
A multinomial distribution over 63 cells was considered; hence, p = 62. First the simple criterion (19) is adopted, then p/n = 62/4177 0.015 < 1/25, but p/n > 1/625. Consequently, the model distribution is close to the information projection (this case, the true distribution) to the extent that the Bayes error rate is more than 0.45 but less than 0.49.
Let t = m − s and i .
Taking the limit operation for both sides as the partition becomes finer gives the result.