\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{P}}$$\end{document}-value model selection criteria for exponential families of increasing dimension

Let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{M }_{\underline{i}}$$\end{document} be an exponential family of densities on \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[0,1]$$\end{document} pertaining to a vector of orthonormal functions \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_{\underline{i}}=(b_{i_1}(x),\ldots ,b_{i_p}(x))^\mathbf{T}$$\end{document} and consider a problem of estimating a density \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f$$\end{document} belonging to such family for unknown set \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\underline{i}}\subset \{1,2,\ldots ,m\}$$\end{document}, based on a random sample \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X_1,\ldots ,X_n$$\end{document}. Pokarowski and Mielniczuk (2011) introduced model selection criteria in a general setting based on p-values of likelihood ratio statistic for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_0: f\in \mathcal{M }_0$$\end{document} versus \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_1: f\in \mathcal{M }_{\underline{i}}\setminus \mathcal{M }_0$$\end{document}, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{M }_0$$\end{document} is the minimal model. In the paper we study consistency of these model selection criteria when the number of the models is allowed to increase with a sample size and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f$$\end{document} ultimately belongs to one of them. The results are then generalized to the case when the logarithm of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f$$\end{document} has infinite expansion with respect to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(b_i(\cdot ))_1^\infty $$\end{document}. Moreover, it is shown how the results can be applied to study convergence rates of ensuing post-model-selection estimators of the density with respect to Kullback–Leibler distance. We also present results of simulation study comparing small sample performance of the discussed selection criteria and the post-model-selection estimators with analogous entities based on Schwarz’s rule as well as their greedy counterparts.


Introduction
Consider the following general scenario of a model selection. Let {M i } i∈I be a list of models, where I ⊂ 2 full and full = {1, 2, . . . , m}, consisting of densities described by finite-dimensional parameter θ i , such that for every i ∈ I we have M 0 ⊆ M i ⊆ M full for some minimal model M 0 . In the case of a correct specification of the list of models the density f from which iid sample of size n is available belongs to a model M t for some t ∈ I of cardinality |t| such that f ∈ M j implies M t ⊆ M j . Throughout, |i| will denote cardinality of set i. Pokarowski and Mielniczuk (2011) considered general parametric models M i and introduced (unscaled) minimal p-value criterion M n m based on asymptotic p-values of likelihood ratio test statistics Λ n,0,i for testing H 0 : f ∈ M 0 versus H 1 : f ∈ M i \ M 0 . It is proved there that under mild conditions on the regularity of the models and for a general lists of parametric models of fixed size m which does not depend on the sample size the introduced minimal p-value criterion (mPVC) is consistent in the sense considered for selection rules, i.e. that P(M n m = t) → 1. Moreover, it is shown that Bayesian Information Criterion BIC (cf. Schwarz 1978) is an approximation of mPVC.
In the present paper we focus on a special family of models, namely exponential models and generalize and strengthen these consistency results to the case of possibly growing families and allow for misspecification of the true density. We also consider maximal p-value criterion (MPVC) as well as greedy versions of both mPVC and MPVC.
We note that permitting that the size of the list of models grows together with the sample size is of interest as it makes possible to handle nonparametric situations, e.g. the case when logarithm of the underlying density has infinite orthonormal expansion. Therefore, beside results on consistency of introduced rules and on Kullback-Leibler distance from the ensuing post-model-selection estimator to the true density when it belongs to one of the underlying models, we were able to prove similar results in the case when the list of models is misspecified. In particular, Theorem 3 states that the selector based on the minimal p-value criterion is then conservative in a sense defined in Sect. 3.1.
BIC-based selection of an appropriate exponential model was applied by Ledwina (1994) to construct data-adaptive Neyman smooth tests. It turned out to be a powerful tool in many testing problems (cf. e.g. Inglot and Ledwina 1996;Kallenberg and Ledwina 1997). In the present paper we investigate usefulness of such approach in a parallel problem of estimation. In particular, in Sect. 4 we compare numerically performance of post-model-selection estimators based on p-values with those based on BIC. It is important to stress that unlike in testing problems, which usually consider nested family of models, we deal here with an arbitrary family of exponential models.
The presented approach can lead also to effective methods of estimating functionals of density which compare favorably with kernel plug-in method. In particular, Mielniczuk and Wojtyś (2010) investigated this problem in the case of Fisher information with BIC applied as a selection criterion, and a similar method may be used also to approximate entropy of a density, which is usually estimated with the use of a kernel estimate truncated from below.
Moreover, let us note that the construction of selection criteria using p-values is quite general and can be applied in other scenarios. In particular, an analogous approach may be used to choose the most parsimonious linear model from a growing list of linear models provided that a design matrix satisfies mild regularity conditions.
Organization of the paper is as follows. In Sect. 2 we introduce exponential families of densities, define p-value criteria and state auxiliary lemmas. Section 3 contains main results on consistency and conservativeness of mPVC and MPVC criteria as well as their greedy counterparts. The results concern both the situation when an underlying density belongs to one of the exponential models considered (correct specification of family of models) and the opposite misspecification case. We also show how these results can be applied to bound convergence rates of ensuing post-model-selection estimators of density with respect to Kullback-Leibler distance. In Sect. 4 behavior of post-model-selection estimators is studied by means of numerical experiments.

Minimal and maximal p-value selection criteria
We specify {M i } i∈I to be a list of exponential models.
and k ∈ N implies a j = 0 for j = 0, 1, . . . , k. For fixed m ∈ N and every nonempty subset i of {1, . . . , m} define regular exponential family of densities on [0,1] of the form }dx is normalizing constant corresponding to model M i under consideration. Let 1(·) be an indicator function and M 0 = { f 0 (x) = 1(x ∈ [0, 1])} the smallest (null) model consisting of the uniform density only. We refer to van der Vaart (2000), Sect. 4.2, and Wainwright and Jordan (2008) for an introduction to exponential families.
Let f ∈ L 2 ([0, 1], λ) be some unknown density on [0, 1] and consider the problem of choosing one of the above 2 m models in order to estimate f with the use of a random sample X 1 , . . . , X n ∼ f . We consider the following model selection criteria which are proposed in Pokarowski and Mielniczuk (2011) and are based on the idea of likelihood ratio goodness of fit tests. Namely, for a fixed m ∈ N and for every nonempty i ⊂ {1, . . . , m} consider testing null hypothesis: calculated for the random sample X 1 , . . . , X n ∼ f , whereθ i ML is the maximum likelihood estimator of parameter θ in family M i based on X 1 , . . . , X n . Note that since for i = ∅ estimator fθi ML is equal to f 0 (x) = 1(x ∈ [0, 1]), the statistic Λ n,0,i is the likelihood ratio test statistic (LRT) for the above testing problem.
In order to compute p-value of (2) knowledge of the distribution of likelihood ratio statistic under the null hypothesis is needed. Since its exact distribution is not known in the case of the exponential family (1), we use its asymptotically valid approximation by chi-squared distribution. Namely Wilks' theorem (cf. e.g. Theorem 5.6.3 in Sen and Singer 1993) implies that likelihood ratio statistic defined as in (2) has an asymptotic where F k is the cumulative distribution function of χ 2 k distribution. We consider p(Λ n,0,i | |i|) as an approximate p-value of the test statistic (2), i.e. the conditional probability that, given Λ n,0,i , a random variable having χ 2 |i | distribution exceeds Λ n,0,i . Observe that if G n is c.d.f. of Λ n,0,i then c.d.f. of random variable p(Λ n,0,i ||i|) equals 1−G n (F −1 |i| (1−x)) and if H 0 holds is approximately c.d.f. of the uniform distribution. As the best fitted model for f we choose the one having the smallest scaled p-value: where a n ≥ 0, a n /n → 0 as n → ∞ and we put p(Λ n,0,0 | 0) := n −1/2 . In the case of ties the model M i having the smallest number of parameters |i| is chosen. Model selection criterion based on choosing the smallest p-value is introduced in Pokarowski and Mielniczuk (2011) where it is called minimal p-value criterion (mPVC). Its original definition considered only the case a n = 0. In this case from among pairs {H 0 , H i } we choose the pair for which we are most inclined to reject H 0 , i.e. we select a model corresponding to the most convincing alternative hypothesis. If a n > 0 the scaling factor is interpreted as an additional penalization for the complexity of the model.
As an alternative method maximal p-value criterion (MPVC) is also proposed in Pokarowski and Mielniczuk (2011), which involves test statistics for a family of 2 m null hypotheses of the form where full = {1, . . . , m} corresponds to the largest considered model. In this case we choose the model for which likelihood ratio statistic attains the largest scaled p-value: where a n ≥ 0, a n /n → 0 as n → ∞ and, as before, p(Λ n,i,full | m − |i|) is an approximate p-value defined as in (3). We also put p(Λ n,full,full | 0) := 1. Here we use the fact that likelihood ratio statistic (4) has under H 0 : f ∈ M i and for fixed m an asymptotic χ 2 m−|i| distribution. The motivation is similar to the motivation of mPVC, namely we choose a model which we are the least inclined to reject when compared to the full model.
The aim of the paper is to study properties of the above criteria when fixed m is replaced by m n possibly depending on n. In other words, the number of models is allowed to change with the sample size. Such assumption allows us to consider the case when unknown density f belongs to one of the models only ultimately as well as the case when the logarithm of f has infinite expansion with respect to (b i ) ∞ 1 . We assume throughout that the size m n of the list is nondecreasing function of n. In the paper we establish conditions under which consistency properties of mPVC and MPVC hold in such settings. In particular it turns out that under the introduced scaling conditions MPVC rule in general is conservative. The assumption a n → ∞ is a sufficient condition for consistency of MPVC (cf. Theorems 5 and 6). This is a difference between mPVC and MPVC criteria as mPVC may be consistent for limsup a n < ∞. In the following we use the notation m n wherever possible, however in some formulas in order to avoid subscripts of multiple levels we abbreviate the symbol m n to m.
Observe that both p-value criteria are strictly monotone functions of the maximized likelihood of a model provided the number of degrees of freedom is fixed. The same property holds for BIC and AIC criteria. It follows that if two criteria enjoying this property choose models having the same number of parameters, e.g. if |î mPVC | = |î MPVC |, and these models are uniquely determined then the chosen models necessarily coincide.
Before presenting the main results of the paper, we provide several auxiliary results concerning the properties of likelihood ratio statistic Λ n , which are crucial to prove the consistency of mPVC and MPVC in the case of exponential families of growing dimensions. Some of them, in particular Lemmas 3 and 4, are of independent interest. In their statements the notions of relative entropy and information projection will be used. Thus, let D( f ||g) denote relative entropy (or Kullback-Leibler distance) between densities f and g, which is equal to ∞ −∞ f log( f /g) if f is absolutely continuous with respect to g, and ∞ otherwise.
For the sake of simplicity of notation in the following we use the symbolf i to denote the maximum likelihood estimator of density f in family ). Thus as p-value p(Λ n,0,i | |i|) is a strictly monotone function of Λ n,0,i for a fixed |i| it follows that on the stratum |i| = j mPVC criterion chooses an alternative i such thatf i has the largest KL distance from the uniform density.

Information projection and basic assumptions
Suppose that θ * i ∈ R |i| is the unique vector which satisfies the equation . This implies uniqueness of the projection. For more of its properties see e.g. Barron and Sheu (1991).
Moreover, define the Kullback-Leibler distance between density f and the set M of densities as it follows that t * m is the minimal with respect to inclusion set of indices such that Indeed, the fact that every other model which attains the minimum of D( f, ·) includes M t * m as its subset is implied by Pythagorean-like equality mentioned above and assumed linear independence of the system (b i ) ∞ i=0 with b 0 ≡ 1. Our aim in the paper is to identify t * m for a given family {M i } i⊂{1,...,m n } using the introduced model selection criteria.
In the following f will denote an arbitrary density on [0,1]. Assumptions (A1), (A2) and (A5) below are general assumptions about f whereas (A6) stipulates that log f has an expansion with respect to functions . Existence of information projection is discussed in Theorem 3.3 in Wainwright and Jordan (2008). Assumption (A3) concerns growth of ||b j || ∞ , where ||b|| ∞ = sup x∈ [0,1] |b(x)| denotes supremum norm of b, and (A4) constrains growth of m n . For all main results (A6) will be assumed.
The lemmas below are stated under minimal subsets of assumptions from (A1) to (A6). Their proofs rely on methods developed by Barron and Sheu (1991). In particular Lemma 2 is an extension of a result proved there to the case when quantities considered in (8)-(9) are maximized over all subsets of {1, . . . , m n }. Such properties are useful when selection ruleî based on an exhaustive search of all subsets is considered. Lemmas 3-5 are to the best of our knowledge new. Let (θ j ) ∞ j=1 be the vector of coefficients given in the representation (A6) of f . We use throughout the following notation: for the set of indices of nonzero coefficients of (θ j ) ∞ j=1 . Note that if f belongs to an exponential family set t corresponds to the minimal model containing it discussed in the Introduction. For every vector j=1 by choosing only those coordinates whose indices are elements of the set i, i ⊂ N. Moreover for m ∈ N we set v m = v {1,...,m} .
If f belongs to some exponential family of distributions M t , i.e. |t| < ∞, and the maximal element of t denoted as max t ≤ m n then t = t * m since in this case θ * full = θ full . Observe also that if t is infinite then f does not belong to any exponential model (1). We recall that m = m n may depend on a sample size n.

Auxiliary lemmas
In this subsection we give several auxiliary lemmas which are crucial to prove the consistency of mPVC and MPVC in the case of exponential families of growing dimensions. The proofs of them are defered to the Appendix.
Lemmas 3 and 4 concern the asymptotic properties of likelihood ratio statistic Λ n,0,i in the case of exponential families of growing dimensions and as such are of independent interest. For i ⊂ {1, . . . , m n } define Note thatf i and f * i do not depend on the specific system of functions (b j (x)) j∈i provided the systems span the same linear space. In particular we will investigate in Lemma 4 the behavior of the quantities D( f * i ||f i ) and Λ n,0,i considering an additional system of functions stands for a linear space spanned by functions from system B and i 1 ⊂ i 2 = {1, . . . , m n }. Such system can be obtained by Gram-Schmidt orthonormalisation procedure. Obviouslỹ b j depend on f but it follows from Lemma 7 in the Appendix that in order to bound D( f * i ||f i ) using the constructed system it is enough to evaluate the constant A m ( f ) appearing there. Such useful approach was used by Barron and Sheu (1991) (cf. proof of (6.7) in their paper) and it yields better rates of convergence of D( f * i ||f i ) than a direct method based on Yurinski (1976) inequality.

Main results
Sections 3.1 and 3.2 deal with consistency of minimal and maximal p-value criterions, respectively. In Sect. 3.3 greedy counterparts of the criteria are introduced and their consistency is proved. We state first a lemma of a different character than those presented in Sect. 2. It concerns bounds for a tail of the χ 2 k distribution. Recall that p(x|k) is the p-value defined in (3). For Part (i) of the above lemma was proved by Gordon (1941) whereas part (ii) follows from Inglot and Ledwina (2006) after noticing that the product c(k)E k (x) of the functions defined there equals C(x, k).
For all the results below we assume that assumption (A6) holds, i.e. the logarithm of the underlying density f has L 2 expansion w.r.

Minimal p-value criterion mPVC
The first main result states that if m is held constant thenî mPVC identifies with probability tending to 1 the indices corresponding to the nonzero coefficients of the information projection of f on M full .
In the case of a correct specification of the list of models we call a selection rulê i =î(X 1 , . . . , X n ) conservative if P(t ⊂î) → 1 when n → ∞. Theorem below states the conditions for conservativeness and consistency of mPVC criterion when the true density f belongs to one of the models on the list. Observe that some growth conditions on m n have to be imposed in this context as the method of calculating approximate p-values relies on Wilks' theorem and the quality of approximation Λ n,0,i by χ 2 |i| deteriorates when |i| increases together with n. Theorem 2 (mPVC consistency) Assume (A1)-(A6), |t| < ∞ and lim n→∞ m n ≥ max t. Then (i) lim n→∞ P(t ⊂î mPVC ) = 1, (ii) if m n log m n = o(log n + a n ) as n → ∞ then lim n→∞ P(î mPVC = t) = 1.
As in the proof of Theorem 1 Lemmas 3(i), (ii) and 5 imply that there exists ε > 0 such that It is easily seen that the inequality Γ Moreover, for L(n, i) := |i|−2 2 log(Λ n,0,i 2 ) we have L(n, i) ≥ 0 for |i| ≥ 2 and Lemma 3 easily implies min i:|i|=1 L(n, i) ≥ −cn with probability tending to 1 for any positive c. Thus using the assumption a n /n → 0, bounds (11) and (12) and the relations above, we obtain equation (10) and it follows that lim n→∞ P(t ⊂î mPVC ) = 1.
Assume now that t i ⊂ {1, . . . , m n }, where f ≡ 1. As λ i = λ t > 0, Lemma 6 may be applied to both p-values. As in proof of Theorem Lemma 4 implies max i: t⊂i⊂{1,...,m n } |Λ n,0,i − Λ n,0,t | = O P (m n ). Moreover observe that The first term in the above decomposition is positive as |i| ≥ 2 and t ⊂ i and the second is easily seen to be larger than −c log n with probability tending to 1 for any positive c. Finally, using the fact that log Γ ( m n 2 )/Γ ( |t| 2 ) = O(m n log m n ) together with the assumption m n log m n = o(log n + a n ) we obtain lim n→∞ P min i: t⊂i⊂{1,...,m n } p(Λ n,0,i | |i|)e a n |i| > p(Λ n,0,t | |t|)e a n |t| = 1.
The proof in the case |t| = 1 is similar. Consider now the case when f ≡ 1, i.e. the minimal model is true. We will show that P min i:i⊂{1,...,m n } p(Λ n,0,i | |i|)e a n |i| < n −1/2 → 0.
As Lemma 4 yields that when the minimal model is true max i⊂{1,...,m n } Λ n,0,i = O P (m n ) it is enough to show that for any C > 0 n 1/2 min k=1,...,m n e a n k P(Z k > Cm n ) → ∞, where Z k pertains to χ 2 k distribution. This is easily shown using Lemma 6, assumption m n log m n = o(a n + log n) and the fact that max k≤m n log Γ (k/2) = O(m n log m n ).
Remark 2 Careful examination of the previous proof yields that under conditions of Theorem 2 P(t ⊂î mPVC ) ≤ C(m 1+2ω n /n) 1/2 for some absolute constant C > 0.
Theorem 3 below states that any index i 0 corresponding to a nonzero coefficient in the expansion of log f will be eventually included inî mPVC . We call a selection ruleî conservative for f satisfying assumption (A6) if for any M ∈ NP(t ∩ {1, . . . , M} ⊂ i) → 1. Note that this notion coincides with the usual definition of conservativeness under correct specification given above Theorem 2 when |t| < ∞. In Theorem 3 t can be either infinite or finite. However, an interesting part of the result corresponds to the former case, i.e. when log f has infinite expansion w.r.t. system (b i ) ∞ i=0 . As it was noticed before this corresponds to misspecification case when density f does not belong to any model M m n , n = 1, 2, . . .. For finite t Theorem 3 coincides with Theorem 2(i).

Theorem 3 Assume (A1)-(A6), moreover that (i)
Proof It is enough to prove that for any i 0 ∈ t P(i 0 ∈î mPVC ) → 1 as n → ∞. The proof proceeds analogously to the proof of Theorem 2 (a). In the following we prove that there exists ε > 0 such that First note that reasoning analogously as in the proof of Lemma 5 we have lim inf Indeed, assumptions (A2) and (A5) imply as in Lemma 1 that there exists a con- (throughout, the integration is performed w.r.t. Lebesgue measure λ on [0,1]). Thus To see this note that since , where θ t m is defined in Sect. 2.2. Lemma 1 in Barron and Sheu (1991) implies It is easy to see that assumption (i) implies inf x f θ t m (x) > ε for some ε > 0 and sufficiently large n. Thus assumption (A5) yields (16) follows from the fact that i>m n θ 2 i → 0 and ψ(θ) − ψ t mn (θ t mn ) → 0 as n → ∞. The last convergence follows from the proof of Lemma 4 in Barron and Sheu (1991) (15) and (16) imply (14). The rest of the proof is analogous to that of Theorem 2(a).
Remark 3 Observe that the above proof also yields the following more general fact. Assume that subsets M n ⊂ 2 {1,...,m n } are such that the following separability condition generalizing (14) holds for all n ∈ N and some ε > 0, or equivalently min i∈M n D Proof Observe that we have forî =î mPVC The first term is 0 with probability tending to 1 in view of Theorem 2 (i). Using Lemma 7 analogously as in the proof of Lemma 4 yields D( f * t ||f t ) = O P (m n /n). This and convergence P(î = t) → 1 implied by Theorem 2 (i) completes the proof.

Maximal p-value criterion MPVC
Theorems 5 and 6 below are analogous to Theorems 1 and 2, respectively, for MPVC criterion.
The last convergence follows from the assumption a n → ∞ and the fact that > 0 then analogously as in the proof of Theorem 1 we obtain using Lemma 3 P min for some ε > 0. Moreover, Lemma 6 implies for Λ n,t * m ,full > 0 and m − |t * m | > 1 and for Λ n,i,full → ∞ and m − |i| > 1. Analogous inequality holds for i such that m − |i| = 1. This together with (19) and assumption a n /n → 0 imply that when m − |t * m | > 1 P max i: t * m ⊂i⊂{1,...,m} p(Λ n,i,full | m −|i|)e −a n |i| < p(Λ n,t * m ,full | m −|t * m |)e −a n |t * m | → 1.
Proof We first consider the case when model i is misspecified: t ⊂ i and t = full.
Observe that Lemma 3(i) implies that max i⊂{1,...,m n } |Λ n,i,full /(2n)−(λ t −λ i )| P −→ 0 and thus it follows in view of Lemma 5 that min i⊂{1,...,m n }: t ⊂i Λ n,i,full → ∞ in probability. Assume first that m n − |i| > 1. Whence using Lemma 6(ii) it is enough to show that with probability tending to 1 we have e −a n |t| C(Λ n,t,full , m n − |t|) > e −a n |i| C(Λ n,i,full , m n − |i|) 1 + m n − |i| − 2 Λ n,i,full − m n + |i| + 2 for all i ⊂ {1, . . . , m n } : t ⊂ i and such that m n − |i| > 1. This easily follows as in the case of mPVC consistency from P min for some ε > 0. The proof for m n − |i| = 1 is analogous. The case t = full follows easily from a n /n → 0 after noting that in this case the right hand side of (20) is O(e −εn ) for some ε > 0, whereas the left hand side equals e −a n |t| . This yields the proof of (a). We now consider the case when model i contains the minimal true model: t i. As in the proof of Theorem 5 we obtain P max i: t⊂i⊂{1,...,m n } p(Λ n,i,full | m n − |i|)e −a n |i| < p(Λ n,t,full | m n − |t|)e −a n |t| ≥ P p(Λ n,t,full | m n − |t|) > e −a n .
Observe that in view of Lemma 4 Λ n,t,full = O P (m n ). Thus it is enough to show that e a n P(Z χ 2 for any fixed C > 0, where Z χ 2 mn −|t| ∼ χ 2 m n −|t| . This follows easily from Lemma 6 and the assumed condition on m n as the above expression is bounded from below by exp a n − Cm n 2 + m n 2 − 1 log Cm n 2 − log m n 2 m n 2 → ∞, wherem n = m n − |t|. This yields the proof of (b).
Remark 4 Note that the assumption of Theorem 6 (b) implies that a n → ∞ whereas Theorem 2 asserts that consistency of mPVC criterion holds also for constant a n provided m n = o(log n).
From the proof of Theorem 6 (cf. (21)) it is easy to see that the analogue of Theorem 3 holds for MPVC criterion under the same conditions.

Greedy mPVC and MPVC criteria and their consistency
Optimization of a criterion function over all subsets of {1, 2, . . . , m n } involves a considerable computational cost for large m n . This is a drawback of all criterion based procedures. We discuss here a two-step modification of p-value criteria introduced in Pokarowski and Mielniczuk (2011) which involves only O(m n ) calculations of criterion instead of O(2 m n ). The approach is motivated by Zheng and Loh (1997) who used such an approach for linear models.
Greedy mPVC In the first step for every j ∈ {1, . . . , m n } we consider testing hypotheses Observe that in both greedy methods all m likelihood ratio test statistics LRT have under the null hypothesis the same asymptotic distribution, namely χ 2 m−1 in the case of Greedy mPVC and χ 2 1 in the case of Greedy MPVC. Hence in both cases ordering the p-values coincides with ordering the values of test statistics monotonically. Moreover, it follows that the sets M yielded by the both procedures are equal.
It is also of interest to note that selection methods based on truncationî = { j : |θ full j | > C n } can be also viewed as two-step procedures similar to these defined above. Namely, in the first step estimates of the parameters in the full model are ordered: |θ full It is easy to see thatî = {R 1 , R 2 , . . . , R k 0 }. Observe that the criterion function in the second step is a penalized score statistic k i=1 (θ full R i ) 2 for the model indexed by {R 1 , R 2 , . . . , R k }. We refer to Wojtyś (2011) for discussion of such rules.
Corollary 1 Under conditions of Theorem 2 (b) and Theorem 6 (b), respectively, greedy mPVC and MPVC methods are consistent.
Proof For every j / ∈ t we have Λ n,0,t ≤ Λ n,0,j . Thus as for k ∈ t we havek ⊃ t (13) implies that for every such k Thus with probability tending to 1 after the initial ordering all indices belonging to set t will precede the remaining indices. Hence the consistency of both greedy methods follows from the consistency of their respective full search method applied to M.
Remark 5 It can be shown using Lemma 3(i) that under its assumptions probability of incorrect ordering in the first step of the procedure is bounded by where λ = min i ∈t, j∈t (λ¯i − λ¯j ) and C is an absolute constant. Note that the bound becomes larger when the selection problem becomes more difficult, i.e. λ decreases.

Simulation study
We conducted numerical experiments to check how the considered selectors behave in practice for moderate sample sizes. The considered sample size was n = 300 with m n = 5. Random samples were generated either form the uniform density or belonged to one of the following four families: Recall that e.g. t = {1, 5} means that only the first and the fifth coefficient is allowed to be nonzero. In all cases we considered Legendre polynomials basis. The densities are plotted in Fig. 1. The values of the parameters were chosen to obtain typical shapes in the considered family. Number of Legendre polynomials appearing in the definitions of densities (i)-(iii) corresponds to shape complexity. Observe e.g. that typical density in example (ii) has two modes in (0,1) in contrast to four modes in the case (iii). Example (iv) corresponds to misspecification case, i.e. the situation when the underlying density does not belong to any model in the considered family.
Six selection criteria were taken into account: BIC (Bayesian Information Criterion, Schwarz (1978)), mPVC, MPVC and their greedy counterparts. BIC is defined as Several values of scaling constants a n for p-value based criteria were considered. We report the results for constants which performed the best on average: a n = 0 for mPVC and a n = log n/2 for MPVC, i.e. an unscaled p(Λ n,0,i ||i|) for mPVC and p(Λ n,i,full |m − |i|) scaled down by n −1/2 for MPVC.
For three main selectors ML estimators had to be calculated for any of 2 m models or, after initial preordering, for m models in the case of their greedy counterparts. As ML estimators are calculated using iterative Newton-Raphson procedure some cases of non-convergence occur. In such a case corresponding model was excluded from the list and only the models for which ML estimators were obtained were considered for optimization. Number of cases when lack of convergence occurs increases with complexity of the model, however it does not exceed 3 % of the number of the models considered.
Two main measures of performance were taken into account: fraction of correct model specifications averaged over 10 4 repetitions of the experiment and averaged empirical Integrated Squared Error ISE defined as where f is the theoretical density of random sample andf is its post-model-selection estimator. Mean of ISE is denoted as MISE in the plots. It turned out that (see Figs. 2,3,4,5) these two measures are approximately concordant in the sense that a large probability of correct specification corresponds to a small ISE and the ranking of selectors with respect to both measures coincide in general. The only exception are the densities close to M 0 (e.g. members of M 1 with small θ 1 ) for which a method may err by choosing the uniform density but still has a small MISE. For misspecification case the accuracy percent was replaced by averaged Kullback-Leibler distance D( f ||f ) from f to post-model-selection estimator. Both integrals, ISE and Kullback-Leibler distance, were calculated numerically using the Gaussian quadrature for Legendre polynomials. In order not to obscure the clarity of pictures in each example we plotted the best overally performing selector and depending on whether it was the main selector or one of the greedy modification we supplemented it by the two remaining selectors from the same group. We start with the comparison of methods for data pertaining to the uniform density presented in Table 1. In this case the best performance was achieved by greedy MPVC, which selected the correct model in more than 99 % of Monte Carlo repetitions and yielded the smallest MISE. Also MPVC had comparably good properties as it attained nearly 99 % accuracy of model selection and had MISE three times as large as the winner. BIC criterion achieved more than 90 % of correct model selections, however the estimated MISE was 10 times larger than for greedy MPVC. The methods mPVC and greedy mPVC behaved considerably worse. This may come as a surprise that the greedy method may actually perform the best even when optimization over all subsets is taken into account. It happens for the simplest densities considered (case (i), cf. Fig. 2) where greedy MPVC performs most favorably. For the other case with only one nonzero coefficient (example (ii), cf. Fig. 3) its non-greedy counterpart is the best. The similar situation occurs in the misspecification case (example (iv)). However, for the most complex shapes of densities (example (iii), cf. Fig. 4) minimal p-value criterion mPVC was a clear winner. This case is clearly the most difficult among cases considered as accuracy percent falls below 0.4 and it is here where the improvement is most desirable. Note that in this case the gain of the winner over its BIC competitor (greedy or non-greedy) was more pronounced in the terms identification of the correct model selection than for accuracy of estimator measured by MISE. This happened especially for the larger values of the considered parameters. Thus it seems that the main advantage of the introduced methods is in terms of increase of probability of model specification rather than increase of estimation accuracy of density estimator measured by MISE. Figure 6 shows the probability of correct ordering in the fist step of greedy methods for model (i) as a function of m. In this case in order to reduce computational  cost of simulations the sample size was taken as n = 100. The plotted curves indicate that probability of correct ordering depends heavily on the size of the list of models, however, once the correct order is established it is relatively unlikely to commit errors in the second step using p-value based methods or BIC. This underlines the importance of choosing a small subfamily of models before optimizing the criterion. We also recorded the cases when a supermodel, that is a subset i ⊃ t instead of t, is selected. It turns out that greedy methods in general are more conservative than their full-search counterparts, i.e. much more prone to choosing supermodel. Moreover, mPVC is much more conservative than MPVC; for the first method in some cases fraction of supermodels chosen is around 30 %, whereas in the case of MPVC it is always below 3 %. This is possibly due to penalty a n which equals 0 for mPVC.
Finally, note that in all the cases considered one of the proposed method performed better than its BIC-like competitor. The natural question being a topic of current research is whether some combined version of p-value based criteria would exhibit better performance overall.