1 Introduction

Let \({\varvec{X}}\in {\mathbb {X}}\) be a random variable. Let \({\mathcal {K}}\left( {\mathbb {X}}\right)\) be a class of probability density functions (PDFs), defined on the set \({\mathbb {X}}\), which we shall refer to as components. We say that \({\varvec{X}}\) arises from a g component mixture model of class \({\mathcal {K}}\) if the PDF \(f_{0}\) of \({\varvec{X}}\) belongs in the convex class

$$\begin{aligned} {\mathcal {M}}_{g}\left( {\mathbb {X}}\right) =\left\{ f\left( {\varvec{x}}\right) :f\left( {\varvec{x}}\right) =\sum _{z=1}^{g}\pi _{z}f_{z}\left( {\varvec{x}}\right) ; \; \pi _{z}\ge 0,\sum _{z=1}^{g}\pi _{z}=1,f_{z}\in {\mathcal {K}}\left( {\mathbb {X}}\right) ,z\in \left[ g\right] \right\} \text {,} \end{aligned}$$

where \(g\in {\mathbb {N}}\) and \(\left[ g\right] =\left\{ 1,\dots ,g\right\}\).

Suppose that we observe an independent and identically distributed (IID) sample sequence of data \({\mathbf {X}}_{n}=\left( {\varvec{X}}_{i}\right) _{i=1}^{n}\), where each \({\varvec{X}}_{i}\) has the same data generating process (DGP) as \({\varvec{X}}\), which is unknown. Under the assumption that \(f_{0}\in {\mathcal {M}}_{g_{0}}\left( {\mathbb {X}}\right)\) for some \(g_{0}\in {\mathbb {N}}\), we wish to use the data \({\mathbf {X}}_{n}\) in order to determine the possible values of \(g_{0}\). This problem is generally referred to as order selection in the mixture modeling literature, and reviews regarding the problem can be found in McLachlan and Peel (2000, Ch. 6) and McLachlan and Rathnayake (2014), for example.

Notice that the sequence \(\left( {\mathcal {M}}_{g}\right) _{g=1}^{\infty }\) is nested, in the sense that \({\mathcal {M}}_{g}\subset {\mathcal {M}}_{g+1}\), for each g, and that \(g_{0}\in {\mathbb {N}}\) is equivalent to \(f_{0}\in {\mathcal {M}}=\bigcup _{g=1}^{\infty }{\mathcal {M}}_{g}\). We shall write the null hypothesis that \(f_{0}\in {\mathcal {M}}_{g}\) (or equivalently, \(g_{0}\le g\)) as \(\text {H}_{g}\), and we assume that we have available a \(p \text {-value}\) \(P_{g}\left( {\mathbf {X}}_{n}\right)\) for each hypothesis, and that \(P_{g}\left( {\mathbf {X}}_{n}\right)\) correctly controls the size of the hypothesis test, in the sense that

$$\begin{aligned} \sup _{f\in {\mathcal {M}}_{g}}\text {Pr}_{f}\left( P_{g}\left( {\mathbf {X}}_{n}\right) \le \alpha \right) \le \alpha \text {,} \end{aligned}$$
(1)

for any \(\alpha \in \left( 0,1\right)\). Here, \(\text {Pr}_{f}\) is the probability measure corresponding to the PDF f. In Wasserman et al. (2020), the following simple sequential testing procedure (STP) is proposed for determining the value of \(g_{0}\) (for general nested models, not necessarily mixtures):

  1. 1.

    Choose some significance level \(\alpha \in \left( 0,1\right)\) and initialize \(\hat{g}=0\);

  2. 2.

    Set \(\hat{g}=\hat{g}+1\);

  3. 3.

    Test the null hypothesis \(\text {H}_{\hat{g}}\) using the \(p\text {-value}\) \(P_{\hat{g}}\left( {\mathbf {X}}_{n}\right)\);

    1. (a)

      If \(P_{\hat{g}}\left( {\mathbf {X}}_{n}\right) \le \alpha\), then go to Step 2.

    2. (b)

      If \(P_{\hat{g}}\left( {\mathbf {X}}_{n}\right) >\alpha\), then go to Step 4.

  4. 4.

    Output the estimated number of components \(\hat{g}_{n}=\hat{g}\).

It was argued informally in Wasserman et al. (2020) that, although the procedure above involves a sequence of multiple tests, each with local size \(\alpha\), it still correctly controls the Type I error in the sense that

$$\begin{aligned} \text {Pr}_{f_{0}}\left( f_{0}\in {\mathcal {M}}_{\hat{g}_{n}-1}\right) \le \alpha \end{aligned}$$
(2)

for any \(f_{0}\in {\mathcal {M}}\). Here, we note that the complement of the event \(\left\{ f_{0}\in {\mathcal {M}}_{\hat{g}_{n}-1}\right\}\) is \(\left\{ f_{0}\in {\mathcal {M}}\backslash {\mathcal {M}}_{\hat{g}_{n}-1}\right\}\) or equivalently \(\left\{ g_{0}\ge \hat{g}_{n}\right\}\). Thus, from (2), we can make the confidence statement that

$$\begin{aligned} \text {Pr}_{f_{0}}\left( g_{0}\ge \hat{g}_{n}\right) \ge 1-\alpha \text {,} \end{aligned}$$
(3)

for any \(f_{0}\in {\mathcal {M}}\).

In the present work, we shall provide a formal proof of result (2) using the closed testing principle of Marcus et al. (1976) (see also Dickhaus, 2014, Sec. 3.3). Using this result and the universal inference framework of Wasserman et al. (2020), we construct a sequence of tests for \(\left( \text {H}_{g}\right) _{g=1}^{\infty }\) with \(p \text {-values}\) satisfying (1) and prove that each of the tests is consistent under some regularity conditions. We then demonstrate the performance of our testing procedure for the problem of order selection for finite mixtures of normal distributions, and verify the empirical manifestation of the confidence result (3). Extensions of the STP are also considered, whereupon we construct a method that consistently estimates the order \(g_{0}\), and consider the application of the STP to asymptotically valid tests.

We note that hypothesis testing for order selection in mixture models is a well-studied area of research. Difficulties in applying testing procedures to the order selection problem arise due to identifiability and boundary issues of the null hypothesis parameter spaces (see, e.g., Quinn et al., 1987, and references therein regarding parametric mixture models, and Andrews, 2001, more generally). Examples of testing methods proposed to overcome the problem include the parametric bootstrapping techniques of McLachlan (1987), Feng and McCulloch (1996), and Polymenis and Titterington (1998), whereupon bootstrapped distributions of test statistics are used to approximate finite sample distributions, in the absence of asymptotic results. Another approach is the penalization techniques of Chen (1998), Li and Chen (2010), and Chen et al. (2012), where asymptotically well-behaved penalized likelihood ratio statistics are proposed, with limiting distributions that are computable or simulatable. It is noteworthy that the bootstrap approaches provide only an approximate finite sample distribution of test statistics and thus the tests are not guaranteed to have the correct size. The penalization approach, on the other hand, provides asymptotic tests of the correct size, although the construction of the penalization of the test statistic must be specialized to every individual testing problem and is only suitable for parametric families of densities \({\mathcal {K}}\) that are characterized by a low-dimensional parameter.

In fact, the sequential procedure described above was also considered for order selection in the mixture model context by Windham and Cutler (1992) and Polymenis and Titterington (1998), although no establishment of the properties of the approach was provided. The possibility of constructing intervals of form (3) via bounding of discrete functionals of the underlying probability measure is discussed in Donoho (1988), although no implementation is suggested. Citing observations made by Donoho (1988) and Cutler and Windham (1994), it is suggested in McLachlan and Peel (2000, Sec. 6.1) that intervals of form (3) are sensible in practice, because reasonable functionals that characterize properties of \(f_{0}\), such as for the number of components \(g_{0}\), can be lower bounded with high probability from data, but often cannot be upper bounded.

As previously mentioned, we plan to prove that (2) holds by demonstrating that the sequential test is a closed testing procedure. However, we note that the procedure may also be considered under the sequential rejection principle of Goeman and Solari (2010), and if \({\mathcal {M}}=\bigcup _{g=1}^{G}{\mathcal {M}}_{g}\) for some fixed \(G\in {\mathbb {N}}\), then we may also consider the procedure as a fixed sequence procedure, as considered by Maurer et al. (1995). Another perspective regarding the sequential test is via the general procedures of Bauer and Kieser (1996), who consider the construction of confidence intervals using sequences of tests for nested and order sets of hypotheses. We also remark that the use of multiple testing procedures for model selection is well studied in the literature, as exemplified by the works of Finner and Giani (1996) and Hansen et al. (2011), who both consider the application of hypothesis testing schemes to generate confidence sets over model spaces.

For completeness, we note that apart from hypothesis testing, numerous solutions to the order selection problem for finite mixture models have been suggested. These related works include the use of information criteria, such as the Akaike information crtierion (AIC), Bayesian information criterion (BIC), and variants of such techniques (Biernacki et al., 2000; Keribin, 2000; Leroux, 1992), and parameter regularization, such as via the Lasso and elastic net, and penalization approaches (Chen & Khalili, 2009; Xu & Chen, 2015; Yin et al., 2019), among other techniques.

We note that the aforementioned order selection techniques are all, in a sense, point estimation procedures that each serve the purpose of consistently estimating the number of components of the DGP mixture model, in the sense that the estimate is close to the true number of components, for sufficiently large n. Our approach does not output a consistent estimator, but instead produces fixed-probability confidence set \(\left\{ g_0\ge \hat{g}_{n}\right\}\), and should thus be viewed as an interval estimator. Although the lower-bound of the interval \(\hat{g}_n\) can be an accurate estimator of the true number of components, it should not be considered as a competitor to proper point estimators and instead should be viewed as complementary to point estimation approaches. We finally note that outside of the multiple testing framework, the problem of model selection with confidence has also been addressed in the articles of Ferrari and Yang (2015) and Zheng et al. (2019).

The remainder of manuscript proceeds as follows. In Sect. 2, we recall the closed testing principle and use it to prove the inequality (2). In Sect. 3, we use the universal inference framework of Wasserman et al. (2020) to construct a class of likelihood ratio-based tests for the hypotheses \(\left( \text {H}_{g}\right) _{g=1}^{\infty }\). In the context of normal mixture models, numerical simulations and real data examples are used to assess the performance of the sequential procedure using the constructed tests in Sect. 4. Extensions to the STP are discussed in Sect. 5. Finally, conclusions are provided in Sect. 6 and technical proofs are provided in the Appendix.

2 Confidence via the closed testing principle

Let \({\mathbb {H}}=\left\{ \text {H}_{g}:g\in {\mathbb {G}}\right\}\) be a set of hypotheses that are indexed by some (possibly infinite) set \({\mathbb {G}}\), where each hypothesis \(\text {H}_{g}\) corresponds to the statement \(\left\{ {\varvec{\theta }}\in {\mathbb {T}}_{g}\right\}\) regarding the parameter of interest \({\varvec{\theta }}\in {\mathbb {T}}\), where \({\mathbb {T}}_{g}\subset {\mathbb {T}}\). We say that \({\mathbb {H}}\) is a \(\cap \text {-closed}\) system if for each \({\mathbb {I}}\subseteq {\mathbb {G}}\), either \(\bigcap _{g\in {\mathbb {I}}}{\mathbb {T}}_{g}=\emptyset\) or \(\bigcap _{g\in {\mathbb {I}}}{\mathbb {T}}_{g}\in \left\{ {\mathbb {T}}_{g}:g\in {\mathbb {G}}\right\}\). That is, for every set \({\mathbb {I}}\) of indices that yields a non-empty statement \(\left\{ {\varvec{\theta }}\in \bigcap _{g\in {\mathbb {I}}}{\mathbb {T}}_{g}\right\}\), there exists a hypothesis \(\text {H}_{g}\in {\mathbb {H}}\), such that \(g\in {\mathbb {T}}_{g}\).

Recalling the notation from Sect. 1, we say that \(\text {H}_{g}\) is rejected if \(R_{g}\left( {\mathbf {X}}_{n}\right) =\mathbf {1}\left\{ P_{g}\left( {\mathbf {X}}_{n}\right) \le \alpha \right\}\) is equal to 1, and we say that \(\text {H}_{g}\) is not rejected, otherwise. Here, \(\mathbf {1}\left\{ \cdot \right\}\) is the indicator function. We further say that the familywise error rate (FWER) of a set rejections \(\left\{ R_{g}\left( {\mathbf {X}}_{n}\right) \right\} _{g\in {\mathbb {G}}}\) is strongly controlled at level \(\alpha \in \left( 0,1\right)\) if for all \({\varvec{\theta }}\in {\mathbb {T}}\),

$$\begin{aligned} \text {Pr}_{{\varvec{\theta }}}\left( \bigcup _{g\in {\mathbb {G}}_{0}\left( {\varvec{\theta }}\right) }\left\{ R_{g}\left( {\mathbf {X}}_{n}\right) =1\right\} \right) \le \alpha \text {,} \end{aligned}$$

where \(\text {Pr}_{{\varvec{\theta }}}\) denotes the probability measure corresponding to parameter value \({\varvec{\theta }}\), and \({\mathbb {G}}_{0}\left( {\varvec{\theta }}\right) \subset {\mathbb {G}}\) is the set of indices with corresponding hypotheses that are true under \(\text {Pr}_{{\varvec{\theta }}}\).

We note that the statement \(\left\{ \bigcup _{g\in {\mathbb {G}}_{0}\left( {\varvec{\theta }}\right) }\left\{ R_{g}\left( {\mathbf {X}}_{n}\right) =1\right\} \right\}\) reads as: at least one true hypothesis has been rejected. The complement of the statement is therefore that no true hypotheses have been rejected and hence the strong control of the FWER implies that the true parameter value lies in the complement of union of the rejected subsets with probability \(1-\alpha\). That is, for all \({\varvec{\theta }}\in {\mathbb {T}}\),

$$\begin{aligned} \text {Pr}_{{\varvec{\theta }}}\left( {\varvec{\theta }}\in \bigcap _{g\in {\mathbb {G}}_{1}\left( {\varvec{X}}_{n}\right) }{\mathbb {T}}_{g}^{\complement }\right) \ge 1-\alpha \text {,} \end{aligned}$$

where \(\left( \cdot \right) ^{\complement }\) is the set complement operation and \({\mathbb {G}}_{1}\left( {\varvec{X}}_{n}\right) =\left\{ g\in {\mathbb {G}}:R_{g}\left( {\varvec{X}}_{n}\right) =1\right\}\) is the set of rejected hypotheses.

Define the set of closed tests corresponding to \({\mathbb {H}}\) as the rejection rules: \(\left( \bar{R}_{g}\left( {\mathbf {X}}_{n}\right) \right) _{g\in {\mathbb {G}}}\), where for each \(g\in {\mathbb {G}}\),

$$\begin{aligned} \bar{R}_{g}\left( {\mathbf {X}}_{n}\right) =\min _{\left\{ j:{\mathbb {T}}_{j}\subseteq {\mathbb {T}}_{g}\right\} }R_{j}\left( {\mathbf {X}}_{n}\right) \text {.} \end{aligned}$$
(4)

all hypotheses \(\text {H}_j, j \le g\), are rejected, otherwise \(\bar{R}_{g}\left( {\mathbf {X}}_{n}\right) = 0\). Then, we have the following result regarding the closed testing principle (cf. Dickhaus, 2014, Thm. 3.4).

Theorem 1

For an \(\cap \text {-closed}\) system of hypotheses \({\mathbb {H}}\) with corresponding \(\alpha\) level local tests \(\left( R_{g}\left( {\mathbf {X}}_{n}\right) \right) _{g\in {\mathbb {G}}}\), the closed testing procedure defined by \(\left( \bar{R}_{g}\left( {\mathbf {X}}_{n}\right) \right) _{g\in {\mathbb {G}}}\) strongly controls the FWER at level \(\alpha\) in the sense that

$$\begin{aligned} \mathrm {Pr}_{{\varvec{\theta }}}\left( \bigcup _{g\in {\mathbb {G}}_{0}\left( {\varvec{\theta }}\right) }\left\{ \bar{R}_{g}\left( {\mathbf {X}}_{n}\right) =1\right\} \right) \le \alpha \text {,} \end{aligned}$$

for each \({\varvec{\theta }}\in {\mathbb {T}}\).

We now demonstrate that the sequential procedure constitutes a set of closed tests of the form (4) and thus permits the conclusion of Theorem 1, which in turn implies (2) and thus (3). That is, we show that the sequence of hypotheses \(\left( \text {H}_{g}\right) _{g=1}^{\infty }\) corresponds to a \(\cap\)-closed system, where each \(\text {H}_{g}\) is defined by \(f_{0}\in {\mathcal {M}}_{g}\), and that the STP corresponds to a sequence of tests of form (4).

Theorem 2

The hypotheses \(\left( \mathrm {H}_{g}\right) _{g=1}^{\infty }\) and the STP from Sect. 1constitute a \(\cap \text {-closed}\) system and a closed testing procedure, respectively, when testing using \(p \text {-values}\) \(\left( P_{g}\left( {\mathbf {X}}_{n}\right) \right) _{g=1}^{\infty }\), satisfying (1). The sequential test therefore permit conclusions (2) and (3).

Proof

The proof of this result appears in the Appendix. \(\square\)

Thus, under the assumption that the data \({\mathbf {X}}_{n}\) arises from a DGP with density function \(f_{0}\), corresponding to a \(g_{0}\) component mixture model, the STP outputs a point estimator \(\hat{g}_{n}\), where the event \(\left\{ g_{0}\ge \hat{g}_{n}\right\}\) occurs with probability at least \(1-\alpha\).

3 Test of order via universal inference

Let \({\mathbf {X}}_{n}\) be split into two subsequences of lengths \(n_{1}\) and \(n_{2}\), where \({\mathbf {X}}_{n}^{1}=\left( {\varvec{X}}_{i}\right) _{i=1}^{n_{1}}\) and \({\mathbf {X}}_{n}^{2}=\left( {\varvec{X}}_{i}\right) _{i=n_{1}+1}^{n}\), and \(n_{1}+n_{2}=n\). Assume that \({\varvec{X}}\) has DGP characterized by the PDF \(f_{0}\) and for each \(g\in {\mathbb {N}}\), let \(\hat{f}_{g}^{1}\in \bar{{\mathcal {M}}}_{g}\) and \(\hat{f}_{g}^{2}\in \bar{{\mathcal {M}}}_{g}\) be estimators of \(f_{0}\) (not necessarily maximum likelihood estimators), based on \({\mathbf {X}}_{n}^{1}\) and \({\mathbf {X}}_{n}^{2}\), respectively, where \(\bar{{\mathcal {M}}}_{g}\subseteq {\mathcal {M}}\) is a class that characterizes an alternative to the null hypothesis that \(f_{0}\in {\mathcal {M}}_{g}\), with \({\mathcal {M}}_{g}\subset \bar{{\mathcal {M}}}_{g}\).

For notational convenience, for each \(k\in \left\{ 1,2\right\}\), we reindex the elements of \({\mathbf {X}}_n^k\) by inclusion of a superscript k, so that \({\mathbf {X}}_{n}^{k}=\left( {\varvec{X}}_{i}^{k}\right) _{i=1}^{n_{k}}\), and let

$$\begin{aligned} L_{f}\left( {\mathbf {X}}_{n}^{k}\right) =\prod _{i=1}^{n_{k}}f\left( {\varvec{X}}_{i}^{k}\right) \text {,} \end{aligned}$$

be the likelihood function corresponding to subsample \({\mathbf {X}}_{n}^{k}\), evaluated under PDF f. We wish to test the null hypothesis \(\text {H}_{g}\): \(f_{0}\in {\mathcal {M}}_{g}\) against the alternative \(\bar{\text {H}}_{g}\): \(f_{0}\in \bar{{\mathcal {M}}}_{g}\), using the Split test statistics

$$\begin{aligned} V_{g}^{k}\left( {\mathbf {X}}_{n}\right) =\frac{L_{\hat{f}_{g}^{3-k}}\left( {\mathbf {X}}_{n}^{k}\right) }{L_{\tilde{f}_{g}^{k}}\left( {\mathbf {X}}_{n}^{k}\right) }\text {,} \end{aligned}$$

for \(k\in \left\{ 1,2\right\}\), and the Swapped test statistic

$$\begin{aligned} \bar{V}_{g}\left( {\mathbf {X}}_{n}\right) =\frac{1}{2}\left\{ V_{g}^{1}\left( {\mathbf {X}}_{n}\right) +V_{g}^{2}\left( {\mathbf {X}}_{n}\right) \right\} \text {,} \end{aligned}$$

as introduced in Wasserman et al. (2020). Here, the denominator estimator \(\tilde{f}_{g}^{k}\) is the maximum likelihood estimator of \(f_{0}\), based on \({\mathbf {X}}_{n}^{k}\) under the null hypothesis \(\text {H}_{g}\), in the sense that

$$\begin{aligned} \tilde{f}_{g}^{k}\in \left\{ \tilde{f}\in {\mathcal {M}}_{g}:L_{\tilde{f}}\left( {\mathbf {X}}_{n}^{k}\right) =\max _{f\in {\mathcal {M}}_{g}}L_{f}\left( {\mathbf {X}}_{n}^{k}\right) \right\} \text {.} \end{aligned}$$

We define the \(p \text {-values}\) for the Split and Swapped test statistics as \(P_{g}^{k}\left( {\mathbf {X}}_{n}\right) =\min \{1/V_{g}^{k}\left( {\mathbf {X}}_{n}\right) ,1\}\) and \(\bar{P}_{g}\left( {\mathbf {X}}_{n}\right) =\min \{1/\bar{V}_{g}\left( {\mathbf {X}}_{n}\right) ,1\}\), respectively. The adaptation of Wasserman et al. (2020, Thm. 3) demonstrates that the two tests have correct size for any sample size n (i.e., \(P_{g}^{k}\left( {\mathbf {X}}_{n}\right)\) and \(\bar{P}_{g}\left( {\mathbf {X}}_{n}\right)\) satisfy condition (1), for any n).

Theorem 3

For any \(n\in {\mathbb {N}}\) and \(\alpha \in \left( 0,1\right)\),

$$\begin{aligned} \sup _{f\in {\mathcal {M}}_{g}}\mathrm {Pr}_{f}\left( P_{g}^{k}\left( {\mathbf {X}}_{n}\right) \le \alpha \right) \le \alpha \end{aligned}$$

and

$$\begin{aligned} \sup _{f\in {\mathcal {M}}_{g}}\mathrm {Pr}_{f}\left( \bar{P}_{g}\left( {\mathbf {X}}_{n}\right) \le \alpha \right) \le \alpha \text {.} \end{aligned}$$

Theorem 3 implies that for each \(g\in {\mathbb {N}}\) and \(k\in \left\{ 1,2\right\}\), and for any sample size \(n\in {\mathbb {N}}\), if \(f_{0}\in {\mathcal {M}}_{g}\) is the DGP of \({\mathbf {X}}_{n}\), then events \(\left\{ P_{g}^{k}\left( {\mathbf {X}}_{n}\right) \le \alpha \right\}\) and \(\left\{ \bar{P}_{g}\left( {\mathbf {X}}_{n}\right) \le \alpha \right\}\), corresponding to a rejection of the null hypothesis \(\text {H}_{g}\), occur with probability no greater than \(\alpha\), as required for a test of size \(\alpha\).

It is suggested by Windham and Cutler (1992), Polymenis and Titterington (1998), and Wasserman et al. (2020) that the alternative hypothesis for each \(\text {H}_{g}\) should be that \(f_{0}\in \bar{{\mathcal {M}}}_{g}={\mathcal {M}}_{g+1}\). However, since we are only looking to reject \(\text {H}_{g}\), rather than making conclusions regarding the alternative, we can take \(\bar{{\mathcal {M}}}_{g}\) to be a richer class of PDFs that is still feasible to estimate. Thus, in the sequel, we shall consider the possibility that \(\bar{{\mathcal {M}}}_{g}={\mathcal {M}}_{g+l_{g}}\) for some \(l_{g}\in {\mathbb {N}}\), for each \(g\in {\mathbb {N}}\). Typically, we can let \(l_{g}=l\) for all g, but we anticipate that there may be circumstances where one may wish for \(l_{g}\) to vary.

3.1 Consistency of order tests

Although Theorem 3 guarantees the control of the Type I error for each local test of \(\text {H}_{g}\), it makes no statement regarding the power of the tests. For tests against alternatives of the form: \(\bar{{\mathcal {M}}}_{g}={\mathcal {M}}_{g+l_{g}}\), we shall consider the issue of power from an asymptotic perspective in the parametric context. That is, we suppose that

$$\begin{aligned} {\mathcal {K}}\left( {\mathbb {X}}\right) =\left\{ f\left( {\varvec{x}}\right) =f\left( {\varvec{x}};{\varvec{\theta }}\right) :{\varvec{\theta }}\in {\mathbb {T}}\right\} \text {,} \end{aligned}$$
(5)

where \({\mathbb {T}}\subseteq {\mathbb {R}}^{p}\) for some \(p\in {\mathbb {N}}\), and thus

$$\begin{aligned}{\mathcal {M}}_{g}\left( {\mathbb {X}}\right) &=\left\{ f\left( {\varvec{x}};{\varvec{\vartheta }}^{\left( g\right) }\right) :f\left( {\varvec{x}};{\varvec{\vartheta }}^{\left( g\right) }\right) =\sum _{z=1}^{g}\pi _{z}f\left( {\varvec{x}};{\varvec{\theta }}_{z}\right) ;\pi _{z}\ge 0,\right. \nonumber \\&\quad \left. \sum _{z=1}^{g}\pi _{z}=1,{\varvec{\theta }}_{g}\in {\mathbb {T}},z\in \left[ g\right] \right\} \text {.} \end{aligned}$$
(6)

We put the pairs \(\left( \left( \pi _{z},{\varvec{\theta }}_{z}\right) \right) _{z=1}^{g}\) in the vector \({\varvec{\vartheta }}^{\left( g\right) }\in \left( \left[ 0,1\right] \times {\mathbb {T}}\right) ^{g}={\mathbb {T}}_{g}\). Here, we further replace \(\hat{f}_{g}^{2}\) and \(\tilde{f}_{g}^{1}\) by \(f\left( \cdot ;\hat{{\varvec{\vartheta }}}_{n}^{\left( g+l_{g}\right) }\right)\) and \(f\left( \cdot ;\tilde{{\varvec{\vartheta }}}_{n}^{\left( g\right) }\right)\), respectively, where \(\hat{{\varvec{\vartheta }}}_{n}^{\left( g+l_{g}\right) }\) is a function of \({\mathbf {X}}_{n}^{2}\) and \(\tilde{{\varvec{\vartheta }}}_{n}^{\left( g\right) }\) is a function of \({\mathbf {X}}_{n}^{1}\). Further, since \(f\left( \cdot ;\tilde{{\varvec{\vartheta }}}_{n}^{\left( g\right) }\right)\) is the maximum likelihood estimator of \(f_{0}\in {\mathcal {M}}_{g}\), we also write

$$\begin{aligned} \tilde{{\varvec{\vartheta }}}_{n}^{\left( g\right) }\in \left\{ \tilde{{\varvec{\vartheta }}}^{\left( g\right) }\in {\mathbb {T}}_{g}:\prod _{i=1}^{n_{1}}f\left( {\varvec{X}}_{i}^{1};\tilde{{\varvec{\vartheta }}}^{\left( g\right) }\right) =\max _{{\varvec{\vartheta }}^{\left( g\right) }\in {\mathbb {T}}_{g}}\prod _{i=1}^{n_{1}}f\left( {\varvec{X}}_{i}^{1};{\varvec{\vartheta }}^{\left( g\right) }\right) \right\} \text {.} \end{aligned}$$
(7)

Following DasGupta (2008, Def. 23.1), we say that a sequence of tests \(\left( R_{g}\left( {\mathbf {X}}_{n}\right) \right) _{n=1}^{\infty }\) for \(\text {H}_{g}\) is consistent if under the true DGP, characterized by \(f_{0}\notin {\mathcal {M}}_{g}\), it is true that \(\text {Pr}_{f_{0}}\left( R_{g}\left( {\mathbf {X}}_{n}\right) =1\right) \rightarrow 1\), as \(n\rightarrow \infty\). Let \(\left\| \cdot \right\|\) denote the Euclidean norm and define the Kullback–Leibler divergence between two PDFs on \({\mathbb {X}}\): \(f_{1}\) and \(f_{2}\), as

$$\begin{aligned} \text {D}\left( f_{1},f_{2}\right) =\int _{{\mathbb {X}}}f_{1}\left( {\varvec{x}}\right) \log \frac{f_{1}\left( {\varvec{x}}\right) }{f_{2}\left( {\varvec{x}}\right) }\text {d}{\varvec{x}}\text {.} \end{aligned}$$

Further, say that a class of parametric mixture models \({\mathcal {M}}_{g}\) is identifiable if

$$\begin{aligned} \sum _{z=1}^{g}\pi _{z}f\left( {\varvec{x}};{\varvec{\theta }}_{z}\right) =\sum _{z=1}^{g}\pi _{z}^{\prime }f\left( {\varvec{x}};{\varvec{\theta }}_{z}^{\prime }\right) \end{aligned}$$

if and only if \(\sum _{z=1}^{g}\pi _{z}\mathbf {1}\left( {\varvec{\theta }}={\varvec{\theta }}_{z}\right) =\sum _{z=1}^{g}\pi _{z}^{\prime }\mathbf {1}\left( {\varvec{\theta }}={\varvec{\theta }}_{z}^{\prime }\right)\), where \(\text {{\textbf {1}}}\left( \cdot \right)\) is the usual indicator function. For \(R_{g}\left( {\mathbf {X}}_{n}\right) =\mathbf {1}\left( P_{g}^{1}\left( {\mathbf {X}}_{n}\right) <\alpha \right)\), where \(P_{g}^{1}\left( {\mathbf {X}}_{n}\right)\) is obtained from testing \(\text {H}_{g}\) against the alternative \(\bar{{\mathcal {M}}}_{g}={\mathcal {M}}_{g+l_{g}}\), we obtain the following result. The equivalent result regarding \(\bar{P}_{g}\left( {\varvec{X}}_{n}\right)\) can be established analogously.

Theorem 4

Make the following assumptions:

(A1):

for each \(g\in {\mathbb {N}}\), the class \({\mathcal {M}}_{g}\) is identifiable;

(A2):

the PDF \(f\left( {\varvec{x}};{\varvec{\theta }}\right) >0\) is everywhere positive and continuous for all \(\left( {\varvec{x}},{\varvec{\theta }}\right) \in {\mathbb {X}}\times {\mathbb {T}}\), where \({\mathbb {X}}\) and \({\mathbb {T}}\) are Euclidean spaces and \({\mathbb {T}}\) is compact;

(A3):

for all \({\varvec{x}}\in {\mathbb {X}}\) and \({\varvec{\theta }}_{1},{\varvec{\theta }}_{2}\in {\mathbb {T}}\), \(\left| \log f\left( {\varvec{x}};{\varvec{\theta }}_{1}\right) \right| \le M_{1}\left( {\varvec{x}}\right)\) and

$$\begin{aligned} \left| \log f\left( {\varvec{x}};{\varvec{\theta }}_{1}\right) -\log f\left( {\varvec{x}};{\varvec{\theta }}_{2}\right) \right| \le M_{2}\left( {\varvec{x}}\right) \left\| {\varvec{\theta }}_{1}-{\varvec{\theta }}_{2}\right\| , \end{aligned}$$

where \(\mathrm {E}_{f_{0}}M_{1}\left( {\varvec{X}}\right) <\infty\) and \(\mathrm {E}_{f_{0}}M_{2}\left( {\varvec{X}}\right) <\infty\);

(A4):

the estimator \(\hat{{\varvec{\vartheta }}}_{n}^{\left( g+l_{g}\right) }\rightarrow {\varvec{\vartheta }}_{0}^{\left( g+l_{g}\right) }\), in probability, as \(n_{2}\rightarrow \infty\), where

$$\begin{aligned} {\varvec{\vartheta }}_{0}^{\left( g+l_{g}\right) } & \in \Bigg\{ \hat{{\varvec{\vartheta }}}^{\left( g+l_{g}\right) }\in {\mathbb {T}}_{g+l_{g}}:\mathrm {E}_{f_{0}}\log f\left( {\varvec{X}};\hat{{\varvec{\vartheta }}}^{\left( g+l_{g}\right) }\right) \\&\quad =\max _{{\varvec{\vartheta }}^{\left( g+l_{g}\right) }\in {\mathbb {T}}_{g+l_{g}}}\mathrm {E}_{f_{0}}\log f\left( {\varvec{X}};{\varvec{\vartheta }}^{\left( g+l_{g}\right) }\right) \Bigg\} . \end{aligned}$$

Under Assumptions (A1)–(A4), if \(f_{0}\in {\mathcal {M}}\backslash {\mathcal {M}}_{g}\), and \(n_{1},n_{2}\rightarrow \infty\), then \(R_{g}\left( {\mathbf {X}}_{n}\right) =\mathbf {1}\left( P_{g}^{1}\left( {\mathbf {X}}_{n}\right) <\alpha \right)\) is a consistent test for \(\text {H}_{g}\).

Proof

The proof of this result appears in the Appendix. \(\square\)

Assumption A1 ensures that the elements of (6) [i.e., the g component mixtures of densities of class (5)] are distinct (as noted in Titterington et al., 1985, Sec. 3.1.1), and A2 implies that the log-likelihood cannot take infinitely negative values and that it is continuous for any \({\varvec{x}}\) and \({\varvec{\theta }}\), where the compactness of \({\mathbb {T}}\) ensures that \(f\left( {\varvec{x}};{\varvec{\theta }}\right)\) is bounded for each fixed \({\varvec{x}}\in {\mathbb {X}}\). Assumption A3 then implies that the expected log-likelihood \(\text {E}_{f_{0}}\log f\left( {\varvec{X}};{\varvec{\theta }}\right)\) is bounded for each \({\varvec{\theta }}\), and since \(f\left( {\varvec{x}};{\varvec{\theta }}\right)\) is continuous, \(\text {E}_{f_{0}}\log f\left( {\varvec{X}};{\varvec{\theta }}\right)\) is also continuous and thus has global optima within the compact set \({\mathbb {T}}\). Assumption A3 also implies that \(\text {E}_{f_{0}}\log f\left( {\varvec{X}};{\varvec{\theta }}\right)\) is Lipschitz continuous, with respect to \({\varvec{\theta }}\in {\mathbb {T}}\) equipped with the Euclidean norm, and A4 implies that \(\hat{{\varvec{\vartheta }}}_{n}^{\left( g+l_{g}\right) }\), characterizing \(\hat{f}_{g}^{2}\), behaves asymptotically (with respect to convergence in probability) like a parametric maximum likelihood estimator, under the potentially misspecified supposition that \(f_{0}\in {\mathcal {M}}_{g+l_{g}}\).

Assumptions A1–A3 are required for the application of Leroux (1992, Lem. 1), with A2 and A3 also required for establishing the consistency of the estimators \(\tilde{{\varvec{\vartheta }}}_{n}^{\left( g\right) }\). The Lipschitz condition of A3 and A4 are further required to show that the logarithm of the split test statistic \(V_{g}^{1}\left( {\mathbf {X}}_{n}\right)\) is a consistent estimator of an difference in divergence expression required in the proof.

Theorem 4 states that for each \(g<g_0\) and for any significance level \(\alpha \in (0,1)\), the rejection probability of the test of \(\text {H}_g\), based on \(P_{g}^{1}\left( {\mathbf {X}}_{n}\right)\), converges to 1, as n gets large. We note that Assumptions A1–A4 are verifiable for typical models of interest. For example, when \({\mathcal {M}}_{g}\) is the class of g component normal mixture models (see Sect. 4), A1 is verified due to Yakowitz and Spragins (1968), A2 is satisfied by the usual compact restrictions on the parameter space (see, e.g., Ritter, 2014, Sec. B.6.2), and A3 is satisfied under A2.

4 Normal mixture models

We apply the STP with the Split and Swapped tests from Sect. 3 to the classic problem of order selection for normal mixture models, whereby \({\mathbb {X}}={\mathbb {R}}^{d}\) for some \(d\in {\mathbb {N}}\) and

$$\begin{aligned} {\mathcal {K}}\left( {\mathbb {X}}\right) & ={\mathcal {K}}\left( {\mathbb {R}}^{d}\right) =\Biggl\{ f\left( {\varvec{x}}\right) =\phi \left( {\varvec{x}};{\varvec{\mu }},{\varvec{\Sigma }}\right) :\phi \left( {\varvec{x}};{\varvec{\mu }},{\varvec{\Sigma }}\right) =\left| 2\pi {\varvec{\Sigma }}\right| ^{-1/2}\Biggr. \\&\quad \Biggl. \times \exp \left[ -\frac{1}{2}\left( {\varvec{x}}-{\varvec{\mu }}\right) ^{\top }{\varvec{\Sigma }}^{-1}\left( {\varvec{x}}-{\varvec{\mu }}\right) \right] \Biggr\} \text {,} \end{aligned}$$

where \({\varvec{\mu }}\in {\mathbb {R}}^{d}\) and \({\varvec{\Sigma }}\in {\mathbb {R}}^{d\times d}\) is symmetric positive definite.

To assess the performance of the STP, we conduct a thorough simulation study, within the \(\mathsf {R}\) programming environment (R Core Team, 2020). For each \(d\in \left\{ 2,4\right\}\), we generate data sets \({\mathbf {X}}_{n}\), with \(n_{1}=n_{2}\in \left\{ {300,500,}1000,2000,5000,10{,}000\right\}\) observations (recall that \(n=n_{1}+n_{2}\)), where each \({\varvec{X}}_{i}\in {\mathbb {R}}^{d}\), from a multivariate normal mixture model in \({\mathcal {M}}_{g_{0}}\left( {\mathbb {R}}^{d}\right)\) for \(g_{0}\in \left\{ 5,10\right\}\), with parameter elements \(\left( \pi _{z},{\varvec{\mu }}_{z},{\varvec{\Sigma }}_{z}\right) _{z=1}^{g_{0}}\) of \({\mathcal {M}}_{g_{0}}\left( {\mathbb {R}}^{d}\right)\) generated using the \(\mathsf {MixSim}\) package (Melnykov et al., 2012), using the setting \(\bar{\omega }\in \left\{ 0.01,0.05,0.1\right\}\) and \(\min _{z\in \left[ g_{0}\right] }\pi _{z}\ge \left( 2g_{0}\right) ^{-1}\). Here, the \(\bar{\omega }\) parameter is described in Melnykov et al. (2012), and controls the level of overlap between the normal components of the mixture model. Four examples of data sets generated using various combinations of simulation parameters \(\left( g_{0},\bar{\omega }\right)\), with \(d=2\) and \(n_{1}=1000\), are provided in Fig. 1.

Fig. 1
figure 1

Example data sets of \(n_{1}=1000\) random observations from a \(d=2\) dimensional \(g_{0}\) component normal mixture model, with parameters determined via parameter \(\bar{\omega }\). Here, the pairs \(\left( g_{0},\bar{\omega }\right)\) visualized in subplots a, b, c, and d are \(\left( 5,0.01\right)\), \(\left( 5,0.05\right)\), \(\left( 5,0.1\right)\), and \(\left( 10,0.01\right)\), respectively

For each set of simulation parameters \(\left( g_{0},\bar{\omega },d,n_{1}\right)\), we simulate \(r=100\) replicate data sets, whereupon we apply the STP at the \(\alpha =0.05\) level, using the Split and Swapped test p-values of the forms \(P_{g}^{1}\) and \(\bar{P}_{g}\), for each of the r data sets. To compute the maximum likelihood estimators \(\tilde{f}_{g}^{k}=f\left( \cdot ;\tilde{{\varvec{\vartheta }}}_{n}^{\left( g\right) }\right)\), under the null hypotheses that \(f_{0}\in {\mathcal {M}}_{g}\), we use the \(\texttt {gmm\_full}\) function from the \(\textsf {Armadillo}\) \(\textsf {C++}\) library, implemented in \(\textsf {R}\) using the \(\textsf {RcppArmadillo}\) package (Eddelbuettel & Sanderson, 2014). We also use the maximum likelihood estimator as \(\hat{f}_{g}^{3-k}=f\left( \cdot ;\hat{{\varvec{\vartheta }}}_{n}^{\left( g+l_{g}\right) }\right)\), under the alternative hypotheses \(f_{0}\in \bar{{\mathcal {M}}}_{g}={\mathcal {M}}_{g+l_{g}}\), where we set \(l_{g}=l\in \left\{ 1,2\right\}\), for all \(g\in {\mathbb {N}}\). From each of the r STP results, we compute the coverage proportion (CovProp; proportion of r for which \(g_{0}\ge \hat{g}_{n}\)), the mean estimated number of components (MeanComp; the average of \(\hat{g}_{n}\) over the r repetitions), and the proportion of times that the estimated number of components corresponded with the \(g_{0}\) (CorrProp; the proportion of times the event \(\hat{g}_{n}=g_{0}\) occurs out of the r repetitions).

It is worth recalling that our estimators \(\hat{g}_{n}\) are not in fact point estimators of \(g_{0}\), but are actually lower bounds of the STP interval estimators \(\left\{ g_{0}\ge \hat{g}_{n}\right\}\) and are thus only expected to be close \(g_{0}\), with \(\hat{g}_{n}=g_{0}\) indicating that the interval is efficient. We can complement the output of the interval estimator with a point estimator of \(g_{0}\), such as via the AIC or BIC procedures, whereupon for data set \({\mathbf {X}}_{n}\), and for each \(g\in {\mathbb {N}}\), the AIC or BIC values:

$$\begin{aligned}&\text {AIC}_{g}=\frac{2}{n}\text {dim}_{g}-\frac{2}{n}\sum _{i=1}^{n}\log f\left( {\varvec{X}}_{i};{\varvec{\vartheta }}_{n}^{\left( g\right) }\right) \text {, and} \nonumber \\&\text {BIC}_{g}=\frac{\log n}{n}\text {dim}_{g}-\frac{2}{n}\sum _{i=1}^{n}\log f\left( {\varvec{X}}_{i};{\varvec{\vartheta }}_{n}^{\left( g\right) }\right) \text {,} \end{aligned}$$
(8)

respectively, are computed. Here \(\text {dim}_g\) is the dimensionality of \({\varvec{\vartheta }}^{(g)}\) and \({\varvec{\vartheta }}^{(g)}_n\) is a maximum likelihood estimator of form (7), computed using \(\varvec{X}_n\) instead of \(\varvec{X}_n^1\). The AIC and BIC procedures then estimate \(g_{0}\) via \(\arg \min _{g}\text {AIC}_{g}\) or \(\arg \min _{g}\text {BIC}_{g}\), respectively. To complement our results regarding \(\left\{ g_{0}\ge \hat{g}_{n}\right\}\), we also provide the MeanComp and CorrProp values for the AIC and BIC procedures. All of our R scripts are made available at https://github.com/ex2o/oscfmm.

4.1 Simulation results

For all scenario combinations \(\left( g_{0},\bar{\omega },d,n_{1},l\right)\), the CovProp was \(100\%\) over the r repetitions. This confirms the conclusions of Theorems 2 and 3. This also implies that the tests are underpowered, which is conforming to the observations from the simulations of Wasserman et al. (2020). This result is unsurprising since the tests are constructed via a Markov inequality argument, which makes no use of the topological features of the sets \({\mathcal {M}}_{g}\) and \(\bar{{\mathcal {M}}}_{g}\) that can be used to derive more specific results.

We report the STP interval estimator, and AIC and BIC point estimator results for all of the combinations \(\left( g_{0},\bar{\omega },d,n_{1},l\right)\), partitioned by \(\left( g_{0},\bar{\omega }\right)\) in Tables 1, 2, 3, 4, 5 and 6. Here, Tables 1, 2, 3, 4, 5 and 6 contain results for pairs \(\left( 5,0.01\right)\), \(\left( 5,0.05\right)\), \(\left( 5,0.1\right)\), \(\left( 10,0.01\right)\), \(\left( 10,0.05\right)\), and \(\left( 10,0.1\right)\), respectively. In the \(\left( 5,0.01\right)\) case, we observe that both the Split and Swapped test-based STPs were able to identify the generative value of g in over \(90\%\) of the cases, except when \(n_{1}<1000\) and for the case \(\left( d,n_{1},l\right) =\left( 4,1000,2\right)\). There is some evidence that the Swapped test is more powerful than Split test in all cases, as indicated by the higher values of MeanComp and CorrProp. Furthermore, the \(l=2\) alternative appears to be more powerful than the \(l=1\) alternative in all cases except when \(\left( d,n_{1}\right) =\left( 4,1000\right)\).

Table 1 MeanComp and CorrProp results for different values of \(\left( d,n_{1},l\right)\), when \(\left( g_{0},\bar{\omega }\right) =\left( 5,0.01\right)\)
Table 2 MeanComp and CorrProp results for different values of \(\left( d,n_{1},l\right)\), when \(\left( g_{0},\bar{\omega }\right) =\left( 5,0.05\right)\)
Table 3 MeanComp and CorrProp results for different values of \(\left( d,n_{1},l\right)\), when \(\left( g_{0},\bar{\omega }\right) =\left( 5,0.1\right)\)
Table 4 MeanComp and CorrProp results for different values of \(\left( d,n_{1},l\right)\), when \(\left( g_{0},\bar{\omega }\right) =\left( 10,0.01\right)\)
Table 5 MeanComp and CorrProp results for different values of \(\left( d,n_{1},l\right)\), when \(\left( g_{0},\bar{\omega }\right) =\left( 10,0.05\right)\)
Table 6 MeanComp and CorrProp results for different values of \(\left( d,n_{1},l\right)\), when \(\left( g_{0},\bar{\omega }\right) =\left( 10,0.1\right)\)

For the other pairs of \(\left( g_{0},\bar{w}\right)\), we observe the same relationships between the values of l and the Split and Swapped tests. That is, \(l=2\) tends to be more powerful than \(l=1\) (except when \(n_{1}\) is relatively small, i.e. \(n_{1}\in \left\{ {300,500},1000,2000\right\}\)), and the Swapped test tends to be more powerful than the Split test. In addition, we also observe that the STP becomes more powerful as \(n_{1}\) increases, which supports the conclusions of Theorem 4, which applies to the normal mixture model that is under study.

For smaller sample sizes, we observe that the STP tended to be more powerful when \(d=2\) in almost all cases, and for larger sample sizes, the opposite appears to be true. This is likely due to a combination of the variability of the maximum likelihood estimator and the increase in separability of higher dimensional spaces. Finally, we notice that the STP was more powerful when the data were more separable (i.e., for smaller values of \(\bar{\omega }\)). Here, we can see that for \(n_{1}=10{,}000\), the STP can identify the generative value of g in the \(g_{0}=5\) scenarios, in a large proportion of cases. However, when \(g_{0}=10\), the STP becomes less powerful. It is particularly remarkable that even when \(n_{1}=10{,}000\), the highest detection proportion was \(7\%\) in the \(\left( g_{0},\bar{\omega }\right) =\left( 10,0.1\right)\) scenarios. This again implies that the STP lacks power, when applied with the Split or Swapped tests, especially when component densities of the generative mixture model are not well separated.

Regarding the AIC and BIC point estimators, we observe firstly that across all scenarios, the AIC procedure produces a larger estimate of \(g_{0}\) than the BIC procedure, when observing the MeanComp values. We also observe that the AIC estimator is often larger than \(g_{0}\), even for larger sample sizes. This observation is in concordance with the theory of Leroux (1992) and Keribin (2000) who show that the AIC procedure does not underestimate \(g_{0}\), asymptotically, but is also not consistent. On the other hand, we observe that the BIC procedure tends to underestimate \(g_{0}\), for small sample sizes, but becomes more accurate, on average, as \(n_{1}\) increases. Again, this is in concordance with the consistency results regarding the BIC estimator of Keribin (2000). Regarding the CorrPro values, we observe that in smaller sample sizes (\(n_{1}\in \left\{ 300,500\right\}\)), the AIC procedure outperforms the BIC procedure in all cases other than those reported in Table 1. This is likely due to the downward bias of the BIC estimates, as observed via the MeanComp values. This downward bias appears to be most apparent in situations where the data are less separable and for larger \(g_{0}\), as is evident by the results of Table 6, where the AIC procedure outperforms the BIC procedure with respect to CorrProp, across all sample sizes. In comparison to the STP estimates \(\hat{g}_{n}\), as expected, it is notable that the AIC and BIC procedures provide point estimates of \(g_{0}\) that are as accurate or more accurate in all simulation scenarios.

Overall, we observe that the conclusions of Theorems 24 appear to hold over the assessed simulation scenarios. From a practical perspective we can make the following recommendations. Firstly, the STP based on the Swapped test is preferred over the Split test. Secondly, the alternative based on \(l=2\) is preferred over \(l=1\). Thirdly, to obtain intervals of a fixed level of efficiency, larger sample sizes are necessary when data arise from mixture models with larger numbers of mixture components and when the mixture components are not well separated. Finally, we note that the AIC and BIC procedures both provide accurate point estimation of \(g_{0}\) and are both complementary to the interval estimator \(\left\{ g_{0}\ge \hat{g}_{n}\right\}\) obtained from the STP.

4.2 Example applications

We procedure to demonstrate the utility of the STP via example applications of varying complexity.

4.2.1 Old faithful data

Our first example is to assess the number of Gaussian mixture components that are present in the \(\mathtt {faithful}\) data set from \(\mathsf {R}\), which was originally studied in Hardle (1991). The data set consists of a length \(n=272\) realizations of a sequence \({\mathbf {X}}_{n}\), where \({\varvec{X}}_{i}\in {\mathbb {R}}^{2}\) for each \(i\in \left[ n\right]\). Here, each observation \({\varvec{X}}_{i}=\left( X_{i1},X_{i2}\right)\) contains measurements regarding the eruption length of time \(X_{i1}\) and the waiting time until the next eruption \(X_{i2}\), in minutes, of eruption event i, for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. A visualization of the data appears in Fig. 2.

Fig. 2
figure 2

Scatter plot of the \(\mathtt {faithful}\) data set

We apply the STP using a \(n_{1}=n_{2}=136\) split. The p-values obtained from the Split tests of hypotheses \(\text {H}_{g}\) versus \(\bar{\text {H}}_{g}\) with \(l_{g}=2\), for \(g=1,2\) are \(3.40\times 10^{-32}\) and 1. Respectively, the p-values for the Swapped test are \(6.80\times 10^{-32}\) and 1. Thus, using either the Split or the Swapped test variants of the STP, for \(\alpha >6.80\times 10^{-32}\), we can conclude that the event \(\left\{ g_{0}\ge 2\right\}\) occurs with a probability of at least \(1-\alpha\). We also obtain the \(\text {AIC}_{g}\) values for \(g=1,2,3\): 9.53, 8.40, and 8.42, and the respective \(\text {BIC}_{g}\) values: 9.61, 8.56, and 8.66. Thus, both procedures estimate the order of the underlying mixture distribution to be 2. Although there is no ground truth regarding the \(\mathtt {faithful}\) data set, a visual inspection of Fig. 2 suggests that both the STP interval estimator and the point estimation provided by the AIC and BIC procedures are reasonable.

4.2.2 Palmer penguins data

Our second example is to estimate the Gaussian mixture order of the \(\mathtt {penguins}\) data set from the \(\mathsf {R}\) package \(\mathsf {palmerpenguins}\), originally considered by Gorman et al. (2014). After removing rows with missing data, the data set contains a length \(n=342\) realization of a sequence \({\mathbf {X}}_{n}\), where \({\varvec{X}}_{i}\in {\mathbb {R}}^{4}\) for each \(i\in \left[ n\right]\). Here, each observation \({\varvec{X}}_{i}=\left( X_{i1},X_{i2},X_{i3},X_{i4}\right)\) contains measures regarding penguins of the Adelie, Gentoo, and Chinstrap species. Specifically, for each i, the measurements are the bill length \(X_{i1}\), bill depth \(X_{i2}\), and flipper length \(X_{i3}\), all in millimeters, along with the body mass \(X_{i4}\), in grams. A visualization of the data, with separate symbols for the different penguin species, is provided in Fig. 3.

Fig. 3
figure 3

Scatter plot of the \(\mathtt {penguins}\) data set. Adelie, Chinstrap, and Gentoo data points are plotted as circles, triangles, and plus signs, respectively

We apply the STP using a \(n_{1}=n_{2}=171\) split. The p-values obtained from the Split tests of hypotheses \(\text {H}_{g}\) versus \(\bar{\text {H}}_{g}\) with \(l_{g}=2\), for \(g=1,2\) are \(3.78\times 10^{-61}\) and 1. Respectively, the p-vales for the Swapped test are \(9.25\times 10^{-66}\) and 1. Thus, using either version of the STP, for \(\alpha >3.78\times 10^{-61}\), we conclude that the event \(\left\{ g_{0}\ge 2\right\}\) occurs with a probability of at least \(1-\alpha\). For these data, the \(\text {AIC}_{g}\) values for \(g=1,2,3,4,5\) are 32.37, 30.65, 30.38, 30.35, and 30.36., and the respective \(\text {BIC}_{g}\) values are 32.54, 30.99, 30.89, 31.03, and 31.20. Thus, the AIC and BIC procedures estimate the true order \(g_{0}\) to be 4 and 3, respectively. Compared to the ground truth of three penguin species, we observe that the AIC procedure is an over estimate, whereas the BIC is accurate. The inference obtained from the STP is also correct, with the assessment that there are at least 2 mixture components, with high probability.

4.2.3 Cell lines data set

Our final example is to identify the number of mixture components in the \(\texttt {cell\_lines}\) data set from the \(\mathsf {harmony}\) package of Korsunsky et al. (2019). As presented in https://portals.broadinstitute.org/harmony/articles/quickstart.html, the data set consists of \(n=2370\) rows consisting of a realization of the sequence \({\mathbf {X}}_{n}\) of random variable \({\varvec{X}}_{i}\in {\mathbb {R}}^{2}\), for each \(i\in \left[ n\right]\). Each observation \({\varvec{X}}_{i}=\left( X_{i1},X_{i2}\right)\) contains measurements of the first and second scaled principal components of single cell gene expression data. The data come from three sources, where the first source comes form a pure Jurkat cell lines, the second comes from a pure HEK293T cell lines, and the third source consists of a half-and-half mix of Jurkat cells and HEK293T cells. Since the data from the mixed sources are not registered to the pure sources data, there are in effect four separate subpopulations of observations. We plot the \(\texttt {cell\_lines}\) data in Fig. 4.

Fig. 4
figure 4

Scatter plot of the \(\texttt {cell\_lines}\) data set. The pure and mixed Jurkat cells data are plotted as plus signs and circles, respectively, and the pure and mixed HEK293T data are plotted as crosses and triangles, respectively

We apply the STP using a \(n_{1}=n_{2}=1185\) split. The p-values obtained from the Split tests of hypotheses \(\text {H}_{g}\) versus \(\bar{\text {H}}_{g}\) with \(l_{g}=2\), for \(g=1,2,3,4\), are 0 (in double precision zero), \(5.22\times 10^{-49}\), \(5.70\times 10^{-13}\), and 0.21. Respectively, the Swapped tests yield p-values 0, \(1.04\times 10^{-48}\), \(2.39\times 10^{-17}\), and 0.42. Thus, for any \(\alpha >5.70\times 10^{-13}\), the STP concludes that the event \(\left\{ g_{0}\ge 4\right\}\) occurs with a probability of at least \(1-\alpha\). Again, we compute the \(\text {AIC}_{g}\) and \(\text {BIC}_{g}\) values. For each \(g=1,2,3,4,5,6,7,8\), the \(\text {AIC}_{g}\) values are \(-14.23\), \(-16.32\), \(-16.45\), \(-16.52\), \(-16.54\), \(-16.55\), \(-16.56\), and \(-16.55\), respectively. Thus, The AIC procedure estimates \(g_{0}\) as 7. For each \(g=1,2,3,4,5,6\), the \(\text {BIC}_{g}\) values are \(-14.22\), \(-16.29\), \(-16.41\), \(-16.46\), \(-16.47\), and \(-16.46\). Thus, the BIC procedure estimates the mixture order to be 5. Compared to the ground truth, it appears that the confidence set \(\left\{ g_{0}\ge 4\right\}\) provides sensible inference regarding the underlying number of Gaussian mixture components. It would appear that the AIC and BIC procedures both overestimate the underling mixture order. However, it could also be true that the subpopulations corresponding to each of the cell lines cannot be adequately modeled via Gaussian mixture components.

5 Extensions

5.1 A consistent sequential testing procedure

Important criteria regarding the validity of an order selection method are the large sample properties of conservativeness and consistency. These properties are defined by Leeb and Potscher (2009), in the context of this work, as

$$\begin{aligned} \lim _{n_{1},n_{2}\rightarrow \infty }\mathrm {Pr}_{f_{0}}\left( g_{0}\ge \hat{g}_{n}\right) =1 \end{aligned}$$

and

$$\begin{aligned} \lim _{n_{1},n_{2}\rightarrow \infty }\mathrm {Pr}_{f_{0}}\left( g_{0}=\hat{g}_{n}\right) =1\text {,} \end{aligned}$$

for all \(f_{0}\in {\mathcal {M}}\), respectively (see also Dickhaus, 2014, Sec. 7.1).

By Theorem 2, we have the fact that (3) holds for all n, and thus the STP, as stated in Sect. 1, cannot be conservative, nor consistent. However, if we replace \(\alpha\) by a sequence \(\left( \alpha _{n}\right) _{n=1}^{\infty }\), where \(\alpha _{n}\rightarrow 0\) as \(n_{1},n_{2}\rightarrow \infty\), then we can conclude that the modified procedure is conservative by taking the limits on both sides of inequality (3).

We now specialize our focus, again, to the parametric setting. To construct a procedure that is consistent requires further modification to the STP. Namely, we require additionally that the individual tests of \(\text {H}_{g}\) are consistent (i.e., that Theorem 4 holds for the sequence \(\left( \alpha _{n}\right) _{n=1}^{\infty }\), replacing \(\alpha\) in each test). Thus, to make (14) hold with probability approaching one, we require that the third term on the left-hand side converges to zero. We observe that the sequence \(\left( \alpha _{n}\right) _{n=1}^{\infty }\) must simultaneously satisfy the conditions that \(\alpha _{n}\rightarrow 0\) and \(n_{1}^{-1}\log \alpha _{n}\rightarrow 0\), as \(n_{1},n_{2}\rightarrow \infty\). For instance, we may choose to set \(\alpha _{n}=n_{1}^{-\kappa }\), with \(\kappa >0\). We thus have the following result regarding the STP when applied using the sequence of \(p \text {-values}\) \(\left( P_{g}^{1}\left( {\mathbf {X}}_{n}\right) \right) _{g=1}^{\infty }\).

Corollary 1

Assume (A1)–(A4) from Theorem 4, and that \(g_{0}<\infty\). If \(\alpha _{n}\rightarrow 0\) and \(n_{1}^{-1}\log \alpha _{n}\rightarrow 0\), as \(n_{1},n_{2}\rightarrow \infty\), then the STP for testing the sequence \(\left( \text {H}_{g}\right) _{g=1}^{\infty }\) is consistent, when applied using the rules \(\left( R_{g}\left( {\mathbf {X}}_{n}\right) \right) _{g=1}^{\infty }\), where \(R_{g}\left( {\mathbf {X}}_{n}\right) =\mathbf {1}\left( P_{g}^{1}\left( {\mathbf {X}}_{n}\right) <\alpha _{n}\right).\)

Proof

The proof of this result appears in the Appendix. \(\square\)

We note that the modified STP resembles the time series order selection procedure of Potscher (1983). In fact, the conditions placed on the sequence \(\left( \alpha _{n}\right) _{n=1}^{\infty }\) are the same as those imposed in Potscher (1983, Thm. 5.7). Furthermore, we note that the conditions placed on \(\left( \alpha _{n}\right) _{n=1}^{\infty }\) closely resemble the conditions that are required for the consistent application of information criteria methods; see Keribin (2000) and Baudry (2015). We can observe this resemblance by considering expression (8) and taking \(\text {BIC}_{g+l}-\text {BIC}_{g}\), for any \(g,l\in {\mathbb {N}}\). For the BIC procedure to be consistent, this expression must be negative, for large n, which requires that the difference in penalty \(n^{-1}\log n\left( \text {dim}_{g+1}-\text {dim}_{g}\right)\) goes to zero. In the STP, if we set \(n_{1}=n/2\) (or as any fraction of n) and \(\alpha _{n}=1/n\), then we have the similar requirement (of the same rate in n) that \(\left( 2n\right) ^{-1}\log n\) must go to zero.

5.2 Asymptotic tests

Throughout the manuscript, we have assumed that the \(p \text {-values}\) from which tests are constructed satisfy (1) for all n. This assumption is compatible with our application of the STP using the local tests proposed in Sect. 3. We note that the STP still provides guarantees for \(p \text {-values}\) that only satisfy (1) asymptotically, in the sense that

$$\begin{aligned} \limsup _{n\rightarrow \infty }\text {Pr}_{f}\left( P_{g}\left( {\mathbf {X}}_{n}\right) \le \alpha \right) \le \alpha \end{aligned}$$
(9)

for all \(f\in {\mathcal {M}}_{g}\). In such a case, we have the limiting version of the confidence statement (3):

$$\begin{aligned} \liminf _{n\rightarrow \infty }\text {Pr}_{f_{0}}\left( g_{0}\ge \hat{g}_{n}\right) \ge 1-\alpha \text {.} \end{aligned}$$
(10)

To obtain (10), suppose that \(f_{0}\in {\mathcal {M}}_{g_{0}}\), for some finite \(g_{0}\in {\mathbb {N}}\). In the notation of Sect. 2, we can write \({\mathbb {G}}_{0}\left( f_{0}\right) ={\mathbb {N}}\backslash \left[ g_{0}-1\right]\), and hence

$$\begin{aligned} \text {Pr}_{f_{0}}\left( f_{0}\in {\mathcal {M}}_{\hat{g}_{n}-1}\right)&=\mathrm {Pr}_{f_{0}}\left( \bigcup _{g\in {\mathbb {N}}\backslash \left[ g_{0}-1\right] }\left\{ \bar{R}_{g}\left( {\mathbf {X}}_{n}\right) =1\right\} \right) =\mathrm {Pr}_{f_{0}}\left( \bar{R}_{g_{0}}\left( {\mathbf {X}}_{n}\right) =1\right) \nonumber \\&=\mathrm {Pr}_{f_{0}}\left( \bigcap _{g\in \left[ g_{0}\right] }\left\{ P_{g}\left( {\mathbf {X}}_{n}\right) \le \alpha \right\} \right) \le \mathrm {Pr}_{f_{0}}\left( P_{g_{0}}\left( {\mathbf {X}}_{n}\right) \le \alpha \right) \text {,} \end{aligned}$$
(11)

Then, since (11) holds for all n, we can apply Rudin (1976, Thm. 3.19) to obtain

$$\begin{aligned} \limsup _{n\rightarrow \infty }\text {Pr}_{f_{0}}\left( f_{0}\in {\mathcal {M}}_{\hat{g}_{n}-1}\right) \le \limsup _{n\rightarrow \infty }\mathrm {Pr}_{f_{0}}\left( P_{g_{0}}\left( {\mathbf {X}}_{n}\right) \le \alpha \right) \le \alpha \text {,} \end{aligned}$$

as required. Using (10), we can justify the use of the STP with asymptotically valid tests, such as the procedure of Li and Chen (2010).

5.3 Aggregated tests

Under the null hypothesis that \(f_{0}\in {\mathcal {M}}_{g}\), both the Split and Swapped statistics, \(V_{g}^{k}\left( {\mathbf {X}}_{n}\right)\) and \(\bar{V}_{g}\left( {\mathbf {X}}_{n}\right)\), are examples of e-values (which we shall write generically as \(E_{g}\)), as defined in Vovk and Wang (2021) (note that these values also appear as s-values in Grunwald et al., 2020, and as betting scores in Shafer, 2021), based on the defining feature that

$$\begin{aligned} \sup _{f\in {\mathcal {M}}_{g}}\text {E}_{f}\left( E_{g}\right) \le 1\text {.} \end{aligned}$$
(12)

By Markov’s inequality, (12) implies

$$\begin{aligned} \sup _{f\in {\mathcal {M}}_{g}}\text {Pr}_{f}\left( E_{g}\ge 1/\alpha \right) \le \alpha \text {,} \end{aligned}$$

for any \(\alpha \in \left( 0,1\right)\), from which we can derive the p-value \(\min \{P_{g}=1/E_{g},1\}\), which satisfies (1).

As discussed in Wasserman et al. (2020), any set of possibly dependent e-values \(E_{g}^{1},\dots ,E_{g}^{m}\) (\(m\in {\mathbb {N}}\)) can be combined by simple averaging to generate a new e-value \(\bar{E}_{g}=m^{-1}\sum _{j=1}^{m}E_{g}^{j}\), which we shall call the aggregated e-value. As such, one may consider generating m different e-values based on either the Split or Swapped statistics, using different partitions of the data into subsequences \({\mathbf {X}}_{n}^{1}\) and \({\mathbf {X}}_{n}^{2}\). For any fixed \(n_{1}\) and \(n_{2}\), there are only a finite number of such partitions and thus one may imagine an aggregated e-value that averages over all such partitions. This hypothetical process was referred to as derandomization in Wasserman et al. (2020), since the resulting p-value is no longer dependent on any particular random partitioning of \({\mathbf {X}}_{n}\).

We further note that one can also aggregate the results from multiple instances of the Split and Swapped statistics via methods for aggregating over p-values. These methods are discussed at length in the works of Vovk and Wang (2020) who provide a detailed assessment of methods for combining arbitrarily dependent p-values, via generalized averaging operations.

6 Conclusions

In this work, we proved that the closed testing principle could be used to construct a sequence of null hypothesis tests that generates a confidence statement regarding the true number of mixture components of a finite mixture model. Further, we derive tests for each of the null hypotheses in the STP, using the universal inference framework of Wasserman et al. (2020), and proved that in the parametric case, under regularity conditions, such tests are consistent against fixed alternative hypotheses.

The performance of the STP for order selection of normal mixture models was considered via a comprehensive simulation study. We observe from the study that the constructed confidence statements were conservative, as predicted by the theory, and we were also able to make recommendations regarding the different variants of the tests, for practical application. We also determined that the AIC and BIC point estimators provide accurate complements to the intervals provided by the STP. Example applications of the STP are further described to demonstrate the utility of our methods in practice. We recommend that our STP interval estimators be reported alongside an AIC or BIC point estimator to provide both an accurate and precise inference regarding the true order.

Extensions of the STP were also discussed, including the possibility of aggregating over multiple tests, and performing the STP with asymptotic tests. Of particular interest is a proof that the testing procedure could be modified to generate an order selection procedure that consistently determines the true number of mixture components, in the asymptotic sense. Our proof shows that such a procedure was essentially equivalent to other asymptotic model selection methods such as the Bayesian information criterion and its variants.

We note that our general order selection confidence result of Theorem 2 applies not only to finite mixture models, but also to any nested sequences of models. For example, we may consider the same STP to generate confidence statements regarding the number of factors in a factor analysis model or the degree of a polynomial fit. We leave the application of the STP to such problems for future work, along with the applications of our discussed variants on the testing procedures.