Empirical evaluation of fully Bayesian information criteria for mixture IRT models using NUTS

This study is to evaluate the performance of fully Bayesian information criteria, namely, LOO, WAIC and WBIC in terms of the accuracy in determining the number of latent classes of a mixture IRT model while comparing it to the conventional model via non-random walk MCMC algorithms and to further compare their performance with conventional information criteria including AIC, BIC, CAIC, SABIC, and DIC. Monte Carlo simulations were carried out to evaluate these criteria under different situations. The results indicate that AIC, BIC, and their related CAIC and SABIC tend to select the simpler model and are not recommended when the actual data involve multiple latent classes. For the three fully Bayesian measures, WBIC can be used for detecting the number of latent classes for tests with at least 30 items, while WAIC and LOO are suggested to be used together with their effective number of parameters in choosing the correct number of latent classes.


Introduction
Conventional unidimensional item response theory (IRT) models assume that the observed response data stem from a homogenous population of individuals. However, under many test situations, and especially in situations where a mixture of several latent classes (i.e., subpopulations) is involved, fitting a conventional IRT model Communicated by Maomi Ueno. to the data produces biased estimates of model parameters (e.g., De Ayala et al. 2002). As a result, mixture IRT (MixIRT) models (Rost 1990) were developed to capture the presence of these latent classes that are qualitatively different but within which a conventional IRT model holds. De Ayala et al. (2002) further showed how the occurrence of differential item functioning (DIF) items can be explained by realizing that the data do not come from a homogenous population of individuals but are a mixture of multiple populations. In the MixIRT modeling framework, persons are characterized by their location on a continuous latent dimension as well as by their latent class membership. Also, each subpopulation has a unique set of item parameters.
Different estimation methods have been developed to estimate IRT models, with the current focus on the fully Bayesian estimation based on Markov chain Monte Carlo (MCMC ;Hastings 1970;Metropolis and Ulam 1949;Metropolis et al. 1953) techniques. Researchers have documented its advantages over the traditional maximum likelihood estimation (MLE ;Fisher 1922) methods in estimating various IRT models (e.g., Finch and French 2012;de la Torre et al. 2006;Wollack et al. 2002). The MLE method may result in infinite or implausible parameter estimates in situations where unusual response patterns are encountered such as perfect or zero scores. On the other hand, the fully Bayesian estimation via the use of the MCMC simulation techniques approximates the joint posterior distribution of all model parameters, and hence accounts for the uncertainty associated with any parameter estimation. However, using MCMC can be time consuming and computationally expensive. Recent developments of MCMC focus on non-random walk MCMCs such as the no-U-turn sampler (NUTS; Hoffman and Gelman 2014), which avoids the inefficient exploration of the parameter space via random walks. Consequently, NUTS converges to the posterior distribution faster than the common MCMC algorithms such as Gibbs sampling (Geman and Geman 1984) and Metropolis-Hastings (MH;Hastings 1970;Metropolis and Ulam 1949) even for complex models such as MixIRT models. Indeed, Luo and Jiao (2017) showed how NUTS was efficient in fitting various IRT models, namely the three-parameter logistic (3PL; Birnbaum 1968) IRT model, the graded response model (GRM; Samejima 1969), and the nominal response model (NRM; Bock 1972). For example, they found that around 200 iterations were sufficient to reach convergence for dichotomous IRT models (including the one-, two-, and three-parameter logistic models). Uto and Ueno (2020) also demonstrated the advantage of NUTS in estimating a complex generalized many-facet Rasch model that simultaneously incorporates three-rater characteristic parameters, including severity, consistency, and range restriction, especially for data with relatively small sample sizes.
Previous work on estimating mixture IRT models using MCMC algorithms focused mainly on implementing Gibbs sampling (e.g., Cho et al. 2013). In an effort of applying NUTS to a two-parameter MixIRT model, Al Hakmani and Sheng (2019) demonstrated the accuracy of NUTS in recovering model parameters and class membership of individual persons although the recovery of the class membership was not satisfactory for test situations where more than two classes were involved. Given the efficiency of NUTS, this study focuses on using it to evaluate three fully Bayesian information criteria, namely, the widely applicable (or Watanabe-Akaike) information criterion (WAIC; Watanabe 2010), the widely applicable Bayesian information criterion (WBIC; Watanabe 2013), and the leave-oneout cross-validation (LOO) implemented through a Pareto smoothed important sampling (LOO-PSIS; Vehtari et al. 2017), in the context of MixIRT models in terms of the accuracy in determining the number of latent classes, and further to compare the performance of these fully Bayesian information criteria with other information criteria.
In the literature, various forms of information criteria, under either the frequentist or the Bayesian framework, have been used to assess the fit of conventional IRT and MixIRT models, including the popular Akaike's information criterion (AIC; Akaike 1974), Bayesian information criterion (BIC; Schwarz 1978), and deviance information criterion (DIC; Spiegelhalter et al. 2002). Two adjusted forms of AIC and BIC, namely, consistent AIC (CAIC; Bozdogan 1987) and sample-size-adjusted BIC (SABIC; Sclove 1987), have been less commonly used in detecting the number of latent classes in the MixIRT literature.
While Vehtari et al. (2017) noted that the above-referenced criteria are not fully Bayesian and instead recommended the use of the WAIC and the LOO-PSIS indices, previous research has mainly focused on using them to evaluate their performance in fitting MixIRT models in the fully Bayesian framework. Specifically, Choi et al. (2017) investigated the performance of AIC, BIC, corrected AIC (AICc; Sugiura 1978), and SABIC in detecting the correct number of latent classes in the two-class mixture Rasch model. Their results revealed that the four information criteria performed differently under different class-distinction conditions with the overall conclusion that AICc and SABIC performed comparable to or better than AIC and BIC. Also, Lee and Beretvas (2014) evaluated the performance of AIC, BIC, CAIC, and SABIC in identifying the correct mixture Rasch model while manipulating DIF effect sizes and latent class proportions. Their findings indicated that these information criteria performed better with equal class proportions than unequal class proportions, and that for the condition of equal class proportions and small DIF effect sizes, the AIC performed better than other criteria in selecting the correct model. Alternatively, Li et al. (2009) examined the performances of AIC, BIC, and DIC in selecting the correct MixIRT model among three competing models (the mixture one-, two-and threeparameter logistic IRT models) via the use of Gibbs sampling, and found that BIC was the most effective, while AIC tended to choose more complex models in certain conditions and DIC was the least effective method. Similarly, Nylund et al. (2007) examined the performance of four information criteria including AIC, BIC, CAIC, and SABIC in detecting the number of classes for three mixture models (the latent class analysis, the factor mixture model, and the mixture growth model), and concluded that BIC and SABIC were better than AIC in identifying the correct number of classes and that this performance improved as the total sample size increased. They further noted that CAIC performed well in correctly identifying the number of classes for most conditions except for the most complicated model (i.e., the 10-item categorical latent class analysis model with a complex structure and unequal class sizes). Sen et al. (2019), while examining the performance of AIC, BIC, CAIC, and SABIC in detecting the best fitting multivariate mixture Rasch model with a varying number of classes at the student level and school level, concluded that CAIC and BIC performed better than AIC or SABIC in identifying the correct model.
The performance of fully Bayesian measures such as WAIC and LOO has been investigated in the context of unidimensional IRT models. For example, Luo and Al-Harbi (2017) compared their performances with four popular methods: the likelihood ratio test (LRT; Neyman and Pearson 1933), AIC, BIC, and DIC, for fitting dichotomous IRT models including the conventional one-, two-, and three-parameter logistic IRT models. Their results showed that WAIC and LOO performed consistently better than the other four criteria, especially for fitting the 3PL model. For polytomous IRT models [e.g., the graded response model (GRM; Samejima 1969), the rating scale model (RSM ;Andrich 1978), the partial credit model (PCM ;Masters 1982), and the generalized partial credit model (GPCM; Muraki 1992)], Luo (2019) compared the performances of WAIC and LOO with AIC, BIC, AICc, SABIC, and DIC and found that all the seven measures had relatively high statistical power in detecting the true polytomous IRT model with the frequentist-based measures (AIC, BIC, AICc, and SABIC) performing slightly better than the Bayesian ones (DIC, LOO, and WAIC). Moreover, da Silva et al. (2018) suggested to employ DIC, WAIC, and LOO over expected AIC (Brooks et al. 2002) or expected BIC (Carlin and Louis 2001) in fitting polytomous IRT models (GRM and GPCM), especially for data with small sample sizes (e.g., 50-150 persons) and short tests (e.g., 7-15 items). Nevertheless, to date, WAIC, LOO or WBIC has not been considered or evaluated for MixIRT models. It is hence necessary to explore their usage under the MixIRT framework in identifying the number of latent classes.
In view of the above, this study is to assess the performance of fully Bayesian information criteria, namely LOO, WAIC, and WBIC in terms of the accuracy in determining the number of latent classes of a MixIRT model by comparing it to the conventional IRT model. For the sake of comparisons, AIC, BIC, CAIC, SABIC, and DIC were considered in this study. The rest of the paper is organized as follows. Section 2 briefly describes a form of the MixIRT model with its prior specifications as implemented in NUTS, followed by Sect. 3 where we describe all the information criteria considered in this study. In Sect. 4, a Monte Carlo simulation study is presented to evaluate the performance of the three fully Bayesian measures in fitting a MixIRT model via NUTS while comparing them with other measures. Its results are summarized in Sect. 5 for tests with multiple latent populations and for those with a single population, with a few remarks discussed in Sect. 6.

Model and prior specifications
This study focuses on evaluating the three fully Bayesian information criteria in detecting the number of latent classes of the mixture two-parameter logistic (Mix2PL) model via implementing NUTS and further on comparing them with other information criteria. The model and its prior specifications as implemented in NUTS are briefly described in this section.

Two-parameter mixture IRT model
In the Mix2PL model, the conventional 2PL IRT model is assumed to hold for each latent class, allowing the item difficulty and discrimination parameters to differ for different classes. Moreover, each person is parameterized by a class membership parameter g and a class-specific ability parameter θ ig , whereas each item is parameterized by a different set of difficulty and discrimination parameters for each latent class. The probability of a correct (Y ij = 1) response for person i to item j in the Mix2PL IRT model is defined as where g = 1, 2,…, G is the latent class indicator; b jg and a jg denote the difficulty and discrimination parameters, respectively, for item j in the gth class; θ ig denotes the ability for person i who belongs to class g; and π g denotes the proportion of persons in each class (i.e., the mixing proportion) such that these proportions sum to one. In situations where there is only one latent class (i.e., G = 1), the Mix2PL model shown in Eq. (1) is reduced to the conventional 2PL model as defined in Eq. (2): where θ i is the ability of person i, and b j and a j are difficulty and discrimination parameters, respectively, for item j.

NUTS and prior specification
In the fully Bayesian framework, common MCMC algorithms such as Gibbs sampling (Geman and Geman 1984) and Metropolis-Hastings (MH;Hastings 1970;Metropolis and Ulam 1949) explore the posterior distribution via simple random walk proposals, and as a result, a large number of iterations are needed to sufficiently explore the parameter space. Conversely, non-random walk MCMC algorithms such as Hamiltonian Monte Carlo (HMC; Duane et al. 1987;Neal 2011) and the no-Uturn sampler (NUTS; Hoffman and Gelman 2014) avoid the inefficient exploration of the parameter space. HMC is a powerful tool, but its performance depends on choosing suitable values for the step size parameter ε and the number of leapfrog steps L (Hoffman and Gelman 2014). Tuning these parameters, and specifically L requires some expertise and preliminary runs (Hoffman and Gelman 2014;Neal 2011). NUTS is an extension of HMC that eliminates the need to specify the number of leapfrog steps parameter L. This is achieved using a criterion based on the dot product between the current momentum r and the distance between the proposal and the initial value of the model parameter , which is proportional to the progress one would make away from the starting point . This algorithm in which one runs 1 3 leapfrog steps until ( ̃ − ) •r is less than 0, however, does not ensure time reversibility and hence convergence to the correct distribution is not ensured. This issue is resolved using a recursive algorithm in which NUTS creates a set of candidate values that spans a wide path of the target distribution, stopping automatically when it starts to make a U-turn (Hoffman and Gelman 2014), at which point NUTS stops the simulation and samples from the set of values computed during the simulation.
In practice, NUTS performs as efficient as, and sometimes better than, a well-tuned HMC without requiring user interventions.
To implement NUTS for the Mix2PL model in this study, priors and hyperpriors have been specified to be similar to those used by others (e.g., Meyer 2010;Li et al. 2009) such that normal prior densities were assumed for person ability parameters θ ig ~ N(μ g , 1), with a standard normal distribution for the hyperparameters μ g , and a Dirichlet distribution for the mixing-proportion parameters (π 1 , …, π G ) ~ Dirichlet(1, …,1); a standard normal prior was considered for the class-specific difficulty parameters, and a truncated normal prior for the class-specific discrimination parameters a jg ~ N (0, ∞) (0, 1). In addition, the sum-to-one constraint on the mixing proportions was achieved by assigning the mixing proportions a unit simplex, and the problem of label switching was resolved by imposing an ordinal constraint on the mean ability (μ g ) parameters and the item difficulty parameters (b g ).

Information criteria
Information criteria are statistical measures for the comparative evaluation among candidate models, and they provide ways to assess a model based on its log-likelihood and complexity. Among the various measures considered in the IRT literature, AIC, BIC, CAIC, and SABIC are frequentist information criteria that have been developed based on the MLE method, and can be calculated as where d is the number of estimated parameters, log(L) is the natural logarithm of the likelihood function obtained from the MLE, and n is the sample size. These information criteria may provide different solutions to the same data due to differences in the penalty function applied to the likelihood. It is noted that in the literature, AIC is argued to be an inconsistent measure, and that given consistency is an asymptotic property expected from a model-selection method, Bozdogan (1987) proposed CAIC, an asymptotically consistent measure that includes a penalty for models with larger numbers of parameters using the sample size n. In addition, BIC applies a penalty term that uses both the number of parameters and the sample size, and has been found to be somewhat more accurate than AIC for selection of MixIRT models (Li et al. 2009;Preinerstorfer and Formann 2012). Sugiura (1978) introduced a sample-size-adjusted form of BIC (SABIC) that replaces n in the BIC equation with ( n + 2)∕24. Similar to BIC, the penalty for adding more parameters is represented by both the number of parameters and the sample size. The penalty term, however, is not as large as that in BIC. These frequentist measures are suitable when model parameters are estimated using the MLE estimation method. Nevertheless, they can be applied to the fully Bayesian setting via an approach described by Congdon (2003), where the MLE-based deviance value, −2log(L) , is replaced with the posterior mean of the deviance D( ) obtained using an MCMC algorithm (Congdon 2003) such as NUTS, in which denotes all model parameters. Congdon (2003) pointed out the limitation of the frequentist measures in likelihood comparisons of discrete mixture models involving varying numbers of classes while the process involved in Bayesian model selection is simpler and has advantages in comparing non-nested models. This approach has been adopted by multiple studies that implemented a random-walk MCMC algorithm (i.e., Gibbs sampler) in the mixture IRT literature either to evaluate the performance of, e.g., AIC and BIC in detecting the correct MixIRT model (e.g., Li et al. 2009;Sen et al. 2019) or to inform model selection (Cho et al. 2013;Sen et al. 2016), and consequently, is considered by this study.
In the context of Bayesian estimation, Spiegelhalter et al. (2002) proposed the deviance information criterion (DIC), a Bayesian counterpart of the AIC measure that is defined as where p DIC is the effective number of parameters defined as where D(̂ ) is the deviance obtained from the posterior estimates of the parameters. Given that D(̂ ) is based on a point estimate instead of the entire posterior distribution, DIC is not considered as a fully Bayesian measure. Vehtari et al. (2017) specifically pointed out that its effective number of parameters for a model can result in negative values, while Plummer (2008) noted to the argument in the literature between the advantages of DIC in practice and the lack of a clear theoretical foundation.
AIC and BIC can be used for assessing statistical models when such models are regular, and the likelihood function can be approximated by a normal distribution (Watanabe 2021). For statistical models with a hierarchical structure or with latent variables (i.e., singular models), where regularity is not met, two fully Bayesian information criteria, namely WAIC (Watanabe 2010) and WBIC (Watanabe 2013), were proposed as generalizations of AIC and BIC, respectively. These information criteria estimate the generalization loss (i.e., the average minus log predictive likelihood) and the Bayes free 1 3 energy (i.e., the minus logarithm of Bayes marginal likelihood), respectively (Watanabe 2021). WBIC is specifically defined as where = 1∕log(n) is the inverse temperature and E [G( )], for an arbitrary function G( ), is the expectation over the posterior distribution under the inverse temperature , defined as follows where ( ) denotes the prior probability density function of a given model parameter ⊂ Ξ . It is noted that for regular models, WBIC reduces to BIC (Watanabe 2013). Watanabe (2021) further described the mathematical foundation of WBIC and discussed its application to a normal mixture model, a common singular model.
In addition, WAIC and LOO are two fully Bayesian information criteria that estimate the predictive accuracy of the fitted model using available data, without waiting for out-of-sample data . WAIC estimates the out-of-sample expectation by first computing a log pointwise posterior predictive density (LPPD) of the data, which is defined as where p post ( ) = p( |y) is the posterior distribution of model parameters . In practice, LPPD is computed by evaluating the expectation via sampling from the posterior distribution p post ( ) such that where s = 1, 2, …, S denotes the number of simulation samples from the posterior density. After computing the LPPD, WAIC is obtained by adding a correction ( p WAIC ) for the effective number of parameters to adjust for overfitting The correction term, p WAIC , can be computed in the following two ways: The second adjustment as expressed in Eq. (15) is more computationally stable since summing the variance for each data point produces stability , and is implemented in the R package loo (Vehtari et al. 2017), which is used for computation of both WAIC and LOO. LOO, on the other hand, is tied with Bayesian cross-validation studies, where a dataset is repeatedly partitioned into a training set and a validation set. The model of interest is fitted to the training set to obtain the posterior distribution, with which the fit of the model to the validation set is evaluated. LOO is a special case of cross-validation in which one data point is left out each time and the LPPD is computed with the remaining data points as follows where p post(−i) ( ) is the posterior distribution without the ith data point, and is computed as where is is the sth simulated value from the posterior distribution conditioning on the dataset without the ith data point . To place LOO on the same scale as WAIC, the computed LPPD LOO is multiplied by − 2. According to Gelman et al. (2014), WAIC is asymptotically equal to LOO. To obtain more accurate estimate of LOO, Vehtari et al. (2017) suggested fitting a Pareto distribution to the upper tail of the distribution of the importance weights (PSIS) and argued that PSIS smoothing approach would benefit from using stable importance weights. LOO, WAIC, and WBIC are less used in practice due to the requirement of additional computational steps. In spite of the computational expense, they have advantages over simpler estimates in terms of being fully Bayesian measures that use the whole posterior distribution in contrast to point estimates such as AIC or DIC (Vehtari et al. 2017). Unlike simpler estimates, these fully Bayesian information criteria can be used for singular models with hierarchical and mixture structures or for models with different prior specifications, and hence are expected to perform better than point estimate-based information criteria in selecting or comparing singular complex models (Watanabe 2013) such as MixIRT models.

Method
Monte Carlo simulations were carried out to investigate the performance of the three fully Bayesian information criteria in recovering the number of latent classes, via fitting the Mix2PL model using NUTS and further in comparing with other information criterion measures when one or multiple latent classes exist. To ensure all Markov chains have converged to their stationary distributions, the number of warm-up iterations that should be discarded and the number of sampling iterations that should be used to estimate the posterior distribution were determined using the Gelman-Rubin R statistic (Gelman and Rubin 1992) with a threshold of 1.10 as suggested by Gelman et al. (2014), as non-convergence can result in inaccurate estimation of model parameters for MixIRT models (Jang and Cohen 2020), which in turn can affect the model-data fit.
In the MixIRT literature, the sample size, the test length, and the number of latent classes appear to affect parameter recovery of the MixIRT model. For instance, Preinerstorfer and Formann (2012) found that increasing both the sample size (500, 1000, 2500) and the number of items (10,15,25,40) led to higher accuracy in estimating parameters of the mixture Rasch model. Moreover, Li et al. (2009) found that recovery of item difficultly and discrimination parameters in different MixIRT models (i.e., mixture one-, two-, or three-parameter logistic models) differed based on the number of latent classes (1, 2, 3, 4), test lengths (6, 15, 30), and sample sizes (600, 1200). Difficulty and discrimination parameters were mostly affected by the number of latent classes such that when the number of latent classes increased, the recovery of model parameters was less accurate. Also, their results indicated that the root mean square error (RMSE) decreased as sample size and test length increased. Equal mixing proportions were considered by many studies to simulate test data involving multiple latent classes for different purposes (e.g., Bolt et al. 2001Bolt et al. , 2002Meyer 2010;Cho et al. 2013). For example, Bolt et al. (2001) Preinerstorfer and Formann (2012) found that item parameters were recovered more precisely in the condition of equally sized subgroups (i.e., π = 0.50, 0.50).
In the Monte Carlo simulations, two sets of test conditions were considered, with the first set treating the two-class Mix2PL model as the true model whereas the second set treating the conventional 2PL model as true. Simulation conditions were considered based on prior studies to reflect some practical considerations. Specifically, with binary item response data generated from each set of conditions for sample sizes (n = 500 and 1000) and test lengths (J = 15, 20, 30), NUTS was implemented to fit three candidate models, namely the 2PL model, the two-class Mix2PL model, and the three-class Mix2PL model (i.e., G = 1, 2 or 3). For the first set of test conditions, data were generated using the two-class Mix2PL model, as defined in Eq. (1), with equal mixing proportions (i.e., π 1 = 0.50 and π 2 = 0.50). The model parameters were generated such that the person ability parameters were generated from a mixture of two subpopulations where θ 1 ~ N(− 2, 1) and θ 2 ~ N(2, 1); the class-specific item difficulty parameters were generated from a uniform distribution where b 1 ~ U(− 2, 0) and b 2 ~ U(0, 2); and the class-specific item discrimination parameters were generated from a uniform distribution where a g ~ U(0, 2), g = 1 or 2. For the second set of test conditions, data were generated using the 2PL model, as defined in Eq. (2). The model parameters were generated such that the person ability parameters were generated from a standard normal distribution θ ~ N(0, 1); the item difficulty parameters were generated from a uniform distribution where b ~ U(− 2, 2); and the item discrimination parameters were generated from a uniform distribution where a ~ U(0, 2).
To assess the recovery of the number of latent classes for each data set, the three fitted models were compared using the three fully Bayesian information criteria, namely WAIC, LOO, and WBIC, as well as other measures including AIC, BIC, CAIC, SABIC, and DIC. After computing the respective information criterion measure for each of the three candidate models, the model with the smallest value was selected as the best fitting model. With 50 replications, the proportion of the time the generating model was selected as the best fitting model indicates the accuracy of recovering the number of latent classes by each information criterion. In addition, the values of the eight information criteria were further averaged across replications to provide summary information. To increase the efficiency of computations, the simulations were carried out on a high-performance computing cluster that includes a total of 16,016 cores across 572 nodes and 2.2 PB of storage.
RStan (Stan Development Team 2020) was used to implement NUTS to fit the IRT models, and the Stan code for computing the posterior distribution under the inverse temperature = 1∕log(n) is provided in Appendix 1. This code was used for fitting the conventional 2PL model and two-and three-class Mix2PL models before computing WBIC using R. The Stan code for computing the standard posterior distribution when the inverse temperature = 1 is presented in Appendix 2. This code is similar to that in Appendix 1 except that the log-likelihood is multiplied by 1, and it was used for fitting the same models before computing the other information criteria, AIC, BIC, CAIC, SABIC, DIC, LOO, and WAIC in R.

Results
With four Markov chains, 1000-4000 warm-up iterations and 4000-12,000 sampling iterations were used to ensure the convergence of each NUTS implementation, and specifically that the Gelman-Rubin R statistic was less than the recommended threshold of 1.10 for each model parameter in each simulated condition.

Results for tests with subpopulations
The results for the first six conditions where data conformed to the two-class Mix2PL model are summarized in Tables 1, 2, 3 and 4 with Tables 1 and 2     DIC, tended to choose either the correct two-class model or a more complicated three-class model. Among the three information criteria, DIC performed the best, followed by WAIC, and then LOO. • Performance of DIC or WAIC seemed to be affected by sample size, as larger sample sizes tended to result in a higher likelihood of selecting the correct model; for example, with J = 15, DIC (WAIC) selected the correct two-class Mix2PL model 76% (68%) of the time when n = 500 vs. 84% (72%) of the time when n = 1000. This is, however, not observed with other information criteria. • WBIC performed poorly when J < 30 as it consistently selected the more complicated model; when J = 30, it was able to identify the correct model more than 95% of the time regardless of sample size.
These findings are consistent to those from Tables 3 and 4, where we see that after averaging, AIC, BIC, CAIC, and SABIC were consistently smaller for the simpler 2PL model. DIC, on the other hand, consistently selected the correct two-class Mix2PL model. Among the three fully Bayesian information criteria, LOO tended to favor the more complex three-class model. It is, however, noted that the threeclass solution did not differ much from the two-class solution as the difference of the average values of LOO between the two-class Mix2PL model (correct model) and the three-class Mix2PL model was less than 1.0 when n = 1000 or J > 15. As a matter of fact, average LOO and WAIC values for fitting the two-class and three-class models were almost identical for these test length and sample size conditions. In addition, the average WAIC was able to identify the correct two-class model across all the simulated conditions except when n = 500 and J = 30. The average WBIC, on the other hand, performed better when test length increased as it selected the correct model only when J = 30 and in other conditions, WBIC preferred the more complicated three-class model. It is further noted that the average effective number of parameters for WAIC, LOO or DIC was the smallest for fitting the correct two-class Mix2PL model. Furthermore, as suggested by Gelman et al. (2014), when deciding on the best fitting model, the effective number of parameters associated with Bayesian information criteria should also be taken into account, especially when the differences between the values of these measures for the candidate models are small, such that the simpler model is preferred over the more complex one. This is particularly true for LOO and WAIC, as the average effective number of parameters for the two measures as presented in Tables 3 and 4 indicates that the two-class Mix2PL model was the least complex and shall be preferred although LOO tended to favor the more complex three-class Mix2PL model across all the simulated conditions except when n = 500 and J = 15 and WAIC tended to favor the more complex three-class Mix2PL model when n = 500 and J = 30. For example, in conditions where n = 500 and J = 30, the average effective number of parameters for the two-class Mix2PL model (p LOO = 422.834, p WAIC = 418.842) was relatively smaller compared to the three-class Mix2PL model (p LOO = 424.094, p WAIC = 420.508) or the 2PL IRT model (p LOO = 472.259, p WAIC = 467.023) although LOO or WAIC was slightly larger for the two-class model (LOO = 14,856.803,WAIC = 14,848.819) as compared to the three-class model (LOO = 14,855.829,WAIC = 14,848.657). Hence, we recommend that when using LOO or WAIC for comparing candidate models in practice, one shall also consider their effective number of parameters, especially when the difference between these values for candidate models is small. It is also noted that the conventional 2PL model (i.e., the one-class Mix2PL model), with relatively larger average values, was never preferred by any of the Bayesian information criteria when data involved multiple subpopulations.

Results for tests with a single population
The results for the second six conditions where data conformed to the conventional 2PL IRT model are summarized in Tables 5, 6, 7 and 8 with Tables 5 and 6 display the number and proportion of the time each model was identified as the best fitting model according to each information criterion, and Tables 7 and 8 show the average information criteria and effective number of parameters averaged across 50 replications, where the numbers in bold in Tables 5 and 6 are for the model with the highest frequency/proportion of being chosen as the best-fitting model, and those in Tables 7 and 8 are the smallest average values among the three candidate models. A close examination of Tables 5 and 6 indicate the following: • The four frequentist information criteria, namely AIC, BIC, CAIC, and SABIC, consistently selected the correct 2PL model regardless of sample size or test length.  82) 50 (1.00) 50 (1.00) 50 (1.00) 50 (1.00) 2C-Mix2PL 12 (0.24) 6 (0.12) 2 (0.04) 5 (0.10) 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) • DIC identified the correct 2PL model fairly well especially for J ≥ 20, where its accuracy ranges from 94 to 100%. • Among the three fully Bayesian information criteria, with a few replications favoring the two-or three-class MixIRT model, LOO and WAIC tended to go with more complex models with LOO being more toward this tendency; WBIC tended to favor the correct 2PL model more frequently and hence is more preferred than WAIC or LOO.
These findings are consistent with results presented in Tables 7 and 8. Specifically, after averaging, LOO, WAIC, WBIC, AIC, BIC, CAIC, and SABIC consistently favored the correct 2PL model regardless of sample size or test length. Test length may have an effect on the average DIC, as it tended to favor a more complex MixIRT model when J = 15. It is noted that some of the average LOO values for the two-class solution did not differ much from the one-class solution as the difference of the average values of LOO between the 2PL model (correct model) and the two-class Mix2PL model was only 0.18 for n = 500 and J = 15. In effect, the Bayesian approach resulted in almost identical average information (especially LOO and WAIC) in fitting the correct 2PL model and the more complicated MixIRT models regardless of sample size or test length. Moreover, the average effective number of parameters for LOO or WAIC is consistently the smallest for the correct 2PL model.
The average effective number of parameters associated with both LOO and WAIC, as presented in Tables 7 and 8, indicates that the 2PL model was the least complex across all the conditions. For example, in conditions where n = 1000 and J = 15, the average effective number of parameters for the 2PL IRT model (p LOO = 722.034, p WAIC = 711.098) was relatively smaller compared to that of the    192.618 460.524 16,185.660 457.045 18,038.848 16,172.970 473.453 15,945.517 16,463.914 16,586.914 16,193.921 462.230 16,187.311 458.926 18,058.034 16,176.763 477.171 16,069.592 16,849.294 17,034.294 16,262.093 . Therefore, following our recommendation for tests involving multiple latent subpopulations, we again see the potential benefit of considering the effective number of parameters when LOO or WAIC is used in comparing candidate models where a single latent population is assumed.

Discussion and conclusion
This study focuses on the performance of fully Bayesian information criteria, namely, LOO, WAIC, and WBIC, in terms of the accuracy in determining the number of latent classes of the Mix2PL model while comparing it to the conventional 2PL model and further in terms of comparison with other information criteria including AIC, BIC, CAIC, SABIC, and DIC. Regarding the accuracy in determining the number of latent classes, for the condition where data conformed to the two-class Mix2PL model, the results indicate that among the three fully Bayesian information criteria, WBIC shall only be used when tests consist of at least 30 items; WAIC performed slightly better than LOO in recovering the number of latent classes, although the proportion of the time the correct model was selected as the best fitting model, for both measures, decreased compared to the situation where the generated model was the conventional 2PL model. When considering other information criteria, it is found that the frequentist information criteria all failed to identify the correct model as they consistently favored the simpler 2PL IRT model. DIC, on the other hand, has outperformed both the frequentist and fully-Bayesian information criteria. On the other hand, for the condition where data conformed to the conventional 2PL model, the results indicate that among the three fully Bayesian information criteria, WBIC performed better than WAIC or LOO alone in recovering the number of latent classes in terms of the proportion of the time the correct model was selected as the best fitting model. In addition, when considering other information criteria, it is found that all the frequentist information criteria consistently favored the correct 2PL model and the Bayesian information criterion DIC performed similarly if not better than WBIC. Summarizing across the two simulated test conditions, we make the following conclusions: 1. AIC, BIC, and their related CAIC and SABIC tend to select the simpler model and are not recommended when the actual data involve multiple latent classes. While this finding differs from those from previous studies evaluating their performances via other fully Bayesian estimation algorithms such as Gibbs sampling, it can be explained by citing Watanabe's (2021) argument that if AIC and BIC are used in model selection, the best model with the smallest values of both the generalization loss and the free energy are not chosen, rather tighter (or smaller) models are selected. 2. DIC alone performs equally well and sometimes better than some of the frequentist or fully-Bayesian criterion measures especially when sample size is relatively large (with e.g., 1000 or more subjects). This finding, however, differs from those from previous studies such as Li et al. (2009) and needs further validation.
3. For the three fully Bayesian information criteria, WBIC can be used for detecting the number of latent classes for tests with at least 30 items; WAIC and LOO alone are not suggested for fitting the MixIRT models, and they shall be used together with their effective number of parameters in choosing the correct number of latent classes.
It is noted the fully Bayesian information criteria alone can sometimes favor a more complicated model such as the two-or three-class Mix2PL model when data involve one or two latent subpopulations, respectively. This can be explained by what Watanabe (2021) noted: for a normal mixture model such as MixIRT, neither the generalization loss estimated by WAIC (or LOO) nor the free energy estimated by WBIC increases "even if the statistical model is redundant" (p. 18) as the case with regular statistical models; thus, WAIC (and similarly LOO) or WBIC does not increase either and hence the more complicated (or redundant) model may be selected more often.
In addition, for test conditions where data conformed to the two-class Mix2PL model, it is noted that although in the simulation study, LOO, WAIC, and WBIC sometimes selected the more complex three-class Mix2PL model as the best fitting model, the average proportion of persons (across the 50 replications) for one of the three classes was relatively low (see Table 9). For example, for test situations with 15 items, the average proportion of persons in one of the three classes was 0.14 with a sample size of 500 and 0.12 with a sample size of 1000. Same observation was made for the condition where data conformed to the conventional 2PL model, where LOO, WAIC, and WBIC sometimes selected the more complex and yet incorrect two-class Mix2PL model as the best fitting model. The average proportion of persons (across the 50 replications) for each class is summarized in Table 10 and suggests that the average proportion for one of the two classes was relatively very low. For example, for tests with 20 items, the average proportion of persons in one of the two classes of the Mix2PL model was 0.03 for both sample sizes, 500 and 1000. These results, in general, indicate that when a more complex while incorrect model was selected by one or more of the fully Bayesian criteria without considering the effective number of parameters, the proportion of persons in one of the classes can be relatively low. Given the results displayed in Tables 9 and 10, caution should be taken when fully Bayesian criteria such as LOO, WAIC, or WBIC select a more complicated model as the best fitting model. A further step should be taken to understand the nature of the estimated mixing proportions of the complex model favored by such information criteria.
Our findings that LOO tends to favor a more complex model and is less preferred than WAIC for fitting MixIRT models, however, differ from those of Luo and Al-Harbi (2017), who compared WAIC and LOO for conventional IRT models and concluded that WAIC had a slightly lower detection rate than LOO (although the difference is negligible) in the condition where the generating model was the conventional one-parameter IRT model. Similarly, these results differ from those of Luo (2019) who compared WAIC and LOO with the DIC, AIC, BIC, AICc, SABIC for polytomous IRT models, where the results indicated the detection rate of WAIC (0.935) was slightly lower than that of LOO (0.946). This difference may be a result of the models considered; namely, LOO tends to work well for unidimensional models with varying number of parameters whereas WAIC tends to work well for models with more complex latent structure. This certainly needs to be confirmed with additional studies.
Further, our result where the frequentist information criteria consistently favored the simpler IRT model when the test involved multiple latent subpopulations could be directly tied with how these indices were computed in the fully Bayesian framework. In this study, we followed Congdon (2003) to replace the MLE-based deviance, −2log(L) , with the posterior mean of the deviance as it has been commonly adopted in studies on MixIRT models. While this result differed from previous studies that warrant additional investigations, it suggests potential limitations of using the posterior mean of the deviance as suggested by Congdon (2003) for fully Bayesian estimation especially with non-random walk MCMC algorithms, as it may not correspond to the maximized likelihood, and hence calls for alternative approaches. Additional research should be carried out to further evaluate this.
Given the computational expense, we focused on the two-parameter models in this study to consider a few sample-size and test-length conditions assuming equal mixing proportions for test situations where data involve one or two subpopulations. Future research can consider other test conditions or those with more than two latent classes. Additional studies can also investigate the performance of these fully