Item factor analysis (IFA; Bock, Gibbons, & Muraki, 1988) is a statistical technique that aims to explain the dependency among item-level data by introducing latent factors. Usually, the item-level data are binary or, more generally, ordered categorical. Ability tests with yes–no questions provide a simple example. A generalization of this is the Likert-type response scale used in attitude and personality questionnaires (strongly disagree, disagree, neither, agree, and strongly agree). If modeling category response probability is the main purpose, IFA can be also regarded as a methodology of item response theory (IRT, for an equivalence, see Takane & de Leeuw, 1987). In practice, IFA is often used to evaluate the underlying factor structure of a psychological measurement. This structure provides a basis for latent trait estimation and dimension reduction, which are useful in statistical tasks such as regression, classification, and clustering.

Item parameter estimation is a crucial step in IFA, which can be achieved by frequentist or Bayesian estimation (see Chen, Li, Liu, & Ying, 2021; Wirth & Edwards, 2007, for reviews). From the view of frequentist estimation, marginal maximum likelihood (MML; Bock & Lieberman, 1970) seems to be the gold standard because of its consistency, and asymptotical efficiency and normality. MML tries to find an estimate that maximizes the so-called marginal likelihood. To determine marginal likelihood in the optimization process, MML must integrate over M latent factors. If the number of latent factors is small (e.g., M < 5), the integral can be efficiently computed by Gauss–Hermite quadrature (e.g., Bock & Aitkin, 1981; Bock & Lieberman, 1970; Gibbons & Hedeker, 1992) or by its adaptive variation (e.g., Schilling & Bock, 2005). Otherwise, stochastic algorithms are used. Famous stochastic algorithms for IFA include Monte Carlo expectation maximization (MCEM; e.g., Meng & Schilling, 1996; Song & Lee, 2005), stochastic expectation maximization (StEM; including an improved version, see Zhang, Chen, & Liu, 2020), and Metropolis–Hastings Robbins–Monro (MHRM; e.g., Cai, 2010a, 2010b) algorithms. Some of these implementations are available in mainstream IFA programs such as IRTPRO (Cai, Du Toit, & Thissen, 2011), flexMIRT (Cai, 2017), Mplus (Muthén & Muthén, 1998-2017), and mirt (Chalmers, 2012).

Nevertheless, fitting a high-dimensional IFA model by MML is still a time-consuming task. For example, MML might take about 36 min for fitting an IFA model with ten factors to a data set with 2500 observations and 300 items (see Chen, Li, & Zhang, 2019). Note that 36 min is just for single analysis. In practice, researchers may try many IFA models under different data conditions for searching an optimal result, which definitely spends more time. The computational burden of MML precludes researchers from exploring data-model fit thoroughly. Even for a relatively small IFA model, it still becomes very time-consuming when implementing resampling methods for robust statistical inferences (e.g., Liou & Yu, 1991; Patton, Cheng, Yuan, & Diao, 2014).

To make up for the deficiency of MML, psychometricians develop less computationally intensive procedures, including variational inference (VI) methods (Cho, Wang, Zhang, & Xu, 2020; Hui, Warton, Ormerod, Haapaniemi, & Taskinen, 2017; Wu, Davis, Domingue, Piech, & Goodman, 2020), the constrained joint maximum likelihood (CJML) (Chen et al., 2019; Chen, Li, & Zhang, 2020), and the importance-weighted autoencoder (IWAE) method (Urban & Bauer, 2021). However, these alternative methods are still restrictive in several ways. (1) The consistency of CJML and VI require a large number of items for each latent trait (Chen et al., 2019; Cho et al., 2020). Most empirical settings, however, use less than six items for measuring a latent trait (see Jackson, Gillaspy Jr, & Purc-Stephenson, 2009, for a review). (2) Both the theoretical properties and the empirical performance of CJML and IWAE rely on tuning parameters. The tuning parameter values are usually determined by cross-validation, which complicates the implementation of these methods. (3) The existing findings on statistical inferences for MML (for reviews, see Swaminathan, Hambleton, & Rogers, 2006; Yuan, Cheng, & Patton, 2014) cannot be directly applied to the above alternative methods. Based on these reasons, we believe that most IFA users still prefer using MML unless its implementation is totally infeasible.

If we stay with the implementations of MML, we can still search for computationally more feasible answers. The aim of the current study is to invoke help from the GPU (graphics processing unit) and a carefully designed vectorization to handle large-scale IFA applications. The GPU is an electronic circuit that helps the CPU (central processing units) for handling computer graphics. After the release of CUDA (NVIDIA, Vingelmann, & Fitzek, 2020), the GPU gradually entered scientific computing due to its parallelization capability (Keckler, Dally, Khailany, Garland, & Glasco, 2011; Nickolls & Dally, 2010).

We believe that parallelization with GPU has two advantages. First, GPU machines are readily available today. Some online coding platforms such as Colab and Kaggle even provide free GPU computing resources. In contrast, powerful multi-core CPU workstations are difficult to access for end users. Second, it is not necessary to write source code for GPU computing. Today, GPU is supported by many user-friendly deep learning libraries such as TensorFlow (Abadi et al., 2015), PyTorch (Paszke et al., 2019), and Jax (Bradbury et al., 2018). These libraries provide highly optimized functions with both CPU and GPU backends depending on the availability of GPU. Therefore, programmers are only required to vectorize most operations by using these functions.

Note that accelerating psychometric modeling by parallelization is not new. There were several works considering this issue on either GPU or CPU (e.g., von Davier, 2017; Loossens et al., 2021; Sheng, Welling, & Zhu, 2014, 2015; Verdonck, Meers, & Tuerlinckx, 2016). However, none of them study GPU computing for high-dimensional IFA. The unique contribution of the present work is to introduce a Python package called xifa (accelerated item factor analysis), which implements a vectorized MHRM (VMHRM) algorithm for a wide class of high-dimensional IFA models. The vectorized algorithm could be greatly speeded up on GPU. In addition, the present work establishes benchmarks for xifa by empirically comparing it with some popular or state-of-the-art implementations, including Bock–Aitkin EM (BAEM) (Bock & Aitkin, 1981), MHRM, and IWAE. As we shall see in our simulations, the VMHRM on GPU could run 33 times faster than its CPU version. We believe this progress is a breakthrough.

The article is organized as follows: First, an IFA framework and the steps of MHRM are presented. Second, we introduce the VMHRM and demonstrate how to use xifa. The technical details of VMHRM can be found in Appendix A and B. Third, numerical experiments of algorithm comparison are executed. Fourth, a real data example illustrates the powerfulness of our approach. Finally, merits and limitations of the current study are discussed.

Item factor analysis and marginal likelihood

A framework for item factor analysis

Let v = (v1, v2,...,vI) denote an I-dimensional response vector of I polytomous items. For each item vi, Ci denotes the number of response categories, that is, vi takes on a value in {0,1,...,Ci − 1}. The value of vi is regarded to satisfy an ordinal scale (Stevens, 1946). To make things a bit simpler, we assume Ci = C for now, and return to the general case later. The IFA characterizes the response probability of vi as a function of an M-dimensional latent factor vector, say η = (η1, η2,...,ηM). This probability is expressed as:

$$ \pi_{i}(\eta) = (\pi_{i0}(\eta), \pi_{i1}(\eta),...,\pi_{i(C-1)}(\eta)), $$
(1)

where πic(η) is the conditional probability of the event vi = c given the trait level η, i.e., πic(η) = Pr(vi = c|η). Note that η is a latent variable that cannot be directly observed. The latent factor is often assumed to be normally distributed with zero mean, i.e., Pr(η) = Normal(0,Φ). In particular, the covariance matrix Φ is set to be standardized with \(\phi _{mm^{\prime }}\) being the correlation of ηm and \(\eta _{m^{\prime }}\).

The exact functional form of πi(η) depends on the IFA model class. For example, the graded response model (GRM; Samejima, 1969) assumes that:

$$ \pi_{ic}(\eta) = \frac{1}{1+\exp\left( -\nu_{ic} - {\lambda_{i}^{T}} \eta\right)} - \frac{1}{1+\exp\left( -\nu_{i(c+1)} - {\lambda_{i}^{T}} \eta\right)} , $$
(2)

where vic is the intercept for the cth response category of item i such that \(-\infty =\nu _{iC} < \nu _{i(C-1)}<...<\nu _{i0} = \infty \), and λi = (λi1, λi2,...,λiM) is an M-dimensional loading vector for item i. Another famous example is the generalized partial credit model (GPCM; Muraki, 1992) that assumes:

$$ \pi_{ic}(\eta) = \frac{\exp {\sum}_{j=0}^{c} \left( \nu_{ij} + {\lambda_{i}^{T}} \eta\right) }{ {\sum}_{k=0}^{C - 1} \exp {\sum}_{j=0}^{k} \left( \nu_{ij} + {\lambda_{i}^{T}} \eta\right) } , $$
(3)

where \(\nu _{i0} + {\lambda _{i}^{T}} \eta \) is defined as zero. Both GRM and GPCM use linear predictors of the form:

$$ \tau_{ic} = \nu_{ic} + {\lambda_{i}^{T}} \eta. $$
(4)

Hence, GRM and GPCM can be formulated as generalized linear multivariate models (Fahrmeir & Tutz, 1994) if η can be directly observed. For more IFA model classes, refer to van der Linden (2016).

Marginal likelihood via MHRM

Consider a random sample \(V = (v_{n})_{n=1}^{N}\) of size N. Let 𝜃 denote the parameter vector containing all freely estimated model parameters including the intercepts (νic), the loadings (λim), and the correlations among factors (\(\phi _{mm^{\prime }}\)). To estimate 𝜃, MML adopts the following log-marginal likelihood function:

$$ \ell(\theta; V) = \frac{1}{N} \sum\limits_{n=1}^{N} \ell(\theta; v_{n}), $$
(5)

where

$$ \ell(\theta; v) = \log \left[ \int \text{Pr}(v| \eta; \theta) \text{Pr}(\eta; \theta) \text{d} \eta \right] . $$
(6)

Any maximizer \(\widehat {\theta }\) for (𝜃;V ) is called an MML estimate of 𝜃. When M ≥ 5, the integral is generally evaluated by Monte Carlo methods, among which MHRM is one of the most often used.

Let \(\mathrm {H} = (\eta _{n})_{n=1}^{N}\) denote the N × M array with ηn as the true factor level corresponding to vn. The MHRM can be understood as an expectation maximization (EM; Dempster, Laird, & Rubin, 1977) algorithm that augments latent factors into the so-called complete data likelihood, denoted by (𝜃;V,H), and then maximizes this likelihood to obtain an MML estimate. In particular, MHRM uses the Metropolis–Hastings (MH) method to sample latent factors from their posterior distributions and then implements a Robbins–Monro (RM) step to update the current parameter estimate. An implementation of MHRM is presented in Algorithm 1 (for details, see Cai, 2010a).

Algorithm 1
figure a

Metropolis–Hastings Robbins–Monro (MHRM) algorithm.

Vectorization on GPU and xifa

GPU and vectorized MHRM

GPU was originally designed for graphical processing. Today, GPU also serves as a tool for general purpose scientific computing. While CPU runs still much faster than a single GPU thread, the advantage of GPU computing lies in its high capacity of parallelization. In fact, most modern deep learning implementations highly rely on GPU (see CH12 in Goodfellow, Bengio, & Courville, 2016). According to the GPU computing era, “Today’s GPUs use hundreds of parallel processor cores executing tens of thousands of parallel threads to rapidly solve large problems having substantial inherent parallelism.” (Nickolls & Dally, 2010, p. 59). For example, NVIDIA GeForce RTX 2080 Ti, the local GPU used in our study, possesses 68 streaming multiprocessors (SMs), each SM containing 64 CUDA cores. Hence, a total of 4352 cores can be used for floating point operations.

Based on the powerfulness of GPU, we propose a vectorized version of the former MHRM algorithm for IFA. We call it vectorized MHRM (VMHRM) and its details can be found in Appendix A. The principle here is to stack data into higher order arrays and then use vectorized operations during computation. Besides the VMHRM itself, several practical considerations are also presented in Appendix B. These considerations include handling missing data, dealing with different numbers of categories, imposing simple parameter constraints, tuning jumping standard deviation, avoiding non-positive definite correlation matrices, and evaluating log-marginal likelihood .

Python package xifa

xifa is a Python package for accelerated item factor analysis. It is established on Jax (Bradbury et al., 2018), which uses the XLA (accelerated linear algebra) compiler for efficient array computation on GPUs. In addition, Jax provides a complete treatment for automatic differentiation allowing us to easily modularize our code for different IFA models. xifa supports IFA by GRM (Samejima, 1969) and GPCM (Muraki, 1992). The analysis can be either exploratory or confirmatory. Moreover, xifa is able to handle the presence of missing responses and unequal category items.

The design of xifa is highly motivated by scikit-learn (Pedregosa et al., 2011), a famous machine learning library in Python. An example xifa syntax for conducting GRM is


>>> from xifa import GRM # import GRM from xifa >>> grm=GRM(data=data, n_factors=5) # create a GRM object >>> grm.fit() # fit model to data >>> grm.params["loading"] # extract loading estimates

Here, data and n_factors are used to specify the data for analysis and the number of factors for exploratory IFA. Note that in this example the data set, also called data, is already prepared.

A complete tutorial with details for analyzing big-five personality data is available on GitHub.Footnote 1 The online material is a Jupyter notebook that can be interactively run on Colab. Thus, here we only mention several key points regarding using xifa. First, the data set must be an N × I NumPy array with integers coded from 0 to C − 1, where C is the maximal number of ordered categories. Missing values must be represented by nan provided by NumPy. Second, xifa allows users to flexibly change hyperparameters for VMHRM. However, our simulations showed that the default setting generally performed well (see next section). Hence, in most cases users could simply use this default setting. In the fitting process, xifa prints the acceptance rate of MH samples and the value of minus complete data likelihood for each step. This information would be useful to monitor the convergence of VMHRM. If the algorithm unfortunately doesn’t converge, one might (1) try more steps for warm up and RM updating; (2) change hyperparameters for MH sampling (e.g., jumping variance, or number of chains). Third, when N, I, and M are large (e.g., N ≥ 50000, I ≥ 200, M ≥ 20), we may encounter GPU out-of-memory errors. An effective way to handle this error is to use the mini-batch approach described in Appendix B. To determine a “good” batch size, just try batch_size = round(N/L) with L = 2,3,4,... and use the smallest L such that no GPU out-of-memory error arises.

Numerical experiments

Overview of experiments

In this section, two numerical experiments are presented. Experiment A was designed to compare the performance of VMHRM implemented by xifa with standard of care methods under small number of factors (M = 1, 3, 5). In particular, we chose the BAEM and MHRM performed by mirt (Chalmers, 2012), a popular R package for multidimensional IRT. Experiment B was designed to compare VMHRM with an IFA procedure for high-dimensional settings (M = 10, 15, 20). We chose IWAE because: (1) it performed well in the work (see Urban & Bauer, 2021); (2) its code is available in PyTorch, which can be run on GPU.

In both experiments, we varied the number of factors (M), the number of items (I), and the case size (N). These factors were mainly used to manipulate the degree of computational complexity. The number of items was set to I = 5 × M and I = 10 × M. The case size was set to N = 500, 1000, 2000, 4000, and 8000. The levels of our manipulation partly covered the simulation design used in Chen et al., (2019) and Urban and Bauer (2021). As a result, there were 60 = 3 × 2 × 5 settings considered in each experiment. For each setting, the number of replications was 100. Both experiments were conducted on a HP Z4 workstation with Intel Xeon W-2123 CPU (3.60 GHz), 32 GB RAM, and NVIDIA GeForce RTX 2080 Ti GPU.

For each replication, a data set was generated by GRM. Let 1M denote an M-dimensional vector with all elements being one. An individual latent factor was first sampled through \(\eta \sim \text {Normal}(0, {\Phi })\), where \({\Phi }=0.3 \times 1_{M} {1_{M}^{T}} + 0.7 \times I_{M}\). The corresponding linear predictor was an I × (C + 1) matrix τ = N + Λη with

$$ \begin{aligned} &\text{N} = 1_{I} \otimes (-\infty, -1, -.4, .4, 1, \infty)^{T},\\ & {\Lambda} = I_{M} \otimes \left [ 1_{K} \otimes (2, 1.5, 1, 1.5, 2) \right ], \end{aligned} $$
(7)

where K = 5, 10, and ⊗ denotes the Kronecker product. The ordinal response v was obtained by Eq. 2. These parameter values represent an ideal setting which assumes symmetric thresholds and large communalities ranged from .5 to .8. We believe that this ideal setting could reduce the occurrence of non-converged solution (e.g., Li, 2016). To make the comparison fair, the different algorithm implementations used the same data set for fitting.

The performance evaluation was based on three metrics: mean square error (MSE), average computational time (speed), and average number of iterations (efficacy). MSE evaluated the overall estimation quality through:

$$ \widehat{MSE} = \frac{1}{100}\sum\limits_{r=1}^{100}\frac{1}{P}\sum\limits_{p=1}^{P} \left (\widehat{\vartheta}_{p}^{(r)} - \vartheta_{p}^{*} \right)^{2}, $$
(8)

where \(\vartheta _{p}^{*}\) denotes the true value of the pth model parameter, and \(\widehat {\vartheta }_{p}\) denotes the corresponding estimate at the rth replication.

Experiment A: Low-dimensional settings

Experiment A compared four implementations under M = 1, 3, and 5, including xifa-VMHRM1, xifa-VMHRM5, mirt-MHRM, and mirt-BAEM. The number after VMHRM indicates how many MH draws were used to approximate complete data likelihood. As mentioned before, VMHRM generated K samples by constructing K parallel chains, different from the Markov chain Monte Carlo (MCMC) practice. In theory, larger K results in better approximation for the complete data likelihood. For mirt-MHRM, only K = 1 was used, so we omitted the indicator after the name. Only non-zero loadings, the finite intercepts, and the factor correlations were estimated here. In other words, confirmatory IFA was considered.

Both VMHRM and MHRM were conducted through three stages: (1) 150 warmup steps for MCMC; (2) 200 StEM iterations to obtain a starting value for RM update; (3) at most 500 MHRM iterations for computing an MML estimate. In each StEM and MHRM iteration, there were four warmup steps for MH sampling. The gain sequence \(\gamma _{t} = \frac {1}{t}\) was used for RM updating. BAEM used the number of quadrature points per dimension as 61, 15, and 7 for 1, 3, and 5 factor conditions, respectively. The maximal number of BAEM iterations was set as 700. For all implementations, the tolerance for declaring convergence was set as 𝜖 = 10− 4. As VMHRM and MHRM are stochastic algorithms, they were considered to converge when three successive differences between the estimates were below the tolerance.

Figure 1 displays the evaluation metrics of Experiment A. The MSE of xifa-VMHRM1, xifa-VMHRM5, and mirt-MHRM were almost the same, suggesting equal quality estimates. However, the mirt-BAEM estimates only coincided with those of other methods under M = 1 and M = 3. For M = 5, mirt-BAEM resulted in higher MSE because of too few quadrature points per dimension. Note that using more quadrature points here is not a remedy either due to its extra computational time.

Fig. 1
figure 1

Mean square error, average computational time, and average number of iterations for Experiment A

The average computational time indicates that xifa-VMHRM1 and xifa-VMHRM5 were reasonably fast, ranging from 3.45 s to 9.43 s across all conditions. They were the fastest under M = 3 and M = 5. Here, xifa-VMHRM1 was slightly faster than xifa-VMHRM5. On the other hand, the average running time of mirt-MHRM and mirt-BAEM varied largely across conditions. The average running time of mirt-MHRM ranged from 4.69 s to 472.61 s, and for mirt-BAEM, it ranged from 0.11 s to 499.56 s. The mirt-BAEM was the fastest implementation for M = 1 (below 1 s), but the slowest for M = 5 (499.56 s).

The average number of iterations shows that mirt-BAEM was the most effective, followed by xifa-VMHRM5, xifa-VMHRM1, and mirt-MHRM. The lower number of iterations of mirt-BAEM is likely related to its deterministic nature. The efficacy of xifa-VMHRM5 over xifa-VMHRM1 is due to the higher number of MH draws for approximation. The inefficacy of mirt-MHRM is unexpected. For most conditions, the average number of iterations for mirt-MHRM was 700, indicating the lack of convergence.Footnote 2 In spite of the lack of convergence, mirt-MHRM still yielded estimates with comparable MSE.

To better understand the acceleration due to GPU, we conducted two supplemental analyses. (1) For each condition, we calculated the ratio of average computational times per iteration between mirt-MHRM and xifa-VMHRM1. This metrics ignored that mirt-MHRM made more iterations. The five-number summary (i.e., minimum, first quartile, median, third quartile, and maximum) of these ratios across conditions was 0.57, 2.20, 5.53, 10.53, and 37.08, respectively. In general, GPU had more effect under larger conditions (with respect to M, I, and N). We only found xifa-VMHRM1 slower than mirt-MHRM per iteration when M = 1 and N ≤ 1000. (2) We ran xifa-VMHRM1 on CPU with only ten replications, and calculated the ratio of average computational times between CPU and GPU implementations. The resulting five-number summary was 1.74, 5.13, 9.54, 15.99, and 28.72. The acceleration increased with either of M, I, and N. The GPU-implemented VMHRM was at least 7 times faster than its CPU version MHRM when M ≥ 3 and N ≥ 2000. Note that these computational times were determined from the same Jax code.

Experiment B: High-dimensional settings

Experiment B compared four implementations under M = 10, 15, and 20, including xifa-VMHRM1, xifa-VMHRM5, torch-IWAE5, and torch-IWAE25. The number after IWAE indicates how many IW samples were used. In theory, more IW samples result in more accurate approximation for marginal likelihood at the price of higher computational complexity. Because the available IWAE code was made for exploratory IFA, only factor loadings and intercepts were estimated in Experiment B. MSE was calculated from rotated results made by GEOMIN (Yates, 1987) via R package GPArotation (Bernaards & Jennrich, 2005). The maximal number of iterations for rotation was set to 5000. Note that the gradient projection algorithm (GPA; Jennrich, 2002) might not converge. Hence, for each rotation task, GPA was conducted repeatedly with different starting values up to 20 times. If no attempts converged, we chose the solution with the lowest GEOMIN function value.

The implementation of VMHRM was the same as in Experiment A. For IWAE, the implementation and the hyperparameter values were established mainly according to Urban and Bauer (2021). The maximal number of iterations was set to 700. Note that IWAE utilized the AMSGrad method (Reddi, Kale, & Kumar, 2018) that updated estimates with mini-batches of size 32. Hence, each iteration updated estimates ceil(N/32) times.

Figure 2 displays the evaluation metrics of Experiment B. The four implementations resulted in quite different MSEs, except for two cases: (1) M = 10 and I = 50; (2) M = 15 and I = 75. Under other conditions, torch-IWAE5 and torch-IWAE25 resulted in much higher MSE than xifa-VMHRM1 and xifa-VMHRM5. Unfortunately, IWAE seemed to yield inconsistent estimators when both M and I were large. For example, the MSE did not change by sample size when M = 20 and I = 200 were used. The inconsistency of IWAE might be fixed by tuning the model and optimization hyperparameters. However, it is a difficult and time consuming task.

Fig. 2
figure 2

Mean square error, average computational time, and average number of iterations for Experiment B

The average computational time of xifa-VMHRM1 and xifa-VMHRM5 were the lowest among the four, ranging from 6.07 s to 27.73 s. The minimum corresponded to the smallest condition: M = 10, I = 50, N = 500. The maximum corresponded to the largest condition: M = 20, I = 200, N = 8000. Since xifa-VMHRM1 sampled fewer MH draws than xifa-VMHRM5, it was slightly faster under most conditions. In contrast, torch-IWAE5 and torch-IWAE25 were much slower. Their average computational times ranged from 86.21 s to 206.46 s.

It is worth noting that xifa-VMHRM5 was occasionally faster than xifa-VMHRM1 even under a small sample size like N = 500. The reason is that xifa-VMHRM5 used fewer iterations to finish the optimization task. This result demonstrates the usefulness of VMHRM on GPU even under small sample sizes. By sampling more MH draws, xifa-VMHRM5 could yield a better approximation for complete data likelihood in each iteration. The average number of iterations were the highest among the four with torch-IWAE25 and torch-IWAE5 under N = 500. However, it decreased with N. Because the MSE of IWAE under large sample sizes became unacceptably high, it is meaningless to appreciate the smaller number of iterations of IWAE. Perhaps, the current convergence criterion of IWAE is too loose for large IFA models (e.g., M ≥ 15 and I ≥ 150). However, this speculation requires further experimentation to confirm.

To better understand the acceleration due to GPU, we also ran Experiment B on CPU with only ten replications, and calculated the ratio of average computational times between CPU and GPU implementations. For VMHRM, the five-number summary of the ratio was 6.42, 16.51, 25.84, 29.06, and 33.80, indicating large accelerations under most settings. In contrast, the summary for IWAE was only 0.28, 0.55, 0.70, 0.97, and 2.10. The GPU version was only faster than the CPU with larger M, I, and N. These results indicate that a GPU implementation alone is not sufficient for speed-up. The nature or the design of an algorithm is still important. By default, the IWAE code only processes a small batch of data for updating estimates. This mini-batch approach might reduce the capacity and effectiveness of parallelization. However, a more detailed analysis and experiment are required to understand this phenomenon.

A real data example

In this section, we demonstrate the powerfulness of VMHRM on GPU through a real data example. Two versions of IPIP-NEO (International Personality Item Pool NEO) data sets were used for demonstration (Johnson, 2015, 2018). The IPIP-NEO scale is composed of 300 items to measure 30 personality facets belonging to the Big Five traits — neuroticism, extroversion, openness, agreeableness, and conscientiousness. For each item, a personality description is presented (e.g., worries about things), and then the respondent is required to choose a degree of agreement from very inaccurate, moderately inaccurate, neither accurate nor inaccurate, moderately accurate, and very accurate. The first version, IPIPv1 included the responses of 20,993 subjects (Johnson, 2015). The second version, IPIPv2, included 307,313 subjects (Johnson, 2018). Zhang et al., (2020) used IPIPv1 to demonstrate their improved StEM.

We fit a 30-dimensional confirmatory GPCM model to both IPIPv1 and IPIPv2. Each item was assumed to be influenced by only one of the 30 personality facets. The correspondence between items and facets could be found in Johnson (2021). These personality facets were set to be correlated. To find MML estimates under IPIPv1 and IPIPv2, we used the same implementation described in our numerical experiments except that: (1) the factor correlation matrix was updated by an empirical method to avoid non-positive definiteness; (2) because of the large sample size of IPIPv2, an mini-batch approach was used to calculate the gradient with a batch size of 80000 (for details, see Appendix B). Code for the real data example is available as supplemental material.Footnote 3 The analyses were run on Kaggle (https://www.kaggle.com), a platform for data science. Fantastically, Kaggle provided GPU with 16 GB memory, exceeding our local GPU machine.

With a GPU on Kaggle, the VMHRM used 30 s to finish the optimization tasks under IPIPv1. The same took 16 min under IPIPv2. As a comparison, the improved StEM used 32 min to find an estimate based on only the 7325 complete data cases in IPIPv1 (Zhang et al., 2020). The VMHRM did not drop any incomplete case due to the full information approach we introduced to handle missing values. Since the sample size of IPIPv2 is large, we only present its corresponding MML estimate. Table 1 shows the estimates for the non-zero loadings, and Fig. 3 visualizes the estimated factor correlation matrix. Their relative values are quite similar to those found in Zhang et al., (2020).

Table 1 The marginal maximum likelihood estimate for factor loadings under IPIPv2
Fig. 3
figure 3

Visualization of the estimated factor correlation matrix under IPIPv2

The log-marginal likelihood under the MML estimate was evaluated by Monte Carlo integration with 5000 quadrature points. To avoid out of memory problems, the batch size was set to 128. The evaluation took 8 s for IPIPv1 and 205 s for IPIPv2. The likelihood value was − 388.535 for IPIPv1 and negative infinity for IPIPv2. We found that the 53,557th and 103,686th cases corresponded to negative infinity. It seems that the two cases correspond to “weird” response patterns. The log-marginal likelihood without these two cases was − 391.606.

Discussion

MML estimation in IFA is a challenging task under high dimensionality. In this study, we proposed a VMHRM algorithm for both the GRM and the GPCM. The algorithm is implemented by xifa, a Python package. Our numerical experiments demonstrated that VMHRM on GPU may run 33 times faster than its CPU implementation. The medians of acceleration were nine times for dimensions ≤ 5, and 25 times for dimensions ≥ 10. The degree of acceleration increases with the number of factors, the sample size, and the number of items. Therefore, the GPU-implemented VMHRM is most appropriate for high-dimensional IFA with large data sets. Even for small data sets, VMHRM is useful for handling more MH samples.

When the number of factors is at least five, the GPU-implemented VMHRM is much faster than existing IFA implementations such as the BAEM, MHRM implemented by mirt, and the IWAE. With 20 factors, 200 items, and 8000 cases, VMHRM used about 28 s to finish the optimization task. This progress in computation time is a breakthrough for high-dimensional IFA. With the help of GPU, it is possible to use MML for many large-scale psychological and educational applications.

The fact that GPU acceleration exists does not imply that every code could be accelerated on GPU without modification. For example, the IWAE code written by Urban and Bauer (2021) is surprisingly slower on GPU under most conditions. The degree of GPU acceleration depends on whether related operations can be parallelized. We have seen that GPU takes advantage of larger input arrays. Otherwise, a CPU-based serial computation remains faster.

VMHRM is not limited to rely on GRM or GPCM. Other types of IFA models (e.g., item response tree, Bockenholt, 2012) could be used after replacing Eqs. 11 and 12 by category response functions. The deep learning libraries would automatically calculate the gradients via automatic differentiation. As a result, VMHRM would still compute the corresponding MML estimates. Our proposed algorithm could even be modified by other sampler and updating methods. It is possible to design new algorithms using the No-U-Turn Sampler (NUTS; Hoffman & Gelman, 2014), Polyak-Ruppert averaging (Polyak & Juditsky, 1992; Ruppert, 1988), and stochastic proximal methods (Zhang & Chen, 2021). It would be interesting to compare the empirical performance of VMHRM with these new algorithms.

The major limitation of the present work is that VMHRM was only evaluated under very restricted simulation settings. Our simulation uses a GRM model with “ideal” parameter values to generate data. In addition, the simulation only considers simple stochastic gradient descent with \(\gamma _{t} = \frac {1}{t}\) after a fixed number of StEM steps. It would be interesting to see the behavior of VMHRM under \(\gamma _{t} = \frac {1}{t^{\alpha }}\) for α ∈ (0.5,1) after an adaptive number of StEM steps (e.g., Zhang et al., 2020). With the acceleration made by GPU, it is possible to evaluate MML algorithms for IFA in a more comprehensive manner.

We believe that many psychometric modeling methods would also benefit from GPU provided that their relevant operations are appropriately vectorized. For example, Bayesian psychometric models (e.g., Edwards, 2010; Muthén & Asparouhov, 2012) could be speeded up by constructing parallel Markov chains as we have done it with our VMHRM. We expect that GPU computing will play a central role in large-scale psychometric modeling in the near future.

Open practices statements

The source code of the software package and real data example are available at https://github.com/psyphh/xifa. Data derived from public domain resources. None of the experiments were preregistered.