HBMIRT: A SAS macro for multidimensional item response models in small (and large!) samples

Psychometric assessment of categorical (in many cases dichotomous) test items is often based on item response theory (IRT), particularly in large scale studies such as the Programme for International Student Assessment (PISA; Organization for Economic Co-Operation and Development, 2016) or the Trends in International Mathematics and Science Study (TIMSS; Martin et al., 2020). IRT models can be considered as a special case of latent variable models with categorical indicators that load on one or more continuous latent variable(s) (Glockner-Rist & Hoijtink, 2003; Muthén, 2002). These models address a variety of different analysis goals regarding psychometric properties at the item- and test level such as the analysis of item discriminations, within- and between-item multidimensionality (Adams et al., 1997), and test reliability (person separation reliability; Andrich, 1982). In addition, the models can address questions regarding individual persons parameter estimates such as their precision (i.e., standard errors or posterior standard deviations).

Despite the great potential of IRT analyses, in many cases, they cannot be applied to empirical data due to insufficient sample size. In fact, for a unidimensional two parameter logistic (2PL) model (i.e., difficulty and discrimination are estimated for each item), samples sizes of N > 500 are recommended, and even larger sample sizes for more complex (e.g., multidimensional) models (Liu & Yang, 2018).

Besides model complexity, the specific estimation method has been shown to play a major role for the required sample size (Garnier-Villarreal et al., 2021). Bayesian methods have been shown to be advantageous over frequentist methods in small samples when prior distributions are specified in an advantageous way (Garnier-Villarreal et al., 2021; König et al., 2020). However, the specification of such a prior distribution can be challenging: A prior that is advantageous in one case may not be so in another case. Often, so called uninformative (or diffuse, flat, vague) priors are specified when no information is a priori available about the parameters to be estimated, which can lead to severe bias when samples are small (Smid et al., 2020; Zitzmann, Lüdtke, et al., 2021).

To mitigate this problem, the use of hierarchical prior distributions (also called adaptive informative priors) has specifically been suggested for IRT models when these models are used in small samples (Fujimoto & Neugebauer, 2020; Gilholm et al., 2021; König et al., 2020; Sheng, 2013). The idea behind this approach is that distributions for the parameters of a type of prior for model parameters (e.g., for a set of item discrimination parameters) are assumed. As a consequence, the prior distribution (e.g., for a specific item discrimination) is “adaptive” in such a way that it is adapted during the estimation process.

From this point of view, the question arises why such hierarchical Bayesian IRT models are not more routinely applied. One major reason for this might be that easy-to-use tools are not available in which users can specify such models without formulas as, for instance, in standard IRT tools (that do not allow for hierarchical Bayesian estimation) such as ConQuest (Wu et al., 2007), PROC IRT (SAS Institute Inc., 2018), the SPIRIT macro (DiTrapani et al., 2018) for SPSS (IBM Corp, 2019) or the many R packages (for an overview see Choi & Asilkalkan, 2019) such as the TAM package (Robitzsch et al., 2024) or the mirt package (Chalmers, 2012).

To provide users with such a tool, we developed a SAS macro based on PROC MCMC (SAS Institute Inc., 2018) that allows users to easily specify and estimate hierarchical Bayesian multidimensional IRT models (for user-specified IRT models with PROC MCMC see Ames & Samonte, 2015; Stone & Zhu, 2015): the HBMIRTFootnote 1 macro for dichotomous items, which supports hierarchical priors for item intercept and item slope (i.e., item discrimination or factor loading) parameters in logistic (logit link) and normal ogive (probit link) models. We choose SAS PROC MCMC as it shows fewer memory limitations compared to different R packages and is computationally very efficient (Ames & Samonte, 2015) due to multithreading (which is however not available in the online version of SAS OnDemand for Academics). In the following, we explain the capabilities and features of the macro, including an automated procedure for monitoring convergence. Moreover, we compare the performance of the implemented hierarchical priors with different constant hyperparameters and maximum likelihood estimation in a small simulation study with unidimensional two-parameter logistic (2PL) models.

The HBMIRT macro

In the following, we first give a short overview about the range of modeling options that are available within the HBMIRT macro. We then explain how different models can be specified with the HBMIRT macro with concrete examples with illustrated listings and outputs. We begin with the specification of a standard unidimensional 2PL model and increase the complexity of the model and the applied (hierarchical) priors step by step and explain why such model extensions can be useful. In the subsequent section, we introduce the use of the remaining macro options, for instance, regarding the convergence criteria, the naming of output data sets, and the possibility of specifying priors with constant hyperparameters instead of hierarchical priors.

Modeling options

The HBMIRT macro allows users to specify a variety of uni- and multidimensional IRT models with binary outcomes. The general model and modeling options are shown in Fig. 1. It is essentially a multidimensional version of the 2PL model (M2PL; Reckase, 2009) – when the default logistic link function instead of the optional probit link function is used – with binary response yij for a person i on an item j. The responses yij are assumed to be a function of the person parameters θif for a person i on a factor (or dimension) f and the item parameters, namely the intercepts dj and the discriminations or slopes ajf (Chalmers, 2012). For both item intercept parameters and item slope parameters, normal prior distributions are specified. The parameters for at least one item per dimension are associated with a truncated normal prior ajf ~ N+a, σ2a), ensuring positive item slope parameters for identification purposes. The hyperparameters (µa and σ2a or µd and σ2d, respectively) of the prior distributions for the item parameters can be defined as prespecified constants (µa*, σ2a*, µd*, σ2d*) or – in the case of hierarchical priors – as parameters with associated hyperprior distributions (µa(s), σ2a(s), µd, σ2d). If hierarchical priors are applied, the hyperparameters regarding the means of the prior distributions are uniformly distributed as µd ~ U(-6, 6 ) and µa(s) ~ U(-6, 6 ) or µa(s) ~ U(0, 6 ), respectively, and the hyperparameters regarding the variances are given inverse gamma distributions with shape and scale parameters both equal to 0.01: σ2a(s) ~ IG(0.1, 0.1) and σ2d ~ IG(0.1, 0.1) as default. Note that the hierarchical prior distributions including the hyperparameters can be changed by macro parameters. Further, in the case of the slope parameters, different sets (S) of hyperparameters with identical hierarchical prior distributions may be defined that are related to different sets of item slope parameters, for instance, a first set relating to slope parameters regarding the second dimension – aj2 ~ Na(1), σ2a(1)) – and a second set relating to slope parameters regarding the first dimension – aj1 ~ Na(2), σ2a(2)). It is also possible to use hierarchical priors only for a subset of the item slope and intercept parameters and constant hyperparameters for the remaining priors – for instance, dj ~ Nd*, σ2d*) for j > 10. In the case of (multidimensional) 1-parameter models, where all item slopes regarding a given dimension are assumed to be equal (i.e., ajf reduces to af), constant hyperparameters af ~ N+a*, σ2a*) are used for the priors of the slope parameter(s). In the case of multidimensional models positive definite correlation matrices for the person parameters θif including the option to constrain selected correlations to zero are generated based on a regression approach (for details see Appendix A).

Fig. 1
figure 1

A directed acyclic graph of the general model and options

Specifying IRT models with the HBMIRT macro

Suppose we have assessed early numeracy (Dierendonck et al., 2021) with a standardized test in a sample consisting of 140 kindergarten children (the SAS listing to generate the data set and the listing for the analyses as well as the respective output are provided in Appendix B). The test encompasses three theoretical facets, each assessed by 10 items, namely counting, relations, and arithmetic. To illustrate the macro, we first estimate a unidimensional 2PL model (Fig. 2) with the hierarchical Bayesian approach (i.e., using hierarchical priors for the item parameters). Our data set numeracy in the SAS library lib comprises a unique numerical variable for each child (ID) named idchild and the variables y1-y30 (counting y1-y10, relations y11-y20, and arithmetic y21-y30) containing the scored items (0 = false, 1 = correct).

Fig. 2
figure 2

Unidimensional 2PL model with 30 indicators. Note. Item slope parameters are denoted as a1-a30. Item intercepts are not displayed.

Unidimensional 2PL model. To estimate our model, we use the following specification (Listing 1):

Listing 1

“Standard” unidimensional 2PL model with hierarchical priors

figure a

The 2PL model is the default choice. Therefore, what needs to be specified here are the name of the data set, the name of the ID variable, one factor (#1) with the respective items that load on that factor (y1-y30), and a label ([1]) that declares that the loadings (i.e., item slopes or aj parameters with j = 1, …, 30) of the items aj parameters share the same mean and variance hyperparameters. This way of modeling item slopes is consistent with the view that identical normal prior distributions would be specified in a non-hierarchical model. A similar assumption is made for the item intercepts (dj parameters with j = 1, …, 30). It should be noted that the model is parameterized as p(yj = 1) = logistic(-dj + aj · θi), meaning that higher intercept values indicate lower probabilities for correct answers for an item with a positive slope and a given person ability.

The results for this model are shown in Fig. 3. Each line – here denoted in SAS as Obs for “observation number” in the respective output data set – contains a parameter estimate. The name of the parameter can be found in the second column (Parameter) and the respective estimate and its posterior standard deviation in columns three (Estimate) and six (StdDev), respectively. The fourth column (constr) shows the constraints applied to identify the model or to avoid label switching (see also below). Only one positivity constraint was applied to the slope parameter of item y1 on dimension 1 (a_y1_dim1), which is the only assumed dimension here (therefore, all names of slope parameters have the suffix dim1). The slope parameters a_y1_dim1 … a_y30_dim1 correspond to those denoted as a1-a30 in Fig. 2, where “dim1” is denoted as θ. The fifth column (hprior) shows that hierarchical priors are used in the estimation. One set of hyperparameters (mean and variance) is used for all slope parameters and another set is used for all intercept parameters. The “y” in column hprior means that the item intercept (d_y1) belongs to the set of parameters with the same hierarchical prior (marked by “y” for “yes”, whereas unmarked intercepts would mean that they are not based on hierarchical priors), which are all intercept parameters, although they are not shown in Fig. 3. Columns eight and nine as well as columns 10 and 11 show credibility intervals (equal tail intervals and highest posterior density [HPD] intervals) at the specified alpha level (alpha, column seven, with default value .05). The right-most column in Fig. 3 (PSR) contains the potential scale reduction (PSR; Gelman & Rubin, 1992) for each parameter as described in Asparouhov and Muthén (2010) for the second half of the total chain (the first half of the chain is treated as a burn-in phase and always discarded for all calculations). If all PSR values fall below the cutoff (default PSR_conv = 1.1) then the column to the right of the PSR column (converged) – not shown in Fig. 3 – contains ones instead of zeros, indicating that the estimation has converged. If the expected sample sizes (ESS; Geyer, 1992; Kass et al., 1998) criterion is additionally used as a stop criterion (default ESS_conv = 0 means that ESS is not computed), convergence depends on both criteria. It is also possible to switch off the PSR criterion (PSR_conv=0) and to use only the ESS criterion. Note that a large ESS can improve the accuracy of the estimates (Zitzmann & Hecht, 2019; see also Zitzmann, Weirich, et al., 2021a, 2021b). Recently, a SAS macro called automcmc (Wagner et al., 2023) has been introduced that can be used to apply the identical stopping criteria as those in HBMIRT to IRT models that require user-defined PROC MCMC code and are not yet covered by HBMIRT.

Fig. 3
figure 3

Output for unidimensional 2PL model presented in Listing 1 (the five right-most columns and lines 36-64 are excluded)

In the IRT literature, item difficulties are often reported. It is also possible to obtain these parameters (b parameters), which can, due to the model parametrization p(yj = 1) = logistic(-dj + aj · θ), be computed as bj = dj /aj (Stone & Zhu, 2015) in unidimensional models (for multidimensional models see Reckase, 2009) using the output data set containing the generated posterior samples (called outpost). Convergence regarding these parameter estimates can be checked by applying the SAS ESS-autocall macro (which is also used in the HBMIRT macro if ESS is specified to monitor convergence)Footnote 2.

Different sets of prior distributions. It might be reasonable to assume different hierarchical prior distributions (means, variances) for different item loadings; for example, depending on the specific facet these loadings belong to (counting y1-y10, relations y11-y20, and arithmetic y21-y30). In this case, the model is specified as follows (Listing 2):

Listing 2

Unidimensional 2PL model with specific loading hierarchical priors for different sets of items

figure b

Here, three different means and three different variances (denoted as mua1-mua3 and vara1-vara3, respectively, in the output shown in Fig. 4 lines 1-6) are estimated for three different sets of prior distributions (as shown in column 5, lines 7-36). This means that the hierarchical prior distributions for the loadings of each facet (denoted as a1-a10, a11-a20 and a21-a30, respectively, in Fig. 2) are allowed to have different means and variances (e.g., highest loadings on average for facet 1). Notice that it is still a unidimensional model because all item sets load on the same dimension (#1 for dimension 1).

Fig. 4
figure 4

Output for unidimensional 2PL model presented in Listing 2 (the five right-most columns and lines 41-68 are excluded)

Excluding items from estimation with hierarchical priors. It is also possible to exclude single items from the hierarchical specification. If, for instance, three items (say y10, y20, and y30) have been newly developed and, therefore, may have less desirable psychometric properties than the established items, these can be estimated with an individual (nonhierarchical) normal prior for each of these items (e.g., by default, prior_mean_a = 0, prior_var_a = 4) as shown in Listing 3. In the output, the zeros in the hprior column would indicate that the respective item slopes do not fall under the umbrella of common hierarchical prior.

Listing 3

Exclusion of items from estimation with hierarchical priors

figure c

Regarding the intercepts, multiple sets of hierarchical priors are not allowed. However, as in the case of the loadings, single items can be excluded from the hierarchical prior by specifying the respective item names after the macro parameter duninf (e.g., duninf = y10 y20 y30). The intercepts of these items are then estimated using the specified (un)informative normal prior (e.g., with the default values prior_mean_d = 0, prior_var_d = 4).

Hyperpriors. The default priors proposed by Stone and Zhu (2015) for the hyperparameters in the hierarchical specification are uninformative with a uniform distribution ranging from -6 to 6 for the mean and an inverse gamma distribution with shape and scale parameters both equal to 0.01 for the variance. If the option muaconstr = yes is chosen, the mean is restricted to positive values. The priors for the hyperparameters can be modified by the macro parameters prior_mua and prior_mud for the means as well as prior_vara and prior_vard for the variances of the normal prior distributions of the item slope and intercept parameters. For instance, a more informative hyperprior (König et al., 2022) for the variance of the slope parameter prior could be specified as prior_vara = %STR(EXPON(ISCALE=0.4)) which would imply an exponential distribution with an inverse scale parameter equal to 0.4 or as a half-Cauchy distribution with location and scale parameters μ = 0 and σ = 2.5 by specifying prior_vara = %STR(CAUCHY(0, 2.5, lower=0)). Note that the %STR() function is necessary to mask the parentheses enclosing the distribution parameter(s) (here “iscale=0.4”) during macro compilation. If at least one of the options muaconstr = yes or the aconstr = yes (all slope parameters are constrained to be positive, see below) is used, the prior for the hyperparameter of the mean of the normal prior distributions for the item slope must exclude negative values. The default hyperprior here is specified as a uniform distribution ranging from 0 to 6 defined by the macro parameter prior_mua_ge0=%STR(UNIFORM(0, 6)).

Estimation without hierarchical priors. It is also possible to estimate a “traditional” Bayesian model with fixed prior means (as defined by prior_mean_a and prior_mean_d, respectively) and variances (as defined by prior_var_a and prior_var_d, respectively) for all a and d parameters by specifying the macro parameter priors = user instead of the default priors = hierarchical. It is important to note that the prior means and variances should be chosen with regard to the selected link function (link = logit or link = probit). The prior variance for a and d parameters with logit link refers to the logistic distribution with variance π2/3, whereas the probit link refers to the normal distribution with unit variance. So, for example, if mean and variance for a parameters with a logit link are chosen as prior_mean_a = 1 and prior_var_a = 4, the corresponding priors in a model with probit link should be approximately prior_mean_a = 1 / (π2/3)1/2 ≈ 0.55 and prior_var_a = 4 / (π2/3) ≈ 1.22.

Model identification, label switching. To identify the model, by default (aconstr = auto), one loading for each dimension is constrained to be positive. In some cases (e.g., if this loading is close to zero or the model is quite complex) label switching can appear because, for instance, in a unidimensional model, the likelihood for a set of loadings and the same set of loadings multiplied by -1 is identical. To avoid such problems, individual loadings can be constrained to be positive by adding the “>” sign after the item (without a blank). For example, “#1 y1> y2> y3> y4” means that y1-y3 are constrained to have positive slope parameters, whereas this is not the case for y4. Alternatively, all loadings can be constrained to be positive by setting aconstr = all. The positivity constraint is realized by truncating the specific normal prior distribution(s) at zero (see also Gilholm et al., 2021). This can also be applied to the hyperparameters for the means of the slope parameters by setting muaconstr = yes.

Multidimensional models. A multidimensional 2PL model with the three facets represented as different (correlated) dimensions (Fig. 5) can be estimated as follows (Listing 4):

Fig. 5
figure 5

Three-dimensional 2PL model with 30 indicators. Note. Item slope parameters are denoted as a1-a10, a11-a20 and a21-a30 for dimensions 1 to 3 (denoted as θi1- θi3), respectively. Item intercepts are not displayed.

Listing 4

Three-dimensional model

figure d

The factors are assumed to be intercorrelated by default (covar = free). If all correlations among factors should be constrained to zero, covar = diagonal can be specified. It is also possible to free correlations among factors, for example, by specifying covar = 1#2 1#3. Here, correlations between factors 1 and 2 (1#2) as well as between factors 1 and 3 (1#3) are estimated, whereas the correlation between factors 2 and 3 is fixed to zero (as 2#3 is not mentioned). If correlations among factors are estimated, the prior distributions of the correlations are given by default (rdist = yes). The prior distribution for the correlation matrix can be modified by choosing different values for var_beta (default: var_beta = 1) with smaller values implying distributions closer around zero.

In Listing 4, a global distribution of all item loadings is assumed. Factor-specific hierarchical prior distributions for the loadings can be specified by renaming [1] to [2] and [3] for the second and third factor (denoted as #2 and #3, respectively) in Listing 4. This would be in line with the population model used to generate the data (with loadings on factors 1-3 sampled from normal distributions with means 1, 1.5, and 2, respectively).

Finally, a bifactor model (Fig. 6) with one general competence factor (with a specific hierarchical prior for the loadings) and three facet-specific factors (all factors are specified as uncorrelated by covar = diagonal) can be estimated. As each indicator is assumed to depend on two dimensions (i.e., loads on two factors) two slope parameters and one intercept parameter per indicator are estimated per indicator. Therefore, the probability of solving item 1, for instance, is modelled as p(y1 = 1) = logistic(-d1 + a1 · θi1 + a31 · θi2). Such models can be seen as multidimensional extensions of the 2PL (M2PL; Reckase, 2009) model. The specification of the respective bifactor model is shown in Listing 5.

Fig. 6
figure 6

Four-dimensional M2PL bifactor model with 30 indicators. Note. Item slope parameters are denoted as a1-a30 for the general dimension (θi1) and a31-a40, a41-a50 and a51-a60 for the specific dimensions 2 to 4 (denoted as θi2i4), respectively. Item intercepts are not displayed

Listing 5

Four-dimensional bifactor model

figure e

Starting values. Starting values are given by default as aj = 1 for slope parameters (loadings) and dj = 0 for intercept parameters and r = 0 for correlations among factors that are not fixed to zero. It is, however, possible to customize the starting values, which can be helpful when factors correlate. Reasonable starting values for correlations can be assigned in squared brackets directly after the respective factor pair to accelerate the estimation process, for instance, covar = 1#2[0.5] 1#3[0.4] (in this case, the svindat option – see below – is skipped). The correlation matrix implied by the starting values (including assumed uncorrelated factors) has to be positive definite. The respective regression weights used to generate the multivariate normal distribution for the person parameters (see above) are estimated with PROC CALIS based on the specified correlations. Starting values for all parameter estimates can also be given by SAS data sets mentioned under svindat or svlastitdat. The svlastitdat option is particularly helpful if more iterations are needed to increase the precision of estimates (smaller potential scale reduction [PSR] or larger effective sample size [ESS]). In this case, the data set from a prior model estimation specified under svlastitout (svlastitout = svlastitout by default) containing all relevant parameter estimates from the last iteration are used as starting values for a next chain of iterations. The resulting iterations can be combined by stacking both outpost files (e.g., outpost = outpost1 in first step, outpost = outpost2 in second step) to one data set (e.g., DATA outpost_combined; SET outpost1 outpost2;). It is also possible to specify user-defined starting values in a specific SAS data set by the svindat option (e.g., svindat = starting_values). The svindat data set has to contain starting values for all parameters to be estimated (except the person parameters). A simple way to generate a data set with all necessary variables is to estimate a given model with only a few iterations and use the resulting svoutdat data set (svoutdat = svoutdat as default) after updating the starting values as intended. Whenever models with hierarchical priors are estimated (and not all starting values are given in a data set), starting values are generated by first automatically estimating a model with constants (defined under prior_mean_a, prior_var_a, prior_mean_d, and prior_var_d) as parameters for the prior distributions. Because these are only starting values that thus should not have an impact on the final results, the use of rather informative priors is recommended, in particular in small samples where uninformative priors may otherwise lead to extreme parameter estimates in some cases. Applying different sets of starting values can also be used to check convergence based on several chains. For instance, a fixed number of (burn-in) iterations (nbi = 5E3, iterationsteps = 1E4; for details see below) could be specified without applying stopping criteria (PSR_conv = 0, ESS_conv = 0), so that based on each set of starting values HBMIRT would generate 10,000 draws from the given chain. Note that different names for the outpost data sets (e.g., outpost = chain1outpost, outpost = chain2outpost etc.) must be specified. The convergence across the different chains can then be checked with the SAS GELMAN-autocall macro.

Output data sets. Besides the above-mentioned starting values-related data sets, the HBMIRT macro generates output data sets for posterior summaries and credible intervals that are printed in the SAS output window by default (output = yes). Further, a data set containing the posterior distributions of all (hyper)parameter estimates except the person parameters is available (outpost = outpost). The latter are stored in a data set at an aggregate level (eap = eap) as expected a posteriori (EAP) estimates and their posterior standard deviations (i.e., the means and the standard deviations of the posterior distributions) computed from the second half of the total chain. The estimated EAP reliability for each dimension is included in the output data set. If ratioCPO is set to a value larger than zero, an additional data set (CPOdat = CPOdat) is generated which contains item and test level information in terms of the natural logarithms of the conditional predictive ordinate (CPO). The calculation is done as described by Stone and Zhu (2015). This information can be used for model comparisons using a pseudo Bayes factor approach (Geisser & Eddy, 1979). It should be noted that this approach is a very reliable model selection approach for IRT models with hierarchical priors (Fujimoto & Neugebauer, 2020) and even more reliable than the deviance information criterion (DIC; Spiegelhalter et al., 2002), which is generated by default as -2 · M(log likelihood) + 2 · Var(log likelihood) from the second half of the chainFootnote 3. Further, in their simulation study, Fujimoto and Falk (2024) found that CPO-based Log-Predicted Marginal Likelihoods (LPMLs) showed an acceptable performance in IRT model selection regarding dimensionality (unidimensional versus multidimensional models including bifactor and two-tier structures) that was superior to the DIC and comparable to the Watanabe–Akaike Information Criterion (WAIC; Watanabe, 2013), although being inferior to the Pareto-Smoothed Importance Sampling-Leave-One-Out Cross-Validation (PSIS-LOO; Vehtari et al., 2022) that was however also the most computationally demanding to calculate.

Missing values. In empirical studies it is often the case that participants do not respond to every task or that several booklets were used with different subsets of the item pool (planned missingness). This means that the data set contains variables with missing values. As it may not be adequate to code missing values on items as incorrect (i.e., as zero) in the case of non-response (Pohl et al., 2014; Rose et al., 2016) – but see Robitzsch (2021) for arguments against treating non-response in terms of the latent ignorability assumption – and in the case of missingness by design, it is useful to rely on procedures that assume missingness at random (MAR; Rubin, 1976). This is the default in PROC MCMC (SAS Institute Inc., 2018), and the HBMIRT macro is designed to handle cases with this type of missingness.

Further macro parameters to control estimation and customize output

In addition to the automatic monitoring of convergence (by specifying the PSR and/or ESS as the convergence criterion), the maximum number of iterations (maxnmc = 1E5 = 100,000 by default) after a number of burn-in iterations (nbi = 5E3 = 5,000 by default) prior to the first block of iterations (see iterationsteps option below) can be specified. Also, thinning (i.e., using only every nth iteration) can be specified (thin = 1 by default, i.e., all iterations are used) in order to reduce the autocorrelation of the chain. However, thinning is only needed when there is not enough memory capacity to conduct the analysis (Link & Eaton, 2012). The check for convergence is applied after a specified number of iterations (iterationsteps = 5E3 = 5,000 by default). Regarding the sampling algorithm, we relied on the PROC MCMC default which is a random walk Metropolis with normal proposal for all parameters in our case except for the variance of the d-parameters (conjugate sampling) when a hyperprior is applied. Proposal tuning should generally be applied (maxtune = 100 loops – which is the maximum allowed number in PROC MCMC – for estimation with user-specified priors or hmaxtune = 100 for estimation with hierarchical priors by default), particularly when multidimensional models are estimated. The maximum number of tuning iterations can also be specified (ntu = 500 iterations – which is the maximum allowed number in PROC MCMC – for estimation with user-specified priors or hntu = 500 for estimation with hierarchical priors by default). As the tuning step implemented in PROC MCMC can be time consuming, the number of iterations for the estimation of the model parameters (iterationsteps) for a given PROC MCMC run should not be chosen too small, because each “iteration step” starts with another tuning step. When a model needs many steps, it might be reasonable to abort the job and choose a larger number of iterations for each step. Note that oftentimes SAS has to be closed to stop PROC MCMC (the “core” of the macro), because SAS does not respond to clicks on the break button. It is thus useful to start a new SAS instance for each job.

A simulation study

We tested the performance of the method applied by the HBMIRT macro in a small simulation study with a unidimensional 2PL model with three different sample sizes (N = 50, N = 100, and N = 500) and a common number of items (k = 25). Item and person parameters were randomly drawn from normal distributions: a ~ N(1, 0.04), d ~ N(0, 1), and θ ~ N(0, 1) for each data replicate. The distribution of the slopes in our study closely approximates the log-normal distribution with μ = 0 and σ2 = 0.04 (mean and variance: E[a] = 1.02, Var[a] = 0.04) used in a simulation study based on a 2PL model by Monroe (2021). It is also comparable to the log-normal distribution with μ = 0 and σ2 = 0.06 in König et al. (2020) regarding the mean (E[a] = 1.03) which, however, shows higher variance (Var[a] = 0.07). Zhang and Zhao (2019) applied a uniform distribution ranging from 0.7 to 1.5 to generate discrimination parameters in their simulation study, which implied a (slightly) higher mean and variance (E[a] = 1.15, Var[a] = 0.05) compared to our study. Item difficulties were based on normal distributions in all three simulation studies with μ = 0 and σ2 = 1 (König et al., 2020; Zhang & Zhao, 2019) or σ2 = 0.56 (Monroe, 2021). The item difficulty distribution (b = -d/a) in our study based on uncorrelated item intercepts with d ~ N(0, 1) and slopes a ~ N(1, 0.04) has an identical mean (μ = 0) but a (slightly) larger variance (σ2 = 1.15).

The data sets were analyzed by a frequentist maximum likelihood approach (SAS PROC IRT with logistic link and option RESFUNC = nominal to get intercept estimates instead of difficulty estimates) and the HBMIRT macro with four different settings for the priors—uninformative: a ~ N(1, 100), d ~ N(0, 100); informative: a ~ N(1, 4), d ~ N(0, 4); hierarchical with hyperparameters with assigned hyperprior distributions (µa ~ U(-6, 6), σ2a ~ IG(0.01, 0.01), µd ~ U(-6, 6), σ2d ~ IG(0.01, 0.01) ; correct priors reflecting perfect prior knowledge of the population distributions: a ~ N(1, 0.04), d ~ N(0, 1)—and default settings for parameter constraints (one item with positivity constraint) and convergence (PSR < 1.1; ESSconv = 0, i.e., ESS criterion not applied). A total of 100 replicates were generated and then analyzed with each estimation approach and prior setting.

The results are presented in Table 1. With regard to the slope parameters, a rather large average bias (bias ≥ 0.215) was observed for very small samples (N = 50) except for the hierarchical Bayesian approach (bias = 0.050, though still statistically significantly larger than zero, 95% CI [0.019, 0.079]) and, as expected, for the approach with correct priors (both upper limits of the 95% CIs were below the lower limits of all other approaches in this condition). With N = 100, bias was negligible for the maximum likelihood (ML) approach and the Bayesian approaches with hierarchical and correct priors (bias ≤ 0.040 and overlapping 95% CIs) but large for the Bayesian approaches with uninformative and informative priors (bias ≥ 0.236; lower limits of the 95% CIs larger than the respective upper limits of all other approaches). In the large sample condition (N = 500), bias was negligible for all estimation approaches (bias ≤ 0.052; but statistically significantly different from zero for the estimation with (un-)informative priors). The estimated variances of the parameter estimates (squared SE for ML, and posterior variance for Bayesian approaches) and the root mean squared errors (RMSE) were extremely large, in particular for ML in the N = 50 condition. A closer look at the results showed that this was partly due to a few very extreme ML estimates: The 99th percentile for the variances were 2.16 and 1.85 for the a and d parameters, respectively, which is far below the average variances of 29.41 and 57.88, respectively. Therefore, we additionally report the percentiles (5th, 50th, and 95th) for the length of the 95% confidence intervals (ML estimate ± 1.96 ∙ SE) and credibility intervals (highest posterior density, HPD, and equal tail intervals). In all conditions the upper limits for the variances for both a and d parameters of the 95% CIs for the hierarchical and correct prior specification were below the respective lower limits of all other approaches. This was also the case regarding the accuracy of the estimation (RMSE) with the exception of the d parameters in the N = 500 condition. Regarding the credibility intervals, the results showed acceptable coverage rates for the HPD intervals which – as they are always smaller than the equal tail intervals (Congdon, 2006) – were preferable over the equal tail intervals here. Therefore, we refer only to the HPD intervals in the following. Although the coverage rate was mostly acceptable across the conditions and the estimation approaches (with the exception of the Bayesian approach with uninformative priors and N = 50), we found large differences regarding the length of the intervals, which was on average markedly smaller for the hierarchical Bayesian approach compared with all other approaches with the exception of the correct prior specification and which is in line with the results for the variances. The same was true for the percentiles. This means that even in the case of N = 500, the hierarchical Bayesian approach outperformed ML estimation in this regard, with 3% compared to 31% larger average 95% CIs than those from the Bayesian approach with correct priors. Whereas in the N = 50 condition, the 95% CIs from the hierarchical Bayesian approach were 44% larger than those from the Bayesian approach with correct priors, in the N = 100 condition, they were only 18% larger than those from the Bayesian approach with correct priors.

Table 1 Results from simulation study for maximum likelihood (ML) estimation and Bayesian estimation with uninformative priors, informative priors, hierarchical priors, and correct priors

Regarding the item intercepts, the results were overall very similar to those for the item slopes with two exceptions: (1) Only minor bias was observed across all conditions. (2) Although the hierarchical Bayesian approach showed the highest precision in terms of the smallest average length of the 95% CIs compared with the other estimation approaches (except the Bayesian approach with correct priors), this advantage was less pronounced than for the item slopes. Compared to the correct prior specification, the 95% CIs for the hierarchical Bayesian approach were only 6% larger on average across all sample sizes, whereas for ML estimation, this difference was 151%.

It should be noted that in Table 1 the term bias refers to average bias across replications for all items. Another important aspect is the dependence of bias on the size of the population parameter. It is a well-known fact that in Bayesian estimation with informative priors estimates are “pulled” towards the mean of the prior distribution (e.g., van de Schoot et al., 2017). In the case of a hierarchical Bayesian approach this effect is also expected as the variance hyperparameters of the priors will typically be rather small and thus quite informative. Therefore, we also investigated the linear association of bias and population parameters. The results for the bias of the item slope parameters are shown in Table 2. Regarding the association of bias and population parameters the results show statistically significant positive (ML—except for the N = 50 condition—, uninformative, informative prior) or negative (hierarchical and correct priors) linear associations for the N = 50 and the N = 100 condition. However, almost no statistically significant associations but for hierarchical and the correct priors were found in the N = 500 condition. The negative associations for the hierarchical and correct priors are in line with the assumption of a “shrinkage to the mean” of the prior distribution: In the N = 100 condition, for instance, applying the correct prior would be expected to lead to the bias (âjaj) = 0.000 + (-0.726) · (aj – M(aj)) (the average bias was <0.001 as can be seen in Table 1). This means that the expected upward bias for a population parameter one unit below the average would be 0.726, or, equivalently, the expected downward bias for a population parameter one unit above the average would be -0.726.

Table 2 Linear associations of item slope bias and the size of the population parameter by sample size and estimation approach

With regard to the item intercepts (Table 3), the association between population parameter and bias was statistically significant for all approaches and all sample sizes with the same pattern as in the case of the item slopes (negative associations for the hierarchical and the correct priors, positive associations for all other approaches).

Table 3 Linear associations of item intercept bias and the size of the population parameter by sample size and estimation approach

We also checked the coverage and (estimated) reliability of the EAP person parameter estimates from the Bayesian models (Table 4). As a reliability estimate (also computed by the HBMIRT macro by default), we used the variance of the EAP scores, which underestimates the true variance of the person parameters by (1 – reliability) percent (Mislevy et al., 1992). The results show an excellent coverage for the CIsFootnote 4 based on estimate ± 1.96 · posterior SD from the Bayesian approaches with hierarchical and correct priors across all conditions, whereas for the Bayesian approaches with uninformative and informative priors, this was the case only in the N = 500 condition. A similar picture was found for the estimated reliabilities that were strongly underestimated (compared to the true reliabilities) when uninformative and informative priors were used, except for the N = 500 condition.

Table 4 Coverage and reliabilities (estimated and true) of the expected a posteriori (EAP) estimates from the Bayesian estimation

The simulation study was conducted using SAS 9.4 TS Level 1M6 on a Windows Version 1.0.19041 X64_10PRO platform notebook with an Intel Core i7-4600U CPU and 16 GB RAM. The average computation times for the estimation of the Bayesian models in the simulation study with the HBMIRT macro (that makes use of the multithreading capability of PROC MCMC) are given in Table 5. Further improvements regarding estimation time might be reached by specifying more appropriate starting values that should lead to faster convergence (i.e., less iterations on average). A very important aspect with regard to the computation time is a sufficient number of tuning loops for the proposal distributions (see, e.g., Hecht et al., 2021, who reported mixed results about the impact of the number of warmup iterations on the MCMC efficiency). On the one hand, a large number should lead to higher efficiency of the chain and, therefore, to a lower number of iterations needed to achieve a given ESS, for instance. On the other hand, the proposal tuning process might need a lot of computational resources. We are hesitant to give a general advice regarding the optimal settings for the proposal tuning. Based on a few experiments we skipped the tuning phase for non-hierarchical models (maxtune = 0) in our simulation study, but used proposal tuning for the hierarchical approach (hmaxtune = 100). If proposal tuning is needed, the number of iterations before the next check of the stopping criteria should not be chosen too small (e.g., iterationsteps = 2500), because the proposal tuning process starts again after each step. But if the proposal tuning phase can be skipped, (much) shorter “iteration steps” (e.g., iterationsteps = 500) can be chosen that my lead to a smaller number of total iterations given the stopping criteria. For instance, the first 10 replications of the model with the correct prior specification in our simulation study in the N = 500 condition with the applied setting (iterationsteps = 2500) all fulfilled the stopping criterion (PSRconv = 1.1) after the first “step” (average computation time: 3 min 3 s). With a smaller number of “iteration steps” (iterationsteps = 500), on average, the stopping criterion was fulfilled after 1,200 iterations (average computation time: 2 min 9 s). Another suggestion for reducing computational time in Bayesian estimation based on model reformulations was proposed by Hecht et al. (2020; see also Hecht & Zitzmann, 2020) and Merkle et al. (2021).

Table 5 Average computation time by model prior specification and sample size

Discussion

During the last decades IRT has become the standard psychometric approach for scaling tests, in particular when dichotomously coded items (responses coded as correct or false) are used. The estimation of such models, however, typically requires rather large sample sizes which makes it difficult for applied researchers in many fields to make use of IRT. This drawback can be overcome by Bayesian IRT approaches with hierarchical priors that allow to estimate (multidimensional) IRT models in rather small samples as recent studies have demonstrated (Fujimoto & Neugebauer, 2020; Gilholm et al., 2021; König et al., 2020; Sheng, 2013). To date, such approaches might be under-used by applied researchers for two reasons: First, researchers may not be aware of the advantages of the Bayesian IRT framework with hierarchical priors. Second, it might be quite difficult for non-experts to set up such IRT models in general purpose (e.g. SAS; SAS Institute Inc., 2018) or specialized (e.g. Stan; Carpenter et al., 2017) statistical computer programs. For these reasons, we developed the HBMIRT macro for SAS with the central aim of user-friendliness. We tried to keep the model specification as simple as possible (comparable to other existing IRT tools such as SAS PROC IRT)—for instance, by adding many default settings—but still allowing for setting up quite complex multidimensional models. Further, we added an automated convergence control by applying a stop criterion analog to that in Mplus as default (which can easily be “switched off” or modified by the user). Finally, we gauged the macro performance (convergence, parameter recovery) in a simulation study.

To evaluate the method employed by the HBMIRT macro, we simulated data according to unidimensional 2PL models each with 25 items and different sample sizes (N = 50, N = 100, and N = 500). For each condition 100 data sets were generated and item parameters as well as person parameters were estimated based on Bayesian IRT models with the HBMIRT macro (uninformative, informative, hierarchical, and correct priors—the latter exactly matched the distributions from which the item parameters were drawn) as well as an ML-IRT model (SAS PROC IRT). The results showed that the convergence criterion was reached in virtually all cases and that the coverage rates for the item parameters were mostly acceptable for all estimation approaches (i.e., the 95% CIs included the respective population parameters in approximately 95% of the cases). Slope parameter estimates from the Bayesian approaches with (un-)informative priors were biased, particularly in the N = 50 condition, whereas estimates based on hierarchical and correct priors were (largely) unbiased across all conditions. Moreover, regarding the precision of the slope parameter estimates in terms of the length of the 95% CIs, the results showed a clear advantage of the hierarchical approach over all other approaches—except, of course, the correct prior specification. Interestingly, this was even the case in the N = 500 condition, where the hierarchical Bayesian approach outperformed ML estimation with 3% compared to 31% larger average 95% CIs for the slope parameter estimates than those from the Bayesian approach with correct priors. This means that even in this condition, the prior still had a non-negligible impact on the posterior distribution (see Table 2) of the slope parameter estimates and a higher accuracy in terms of the RMSE for the hierarchical Bayesian compared to all other approaches—except the correct prior specification—by “borrowing” information from other slope parameter estimates in the model. The results for the item intercepts showed a comparable but less pronounced pattern as in the case of the slope parameters for the different estimation approaches with the exception that bias was generally very small. Further, with regard to the EAP person parameter estimates from the Bayesian approaches, we found almost perfect coverages (95% CIs) and unbiased reliability estimates only for the estimation based on hierarchical priors and the correct priors across all conditions. The reliability estimates may be biased in both uninformative and informative prior approaches when these estimates are compared to their true values (determined by the squared correlations of EAP estimates with population parameters). This could result from a positive bias in slope parameters, which is “compensated” by a reduced variance of the EAP estimates, our chosen reliability index. Taken together, the main findings of our simulation study regarding the hierarchical Bayesian approach are in line with those from prior studies, which also found unbiased item parameter estimates even in small(er) samples (Fujimoto & Neugebauer, 2020; Sheng, 2013) and, additionally, highly accurate person parameter estimates (König et al., 2020, 2022) for hierarchical Bayesian IRT approaches.

Conclusion

In particular with small and medium sample sizes and/or more complex models, less biased and more accurate parameter estimates in terms of a smaller RMSE can be expected with the Bayesian IRT approach with hierarchical priors compared to (un-)informative priors or ML estimation. With the HBMIRT macro for SAS the benefits of this approach are no longer restricted to experts with a strong background in Bayesian modelling. Uni- and multidimensional Bayesian IRT models with hierarchical priors for dichotomous items can easily be specified and estimated with automatic control of convergence by a default stop criterion. For more advanced users, the macro defaults can be modified, for instance, to estimate models with probit-link instead of the default logit-link, to alter the stop criterion (e.g., ESS as an alternative or additional criterion) or the hierarchical prior distributions, and to apply positivity constraints to specific item slope parameters or zero constraints on specific intercorrelations of latent dimensions (factors).