An item response theory analysis of the matrix reasoning item bank (MaRs-IB)

Matrix reasoning tasks are among the most widely used measures of cognitive ability in the behavioral sciences, but the lack of matrix reasoning tests in the public domain complicates their use. Here, we present an extensive investigation and psychometric validation of the matrix reasoning item bank (MaRs-IB), an open-access set of matrix reasoning items. In a first study, we calibrate the psychometric functioning of the items in the MaRs-IB in a large sample of adult participants (N = 1501). Using additive multilevel item structure models, we establish that the MaRs-IB has many desirable psychometric properties: its items span a wide range of difficulty, possess medium-to-large levels of discrimination, and exhibit robust associations between item complexity and difficulty. However, we also find that item clones are not always psychometrically equivalent and cannot be assumed to be exchangeable. In a second study, we demonstrate how experimenters can use the estimated item parameters to design new matrix reasoning tests using optimal item assembly. Specifically, we design and validate two new sets of test forms in an independent sample of adults (N = 600). We find these new tests possess good reliability and convergent validity with an established measure of matrix reasoning. We hope that the materials and results made available here will encourage experimenters to use the MaRs-IB in their research. Supplementary Information The online version contains supplementary material available at 10.3758/s13428-023-02067-8.

participants having reached that item.The logic is that, if items that appeared later in the fixed-order test were disproportionately reached by participants sacrificing accuracy for speed, then we should observe a positive correlation between the total number of available responses and proportion correct amongst the easiest items.
We detected strong positive correlations between proportion correct and number reached (dimension 1 items: ρ = 0.514, p = 0.050; dimension 2 items: ρ = 0.767, p < 0.001; combined: ρ = 0.695, p < 0.001).This result supports the hypothesis that participants that did reach items later in the test did so by prioritizing speed at the expense of accuracy.This suggests that the summary statistics of later items released as part of Chierchia et al. (2019) are likely biased indicators of item difficulty.

Defining a threshold for rapid guessing
In online testing environments, it is inevitable that some participants will not engage meaningfully with an experiment and instead in engage in careless or insufficient effort responding.On matrix reasoning tasks, one such low-effort strategy is rapid guessing wherein participants response in such a short time that there is no way they could have meaningfully considered an item (Wise, 2017).In sufficient quantities, the presence of rapid guesses in data can systematically bias estimates of item parameters.Thus, if possible, rapid guess responses should ideally be identified and removed.
There are a number of approaches for identifying rapid guess responses (for a review, see Wise (2017)).We opted for a threshold approach, in which responses taking less than a particular time would be denoted as rapid guesses and participants exhibiting too many rapid guessing responses would be excluded from the data.To define this threshold, we fit an extended version of the effort-moderated item response theory (EM-IRT) model (Wise & DeMars, 2006) to a small dataset of responses collected during piloting.In the EM-IRT, the probability of correct responding for participant i to item j is defined as the following mixture: where θ i is the latent ability for person i, and β j , α j , and γ j are the difficulty, discrimination, and guessing parameters for item j.As in the main text, here we fixed the guessing parameter for every item clone to the nominal guessing rate (γ j = 0.25).
Crucially, w ij is a weight parameter, bounded between zero and one, that controls whether a participant is responding effortfully in accordance with their ability (w ij → 0) or engaging in rapid guessing responding (w ij → 1).
Here we defined the rapid guessing weight as a function of participant's response time on that trial: where z ij is the (log-transformed) response time for participant i for item j, and ζ n are regressing coefficients mapping responses times to rapid guessing weights.Thus, this form of the EM-IRT model learns a function in a data-driven fashion to classify responses as having originated from effortful or rapid guessing response strategies.
We fit this model to data collected from a total of N=180 participants recruited from the Prolific Academic platform as part of a pilot experiment (independent of the experiments presented in the main text).Each participant completed one of two sets of eight items from the MaRs-IB.The EM-IRT model estimated within a Bayesian framework using Hamiltonian Monte Carlo as implemented in Stan (v2.22) (Carpenter et al., 2017).Four separate chains with randomised start values each took 3,000 samples from the posterior.The first 2,000 samples from each chain were discarded.As such, 4,000 post-warmup samples from the joint posterior were retained.The R values for all parameters were equal to or less than 1.01, indicating acceptable convergence between chains, and there were no divergent transitions in any chain.
The estimated weights (w ij ) across all subjects and items are plotted as a function of their corresponding response time in Figure S3.As can be observed, the weights quickly approach 1 for response times faster than 5 seconds.Interestingly, weights begin to rise again for responses after 20 seconds suggesting that participants have an internal awareness of the amount of time that has elapsed.
To define a threshold for rapid guessing, we found the corresponding response time for which w = 0.5.This was at approximately 3 seconds.Thus, in the main experiments we defined rapid guessing as responses taking fewer than 3 seconds.

Organization of geometric stimuli in the MaRs-IB
Every item template in the MaRs-IB has three unique versions that differ only in the geometric shapes populating its cells.There are 45 unique geometric shapes in total, organized into nine stimulus sets, where each item clone draws a subset of shapes from one of the nine sets (Figure S2).As such, shape set -indicating whether a clone is the 1st, 2nd, or 3rd version of an item template -is not a meaningful nominal variable.Furthermore, item clones drawing from the same shape set are not always populated by the same geometric shapes, making complicated the possibility of modeling the nine stimulus sets instead.The most rigorous way of modeling the presence or absence of a given geometric shape on item functioning would be to include each as a binary attribute predicting item difficulty and discrimination.In order to keep our models simple, we elected to leave shape set unmodeled and accounted for by the residual variability terms instead.

Bayesian additive multilevel item structure (AMIS) model priors
In the additive multilevel item structure (AMIS) models, there are three families of parameters: item difficulty parameters, item discrimination parameters, and person ability parameters.We describe the priors for each set of parameters in turn.The model code is publicly available at https://github.com/ndawlab/mars-irt.
The item difficulty parameters can be decomposed into their fixed and random effects components.The fixed effects components include the grand mean, µ β , the effects of template-level (level-1) attributes, δ βn , and the effects of clone-level (level-2) attributes, δ βm .All three fixed effects terms were assigned standard normal priors, N (0, 1).The template-and clone-level random effects components, ϵ βj and ϵ βk , were each assigned normally distributed priors with a mean of zero and standard deviations estimated by the model, σ βj and σ βk .These standard deviations were, in turn, assigned half-student-t distributions with a mean of zero, a standard deviation of one, and three degrees of freedom, StudentT(3, 0, 1).
Similarly, the item discrimination parameters can be decomposed into their fixed and random effects components.The fixed effects components include the grand mean, µ α , the effects of template-level (level-1) attributes, δ αn , and the effects of clone-level (level-2) attributes, δ αm .All three fixed effects terms were assigned standard normal priors, N (0, 1).The template-and clone-level random effects components, ϵ αj and ϵ αk , were each assigned normally distributed priors with a mean of zero and standard deviations estimated by the model, σ αj and σ αk .These standard deviations were, in turn, assigned half-student-t distributions with a mean of zero, a standard deviation of one, and three degrees of freedom, StudentT(3, 0, 1).Importantly, item discrimination parameters were restricted to be in the range α jk ∈ [0, 5].This was achieved by passing the clone-level discrimination parameter through the standard normal cumulative distribution function (i.e.scaling parameters between 0 and 1), and multiplying the result by 5.
Finally, the person ability parameters can be decomposed into their fixed and random effects components.The fixed effects components comprise the partial correlation coefficients, ρ p .Before transformation, the partial correlation coefficients were assigned standard normal priors, N (0, 1).These values were then transformed such that the sum of their squared values could not exceed 1.The random effects components, ϵ θ , were assigned a normally distributed prior with a mean of zero and a variance equal to 1 -minus the sum of squared partial correlations (i.e.fixing the SUPPLEMENTAL MATERIALS FOR "IRT ANALYSIS OF MARS-IB" 6 variance of the person ability distribution to 1).

Parameter recovery & power analysis
In order to assess the robustness of our item calibration analysis, we performed a parameter recovery and power analysis.The objectives of this analysis were threefold.
First, we wanted to evaluate our ability to accurately estimate person and item parameters given our sample size and experiment design.Second, we wanted to quantify our power to detect an association between item attributes and item functioning (i.e.difficulty, discrimination) for different magnitudes of association.
Third, we wanted to determine how imprecision in our estimates of item parameters impact optimal test assembly and test score reliability.
We generated 100 artificial datasets whose statistical properties were matched to what we observed empirically in the item calibration dataset.Specifically, each dataset was composed of simulated item responses from 1500 "participants" whose latent abilities were drawn from a standard normal distribution.Each participant "completed" 16 item clones from a pool of 384 item clones (nested among 64 item templates).Thus, each artificial item clone was completed by an average of 62.5 participants.The simulated distribution of item difficulty and discrimination parameters were matched to what was observed empirically (µ β = 0.20, σ β = 1.55; µ α = 1.30, σ α = 0.22).The total variance explained by item attributes for item difficulty and discrimination parameters was also matched to what was observed empirically (r 2 β ≈ 60%; r 2 α ≈ 40%).To estimate our statistical power to detect associations between item attributes and parameters, we specified two large effects and two small effects per item parameter.That is, there were two pairs of attributes that explained 20% and 10% of the variance in item difficulty, respectively; and there were two pairs of attributes that explained 14% and 7% of the variance in item discrimination, respectively.Based on what we observed empirically, the residual variance in item difficulty and discrimination parameters was split equally between the template-(ϵ j ) and clone-levels (ϵ k ).Models 5 and 6 were fitted to each simulated dataset.We tested Model 5 because this was the model deemed the best-fitting, parsimonious model in the item calibration study.We tested Model 6 because it is true data generating model for the simulations.
Comparing these two models therefore allows us to evaluate whether selecting Model 5 produces any systematic bias in our analyses.All models were estimated within a Bayesian framework using Hamiltonian Monte Carlo as implemented in Stan (v2.22).
For all models, four separate chains with randomised start values each took 4,000 samples from the posterior.The first 3,000 samples from each chain were discarded.As such, 4,000 post-warmup samples from the joint posterior were retained.The R values for all parameters were less than 1.02, indicating acceptable convergence between chains, and there were no divergent transitions in any chain.
The results of the parameter recovery study are summarized Figure S4.For both models, we observed good recovery of the latent ability parameter (Model 5: r = 0.873; Model 6: r = 0.873).Furthermore, we observed excellent recovery for the item difficulty parameters (Model 5: r = 0.962; Model 6: r = 0.962).In contrast, we observed only adequate recovery of the item discrimination parameters (Model 5: r = 0.667; Model 6: r = 0.671).This result is not altogether unexpected given recent research that recommended at least N = 100 observations per item for accurate item parameter estimation (König, Spoden, & Frey, 2020).This result, however, must be interpreted in context.We demonstrate below that the less-than-ideal recovery of item discrimination parameters does not seriously impact the outputs of optimal test assembly.
Next we calculated the statistical power (i.e.true positive rate) for detecting an association between item attributes and item parameters.We defined a true positive as an association between an item attribute and parameter whose 95% highest density interval excluded zero.The results are summarized in the second column of Figure S4.
For the item difficulty parameters, we detected the large association (20% variance explained) and small association (10% variance explained) in all 100 simulations.This result indicates that we were suitably powered to detect associations between item attributes and item difficulty of the magnitude reported in the manuscript.In contrast, true positive rates for contrasts involving item discrimination were smaller.For large effects (14% variance explained), the TPR was approximately 70% (Model 5: TPR = 0.720; Model 6: TPR = 0.700).For small effects (7% variance explained), the TPR was above 40% (Model 5: TPR = 0.435; Model 6: TPR = 0.400).Thus, while our study design was moderately powered for detecting larger associations between item attributes and item discrimination, we are underpowedered for detecting smaller associations.As such, our mostly null findings for associations between item attributes and item discrimination should be interpreted with caution.(That said, if any associations do exist, they are likely small in magnitude.)Regardless, this result does not alter one of the main findings of the item calibration study; namely, item clones systematically differ in their difficulty by distractor type and are not therefore exchangeable.
Finally, we investigated the effects of item parameter misestimation on test assembly.For each simulated dataset, we submitted the ground-truth (simulated) and model-predicted (recovered) item parameters to test assembly under the same constraints used to generate the MaRs-IB short form measures.We then calculated for each test form its test information function (TIF) and IRT test score reliability (Kim & Feldt, 2010;Nicewander, 2018).Crucially, we calculated the TIF and reliability for the short forms using the true (i.e.not recovered) item parameter estimates; therefore, we can quantify how much the psychometric properties of test forms are impacted by suboptimal item selection due to parameter estimation noise.The results are summarized in the third and fourth columns of Figure S4.As is expected, TIF was larger for the test forms made using the true item parameters, indicating that estimation noise can lead to suboptimal item selection.Despite this, estimation noise led to only marginal decrements in IRT test reliability.For item parameters estimated using Model 5, parameter noise resulted in an average loss of score reliability of ∆ρ = 0.020 (SD = 0.014).For item parameters estimated using Model 6, parameter noise resulted in an average loss of score reliability of ∆ρ = 0.019 (SD = 0.013).
In sum, our sample size and experiment design was sufficient with regard to our study objectives.The results of the parameter recovery analysis demonstrate that we are able to estimate item difficulty parameters, and their association with item attributes, with excellent precision.In contrast, we were able to estimate item discrimination parameters, and their association with item attributes, only with moderate precision.Because the observed variability in item discrimination parameters were small, however, imprecision in the estimation of these parameters yields minimal impact on subsequent test assembly.Indeed, the loss of score reliability due to suboptimal item selection was negligible.Correlations between performance on the MaRs-IB short/long form and self-report Figure S1 .Item parameter estimates for the best-fitting model (Model 5) with and without including mean response time as a clone-level attribute.

Figure S3 .
Figure S3 .Estimates of the rapid guessing weights (w ij ) and their corresponding responses times from the effort-moderated item response theory (EM-IRT) model fit to response data from N=180 pilot participants.The dashed line indicates the chosen rapid guessing threshold, i.e.where w = 0.5.