A two-step, test-guided Mokken scale analysis, for nonclustered and clustered data

Purpose Mokken scale analysis (MSA) is an attractive scaling procedure for ordinal data. MSA is frequently used in health-related quality of life research. Two of MSA's prime features are the scalability coefficients and the automated item selection procedure (AISP). The AISP partitions a (large) set of items into scales based on the observed item scores; the resulting scales can be used as measurement instruments. There exist two issues in MSA: First, point estimates, standard errors, and test statistics for scalability coefficients are inappropriate for clustered item scores, which are omnipresent in quality of life research data. Second, the AISP insufficiently takes sampling fluctuation of Mokken’s scalability coefficients into account. Methods We solved both issues by providing point estimates and standard errors for the scalability coefficients for clustered data and by implementing a Wald-based significance test in the AISP algorithm, resulting in a test-guided AISP (T-AISP), that is available for both nonclustered and clustered test scores. Results We integrated the T-AISP into a two-step, test-guided MSA for scale construction, to guide the analysis for nonclustered and clustered data. The first step is performing a T-AISP and select the final scale(﻿s﻿). For clustered data, within-group dependency is investigated on the final scale(s). In the second step, the strength of the scale(s) is determined and further analyses are performed. The procedure was demonstrated on clustered item scores obtained from administering a questionnaire on quality of life in schools to 639 students nested in 30 classrooms. Conclusions We developed a two-step, test-guided MSA for scale construction that takes into account sample fluctuation of all scalability coefficients and that can be applied to item scores obtained by a nonclustered or clustered sampling design. Supplementary Information The online version contains supplementary material available at 10.1007/s11136-021-02840-2.

1 2 ... I . The item-score pattern probabilities are collected in vector p.
which amounts to averaging the frequencies across all respondents. Note that value S and subscript s both equal 1 in nonclustered data.
For clustered data, Snijders (2001) argued that using the proportions in Equations 1 to 3 may be biased estimators for the probabilities when the group size R s is related to the latent trait value of the group. To avoid biased estimates, he suggested to compute proportions as and (see also Section 3.2 in Koopman et al., 2017). These computations amount to first averaging the frequencies per group into group proportions, and then averaging these group proportions across groups. Note that when group sizes are equal, there is no difference between averaging frequencies across all respondents (Eq. 1 to 3) and averaging group proportions (Eq. 4 to 6).

Computing Scalability Coefficients and Standard Errors
On population level, Mokken's scalability coefficients and Snijders' within-rater scalability coefficients are identical, hence we refer to them as H i j for item-pair coefficients, H i for item coefficients, and H for the total-scale coefficient. Let an (dichotomous) item set be ordered in descending popularity (or ascending difficulty) such that Hence, item 1 is most popular (least difficult) and item I least popular (most difficult). For a item set with i < j, item-pair scalability coefficient H i j is defined as the item scalability coefficient H i as and the total-scale coefficient H as (for polytomous generalizations, see, e.g., Molenaar, 1991; Ark, 2020). Mokken's scalability coefficients are estimated in data samples by replacing the probabilities P(X sri = 1), P(X sri = 0), and P(X sri = 0, X sr j = 1) by sample proportions in Equation 1 and 2, respectively, whereas Snijders' within-rater scalability coefficients are estimated by replacing the probabilities by the sample proportions in Equation 4 and 5, respectively.
Because there exist no difference between Mokken's coefficients and Snijders' within-rater coefficients, the estimation methods of within-rater coefficients may be viable to estimate Mokken's coefficients in clustered data.
For nonclustered data, standard errors of scalability coefficients were derived by assuming that the item-score pattern frequencies follow a multinomial distribution with parameters p and R s , estimated using Equation 3 (Kuijpers et al., 2013). Bias of the point estimates and standard errors in nonclustered data was negligible (Kuijpers et al., 2016). For multi-rater data, standard errors were derived by assuming that the item-score pattern frequencies follow a multinomial distribution per group with parameters p and R s , accounting for overdispersion in the data . Item-score probabilities in p are estimated using Equation 6. Koopman, Zijlstra, De Rooij, and Van der Ark (2020) showed that bias of the standard errors was generally negligible, except when group sizes differed. This may be explained because averaged group proportions from Equation 6 were used to estimate the standard errors, in which small groups weigh as heavily as large groups, even though they are usually less accurate. This can introduce relatively much noise, leading to overestimation of the standard error. For clustered data, we therefore propose replacing Equations 4 and 5 with Equations 1 and 2 to estimate scalability coefficients in Equations 8 to 10, and replacing Equation 5 with Equation 2 to compute p for the multinomial distribution for the standard errors in clustered data. Hence, we assume that there is no relation between the latent trait value of the group and the group size. We will refer to the estimation method for nonclustered data as the one-level method and to our adapted estimation method for clustered data as the two-level method. The point estimates for the one-and two-level methods are identical, but their standard errors differ.

Simulation Study
To determine whether two-level estimates for the scalability coefficients and standard errors outperform the one-level methods in clustered data, we need to investigate their performance.
Because the one-and two-level point estimates are identical, we omit the level and refer to them as the point estimate. The performance of the point estimate, the one-and two-level standard error, and the one-and two-level Wald-based and range-preserving confidence intervals were investigated in a Monte Carlo simulation study (see, e.g., Morris et al., 2019). This numerical computer study involves repeatedly creating data by random sampling. Because the true population model and parameter value is known in such a study, the behavior of the estimated statistic of interest can be evaluated under specific conditions, such as in small samples. Syntax files are available to download from the Open Science Framework (OSF): http://osf.io/y7xud.

Method
Data were generated using a graded response model (Samejima, 1969), an item response model for polytomous data that is a special case of Mokken's NIRT model (Van der Ark, 2001).
Each respondent has a latent trait value θ sr = θ s + δ sr , which combines a group component with a respondent-specific (individual) component. Each simulated data set consisted of S groups, with R s respondents in group s, and item scores of seven 5-category items.
Design Factors. The data-generation mechanism varied in terms of number of groups (S), group size (R s ), and the degree of within-group dependency as denoted by the ICC.
Number of Groups. The levels for the number of groups S were 10, 30, 50, and 100 groups, where it is expected that S = 10 is too small to result in accurate estimates (e.g., Snijders & Bosker, 2012, p. 48).
Group Size. Group size R s had eight levels. Six levels had equal group size (i.e., 2, 5, 10, 20, 50, and 100 respondents per group) and two levels had unequal group size sampled from a discrete uniform distribution defined over the interval [10,30]. For the levels with unequal group sizes, the group size was either independent of the group trait value θ s or the group sizes were matched to the θ s values, so that a larger group size implied a higher θ s value (discussed in detail below).
Within-Group Dependency. The ICC had five levels: very small (.06), small (.13), medium (.25), large (.47), and very large (.69). The ICC was manipulated via the variance of the respondent-specific component δ sr , denoted σ 2 δ . The values were σ 2 δ = .93, .85, .70, .45, and .20 for the five ICC levels, respectively, and was computed for each level by using a large sample (S = 100, 000, R s = 10). The variance of the group component θ s was set to σ 2 θ = 1 − σ 2 δ , so the variance of the combined trait θ sr , σ 2 θ sr = σ 2 θ + σ 2 δ = 1 − σ 2 δ + σ 2 δ = 1 for each level, resulting in an identical population value for all conditions. For larger σ 2 δ values, the individual effect in the latent trait θ sr is larger (and the group effect smaller), making item scores within the same group less similar, and as a result, test scores within a group are less similar and the ICC of the test score is lower.
All levels were fully crossed, which resulted in 8 × 8 × 5 = 160 conditions. Per condition, Q = 1000 datasets were generated, indexed by q (q = 1, 2, . . . , Q). The population value of H was .515 for all conditions, determined using a large sample (S = 1, 000, 000). The empirical standard error SE H was computed for each condition as the standard deviation of H across the Q replications, and varied between .005 and .127.
Performance Measures. For the simulated dataset in each replication, we computed the point estimate of the total-scale scalability coefficient H q , its standard error SE q , the 95% Wald-based confidence interval CI q , and the 95% range-preserving confidence interval CI * q , using the one-and two-level methods. Population value H is computed using a large sample (S = 1, 000, 000) and its standard error SE H as the standard deviation of H across the replications within a condition. Performance measures were bias of H , and coverage of the confidence interval (proportion of times H is included in CI q or CI * q ). Also, we compared the symmetry of the undercoverage of the confidence interval for both type of intervals; that is, whether the undercoverage was equally distributed on both sides of the interval. A satisfactory coverage (i.e., .95) with symmetric undercoverage in both tails (i.e., .025) means that one-sided significance tests and confidence intervals can also be confidently used.
Hypotheses. Unless the group size was related to the group trait value, we expected the point estimate to be unbiased. We expected the one-level standard error estimates to demonstrate larger negative bias for conditions with a larger ICC or larger group size. Also, we expected the two-level standard errors to be biased for the dependent unequal group size conditions, but unbiased in other conditions. We expected the coverage to display similar trends as the bias of the standard error estimates, thus undercoverage for the one-level intervals for larger ICC conditions and over-or undercoverage for the two-level intervals for dependent unequal group sizes. The Wald-based and range-preserving confidence intervals were expected to display similar coverage values, as H < .7 (the value at which range-preserving methods are beneficial according to Koopman et al., in press).

Results
For all conditions, bias of the point estimate of the scalability coefficient was negligible (approximately -1%, M = −.005, SD = .008, range = −.041; .020). Table 1  The one-level SE H was generally underestimated, which was more severe for fewer groups and larger ICC conditions. For (very) small ICC conditions, the one-level bias was negligible.
In general, the bias of two-level SE H was negligible for medium to (very) large ICC, but the

Discussion
This simulation study showed that for clustered data, the point estimates for Mokken's scalability coefficients are accurately estimated, and usually two-level standard errors and confidence intervals outperformed their one-level counterparts, especially for larger levels of within-group dependency and for larger groups. For the conditions with 10 large groups, the two-level standard error estimate was negatively biased and resulted in undercoverage of the confidence interval, for medium or larger ICCs. In such a situation, there is only little independent information present in the data, which is a possible reason for the inaccuracy of the standard error estimates (e.g., Snijders & Bosker, 2012, p. 24). Larger group sizes are recommended for better coverage rates of the two-level confidence interval. The confidence interval was somewhat asymmetric for nested data, possibly caused by skewness of the sampling distribution. As the symmetry was slightly better for the Wald-based confidence interval, we suggest using this type of interval if H is (likely) well below the upper boundary of 1. The undercoverage of the left side of the distribution was lower than the desired value of .025. This means that the one-sided significance tests that are relevant in Mokken scale analysis are likely to be somewhat conservative, with type I error rates below the nominal significance level. Our adaption of Snijders's (2001) estimation method by using averaged frequencies rather than averaged proportions to estimate the scalability coefficients and their standard errors led to accurate estimates, even when group size was related to the trait value of the group. Therefore, we recommend to use our two-level estimation method for clustered data. The two-level standard error was somewhat overestimated when the ICC is close to zero, especially for small groups. This may be explained by the underlying assumption of a multinomial distribution for each group, which cannot be accurately estimated in such situations. Alternative assumptions (e.g., a dirichlet multinomial distribution)