A multidimensional generalized many-facet Rasch model for rubric-based performance assessment

Performance assessment, in which human raters assess examinee performance in a practical task, often involves the use of a scoring rubric consisting of multiple evaluation items to increase the objectivity of evaluation. However, even when using a rubric, assigned scores are known to depend on characteristics of the rubric’s evaluation items and the raters, thus decreasing ability measurement accuracy. To resolve this problem, item response theory (IRT) models that can estimate examinee ability while considering the effects of these characteristics have been proposed. These IRT models assume unidimensionality, meaning that a rubric measures one latent ability. In practice, however, this assumption might not be satisfied because a rubric’s evaluation items are often designed to measure multiple sub-abilities that constitute a targeted ability. To address this issue, this study proposes a multidimensional IRT model for rubric-based performance assessment. Specifically, the proposed model is formulated as a multidimensional extension of a generalized many-facet Rasch model. Moreover, a No-U-Turn variant of the Hamiltonian Markov chain Monte Carlo algorithm is adopted as a parameter estimation method for the proposed model. The proposed model is useful not only for improving the ability measurement accuracy, but also for detailed analysis of rubric quality and rubric construct validity. The study demonstrates the effectiveness of the proposed model through simulation experiments and application to real data.


Introduction
In various assessment fields, performance assessment, in which raters assess examinee outcomes or processes for a performance task, has attracted much attention as a way to measure practical and higher-order abilities, such as problemsolving, critical reasoning, and creative thinking skills (Mislevy 2018;Zlatkin-Troitschanskaia et al. 2019;Murtonen and Balloo 2019;Palm 2008;Shavelson et al. 2019;Linlin 2019;Hussein et al. 2019;Uto and Okano 2020). Performance assessment has been conducted in various formats, including essay writing, oral presentations, interview examinations, and group discussions.
Performance assessment often involves the use of a scoring rubric that consists of multiple evaluation items to increase the objectivity of evaluation. However, even when using a rubric, assigned scores are known to depend on the characteristics of the rubric's evaluation items and the raters, which decreases the ability measurement accuracy (Deng et al. 2018;Hua and Wind 2019;Myford and Wolfe 2003;Nguyen et al. 2015;Rahman et al. 2020;Uto and Ueno 2018). Therefore, to improve measurement accuracy, ability estimation that considers the effects of these characteristics is needed.
For this reason, item response theory (IRT) models that can estimate examinee abilities while considering the effects of these characteristics have been proposed (Uto and Ueno 2018;Linacre 1989;Jin and Wang 2018;Wilson and Hoskens 2001;Shin et al. 2019). One representative model is the many-facet Rasch model (MFRM) (Linacre 1989), which has been applied to various performance assessments (Linlin 2019; Hua and Wind 2019;Deng et al. 2018;Chan et al. 2017;Tavakol and Pinner 2019;Kaliski et al. 2013). However, the MFRM makes strong assumptions, for example all raters having the same consistency level and all evaluation items having the same discriminatory power, even though these assumptions rarely hold in practice (Myford and Wolfe 2003;Jin and Wang 2018;Patz et al. 2002;Soo Park and Xing 2019). To relax these assumptions, several extensions of the MFRM have been recently proposed (Uto and Ueno 2018;Jin and Wang 2018;Shin et al. 2019;. These IRT models are known to measure abilities with higher accuracy compared with simple scoring methods based on point totals or averages Ueno 2016, 2020).
The IRT models assume unidimensionality, meaning that a rubric measures one latent ability. However, this assumption might not hold in practical rubricbased performance assessment because evaluation items in a rubric are often designed to measure multiple sub-abilities that comprise a targeted ability. Applying unidimensional IRT models to data with multidimensional ability scales deteriorates model fitting and ability measurement accuracy (Hutten 1980).
Multidimensional IRT models are used to measure examinee abilities on a multidimensional ability scale and are well known in objective testing scenarios (Reckase 2009). Traditional multidimensional IRT models, however, have no rater parameters, which prevents not only estimation of examinee ability while considering the effects of rater characteristics, but also direct application to rubric-based performance assessment data.
1. The proposed model allows for estimation of examinee abilities while considering the effects of the rubric's evaluation items and the raters simultaneously, which improves model fitting and ability measurement accuracy. 2. Examinee abilities can be assessed on a multidimensional ability scale that is assumed under the rubric's evaluation items. 3. The model provides characteristic parameters for the rubric's evaluation items while removing the effects of the raters and the examinees. The parameters enable us to conduct detailed analysis of the rubric's characteristics and construct validity.
We demonstrate the effectiveness of the proposed model through simulation experiments and application to actual data.

Rating data from rubric-based performance assessment
This study assumes situations where examinee performance on a given task is assessed by multiple raters using a scoring rubric consisting of multiple evaluation items. Thus, obtained rubric-based performance assessment data X are defined as follows: where x ijr is a rating assigned to the performance of examinee j ∈ J = {1, 2, … , J} by rater r ∈ R = {1, 2, … , R} based on evaluation item i ∈ I = {1, 2, … , I} in the rubric. K = {1, 2, … , K} represents rating categories, and x ijr = −1 represents missing data. This study aimed to estimate examinee ability from data X by using IRT.

Item response theory
With widespread adoption of computer testing, there has been an increase in the use of IRT (Lord 1980), a test theory based on mathematical models. In objective testing contexts, IRT generally defines the relationship between observed examinee responses to test items and latent examinee ability variables using latent variable models, so-called IRT models. IRT models give the probability of an item response as a function of the examinee's latent ability and the item's characteristic parameters. IRT offers the following benefits: IRT has traditionally been applied to test items for which responses can be scored as correct or incorrect, such as multiple-choice items. In recent years, however, there have been attempts to apply polytomous IRT models to performance assessments (Reise and Revicki 2014). Well-known IRT models that are applicable to ordered-categorical data such as rubric-based performance assessment data include the rating scale model (RSM) (Andrich 1978), partial credit model (PCM) (Masters 1982), and GPCM (Muraki 1997). The GPCM gives the probability that examinee j receives score k for test item i as: where j is the latent ability of examinee j, i is a discrimination parameter for test item i, i is a difficulty parameter for item i, and d im is a step parameter denoting difficulty of transition between scores m − 1 and m for item i. Here, d i1 = 0 , ∑ K m=2 d im = 0 , and a normal prior for the ability j are assumed for model identification. The constant 1.7 is often used to make the model similar to the normal ogive function. The PCM is a special case of the GPCM, where i = 1.0 for all the items. The RSM is a special case of the PCM, where d im = d m for all the items. Here, d m denotes difficulty of transition between categories m − 1 and m.
Such traditional IRT models are applicable to two-way data consisting of examinees × test items. However, they cannot be directly applied to rubric-based performance assessment data comprising examinees × raters × evaluation items, even if we regard the test item parameters as the evaluation item parameters. To address this problem, IRT models that can consider these characteristics jointly have been proposed (Uto and Ueno 2018;Linacre 1989;Jin and Wang 2018;Wilson and Hoskens 2001;Shin et al. 2019). Note that some such IRT models originally consider characteristics of performance tasks and raters assuming threeway data consisting of examinees × raters × performance tasks. However, this study assumes that IRT models are applied to rubric-based performance assessment data by regarding the performance task parameters as the evaluation item parameters.

IRT models for performance assessment
The most widely used IRT model for performance assessment is the MFRM (Linacre 1989). There are several variants of the MFRM (Myford and Wolfe 2004;Eckes 2015), but the most representative modeling defines the probability that x ijr = k ∈ K as where i is a difficulty parameter for evaluation item i, and r is the severity of rater r. For model identification, , and a normal prior for the ability j are assumed.
A unique feature of this MFRM is that it is defined by the fewest parameters in existing IRT models for performance assessment. The accuracy of parameter estimation generally increases as the number of parameters per data point decreases Ueno 2016, 2020;van der Linden 2016). Thus, this MFRM can estimate model parameters from a small dataset more accurately than can other models, resulting in higher ability measurement accuracy if it fits well to given data (Uto and Ueno 2018;van der Linden 2016).
By contrast, the MFRM relies on the following assumptions.
1. All evaluation items have the same discriminatory power. 2. All raters have the same assessment consistency. 3. All raters share an equal interval rating scale. (3) However, these assumption are rarely satisfied in practice (Deng et al. 2018;Myford and Wolfe 2003;Jin and Wang 2018;Patz et al. 2002;Soo Park and Xing 2019;Elliott et al. 2009). Violation of these assumptions would decrease model fitting and ability measurement accuracy (Uto and Ueno 2018).
To relax these assumptions, various extensions of the MFRM have recently been proposed (Uto and Ueno 2018;Jin and Wang 2018;Shin et al. 2019;Eckes 2015). This study introduces the generalized MFRM , which relaxes all three assumptions simultaneously. This model provides the probability that x ijr = k as where i is a discrimination parameter for evaluation item i, r is the consistency of rater r, and d rm is a step parameter denoting severity of rater r of transition from rating category m − 1 to m. The rater-specific step parameter d rm can represent the rating scale for each rater, meaning that the restriction of an equal-interval scale for raters is relaxed. For model identification, , and a normal prior for the ability j are assumed. The generalized MFRM can represent various rater effects and evaluation item characteristics, so it is expected to provide better model fitting and more accurate ability measurement than the MFRM, especially when various rater and evaluation item characteristics are assumed . Thus, this study develops a multidimensional model for rubric-based performance assessment based on the generalized MFRM.
Note that there are various approaches to dealing with rater effects, such as hierarchical rater models (Patz et al. 2002;DeCarlo et al. 2011), extensions based on signal detection models (Soo Park and Xing 2019; DeCarlo 2005), rater bundle models (Wilson and Hoskens 2001), and trifactor models (Shin et al. 2019). However, this study focuses on the MFRM-based approach because it is the most popular and traditional approach.

Multidimensional item response theory models
In objective testing contexts, the use of multidimensional IRT models that can measure examinee ability in multidimensional space has been widespread (Reckase 2009). Multidimensional IRT models can generally be classified into compensatory and noncompensatory models. Compensatory models assume that an examinee achieves high performance if any one of multiple sub-abilities is sufficiently high, whereas non-compensatory models assume that performance quality depends concurrently on multiple sub-abilities. This study focuses on non-compensatory models for the following reasons. (1) The non-compensatory assumption would be more suitable for performance assessment settings because performing practical complex tasks will generally require multiple abilities, not a single specific ability. (2) Compensatory models require more complex model formulations, which increases the number of model parameters and makes interpretation and estimation of parameters difficult. A representative non-compensatory multidimensional IRT model for polytomous data is the non-compensatory multidimensional GPCM (Yao and Schwarz 2006). When test item parameters are regarded as evaluation item parameters, the model gives the probability that examinee j obtains score k for evaluation item i as where L indicates the number of assumed ability dimensions, jl is the ability of examinee j for dimension l ∈ L = {1, … , L} , and il indicates the discriminatory power of evaluation item i for the l-th ability dimension. For model identification, d i1 = 0 , ∑ K m=2 d im = 0 , and a normal prior for the ability of each dimension jl are assumed.
However, such conventional multidimensional IRT models have no rater parameters, which prevents not only estimation of examinee ability while considering the effects of rater characteristics, but also direct application to rubric-based writing assessment data. To address this limitation, this study proposes a new multidimensional IRT model that considers the characteristics of both the rubric's evaluation items and the raters.

Proposed model
The proposed model is formulated as a multidimensional extension of the generalized MFRM based on the multidimensional GPCM approach. Specifically, this model gives the probability that x ijr = k ∈ K as The proposed model can estimate examinee ability on a multidimensional scale while considering the characteristics of both the evaluation items and the raters compared with conventional models that cannot consider rater characteristics nor estimate ability on a multidimensional scale. Thus, the proposed model is expected to provide better model fitting and more accurate ability measurement compared with the conventional models if ability multidimensionality and rater effects are assumed to exist. Furthermore, application of the proposed model to the rubric-based writing assessment data provides the various characteristics of each evaluation item and rater, which helps in interpreting the quality of the evaluation items and the raters. Also, the evaluation item's discrimination parameters il offers information for interpreting what each ability dimension measures, which makes objective analysis of rubric construct validity possible.
Note that the generalized MFRM incorporates a rater-specific step parameter d rm , assuming that the appearance tendency of each rating category m depends on the raters. In rubric-based performance assessment, however, the appearance tendency of each category is expected to depend more strongly on the rubric's evaluation items than the raters because evaluation criteria for each rating category are defined by rubrics. Therefore, the proposed model defines the step parameter as evaluation item's parameter d im instead of d rm .

Parameter interpretation
This subsection describes how to interpret parameters in the proposed model. Figure 2 shows item response surfaces (IRSs) based on Eq. (6) for six pairs of raters and evaluation items with different characteristics, with a fixed number of ability dimensions L = 2 and rating categories K = 4 . The parameters used for the IRSs are shown in Table 1. For example, in Fig. 2a, the IRS uses the parameter values for Evaluation item 1 and Rater 1 in Table 1. The x-axis indicates the ability in the first dimension j1 , the y-axis indicates the ability in the second dimension j2 , and the z-axis indicates the probability of P ijrk . The IRSs show that the probability of obtaining higher scores increases as the examinees' abilities increase.
Figure 2b-f shows IRSs in which a specific model parameter is changed from that of Fig. 2a. Thus, how each parameter works is explained below by comparing each figure with Fig. 2a. Figure 2b shows the IRS in which the discrimination parameter for the second dimension, i2 , is smaller than that shown in Fig. 2a. A change in the response probabilities that arises from a change in the second-dimension ability value becomes smaller than that in Fig. 2a. This means that evaluation items with smaller discrimination for the l-th dimension of il do not distinguish the corresponding ability dimension, jl , well. It also suggests that the ability dimension an evaluation item mainly measures would be interpreted based on the discrimination parameter, as we explain in the next subsection. Figure 2c shows the IRS with a higher difficulty parameter, i . This IRS shows that an increase in difficulty parameter i causes the IRS to shift in the direction of the ability value increase, which reflects that difficulty in obtaining higher scores increases as the difficulty parameter for evaluation items increases. Figure 2d shows the IRS in which the step difficulty parameters, d i2 and d i3 , are changed. As the difference d i(m+1) − d im increases, the probability of obtaining category m increases over widely varying ability scales. In Fig. 2d, the probability for rating category 2 increases compared with that in Fig. 2a because d i3 − d i2 is large, whereas the probability for rating category 3 decreases because d i4 − d i3 is small. Figure 2e shows the IRS for a rater with a lower consistency parameter, r . Compared with Fig. 2a, the differences in the response probabilities among the categories decrease, which reflects inconsistencies in rater scoring among examinees with the same ability level. In contrast, a higher consistency value r results in large differences in the response probabilities among the categories. This observation reflects the tendency of consistent raters to consistently assign the same ratings to examinees with the same ability level and higher ratings to examinees with higher ability levels. This suggests that raters who are more consistent in scoring are generally desirable for accurate ability measurement.    Figure 2f shows the IRS for a rater with a high severity parameter, r . Compared with Fig. 2a, the IRS shifts in the direction of the increase in ability value as the difficulty parameter for evaluation items increases, reflecting difficulty in assignment of higher ratings by severe raters.

Interpretation of ability dimensions
As noted above, the discrimination parameter for each dimension of il offers information for interpreting what each ability dimension measures. Specifically, by analyzing commonality in content among the evaluation items with higher discrimination values for each dimension, we can interpret what each ability dimension mainly measures.
This analysis is explained in greater detail in Fig. 3, which depicts the discrimination parameters for five evaluation items. The horizontal axis indicates the index of the evaluation items, the vertical axis indicates the value of the discrimination parameters, and the colored bars correspond to the discrimination parameter for each dimension. The figure indicates that evaluation items 1 and 2 have larger discrimination values in the first dimension. This suggests that the first dimension mainly relates to a common ability that underlies evaluation items 1 and 2. Thus, by analyzing commonality in content between evaluation items 1 and 2, we can interpret the meaning of the first ability dimension. Similarly, the meaning of the second ability dimension can be interpreted by investigating commonality in content between evaluation items 3 and 4, which have higher discrimination parameters for the second dimension.

Optimal number of dimensions
In the proposed model, the number of dimensions, L, is a hand-tuned parameter that must be determined in advance. In IRT studies, the optimal number of dimensions is generally explored based on principal component analysis. However, this analysis method is not applicable to the three-way data assumed in the present study.
Dimensionality selection, which is well known in machine learning, can also be considered as a model selection task. The model selection is typically conducted using information criteria, such as the Akaike information criterion (AIC) (Akaike 1974), the Bayesian information criterion (BIC) (Schwarz 1978), the widely  (Watanabe 2010) and the widely applicable Bayesian information criterion (WBIC) (Watanabe 2013). AIC and BIC are applicable when maximum likelihood estimation is used to estimate model parameters, whereas the WAIC and WBIC can be used with Bayesian estimation using MCMC or variational inference methods. With the recent increase in complex statistical and machine learning models, various studies have used WAIC and WBIC because Bayesian estimation tends to provide a more robust estimation for complex models (Vehtari et al. 2017;Almond 2014;Luo and Al-Harbi 2017). Because this study uses a Bayesian estimation based on MCMC, as described in Subsect. 6.5, it uses the WAIC and WBIC to select the optimal number of dimensions for the proposed model. Specifically, the dimensionality that minimizes these criteria is regarded as optimal.

Model identifiability
The proposed model entails a non-identifiability problem whereby parameter values cannot be uniquely determined because different value sets can give the same response probability. For the proposed model without rater parameters that is consistent with the conventional multidimensional GPCM, parameters are known to be identifiable by assuming a specific distribution (e.g., standard normal distribution) for the ability and constraining d i1 = 0 and ∑ K m=2 d im = 0 for each i. However, in the proposed model, the location for i + r is indeterminate, even when these constraints are given, because the response probability with i and r gives the same value of P ijrk with � i = i + c and � r = r − c for any constant c (note . Such location indeterminacy can be solved by fixing one parameter or restricting the mean of some parameters Fox 2010).
There is another indeterminacy of the scale for r . Suppose we let the term ∑ L l=1 il jl − i − r − d im in Eq. (6) be , then the response probability P ijrk with r and gives the same value of P ijrk with � r = r c and � = c for any constant c, because � r � = ( r c) c = r . Such scale indeterminacy can be removed by fixing one parameter or restricting the product of some parameters Fox 2010).
This study therefore uses the restrictions ∏ R r=1 r = 1 , , and the standard normal prior for the ability of each dimension jl .

Parameter estimation using MCMC
This subsection describes the parameter estimation method for the proposed model.
Marginal maximum likelihood estimation using an expectation-maximization algorithm is a widely used method for estimating IRT model parameters (Baker and Kim 2004). However, for complex models such as that proposed in this study, expected a posteriori (EAP) estimation, a type of Bayesian estimation, is known to provide more robust estimations (Uto and Ueno 2016;Fox 2010).
EAP estimates are calculated as the expected value of the marginal posterior distribution of each parameter. The posterior distribution in the proposed model is where S is a set of parameters, g(s| s ) indicates a prior distribution for parameter s, and s is its hyperparameters. L(X| jl , log il , log r , i , r , d im ) is the likelihood that can be calculated as where z ijrk is a dummy variable that takes 1 if x ijr = k , and zero otherwise.
The marginal posterior distribution for each parameter is derived by marginalizing across all parameters except the target parameter. For a complex IRT model, however, it is not generally feasible to derive or calculate the marginal posterior distribution due to high-dimensional multiple integrals. To address this problem, there is widespread use of MCMC, a random sampling-based estimation method, in various fields including IRT studies (Fox 2010;Uto 2019;van Lier et al. 2018;Fontanella et al. 2019;Zhang et al. 2011;Uto et al. 2017;Louvigné et al. 2018;Brooks et al. 2011).
The Metropolis-Hastings-within-Gibbs sampling method (Patz and Junker 1999) is a common MCMC algorithm for IRT models. The algorithm is simple and easy to implement but requires a long time to converge to the target distribution (Hoffman and Gelman 2014; Girolami and Calderhead 2011). As an efficient alternative MCMC algorithm, the NUT sampler (Hoffman and Gelman 2014), a variant of the HMC, has recently been proposed along with a software package called "Stan" (Carpenter et al. 2017), which makes implementation of a NUT-based HMC easy. Thus, there has been recent widespread use of NUT for parameter estimations in various statistical models, including IRT models Luo and Jiao 2018;Jiang and Carter 2019).
We therefore use a NUT-based MCMC algorithm for parameter estimations in the proposed model. The estimation program was implemented in RStan (Stan Development Team 2018). The developed Stan code is provided in Appendix 1. In this study, the standard normal distribution N(0.0, 1.0) is used as a prior distribution for each parameter: jl , log il , log r , i , r , and d im . Furthermore, the EAP estimates are calculated as the mean of parameter samples obtained from 2000 to 4000 periods.
1. For J examinees, I evaluation items, R raters, and L dimensions, generate true model parameters randomly from the following distributions, which are the same as the prior distributions used in the MCMC algorithm.
The number of rating categories, K, was fixed to 4 to match the condition of the actual data used in a later section. 2. Set skewed discrimination values to the first L evaluation items i ∈ {1, … , L} as follows: The necessity of this procedure is explained in the next paragraph. 3. Given the true parameters, generate rating data from the proposed model randomly. 4. Estimate the model parameters from the generated data. 5. Sort the order of the estimated dimensions based on the discrimination parameter estimates. Specifically, using the discrimination parameter estimates for the first L evaluation items, we sorted the dimensions so that ∑ L l=1 ll was maximized. 6. Calculate root mean square errors (RMSEs) and biases between the estimated and true parameters. 7. Repeat the above procedure 30 times, then calculate the average values of the RMSEs and biases.
Note that Procedures 2 and 5 are required because the ability dimensions in multidimensional IRT models including the proposed model are exchangeable, meaning that the dimension to which a sub-ability corresponds changes every time the parameter estimation runs. The dimension indeterminacy is caused because interchanging il jl and il ′ jl ′ ( l ∈ L , l � ∈ L , l ≠ l ′ ) results in the same value for the term ∑ L l=1 il jl . To appropriately calculate the RMSEs and biases between the estimated and true parameters, this experiment addressed the problem by setting extreme discrimination parameter values for the first L evaluation items in Procedure 2, and by sorting the estimated dimensions based on the discrimination parameter estimates in Procedure 5, as in Martin-Fernandez and Revuelta (2017). Table 2 shows the RMSE results, which confirm the following tendencies: 1. The RMSEs for ability values tend to decrease as the number of evaluation items and/or raters increases. Similarly, the RMSEs for raters and evaluation item parameters tend to decrease as the number of examinees increases. These ten- dencies are caused by the increase in the amount of data per parameter, which is consistent with previous studies Ueno 2016, 2018).

An increase in the number of dimensions tends to lead to an increase in the
RMSEs because the ability and discrimination parameters increase without an increase in the amount of data. This tendency is also consistent with previous research on multidimensional IRT (Martin-Fernandez and Revuelta 2017; Svetina et al. 2017;Kose and Demirtasli 2012).
Moreover, Table 3 shows that the average bias was nearly zero in all cases, indicating no overestimation or underestimation of parameters. We also confirmed that the Gelman-Rubin statistic R (Gelman and Rubin 1992;Gelman et al. 2013), a well-known convergence diagnostic index, was less than 1.1 in all cases, indicating that the MCMC runs converged.
From the above, we conclude that the parameter estimation for the proposed model can be appropriately conducted using the MCMC algorithm.

Validity of dimensionality selection using information criteria
This subsection describes a simulation experiment for evaluating the accuracy of the dimensionality selection using the WAIC and WBIC as information criteria. Concretely, we conducted the following experiments by changing the number of examinees, evaluation items, raters and dimensions to J ∈ {50, 100} , I ∈ {5, 15} , R ∈ {5, 15} , and L ∈ {1, 2, 3} , respectively.
1. For J examinees, I evaluation items, R raters, and L dimensions, generate rating data from the proposed model after the true model parameters are randomly generated from the distributions in Eq. (8). The number of categories, K, was fixed to 4, as in the parameter recovery experiment. 2. Using the generated data, estimate the parameters in the proposed model and calculate the WAIC and the WBIC values while changing the number of dimensions L e to {1, 2, 3}.  3. Rank the WAIC and WBIC values for each L e , such that the L e with the lowest WAIC and WBIC values is ranked first. 4. Repeat the above procedure 30 times, then calculate the average rank. Additionally, calculate the ratio of the correct dimensionality identification for each setting. Table 4 shows the results. The L e = 1, 2, 3 columns show the average of the estimated ranks, with the highest average rank for each setting shown in bold. The Acc column shows the ratio of the correct dimensionality identification.
The results show that the WAIC can select true dimensionality in many cases. The WBIC can also select true dimensionality when the size of the data increases, although it tends to be inferior to the WAIC when the size of the data is small. These results suggest that both the information criteria can find the optimal number of dimensions for larger-scale settings, although the WAIC will be more accurate than the WBIC in relatively small-scale settings.

Accuracy of ability measurement
This subsection evaluates whether the consideration of the rater characteristics in the proposed model is effective in improving ability measurement accuracy. For the evaluation, we compare ability measurement accuracy between the proposed model and the conventional multidimensional GPCM. Note that the conventional multidimensional GPCM is consistent with the proposed model without rater parameters. We conducted the following experiments by changing the number of examinees, evaluation items, raters, and dimensions to J ∈ {50, 100} , I ∈ {5, 15} , R ∈ {5, 15} , and L ∈ {1, 2, 3} , respectively.
1. For J examinees, I evaluation items, R raters, and L dimensions, generate true model parameters randomly from the proposed model and the conventional multidimensional GPCM. Here, the parameters were generated from the distributions in Eq. (8). The number of categories, K, was fixed to 4, as in the above experiments. 2. Generate rating data from the two models respectively given the parameters generated in Procedure 1. 3. Using each generated dataset, estimate the model parameters and examinee abilities in each model. 4. Calculate Pearson's correlation between the ability estimates and the true ability values generated in Procedure 1. 5. Repeat the above procedure 30 times, then calculate the average correlation value.
Note that the conventional multidimensional GPCM directly cannot handle threeway data consisting of examinees, raters, and evaluation items. Thus, in this experiment, we applied the conventional multidimensional GPCM assuming that the probability for the three-way data P ijrk is defined by P ijk shown in Eq. (5) as follows: Table 5 shows the results. In the table, the Generation from Prop and Generation from Conv columns show the results of data are generated from the proposed model and the conventional multidimensional GPCM, respectively. The sub-columns Prop and Conv show the average correlation values when the proposed model and the conventional multidimensional GPCM are applied to each dataset respectively. Furthermore, the sub-column Diff indicates the difference in average correlation values between Prop and Conv, where a larger Diff value (10) indicates that the proposed model is more accurate. We also conducted the paired t-test for the averaged agreement correlation between the proposed model and the conventional multidimensional GPCM, and the resulting p values are shown in p column.
The results show that the performance of the conventional model significantly drops when the data are generated from the proposed model, whereas the performance of the proposed model is almost equal to that of the conventional model when the data are generated from the conventional model. This suggests that lack of knowledge of the rater characteristics deteriorates ability measurement accuracy when the raters are assumed to have different characteristics. Note that the p values in the paired t-test depend on two factors, namely, (1) the mean of the difference between conditions, and (2) the standard deviation of the difference between conditions. Thus, as seen in Table 5, a larger absolute Diff value does not necessarily result in a lower p value. Although the results demonstrate the effectiveness of the proposed model, Table 5 shows that improvement in correlation values by the proposed model is small. This is because data X are generated as complete data under a fully crossed design, assuming all the raters evaluate all the examinees. In this case, because the data per examinee are large and dense, ability measurement accuracy tends to be extremely high in both the models, making the difference in performance among the models small. However, in practice, the data will be sparser because we often assign few raters for each examinee to decrease the raters' assessment workload. In sparse data settings, we can expect the difference in performance among the models to be clearer.
Thus, we conducted the same experiment as described above assuming a practice situation where few raters are assigned to each examinee. Concretely, in Procedure 2, we first assigned two raters to each examinee based on a systematic link design (Shin et al. 2019;Uto 2020;Wind and Jones 2019), and then we generated the data based on the rater assignment. The examples of a fully crossed design and a systematic link design are illustrated in Tables 6 and 7, where checkmarks indicate an assigned rater, and blank cells indicate that no rater was assigned. The data without assigned raters are treated as missing data. Table 8 shows the results. The results show that the average correlation values of the conventional model drops substantially when the data are generated from the proposed model, whereas the high performance of the proposed model is still maintained, regardless of data generation models.  From these experimental results, we can conclude that the consideration of the rater characteristics in the proposed model is effective in improving ability measurement accuracy.

Actual data experiments
This section describes the performance of the proposed model in experiments based on actual data.

Actual data
In this experiment, actual rubric-based performance assessment data were gathered as follows: 1. We recruited 134 Japanese university students as participants.
2. The participants were asked to complete an essay-writing task that involved translating a task taken from the National Assessment of Educational Progress assessments (Persky et al. 2003;Salahu-Din et al. 2008). No specific or preliminary knowledge was needed to complete the task. 3. The written essays were evaluated by 18 raters using a rubric consisting of 9 evaluation items divided into 4 rating categories. We assigned four raters to each essay based on a systematic links design (Shin et al. 2019;Uto 2020;Wind and Jones 2019) to reduce the raters' assessment workload. The evaluation items column in Table 9 lists the abstracts of the evaluation items in the rubric, and was created based on two writing assessment rubrics proposed by Matsushita et al. (2013), Nakajima (2017) for Japanese university students. Furthermore, Appendix 2 presents all the information in the rubric.
We evaluated the effectiveness of the proposed model using the obtained data.

Model comparison using information criteria
As explained above, the proposed model can estimate examinee ability on a multidimensional scale while considering the characteristics of both the raters and the rubric's evaluation items. To evaluate the effectiveness of the consideration of the multidimensionality and rater characteristics, we conducted a model fitting  evaluation based on information criteria. Specifically, we calculated the WAIC and WBIC for the proposed model and the conventional multidimensional GPCM, consistent with the proposed model without rater parameters, using the actual data for each dimensionality L ∈ {1, … , 5}. Table 10 shows the results, with the minimum score for each criteria in bold. The table indicates that the WAIC and WBIC are minimized when L = 2 in both the proposed model and the conventional model. This means that the unidimensionality assumption is not satisfied in the data, suggesting the requirement of the multidimensional models. Furthermore, comparison of the two models shows that the proposed model provides better model fitting than the conventional model in all cases. The results suggest that consideration of rater characteristics is effective in improving model fitting, which verifies the effectiveness of the proposed model.

Characteristic interpretation of the rubric's evaluation items
In this subsection, we show the interpretation of the characteristics of the evaluation items. Table 9 shows the parameters of the evaluation items, which were estimated by the proposed model under L = 2 . Here, L = 2 was used because it provided the best model fitting, as shown in the experiment above.
According to Table 9, the evaluation items reveal different patterns of discrimination parameters. For example, evaluation items 1-6 have larger discrimination values in the second dimension, whereas evaluation items 7-9 have larger discrimination values in the first dimension. Moreover, evaluation item 4 has relatively low discrimination values in both dimensions, meaning that it might not be suitable for distinguishing examinee ability. In contrast, evaluation item 6 has moderate discrimination values in both dimensions, meaning that it measures two-dimensional ability concurrently.
The discrimination parameters of each evaluation item enable us to interpret what is mainly measured by each ability dimension. Specifically, as described above,  Table 9 shows that evaluation items 1-6 have larger discrimination values in the second dimension, and evaluation items 7-9 have larger values in the first dimension. These results suggest that the first ability dimension reflects a common ability underlying evaluation items 7-9, and the second dimension reflects a common ability underlying evaluation items 1-6. According to the contents of the evaluation items (see Appendix 2), we can see that evaluation items 7-9 relate to stylistic skills (such as typological errors and word choice), whereas evaluation items 1-6 relate to logical skills (such as augmentation and organization). Indeed, the rubric was designed such that evaluation items 1-6 mainly measure argumentative skills, and evaluation items 7-9 measure stylistic skills (Matsushita et al. 2013;Nakajima 2017). These results suggest that the rubric developer's expectation is supported by the analysis based on the proposed model. Furthermore, Table 9 shows that the level of difficulty differs among the evaluation items. For example, evaluation item 4 is the most difficult, and evaluation item 7 is the easiest. These are reasonable judgments because evaluation item 4 requires sufficient discussion about opposing opinions, whereas evaluation item 7 requires only superficial typological correctness.
The step difficulty parameters, d im , also show different patterns, meaning that the score distribution differs among the evaluation items. As examples, Fig. 4 depicts the IRSs for evaluation items 3 and 4, which have different step difficulty parameter patterns as well as relatively similar discrimination and difficulty. The figure shows that a score of 2 tends to be avoided and a score of 3 tends be preferred in evaluation item 4.

Rater parameter estimates and ability estimates
To confirm whether the rater characteristics differed, rater parameter estimates were obtained, as shown in Table 11. According to the table, severity and consistency differ among raters. For example, Raters 1 and 17 are highly inconsistent raters whose ratings might be unreliable, whereas Raters 8 and 15 are highly consistent raters. Furthermore, Raters 1, 4, and 8 have higher severity values, whereas Raters 3, 9 and 17 have lower severity values. The variety of rater characteristics is the reason why the proposed model provided better model fitting than the conventional multidimensional GPCM.
Moreover, Fig. 5 shows the two-dimensional ability estimates for each examinee. The horizontal axis indicates the first-dimensional ability value j1 , the vertical axis indicates the second-dimensional ability value j2 , and each dot represents an examinee. The figure shows that the examinees have different ability patterns. Such multidimensional ability measurement cannot be realized by conventional unidimensional IRT models.

Conclusion
This study proposed a new IRT model for rubric-based performance assessment. The model was formulated as a multidimensional extension of the generalized MFRM. A NUT variant of the HMC algorithm for the proposed model was implemented using the software package Stan. Through simulation experiments, we demonstrated the following: (1) The MCMC algorithm appropriately estimates the model parameters.
(2) An optimal number of dimensions for the proposed model can be determined using information criteria. (3) The consideration of the rater characteristics in the proposed model is effective in improving ability measurement accuracy. We also conducted real data application experiments to show examples of analysis of rubric quality and rubric construct validity by interpreting the dimensionality and the characteristics of the evaluation items. Also, the actual data experiment showed that the consideration of the multidimensionality and rater characteristics in the proposed model improved the model fitting.
In future studies, we plan to evaluate the effectiveness of the proposed model using various and more massive datasets. Furthermore, we hope to extend the proposed model to four-way data consisting of examinees × raters × evaluation items × performance tasks because practical tests often include several tasks.

Appendix 1
The Stan code for the proposed model is as follows: Appropriateness of problem setting The problem described adheres to the given theme, and descriptions of why the problem was addressed and its background include an explanation of its importance The problem described adheres to the given theme, and there is a description of why the problem was addressed and its background The problem described adheres to the given theme, but there is insufficient description of why the problem was addressed or its background Does not meet criteria for a score of 2 Consistency between claims and conclusions Derives a conclusion corresponding to the author's assertions about the problem described. The conclusion is novel and goes beyond general theory Derives a conclusion corresponding to the author's assertions about the problem described A conclusion is described, but it insufficiently corresponds to the author's assertions Does not meet criteria for a score of 2 Presentation of evidence Evidence for the author's assertions is presented with the support of various reliable facts and data Evidence for the author's assertions is presented with the support of reliable facts or data in at least one case Evidence for the author's assertions is presented, but without the support of reliable facts or data in at least one case Does not meet criteria for a score of 2 There are 4-6 typographical errors Does not meet criteria for a score of 2

Stylistic consistency
There is stylistic consistency and appropriate use of written language, which consists of easy-to-understand sentences and phrasings The author demonstrates efforts toward stylistic consistency and appropriate use of written language, but there are 1-3 mistakes There is general stylistic consistency, but also 4 or more mistakes such as use of spoken language or ambiguous meanings Does not meet criteria for a score of 2

Usage of conjunctions
Connective phrasings are appropriately used There are 1-2 misuses of connective phrasings There are 3-5 misuses of connective phrasings Does not meet criteria for a score of 2