1 Introduction

In various assessment fields, performance assessment, in which raters assess examinee outcomes or processes for a performance task, has attracted much attention as a way to measure practical and higher-order abilities, such as problem-solving, critical reasoning, and creative thinking skills (Mislevy 2018; Zlatkin-Troitschanskaia et al. 2019; Murtonen and Balloo 2019; Palm 2008; Shavelson et al. 2019; Linlin 2019; Hussein et al. 2019; Uto and Okano 2020). Performance assessment has been conducted in various formats, including essay writing, oral presentations, interview examinations, and group discussions.

Performance assessment often involves the use of a scoring rubric that consists of multiple evaluation items to increase the objectivity of evaluation. However, even when using a rubric, assigned scores are known to depend on the characteristics of the rubric’s evaluation items and the raters, which decreases the ability measurement accuracy (Deng et al. 2018; Hua and Wind 2019; Myford and Wolfe 2003; Nguyen et al. 2015; Rahman et al. 2020; Uto et al. 2020; Uto and Ueno 2018). Therefore, to improve measurement accuracy, ability estimation that considers the effects of these characteristics is needed.

For this reason, item response theory (IRT) models that can estimate examinee abilities while considering the effects of these characteristics have been proposed  (Uto and Ueno 2018; Linacre 1989; Jin and Wang 2018; Wilson and Hoskens 2001; Shin et al. 2019). One representative model is the many-facet Rasch model (MFRM) (Linacre 1989), which has been applied to various performance assessments (Linlin 2019; Hua and Wind 2019; Deng et al. 2018; Chan et al. 2017; Tavakol and Pinner 2019; Kaliski et al. 2013). However, the MFRM makes strong assumptions, for example all raters having the same consistency level and all evaluation items having the same discriminatory power, even though these assumptions rarely hold in practice (Myford and Wolfe 2003; Jin and Wang 2018; Patz et al. 2002; Uto and Ueno 2020; Soo Park and Xing 2019). To relax these assumptions, several extensions of the MFRM have been recently proposed (Uto and Ueno 2018; Jin and Wang 2018; Shin et al. 2019; Uto and Ueno 2020). These IRT models are known to measure abilities with higher accuracy compared with simple scoring methods based on point totals or averages (Uto and Ueno 2016, 2020).

The IRT models assume unidimensionality, meaning that a rubric measures one latent ability. However, this assumption might not hold in practical rubric-based performance assessment because evaluation items in a rubric are often designed to measure multiple sub-abilities that comprise a targeted ability. Applying unidimensional IRT models to data with multidimensional ability scales deteriorates model fitting and ability measurement accuracy (Hutten 1980).

Multidimensional IRT models are used to measure examinee abilities on a multidimensional ability scale and are well known in objective testing scenarios (Reckase 2009). Traditional multidimensional IRT models, however, have no rater parameters, which prevents not only estimation of examinee ability while considering the effects of rater characteristics, but also direct application to rubric-based performance assessment data.

To resolve the abovementioned problems, the present study proposes a new multidimensional IRT model that incorporates parameters that take into account the characteristics of both the rubric’s evaluation items and the raters. Specifically, the proposed model is formulated by extending the generalized MFRM (Uto and Ueno 2020), which is one of the latest MFRM extension models, based on the approach of the multidimensional generalized partial credit model (GPCM) (Yao and Schwarz 2006). The study adopts a Bayesian estimation method for the proposed model called the No-U-Turn (NUT) Hamiltonian Monte Carlo (HMC), which is a state-of-the-art Markov chain Monte Carlo (MCMC) algorithm (Hoffman and Gelman 2014). As outlined in Fig. 1, the proposed model has the following advantages.

  1. 1.

    The proposed model allows for estimation of examinee abilities while considering the effects of the rubric’s evaluation items and the raters simultaneously, which improves model fitting and ability measurement accuracy.

  2. 2.

    Examinee abilities can be assessed on a multidimensional ability scale that is assumed under the rubric’s evaluation items.

  3. 3.

    The model provides characteristic parameters for the rubric’s evaluation items while removing the effects of the raters and the examinees. The parameters enable us to conduct detailed analysis of the rubric’s characteristics and construct validity.

We demonstrate the effectiveness of the proposed model through simulation experiments and application to actual data.

Fig. 1
figure 1

Outline of an application of the proposed model

2 Rating data from rubric-based performance assessment

This study assumes situations where examinee performance on a given task is assessed by multiple raters using a scoring rubric consisting of multiple evaluation items. Thus, obtained rubric-based performance assessment data \(\varvec{X}\) are defined as follows:

$$\begin{aligned} \varvec{X} = \{ x_{ijr} | x_{ijr} \in {\mathcal {K}} \cup \{-1\}, i \in {\mathcal {I}}, j \in {\mathcal {J}}, r \in {\mathcal {R}}\}, \end{aligned}$$
(1)

where \(x_{ijr}\) is a rating assigned to the performance of examinee \(j\in {\mathcal {J}}=\{1, 2, \ldots ,J\}\) by rater \(r\in {\mathcal {R}} = \{1, 2, \ldots ,R\}\) based on evaluation item \(i \in {\mathcal {I}}=\{1, 2, \ldots ,I\}\) in the rubric. \({\mathcal {K}} =\{1, 2, \ldots , K\}\) represents rating categories, and \(x_{ijr} =-1\) represents missing data.

This study aimed to estimate examinee ability from data \(\varvec{X}\) by using IRT.

3 Item response theory

With widespread adoption of computer testing, there has been an increase in the use of IRT (Lord 1980), a test theory based on mathematical models. In objective testing contexts, IRT generally defines the relationship between observed examinee responses to test items and latent examinee ability variables using latent variable models, so-called IRT models. IRT models give the probability of an item response as a function of the examinee’s latent ability and the item’s characteristic parameters. IRT offers the following benefits:

  1. 1.

    Examinee ability can be estimated while considering test item characteristics.

  2. 2.

    Examinee responses to different test items can be assessed on the same scale.

  3. 3.

    Statistically sound methods for handling missing data can be easily applied.

IRT has traditionally been applied to test items for which responses can be scored as correct or incorrect, such as multiple-choice items. In recent years, however, there have been attempts to apply polytomous IRT models to performance assessments (Reise and Revicki 2014). Well-known IRT models that are applicable to ordered-categorical data such as rubric-based performance assessment data include the rating scale model (RSM) (Andrich 1978), partial credit model (PCM) (Masters 1982), and GPCM (Muraki 1997).

The GPCM gives the probability that examinee j receives score k for test item i as:

$$\begin{aligned} P_{ijk}= \frac{\exp \sum _{m=1}^{k}\left[ 1.7 \alpha _i (\theta _j-\beta _{i} - d_{im}) \right] }{\sum _{l=1}^{K} \exp \sum _{m=1}^{l}\left[ 1.7 \alpha _i (\theta _j-\beta _{i} - d_{im}) \right] }\, , \end{aligned}$$
(2)

where \(\theta _j\) is the latent ability of examinee j, \(\alpha _i\) is a discrimination parameter for test item i, \(\beta _{i}\) is a difficulty parameter for item i, and \(d_{im}\) is a step parameter denoting difficulty of transition between scores \(m-1\) and m for item i. Here, \(d_{i1}=0\), \(\sum _{m=2}^{K} d_{im} = 0\), and a normal prior for the ability \(\theta _j\) are assumed for model identification. The constant 1.7 is often used to make the model similar to the normal ogive function.

The PCM is a special case of the GPCM, where \(\alpha _i=1.0\) for all the items. The RSM is a special case of the PCM, where \(d_{im}=d_{m}\) for all the items. Here, \(d_m\) denotes difficulty of transition between categories \(m-1\) and m.

Such traditional IRT models are applicable to two-way data consisting of examinees \(\times\) test items. However, they cannot be directly applied to rubric-based performance assessment data comprising examinees \(\times\) raters \(\times\) evaluation items, even if we regard the test item parameters as the evaluation item parameters. To address this problem, IRT models that can consider these characteristics jointly have been proposed (Uto and Ueno 2018; Linacre 1989; Jin and Wang 2018; Wilson and Hoskens 2001; Shin et al. 2019). Note that some such IRT models originally consider characteristics of performance tasks and raters assuming three-way data consisting of examinees \(\times\) raters \(\times\) performance tasks. However, this study assumes that IRT models are applied to rubric-based performance assessment data by regarding the performance task parameters as the evaluation item parameters.

4 IRT models for performance assessment

The most widely used IRT model for performance assessment is the MFRM (Linacre 1989). There are several variants of the MFRM (Myford and Wolfe 2004; Eckes 2015), but the most representative modeling defines the probability that \(x_{ijr} =k \in {\mathcal {K}}\) as

$$\begin{aligned} P_{ijrk} = \frac{\exp \sum _{m=1}^{k}\left[ \theta _j-\beta _i-\beta _{r} - d_{m} \right] }{\sum _{l=1}^{K} \exp \sum _{m=1}^{l} \left[ \theta _j-\beta _i-\beta _{r} - d_{m} \right] }, \end{aligned}$$
(3)

where \(\beta _{i}\) is a difficulty parameter for evaluation item i, and \(\beta _{r}\) is the severity of rater r. For model identification, \(\sum _{r=1}^{R} \beta _{r}=0\), \(d_1=0\), \(\sum _{m=2}^{K} d_{m} = 0\), and a normal prior for the ability \(\theta _j\) are assumed.

A unique feature of this MFRM is that it is defined by the fewest parameters in existing IRT models for performance assessment. The accuracy of parameter estimation generally increases as the number of parameters per data point decreases (Uto and Ueno 2016, 2020; van der Linden 2016). Thus, this MFRM can estimate model parameters from a small dataset more accurately than can other models, resulting in higher ability measurement accuracy if it fits well to given data (Uto and Ueno 2018; van der Linden 2016).

By contrast, the MFRM relies on the following assumptions.

  1. 1.

    All evaluation items have the same discriminatory power.

  2. 2.

    All raters have the same assessment consistency.

  3. 3.

    All raters share an equal interval rating scale.

However, these assumption are rarely satisfied in practice (Deng et al. 2018; Myford and Wolfe 2003; Jin and Wang 2018; Patz et al. 2002; Uto and Ueno 2020; Soo Park and Xing 2019; Elliott et al. 2009). Violation of these assumptions would decrease model fitting and ability measurement accuracy (Uto and Ueno 2018).

To relax these assumptions, various extensions of the MFRM have recently been proposed (Uto and Ueno 2018; Jin and Wang 2018; Shin et al. 2019; Uto and Ueno 2020; Eckes 2015). This study introduces the generalized MFRM (Uto and Ueno 2020), which relaxes all three assumptions simultaneously. This model provides the probability that \(x_{ijr} = k\) as

$$\begin{aligned} P_{ijrk} = \frac{\exp \sum _{m=1}^{k}\left[ 1.7 \alpha _r\alpha _i(\theta _j -\beta _i-\beta _{r} - d_{rm}) \right] }{\sum _{l=1}^{K} \exp \sum _{m=1}^{l} \left[ 1.7 \alpha _r\alpha _i(\theta _j-\beta _i-\beta _{r} - d_{rm} )\right] }\, , \end{aligned}$$
(4)

where \(\alpha _i\) is a discrimination parameter for evaluation item i, \(\alpha _r\) is the consistency of rater r, and \(d_{rm}\) is a step parameter denoting severity of rater r of transition from rating category \(m-1\) to m. The rater-specific step parameter \(d_{rm}\) can represent the rating scale for each rater, meaning that the restriction of an equal-interval scale for raters is relaxed. For model identification, \(\prod _{i=1}^{I} \alpha _{i} = 1\), \(\sum _{i=1}^{I} \beta _{i} = 0\), \(d_{r1}=0\), \(\sum _{m=2}^{K} d_{rm} =0\), and a normal prior for the ability \(\theta _j\) are assumed.

The generalized MFRM can represent various rater effects and evaluation item characteristics, so it is expected to provide better model fitting and more accurate ability measurement than the MFRM, especially when various rater and evaluation item characteristics are assumed (Uto and Ueno 2020). Thus, this study develops a multidimensional model for rubric-based performance assessment based on the generalized MFRM.

Note that there are various approaches to dealing with rater effects, such as hierarchical rater models (Patz et al. 2002; DeCarlo et al. 2011), extensions based on signal detection models (Soo Park and Xing 2019; DeCarlo 2005), rater bundle models (Wilson and Hoskens 2001), and trifactor models (Shin et al. 2019). However, this study focuses on the MFRM-based approach because it is the most popular and traditional approach.

5 Multidimensional item response theory models

In objective testing contexts, the use of multidimensional IRT models that can measure examinee ability in multidimensional space has been widespread (Reckase 2009). Multidimensional IRT models can generally be classified into compensatory and non-compensatory models. Compensatory models assume that an examinee achieves high performance if any one of multiple sub-abilities is sufficiently high, whereas non-compensatory models assume that performance quality depends concurrently on multiple sub-abilities. This study focuses on non-compensatory models for the following reasons. (1) The non-compensatory assumption would be more suitable for performance assessment settings because performing practical complex tasks will generally require multiple abilities, not a single specific ability. (2) Compensatory models require more complex model formulations, which increases the number of model parameters and makes interpretation and estimation of parameters difficult.

A representative non-compensatory multidimensional IRT model for polytomous data is the non-compensatory multidimensional GPCM (Yao and Schwarz 2006). When test item parameters are regarded as evaluation item parameters, the model gives the probability that examinee j obtains score k for evaluation item i as

$$\begin{aligned} P_{ijk} = \frac{\exp \sum _{m=1}^{k} \left[ 1.7 \left( \sum _{l=1}^{L} \alpha _{il} \theta _{jl}-\beta _i- d_{im} \right) \right] }{\sum _{l=1}^{K} \exp \sum _{m=1}^{l} \left[ 1.7 \left( \sum _{l=1}^{L} \alpha _{il} \theta _{jl}-\beta _i-d_{im} \right) \right] }\, , \end{aligned}$$
(5)

where L indicates the number of assumed ability dimensions, \(\theta _{jl}\) is the ability of examinee j for dimension \(l \in {\mathcal {L}} = \{1,\ldots , L\}\), and \(\alpha _{il}\) indicates the discriminatory power of evaluation item i for the l-th ability dimension. For model identification, \(d_{i1}=0\), \(\sum _{m=2}^{K} d_{im} = 0\), and a normal prior for the ability of each dimension \(\theta _{jl}\) are assumed.

However, such conventional multidimensional IRT models have no rater parameters, which prevents not only estimation of examinee ability while considering the effects of rater characteristics, but also direct application to rubric-based writing assessment data. To address this limitation, this study proposes a new multidimensional IRT model that considers the characteristics of both the rubric’s evaluation items and the raters.

6 Proposed model

The proposed model is formulated as a multidimensional extension of the generalized MFRM based on the multidimensional GPCM approach. Specifically, this model gives the probability that \(x_{ijr} = k \in {\mathcal {K}}\) as

$$\begin{aligned} P_{ijrk} = \frac{\exp \sum _{m=1}^{k}\left[ 1.7 \alpha _r \left( \sum _{l=1}^{L} \alpha _{il} \theta _{jl}-\beta _i- \beta _r-d_{im} \right) \right] }{\sum _{l=1}^{K} \exp \sum _{m=1}^{l}\left[ 1.7 \alpha _r \left( \sum _{l=1}^{L} \alpha _{il} \theta _{jl}-\beta _i-\beta _r-d_{im} \right) \right] }\, . \end{aligned}$$
(6)

The proposed model can estimate examinee ability on a multidimensional scale while considering the characteristics of both the evaluation items and the raters compared with conventional models that cannot consider rater characteristics nor estimate ability on a multidimensional scale. Thus, the proposed model is expected to provide better model fitting and more accurate ability measurement compared with the conventional models if ability multidimensionality and rater effects are assumed to exist. Furthermore, application of the proposed model to the rubric-based writing assessment data provides the various characteristics of each evaluation item and rater, which helps in interpreting the quality of the evaluation items and the raters. Also, the evaluation item’s discrimination parameters \(\alpha _{il}\) offers information for interpreting what each ability dimension measures, which makes objective analysis of rubric construct validity possible.

Note that the generalized MFRM incorporates a rater-specific step parameter \(d_{rm}\), assuming that the appearance tendency of each rating category m depends on the raters. In rubric-based performance assessment, however, the appearance tendency of each category is expected to depend more strongly on the rubric’s evaluation items than the raters because evaluation criteria for each rating category are defined by rubrics. Therefore, the proposed model defines the step parameter as evaluation item’s parameter \(d_{im}\) instead of \(d_{rm}\).

6.1 Parameter interpretation

This subsection describes how to interpret parameters in the proposed model.

Figure 2 shows item response surfaces (IRSs) based on Eq. (6) for six pairs of raters and evaluation items with different characteristics, with a fixed number of ability dimensions \(L=2\) and rating categories \(K=4\). The parameters used for the IRSs are shown in Table 1. For example, in Fig. 2a, the IRS uses the parameter values for Evaluation item 1 and Rater 1 in Table 1. The x-axis indicates the ability in the first dimension \(\theta _{j1}\), the y-axis indicates the ability in the second dimension \(\theta _{j2}\), and the z-axis indicates the probability of \(P_{ijrk}\). The IRSs show that the probability of obtaining higher scores increases as the examinees’ abilities increase.

Fig. 2
figure 2

Item response surfaces for evaluation items and raters with various characteristics

Table 1 Parameters used in Fig. 2

Figure 2b–f shows IRSs in which a specific model parameter is changed from that of Fig. 2a. Thus, how each parameter works is explained below by comparing each figure with Fig. 2a.

Figure 2b shows the IRS in which the discrimination parameter for the second dimension, \(\alpha _{i2}\), is smaller than that shown in Fig. 2a. A change in the response probabilities that arises from a change in the second-dimension ability value becomes smaller than that in Fig. 2a. This means that evaluation items with smaller discrimination for the l-th dimension of \(\alpha _{il}\) do not distinguish the corresponding ability dimension, \(\theta _{jl}\), well. It also suggests that the ability dimension an evaluation item mainly measures would be interpreted based on the discrimination parameter, as we explain in the next subsection.

Figure 2c shows the IRS with a higher difficulty parameter, \(\beta _{i}\). This IRS shows that an increase in difficulty parameter \(\beta _{i}\) causes the IRS to shift in the direction of the ability value increase, which reflects that difficulty in obtaining higher scores increases as the difficulty parameter for evaluation items increases.

Figure 2d shows the IRS in which the step difficulty parameters, \(d_{i2}\) and \(d_{i3}\), are changed. As the difference \(d_{i(m+1)}-d_{im}\) increases, the probability of obtaining category m increases over widely varying ability scales. In Fig. 2d, the probability for rating category 2 increases compared with that in Fig. 2a because \(d_{i3}-d_{i2}\) is large, whereas the probability for rating category 3 decreases because \(d_{i4}-d_{i3}\) is small.

Figure 2e shows the IRS for a rater with a lower consistency parameter, \(\alpha _{r}\). Compared with Fig. 2a, the differences in the response probabilities among the categories decrease, which reflects inconsistencies in rater scoring among examinees with the same ability level. In contrast, a higher consistency value \(\alpha _{r}\) results in large differences in the response probabilities among the categories. This observation reflects the tendency of consistent raters to consistently assign the same ratings to examinees with the same ability level and higher ratings to examinees with higher ability levels. This suggests that raters who are more consistent in scoring are generally desirable for accurate ability measurement.

Figure 2f shows the IRS for a rater with a high severity parameter, \(\beta _r\). Compared with Fig. 2a, the IRS shifts in the direction of the increase in ability value as the difficulty parameter for evaluation items increases, reflecting difficulty in assignment of higher ratings by severe raters.

6.2 Interpretation of ability dimensions

As noted above, the discrimination parameter for each dimension of \(\alpha _{il}\) offers information for interpreting what each ability dimension measures. Specifically, by analyzing commonality in content among the evaluation items with higher discrimination values for each dimension, we can interpret what each ability dimension mainly measures.

Fig. 3
figure 3

Discrimination parameters for five evaluation items

This analysis is explained in greater detail in Fig. 3, which depicts the discrimination parameters for five evaluation items. The horizontal axis indicates the index of the evaluation items, the vertical axis indicates the value of the discrimination parameters, and the colored bars correspond to the discrimination parameter for each dimension. The figure indicates that evaluation items 1 and 2 have larger discrimination values in the first dimension. This suggests that the first dimension mainly relates to a common ability that underlies evaluation items 1 and 2. Thus, by analyzing commonality in content between evaluation items 1 and 2, we can interpret the meaning of the first ability dimension. Similarly, the meaning of the second ability dimension can be interpreted by investigating commonality in content between evaluation items 3 and 4, which have higher discrimination parameters for the second dimension.

6.3 Optimal number of dimensions

In the proposed model, the number of dimensions, L, is a hand-tuned parameter that must be determined in advance. In IRT studies, the optimal number of dimensions is generally explored based on principal component analysis. However, this analysis method is not applicable to the three-way data assumed in the present study.

Dimensionality selection, which is well known in machine learning, can also be considered as a model selection task. The model selection is typically conducted using information criteria, such as the Akaike information criterion (AIC) (Akaike 1974), the Bayesian information criterion (BIC) (Schwarz 1978), the widely applicable information criterion (WAIC) (Watanabe 2010) and the widely applicable Bayesian information criterion (WBIC) (Watanabe 2013). AIC and BIC are applicable when maximum likelihood estimation is used to estimate model parameters, whereas the WAIC and WBIC can be used with Bayesian estimation using MCMC or variational inference methods. With the recent increase in complex statistical and machine learning models, various studies have used WAIC and WBIC because Bayesian estimation tends to provide a more robust estimation for complex models (Vehtari et al. 2017; Almond 2014; Luo and Al-Harbi 2017). Because this study uses a Bayesian estimation based on MCMC, as described in Subsect. 6.5, it uses the WAIC and WBIC to select the optimal number of dimensions for the proposed model. Specifically, the dimensionality that minimizes these criteria is regarded as optimal.

6.4 Model identifiability

The proposed model entails a non-identifiability problem whereby parameter values cannot be uniquely determined because different value sets can give the same response probability. For the proposed model without rater parameters that is consistent with the conventional multidimensional GPCM, parameters are known to be identifiable by assuming a specific distribution (e.g., standard normal distribution) for the ability and constraining \(d_{i1}=0\) and \(\sum _{m=2}^{K} d_{im} = 0\) for each i.

However, in the proposed model, the location for \(\beta _{i}+\beta _{r}\) is indeterminate, even when these constraints are given, because the response probability with \(\beta _{i}\) and \(\beta _{r}\) gives the same value of \(P_{ijrk}\) with \(\beta '_{i}=\beta _{i}+c\) and \(\beta '_{r} = \beta _{r}-c\) for any constant c (note that \(\beta '_{i}+\beta '_{r}=(\beta _{i}+c) +(\beta _{r}-c)=\beta _{i}+\beta _{r}\)). Such location indeterminacy can be solved by fixing one parameter or restricting the mean of some parameters (Uto and Ueno 2020; Fox 2010).

There is another indeterminacy of the scale for \(\alpha _r\). Suppose we let the term \(\sum _{l=1}^{L} \alpha _{il} \theta _{jl} - \beta _i -\beta _r-d_{im}\) in Eq. (6) be \(\xi\), then the response probability \(P_{ijrk}\) with \(\alpha _r\) and \(\xi\) gives the same value of \(P_{ijrk}\) with \(\alpha '_r = \alpha _r c\) and \(\xi ' = \frac{\xi }{c}\) for any constant c, because \(\alpha '_r \xi ' = (\alpha _r c) \frac{\xi }{c} = \alpha _r \xi\). Such scale indeterminacy can be removed by fixing one parameter or restricting the product of some parameters (Uto and Ueno 2020; Fox 2010).

This study therefore uses the restrictions \(\prod _{r=1}^{R} \alpha _{r} = 1\), \(\sum _{r=1}^{R} \beta _{r} = 0\) for model identification, in addition to \(d_{i1}=0\), \(\sum _{m=2}^{K} d_{im} =0\), and the standard normal prior for the ability of each dimension \(\theta _{jl}\).

6.5 Parameter estimation using MCMC

This subsection describes the parameter estimation method for the proposed model.

Marginal maximum likelihood estimation using an expectation–maximization algorithm is a widely used method for estimating IRT model parameters (Baker and Kim 2004). However, for complex models such as that proposed in this study, expected a posteriori (EAP) estimation, a type of Bayesian estimation, is known to provide more robust estimations (Uto and Ueno 2016; Fox 2010).

EAP estimates are calculated as the expected value of the marginal posterior distribution of each parameter. The posterior distribution in the proposed model is

$$\begin{aligned}&g(\varvec{\theta _{jl}},\log \varvec{\alpha _{il}},\log \varvec{\alpha _r},\varvec{\beta _i}, \varvec{\beta _r},\varvec{d_{im}}|\varvec{X}) \propto L(\varvec{X}|\varvec{\theta _{jl}}, \log \varvec{\alpha _{il}},\log \varvec{\alpha _r},\varvec{\beta _i},\varvec{\beta _r},\varvec{d_{im}}) \nonumber \\&\quad g(\varvec{\theta _{jl}} | \tau _{\theta _{jl}}) g(\log \varvec{\alpha _{il}} | \tau _{\alpha _{il}}) g(\log \varvec{\alpha _{r}} | \tau _{\alpha _r}) g(\varvec{\beta _i} | \tau _{\beta _i}) g(\varvec{\beta _r} | \tau _{\beta _r}) g(\varvec{d_{im}} | \tau _{d_{im}}), \end{aligned}$$
(7)

where \(\varvec{\theta _{jl}} = \{ \theta _{jl} \mid j \in {\mathcal {J}}, l \in {\mathcal {L}}\}\), \(\log \varvec{\alpha _{il}} = \{\log \alpha _{il} \mid i \in {\mathcal {I}}, l \in {\mathcal {L}}\}\), \(\log \varvec{\alpha _r} = \{\log \alpha _{r} \mid r \in {\mathcal {R}} \}\), \(\varvec{\beta _i}= \{ \beta _{i} \mid i \in {\mathcal {I}}\}\), \(\varvec{\beta _r}= \{ \beta _{r} \mid r \in {\mathcal {R}} \}\), and \(\varvec{d_{im}} = \{ d_{im} \mid i \in {\mathcal {I}}, k \in {\mathcal {K}}\}\). In addition, \(g(\varvec{S}|\tau _{s})=\prod _{s \in \varvec{S}} g(s |\tau _{s})\), where \(\varvec{S}\) is a set of parameters, \(g(s |\tau _{s})\) indicates a prior distribution for parameter s, and \(\tau _{s}\) is its hyperparameters. \(L(\varvec{X}|\varvec{\theta _{jl}},\log \varvec{\alpha _{il}},\log \varvec{\alpha _{r}},\varvec{\beta _i},\varvec{\beta _r}, \varvec{d_{im}})\) is the likelihood that can be calculated as \(\Pi _{j=1}^{J} \Pi _{i=1}^{I} \Pi _{r=1}^{R} \Pi _{k=1}^{K} (P_{ijrk})^{z_{ijrk}}\), where \(z_{ijrk}\) is a dummy variable that takes 1 if \(x_{ijr} = k\), and zero otherwise.

The marginal posterior distribution for each parameter is derived by marginalizing across all parameters except the target parameter. For a complex IRT model, however, it is not generally feasible to derive or calculate the marginal posterior distribution due to high-dimensional multiple integrals. To address this problem, there is widespread use of MCMC, a random sampling-based estimation method, in various fields including IRT studies (Fox 2010; Uto 2019; van Lier et al. 2018; Fontanella et al. 2019; Zhang et al. 2011; Uto et al. 2017; Louvigné et al. 2018; Brooks et al. 2011).

The Metropolis-Hastings-within-Gibbs sampling method (Patz and Junker 1999) is a common MCMC algorithm for IRT models. The algorithm is simple and easy to implement but requires a long time to converge to the target distribution (Hoffman and Gelman 2014; Girolami and Calderhead 2011). As an efficient alternative MCMC algorithm, the NUT sampler (Hoffman and Gelman 2014), a variant of the HMC, has recently been proposed along with a software package called “Stan” (Carpenter et al. 2017), which makes implementation of a NUT-based HMC easy. Thus, there has been recent widespread use of NUT for parameter estimations in various statistical models, including IRT models (Uto and Ueno 2020; Luo and Jiao 2018; Jiang and Carter 2019).

We therefore use a NUT-based MCMC algorithm for parameter estimations in the proposed model. The estimation program was implemented in RStan (Stan Development Team 2018). The developed Stan code is provided in Appendix 1. In this study, the standard normal distribution N(0.0, 1.0) is used as a prior distribution for each parameter: \(\theta _{jl}\), \(\log \alpha _{il}\), \(\log \alpha _{r}\), \(\beta _{i}\), \(\beta _{r}\), and \(d_{im}\). Furthermore, the EAP estimates are calculated as the mean of parameter samples obtained from 2000 to 4000 periods.

7 Simulation experiments

7.1 Accuracy of parameter recovery

In this subsection, we describe a parameter recovery experiment for the proposed model through simulations. We conducted the following experiments by changing the number of examinees, evaluation items, raters, and dimensions to \(J \in \{50, 100\}\), \(I \in \{5, 15\}\), \(R \in \{5, 15\}\), and \(L \in \{1, 2, 3\}\), respectively.

  1. 1.

    For J examinees, I evaluation items, R raters, and L dimensions, generate true model parameters randomly from the following distributions, which are the same as the prior distributions used in the MCMC algorithm.

    $$\begin{aligned} \theta _{jl}, \log \alpha _{il}, \log \alpha _{r}, \beta _{i}, \beta _{r}, d_{im} \sim N(0.0, 1.0). \end{aligned}$$
    (8)

    The number of rating categories, K, was fixed to 4 to match the condition of the actual data used in a later section.

  2. 2.

    Set skewed discrimination values to the first L evaluation items \(i \in \{1,\ldots ,L\}\) as follows:

    $$\begin{aligned} {\left\{ \begin{array}{ll} \alpha _{il} = 1.5 \ \ i = l &{} \\ \alpha _{il} = 0.2 \ \ i \ne l&{} \end{array}\right. }. \end{aligned}$$
    (9)

    The necessity of this procedure is explained in the next paragraph.

  3. 3.

    Given the true parameters, generate rating data from the proposed model randomly.

  4. 4.

    Estimate the model parameters from the generated data.

  5. 5.

    Sort the order of the estimated dimensions based on the discrimination parameter estimates. Specifically, using the discrimination parameter estimates for the first L evaluation items, we sorted the dimensions so that \(\sum _{l=1}^{L} \alpha _{ll}\) was maximized.

  6. 6.

    Calculate root mean square errors (RMSEs) and biases between the estimated and true parameters.

  7. 7.

    Repeat the above procedure 30 times, then calculate the average values of the RMSEs and biases.

Note that Procedures 2 and 5 are required because the ability dimensions in multidimensional IRT models including the proposed model are exchangeable, meaning that the dimension to which a sub-ability corresponds changes every time the parameter estimation runs. The dimension indeterminacy is caused because interchanging \(\alpha _{il}\theta _{jl}\) and \(\alpha _{il'}\theta _{jl'}\) (\(l \in {\mathcal {L}}\), \(l' \in {\mathcal {L}}\), \(l \ne l'\)) results in the same value for the term \(\sum _{l=1}^{L} \alpha _{il} \theta _{jl}\). To appropriately calculate the RMSEs and biases between the estimated and true parameters, this experiment addressed the problem by setting extreme discrimination parameter values for the first L evaluation items in Procedure 2, and by sorting the estimated dimensions based on the discrimination parameter estimates in Procedure 5, as in Martin-Fernandez and Revuelta (2017).

Table 2 RMSE values obtained from the parameter recovery experiment

Table 2 shows the RMSE results, which confirm the following tendencies:

  1. 1.

    The RMSEs for ability values tend to decrease as the number of evaluation items and/or raters increases. Similarly, the RMSEs for raters and evaluation item parameters tend to decrease as the number of examinees increases. These tendencies are caused by the increase in the amount of data per parameter, which is consistent with previous studies (Uto and Ueno 2016, 2018).

  2. 2.

    An increase in the number of dimensions tends to lead to an increase in the RMSEs because the ability and discrimination parameters increase without an increase in the amount of data. This tendency is also consistent with previous research on multidimensional IRT (Martin-Fernandez and Revuelta 2017; Svetina et al. 2017; Kose and Demirtasli 2012).

Table 3 Average bias values obtained from the parameter recovery experiment

Moreover, Table 3 shows that the average bias was nearly zero in all cases, indicating no overestimation or underestimation of parameters. We also confirmed that the Gelman–Rubin statistic \(\hat{R}\) (Gelman and Rubin 1992; Gelman et al. 2013), a well-known convergence diagnostic index, was less than 1.1 in all cases, indicating that the MCMC runs converged.

From the above, we conclude that the parameter estimation for the proposed model can be appropriately conducted using the MCMC algorithm.

7.2 Validity of dimensionality selection using information criteria

This subsection describes a simulation experiment for evaluating the accuracy of the dimensionality selection using the WAIC and WBIC as information criteria. Concretely, we conducted the following experiments by changing the number of examinees, evaluation items, raters and dimensions to \(J \in \{50, 100\}\), \(I \in \{5, 15\}\), \(R \in \{5, 15\}\), and \(L \in \{1, 2, 3\}\), respectively.

  1. 1.

    For J examinees, I evaluation items, R raters, and L dimensions, generate rating data from the proposed model after the true model parameters are randomly generated from the distributions in Eq. (8). The number of categories, K, was fixed to 4, as in the parameter recovery experiment.

  2. 2.

    Using the generated data, estimate the parameters in the proposed model and calculate the WAIC and the WBIC values while changing the number of dimensions \(L^e\) to \(\{1,2,3\}\).

  3. 3.

    Rank the WAIC and WBIC values for each \(L^e\), such that the \(L^e\) with the lowest WAIC and WBIC values is ranked first.

  4. 4.

    Repeat the above procedure 30 times, then calculate the average rank. Additionally, calculate the ratio of the correct dimensionality identification for each setting.

Table 4 Results of model comparison using an information criterion

Table 4 shows the results. The \(L^e=1, 2, 3\) columns show the average of the estimated ranks, with the highest average rank for each setting shown in bold. The Acc column shows the ratio of the correct dimensionality identification.

The results show that the WAIC can select true dimensionality in many cases. The WBIC can also select true dimensionality when the size of the data increases, although it tends to be inferior to the WAIC when the size of the data is small.

These results suggest that both the information criteria can find the optimal number of dimensions for larger-scale settings, although the WAIC will be more accurate than the WBIC in relatively small-scale settings.

7.3 Accuracy of ability measurement

This subsection evaluates whether the consideration of the rater characteristics in the proposed model is effective in improving ability measurement accuracy. For the evaluation, we compare ability measurement accuracy between the proposed model and the conventional multidimensional GPCM. Note that the conventional multidimensional GPCM is consistent with the proposed model without rater parameters. We conducted the following experiments by changing the number of examinees, evaluation items, raters, and dimensions to \(J \in \{50, 100\}\), \(I \in \{5, 15\}\), \(R \in \{5, 15\}\), and \(L \in \{1, 2, 3\}\), respectively.

  1. 1.

    For J examinees, I evaluation items, R raters, and L dimensions, generate true model parameters randomly from the proposed model and the conventional multidimensional GPCM. Here, the parameters were generated from the distributions in Eq. (8). The number of categories, K, was fixed to 4, as in the above experiments.

  2. 2.

    Generate rating data from the two models respectively given the parameters generated in Procedure 1.

  3. 3.

    Using each generated dataset, estimate the model parameters and examinee abilities in each model.

  4. 4.

    Calculate Pearson’s correlation between the ability estimates and the true ability values generated in Procedure 1.

  5. 5.

    Repeat the above procedure 30 times, then calculate the average correlation value.

Note that the conventional multidimensional GPCM directly cannot handle three-way data consisting of examinees, raters, and evaluation items. Thus, in this experiment, we applied the conventional multidimensional GPCM assuming that the probability for the three-way data \(P_{ijrk}\) is defined by \(P_{ijk}\) shown in Eq. (5) as follows:

$$\begin{aligned} P_{ijrk} = P_{ijk}= \frac{\exp \sum _{m=1}^{k} \left[ 1.7 \left( \sum _{l=1}^{L} \alpha _{il} \theta _{jl}-\beta _i- d_{im} \right) \right] }{\sum _{l=1}^{K} \exp \sum _{m=1}^{l} \left[ 1.7 \left( \sum _{l=1}^{L} \alpha _{il} \theta _{jl}-\beta _i-d_{im} \right) \right] }. \end{aligned}$$
(10)
Table 5 Ability measurement accuracy for complete data under a fully crossed design

Table 5 shows the results. In the table, the Generation from Prop and Generation from Conv columns show the results of data are generated from the proposed model and the conventional multidimensional GPCM, respectively. The sub-columns Prop and Conv show the average correlation values when the proposed model and the conventional multidimensional GPCM are applied to each dataset respectively. Furthermore, the sub-column Diff indicates the difference in average correlation values between Prop and Conv, where a larger Diff value indicates that the proposed model is more accurate. We also conducted the paired t-test for the averaged agreement correlation between the proposed model and the conventional multidimensional GPCM, and the resulting p values are shown in p column.

The results show that the performance of the conventional model significantly drops when the data are generated from the proposed model, whereas the performance of the proposed model is almost equal to that of the conventional model when the data are generated from the conventional model. This suggests that lack of knowledge of the rater characteristics deteriorates ability measurement accuracy when the raters are assumed to have different characteristics. Note that the p values in the paired t-test depend on two factors, namely, (1) the mean of the difference between conditions, and (2) the standard deviation of the difference between conditions. Thus, as seen in Table 5, a larger absolute Diff value does not necessarily result in a lower p value.

Although the results demonstrate the effectiveness of the proposed model, Table 5 shows that improvement in correlation values by the proposed model is small. This is because data \(\varvec{X}\) are generated as complete data under a fully crossed design, assuming all the raters evaluate all the examinees. In this case, because the data per examinee are large and dense, ability measurement accuracy tends to be extremely high in both the models, making the difference in performance among the models small. However, in practice, the data will be sparser because we often assign few raters for each examinee to decrease the raters’ assessment workload. In sparse data settings, we can expect the difference in performance among the models to be clearer.

Thus, we conducted the same experiment as described above assuming a practice situation where few raters are assigned to each examinee. Concretely, in Procedure 2, we first assigned two raters to each examinee based on a systematic link design (Shin et al. 2019; Uto 2020; Wind and Jones 2019), and then we generated the data based on the rater assignment. The examples of a fully crossed design and a systematic link design are illustrated in Tables 6 and 7, where checkmarks indicate an assigned rater, and blank cells indicate that no rater was assigned. The data without assigned raters are treated as missing data.

Table 6 Example of fully crossed design
Table 7 Example of systematic link design
Table 8 Ability measurement accuracy for sparse data under a systematic links design

Table 8 shows the results. The results show that the average correlation values of the conventional model drops substantially when the data are generated from the proposed model, whereas the high performance of the proposed model is still maintained, regardless of data generation models.

From these experimental results, we can conclude that the consideration of the rater characteristics in the proposed model is effective in improving ability measurement accuracy.

8 Actual data experiments

This section describes the performance of the proposed model in experiments based on actual data.

8.1 Actual data

Table 9 Evaluation items and their characteristic parameters

In this experiment, actual rubric-based performance assessment data were gathered as follows:

  1. 1.

    We recruited 134 Japanese university students as participants.

  2. 2.

    The participants were asked to complete an essay-writing task that involved translating a task taken from the National Assessment of Educational Progress assessments (Persky et al. 2003; Salahu-Din et al. 2008). No specific or preliminary knowledge was needed to complete the task.

  3. 3.

    The written essays were evaluated by 18 raters using a rubric consisting of 9 evaluation items divided into 4 rating categories. We assigned four raters to each essay based on a systematic links design (Shin et al. 2019; Uto 2020; Wind and Jones 2019) to reduce the raters’ assessment workload. The evaluation items column in Table 9 lists the abstracts of the evaluation items in the rubric, and was created based on two writing assessment rubrics proposed by Matsushita et al. (2013), Nakajima (2017) for Japanese university students. Furthermore, Appendix 2 presents all the information in the rubric.

We evaluated the effectiveness of the proposed model using the obtained data.

8.2 Model comparison using information criteria

As explained above, the proposed model can estimate examinee ability on a multidimensional scale while considering the characteristics of both the raters and the rubric’s evaluation items. To evaluate the effectiveness of the consideration of the multidimensionality and rater characteristics, we conducted a model fitting evaluation based on information criteria. Specifically, we calculated the WAIC and WBIC for the proposed model and the conventional multidimensional GPCM, consistent with the proposed model without rater parameters, using the actual data for each dimensionality \(L \in \{1,\ldots ,5\}\).

Table 10 shows the results, with the minimum score for each criteria in bold. The table indicates that the WAIC and WBIC are minimized when \(L=2\) in both the proposed model and the conventional model. This means that the unidimensionality assumption is not satisfied in the data, suggesting the requirement of the multidimensional models. Furthermore, comparison of the two models shows that the proposed model provides better model fitting than the conventional model in all cases. The results suggest that consideration of rater characteristics is effective in improving model fitting, which verifies the effectiveness of the proposed model.

Table 10 Model comparison using actual data

8.3 Characteristic interpretation of the rubric’s evaluation items

In this subsection, we show the interpretation of the characteristics of the evaluation items. Table 9 shows the parameters of the evaluation items, which were estimated by the proposed model under \(L=2\). Here, \(L=2\) was used because it provided the best model fitting, as shown in the experiment above.

According to Table 9, the evaluation items reveal different patterns of discrimination parameters. For example, evaluation items 1–6 have larger discrimination values in the second dimension, whereas evaluation items 7–9 have larger discrimination values in the first dimension. Moreover, evaluation item 4 has relatively low discrimination values in both dimensions, meaning that it might not be suitable for distinguishing examinee ability. In contrast, evaluation item 6 has moderate discrimination values in both dimensions, meaning that it measures two-dimensional ability concurrently.

The discrimination parameters of each evaluation item enable us to interpret what is mainly measured by each ability dimension. Specifically, as described above, Table 9 shows that evaluation items 1–6 have larger discrimination values in the second dimension, and evaluation items 7–9 have larger values in the first dimension. These results suggest that the first ability dimension reflects a common ability underlying evaluation items 7–9, and the second dimension reflects a common ability underlying evaluation items 1–6. According to the contents of the evaluation items (see Appendix 2), we can see that evaluation items 7–9 relate to stylistic skills (such as typological errors and word choice), whereas evaluation items 1–6 relate to logical skills (such as augmentation and organization). Indeed, the rubric was designed such that evaluation items 1–6 mainly measure argumentative skills, and evaluation items 7–9 measure stylistic skills (Matsushita et al. 2013; Nakajima 2017). These results suggest that the rubric developer’s expectation is supported by the analysis based on the proposed model.

Fig. 4
figure 4

Item response surfaces for two evaluation items with different step difficulty

Furthermore, Table 9 shows that the level of difficulty differs among the evaluation items. For example, evaluation item 4 is the most difficult, and evaluation item 7 is the easiest. These are reasonable judgments because evaluation item 4 requires sufficient discussion about opposing opinions, whereas evaluation item 7 requires only superficial typological correctness.

The step difficulty parameters, \(d_{im}\), also show different patterns, meaning that the score distribution differs among the evaluation items. As examples, Fig. 4 depicts the IRSs for evaluation items 3 and 4, which have different step difficulty parameter patterns as well as relatively similar discrimination and difficulty. The figure shows that a score of 2 tends to be avoided and a score of 3 tends be preferred in evaluation item 4.

8.4 Rater parameter estimates and ability estimates

To confirm whether the rater characteristics differed, rater parameter estimates were obtained, as shown in Table 11. According to the table, severity and consistency differ among raters. For example, Raters 1 and 17 are highly inconsistent raters whose ratings might be unreliable, whereas Raters 8 and 15 are highly consistent raters. Furthermore, Raters 1, 4, and 8 have higher severity values, whereas Raters 3, 9 and 17 have lower severity values. The variety of rater characteristics is the reason why the proposed model provided better model fitting than the conventional multidimensional GPCM.

Table 11 Rater parameter estimates
Fig. 5
figure 5

Ability estimates

Moreover, Fig. 5 shows the two-dimensional ability estimates for each examinee. The horizontal axis indicates the first-dimensional ability value \(\theta _{j1}\), the vertical axis indicates the second-dimensional ability value \(\theta _{j2}\), and each dot represents an examinee. The figure shows that the examinees have different ability patterns. Such multidimensional ability measurement cannot be realized by conventional unidimensional IRT models.

9 Conclusion

This study proposed a new IRT model for rubric-based performance assessment. The model was formulated as a multidimensional extension of the generalized MFRM. A NUT variant of the HMC algorithm for the proposed model was implemented using the software package Stan. Through simulation experiments, we demonstrated the following: (1) The MCMC algorithm appropriately estimates the model parameters. (2) An optimal number of dimensions for the proposed model can be determined using information criteria. (3) The consideration of the rater characteristics in the proposed model is effective in improving ability measurement accuracy. We also conducted real data application experiments to show examples of analysis of rubric quality and rubric construct validity by interpreting the dimensionality and the characteristics of the evaluation items. Also, the actual data experiment showed that the consideration of the multidimensionality and rater characteristics in the proposed model improved the model fitting.

In future studies, we plan to evaluate the effectiveness of the proposed model using various and more massive datasets. Furthermore, we hope to extend the proposed model to four-way data consisting of examinees \(\times\) raters \(\times\) evaluation items \(\times\) performance tasks because practical tests often include several tasks.