1 Introduction

In various assessment contexts, there is increased need to measure practical, higher order abilities such as problem solving, critical reasoning, and creative thinking skills (e.g., Muraki et al. 2000; Myford and Wolfe 2003; Kassim 2011; Bernardin et al. 2016; Uto and Ueno 2016). To measure such abilities, performance assessments in which raters assess examinee outcomes or processes for performance tasks have attracted much attention (Muraki et al. 2000; Palm 2008; Wren 2009). Performance assessments have been used in various formats such as essay writing, oral presentations, interview examinations, and group discussions.

In performance assessments, however, difficulty persists in that ability measurement accuracy strongly depends on rater and task characteristics, such as rater severity, consistency, range restriction, task difficulty, and discrimination (e.g., Saal et al. 1980; Myford and Wolfe 2003, 2004; Eckes 2005; Kassim 2011; Suen 2014; Shah et al. 2014; Nguyen et al. 2015; Bernardin et al. 2016). Therefore, improving measurement accuracy requires ability estimation considering the effects of those characteristics (Muraki et al. 2000; Suen 2014; Uto and Ueno 2016).

For this reason, item response theory (IRT) models that incorporate rater and task characteristic parameters have been proposed (e.g., Uto and Ueno 2016; Eckes 2015; Patz and Junker 1999; Linacre 1989). One representative model is the many-facet Rasch model (MFRM) (Linacre 1989). Although several MFRM variations exist (Myford and Wolfe 2003, 2004; Eckes 2015), the most common formation is defined as a rating scale model (RSM) (Andrich 1978) that incorporates rater severity and task difficulty parameters. This model assumes a common interval rating scale for all raters, but it is known that in practice, rating scales vary among raters due to the effects of range restriction, a common rater characteristic indicating the tendency for raters to overuse a limited number of rating categories (Myford and Wolfe 2003; Kassim 2011; Eckes 2005; Saal et al. 1980; Rahman et al. 2017). Therefore, this model does not fit data well when raters with a range restriction exist, lowering ability measurement accuracy. To address this problem, another MFRM formation that relaxes the condition for an equal-interval rating scale for raters has been proposed (Linacre 1989). This model, however, still makes assumptions that might not be satisfied, namely a same rating consistency for all raters and same discrimination power for all tasks (Uto and Ueno 2016; Patz et al. 2002). To relax these assumptions, an IRT model that incorporates parameters for rater consistency and task discrimination has also been proposed (Uto and Ueno 2016). Performance declines when raters with range restrictions exist, however, because like conventional MFRM, the model assumes equal interval scales for raters.

The three rater characteristics assumed in the conventional models—severity, range restriction, and consistency—are known to generally occur when rater diversity increases (Myford and Wolfe 2003; Kassim 2011; Eckes 2005; Saal et al. 1980; Uto and Ueno 2016; Rahman et al. 2017; Uto and Ueno 2018a), and ignoring any one will decrease model fitting and measurement accuracy. However, no models capable of simultaneously considering all these characteristics have been proposed so far.

One obstacle for developing such a model is the difficulty of parameter estimation. The MFRM and its extensions conventionally use maximum likelihood estimations. However, this generally leads to unstable, inaccurate parameter estimations in complex models. For complex models, a Bayesian estimation method called expected a posteriori (EAP) estimation generally provides more robust estimations (Uto and Ueno 2016; Fox 2010). EAP estimation involves solutions to high-dimensional multiple integrals, and thus incurs high computational costs, but recent increases in computational capabilities and the development of efficient algorithms such as Markov chain Monte Carlo (MCMC) make it feasible. In IRT studies, EAP estimation using MCMC has been used for hierarchical Bayesian IRT, multidimensional IRT, and multilevel IRT (Fox 2010).

We, therefore, propose a new IRT model that can represent all three rater characteristics and applies a developed Bayesian estimation method using MCMC. Specifically, the proposed model is formulated as a generalization of the MFRM without equal interval rating scales for raters. The proposed model has the following benefits:

  1. 1.

    Model fitting is improved for an increased variety of raters, because the characteristics of each rater can be more flexibly represented.

  2. 2.

    More accurate ability measurements will be provided when the variety of raters increases, because abilities can be more precisely estimated considering the effects of each rater’s characteristics.

We also present a Bayesian estimation method for the proposed model using No-U-Turn Hamiltonian Monte Carlo, a state-of-the-art MCMC algorithm (Hoffman and Gelman 2014). We further demonstrate that the method can appropriately estimate model parameters even when the sample size is relatively small, such as the case of 30 examinees, 3 tasks, and 5 raters.

2 Data

This study assumes that performance assessment data \({\varvec{X}}\) consist of a rating \(x_{ijr} \in \mathcal{K} = \{1, 2, \ldots , K\}\) assigned by rater \(r\in \mathcal{R} = \{1, 2, \ldots ,R\}\) to performance of examinee \(j\in \mathcal{J}=\{1, 2, \ldots ,J\}\) for performance task \(i\in \mathcal{I}=\{1, 2, \ldots ,I\}\). Therefore, data \({\varvec{X}}\) are described as

$$\begin{aligned} {\varvec{X}} = \{ x_{ijr} | x_{ijr} \in \mathcal{K} \cup \{-1\}, i \in \mathcal{I}, j \in \mathcal{J}, r \in \mathcal{R}\}, \end{aligned}$$
(1)

where \(x_{ijr} = -1\) represents missing data.

This study aims to accurately estimate examinee ability from rating data \({\varvec{X}}\). In performance assessments, however, a difficulty persists in that ability measurement accuracy strongly depends on rater and task characteristics (e.g., Saal et al. 1980; Myford and Wolfe 2003; Eckes 2005; Kassim 2011; Suen 2014; Shah et al. 2014; Bernardin et al. 2016; DeCarlo et al. 2011; Crespo et al. 2005).

3 Common rater and task characteristics

The following are common rater characteristics on which ability measurement accuracy generally depends:

  1. 1.

    Severity: The tendency to give consistently lower ratings than are justified by performance.

  2. 2.

    Consistency: The extent to which the rater assigns similar ratings to performances of similar quality.

  3. 3.

    Range restriction: The tendency to overuse a limited number of rating categories. Special cases of range restriction are the central tendency, namely a tendency to overuse the central categories, and the extreme response tendency, a tendency to prefer endpoints of the response scale (Elliott et al. 2009).

The following are typical task characteristics on which accuracy depends:

  1. 1.

    Difficulty: More difficult tasks tend to receive lower ratings.

  2. 2.

    Discrimination: The extent to which different levels of the ability to be measured are reflected in task outcome quality.

To estimate examinee abilities while considering these rater and task characteristics, item response theory (IRT) models that incorporate parameters representing those characteristics have been proposed (e.g., Uto and Ueno 2016; Eckes 2015; Patz and Junker 1999; Linacre 1989). Before introducing these models, the following section describes the conventional IRT model on which they are based.

4 Item response theory

IRT (Lord 1980), which is a test theory based on mathematical models, has been increasingly used with the widespread adoption of computer testing. IRT hypothesizes a functional relationship between observed examinee responses to test items and latent ability variables that are assumed to underlie the observed responses. IRT models provide an item response function that specifies the probability of a response to a given item as a function of latent examinee ability and the item’s characteristics. IRT offers the following benefits:

  1. 1.

    It is possible to estimate examinee ability while considering characteristics of each test item.

  2. 2.

    Examinee responses to different test items can be assessed on the same scale.

  3. 3.

    Missing data can be easily estimated.

IRT has traditionally been applied to test items for which responses can be scored as correct or incorrect, such as multiple-choice items. In recent years, however, there have been attempts to apply polytomous IRT models to performance assessments (Muraki et al. 2000; Matteucci and Stracqualursi 2006; DeCarlo et al. 2011). The following subsections describe two representative polytomous IRT models: the generalized partial credit model (GPCM) (Muraki 1997) and the graded response model (GRM) (Samejima 1969).

4.1 Generalized partial credit model

The GPCM gives the probability that examinee j receives score k for test item i as

$$\begin{aligned} P_{ijk}= \frac{\exp \sum _{m=1}^{k}\left[ \alpha _i (\theta _j-\beta _{im}) \right] }{\sum _{l=1}^{K} \exp \sum _{m=1}^{l}\left[ \alpha _i (\theta _j-\beta _{im}) \right] }~, \end{aligned}$$
(2)

where \(\alpha _i\) is a discrimination parameter for item i, \(\beta _{ik}\) is a step difficulty parameter denoting difficulty of transition between scores \(k-1\) and k in the item, and \(\theta _j\) is the latent ability of examinee j. Here, \(\beta _{i1}=0\) for each i is given for model identification.

Decomposing the step difficulty parameter \(\beta _{ik}\) to \(\beta _{i} + d_{ik}\), the GPCM is often described as

$$\begin{aligned} P_{ijk}= \frac{\exp \sum _{m=1}^{k}\left[ \alpha _i (\theta _j-\beta _{i} - d_{im}) \right] }{\sum _{l=1}^{K} \exp \sum _{m=1}^{l}\left[ \alpha _i (\theta _j-\beta _{i} - d_{im}) \right] }~, \end{aligned}$$
(3)

where \(\beta _{i}\) is a positional parameter representing the difficulty of item i and \(d_{ik}\) is a step parameter denoting difficulty of transition between scores \(k-1\) and k for item i. Here, \(d_{i1}=0\) and \(\sum _{k=2}^{K} d_{ik} = 0\) for each i are given for model identification.

The GPCM is a generalization of the partial credit model (PCM) (Masters 1982) and the rating scale model (RSM) (Andrich 1978). The PCM is a special case of the GPCM, where \(\alpha _i=1.0\) for all items. Moreover, the RSM is a special case of PCM, where \(\beta _{ik}\) is decomposed to \(\beta _{i}+d_{k}\). Here, \(d_k\) is a category parameter that denotes difficulty of transition between categories \(k-1\) and k.

4.2 Graded response model

The GRM is another polytomous IRT model that has item parameters similar to those of the GPCM. The GRM gives the probability that examinee j obtains score k for test item i as

$$\begin{aligned}&P_{ijk}=P^*_{ijk-1}-P^*_{ijk}, \end{aligned}$$
(4)
$$\begin{aligned}&{\left\{ \begin{array}{ll} P^*_{ijk} =\frac{1}{1+\exp {(-\alpha _i(\theta _j-b_{ik}))}} &{} k=1,\ldots ,K-1, \\ P^*_{ij0} =1, &{} \\ P^*_{ijK}= 0, &{} \end{array}\right. } \end{aligned}$$
(5)

In these equations, \(b_{ik}\) is the upper grade threshold parameter for category k of item i, indicating the difficulty of obtaining a category greater than or equal to k for item i. The order of difficulty parameters is \(b_{i1}< b_{i2}< \cdots < b_{iK-1}\).

4.3 Interpretation of item parameters

Fig. 1
figure 1

IRCs of the GPCM for four items with different parameters

This subsection presents item characteristic parameters based on Eq. (3) form of the GPCM, which has the most item parameters of the models described above.

Figure 1 depicts item response curves (IRCs) of the GPCM for four items with the parameters presented in Table 1, with the horizontal axis showing latent ability \(\theta \) and the vertical axis showing probability \(P_{ijk}\). The IRCs show that examinees with lower (higher) ability tend to obtain lower (higher) scores.

Table 1 Parameters used in Fig. 1

The difficulty parameter \(\beta _{i}\) controls the location of the IRC. As the value of this parameter increases, the IRC shifts to the right. Comparing the IRCs for Item 2 with those for Item 1 shows that obtaining higher scores is more difficult in items with higher difficulty parameter values.

Item discrimination parameter \(\alpha _i\) controls differences in response probabilities among the rating categories. The IRCs for Item 3 in Fig. 1 show that lower item discriminations indicate smaller differences. This trend implies increased randomness of ratings assigned to examinees for low-discrimination items. Low-discrimination items generally lower ability measurement accuracy, because observed data do not necessarily correlate with true ability.

Parameter \(d_{ik}\) represents the location on the \(\theta \) scale at which the adjacent categories k and \(k-1\) are equally likely to be observed (Sung and Kang 2006; Eckes 2015). Therefore, when the difference \(d_{i(k+1)} - d_{ik}\) increases, the probability of obtaining category k increases over widely varying ability scales. In Item 4, the response probability for category 4 had a higher value than those for other items, because \(d_{i5} - d_{i4}\) is relatively larger.

5 IRT models incorporating rater parameters

As described in Sect. 2, this study applies IRT models to three-way data \({\varvec{X}}\) comprising examinees \(\times \) tasks \(\times \) raters. However, the models introduced above are not directly applicable to such data. To address this problem, IRT models that incorporate rater characteristic parameters have been proposed (Ueno and Okamoto 2008; Uto and Ueno 2016; Patz et al. 2002; Patz and Junker 1999; Linacre 1989). In these models, item parameters are regarded as task parameters.

The MFRM (Linacre 1989) is the most common IRT model that incorporates rater parameters. The MFRM belongs to the family of Rasch models (Rasch 1980), including the RSM and the PCM introduced in Sect. 4.1. The MFRM has been conventionally used for analyzing various performance assessments (e.g., Myford and Wolfe 2003, 2004; Eckes 2005; Saal et al. 1980; Eckes 2015).

Several MFRM variations exist (Myford and Wolfe 2003, 2004; Eckes 2015), but the most common formation is defined as a RSM that incorporates a rater severity parameter. This MFRM provides the probability that rater r responds in category k to examinee j’s performance for task i as

$$\begin{aligned} P_{ijrk} = \frac{\exp \sum _{m=1}^{k}\left[ \theta _j-\beta _i-\beta _{r} - d_{m} \right] }{\sum _{l=1}^{K} \exp \sum _{m=1}^{l}\left[ \theta _j-\beta _i-\beta _{r} - d_{m} \right] }~, \end{aligned}$$
(6)

where \(\beta _{i}\) is a positional parameter representing the difficulty of task i, \(\beta _{r}\) denotes the severity of rater r, and \(\beta _{r=1}=0\), \(d_1=0\), and \(\sum _{k=2}^{K} d_{k} = 0\) are given for model identification.

A unique feature of this model is that it is defined using the fewest parameters among existing IRT models with rater parameters. The accuracy of parameter estimation generally increases as the number of parameters per data decreases (Waller 1981; Bishop 2006; Reise and Revicki 2014; Uto and Ueno 2016). Consequently, this model is expected to provide accurate parameter estimations if it fits well to the given data.

Because it assumes an equal interval scale for raters, however, this model does not fit well to data when rating scales vary across raters, lowering measurement accuracy. Differences in rating scales among raters are typically caused by the effects of range restriction (Myford and Wolfe 2003; Kassim 2011; Eckes 2005; Saal et al. 1980; Rahman et al. 2017). To relax the restriction of equal-interval rating scale for raters, another formation of the MFRM has been proposed (Linacre 1989). That model provides probability \(P_{ijrk} \) as

$$\begin{aligned} P_{ijrk} = \frac{\exp \sum _{m=1}^{k}\left[ \theta _j-\beta _i-\beta _{r} - d_{rm} \right] }{\sum _{l=1}^{K} \exp \sum _{m=1}^{l}\left[ \theta _j-\beta _i-\beta _{r} - d_{rm} \right] }~, \end{aligned}$$
(7)

where, \(d_{rk}\) is the difficulty of transition between categories \(k-1\) and k for rater r, reflecting how rater r tends to use category k. Here, \(\beta _{r=1}=0\), \(d_{r1}=0\), and \(\sum _{k=2}^{K} d_{rk} = 0\) are given for model identification. For convenience, we refer to this model as “rMFRM” below.

This model, however, still assumes that rating consistency is the same for all raters and that all tasks have the same discriminatory power, assumptions that might not be satisfied in practice (Uto and Ueno 2016). To relax these constraints, an IRT model that allows differing rater consistency and task discrimination power has been proposed (Uto and Ueno 2016). The model is formulated as an extension of GRM, and provides the probability \(P_{ijrk} \) as

$$\begin{aligned}&P_{ijrk}=P^*_{ijrk-1}-P^*_{ijrk},\nonumber \\&{\left\{ \begin{array}{ll} P^*_{ijrk} =\frac{1}{1+\exp (-\alpha _i \alpha _r(\theta _j-b_{ik}-\varepsilon _r ))}&{} k=1,\ldots ,K-1, \\ P^*_{ijr0} =1, \\ P^*_{ijrK} =0, \end{array}\right. } \end{aligned}$$
(8)

where \(\alpha _i\) is a discrimination parameter for task i, \(\alpha _r\) reflects the consistency of rater r, \(\varepsilon _{r}\) represents the severity of rater r, and \(b_{ik}\) denotes the difficulty of obtaining score k for task i (with \(b_{i1}<b_{i2}<\cdots <b_{iK-1}\)). Here, \(\alpha _{r=1}=1\) and \(\varepsilon _1=0\) are assumed for model identification. For convenience, we refer to this model as “rGRM” below.

5.1 Interpretation of rater parameters

This subsection describes how the above models represent the typical rater characteristics introduced in Sect. 3.

Rater severity is represented as \(\beta _{r}\) in MFRM and rMFRM and as \(\epsilon _r\) in rGRM. The IRC shifts to the right as this parameter values increases, indicating that raters tend to consistently assign low scores. To illustrate this point, Fig. 2 shows IRCs of the MFRM for raters with different severity. Here, we used a low severity value \(\beta _{r}=-1.0\) for the left panel and a high value \(\beta _{r}=1.0\) for the right panel. Other model parameters were the same. Figure 2 shows that the IRC for a severe rater is farther right than that for the lenient rater.

Fig. 2
figure 2

IRCs of MFRM for two raters with different severity

Only rMFRM describes the range restriction characteristic, represented as \(d_{rk}\). When \(d_{r(k+1)}\) and \(d_{rk}\) are closer, the probability of responding with category k decreases. Conversely, as the difference \(d_{r(k+1)} - d_{rk}\) increases, the response probability for category k also increases. Figure 3 shows IRCs of the rMFRM for two raters with different \(d_{rk}\) values. We used \(d_{r2}=-1.5\), \(d_{r3}=0.0\), \(d_{r4}=0.5\), and \(d_{r5}=1.5\) for the left panel, and \(d_{r2}=-2.0\), \(d_{r3}=-1.0\), \(d_{r4}=1.0\), and \(d_{r5}=1.5\) for the right panel. The left-side item has relatively larger values of \(d_{r3} - d_{r2}\) and \(d_{r5} - d_{r4}\), thus increasing response probabilities for categories 2 and 4 in the IRC. The right-side item shows that the response probability for category 3 is increased, because \(d_{r4} - d_{r3}\) has a larger value. The points presented above illustrate that parameter \(d_{rk}\) reflects the range restriction characteristic.

Fig. 3
figure 3

IRCs of rMFRM for two raters with different range restriction characteristics

rGRM represents rater consistency as \(\alpha _r\), with lower values indicating smaller differences in response probabilities between the rating categories. This reflects that raters with a lower consistency parameter have stronger tendencies to assign different ratings to examinees with similar ability levels. Figure 4 shows IRCs of rGRM for two raters with different consistency levels. The left panel shows a high consistency value \(\alpha _r=2.0\) and the right panel shows a low value \(\alpha _r=0.8\). In the right-side IRC, differences in response probabilities among the categories are small.

Fig. 4
figure 4

IRCs of rGRM for two raters with different consistency

The interpretation of task characteristics is similar to that of the item characteristic parameters described in Sect. 4.3.

5.2 Remaining problems

Table 2 Rater and task characteristics assumed in each model

Table 2 summarizes the rater and task characteristics considered in the conventional models. This table shows that all the models can represent the task difficulty and rater severity, despite the following differences:

  1. 1.

    MFRM is the simplest model that incorporates only task difficulty and rater severity parameters.

  2. 2.

    rMFRM is the only model that can consider the range restriction characteristic.

  3. 3.

    A unique feature of rGRM is its incorporation of rater consistency and task discrimination.

Table 2 also shows that none of these models can simultaneously consider all three rater parameters, which are known to generally occur when rater diversity increases (Myford and Wolfe 2003; Kassim 2011; Eckes 2005; Saal et al. 1980; Uto and Ueno 2016; Rahman et al. 2017; Uto and Ueno 2018a). Thus, ignoring any one will decrease model fitting and ability measurement accuracy. We thus propose a new IRT model that incorporates all three rater parameters.

5.3 Other statistical models for performance assessment

The models described above have been proposed as IRT models that directly incorporate rater parameters. A different model, the hierarchical rater model (HRM) (Patz et al. 2002; DeCarlo et al. 2011), introduces an ideal rating for each outcome and hierarchical structure data modeling. In the HRM, however, the number of ideal ratings, which should be estimated from given rating data, rapidly increases as the number of examinees or tasks increases. Ability and parameter estimation accuracies are generally reduced when the number of parameters per data increases. Therefore, accurate estimations under the HRM are more difficult than those for the models introduced above.

Several statistical models similar to the HRM have been proposed without IRT (Piech et al. 2013; Goldin 2012; Desarkar et al. 2012; Ipeirotis et al. 2010; Lauw et al. 2007; Abdel-Hafez and Xu 2015; Chen et al. 2011; Baba and Kashima 2013). However, those models cannot estimate examinee ability, because they do not incorporate an ability parameter.

From the above, we are not concerned with the models described in this subsection.

6 Proposed model

To address the problems described in Sect. 5.2, we propose a new IRT model that incorporates the three rater characteristic parameters. The proposed model is formulated as a rMFRM that incorporates a rater consistency parameter and further incorporates a task discrimination parameter like that in rGRM. Specifically, the proposed model provides the probability that rater r assigns score k to examinee j’s performance for task i as

$$\begin{aligned} P_{ijrk} = \frac{\exp \sum _{m=1}^{k}\left[ \alpha _r\alpha _i(\theta _j-\beta _i-\beta _{r} - d_{rm}) \right] }{\sum _{l=1}^{K} \exp \sum _{m=1}^{l}\left[ \alpha _r\alpha _i(\theta _j-\beta _i-\beta _{r} - d_{rm} )\right] }~, \end{aligned}$$
(9)

In the proposed model, rater consistency, severity, and range restriction characteristics are, respectively, represented as \(\alpha _r\), \(\beta _{r}\), and \(d_{rk}\). Interpretations of these parameters are as described in Sect. 5.1.

The proposed model entails a non-identifiability problem, meaning that parameter values cannot be uniquely determined, because different value sets can give same response probability. For the proposed model without task parameters, parameters are identifiable by assuming a specific distribution for the ability and constraining \(d_{r1}=0\) and \(\sum _{k=2}^{K} d_{rk} = 0\) for each r, because this is consistent with conventional GPCM in which item parameters are regarded as rater parameters. However, the proposed model still has indeterminacy of the scale for \(\alpha _r \alpha _i\) and that of the location for \(\beta _{i}+\beta _{r}\), even when these constraints are given. Specifically, the response probability \(P_{ijrk}\) with \(\alpha _r\) and \(\alpha _i\) engenders the same value of \(P_{ijrk}\) with \(\alpha '_r = \alpha _r c\) and \(\alpha '_i = \frac{\alpha _i}{c}\) for any constant c, because \(\alpha '_r \alpha '_i = (\alpha _r c) \frac{\alpha _i}{c} = \alpha _r \alpha _i\). Similarly, the response probability with \(\beta _{i}\) and \(\beta _{r}\) engenders the same value of \(P_{ijrk}\) with \(\beta '_{i}=\beta _{i}+c\) and \(\beta '_{r} = \beta _{r}-c\) for any constant c, because \(\beta '_{i}+\beta '_{r}=(\beta _{i}+c)+(\beta _{r}-c)=\beta _{i}+\beta _{r}\). Scale indeterminacy, as in the \(\alpha _r \alpha _i\) case, is known to be removable by fixing one parameter or by restricting the product of some parameters (Fox 2010). Furthermore, location indeterminacy, as in the \(\beta _{i}+\beta _{r}\) case, is solvable by fixing one parameter or by restricting the mean of some parameters (Fox 2010). This study, therefore, uses the restrictions \(\prod _{i=1}^{I} \alpha _i=1\), \(\sum _{i=1}^{I} \beta _i=0\), \(d_{r1}=0\), and \(\sum _{k=2}^{K} d_{rk} = 0\) for model identification, in addition to assuming a specific distribution for the ability.

The proposed model improves model fitting when the variety of raters increases, because the characteristics of each rater can be more flexibly represented. It also more accurately measures ability when rater variety increases, because it can estimate ability by more precisely reflecting rater characteristics. Note that ability measurement is improved only when the decrease in model misfit by increasing parameters exceeds the increase in parameter estimation errors caused by the decrease in data per parameter. This property is known as the bias–accuracy tradeoff (van der Linden 2016a).

7 Parameter estimation

This section presents the parameter estimation method for the proposed model.

Marginal maximum likelihood estimation using an EM algorithm is a common method for estimating IRT model parameters (Baker and Kim 2004). However, for complex models like that used in this study, EAP estimation, a form of Bayesian estimation, is known to provide more robust estimations (Uto and Ueno 2016; Fox 2010).

EAP estimates are calculated as the expected value of the marginal posterior distribution of each parameter (Fox 2010; Bishop 2006). The posterior distribution in the proposed model is

$$\begin{aligned}&g({\varvec{\theta _j}},\log {\varvec{\alpha _i}},\log {\varvec{\alpha _r}},{\varvec{\beta _i}},{\varvec{\beta _r}},{\varvec{d_{rk}}}|{\varvec{X}}) \nonumber \\&\quad \propto L({\varvec{X}}|{\varvec{\theta _j}},\log {\varvec{\alpha _i}},\log {\varvec{\alpha _r}},{\varvec{\beta _i}},{\varvec{\beta _r}},{\varvec{d_{rk}}}) g({\varvec{\theta _j}} | \tau _{\theta })\nonumber \\&\quad g(\log {\varvec{\alpha _{i}}} | \tau _{\alpha _i}) g(\log {\varvec{\alpha _{r}}} | \tau _{\alpha _r}) g({\varvec{\beta _i}} | \tau _{\beta _i}) g({\varvec{\beta _r}} | \tau _{\beta _r}) g({\varvec{d_{rk}}} | \tau _{d}), \end{aligned}$$
(10)

where

$$\begin{aligned}&L({\varvec{X}}|{\varvec{\theta _j}},\log {\varvec{\alpha _{i}}},\log {\varvec{\alpha _{r}}},{\varvec{\beta _i}},{\varvec{\beta _r}},{\varvec{d_{rk}}}) = \Pi _{j=1}^{J} \Pi _{i=1}^{I} \Pi _{r=1}^{R} \Pi _{k=1}^{K} (P_{ijrk})^{z_{ijrk}},\end{aligned}$$
(11)
$$\begin{aligned}&z_{ijrk}= {\left\{ \begin{array}{ll} 1: \quad x_{ijr} = k,\\ 0: \quad {\text {otherwise}}. \end{array}\right. } \end{aligned}$$
(12)

Therein, \({\varvec{\theta _j}} = \{ \theta _j \mid j \in \mathcal{J}\}\), \(\log {\varvec{\alpha _i}} = \{\log \alpha _{i} \mid i \in \mathcal{I}\}\), \({\varvec{\beta _i}}= \{ \beta _{i} \mid i \in \mathcal{I}\}\), \(\log {\varvec{\alpha _r}} = \{\log \alpha _{r} \mid r \in \mathcal{R} \}\), \({\varvec{\beta _r}}= \{ \beta _{r} \mid r \in \mathcal{R} \}\), and \({\varvec{d_{rk}}} = \{ d_{rk} \mid r \in \mathcal{R}, k \in \mathcal{K}\}\). Here, \(g({\varvec{S}}|\tau _{S})=\prod _{s \in {\varvec{S}}} g(s |\tau _{S})\) (where \({\varvec{S}}\) is a set of parameters) indicates a prior distribution. \(\tau _{s}\) is a hyperparameter for parameter s, which is arbitrarily determined to reflecting analyst’s subjectivity.

The marginal posterior distribution for each parameter is derived marginalizing across all parameters except the target one. For a complex IRT model, however, it is generally infeasible to derive the marginal posterior distribution or to calculate it using numerical analysis methods such as the Gaussian quadrature integral, because doing so requires solutions to high-dimensional multiple integrals. MCMC, a random sampling-based estimation method, can be used to address this problem. The effectiveness of MCMC has been demonstrated in various fields (Bishop 2006; Brooks et al. 2011; Uto et al. 2017; Louvigné et al. 2018). In IRT studies, MCMC has been used for complex models such as hierarchical Bayesian IRT, multidimensional IRT, and multilevel IRT (Fox 2010; Uto 2019).

7.1 MCMC algorithm

The Metropolis-Hastings-within-Gibbs sampling method (Gibbs/MH) (Patz and Junker 1999) has been commonly used as a MCMC algorithm for parameter estimation in IRT models. The algorithm is simple and easy to implement (Patz and Junker 1999; Zhang et al. 2011; Cai 2010), but it requires long times to converge to the target distribution, because it explores the parameter space via an inefficient random walk (Hoffman and Gelman 2014; Girolami and Calderhead 2011).

The Hamiltonian Monte Carlo (HMC) is an alternative MCMC algorithm with high efficiency (Brooks et al. 2011). Generally, HMC quickly converges to a target distribution in complex high-dimensional problems if two hand-tuned parameters, namely step size and simulation length, are appropriately selected (Neal 2010; Hoffman and Gelman 2014; Girolami and Calderhead 2011). In recent years, the No-U-Turn (NUT) sampler (Hoffman and Gelman 2014), an extension of HMC that eliminates hand-tuned parameters, has been proposed. The “Stan” software package (Carpenter et al. 2017) makes implementation of a NUT-based HMC easy. This algorithm has thus recently been used for parameter estimations in various statistical models, including IRT models (Luo and Jiao 2018; Jiang and Carter 2019).

We, therefore, use a NUT-based MCMC algorithm for parameter estimations in the proposed model. The estimation program was implemented in RStan (Stan Development Team 2018). The developed Stan code is provided in an Appendix. In this study, the prior distributions are set as \(\theta _{j}\), \(\log \alpha _{i}\), \(\log \alpha _{r}\), \(\beta _{i}\), \(\beta _{r}\), and \(d_{rk}\) \(\sim N(0.0,1.0^2)\), where \(N(\mu ,\sigma ^2)\) is a normal distribution with mean \(\mu \) and standard deviation \(\sigma \). Furthermore, we calculate EAP estimates as the mean of parameter samples obtained from 500 to 1000 periods of three independent MCMC chains.

7.2 Accuracy of parameter recovery

This subsection evaluates parameter recovery accuracy under the proposed model using the MCMC algorithm. The experiments were conducted as follows:

  1. 1.

    Randomly generate true parameters for the proposed model from the distributions described in Sect. 7.1.

  2. 2.

    Randomly sample rating data given the generated parameters.

  3. 3.

    Using the data, estimate the model parameters by the MCMC algorithm.

  4. 4.

    Calculate root mean square deviations (RMSEs) and biases between the estimated and true parameters.

  5. 5.

    Repeat the above procedure ten times, then calculate average values of the RMSEs and biases.

The above experiment was conducted while changing numbers of examinees, tasks, and raters as \(J \in \{30, 50, 100\}\), \(I \in \{3, 4, 5\}\), and \(R \in \{5, 10, 30\}\). The number of categories K was fixed to five.

Table 3 Results of the parameter recovery experiment

Table 3 shows the results, which confirm the following tendencies:

  1. 1.

    The accuracy of parameter estimation tends to increase with the number of examinees.

  2. 2.

    The accuracy of ability estimation tends to increase with the number of tasks or raters.

These tendencies are consistent with those presented in previous studies (Uto and Ueno 2018a, 2016).

Furthermore, we can confirm that the average biases were nearly zero in all cases, indicating no overestimation or underestimation of parameters. We also confirmed the Gelman–Rubin statistic \({\hat{R}}\) (Gelman and Rubin 1992; Gelman et al. 2013), which is generally used as a convergence diagnostic. Values for these statistics were less than 1.1 in all cases, indicating that the MCMC runs converged.

From the above, we conclude that the MCMC algorithm can appropriately estimate parameters for the proposed model.

8 Simulation experiments

This section describes a simulation experiment for evaluating the effectiveness of the proposed model.

This experiment compares the model fitting and ability estimation accuracy using simulation data created to imitate behaviors of raters with specific characteristics. Specifically, we examine how rater consistency and range restrictions affect the performance of each model. Rater severity is not examined in this experiment, because all conventional models have this parameter. We compare performance of the proposed model with that of rMFRM and rGRM. Note that MFRM is not compared, because all characteristics assumed in that model are incorporated in the other models. To examine the effects of rater consistency and range restriction parameters in the proposed model, we also compare two sub-models of the proposed model that restrict \(\alpha _r\) and \(d_{rk}\) to be constant for \(r \in \mathcal{R}\).

Table 4 Rules for creating rating data that imitate behaviors of raters with specific characteristics

The experiments were conducted using the following procedures:

  1. 1.

    Setting \(J=30\), \(I=5\), \(R=10\), and \(K=5\), sample rating data from the MFRM (the simplest model) after the true model parameters are randomly generated.

  2. 2.

    For a randomly selected 20%, 40%, and 60% of raters, transform the rating data to imitate behaviors of raters with specific characteristics by applying a rule in Table 4.

  3. 3.

    Estimate the parameters for each model from the transformed data using the MCMC algorithm.

  4. 4.

    Calculate information criteria for comparison of model fitting to the data. As the information criteria, we use the widely applicable information criterion (WAIC) (Watanabe 2010) and an approximated log marginal likelihood (log ML) (Newton and Raftery 1994), which have previously been used for IRT model comparison (Uto and Ueno 2016; Reise and Revicki 2014; van der Linden 2016b). Note that we use an approximate log ML (Newton and Raftery 1994), which is calculated as the harmonic mean of likelihoods sampled during MCMC, because exact calculation of ML is intractable due to the high-dimensional integrals involved. The model minimizing criteria scores is regarded as the optimal model. After ordering the models by each information criterion, calculate the rank of each model.

  5. 5.

    To evaluate the accuracy of ability estimation, calculate the RMSE and the correlation between true ability values and ability estimates as calculated from the transformed data in Procedure 8. Note that the RMSE was calculated after standardizing both the true and the estimated ability values, because the scale of ability differs between the MFRM from which the true values generated and a target model.

  6. 6.

    Repeat the above procedures ten times, then calculate the average rank and correlation.

Table 5 Results of model comparison using information criteria (Values in parentheses are the standard deviation of the rank)

Tables 5 and 6 show the results. In these tables, bold text represents highest values for ranks, correlations, and lowest RMSEs, and underlined text represents the next good values. The results show that the model performance strongly depends on whether the model can represent rater characteristics appearing in the assessment process. Specifically, the following findings were obtained from the results:

Table 6 Accuracy of ability estimation in the simulation experiment
  • For data with rating behavior pattern (A), in which raters with lower consistency exist, the models with rater consistency parameter \(\alpha _r\) (namely, rGRM and the proposed model with or without the constraint \(d_{rk}\)) tend to fit well and provide high ability estimation accuracy.

  • For data with rating behavior pattern (B), in which raters with range restrictions exist, the models with the \(d_{rk}\) parameter (namely, rMFRM and the proposed model with or without the constraint \(\alpha _r\)) provide high performance.

  • For data with rating behavior pattern (C), in which both raters with range restriction and those with low consistency exist, the proposed model provides the highest performance, because it is the only model that incorporates both rater parameters.

These results confirm that the proposed model provides better model fitting and more accurate ability estimations than do the conventional models when assuming varying rater characteristics. Furthermore, these results demonstrate that rater parameters \(\alpha _r\) and \(d_{rk}\) appropriately reflect rater consistency and range restriction characteristics, as expected.

9 Actual data experiments

This section describes actual data experiments performed to evaluate performance of the proposed model.

9.1 Actual data

This experiment uses rating data obtained from a peer assessment activity among university students. We selected this situation because it is a typical example in which the existence of raters with various characteristics can be assumed (e.g., Nguyen et al. 2015; Uto and Ueno 2018b; Uto et al. n.d.). We gathered actual peer assessment data through the following procedures:

  1. 1.

    Subjects were 34 university students majoring in various STEM fields, including statistics, materials, chemistry, engineering, robotics, and information science.

  2. 2.

    Subjects were asked to complete four essay-writing tasks from the National Assessment of Educational Progress (NAEP) assessments in 2002 and 2007 (Persky et al. 2003; Salahu-Din et al. 2008). No specific or preliminary knowledge was needed to complete these tasks.

  3. 3.

    After the subjects completed all tasks, they were asked to evaluate the essays of other subjects for all four tasks. These assessments were conducted using a rubric based on assessment criteria for grade 12 NAEP writing (Salahu-Din et al. 2008), consisting of five rating categories with corresponding scoring criteria.

Table 7 Instructions given to ten raters to obtain responses for specific characteristics

In this experiment, we also collected rating data that simulate behaviors of raters with specific characteristics. Specifically, we gathered ten other university students and asked them to evaluate the 134 essays written by the initial 34 subjects following the instructions in Table 7. The first three raters are expected to provide inconsistent ratings, the next four raters to imitate raters with a range restriction, and the last three raters to simulate severe or lenient raters. For simplicity, hereinafter we refer to such raters as controlled raters.

We evaluate the effectiveness of the proposed model using these data.

9.2 Example of parameter estimates

This subsection presents an example of parameter estimation using the proposed model. From the rating data from peer raters and controlled raters, we used the MCMC algorithm to estimate parameters for the proposed model. Table 8 shows the estimated rater and task parameters.

Table 8 Parameter estimates
Fig. 5
figure 5

IRCs for four representative peer raters with different characteristics

Fig. 6
figure 6

IRCs for controlled raters with strong range restriction

Table 8 confirms the existence of peer raters with various rater characteristics. Figure 5 shows IRCs for four representative peer raters with different characteristics. Here, Rater 17 and Rater 24 are example lenient and inconsistent raters, respectively. Rater 4 and Rater 32 are raters with different range restriction characteristics. Specifically, Rater 4 tended to overuse categories \(k=2\) and \(k=4\), and Rater 32 tended to overuse only \(k=4\).

We can also confirm that the controlled raters followed the provided instructions. Specifically, high severity values are estimated for controlled raters 8 and 9, and a low value is assigned to controlled rater 10, as expected. Figure 5 also shows the IRCs of controlled raters 4, 5, 6, and 7, which confirm range restriction characteristics complying with the instructions. Although we expected raters 1, 2, and 3 to be inconsistent, because they need to perform assessments within a short time, their consistencies were not low.

Table 8 also shows that the tasks had different discrimination powers and difficulty values. However, parameter differences among tasks are smaller than those among raters.

This suggests that the proposed model is suitable for the data, because various rater characteristics are likely to exist.

9.3 Model comparison using information criteria

This subsection presents model comparisons using information criteria. We calculated WAIC and log ML for each model using the peer-rater data and the data with controlled rater data.

Table 9 shows the results, with bold text indicating minimum scores. The table shows that the proposed model presents lowest values for both information criteria and for both datasets, suggesting that the proposed model is the best model for the actual data. The table also shows that performance of the proposed model decreases when the effects of rater consistency or range restriction are ignored, indicating that simultaneous consideration of both is important.

The experimental results show that the proposed model can improve the model fitting when raters with various characteristics exist. This is because consistency and range restriction characteristics differ among raters, as described in the previous subsection, and because the proposed model appropriately represents these effects (Fig. 6).

Table 9 Model comparison using actual data

9.4 Accuracy of ability estimation

This subsection compares ability measurement accuracies using the actual data. Specifically, we evaluate how well ability estimates are correlated when abilities are estimated using data from different raters. If a model appropriately reflects rater characteristics, ability values estimated from data from different raters will be highly correlated. We thus conducted the following experiment for each model and for two datasets, namely, the peer rater data and the data with controlled rater data:

  1. 1.

    Use MCMC to estimate model parameters.

  2. 2.

    Randomly select 5 or 10 ratings assigned to each examinee, then change unselected ratings to missing data.

  3. 3.

    Using the dataset with missing data, estimate examinee abilities \({\varvec{\theta }}\) given the rater and task parameters estimated in Procedure 1.

  4. 4.

    Repeat the above procedure 100 times, then calculate the correlation between each pair of ability estimates obtained in Procedure 3. Then, calculate the average and standard deviation of the correlations.

For comparison, we conducted the same experiment using a method in which the true score is given as the average rating. We designate this as the average score method. We also conducted multiple comparisons using Dunnett’s test to ascertain whether correlation values under the proposed model are significantly higher than those under the other models.

Table 10 Ability estimation accuracy using actual data (Values in parentheses are standard deviations)

Table 10 shows the results. The results show that all IRT models provide higher correlation values than does the averaged score, indicating that the IRT models effectively improve the accuracy of ability measurements. The results also show that the proposed model provides significantly higher correlations than do the other models, indicating that the proposed model most accurately estimates abilities. We can also confirm that performance of the proposed model rapidly decreases when the effects of rater consistency or range restriction are ignored, suggesting the effectiveness of considering both characteristics to improve accuracy.

These results demonstrate that the proposed model provides the most accurate ability estimations when a large variety of rater characteristics is assumed.

10 Conclusion

We proposed a generalized MFRM that incorporates parameters for three common rater characteristics, namely, severity, range restriction, and consistency. To address the difficulty of parameter estimation under such a complex model, we presented a Bayesian estimation method for the proposed model using a MCMC algorithm based on NUT-HMC. Simulation and actual data experiments demonstrated that model fitting and accuracy for ability measurements is improved when the variety of raters increases. We also demonstrated the importance of each rater parameter for improving performance. Through a parameter recovery experiment, we demonstrated that the developed MCMC algorithm can appropriately estimate parameters for the proposed model even when the sample size is relatively small.

Although this study used peer assessment data in an actual data experiment, the proposed model would be effective in various assessment situations where raters with diverse characteristics are assumed to exist, or when sufficient quality control of raters is difficult. Future studies should evaluate the effectiveness of the proposed model using more varied and larger datasets. While this study mainly focused on model fitting and ability measurement accuracy, the proposed model is also applicable to other purposes, such evaluating and training raters’ assessment skills, detecting aberrant or heterogeneous raters, and selecting optimal raters for each examinee. Such applications are left as topics for future work.