Abstract
Performance assessments increasingly utilize onscreen or internet-based technology to collect human ratings. One of the benefits of onscreen ratings is the automatic recording of rating times along with the ratings. Considering rating times as an additional data source can provide a more detailed picture of the rating process and improve the psychometric quality of the assessment outcomes. However, currently available models for analyzing performance assessments do not incorporate rating times. The present research aims to fill this gap and advance a joint modeling approach, the “hierarchical facets model for ratings and rating times” (HFM-RT). The model includes two examinee parameters (ability and time intensity) and three rater parameters (severity, centrality, and speed). The HFM-RT successfully recovered examinee and rater parameters in a simulation study and yielded superior reliability indices. A real-data analysis of English essay ratings collected in a high-stakes assessment context revealed that raters exhibited considerably different speed measures, spent more time on high-quality than low-quality essays, and tended to rate essays faster with increasing severity. However, due to the significant heterogeneity of examinees’ writing proficiency, the improvement in the assessment’s reliability using the HFM-RT was not salient in the real-data example. This discussion focuses on the advantages of accounting for rating times as a source of information in rating quality studies and highlights perspectives from the HFM-RT for future research on rater cognition.
Similar content being viewed by others
Data availability
The JAGS codes for the HFM-RT and the FM-SC, along with a simulated dataset, are publicly accessible on the Open Science Framework (https://osf.io/nhprs/).
Notes
These settings were partially informed by the real-data study results reported later.
Thinning is inefficient in reducing autocorrelation (Link & Eaton, 2012).
We also compared the parameter estimates before and after thinning the Markov chain (with a thinning interval of 10). The estimates proved highly concordant. Therefore, we opted for the Markov chain without thinning (the larger effective sample size leads to more accurate inference).
The rating scale and its categories are not publicly available.
The bivariate distributions between response time and rating scores are available at https://osf.io/nhprs/.
Due to the enormous dataset size, either analysis took about 240 hours each, using a workstation with CPUs of 2.4 GHz and RAM of 384 GB.
Because true parameter values are unavailable in empirical studies, the assessment’s reliability was calculated as \(1-\overline{{SE(\widehat{\mathrm{\uptheta }})}^{2}}/{\hat{\mathrm{\upsigma }}}_{\mathrm{\uptheta }}^{2}\). The high reliabilities from both models and the consequently small reliability difference can be explained by the pronounced heterogeneity of examinees’ English writing proficiency (see top left panel of Fig. 3).
The expected rating time can be calculated as follows: \({\hat{T}}_{nk}=\mathrm{exp}\left({\hat{\upbeta }}_{n}-{\hat{{\upzeta }}_{k}}+{\hat{\upsigma }}_{\upepsilon }^{2}/2\right)\).
References
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.org/10.1007/BF02293814
de Ayala, R. J. (2022). The theory and application of item response theory (2nd. ed.). Guilford Press.
Bejar, I. I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issues and Practice, 31(3), 2–9. https://doi.org/10.1111/j.1745-3992.2012.00238.x
Bennett, R. E. (2003). Online assessment and the comparability of score meaning (Research Memorandum No. RM-03-05). Educational Testing Service. https://www.ets.org/Media/Research/pdf/RM-03-05-Bennett.pdf
Bolsinova, M., & Tijmstra, J. (2018). Improving precision of ability estimation: Getting more from response times. British Journal of Mathematical and Statistical Psychology, 71(1), 13–38. https://doi.org/10.1111/bmsp.12104
Casabianca, J. M., Junker, B. W., & Patz, R. J. (2016). Hierarchical rater models. In W. J. van der Linden (Ed.), Handbook of item response theory (Vol. 1) (pp. 449–465). Chapman & Hall/CRC.
Cheng, Y., & Shao, C. (2022). Application of change point analysis of response time data to detect test speededness. Educational and Psychological Measurement, 82(5), 1031–1062. https://doi.org/10.1177/00131644211046392
Coniam, D. (2010). Validating onscreen marking in Hong Kong. Asian Pacific Education Review, 11(3), 423–431. https://doi.org/10.1007/s12564-009-9068-2
Coniam, D., & Falvey, P. (Eds.). (2016). Validating technological innovation: The introduction and implementation of onscreen marking in Hong Kong. Springer. https://doi.org/10.1007/978-981-10-0434-6
Cooze, M. (2011). Assessing writing tests on scoris®: The introduction of online marking. Research Notes, 43, 12–15. https://www.cambridgeenglish.org/Images/23161-research-notes-43.pdf
De Boeck, P., & Jeon, M. (2019). An overview of models for response times and processes in cognitive tests. Frontiers in Psychology, 10, 102. https://doi.org/10.3389/fpsyg.2019.00102
DeCarlo, L. T., Kim, Y. K., & Johnson, M. S. (2011). A hierarchical rater model for constructed responses, with a signal detection rater model. Journal of Educational Measurement, 48(3), 333–356. https://doi.org/10.1111/j.1745-3984.2011.00143.x
Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to rater behavior. Language Assessment Quarterly, 9(3), 270–292. https://doi.org/10.1080/15434303.2011.649381
Eckes, T. (2017). Rater effects: Advances in item response modeling of human ratings - Part I [Editorial]. Psychological Test and Assessment Modeling, 59(4), 443–452. https://www.psychologie-aktuell.com/fileadmin/download/ptam/4-2017_20171218/03_Eckes.pdf
Eckes, T. (2023). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Peter Lang. https://doi.org/10.3726/b20875
Eckes, T., & Jin, K.-Y. (2021a). Examining severity and centrality effects in TestDaF writing and speaking assessments: An extended Bayesian many-facet Rasch analysis. International Journal of Testing, 21(3–4), 131–153. https://doi.org/10.1080/15305058.2021.1963260
Eckes, T., & Jin, K.-Y. (2021b). Measuring rater centrality effects in writing assessment: A Bayesian facets modeling approach. Psychological Test and Assessment Modeling, 63(1), 65–94. https://www.psychologie-aktuell.com/fileadmin/download/ptam/1-2021/Seiten_aus_PTAM_2021-1_ebook_4.pdf.
Eckes, T., & Jin, K.-Y. (2022). Detecting illusory halo effects in rater-mediated assessment: A mixture Rasch facets modeling approach. Psychological Test and Assessment Modeling, 64(1), 87–111. https://www.psychologie-aktuell.com/fileadmin/Redaktion/Journale/ptam_2022-1/PTAM__1-2022_5_kor.pdf
Engelhard, G., & Wind, S. A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Routledge. https://doi.org/10.4324/9781315766829
Falvey, P., & Coniam, D. (2010). A qualitative study of the response of raters towards onscreen and paper-based marking. Melbourne Papers in Language Testing, 15(1), 1–26. https://arts.unimelb.edu.au/__data/assets/pdf_file/0003/3518706/15_1_1_Falvey-and-Coniam.pdf
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In J. M. Bernardo, J. Berger, A. P. Dawid, & J. F. M. Smith (Eds.), Bayesian statistics 4 (pp. 169–193). Oxford University Press.
Glazer, N., & Wolfe, E. W. (2020). Understanding and interpreting human scoring. Applied Measurement in Education, 33(3), 191–197. https://doi.org/10.1080/08957347.2020.1750402
Goudie, R. J. B., Turner, R. M., De Angelis, D., & Thomas, A. (2020). MultiBUGS: A parallel implementation of the BUGS modelling framework for faster Bayesian inference. Journal of Statistical Software, 95(7), 1–20. https://doi.org/10.18637/jss.v095.i07
International Test Commission (ITC) and Association of Test Publishers (ATP). (2022). Guidelines for technology-based assessment. https://www.intestcom.org/upload/media-library/guidelines-for-technology-based-assessment-v20221108-16684036687NAG8.pdf
Jackman, S. (2009). Bayesian analysis for the social sciences. Wiley. https://doi.org/10.1002/9780470686621
Jin, K.-Y., & Chiu, M. M. (2022). A mixture Rasch facets model for rater’s illusory halo effects. Behavior Research Methods, 54(6), 2750–2764. https://doi.org/10.3758/s13428-021-01721-3
Jin, K.-Y., & Eckes, T. (2022a). Detecting differential rater functioning in severity and centrality: The dual DRF facets model. Educational and Psychological Measurement, 82(4), 757–781. https://doi.org/10.1177/00131644211043207
Jin, K.-Y., & Eckes, T. (2022b). Detecting rater centrality effects in performance assessments: A model-based comparison of centrality indices. Measurement: Interdisciplinary Research and Perspectives, 20(4), 228–247. https://doi.org/10.1080/15366367.2021.1972654
Jin, K.-Y., & Eckes, T. (2023). Measuring the impact of peer interaction in group oral assessments with an extended many-facet Rasch model. Journal of Educational Measurement. https://doi.org/10.1111/jedm.12375
Jin, K.-Y., & Wang, W.-C. (2017). Assessment of differential rater functioning in latent classes with new mixture facets models. Multivariate Behavioral Research, 52(3), 391–402. https://doi.org/10.1080/00273171.2017.1299615
Jin, K.-Y., & Wang, W.-C. (2018). A new facets model for rater’s centrality/extremity response style. Journal of Educational Measurement, 55(4), 543–563. https://doi.org/10.1111/jedm.12191
Jin, K.-Y., Hsu, C.-L., Chiu, M. M., & Chen, P.-H. (2023). Modeling rapid guessing behaviors in computer-based testlet items. Applied Psychological Measurement, 47(1), 19–33. https://doi.org/10.1177/0146621622112517
Johnson, R. L., Penny, J. A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. Guilford Press.
Knoch, U., Fairbairn, J., & Jin, Y. (2021). Scoring second language spoken and written performance: Issues, options and directions. Equinox.
Lane, S. (2019). Modeling rater response processes in evaluating score meaning. Journal of Educational Measurement, 56(3), 653–663. https://doi.org/10.1111/jedm.12229
Lee, C. (2016a). The role of the Hong Kong Examinations and Assessment Authority. In D. Coniam & P. Falvey (Eds.), Validating technological innovation: The introduction and implementation of onscreen marking in Hong Kong (pp. 9–21). Springer. https://doi.org/10.1007/978-981-10-0434-6_2
Lee, C. (2016b). Onscreen marking system. In D. Coniam & P. Falvey (Eds.), Validating technological innovation: The introduction and implementation of onscreen marking in Hong Kong (pp. 23–41). Springer. https://doi.org/10.1007/978-981-10-0434-6_3
Lee, Y.-H., & Chen, H. (2011). A review of recent response-time analyses in educational testing. Psychological Test and Assessment Modeling, 53(3), 359–379. https://www.psychologie-aktuell.com/fileadmin/download/ptam/3-2011_20110927/06_Lee.pdf
Levy, R., & Mislevy, R. J. (2016). Bayesian psychometric modeling. Chapman & Hall/CRC. https://doi.org/10.1201/9781315374604
Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.
Ling, G., Williams, J., O’Brien, S., & Cavalle, C. S. (2022). Scoring essays on an iPad versus a desktop computer: An exploratory study. Educational Testing Service https://www.ets.org/research/policy_research_reports/publications/report/2022/kelf.html
Link, W. A., & Eaton, M. J. (2012). On thinning of chains in MCMC. Methods in Ecology and Evolution, 3(1), 112–115. https://doi.org/10.1111/j.2041-210X.2011.00131.x
Man, K., Harring, J. R., Jiao, H., & Zhan, P. (2019). Joint modeling of compensatory multidimensional item responses and response times. Applied Psychological Measurement, 43(8), 639–654. https://doi.org/10.1177/0146621618824853
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272
Molenaar, D., & De Boeck, P. (2018). Response mixture modeling: Accounting for heterogeneity in item characteristics across response times. Psychometrika, 83(2), 279–297. https://doi.org/10.1007/s11336-017-9602-9
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422. http://jampress.org/pubs.htm
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227. http://jampress.org/pubs.htm.
Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27(4), 341–384. https://doi.org/10.3102/10769986027004341
Plummer, M. (2017). JAGS version 4.3.0 user manual. https://sourceforge.net/projects/mcmc-jags/files/Manuals/4.x/jags_user_manual.pdf
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. University of Chicago Press. (Original work published in 1960).
Robitzsch, A., & Steinfeld, J. (2018). Item response models for human ratings: Overview, estimation methods and implementation in R. Psychological Test and Assessment Modeling, 60(1), 101–138. https://www.psychologie-aktuell.com/fileadmin/download/ptam/1-2018_20180323/6_PTAM_IRMHR_Main__2018-03-13_1416.pdf
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12(4), 1151–1172. https://doi.org/10.1214/aos/1176346785
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B, 64(4), 583–639. https://doi.org/10.1111/1467-9868.00353
Uto, M. (2021). Accuracy of performance-test linking based on a many-facet Rasch model. Behavior Research Methods, 53(4), 1440–1454. https://doi.org/10.3758/s13428-020-01498-x
Uto, M. (2022). A Bayesian many-facet Rasch model with Markov modeling for rater severity drift. Advanced online publication. https://doi.org/10.3758/s13428-022-01997-z
van der Linde, A. (2005). DIC in variable selection. Statistica Neerlandica, 59(1), 45–56. https://doi.org/10.1111/j.1467-9574.2005.00278.x
van der Linden, W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204. https://doi.org/10.3102/10769986031002181
van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308. https://doi.org/10.1007/s11336-006-1478-z
van der Linden, W. J. (2009). Conceptual issues in response-time modeling. Journal of Educational Measurement, 46(3), 247–272. https://doi.org/10.1111/j.1745-3984.2009.00080.x
van der Linden, W. J. (2011). Modeling response times with latent variables: Principles and applications. Psychological Test and Assessment Modeling, 53(3), 334–358. https://www.psychologie-aktuell.com/fileadmin/download/ptam/3-2011_20110927/05_vanderLinden.pdf
van der Linden, W. J. (Ed.). (2016a). Handbook of item response theory (Vol. 1.). Chapman & Hall/CRC. https://doi.org/10.1201/9781315374512
van der Linden, W. J. (2016b). Lognormal response-time model. In W. J. van der Linden (Ed.), Handbook of item response theory (1st ed., pp. 261–282). Chapman & Hall/CRC.
van Rijn, P. W., & Ali, U. S. (2017). A comparison of item response models for accuracy and speed of item responses with applications to adaptive testing. British Journal of Mathematical and Statistical Psychology, 70(2), 317–345. https://doi.org/10.1111/bmsp.12101
van Rijn, P. W., & Ali, U. S. (2018). A generalized speed–accuracy response model for dichotomous items. Psychometrika, 83(1), 109–131. https://doi.org/10.1007/s11336-017-9590-9
Wang, W.-C., & Liu, C.-Y. (2007). Formulation and application of the generalized multilevel facets model. Educational and Psychological Measurement, 67(4), 583–605. https://doi.org/10.1177/0013164406296974
Wind, S. A., & Ge, Y. (2021). Detecting rater biases in sparse rater-mediated assessment networks. Educational and Psychological Measurement, 81(5), 996–1022. https://doi.org/10.1177/0013164420988108
Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192. https://doi.org/10.1177/0265532216686999
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jin, KY., Eckes, T. Human ratings take time: A hierarchical facets model for the joint analysis of ratings and rating times. Behav Res (2023). https://doi.org/10.3758/s13428-023-02259-2
Accepted:
Published:
DOI: https://doi.org/10.3758/s13428-023-02259-2