Human ratings take time: A hierarchical facets model for the joint analysis of ratings and rating times

Jin, Kuan-Yu; Eckes, Thomas

doi:10.3758/s13428-023-02259-2

Human ratings take time: A hierarchical facets model for the joint analysis of ratings and rating times

Original Manuscript
Published: 02 November 2023

(2023)
Cite this article

Behavior Research Methods Aims and scope Submit manuscript

Abstract

Performance assessments increasingly utilize onscreen or internet-based technology to collect human ratings. One of the benefits of onscreen ratings is the automatic recording of rating times along with the ratings. Considering rating times as an additional data source can provide a more detailed picture of the rating process and improve the psychometric quality of the assessment outcomes. However, currently available models for analyzing performance assessments do not incorporate rating times. The present research aims to fill this gap and advance a joint modeling approach, the “hierarchical facets model for ratings and rating times” (HFM-RT). The model includes two examinee parameters (ability and time intensity) and three rater parameters (severity, centrality, and speed). The HFM-RT successfully recovered examinee and rater parameters in a simulation study and yielded superior reliability indices. A real-data analysis of English essay ratings collected in a high-stakes assessment context revealed that raters exhibited considerably different speed measures, spent more time on high-quality than low-quality essays, and tended to rate essays faster with increasing severity. However, due to the significant heterogeneity of examinees’ writing proficiency, the improvement in the assessment’s reliability using the HFM-RT was not salient in the real-data example. This discussion focuses on the advantages of accounting for rating times as a source of information in rating quality studies and highlights perspectives from the HFM-RT for future research on rater cognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

Article Open access 07 June 2017

Recognize the Value of the Sum Score, Psychometrics’ Greatest Accomplishment

Article Open access 17 April 2024

An automated essay scoring systems: a systematic literature review

Article 23 September 2021

Data availability

The JAGS codes for the HFM-RT and the FM-SC, along with a simulated dataset, are publicly accessible on the Open Science Framework (https://osf.io/nhprs/).

Notes

These settings were partially informed by the real-data study results reported later.
Thinning is inefficient in reducing autocorrelation (Link & Eaton, 2012).
We also compared the parameter estimates before and after thinning the Markov chain (with a thinning interval of 10). The estimates proved highly concordant. Therefore, we opted for the Markov chain without thinning (the larger effective sample size leads to more accurate inference).
The rating scale and its categories are not publicly available.
The bivariate distributions between response time and rating scores are available at https://osf.io/nhprs/.
Due to the enormous dataset size, either analysis took about 240 hours each, using a workstation with CPUs of 2.4 GHz and RAM of 384 GB.
Because true parameter values are unavailable in empirical studies, the assessment’s reliability was calculated as \(1-\overline{{SE(\widehat{\mathrm{\uptheta }})}^{2}}/{\hat{\mathrm{\upsigma }}}_{\mathrm{\uptheta }}^{2}\). The high reliabilities from both models and the consequently small reliability difference can be explained by the pronounced heterogeneity of examinees’ English writing proficiency (see top left panel of Fig. 3).
The expected rating time can be calculated as follows: \({\hat{T}}_{nk}=\mathrm{exp}\left({\hat{\upbeta }}_{n}-{\hat{{\upzeta }}_{k}}+{\hat{\upsigma }}_{\upepsilon }^{2}/2\right)\).

References

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.org/10.1007/BF02293814
Article Google Scholar
de Ayala, R. J. (2022). The theory and application of item response theory (2nd. ed.). Guilford Press.
Bejar, I. I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issues and Practice, 31(3), 2–9. https://doi.org/10.1111/j.1745-3992.2012.00238.x
Article Google Scholar
Bennett, R. E. (2003). Online assessment and the comparability of score meaning (Research Memorandum No. RM-03-05). Educational Testing Service. https://www.ets.org/Media/Research/pdf/RM-03-05-Bennett.pdf
Google Scholar
Bolsinova, M., & Tijmstra, J. (2018). Improving precision of ability estimation: Getting more from response times. British Journal of Mathematical and Statistical Psychology, 71(1), 13–38. https://doi.org/10.1111/bmsp.12104
Article PubMed Google Scholar
Casabianca, J. M., Junker, B. W., & Patz, R. J. (2016). Hierarchical rater models. In W. J. van der Linden (Ed.), Handbook of item response theory (Vol. 1) (pp. 449–465). Chapman &amp; Hall/CRC.
Google Scholar
Cheng, Y., & Shao, C. (2022). Application of change point analysis of response time data to detect test speededness. Educational and Psychological Measurement, 82(5), 1031–1062. https://doi.org/10.1177/00131644211046392
Article PubMed Google Scholar
Coniam, D. (2010). Validating onscreen marking in Hong Kong. Asian Pacific Education Review, 11(3), 423–431. https://doi.org/10.1007/s12564-009-9068-2
Article Google Scholar
Coniam, D., & Falvey, P. (Eds.). (2016). Validating technological innovation: The introduction and implementation of onscreen marking in Hong Kong. Springer. https://doi.org/10.1007/978-981-10-0434-6
Book Google Scholar
Cooze, M. (2011). Assessing writing tests on scoris^®: The introduction of online marking. Research Notes, 43, 12–15. https://www.cambridgeenglish.org/Images/23161-research-notes-43.pdf
Google Scholar
De Boeck, P., & Jeon, M. (2019). An overview of models for response times and processes in cognitive tests. Frontiers in Psychology, 10, 102. https://doi.org/10.3389/fpsyg.2019.00102
Article PubMed PubMed Central Google Scholar
DeCarlo, L. T., Kim, Y. K., & Johnson, M. S. (2011). A hierarchical rater model for constructed responses, with a signal detection rater model. Journal of Educational Measurement, 48(3), 333–356. https://doi.org/10.1111/j.1745-3984.2011.00143.x
Article Google Scholar
Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to rater behavior. Language Assessment Quarterly, 9(3), 270–292. https://doi.org/10.1080/15434303.2011.649381
Article Google Scholar
Eckes, T. (2017). Rater effects: Advances in item response modeling of human ratings - Part I [Editorial]. Psychological Test and Assessment Modeling, 59(4), 443–452. https://www.psychologie-aktuell.com/fileadmin/download/ptam/4-2017_20171218/03_Eckes.pdf
Google Scholar
Eckes, T. (2023). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Peter Lang. https://doi.org/10.3726/b20875
Book Google Scholar
Eckes, T., & Jin, K.-Y. (2021a). Examining severity and centrality effects in TestDaF writing and speaking assessments: An extended Bayesian many-facet Rasch analysis. International Journal of Testing, 21(3–4), 131–153. https://doi.org/10.1080/15305058.2021.1963260
Article Google Scholar
Eckes, T., & Jin, K.-Y. (2021b). Measuring rater centrality effects in writing assessment: A Bayesian facets modeling approach. Psychological Test and Assessment Modeling, 63(1), 65–94. https://www.psychologie-aktuell.com/fileadmin/download/ptam/1-2021/Seiten_aus_PTAM_2021-1_ebook_4.pdf.
Google Scholar
Eckes, T., & Jin, K.-Y. (2022). Detecting illusory halo effects in rater-mediated assessment: A mixture Rasch facets modeling approach. Psychological Test and Assessment Modeling, 64(1), 87–111. https://www.psychologie-aktuell.com/fileadmin/Redaktion/Journale/ptam_2022-1/PTAM__1-2022_5_kor.pdf
Google Scholar
Engelhard, G., & Wind, S. A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Routledge. https://doi.org/10.4324/9781315766829
Book Google Scholar
Falvey, P., & Coniam, D. (2010). A qualitative study of the response of raters towards onscreen and paper-based marking. Melbourne Papers in Language Testing, 15(1), 1–26. https://arts.unimelb.edu.au/__data/assets/pdf_file/0003/3518706/15_1_1_Falvey-and-Coniam.pdf
Google Scholar
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In J. M. Bernardo, J. Berger, A. P. Dawid, & J. F. M. Smith (Eds.), Bayesian statistics 4 (pp. 169–193). Oxford University Press.
Google Scholar
Glazer, N., & Wolfe, E. W. (2020). Understanding and interpreting human scoring. Applied Measurement in Education, 33(3), 191–197. https://doi.org/10.1080/08957347.2020.1750402
Article Google Scholar
Goudie, R. J. B., Turner, R. M., De Angelis, D., & Thomas, A. (2020). MultiBUGS: A parallel implementation of the BUGS modelling framework for faster Bayesian inference. Journal of Statistical Software, 95(7), 1–20. https://doi.org/10.18637/jss.v095.i07
Article Google Scholar
International Test Commission (ITC) and Association of Test Publishers (ATP). (2022). Guidelines for technology-based assessment. https://www.intestcom.org/upload/media-library/guidelines-for-technology-based-assessment-v20221108-16684036687NAG8.pdf
Jackman, S. (2009). Bayesian analysis for the social sciences. Wiley. https://doi.org/10.1002/9780470686621
Book Google Scholar
Jin, K.-Y., & Chiu, M. M. (2022). A mixture Rasch facets model for rater’s illusory halo effects. Behavior Research Methods, 54(6), 2750–2764. https://doi.org/10.3758/s13428-021-01721-3
Article PubMed Google Scholar
Jin, K.-Y., & Eckes, T. (2022a). Detecting differential rater functioning in severity and centrality: The dual DRF facets model. Educational and Psychological Measurement, 82(4), 757–781. https://doi.org/10.1177/00131644211043207
Article PubMed Google Scholar
Jin, K.-Y., & Eckes, T. (2022b). Detecting rater centrality effects in performance assessments: A model-based comparison of centrality indices. Measurement: Interdisciplinary Research and Perspectives, 20(4), 228–247. https://doi.org/10.1080/15366367.2021.1972654
Article Google Scholar
Jin, K.-Y., & Eckes, T. (2023). Measuring the impact of peer interaction in group oral assessments with an extended many-facet Rasch model. Journal of Educational Measurement. https://doi.org/10.1111/jedm.12375
Jin, K.-Y., & Wang, W.-C. (2017). Assessment of differential rater functioning in latent classes with new mixture facets models. Multivariate Behavioral Research, 52(3), 391–402. https://doi.org/10.1080/00273171.2017.1299615
Article PubMed Google Scholar
Jin, K.-Y., & Wang, W.-C. (2018). A new facets model for rater’s centrality/extremity response style. Journal of Educational Measurement, 55(4), 543–563. https://doi.org/10.1111/jedm.12191
Article Google Scholar
Jin, K.-Y., Hsu, C.-L., Chiu, M. M., & Chen, P.-H. (2023). Modeling rapid guessing behaviors in computer-based testlet items. Applied Psychological Measurement, 47(1), 19–33. https://doi.org/10.1177/0146621622112517
Article PubMed Google Scholar
Johnson, R. L., Penny, J. A., & Gordon, B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. Guilford Press.
Google Scholar
Knoch, U., Fairbairn, J., & Jin, Y. (2021). Scoring second language spoken and written performance: Issues, options and directions. Equinox.
Google Scholar
Lane, S. (2019). Modeling rater response processes in evaluating score meaning. Journal of Educational Measurement, 56(3), 653–663. https://doi.org/10.1111/jedm.12229
Article Google Scholar
Lee, C. (2016a). The role of the Hong Kong Examinations and Assessment Authority. In D. Coniam & P. Falvey (Eds.), Validating technological innovation: The introduction and implementation of onscreen marking in Hong Kong (pp. 9–21). Springer. https://doi.org/10.1007/978-981-10-0434-6_2
Chapter Google Scholar
Lee, C. (2016b). Onscreen marking system. In D. Coniam & P. Falvey (Eds.), Validating technological innovation: The introduction and implementation of onscreen marking in Hong Kong (pp. 23–41). Springer. https://doi.org/10.1007/978-981-10-0434-6_3
Chapter Google Scholar
Lee, Y.-H., & Chen, H. (2011). A review of recent response-time analyses in educational testing. Psychological Test and Assessment Modeling, 53(3), 359–379. https://www.psychologie-aktuell.com/fileadmin/download/ptam/3-2011_20110927/06_Lee.pdf
Google Scholar
Levy, R., & Mislevy, R. J. (2016). Bayesian psychometric modeling. Chapman & Hall/CRC. https://doi.org/10.1201/9781315374604
Book Google Scholar
Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.
Google Scholar
Ling, G., Williams, J., O’Brien, S., & Cavalle, C. S. (2022). Scoring essays on an iPad versus a desktop computer: An exploratory study. Educational Testing Service https://www.ets.org/research/policy_research_reports/publications/report/2022/kelf.html
Google Scholar
Link, W. A., & Eaton, M. J. (2012). On thinning of chains in MCMC. Methods in Ecology and Evolution, 3(1), 112–115. https://doi.org/10.1111/j.2041-210X.2011.00131.x
Article Google Scholar
Man, K., Harring, J. R., Jiao, H., & Zhan, P. (2019). Joint modeling of compensatory multidimensional item responses and response times. Applied Psychological Measurement, 43(8), 639–654. https://doi.org/10.1177/0146621618824853
Article PubMed PubMed Central Google Scholar
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272
Article Google Scholar
Molenaar, D., & De Boeck, P. (2018). Response mixture modeling: Accounting for heterogeneity in item characteristics across response times. Psychometrika, 83(2), 279–297. https://doi.org/10.1007/s11336-017-9602-9
Article PubMed Google Scholar
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422. http://jampress.org/pubs.htm
PubMed Google Scholar
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227. http://jampress.org/pubs.htm.
PubMed Google Scholar
Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27(4), 341–384. https://doi.org/10.3102/10769986027004341
Article Google Scholar
Plummer, M. (2017). JAGS version 4.3.0 user manual. https://sourceforge.net/projects/mcmc-jags/files/Manuals/4.x/jags_user_manual.pdf
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. University of Chicago Press. (Original work published in 1960).
Google Scholar
Robitzsch, A., & Steinfeld, J. (2018). Item response models for human ratings: Overview, estimation methods and implementation in R. Psychological Test and Assessment Modeling, 60(1), 101–138. https://www.psychologie-aktuell.com/fileadmin/download/ptam/1-2018_20180323/6_PTAM_IRMHR_Main__2018-03-13_1416.pdf
Google Scholar
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12(4), 1151–1172. https://doi.org/10.1214/aos/1176346785
Article Google Scholar
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B, 64(4), 583–639. https://doi.org/10.1111/1467-9868.00353
Article Google Scholar
Uto, M. (2021). Accuracy of performance-test linking based on a many-facet Rasch model. Behavior Research Methods, 53(4), 1440–1454. https://doi.org/10.3758/s13428-020-01498-x
Article PubMed Google Scholar
Uto, M. (2022). A Bayesian many-facet Rasch model with Markov modeling for rater severity drift. Advanced online publication. https://doi.org/10.3758/s13428-022-01997-z
Book Google Scholar
van der Linde, A. (2005). DIC in variable selection. Statistica Neerlandica, 59(1), 45–56. https://doi.org/10.1111/j.1467-9574.2005.00278.x
Article Google Scholar
van der Linden, W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204. https://doi.org/10.3102/10769986031002181
Article Google Scholar
van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308. https://doi.org/10.1007/s11336-006-1478-z
Article Google Scholar
van der Linden, W. J. (2009). Conceptual issues in response-time modeling. Journal of Educational Measurement, 46(3), 247–272. https://doi.org/10.1111/j.1745-3984.2009.00080.x
Article Google Scholar
van der Linden, W. J. (2011). Modeling response times with latent variables: Principles and applications. Psychological Test and Assessment Modeling, 53(3), 334–358. https://www.psychologie-aktuell.com/fileadmin/download/ptam/3-2011_20110927/05_vanderLinden.pdf
Google Scholar
van der Linden, W. J. (Ed.). (2016a). Handbook of item response theory (Vol. 1.). Chapman & Hall/CRC. https://doi.org/10.1201/9781315374512
van der Linden, W. J. (2016b). Lognormal response-time model. In W. J. van der Linden (Ed.), Handbook of item response theory (1st ed., pp. 261–282). Chapman & Hall/CRC.
Chapter Google Scholar
van Rijn, P. W., & Ali, U. S. (2017). A comparison of item response models for accuracy and speed of item responses with applications to adaptive testing. British Journal of Mathematical and Statistical Psychology, 70(2), 317–345. https://doi.org/10.1111/bmsp.12101
Article PubMed Google Scholar
van Rijn, P. W., & Ali, U. S. (2018). A generalized speed–accuracy response model for dichotomous items. Psychometrika, 83(1), 109–131. https://doi.org/10.1007/s11336-017-9590-9
Article PubMed Google Scholar
Wang, W.-C., & Liu, C.-Y. (2007). Formulation and application of the generalized multilevel facets model. Educational and Psychological Measurement, 67(4), 583–605. https://doi.org/10.1177/0013164406296974
Article Google Scholar
Wind, S. A., & Ge, Y. (2021). Detecting rater biases in sparse rater-mediated assessment networks. Educational and Psychological Measurement, 81(5), 996–1022. https://doi.org/10.1177/0013164420988108
Article PubMed PubMed Central Google Scholar
Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192. https://doi.org/10.1177/0265532216686999
Article Google Scholar

Download references

Author information

Authors and Affiliations

Hong Kong Examinations and Assessment Authority, 68 Gillies Avenue South, Kowloon City, Kowloon, Hong Kong
Kuan-Yu Jin
TestDaF Institute, University of Bochum, Universitätsstr, 134, 44799, Bochum, Germany
Thomas Eckes

Authors

Kuan-Yu Jin
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Eckes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kuan-Yu Jin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jin, KY., Eckes, T. Human ratings take time: A hierarchical facets model for the joint analysis of ratings and rating times. Behav Res (2023). https://doi.org/10.3758/s13428-023-02259-2

Download citation

Accepted: 25 September 2023
Published: 02 November 2023
DOI: https://doi.org/10.3758/s13428-023-02259-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human ratings take time: A hierarchical facets model for the joint analysis of ratings and rating times

Abstract

Access this article

Similar content being viewed by others

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

Recognize the Value of the Sum Score, Psychometrics’ Greatest Accomplishment

An automated essay scoring systems: a systematic literature review

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Human ratings take time: A hierarchical facets model for the joint analysis of ratings and rating times

Abstract

Access this article

Similar content being viewed by others

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

Recognize the Value of the Sum Score, Psychometrics’ Greatest Accomplishment

An automated essay scoring systems: a systematic literature review

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation