Abstract
Accurate assessment of a student’s ability is the key task of a test. Assessments based on final responses are the standard. As the infrastructure advances, substantially more information is observed. One of such instances is the process data that is collected by computer-based interactive items and contain a student’s detailed interactive processes. In this paper, we show both theoretically and with simulated and empirical data that appropriately including such information in the assessment will substantially improve relevant assessment precision.
Similar content being viewed by others
References
AERA, APA, and NCME. (2014). Standards for educational and psychological testing. American Educational Research Association American Psychological Association.
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V.2. The Journal of Technology, Learning and Assessment, 4(3). Retrieved from https://ejournals.bc.edu/index.php/jtla/article/view/1650
Bejar, I. I., Mislevy, R. J., & Zhang, M. (2016). Automated scoring with validity in mind. In A. A. Rupp & J. P. Leighton (Eds.), The Wiley handbook of cognition and assessment (pp. 226–246). https://doi.org/10.1002/9781118956588.ch10
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley.
Blackwell, D. (1947). Conditional expectation and unbiased sequential estimation. The Annals of Mathematical Statistics, 18(1), 105–110.
Bolsinova, M., & Tijmstra, J. (2018). Improving precision of ability estimation: Getting more from response times. British Journal of Mathematical and Statistical Psychology, 71(1), 13–38.
Casella, G., & Berger, R. L. (2002). Statistical inference (Vol. 2). Duxbury.
Clauser, B. E., Harik, P., & Clyman, S. G. (2000). The generalizability of scores for a performance assessment scored with a computer-automated scoring system. Journal of Educational Measurement, 37(3), 245–261.
Evanini, K., Heilman, M., Wang, X., & Blanchard, D. (2015). Automated scoring for the toefl junior® comprehensive writing and speaking test. ETS Research Report Series, 2015(1), 1–11.
Fife, J. H. (2013). Automated scoring of mathematics tasks in the common core era: Enhancements to m-rater in support of \(\text{ cbal}^{{\rm TM}}\) mathematics and the common core assessments. ETS research report series, 2013(2), i–35.
Foltz, P. W., Laham, D., & Landauer, T. K. (1999). Automated essay scoring: Applications to educational technology. In B. Collis & R. Oliver (Eds.), Proceedings of EdMedia + Innovate Learning 1999 (pp. 939–944). Association for the Advancement of Computing in Education (AACE).
Frey, A., Spoden, C., Goldhammer, F., & Wenzel, S. F. C. (2018). Response time-based treatment of omitted responses in computer-based testing. Behaviormetrika, 45(2), 505–526.
He, Q., Veldkamp, B. P., Glas, C. A., & de Vries, T. (2017). Automated assessment of patients’ self-narratives for posttraumatic stress disorder screening using natural language processing and text mining. Assessment, 24(2), 157–172.
He, Q., Veldkamp, B. P., Glas, C. A., & Van Den Berg, S. M. (2019). Combining text mining of long constructed responses and item-based measures: A hybrid test design to screen for posttraumatic stress disorder (ptsd). Frontiers in Psychology, 10, 2358.
He, Q., & von Davier, M. (2016). Analyzing process data from problem-solving items with N-grams: Insights from a computer-based large-scale assessment. In Y. Rosen, S. Ferrara, & M. Mosharraf (Eds.), Handbook of research on technology tools for real-world skill development (pp. 750–777). IGI Global. https://doi.org/10.4018/978-1-4666-9441-5.ch029
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67. https://doi.org/10.1080/00401706.1970.10488634
Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93.
Kim, J. K., & Nicewander, W. A. (1993). Ability estimation for conventional tests. Psychometrika, 58(4), 587–599.
LaMar, M. M. (2018). Markov decision process measurement model. Psychometrika, 83(1), 67–88.
Lehmann, E. L., & Romano, J. P. (2005). Testing statistical hypotheses (3rd ed.). Springer.
Liu, H., Liu, Y., & Li, M. (2018). Analysis of process data of pisa 2012 computer-based problem solving: Application of the modified multilevel mixture irt model. Frontiers in Psychology, 9, 1372.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Routledge.
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5–11.
Muraki, E. (1992). A generalized partial credit model: Application of an em algorithm. ETS Research Report Series, 1992(1), i–30.
OECD. (2012). Literacy, numeracy and problem solving in technology-rich environments: Framework for the oecd survey of adult skills. OECD Publishing.
Page, E. B. (1966). The imminence of grading essays by computer. The Phi Delta Kappan, 47(5), 238–243.
Qiao, X., & Jiao, H. (2018). Data mining techniques in analyzing process data: A didactic. Frontiers in Psychology, 9, 2231.
Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests. Danish Institute for Educational Research.
Rose, N., von Davier, M., & Nagengast, B. (2017). Modeling omitted and not-reached items in irt models. Psychometrika, 82(3), 795–819.
Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetric\(^{\rm TM}\) essay scoring system. The Journal of Technology, Learning and Assessment, 4(4). Retrieved from https://ejournals.bc.edu/index.php/jtla/article/view/1651
Rupp, A. A. (2018). Designing, evaluating, and deploying automated scoring systems with validity in mind: Methodological design decisions. Applied Measurement in Education, 31(3), 191–214.
Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. Guilford Press.
Schleicher, A. (2008). Piaac: A new strategy for assessing adult competencies. International Review of Education, 54(5–6), 627–650.
Tang, X., Wang, Z., Liu, J., & Ying, Z. (2021a). An exploratory analysis of the latent structure of process data via action sequence autoencoders. British Journal of Mathematical and Statistical Psychology, 74(1), 1–33.
Tang, X., Zhang, S., Wang, Z., Liu, J., & Ying, Z. (2021b). Procdata: An R package for process data analysis. Psychometrika, 86(4), 1058–1083.
Tang, X., Wang, Z., He, Q., Liu, J., & Ying, Z. (2020). Latent feature extraction for process data via multidimensional scaling. Psychometrika, 85(2), 378–397.
Tikhonov, A. N. & Arsenin, V. Y. (1977). Solutions of ill-posed problems (pp. 1–30). New York.
Ulitzsch, E., von Davier, M., & Pohl, S. (2020). A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level non-response. British Journal of Mathematical and Statistical Psychology, 73, 83–112.
van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287.
von Davier, M., Sinharay, S., Oranje, A., & Beaton, A. (2006). 32 the statistical procedures used in national assessment of educational progress: Recent developments and future directions. Handbook of Satistics, 26, 1039–1055.
Wainer, H., Dorans, N. J. , Flaugher, R., Green, B. F., & Mislevy, R. J. (2000). Computerized adaptive testing: A primer. Routledge.
Xu, H., Fang, G., Chen, Y., Liu, J., & Ying, Z. (2018). Latent class analysis of recurrent events in problem-solving items. Applied Psychological Measurement, 0146621617748325.
Zumbo, B. D., & Hubley, A. M. (2017). Understanding and investigating response processes in validation research (Vol 26). Springer.
Acknowledgements
This research was supported in part by NSF Grants SES-1826540, SES-2119938, DMS-2015417 and 1633360. The authors would like to thank Educational Testing Service for providing the data.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proofs of Theorem 1 and Theorem 2
Appendix: Proofs of Theorem 1 and Theorem 2
To prove Theorem 1, we establish the following lemma.
Lemma 1
Let X be a nonconstant random variable, and \(f(\cdot )\) and \(g(\cdot )\) be strictly increasing functions. Suppose that f(X) and g(X) have finite second moments. Then, \({{\,\mathrm{Cov}\,}}\left( f(X), g(X) \right) >0\) .
Proof of lemma 1
Let Y be an independent and identically distributed (i.i.d.) copy of X. It is easy to verify the following identity
Clearly, for any x and y, \( (f(x) -f(y) ) (g(x) - g(y))\ge 0\), and \(``=''\) holds if and only if \(x=y\). Since \(P(X\not =Y)>0\), the right-hand side of equation (10) must be positive. \(\square \)
Proof of Theorem 1
By Assumption A2 (local independence),
Due to Assumption A3 (exponential family), the posterior distribution of \(\theta \) given \({\mathbf {X}}_{-j}\) depends on \(\mathbf{X}_{-j}\) only through the sufficient statistic \(T_j({\mathbf {X}}_{-j})\). In fact,
where \(G_j(t) =E \left[ m_j (\theta ) |T_j({\mathbf {X}}_{-j})=t\right] \). Furthermore, by making use of the exponential family form in Assumption A3 and the simple exchange of order of differentiation and integration, we can show that
Since both \(m_j\) and \(\eta _j\) are strictly monotone, Lemma 1 implies that \(G_j'(t)\) is strictly positive or negative for all t and, therefore, \(G_j\) is strictly monotone. In other words, there is a one-to-one mapping between \(T_{{\mathbf {X}}_j}\) and \(T_j({\mathbf {X}}_{-j})\). \(\square \)
Proof of Theorem 2
From Theorem 1, we know that \(T_{\mathbf{X}_{-j}}\) is a sufficient statistic of \({\mathbf {X}}_{-j}\) for each j. Since \({{\hat{\theta }}}_{{\mathbf {Y}}}\) is a function of \({\mathbf {Y}}\) and \(\sigma ({\mathbf {Y}}_{-j}) \subseteq \sigma ({\mathbf {X}}_{-j})\), the conditional distribution \({{\hat{\theta }}}_{{\mathbf {Y}}} | T_{\mathbf{X}_{-j}}, Y_j\) is free of \(\theta \). Therefore, we have \(E[{{\hat{\theta }}}_{{\mathbf {Y}}} | T_{{\mathbf {X}}_{-j}}, Y_j, \theta ] = E[{{\hat{\theta }}}_{{\mathbf {Y}}} | T_{{\mathbf {X}}_{-j}}, Y_j ] = \hat{\theta }_{{\mathbf {X}}_{-j}}.\) It follows from the well-known Rao–Blackwell theorem (Casella & Berger, 2002) that \({{\hat{\theta }}}_{{\mathbf {X}}_{-j}}\) reduces the conditional variance and
holds for every j and \(\theta \).
By Cauchy–Schwarz inequality, we get
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, S., Wang, Z., Qi, J. et al. Accurate Assessment via Process Data. Psychometrika 88, 76–97 (2023). https://doi.org/10.1007/s11336-022-09880-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-022-09880-8