Abstract
Educational assessments can have far-reaching consequences for individuals. To allow test users to make valid decisions, it is important to provide evidence about the uncertainties in the observed scores on which the individual decisions are based. In this chapter we examine standard errors of measurement defined for specific score groups, which are referred to as conditional standard errors of measurement. In particular, we study the foundations of the ANOVA method proposed by Feldt et al. (Appl Psychol Meas 9:351–361, 1985) within the context of classical test theory. In addition, we suggest some variations and study their practical usefulness including sample size requirements.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Allen, M. J., & Yen, W. M. (2002). Introduction to measurement theory. Waveland Press.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Addison-Wesley.
Borsboom, D. (2005). Measuring the mind: Conceptual issues in modern psychometrics. Cambridge University Press. https://doi.org/10.1017/CBO9780511490026
Brennan, R. L. (1998). Raw-score conditional standard error of measurement in generalizability theory. Applied Psychological Measurement, 22(4), 307–331. https://doi.org/10.1177/014662169802200401
Brennan, R. L. (2001). Generalizability theory. Springer. https://doi.org/10.1007/978-1-4757-3456-0
Evers, A., Lucassen, W., Meijer, R., & Sijtsma, K. (2010). COTAN review system for evaluating test quality. Nederlands Instituut van Psychologen. https://www.psynip.nl/wp-content/uploads/2019/05/NIP-Brochure-Cotan-2018-correctie-1.pdf
Ellis, J. L., & Van den Wollenberg, A. L. (1993). Local homogeneity in latent trait models: A characterization of the homogeneous monotone IRT model. Psychometrika, 58(3), 417--429. https://doi.org/10.1007/BF02294649
Feldt, L. S., Steffen, M., & Gupta, N. C. (1985). A comparison of five methods for estimating the standard error of measurement at specific score levels. Applied Psychological Measurement, 9(4), 351–361. https://doi.org/10.1177/014662168500900402
Gu, Z., Emons, W. H. M., & Sijtsma, K. (2021). Precision and sample size requirements for regression-based norming methods for change scores. Assessment, 28(2), 503–517. https://doi.org/10.1177/1073191120913607
Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10(4), 255–282. https://doi.org/10.1007/BF02288892
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Kluwer Academic Publishers.
Harvill, L. M. (1991). Standard error of measurement. National Council on Educational Measurement, 10(2), 33–41. https://doi.org/10.1111/j.1745-3992.1991.tb00195.x
Holland, P. W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55(4), 577–601. https://doi.org/10.1007/BF02294609
Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first-order item response theory: Application to true-score prediction from a possibly nonparallel test. Psychometrika, 68(1), 123–149. https://doi.org/10.1007/BF02296657
Hopster-Den Otter, D., Muilenburg, S. N., Wools, S., Veldkamp, B. P., & Eggen, T. J. H. M. (2019). Comparing the influence of various measurement error presentations in test score reports on educational decision making. Assessment in Education: Principles, Policy & Practice, 26(2), 123–142. https://doi.org/10.1080/0969594X.2018.1447908
Hoyt, C. (1941). Test reliability obtained by analysis of variance. Psychometrika, 6(3), 153–160. https://doi.org/10.1007/BF02289270
Jabrayilov, R., Emons, W. H. M., & Sijtsma, K. (2016). Comparison of classical test theory and item response theory in individual change assessment. Applied Psychological Measurement, 40(8), 559–572. https://doi.org/10.1177/0146621616664046
Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to define meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59(1), 12–19. https://doi.org/10.1037//0022-006x.59.1.12
Jarjoura, D. (1986). An estimator of examinee-level measurement error variance that considers test form difficulty adjustments. Applied Psychological Measurement, 10(2), 175–186. https://doi.org/10.1177/014662168601000209
Kolen, M. J., & Brennan, R. L. (1995). Test equating: Methods and practices. Springer.
Kuijpers, R. E., Van der Ark, L. A., & Croon, M. (2013). Testing hypotheses involving Cronbach’s alpha using marginal models. British Journal of Mathematical and Statistical Psychology, 66(3), 503–220. https://doi.org/10.1111/bmsp.12010
Lee, W.-C., Brennan, R. L., & Kolen, M. J. (2000). Estimators of conditional scale-score standard errors of measurement: A simulation study. Journal of Educational Measurement, 37(1), 1–20. https://doi.org/10.1111/j.1745-3984.2000.tb01073.x
Lek, K. M., & Van De Schoot, R. (2018). A comparison of the single, conditional and person-specific standard error of measurement: What do they measure and when to use them? Frontiers in Applied Mathematics and Statistics, 4(1). https://doi.org/10.3389/fams.2018.00040
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Laurence Erlbaum. https://doi.org/10.4324/9780203056615
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley.
Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8(4), 453–461. https://doi.org/10.1177/014662168400800409
Lumsden, J. (1978). Tests are perfectly reliable. British Journal of Mathematical and Statistical Psychology, 31(1), 19–26. https://doi.org/10.1111/j.2044-8317.1978.tb00568.x
Maassen, G. H. (2004). The standard error in the Jacobson and Truax reliable change index: The classical approach to the assessment of reliable change. Journal of the International Neuropsychological Society, 10(6), 888–893. https://doi.org/10.1017/s1355617704106097
Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing data. A model comparison perspective (2nd ed.). Lawrence Erlbaum. https://doi.org/10.4324/9781315642956
Mellenbergh, G. J. (1996). Measurement precision in test scores and item response models. Psychological Methods, 1(3), 293–299. https://doi.org/10.1037/1082-989X.1.3.293
Nicewander, A. (2019). Conditional precision of measurement for test scores: Are conditional standard errors sufficient. Educational and Psychological Measurement, 79(1), 5–18. https://doi.org/10.1177/0013164418758373
Oosterwijk, P. (2016). Statistical properties and practical use of classical test-score reliability methods [Unpublished doctoral dissertation]. Tilburg University.
Oosterwijk, P., Van der Ark, L. A., & Sijtsma, K. (2016). Numerical differences between Guttman’s reliability coefficients and the GLB. In L. A. van der Ark, D. M. Bolt, W.-C. Wang, J. A. Douglas, & M. Wieberg (Eds.), Quantitative psychology research. Springer. https://doi.org/10.1007/978-3-319-38759-8_12
Oosterwijk, P., Van der Ark, L. A., & Sijtsma, K. (2019). Using confidence intervals for assessing reliability of real tests. Assessment, 26(7), 1207–1216. https://doi.org/10.1177/1073191117737375
Payton, J., Weissberg, R. P., Durlak, J. A., Dymnicki, A. B., Taylor, R. D., Schellinger, K. B., et al. (2008). The positive impact of social and emotional learning for kindergarten to eighth-grade students: Findings from three scientific reviews. Collaborative for Academic, Social, and Emotional Learning. https://files.eric.ed.gov/fulltext/ED505370.pdf
Qualls-Payne, A. L. (1992). A comparison of score level estimates of the standard error of measurement. Journal of Educational Measurement, 29(3), 213–225. https://doi.org/10.1111/j.1745-3984.1992.tb00374.x
Raju, N. S., Price, L. R., Oshima, T. C., & Nering, M. L. (2007). Standardized conditional SEM: A case for conditional reliability. Applied Psychological Measurement, 31(3), 169–180. https://doi.org/10.1177/0146621606291569
Reise, S. P., & Haviland, M. G. (2005). Item response theory and the measurement of clinical change. Journal of Personality Assessment, 84(3), 228–238. https://doi.org/10.1207/s15327752jpa8403_02
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores 1. ETS Research Bulletin Series, 1968(1), 1–169.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. https://doi.org/10.1037/0033-2909.86.2.420
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107–120. https://doi.org/10.1007/s11336-008-9101-0
Sijtsma, K. (2012). Future of psychometrics: Ask what psychometrics can do for psychology. Psychometrika, 77(1), 4–20. https://doi.org/10.1007/s11336-011-9242-4
Sijtsma, K., & Emons, W. H. M. (2011). Advice on total-score reliability issues in psychosomatic measurement. Journal of Psychosomatic Research, 70(6), 565–572. https://doi.org/10.1016/j.jpsychores.2010.11.002
Sijtsma, K., & Molenaar, I. W. (2002). Introduction to nonparametric item response theory. Sage.
Sijtsma, K., & Van der Ark, L. A. (2020). Measurement models for psychological attributes. CRC Press. https://doi.org/10.1201/9780429112447
Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. L. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19(1), 39–49. https://doi.org/10.1177/014662169501900105
Thompson, B. (Ed.). (2003). Score reliability: Contemporary thinking on reliability issues. Sage.
Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (Ed.), Educational measurement. Addison-Wesley.
Van der Linden, W. J. (2016). Handbook of item response theory. CRC Press. https://doi.org/10.1201/9781315374512
Wang, M. C., Haertel, G. D., & Walberg, H. J. (1997). Learning influences. In H. J. Walberg & G. D. Haertel (Eds.), Psychology and educational practice (pp. 199–211). McCutchan.
Woodhouse, B., & Jackson, P. H. (1977). Lower bounds for the reliability of the total score on a test composed of non-homogeneous items: II: A search for the greatest lower bound. Psychometrika, 42(4), 579–591. https://doi.org/10.1007/BF02295980
Woodruff, D. (1990). Conditional standard error of measurement in prediction. Journal of Educational Measurement, 27(3), 191–208. https://doi.org/10.1111/j.1745-3984.1990.tb00743.x
Woodruff, D., Traynor, A., Cui, Z., & Fang, Y. (2013). A comparison of three methods for computing scale score conditional standard errors of measurement. ACT. https://files.eric.ed.gov/fulltext/ED555593.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Proof of Eq. 11.8
We start from the well-known definition of the standard error of measurement; that is,
where α is coefficient alpha and \( {S}_X^2 \) the variance of the total scores across persons. Because \( \alpha \equiv ICC\left(3,J\right)=\left[1-\frac{M{S}_{N\times J}}{M{S}_s}\right] \) (Eq. 11.7), substituting the definition of ICC(3, J) for α gives
Furthermore, we have \( M{S}_s=\frac{J{\sum}_v{\overline{X}}_v^2- nJ{\overline{X}}^2}{n-1} \) (e.g., Brennan, 2001, p. 41), where \( {\overline{X}}_v^2 \) is the square average test score for an arbitrary person v. It can be shown – after some tedious algebra – that MS s is equivalent with \( \frac{\sum_v{X}_{+}^2-n{\overline{X}}_{+}^2}{J\left(n-1\right)}=\frac{S_X^2}{J} \), showing that MS s can be conceived as the average variance of subjects across items. Substituting \( \frac{S_X^2}{J} \) for MS s in Eq. (11.A2) gives \( \frac{M{S}_{N\times J}}{\frac{S_X^2}{J}}\bullet {S}_X^2= JM{S}_{N\times J} \), and that completes the proof.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Emons, W.H.M. (2023). Methods for Estimating Conditional Standard Errors of Measurement and Some Critical Reflections. In: van der Ark, L.A., Emons, W.H.M., Meijer, R.R. (eds) Essays on Contemporary Psychometrics. Methodology of Educational Measurement and Assessment. Springer, Cham. https://doi.org/10.1007/978-3-031-10370-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-10370-4_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10369-8
Online ISBN: 978-3-031-10370-4
eBook Packages: EducationEducation (R0)