Methods for Estimating Conditional Standard Errors of Measurement and Some Critical Reflections

Emons, Wilco H. M.

doi:10.1007/978-3-031-10370-4_11

Wilco H. M. Emons¹²

Part of the book series: Methodology of Educational Measurement and Assessment ((MEMA))

436 Accesses
1 Citations

Abstract

Educational assessments can have far-reaching consequences for individuals. To allow test users to make valid decisions, it is important to provide evidence about the uncertainties in the observed scores on which the individual decisions are based. In this chapter we examine standard errors of measurement defined for specific score groups, which are referred to as conditional standard errors of measurement. In particular, we study the foundations of the ANOVA method proposed by Feldt et al. (Appl Psychol Meas 9:351–361, 1985) within the context of classical test theory. In addition, we suggest some variations and study their practical usefulness including sample size requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Allen, M. J., & Yen, W. M. (2002). Introduction to measurement theory. Waveland Press.
Google Scholar
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Addison-Wesley.
Google Scholar
Borsboom, D. (2005). Measuring the mind: Conceptual issues in modern psychometrics. Cambridge University Press. https://doi.org/10.1017/CBO9780511490026
Book Google Scholar
Brennan, R. L. (1998). Raw-score conditional standard error of measurement in generalizability theory. Applied Psychological Measurement, 22(4), 307–331. https://doi.org/10.1177/014662169802200401
Article Google Scholar
Brennan, R. L. (2001). Generalizability theory. Springer. https://doi.org/10.1007/978-1-4757-3456-0
Book Google Scholar
Evers, A., Lucassen, W., Meijer, R., & Sijtsma, K. (2010). COTAN review system for evaluating test quality. Nederlands Instituut van Psychologen. https://www.psynip.nl/wp-content/uploads/2019/05/NIP-Brochure-Cotan-2018-correctie-1.pdf
Google Scholar
Ellis, J. L., & Van den Wollenberg, A. L. (1993). Local homogeneity in latent trait models: A characterization of the homogeneous monotone IRT model. Psychometrika, 58(3), 417--429. https://doi.org/10.1007/BF02294649
Feldt, L. S., Steffen, M., & Gupta, N. C. (1985). A comparison of five methods for estimating the standard error of measurement at specific score levels. Applied Psychological Measurement, 9(4), 351–361. https://doi.org/10.1177/014662168500900402
Article Google Scholar
Gu, Z., Emons, W. H. M., & Sijtsma, K. (2021). Precision and sample size requirements for regression-based norming methods for change scores. Assessment, 28(2), 503–517. https://doi.org/10.1177/1073191120913607
Article Google Scholar
Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10(4), 255–282. https://doi.org/10.1007/BF02288892
Article Google Scholar
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Kluwer Academic Publishers.
Book Google Scholar
Harvill, L. M. (1991). Standard error of measurement. National Council on Educational Measurement, 10(2), 33–41. https://doi.org/10.1111/j.1745-3992.1991.tb00195.x
Article Google Scholar
Holland, P. W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55(4), 577–601. https://doi.org/10.1007/BF02294609
Article Google Scholar
Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first-order item response theory: Application to true-score prediction from a possibly nonparallel test. Psychometrika, 68(1), 123–149. https://doi.org/10.1007/BF02296657
Article Google Scholar
Hopster-Den Otter, D., Muilenburg, S. N., Wools, S., Veldkamp, B. P., & Eggen, T. J. H. M. (2019). Comparing the influence of various measurement error presentations in test score reports on educational decision making. Assessment in Education: Principles, Policy & Practice, 26(2), 123–142. https://doi.org/10.1080/0969594X.2018.1447908
Article Google Scholar
Hoyt, C. (1941). Test reliability obtained by analysis of variance. Psychometrika, 6(3), 153–160. https://doi.org/10.1007/BF02289270
Article Google Scholar
Jabrayilov, R., Emons, W. H. M., & Sijtsma, K. (2016). Comparison of classical test theory and item response theory in individual change assessment. Applied Psychological Measurement, 40(8), 559–572. https://doi.org/10.1177/0146621616664046
Article Google Scholar
Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to define meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59(1), 12–19. https://doi.org/10.1037//0022-006x.59.1.12
Article Google Scholar
Jarjoura, D. (1986). An estimator of examinee-level measurement error variance that considers test form difficulty adjustments. Applied Psychological Measurement, 10(2), 175–186. https://doi.org/10.1177/014662168601000209
Article Google Scholar
Kolen, M. J., & Brennan, R. L. (1995). Test equating: Methods and practices. Springer.
Book Google Scholar
Kuijpers, R. E., Van der Ark, L. A., & Croon, M. (2013). Testing hypotheses involving Cronbach’s alpha using marginal models. British Journal of Mathematical and Statistical Psychology, 66(3), 503–220. https://doi.org/10.1111/bmsp.12010
Article Google Scholar
Lee, W.-C., Brennan, R. L., & Kolen, M. J. (2000). Estimators of conditional scale-score standard errors of measurement: A simulation study. Journal of Educational Measurement, 37(1), 1–20. https://doi.org/10.1111/j.1745-3984.2000.tb01073.x
Article Google Scholar
Lek, K. M., & Van De Schoot, R. (2018). A comparison of the single, conditional and person-specific standard error of measurement: What do they measure and when to use them? Frontiers in Applied Mathematics and Statistics, 4(1). https://doi.org/10.3389/fams.2018.00040
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Laurence Erlbaum. https://doi.org/10.4324/9780203056615
Book Google Scholar
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley.
Google Scholar
Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8(4), 453–461. https://doi.org/10.1177/014662168400800409
Article Google Scholar
Lumsden, J. (1978). Tests are perfectly reliable. British Journal of Mathematical and Statistical Psychology, 31(1), 19–26. https://doi.org/10.1111/j.2044-8317.1978.tb00568.x
Article Google Scholar
Maassen, G. H. (2004). The standard error in the Jacobson and Truax reliable change index: The classical approach to the assessment of reliable change. Journal of the International Neuropsychological Society, 10(6), 888–893. https://doi.org/10.1017/s1355617704106097
Article Google Scholar
Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing data. A model comparison perspective (2nd ed.). Lawrence Erlbaum. https://doi.org/10.4324/9781315642956
Book Google Scholar
Mellenbergh, G. J. (1996). Measurement precision in test scores and item response models. Psychological Methods, 1(3), 293–299. https://doi.org/10.1037/1082-989X.1.3.293
Article Google Scholar
Nicewander, A. (2019). Conditional precision of measurement for test scores: Are conditional standard errors sufficient. Educational and Psychological Measurement, 79(1), 5–18. https://doi.org/10.1177/0013164418758373
Article Google Scholar
Oosterwijk, P. (2016). Statistical properties and practical use of classical test-score reliability methods [Unpublished doctoral dissertation]. Tilburg University.
Google Scholar
Oosterwijk, P., Van der Ark, L. A., & Sijtsma, K. (2016). Numerical differences between Guttman’s reliability coefficients and the GLB. In L. A. van der Ark, D. M. Bolt, W.-C. Wang, J. A. Douglas, & M. Wieberg (Eds.), Quantitative psychology research. Springer. https://doi.org/10.1007/978-3-319-38759-8_12
Chapter Google Scholar
Oosterwijk, P., Van der Ark, L. A., & Sijtsma, K. (2019). Using confidence intervals for assessing reliability of real tests. Assessment, 26(7), 1207–1216. https://doi.org/10.1177/1073191117737375
Article Google Scholar
Payton, J., Weissberg, R. P., Durlak, J. A., Dymnicki, A. B., Taylor, R. D., Schellinger, K. B., et al. (2008). The positive impact of social and emotional learning for kindergarten to eighth-grade students: Findings from three scientific reviews. Collaborative for Academic, Social, and Emotional Learning. https://files.eric.ed.gov/fulltext/ED505370.pdf
Google Scholar
Qualls-Payne, A. L. (1992). A comparison of score level estimates of the standard error of measurement. Journal of Educational Measurement, 29(3), 213–225. https://doi.org/10.1111/j.1745-3984.1992.tb00374.x
Article Google Scholar
Raju, N. S., Price, L. R., Oshima, T. C., & Nering, M. L. (2007). Standardized conditional SEM: A case for conditional reliability. Applied Psychological Measurement, 31(3), 169–180. https://doi.org/10.1177/0146621606291569
Article Google Scholar
Reise, S. P., & Haviland, M. G. (2005). Item response theory and the measurement of clinical change. Journal of Personality Assessment, 84(3), 228–238. https://doi.org/10.1207/s15327752jpa8403_02
Article Google Scholar
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores 1. ETS Research Bulletin Series, 1968(1), 1–169.
Google Scholar
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. https://doi.org/10.1037/0033-2909.86.2.420
Article Google Scholar
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107–120. https://doi.org/10.1007/s11336-008-9101-0
Article Google Scholar
Sijtsma, K. (2012). Future of psychometrics: Ask what psychometrics can do for psychology. Psychometrika, 77(1), 4–20. https://doi.org/10.1007/s11336-011-9242-4
Article Google Scholar
Sijtsma, K., & Emons, W. H. M. (2011). Advice on total-score reliability issues in psychosomatic measurement. Journal of Psychosomatic Research, 70(6), 565–572. https://doi.org/10.1016/j.jpsychores.2010.11.002
Article Google Scholar
Sijtsma, K., & Molenaar, I. W. (2002). Introduction to nonparametric item response theory. Sage.
Book Google Scholar
Sijtsma, K., & Van der Ark, L. A. (2020). Measurement models for psychological attributes. CRC Press. https://doi.org/10.1201/9780429112447
Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. L. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19(1), 39–49. https://doi.org/10.1177/014662169501900105
Article Google Scholar
Thompson, B. (Ed.). (2003). Score reliability: Contemporary thinking on reliability issues. Sage.
Google Scholar
Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (Ed.), Educational measurement. Addison-Wesley.
Google Scholar
Van der Linden, W. J. (2016). Handbook of item response theory. CRC Press. https://doi.org/10.1201/9781315374512
Wang, M. C., Haertel, G. D., & Walberg, H. J. (1997). Learning influences. In H. J. Walberg & G. D. Haertel (Eds.), Psychology and educational practice (pp. 199–211). McCutchan.
Google Scholar
Woodhouse, B., & Jackson, P. H. (1977). Lower bounds for the reliability of the total score on a test composed of non-homogeneous items: II: A search for the greatest lower bound. Psychometrika, 42(4), 579–591. https://doi.org/10.1007/BF02295980
Article Google Scholar
Woodruff, D. (1990). Conditional standard error of measurement in prediction. Journal of Educational Measurement, 27(3), 191–208. https://doi.org/10.1111/j.1745-3984.1990.tb00743.x
Article Google Scholar
Woodruff, D., Traynor, A., Cui, Z., & Fang, Y. (2013). A comparison of three methods for computing scale score conditional standard errors of measurement. ACT. https://files.eric.ed.gov/fulltext/ED555593.pdf
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Methodology and Statistics, Tilburg University, Tilburg, The Netherlands
Wilco H. M. Emons

Authors

Wilco H. M. Emons
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wilco H. M. Emons .

Editor information

Editors and Affiliations

Research Institute of Child Development and Education, University of Amsterdam, Amsterdam, The Netherlands
L. Andries van der Ark
Department of Methodology and Statistics, Tilburg University, Tilburg, The Netherlands
Wilco H. M. Emons
The expertise group Psychometrics and Statistics, University of Groningen, Groningen, The Netherlands
Rob R. Meijer

Appendix

1.1 Proof of Eq. 11.8

We start from the well-known definition of the standard error of measurement; that is,

$$ {S}_E^2\left({\lambda}_3\right)=\left(1-\alpha \right)\bullet {S}_X^2, $$

(11.A1)

where α is coefficient alpha and $ {S}_X^2 $ the variance of the total scores across persons. Because $ \alpha \equiv ICC\left(3,J\right)=\left[1-\frac{M{S}_{N\times J}}{M{S}_s}\right] $ (Eq. 11.7), substituting the definition of ICC(3, J) for α gives

$$ {\sigma}_E^2\left({\lambda}_3\right)=\frac{M{S}_{N\times J}}{M{S}_s}\bullet {S}_X^2. $$

(11.A2)

Furthermore, we have $ M{S}_s=\frac{J{\sum}_v{\overline{X}}_v^2- nJ{\overline{X}}^2}{n-1} $ (e.g., Brennan, 2001, p. 41), where $ {\overline{X}}_v^2 $ is the square average test score for an arbitrary person v. It can be shown – after some tedious algebra – that MS _s is equivalent with $ \frac{\sum_v{X}_{+}^2-n{\overline{X}}_{+}^2}{J\left(n-1\right)}=\frac{S_X^2}{J} $, showing that MS _s can be conceived as the average variance of subjects across items. Substituting $ \frac{S_X^2}{J} $ for MS _s in Eq. (11.A2) gives $ \frac{M{S}_{N\times J}}{\frac{S_X^2}{J}}\bullet {S}_X^2= JM{S}_{N\times J} $, and that completes the proof.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Emons, W.H.M. (2023). Methods for Estimating Conditional Standard Errors of Measurement and Some Critical Reflections. In: van der Ark, L.A., Emons, W.H.M., Meijer, R.R. (eds) Essays on Contemporary Psychometrics. Methodology of Educational Measurement and Assessment. Springer, Cham. https://doi.org/10.1007/978-3-031-10370-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-10370-4_11
Published: 16 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10369-8
Online ISBN: 978-3-031-10370-4
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics

Methods for Estimating Conditional Standard Errors of Measurement and Some Critical Reflections

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Proof of Eq. 11.8

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation