Test Comparability and Measurement Validity in Educational Assessment

González, Jorge; Gempp, René

doi:10.1007/978-3-030-78390-7_8

Jorge González⁴ &
René Gempp⁵

423 Accesses
1 Citations

Abstract

Comparability is an umbrella term sometimes used in educational assessment to refer, in a very general sense, to methodologies used for score linking or measurement invariance analysis. For instance, by using comparability methodologies testing agencies can estimate achievement trends measured with different tests across several years or ensure that test booklets with different items produce equivalent scores. Also, comparability methodologies can be used to assess if psychometric properties of a test are equivalent between examinees from different ethnical or linguistic backgrounds. Using comparability methodologies, it is possible to put the results obtained from different tests onto a common score scale, such that scores can be used interchangeably to make valid inferences about the performance of examinees. This chapter addresses the concept of score comparability and its importance for measurement validity. The most used methodologies for obtaining comparable scores are discussed and some practical recommendations for the design of comparability studies are provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Ensuring Validity in International Comparisons Using State-of-the-Art Psychometric Methodologies

Making sense out of measurement non-invariance: how to explore differences among educational systems in international large-scale assessments

Article 08 February 2021

Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Google Scholar
Bennett, R. E. (2003). Online assessment and the comparability of score meaning (ETS-RM-03-05). Princeton, NJ: Educational Testing Service. Retrieved from https://www.ets.org/Media/Research/pdf/RM-03-05-Bennett.pdf
Braun, H., & Holland, P. (1982). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In P. Holland & D. Rubin (Eds.), Test equating (Vol. 1, pp. 9–49). New York, NY: Academic Press.
Google Scholar
Brennan, R. L. (Ed.). (2006). Educational measurement. (4th Ed.). Westport, CT: Praeger Publishers. https://doi.org/10.1111/j.1745-3984.2008.00060.x
Briggs, D. C., & Weeks, J. (2009). The impact of vertical scaling decisions on growth interpretations. Educational Measurement: Issues and Practice, 28(4), 3–14. https://doi.org/10.1111/j.1745-3992.2009.00158.x
Article Google Scholar
Childs, R. A., & Jaciw, A. P. (2003). Matrix sampling of items in large-scale assessments. Practical Assessment, Research & Evaluation, 8(16), 1–9.
Google Scholar
Crisp, V. (2017). Exploring the relationship between validity and comparability in assessment. London Review of Education, 15(3), 523–535. https://doi.org/10.18546/LRE.15.3.13
CTB/McGraw-Hill. (1996). TerraNova pre-publication technical bulletin. Monterray, CA: Author.
Google Scholar
Davis, L., Morrison, K., Kong, X., & McBride, Y. (2017). Disaggregated effects of device on score comparability. Educational Measurement: Issues and Practice, 36(3), 35–45. https://doi.org/10.1111/emip.12158
Article Google Scholar
De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York, NY: Springer. https://doi.org/10.1007/978-1-4757-3990-9
Dorans, N. J. (1999). Correspondences between ACT and SAT I scores: College board report no. 99–1. New York, NY: The College Board. Retrieved from https://www.ets.org/Media/Research/pdf/RR-99-02-Dorans.pdf
Dorans, N. J. (2004). Equating, concordance and expectation. Applied Psychological Measurement, 28(4), 227–246. https://doi.org/10.1177/0146621604265031
Article Google Scholar
Dorans, N. J., Pommerich, M., & Holland, P. W. (Eds.). (2007). Linking and aligning scores and scales. New York, NY: Springer. https://doi.org/10.1007/978-0-387-49771-6
Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20(4), 369–377. https://doi.org/10.1111/j.1745-3984.1983.tb00214.x
Article Google Scholar
González, J., & San Martín, E. (2018). An alternative view on the neat design in test equating. In M. Wiberg, S. Culpepper, R. Janssen, J. González & D. Molenaar (Eds.), Quantitative psychology: The 82nd annual meeting of the psychometric society, Zurich, Switzerland, 2017 (pp. 111–120). New York, NY: Springer. https://doi.org/10.1007/978-3-319-77249-3
González, J., & Wiberg, M. (2017). Applying test equating methods: Using R. Springer. https://doi.org/10.1007/978-3-319-51824-4
Book Google Scholar
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149. https://doi.org/10.4992/psycholres1954.22.144
Article Google Scholar
Harcourt Educational Measurement. (2002). Metropolitan achievement test: Technical manual (8th ed.). Author.
Google Scholar
Harcourt Educational Measurement. (2003). Stanford achievement test series: Spring technical data report (10th ed.). Author.
Google Scholar
Harkness, J. A., van de Vijver, F. J., & Mohler, P. P. (2003). Cross-cultural survey methods (Vol. 325). Wiley-Interscience.
Google Scholar
Harris, D. J. (2007). Practical issues in vertical scaling. In N. J. Dorans, M. Pommerich & P.W. Holland (Eds.), Linking and aligning scores and scales (pp. 233–251). New York, NY: Springer. https://doi.org/10.1007/978-0-387-49771-6_13
Hoffman, R. G., Wise, L. L., Thacker, A. A., & Ford, L. A. (2003). Florida comprehensive assessment test: Technical report on vertical scaling for reading and mathematics. A HumRRO Report under subcontract to Harcourt Assessment. San Antonio, TX.
Google Scholar
Holland, P. W. (2007). A framework and history for score linking. In N. J. Dorans, M. Pommerich & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 5–30). New York, NY: Springer. https://doi.org/10.1007/978-0-387-49771-6_2
Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R.L. Brennan (Ed.), Educational measurement (4th Ed.). Westport, CT: Praeger Publishers.
Google Scholar
Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Routledge.
Google Scholar
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355–381. https://doi.org/10.1111/j.1745-3984.2006.00021.x
Article Google Scholar
Kingston, N. M. (2009). Comparability of computer- and paper administered multiple-choice test for K-12 populations: A synthesis. Applied Measurement in Education, 22(1), 22–37. https://doi.org/10.1080/08957340802558326
Article Google Scholar
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling and linking. Methods and practice. New York, NY: Springer. https://doi.org/10.1007/978-1-4939-0317-7
Lee, W., & Cho, S. J. (2017). The consequences of ignoring item parameter drift in longitudinal item response models. Applied Measurement in Education, 30(2), 129–146. https://doi.org/10.1080/08957347.2017.1283317
Article Google Scholar
Lord, F. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.
Google Scholar
Lord, F., & Wingersky, M. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings.” Applied Psychological Measurement, 8(4), 453–461.
Article Google Scholar
Loyd, B. H., & Hoover, H. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193. https://doi.org/10.1111/j.1745-3984.1980.tb00825.x
Article Google Scholar
Magis, D., Béland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. https://doi.org/10.3758/BRM.42.3.847
Article Google Scholar
Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139–160. https://doi.org/10.1111/j.1745-3984.1977.tb00033.x
Article Google Scholar
Marsh, H. W., Hau, K. T., Artelt, C., Baumert, J., & Peschar, J. L. (2006). OECD’s brief self-report measure of educational psychology’s most useful affective constructs: Cross-cultural, psychometric comparisons across 25 countries. International Journal of Testing, 6(4), 311–360. https://doi.org/10.1207/s15327574ijt0604_1
Article Google Scholar
Matus, C., Guzmán, V., Stevenson, M., & Valencia, M. (2012). Alignment of SIMCE 2008 and PISA 2009 scores in samples of 2nd year students. Reading and math. Evidence for Public Policy in Education: Selected Research. FONIDE-PISA Extraordinary Competition. Santiago de Chile: Centro de Estudios Mineduc.
Google Scholar
Meredith, W. (1993). Measurement invariance, factor analysis, and factorinvariance. Psychometrika, 58(4), 525–543. https://doi.org/10.1007/BF02294825
Article Google Scholar
Meredith, W., & Millsap, R. E. (1992). On the misuse of manifest variables in the detection of measurement bias. Psychometrika, 57(2), 289–311. https://doi.org/10.1007/BF02294510
Article Google Scholar
Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice, 14(4), 5–8. https://doi.org/10.1111/j.1745-3992.1995.tb00881.x
Article Google Scholar
Millsap, R. E. (2011). Statistical approaches to measurement invariance. Routledge.
Google Scholar
Millsap, R. E., & Meredith, W. (2007). Factorinvariance: Historical perspectives and new problems. In R. Cudeck & R. MacCallum (Eds.), Factor analysis at 100: Historical developments and future directions (pp. 131–152). Lawrence Erlbaum Associates.
Google Scholar
Moss, P. A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 62(3), 229–258. https://doi.org/10.2307/1170738
Article Google Scholar
Mroch, A., Li, D., & Thompson, T. (2015). A framework for evaluating score comparability. Paper presented at the annual meeting of the national council on measurement in education, Chicago, IL.
Google Scholar
Newton, P. E. (2010). Contrasting conceptions of comparability. Research Papers in Education, 25(3), 285–292. https://doi.org/10.1080/02671522.2010.498144
Article Google Scholar
Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (2nd Ed.). Quantitative applications in the social sciences, 161. Thousand Oaks, CA: Sage Publications.
Google Scholar
Page, G., San Martín, E., Orellana, J., & González, J. (2016). Exploring complete school effectiveness via quantile value-added. Journal of the Royal Statistical Society: Series A., 180(1), 315–340.
Article Google Scholar
Patz, R. J., & Yao, L. Y. (2007). Methods and models for vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 253–272). Springer.
Chapter Google Scholar
Pedrero, V. (2019). Measurement of warp invariance in large-scale tests. In J. Manzi, M. R. García & S. Taut, Validity of educational evaluations in Chile and Latin America. Santiago, Chile: Ediciones UC.
Google Scholar
Pommerich, M. (2007). Concordance: The good, the bad and the ugly. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 200–216). Springer.
Chapter Google Scholar
Pommerich, M., & Dorans, N. J. (Eds.). (2004). Concordance [Special issue]. Applied Psychological Measurement, 28(4).
Google Scholar
San Martín, E., & González, J. (under review). A Critical View on the NEAT Equating Design: Statistical Modelling and Identifiability Problems.
Google Scholar
Santelices, M. V., & Wilson, M. (2010). Unfair treatment? The case of freedle, the SAT, and the standardization approach to differential item functioning. Harvard Educational Review, 80(1), 106–133.
Article Google Scholar
Shadish, W., Cook, T., & Campbell, D. (2002). Experimental and quasi-experimental design for generalized causal inference. Houghton Mifflin.
Google Scholar
Sinharay, S., Haberman, S., Holland, P., & Lewis, C. (2012). A note on the choice of an anchor test in equating. ETS Research Report Series, 2012(2), i–9. Retrieved from https://www.ets.org/Media/Research/pdf/RR-12-14.pdf
Stocking, M., & Lord, F. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210. https://doi.org/10.1177/14662168300700208
Treviño, E., Sandoval-Hernández, A., Miranda, D., Rutkowski, D. & Matta, T. (2019). Invariance of socioeconomic status scales in international studies In J. Manzi, M. R. García & S. Taut, Validity of educational evaluations in Chile and Latin America. Santiago, Chile: Ediciones UC.
Google Scholar
van der Linden, W. J. (Ed.). (2016). Handbook of item response theory: Models, statistical tools, and applications (Vol. 1–3). Chapman & Hall/CRC.
Google Scholar
Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8(3), 347–364. https://doi.org/10.1177/014662168400800312
Article Google Scholar
Wolkowitz, A., & Davis-Becker, S. (2015). Evaluating common item block options when faced with practical constraints. Practical Assessment, Research & Evaluation, 20(19). Retrieved from https://pareonline.net/getvn.asp?v=20=19
Young, M. J. (2006). Vertical scales. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 469–485). Lawrence Erlbaum Associates.
Google Scholar
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Google Scholar

Download references

Acknowledgments

Jorge González was partially funded by the FONDECYT grant 1201129. René Gempp was partially funded by the FONDECYT grant 1151313.

Author information

Authors and Affiliations

Department of Statistics, Faculty of Mathematics, Pontificia Universidad Católica de Chile, Santiago, Chile
Jorge González
Faculty of Economics and Business, Universidad Diego Portales, Santiago de Chile, Santiago, Chile
René Gempp

Authors

Jorge González
View author publications
You can also search for this author in PubMed Google Scholar
René Gempp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorge González .

Editor information

Editors and Affiliations

Pontificia Universidad Católica de Chile, Santiago, Chile
Jorge Manzi
Pontificia Universidad Católica de Chile, Santiago, Chile
María Rosa García
Ministry of Education, Gunzenhausen, Germany
Sandy Taut

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

González, J., Gempp, R. (2021). Test Comparability and Measurement Validity in Educational Assessment. In: Manzi, J., García, M.R., Taut, S. (eds) Validity of Educational Assessments in Chile and Latin America. Springer, Cham. https://doi.org/10.1007/978-3-030-78390-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-78390-7_8
Published: 12 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78389-1
Online ISBN: 978-3-030-78390-7
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics

Test Comparability and Measurement Validity in Educational Assessment

Abstract

Access this chapter

Similar content being viewed by others

Ensuring Validity in International Comparisons Using State-of-the-Art Psychometric Methodologies

Making sense out of measurement non-invariance: how to explore differences among educational systems in international large-scale assessments

Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Test Comparability and Measurement Validity in Educational Assessment

Abstract

Access this chapter

Similar content being viewed by others

Ensuring Validity in International Comparisons Using State-of-the-Art Psychometric Methodologies

Making sense out of measurement non-invariance: how to explore differences among educational systems in international large-scale assessments

Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation