Abstract
Comparability is an umbrella term sometimes used in educational assessment to refer, in a very general sense, to methodologies used for score linking or measurement invariance analysis. For instance, by using comparability methodologies testing agencies can estimate achievement trends measured with different tests across several years or ensure that test booklets with different items produce equivalent scores. Also, comparability methodologies can be used to assess if psychometric properties of a test are equivalent between examinees from different ethnical or linguistic backgrounds. Using comparability methodologies, it is possible to put the results obtained from different tests onto a common score scale, such that scores can be used interchangeably to make valid inferences about the performance of examinees. This chapter addresses the concept of score comparability and its importance for measurement validity. The most used methodologies for obtaining comparable scores are discussed and some practical recommendations for the design of comparability studies are provided.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Bennett, R. E. (2003). Online assessment and the comparability of score meaning (ETS-RM-03-05). Princeton, NJ: Educational Testing Service. Retrieved from https://www.ets.org/Media/Research/pdf/RM-03-05-Bennett.pdf
Braun, H., & Holland, P. (1982). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In P. Holland & D. Rubin (Eds.), Test equating (Vol. 1, pp. 9–49). New York, NY: Academic Press.
Brennan, R. L. (Ed.). (2006). Educational measurement. (4th Ed.). Westport, CT: Praeger Publishers. https://doi.org/10.1111/j.1745-3984.2008.00060.x
Briggs, D. C., & Weeks, J. (2009). The impact of vertical scaling decisions on growth interpretations. Educational Measurement: Issues and Practice, 28(4), 3–14. https://doi.org/10.1111/j.1745-3992.2009.00158.x
Childs, R. A., & Jaciw, A. P. (2003). Matrix sampling of items in large-scale assessments. Practical Assessment, Research & Evaluation, 8(16), 1–9.
Crisp, V. (2017). Exploring the relationship between validity and comparability in assessment. London Review of Education, 15(3), 523–535. https://doi.org/10.18546/LRE.15.3.13
CTB/McGraw-Hill. (1996). TerraNova pre-publication technical bulletin. Monterray, CA: Author.
Davis, L., Morrison, K., Kong, X., & McBride, Y. (2017). Disaggregated effects of device on score comparability. Educational Measurement: Issues and Practice, 36(3), 35–45. https://doi.org/10.1111/emip.12158
De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York, NY: Springer. https://doi.org/10.1007/978-1-4757-3990-9
Dorans, N. J. (1999). Correspondences between ACT and SAT I scores: College board report no. 99–1. New York, NY: The College Board. Retrieved from https://www.ets.org/Media/Research/pdf/RR-99-02-Dorans.pdf
Dorans, N. J. (2004). Equating, concordance and expectation. Applied Psychological Measurement, 28(4), 227–246. https://doi.org/10.1177/0146621604265031
Dorans, N. J., Pommerich, M., & Holland, P. W. (Eds.). (2007). Linking and aligning scores and scales. New York, NY: Springer. https://doi.org/10.1007/978-0-387-49771-6
Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20(4), 369–377. https://doi.org/10.1111/j.1745-3984.1983.tb00214.x
González, J., & San Martín, E. (2018). An alternative view on the neat design in test equating. In M. Wiberg, S. Culpepper, R. Janssen, J. González & D. Molenaar (Eds.), Quantitative psychology: The 82nd annual meeting of the psychometric society, Zurich, Switzerland, 2017 (pp. 111–120). New York, NY: Springer. https://doi.org/10.1007/978-3-319-77249-3
González, J., & Wiberg, M. (2017). Applying test equating methods: Using R. Springer. https://doi.org/10.1007/978-3-319-51824-4
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149. https://doi.org/10.4992/psycholres1954.22.144
Harcourt Educational Measurement. (2002). Metropolitan achievement test: Technical manual (8th ed.). Author.
Harcourt Educational Measurement. (2003). Stanford achievement test series: Spring technical data report (10th ed.). Author.
Harkness, J. A., van de Vijver, F. J., & Mohler, P. P. (2003). Cross-cultural survey methods (Vol. 325). Wiley-Interscience.
Harris, D. J. (2007). Practical issues in vertical scaling. In N. J. Dorans, M. Pommerich & P.W. Holland (Eds.), Linking and aligning scores and scales (pp. 233–251). New York, NY: Springer. https://doi.org/10.1007/978-0-387-49771-6_13
Hoffman, R. G., Wise, L. L., Thacker, A. A., & Ford, L. A. (2003). Florida comprehensive assessment test: Technical report on vertical scaling for reading and mathematics. A HumRRO Report under subcontract to Harcourt Assessment. San Antonio, TX.
Holland, P. W. (2007). A framework and history for score linking. In N. J. Dorans, M. Pommerich & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 5–30). New York, NY: Springer. https://doi.org/10.1007/978-0-387-49771-6_2
Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R.L. Brennan (Ed.), Educational measurement (4th Ed.). Westport, CT: Praeger Publishers.
Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Routledge.
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355–381. https://doi.org/10.1111/j.1745-3984.2006.00021.x
Kingston, N. M. (2009). Comparability of computer- and paper administered multiple-choice test for K-12 populations: A synthesis. Applied Measurement in Education, 22(1), 22–37. https://doi.org/10.1080/08957340802558326
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling and linking. Methods and practice. New York, NY: Springer. https://doi.org/10.1007/978-1-4939-0317-7
Lee, W., & Cho, S. J. (2017). The consequences of ignoring item parameter drift in longitudinal item response models. Applied Measurement in Education, 30(2), 129–146. https://doi.org/10.1080/08957347.2017.1283317
Lord, F. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.
Lord, F., & Wingersky, M. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings.” Applied Psychological Measurement, 8(4), 453–461.
Loyd, B. H., & Hoover, H. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193. https://doi.org/10.1111/j.1745-3984.1980.tb00825.x
Magis, D., Béland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. https://doi.org/10.3758/BRM.42.3.847
Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139–160. https://doi.org/10.1111/j.1745-3984.1977.tb00033.x
Marsh, H. W., Hau, K. T., Artelt, C., Baumert, J., & Peschar, J. L. (2006). OECD’s brief self-report measure of educational psychology’s most useful affective constructs: Cross-cultural, psychometric comparisons across 25 countries. International Journal of Testing, 6(4), 311–360. https://doi.org/10.1207/s15327574ijt0604_1
Matus, C., Guzmán, V., Stevenson, M., & Valencia, M. (2012). Alignment of SIMCE 2008 and PISA 2009 scores in samples of 2nd year students. Reading and math. Evidence for Public Policy in Education: Selected Research. FONIDE-PISA Extraordinary Competition. Santiago de Chile: Centro de Estudios Mineduc.
Meredith, W. (1993). Measurement invariance, factor analysis, and factorinvariance. Psychometrika, 58(4), 525–543. https://doi.org/10.1007/BF02294825
Meredith, W., & Millsap, R. E. (1992). On the misuse of manifest variables in the detection of measurement bias. Psychometrika, 57(2), 289–311. https://doi.org/10.1007/BF02294510
Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice, 14(4), 5–8. https://doi.org/10.1111/j.1745-3992.1995.tb00881.x
Millsap, R. E. (2011). Statistical approaches to measurement invariance. Routledge.
Millsap, R. E., & Meredith, W. (2007). Factorinvariance: Historical perspectives and new problems. In R. Cudeck & R. MacCallum (Eds.), Factor analysis at 100: Historical developments and future directions (pp. 131–152). Lawrence Erlbaum Associates.
Moss, P. A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 62(3), 229–258. https://doi.org/10.2307/1170738
Mroch, A., Li, D., & Thompson, T. (2015). A framework for evaluating score comparability. Paper presented at the annual meeting of the national council on measurement in education, Chicago, IL.
Newton, P. E. (2010). Contrasting conceptions of comparability. Research Papers in Education, 25(3), 285–292. https://doi.org/10.1080/02671522.2010.498144
Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (2nd Ed.). Quantitative applications in the social sciences, 161. Thousand Oaks, CA: Sage Publications.
Page, G., San Martín, E., Orellana, J., & González, J. (2016). Exploring complete school effectiveness via quantile value-added. Journal of the Royal Statistical Society: Series A., 180(1), 315–340.
Patz, R. J., & Yao, L. Y. (2007). Methods and models for vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 253–272). Springer.
Pedrero, V. (2019). Measurement of warp invariance in large-scale tests. In J. Manzi, M. R. García & S. Taut, Validity of educational evaluations in Chile and Latin America. Santiago, Chile: Ediciones UC.
Pommerich, M. (2007). Concordance: The good, the bad and the ugly. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 200–216). Springer.
Pommerich, M., & Dorans, N. J. (Eds.). (2004). Concordance [Special issue]. Applied Psychological Measurement, 28(4).
San Martín, E., & González, J. (under review). A Critical View on the NEAT Equating Design: Statistical Modelling and Identifiability Problems.
Santelices, M. V., & Wilson, M. (2010). Unfair treatment? The case of freedle, the SAT, and the standardization approach to differential item functioning. Harvard Educational Review, 80(1), 106–133.
Shadish, W., Cook, T., & Campbell, D. (2002). Experimental and quasi-experimental design for generalized causal inference. Houghton Mifflin.
Sinharay, S., Haberman, S., Holland, P., & Lewis, C. (2012). A note on the choice of an anchor test in equating. ETS Research Report Series, 2012(2), i–9. Retrieved from https://www.ets.org/Media/Research/pdf/RR-12-14.pdf
Stocking, M., & Lord, F. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210. https://doi.org/10.1177/14662168300700208
Treviño, E., Sandoval-Hernández, A., Miranda, D., Rutkowski, D. & Matta, T. (2019). Invariance of socioeconomic status scales in international studies In J. Manzi, M. R. García & S. Taut, Validity of educational evaluations in Chile and Latin America. Santiago, Chile: Ediciones UC.
van der Linden, W. J. (Ed.). (2016). Handbook of item response theory: Models, statistical tools, and applications (Vol. 1–3). Chapman & Hall/CRC.
Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8(3), 347–364. https://doi.org/10.1177/014662168400800312
Wolkowitz, A., & Davis-Becker, S. (2015). Evaluating common item block options when faced with practical constraints. Practical Assessment, Research & Evaluation, 20(19). Retrieved from https://pareonline.net/getvn.asp?v=20=19
Young, M. J. (2006). Vertical scales. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 469–485). Lawrence Erlbaum Associates.
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Acknowledgments
Jorge González was partially funded by the FONDECYT grant 1201129. René Gempp was partially funded by the FONDECYT grant 1151313.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
González, J., Gempp, R. (2021). Test Comparability and Measurement Validity in Educational Assessment. In: Manzi, J., García, M.R., Taut, S. (eds) Validity of Educational Assessments in Chile and Latin America. Springer, Cham. https://doi.org/10.1007/978-3-030-78390-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-78390-7_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78389-1
Online ISBN: 978-3-030-78390-7
eBook Packages: EducationEducation (R0)