Skip to main content

Test Comparability and Measurement Validity in Educational Assessment

  • Chapter
  • First Online:
Validity of Educational Assessments in Chile and Latin America

Abstract

Comparability is an umbrella term sometimes used in educational assessment to refer, in a very general sense, to methodologies used for score linking or measurement invariance analysis. For instance, by using comparability methodologies testing agencies can estimate achievement trends measured with different tests across several years or ensure that test booklets with different items produce equivalent scores. Also, comparability methodologies can be used to assess if psychometric properties of a test are equivalent between examinees from different ethnical or linguistic backgrounds. Using comparability methodologies, it is possible to put the results obtained from different tests onto a common score scale, such that scores can be used interchangeably to make valid inferences about the performance of examinees. This chapter addresses the concept of score comparability and its importance for measurement validity. The most used methodologies for obtaining comparable scores are discussed and some practical recommendations for the design of comparability studies are provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

    Google Scholar 

  • Bennett, R. E. (2003). Online assessment and the comparability of score meaning (ETS-RM-03-05). Princeton, NJ: Educational Testing Service. Retrieved from https://www.ets.org/Media/Research/pdf/RM-03-05-Bennett.pdf

  • Braun, H., & Holland, P. (1982). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In P. Holland & D. Rubin (Eds.), Test equating (Vol. 1, pp. 9–49). New York, NY: Academic Press.

    Google Scholar 

  • Brennan, R. L. (Ed.). (2006). Educational measurement. (4th Ed.). Westport, CT: Praeger Publishers. https://doi.org/10.1111/j.1745-3984.2008.00060.x

  • Briggs, D. C., & Weeks, J. (2009). The impact of vertical scaling decisions on growth interpretations. Educational Measurement: Issues and Practice, 28(4), 3–14. https://doi.org/10.1111/j.1745-3992.2009.00158.x

    Article  Google Scholar 

  • Childs, R. A., & Jaciw, A. P. (2003). Matrix sampling of items in large-scale assessments. Practical Assessment, Research & Evaluation, 8(16), 1–9.

    Google Scholar 

  • Crisp, V. (2017). Exploring the relationship between validity and comparability in assessment. London Review of Education, 15(3), 523–535. https://doi.org/10.18546/LRE.15.3.13

  • CTB/McGraw-Hill. (1996). TerraNova pre-publication technical bulletin. Monterray, CA: Author.

    Google Scholar 

  • Davis, L., Morrison, K., Kong, X., & McBride, Y. (2017). Disaggregated effects of device on score comparability. Educational Measurement: Issues and Practice, 36(3), 35–45. https://doi.org/10.1111/emip.12158

    Article  Google Scholar 

  • De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York, NY: Springer. https://doi.org/10.1007/978-1-4757-3990-9

  • Dorans, N. J. (1999). Correspondences between ACT and SAT I scores: College board report no. 99–1. New York, NY: The College Board. Retrieved from https://www.ets.org/Media/Research/pdf/RR-99-02-Dorans.pdf

  • Dorans, N. J. (2004). Equating, concordance and expectation. Applied Psychological Measurement, 28(4), 227–246. https://doi.org/10.1177/0146621604265031

    Article  Google Scholar 

  • Dorans, N. J., Pommerich, M., & Holland, P. W. (Eds.). (2007). Linking and aligning scores and scales. New York, NY: Springer. https://doi.org/10.1007/978-0-387-49771-6

  • Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20(4), 369–377. https://doi.org/10.1111/j.1745-3984.1983.tb00214.x

    Article  Google Scholar 

  • González, J., & San Martín, E. (2018). An alternative view on the neat design in test equating. In M. Wiberg, S. Culpepper, R. Janssen, J. González & D. Molenaar (Eds.), Quantitative psychology: The 82nd annual meeting of the psychometric society, Zurich, Switzerland, 2017 (pp. 111–120). New York, NY: Springer. https://doi.org/10.1007/978-3-319-77249-3

  • González, J., & Wiberg, M. (2017). Applying test equating methods: Using R. Springer. https://doi.org/10.1007/978-3-319-51824-4

    Book  Google Scholar 

  • Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149. https://doi.org/10.4992/psycholres1954.22.144

    Article  Google Scholar 

  • Harcourt Educational Measurement. (2002). Metropolitan achievement test: Technical manual (8th ed.). Author.

    Google Scholar 

  • Harcourt Educational Measurement. (2003). Stanford achievement test series: Spring technical data report (10th ed.). Author.

    Google Scholar 

  • Harkness, J. A., van de Vijver, F. J., & Mohler, P. P. (2003). Cross-cultural survey methods (Vol. 325). Wiley-Interscience.

    Google Scholar 

  • Harris, D. J. (2007). Practical issues in vertical scaling. In N. J. Dorans, M. Pommerich & P.W. Holland (Eds.), Linking and aligning scores and scales (pp. 233–251). New York, NY: Springer. https://doi.org/10.1007/978-0-387-49771-6_13

  • Hoffman, R. G., Wise, L. L., Thacker, A. A., & Ford, L. A. (2003). Florida comprehensive assessment test: Technical report on vertical scaling for reading and mathematics. A HumRRO Report under subcontract to Harcourt Assessment. San Antonio, TX.

    Google Scholar 

  • Holland, P. W. (2007). A framework and history for score linking. In N. J. Dorans, M. Pommerich & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 5–30). New York, NY: Springer. https://doi.org/10.1007/978-0-387-49771-6_2

  • Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R.L. Brennan (Ed.), Educational measurement (4th Ed.). Westport, CT: Praeger Publishers.

    Google Scholar 

  • Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Routledge.

    Google Scholar 

  • Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355–381. https://doi.org/10.1111/j.1745-3984.2006.00021.x

    Article  Google Scholar 

  • Kingston, N. M. (2009). Comparability of computer- and paper administered multiple-choice test for K-12 populations: A synthesis. Applied Measurement in Education, 22(1), 22–37. https://doi.org/10.1080/08957340802558326

    Article  Google Scholar 

  • Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling and linking. Methods and practice. New York, NY: Springer. https://doi.org/10.1007/978-1-4939-0317-7

  • Lee, W., & Cho, S. J. (2017). The consequences of ignoring item parameter drift in longitudinal item response models. Applied Measurement in Education, 30(2), 129–146. https://doi.org/10.1080/08957347.2017.1283317

    Article  Google Scholar 

  • Lord, F. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.

    Google Scholar 

  • Lord, F., & Wingersky, M. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings.” Applied Psychological Measurement, 8(4), 453–461.

    Article  Google Scholar 

  • Loyd, B. H., & Hoover, H. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193. https://doi.org/10.1111/j.1745-3984.1980.tb00825.x

    Article  Google Scholar 

  • Magis, D., Béland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. https://doi.org/10.3758/BRM.42.3.847

    Article  Google Scholar 

  • Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139–160. https://doi.org/10.1111/j.1745-3984.1977.tb00033.x

    Article  Google Scholar 

  • Marsh, H. W., Hau, K. T., Artelt, C., Baumert, J., & Peschar, J. L. (2006). OECD’s brief self-report measure of educational psychology’s most useful affective constructs: Cross-cultural, psychometric comparisons across 25 countries. International Journal of Testing, 6(4), 311–360. https://doi.org/10.1207/s15327574ijt0604_1

    Article  Google Scholar 

  • Matus, C., Guzmán, V., Stevenson, M., & Valencia, M. (2012). Alignment of SIMCE 2008 and PISA 2009 scores in samples of 2nd year students. Reading and math. Evidence for Public Policy in Education: Selected Research. FONIDE-PISA Extraordinary Competition. Santiago de Chile: Centro de Estudios Mineduc.

    Google Scholar 

  • Meredith, W. (1993). Measurement invariance, factor analysis, and factorinvariance. Psychometrika, 58(4), 525–543. https://doi.org/10.1007/BF02294825

    Article  Google Scholar 

  • Meredith, W., & Millsap, R. E. (1992). On the misuse of manifest variables in the detection of measurement bias. Psychometrika, 57(2), 289–311. https://doi.org/10.1007/BF02294510

    Article  Google Scholar 

  • Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice, 14(4), 5–8. https://doi.org/10.1111/j.1745-3992.1995.tb00881.x

    Article  Google Scholar 

  • Millsap, R. E. (2011). Statistical approaches to measurement invariance. Routledge.

    Google Scholar 

  • Millsap, R. E., & Meredith, W. (2007). Factorinvariance: Historical perspectives and new problems. In R. Cudeck & R. MacCallum (Eds.), Factor analysis at 100: Historical developments and future directions (pp. 131–152). Lawrence Erlbaum Associates.

    Google Scholar 

  • Moss, P. A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 62(3), 229–258. https://doi.org/10.2307/1170738

    Article  Google Scholar 

  • Mroch, A., Li, D., & Thompson, T. (2015). A framework for evaluating score comparability. Paper presented at the annual meeting of the national council on measurement in education, Chicago, IL.

    Google Scholar 

  • Newton, P. E. (2010). Contrasting conceptions of comparability. Research Papers in Education, 25(3), 285–292. https://doi.org/10.1080/02671522.2010.498144

    Article  Google Scholar 

  • Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (2nd Ed.). Quantitative applications in the social sciences, 161. Thousand Oaks, CA: Sage Publications.

    Google Scholar 

  • Page, G., San Martín, E., Orellana, J., & González, J. (2016). Exploring complete school effectiveness via quantile value-added. Journal of the Royal Statistical Society: Series A., 180(1), 315–340.

    Article  Google Scholar 

  • Patz, R. J., & Yao, L. Y. (2007). Methods and models for vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 253–272). Springer.

    Chapter  Google Scholar 

  • Pedrero, V. (2019). Measurement of warp invariance in large-scale tests. In J. Manzi, M. R. García & S. Taut, Validity of educational evaluations in Chile and Latin America. Santiago, Chile: Ediciones UC.

    Google Scholar 

  • Pommerich, M. (2007). Concordance: The good, the bad and the ugly. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 200–216). Springer.

    Chapter  Google Scholar 

  • Pommerich, M., & Dorans, N. J. (Eds.). (2004). Concordance [Special issue]. Applied Psychological Measurement, 28(4).

    Google Scholar 

  • San Martín, E., & González, J. (under review). A Critical View on the NEAT Equating Design: Statistical Modelling and Identifiability Problems.

    Google Scholar 

  • Santelices, M. V., & Wilson, M. (2010). Unfair treatment? The case of freedle, the SAT, and the standardization approach to differential item functioning. Harvard Educational Review, 80(1), 106–133.

    Article  Google Scholar 

  • Shadish, W., Cook, T., & Campbell, D. (2002). Experimental and quasi-experimental design for generalized causal inference. Houghton Mifflin.

    Google Scholar 

  • Sinharay, S., Haberman, S., Holland, P., & Lewis, C. (2012). A note on the choice of an anchor test in equating. ETS Research Report Series, 2012(2), i–9. Retrieved from https://www.ets.org/Media/Research/pdf/RR-12-14.pdf

  • Stocking, M., & Lord, F. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210. https://doi.org/10.1177/14662168300700208

  • Treviño, E., Sandoval-Hernández, A., Miranda, D., Rutkowski, D. & Matta, T. (2019). Invariance of socioeconomic status scales in international studies In J. Manzi, M. R. García & S. Taut, Validity of educational evaluations in Chile and Latin America. Santiago, Chile: Ediciones UC.

    Google Scholar 

  • van der Linden, W. J. (Ed.). (2016). Handbook of item response theory: Models, statistical tools, and applications (Vol. 1–3). Chapman & Hall/CRC.

    Google Scholar 

  • Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8(3), 347–364. https://doi.org/10.1177/014662168400800312

    Article  Google Scholar 

  • Wolkowitz, A., & Davis-Becker, S. (2015). Evaluating common item block options when faced with practical constraints. Practical Assessment, Research & Evaluation, 20(19). Retrieved from https://pareonline.net/getvn.asp?v=20=19

  • Young, M. J. (2006). Vertical scales. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 469–485). Lawrence Erlbaum Associates.

    Google Scholar 

  • R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

    Google Scholar 

Download references

Acknowledgments

Jorge González was partially funded by the FONDECYT grant 1201129. René Gempp was partially funded by the FONDECYT grant 1151313.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jorge González .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

González, J., Gempp, R. (2021). Test Comparability and Measurement Validity in Educational Assessment. In: Manzi, J., García, M.R., Taut, S. (eds) Validity of Educational Assessments in Chile and Latin America. Springer, Cham. https://doi.org/10.1007/978-3-030-78390-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-78390-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-78389-1

  • Online ISBN: 978-3-030-78390-7

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics