Skip to main content

Scoring and Scaling Educational Tests

  • Chapter
  • First Online:
Statistical Models for Test Equating, Scaling, and Linking

Abstract

The numbers that are associated with examinee performance on educational or psychological tests are defined through the process of scaling. This process produces a score scale, and the scores that are reported to examinees are referred to as scale scores. Kolen (2006) referred to the term primary score scale, which is the focus of this chapter, as the scale that is used to underlie psychometric properties for tests.

A key component in the process of developing a score scale is the raw score for an examinee on a test, which is a function of the item scores for that examinee. Raw scores can be as simple as a sum of the item scores or be so complicated that they depend on the entire pattern of item responses.

Raw scores are transformed to scale scores to facilitate the meaning of scores for test users. For example, raw scores might be transformed to scale scores so that they have predefined distributional properties for a particular group of examinees, referred to as a norm group. Normative information might be incorporated by constructing scale scores to be approximately normally distributed with a mean of 50 and a standard deviation of 10 for a national population of examinees. In addition, procedures can be used for incorporating content and score precision information into score scales.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • ACT. (2001). EXPLORE technical manual. Iowa City, IA: Author.

    Google Scholar 

  • Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.

    Google Scholar 

  • Angoff, W. H. (1984). Scales, norms, and equivalent scores. Princeton, NJ: ETS. (Reprinted from Educational measurement, 2nd ed., pp. 508–600, by R. L. Thorndike, Ed., 1971, Washington, DC: American Council on Education)

    Google Scholar 

  • Ban, J.-C., & Lee, W.-C. (2007). Defining a score scale in relation to measurement error for mixed format tests (CASMA Research Report Number 24). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment.

    Google Scholar 

  • Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. Journal of Educational Measurement, 34(3), 197–211.

    Article  Google Scholar 

  • Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 471–515). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Ebel, R. L. (1962). Content standard test scores. Educational and Psychological Measurement, 22(1), 15–25.

    Google Scholar 

  • Flanagan, J. C. (1951). Units, scores, and norms. In E. F. Lindquist (Ed.), Educational measurement (pp. 695-763). Washington, DC: American Council on Education.

    Google Scholar 

  • Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and square root. Annals of Mathematical Statistics, 21(4), 607–611.

    Article  Google Scholar 

  • Hambleton, R. K., & Pitoniak, M. J. (2006) Setting performance standards. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 433–470). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 187–220). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Iowa Tests of Educational Development. (1958). Manual for the school administrator (Rev. ed.). Iowa City: State University of Iowa.

    Google Scholar 

  • Kolen, M. J. (1988). Defining score scales in relation to measurement error. Journal of Educational Measurement, 25(2), 97–110.

    Article  Google Scholar 

  • Kolen, M. J. (2006). Scaling and norming. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 155–186). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York, NY: Springer-Verlag.

    Book  Google Scholar 

  • Kolen, M. J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measurement for scale scores. Journal of Educational Measurement, 29(4), 285–307.

    Article  Google Scholar 

  • Kolen, M. J., Zeng, L., & Hanson, B. A. (1996). Conditional standard errors of measurement for scale scores using IRT. Journal of Educational Measurement, 33(2), 129–140.

    Article  Google Scholar 

  • Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings.” Applied Psychological Measurement, 8(4), 453–461.

    Article  Google Scholar 

  • McCall, W. A. (1939). Measurement: A revision of how to measure in education. New York, NY: Macmillan.

    Google Scholar 

  • Muraki, E. (1993) Information functions of the generalized partial credit model. Applied Psychological Measurement, 14(4), 351–363.

    Article  Google Scholar 

  • Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221–262). New York, NY: Macmillan.

    Google Scholar 

  • Pommerich, M., Nicewander, W. A., & Hanson, B. A. (1999). Estimating average domain scores. Journal of Educational Measurement, 36(3), 199–216.

    Article  Google Scholar 

  • Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40(2), 163–184.

    Article  Google Scholar 

  • Rosa, K., Swygert, K. A., Nelson, L., & Thissen, D. (2001). Item response theory applied to combinations of multiple-choice and constructed-response items—Scale scores for patterns of summed scores. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 253–292). Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 1).

    Google Scholar 

  • Thissen, D., Nelson, L., & Swygert, K. A. (2001). Item response theory applied to combinations of multiple-choice and constructed-response items—Approximation methods for scale scores. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 293–341). Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 73–140). Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. L. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19(1), 39–49.

    Article  Google Scholar 

  • Thissen, D., & Wainer, H. (Eds.). (2001). Test scoring. Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Tong, Y., & Kolen, M. J. (2005). Assessing equating results on different equating criteria. Applied Psychological Measurement, 29(6), 418–432.

    Article  Google Scholar 

  • Tong, Y., & Kolen, M. J. (2007). Comparisons of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20(2), 227–253.

    Article  Google Scholar 

  • van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York, NY: Springer-Verlag.

    Google Scholar 

  • Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6(2), 103–118.

    Article  Google Scholar 

  • Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37(2), 141–162.

    Article  Google Scholar 

  • Yen, W., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 111–153). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Zwick, R., Senturk, D., Wang, J., & Loomis, S. C. (2001). An investigation of alternative methods for item mapping in the National Assessment of Educational Progress. Educational Measurement: Issues and Practice, 20(2), 15–25.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael J. Kolen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Kolen, M.J., Tong, Y., Brennan, R.L. (2009). Scoring and Scaling Educational Tests. In: von Davier, A. (eds) Statistical Models for Test Equating, Scaling, and Linking. Statistics for Social and Behavioral Sciences. Springer, New York, NY. https://doi.org/10.1007/978-0-387-98138-3_3

Download citation

Publish with us

Policies and ethics