Psychometric Principles in Student Assessment

  • Robert J. Mislevy
  • Mark R. Wilson
  • Kadriye Ercikan
  • Naomi Chudowsky
Part of the Kluwer International Handbooks of Education book series (SIHE, volume 9)

Abstract

“Validity, reliability, comparability, and fairness are not just measurement issues, but social values that have meaning and force outside of measurement wherever evaluative judgments and decisions are made” (Messick, 1994, p. 2).

Keywords

Measurement Model Differential Item Functioning Item Response Theory True Score Item Response Theory Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adams, R., Wilson, M.R., & Wang, W.-C. (1997). The multidimensional random coefficients multino mial logit model. Applied Psychological Measurement, 21, 1–23.CrossRefGoogle Scholar
  2. Almond, R.G., & Mislevy, R.J. (1999). Graphical models and computerized adaptive testing. Applied Psychological Measurement, 23, 223–237.CrossRefGoogle Scholar
  3. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar
  4. Anderson, J.R., Boyle, C.F., & Corbett, A.T. (1990). Cognitive modeling and intelligent tutoring. Artificial Intelligence, 42, 7–49.CrossRefGoogle Scholar
  5. Bennett, R.E. (2001). How the internet will help large-scale assessment reinvent itself. Education Policy Analysis, 9(5). Retrieved from http://epaa.asu.edu/epaa/v9n5.htlm.
  6. Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.CrossRefGoogle Scholar
  7. Brennan, R.L. (1983). The elements of generalizability theory. Iowa City, IA: American College Testing Program.Google Scholar
  8. Brennan, R.L. (2001). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 38(4), 295–317CrossRefGoogle Scholar
  9. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296–322.Google Scholar
  10. Bryk, A.S., & Raudenbush, S. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park: Sage.Google Scholar
  11. Campbell, D.T., & Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.CrossRefGoogle Scholar
  12. Cohen, J.A. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.CrossRefGoogle Scholar
  13. Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 17, 297–334.CrossRefGoogle Scholar
  14. Cronbach, L.J. (1989). Construct validation after thirty years. In R.L. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. 147–171). Urbana, IL: University of Illinois Press.Google Scholar
  15. Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.Google Scholar
  16. Cronbach, L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.CrossRefGoogle Scholar
  17. Dayton, CM. (1999). Latent class scaling analysis. Thousand Oaks, CA: Sage.Google Scholar
  18. Dibello, L.V., Stout, W.F., & Roussos, L.A. (1995). Unified cognitive/psychometric diagnostic assessment likelihood based classification techniques. In P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 361–389). Hillsdale, NJ: Erlbaum.Google Scholar
  19. Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179–197.CrossRefGoogle Scholar
  20. Embretson, S.E. (1998). A cognitive design systems approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 380–396.CrossRefGoogle Scholar
  21. Ercikan, K. (1998). Translation effects in international assessments. International Journal of Educational Research, 29, 543–553.CrossRefGoogle Scholar
  22. Ercikan, K., & Julian, M. (2002). Classification accuracy of assigning student performance to proficiency levels: Guidelines for assessment design. Applied Measurement in Education. 15, 269–294.CrossRefGoogle Scholar
  23. Falmagne, J.-C., & Doignon, J.-P. (1988). A class of stochastic procedures for the assessment of knowledge. British Journal of Mathematical and Statistical Psychology, 41, 1–23.CrossRefGoogle Scholar
  24. Fischer, G.H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359–374.CrossRefGoogle Scholar
  25. Gelman, A., Carlin, J., Stern, H., & Rubin, D.B. (1995). Bayesian data analysis. London: Chapman & Hall.Google Scholar
  26. Greeno, J.G., Collins, A.M., & Resnick, L.B. (1996). Cognition and learning. In D.C Berliner, & R.C Calfee (Eds.), Handbook of educational psychology (pp. 15–146). New York: Macmillan.Google Scholar
  27. Gulliksen, H. (1950/1987). Theory of mental tests. New York: John Wiley/Hillsdale, NJ: Lawrence Erlbaum.CrossRefGoogle Scholar
  28. Haertel, E.H. (1989). Using restricted latent class models to map the skill structure of achievement test items. Journal of Educational Measurement, 26, 301–321.CrossRefGoogle Scholar
  29. Haertel, E.H., & Wiley, D.E. (1993). Representations of ability structures: Implications for testing. In N. Frederiksen, R.J. Mislevy, & I.I. Bejar (Eds.), Test theory for a new generation of tests. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  30. Hambleton, R.K. (1989). Principles and selected applications of item response theory. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 147–200). Phoenix, AZ: American Council on Education/Oryx Press.Google Scholar
  31. Hambleton, R.K., & Slater, S.C. (1997). Reliability of credentialing examinations and the impact of scoring models and standard-setting policies. Applied Measurement in Education, 10, 19–39.CrossRefGoogle Scholar
  32. Holland, P.W., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenzsel procedures. In H. Wainer, & H.I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  33. Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  34. Jöreskog, K.G., & Sörbom, D. (1979). Advances in factor analysis and structural equation models. Cambridge, MA: Abt Books.Google Scholar
  35. Kadane, J.B., & Schum, D.A. (1996). A probabilistic analysis of the Sacco and Vanzetti evidence. New York: Wiley.Google Scholar
  36. Kane, M.T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.CrossRefGoogle Scholar
  37. Kelley, T.L. (1927). Interpretation of educational measurements. New York: World Book.Google Scholar
  38. Kuder, G.F., & Richardson, M.W. (1937). The theory of estimation of test reliability. Psychometrika, 2, 151–160.CrossRefGoogle Scholar
  39. Lane, W., Wang, N., & Magone, M. (1996). Gender-related differential item functioning on a middle-school mathematics performance assessment. Educational Measurement: Issues and Practice, 15(4), 21–27, 31.CrossRefGoogle Scholar
  40. Lazarsfeld, P.F. (1950). The logical and mathematical foundation of latent structure analysis. In S.A. Stouffer, L. Guttman, E.A. Suchman, P.R. Lazarsfeld, S.A. Star, & J.A Clausen (Eds.), Measurement and prediction (pp. 362–412). Princeton, NJ: Princeton University Press.Google Scholar
  41. Levine, M., & Drasgow, F. (1982). Appropriateness measurement: Review, critique, and validating studies. British Journal of Mathematical and Statistical Psychology, 35, 42–56.CrossRefGoogle Scholar
  42. Linacre, J.M. (1989). Many faceted Rasch measurement. Doctoral dissertation, University of Chicago.Google Scholar
  43. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  44. Lord, R.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.Google Scholar
  45. Martin, J.D., & VanLehn, K. (1995). A Bayesian approach to cognitive assessment. In P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 141–165). Hillsdale, NJ: Erlbaum.Google Scholar
  46. Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 13–103). New York: American Council on Education/Macmillan.Google Scholar
  47. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Education Researcher, 32, 13–23.Google Scholar
  48. Messick, S., Beaton, A.E., & Lord, F.M. (1983). National Assessment of Educational Progress reconsidered: A new design for a new era. NAEP Report 83-1. Princeton, NJ: National Assessment for Educational Progress.Google Scholar
  49. Mislevy, R.J., Steinberg, L.S., & Almond, R.G. (in press). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives. In S. Irvine, & P. Kyllonen (Eds.), Generating items for cognitive tests: Theory and practice. Hillsdale, NJ: Erlbaum.Google Scholar
  50. Mislevy, R.J., Steinberg, L.S., Almond, R.G., Haertel, G., & Penuel, W. (in press). Leverage points for improving educational assessment. In B. Means, & G. Haertel (Eds.), Evaluating the effects of technology in education. Hillsdale, NJ: Erlbaum.Google Scholar
  51. Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (1999). A cognitive task analysis, with implications for designing a simulation-based assessment system. Computers and Human Behavior, 15, 335–374.CrossRefGoogle Scholar
  52. Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (in press). Making sense of data from complex assessment. Applied Measurement in Education.Google Scholar
  53. Myford, C.M., & Mislevy, R.J. (1995). Monitoring and improving a portfolio assessment system (Center for Performance Assessment Research Report). Princeton, NJ: Educational Testing Service.Google Scholar
  54. National Research Council (1999). How people learn: Brain, mind, experience, and school. Committee on Developments in the Science of Learning. Bransford, J.D., Brown, A.L., & Cocking, R.R. (Eds.). Washington, DC: National Academy Press.Google Scholar
  55. National Research Council (2001). Knowing what students know: The science and design of educational assessment. Committee on the Foundations of Assessment. Pellegrino, J., Chudowsky, N., & Glaser, R. (Eds.). Washington, DC: National Academy Press.Google Scholar
  56. O’Neil, K.A., & McPeek, W.M. (1993). Item and test characteristics that are associated with Differential Item Functioning. In P.W. Holland, & H. Wainer (Eds.), Differential item functioning (pp. 255–276). Hillsdale, NJ: Erlbaum.Google Scholar
  57. Patz, R.J., & Junker, B.W. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342–366.Google Scholar
  58. Petersen, N.S., Kolen, M.J., & Hoover, H.D. (1989). Scaling, norming, and equating. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 221–262). New York: American Council on Education/Macmillan.Google Scholar
  59. Pirolli. P., & Wilson, M. (1998). A theory of the measurement of knowledge content, access, and learning. Psychological Review, 105, 58–82.CrossRefGoogle Scholar
  60. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research/Chicago: University of Chicago Press (reprint).Google Scholar
  61. Reckase, M. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401–412.CrossRefGoogle Scholar
  62. Rogosa, D.R., & Ghandour, G.A. (1991). Statistical models for behavioral observations (with discussion). Journal of Educational Statistics, 16, 157–252.CrossRefGoogle Scholar
  63. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph No. 17, 34, (No. 4, Part 2).Google Scholar
  64. Samejima, F. (1973). Homogeneous case of the continuous response level. Psychometrika, 38, 203–219.CrossRefGoogle Scholar
  65. Schum, D.A. (1987). Evidence and inference for the intelligence analyst. Lanham, MD: University Press of America.Google Scholar
  66. Schum, D.A. (1994). The evidential foundations of probabilistic reasoning. New York: Wiley.Google Scholar
  67. SEPUP (1995). Issues, evidence, and you: Teacher’s guide. Berkeley: Lawrence Hall of Science.Google Scholar
  68. Shavelson, R.J., & Webb, N.W (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.Google Scholar
  69. Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101.CrossRefGoogle Scholar
  70. Spearman, C. (1910). Correlation calculated with faulty data. British Journal of Psychology, 3, 271–295.Google Scholar
  71. Spiegelhalter, D.J., Thomas, A., Best, N.G., & Gilks, W.R. (1995). BUGS: Bayesian inference using Gibbs sampling, Version 0.50. Cambridge: MRC Biostatistics Unit.Google Scholar
  72. Tatsuoka, K.K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M.G. Shafto, (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 453–488). Hillsdale, NJ: Erlbaum.Google Scholar
  73. Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567–77.CrossRefGoogle Scholar
  74. Toulmin, S. (1958). The uses of argument. Cambridge, England: University of Cambridge Press.Google Scholar
  75. Traub, R.E., & Rowley, G.L. (1980). Reliability of test scores and decisions. Applied Psychological Measurement, 4, 517–545.CrossRefGoogle Scholar
  76. van der Linden, W.J. (1998). Optimal test assembly. Applied Psychological Measurement, 22, 195–202.CrossRefGoogle Scholar
  77. van der Linden, W.J., & Hambleton, R.K. (1997). Handbook of modern item response theory. New York: Springer.Google Scholar
  78. Wainer, H., Dorans, N.J., Flaugher, R., Green, B.F., Mislevy, R.J., Steinberg, L., & Thissen, D. (2000). Computerized adaptive testing: A primer (2nd ed). Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  79. Wainer, H., & Keily, G.L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 195–201.CrossRefGoogle Scholar
  80. Wiley, D.E. (1991). Test validity and invalidity reconsidered. In R.E. Snow, & D.E. Wiley (Eds.), Improving inquiry in social science (pp. 75–107). Hillsdale, NJ: Erlbaum.Google Scholar
  81. Willingham, W.W., & Cole, N.S. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum.Google Scholar
  82. Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13, 181–208.CrossRefGoogle Scholar
  83. Wolf, D., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. In G. Grant (Ed.), Review of Educational Research, Vol. 17 (pp. 31–74). Washington, DC: American Educational Research Association.Google Scholar
  84. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press.Google Scholar
  85. Yen, W.M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213.CrossRefGoogle Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Robert J. Mislevy
    • 1
  • Mark R. Wilson
    • 2
  • Kadriye Ercikan
    • 3
  • Naomi Chudowsky
    • 4
  1. 1.University of MarylandUSA
  2. 2.University of CaliforniaBerkeleyUSA
  3. 3.University of British ColumbiaUSA
  4. 4.National Research CouncilUSA

Personalised recommendations