, Volume 59, Issue 4, pp 439–483 | Cite as

Evidence and inference in educational assessment

  • Robert J. Mislevy


Educational assessment concerns inference about students' knowledge, skills, and accomplishments. Because data are never so comprehensive and unequivocal as to ensure certitude, test theory evolved in part to address questions of weight, coverage, and import of data. The resulting concepts and techniques can be viewed as applications of more general principles for inference in the presence of uncertainty. Issues of evidence and inference in educational assessment are discussed from this perspective.

Key words

Bayesian inference networks cognitive psychology evidence inference performance assessment probability psychometrics test theory 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Aitkin, M., & Longford, N. (1986). Statistical modeling issues in school effectiveness studies.Journal of the Royal Statistical Society, Series A, 149, 1–43.Google Scholar
  2. American Council on the Training of Foreign Languages. (1989).ACTFL proficiency guidelines. Yonkers, NY: Author.Google Scholar
  3. Andreassen, S., Jensen, F. V., & Olesen, K. G. (1990). Medical expert systems based on causal probabilistic networks. Aalborg, Denmark: Aalborg University, Institute of Electronic Systems.Google Scholar
  4. Andersen, S. K., Jensen, F. V., Olesen, K. G., & Jensen, F. (1989).HUGIN: A shell for building Bayesian belief universes for expert systems [computer program]. Aalborg, Denmark: HUGIN Expert.Google Scholar
  5. Anderson, T. J., & Twining, W. L. (1991).Analysis of evidence. Boston: Little, Brown, & Co.Google Scholar
  6. Askin, W. (1985).Evaluating the Advanced Placement portfolio in studio art. Princeton, NJ: Educational Testing Service.Google Scholar
  7. Bentham, J. (1825).A treatise on judicial evidence. London: Hunt & Clarke.Google Scholar
  8. Bentham, J. (1827).Rationale of judicial evidence. London: Hunt & Clarke.Google Scholar
  9. Box, G. E. P., & Tiao, G. C. (1973).Bayesian inference in statistical analysis. Reading, MA: Addison-Wesley.Google Scholar
  10. Cohen, L. J. (1977).The probable and the provable. Oxford: The Clarendon Press.Google Scholar
  11. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972).The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.Google Scholar
  12. de Finetti, B. (1974).Theory of probability. London: Wiley.Google Scholar
  13. Diaconis, P., & Freedman, D. (1980). Finite exchangeable sequences.The Annals of Probability, 8, 745–764.Google Scholar
  14. Falmagne, J-C. (1989). A latent trait model via a stochastic learning theory for a knowledge space.Psychometrika, 54, 283–303.Google Scholar
  15. Glaser, R., Lesgold, A., & Lajoie, S. (1987). Toward a cognitive theory for the measurement of achievement. In R. Ronning, J. Glover, J. C. Conoley, & J. Witt (Eds.),The influence of cognitive psychology on testing and measurement: The Buros-Nebraska Symposium on measurement and testing (Vol. 3, pp. 41–85). Hillsdale, NJ: Erlbaum.Google Scholar
  16. Good, I. J. (1950).Probability and the weighting of evidence. New York: Hafner.Google Scholar
  17. Greeno, J. G. (1989). A perspective on thinking.American Psychologist, 44, 134–141.Google Scholar
  18. Gulliksen, H. (1961). Measurement of learning and mental abilities.Psychometrika, 26, 93–107.Google Scholar
  19. Haertel, E. H., & Wiley, D. E. (1993). Representations of ability structures: Implications for testing. In N. Frederiksen, R. J. Mislevy, & I. I. Bejar (Eds.),Test theory for a new generation of tests (pp. 359–384). Hillsdale, NJ: Erlbaum.Google Scholar
  20. Holland, P. W., & Rosenbaum, P. R. (1986). Conditional association and unidimensionality in monotone latent trait variable models.Annals of Statistics, 14, 1523–1543.Google Scholar
  21. Inhelder, B., & Piaget, J. (1958).The growth of logical thinking from childhood to adolescence. New York: Basic.Google Scholar
  22. Jöreskog, K. G., & Sörbom, D. (1979).Advances in factor analysis and structural equation models. Cambridge, MA: Abt Books.Google Scholar
  23. Kahneman, D., Slovic, P., & Tversky, A. (1982).Judgment under uncertainty: Heuristics and biases. Cambridge: Cambridge University Press.Google Scholar
  24. Kempf, W. (1983). Some theoretical concerns about applying latent trait models in educational testing. In S. B. Anderson & J. S. Helmick (Eds.),On educational testing (pp. 252–270). San Francisco: Josey-Bass.Google Scholar
  25. Kolmogorov, A. N. (1950).Foundations of the theory of probability. New York: Chelsea.Google Scholar
  26. Koretz, D. (1992).Evaluating and validating indicators of mathematics and science education (RAND Note No. N-2900-NSF). Santa Monica, CA: RAND.Google Scholar
  27. Kuhn, T. S. (1970).The structure of scientific revolutions (2nd ed.). Chicago: University of Chicago Press.Google Scholar
  28. Kyllonen, P. C., Lohman, D. F., & Snow, R. E. (1984). Effects of aptitudes, strategy training, and test facets on spatial task performance.Journal of Educational Psychology, 76, 130–145.Google Scholar
  29. Lakatos, I. (1970). Falsification and the methodology of scientific research programs. In I. Lakatos & A. Musgrove (Eds.),Criticism and the growth of knowledge (pp. 91–196). Cambridge: Cambridge University Press.Google Scholar
  30. Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems (with discussion).Journal of the Royal Statistical Society, Series B,50, 157–224.Google Scholar
  31. Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure analysis. In S. A. Stouffer, L. Guttman, E. A. Suchman, P. F. Lazarsfeld, S. A. Star, & J. A. Clausen,Studies in social psychology in World War II, Volume 4: Measurement and prediction (pp. 362–412). Princeton, NJ: Princeton University Press.Google Scholar
  32. Levine, M., & Drasgow, F. (1982). Appropriateness measurement: Review, critique, and validating studies.British Journal of Mathematical and Statistical Psychology, 35, 42–56.Google Scholar
  33. Lewis, C. (1986). Test theory andPsychometrika: The past twenty-five years.Psychometrika, 51, 11–22.Google Scholar
  34. Linacre, J. M. (1989).Multi-faceted Rasch measurement. Chicago: MESA Press.Google Scholar
  35. Lindley, D. V., & Novick, M. R. (1981). The role of exchangeability of inference.Annals of Statistics, 9, 45–58.Google Scholar
  36. Lord, F. M. (1980).Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.Google Scholar
  37. Martin, J. D., & VanLehn, K. (1993). OLEA: Progress toward a multi-activity, Bayesian student modeler. In S. P. Brna, S. Ohlsson, & H. Pain (Eds.),Artificial intelligence in education: Proceedings of AI-ED 93 (pp. 410–417). Charlottesville, VA: Association for the Advancement of Computing in Education.Google Scholar
  38. Mislevy, R. J. (1986). Bayes modal estimation in item response models.Psychometrika, 51, 177–196.Google Scholar
  39. Mislevy, R. J. (in press). Probability-based inference in cognitive diagnosis. In P. Nichols, S. Chipman, & R. Brennan (Eds.),Cognitively diagnostic assessment. Hillsdale, NJ: Erlbaum.Google Scholar
  40. Mislevy, R. J., & Sheehan, K. M. (1989). The role of collateral information about examinees in item parameter estimation.Psychometrika, 54, 661–679.Google Scholar
  41. Mislevy, R. J., Sheehan, K. M., & Wingersky, M. S. (1993). How to equate tests with little or no data.Journal of Educational Measurement, 30, 55–78.Google Scholar
  42. Mislevy, R. J., Yamamoto, K., & Anacker, S. (1992). Toward a test theory for assessing student understanding. In R. A. Lesh & S. Lamon (Eds.),Assessments of authentic performance in school mathematics (pp. 293–318). Washington, DC: American Association for the Advancement of Science.Google Scholar
  43. Mitchell, R. (1992).Testing for learning: How new approaches to evaluation can improve American schools. New York: The Free Press.Google Scholar
  44. Myford, C. M., & Mislevy, R. J. (in press).Monitoring and improving a portfolio assessment system (ETS Research Report). Princeton, NJ: Educational Testing Service.Google Scholar
  45. Noetic Systems. (1991).ERGO [computer program]. Baltimore, MD: Author.Google Scholar
  46. Pearl, J. (1988).Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Kaufmann.Google Scholar
  47. Peploe, M., Wollen, P., & Antonioni, M. (1975).The passenger. New York, NY: Random House.Google Scholar
  48. Platt, W. J. (1975). Policy making and international studies in educational evaluation. In A. C. Purves & D. U. Levine (Eds.),Educational policy and international assessment (pp. 33–59). Berkeley, CA: McCutchen.Google Scholar
  49. Posner, G. (1993).Case closed: Lee Harvey Oswald and the assassination of JFK. New York: Random House.Google Scholar
  50. Rasch, G. (1960).Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. (1980 reprint published by the University of Chicago Press, Chicago)Google Scholar
  51. Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician.Annals of Statistics, 12, 1151–1172.Google Scholar
  52. Schum, D. A. (1981). Sorting out the effects of witness sensitivity and response-criterion placement upon the inferential value of testimonial evidence.Organizational Behavior and Human Performance, 27, 153–196. (University Press of America)Google Scholar
  53. Schum, D. A. (1987).Evidence and inference for the intelligence analyst. Lanham, MD: University Press of America.Google Scholar
  54. Shafer, G. (1976).A mathematical theory of evidence. Princeton: Princeton University Press.Google Scholar
  55. Shafer, G., & Shenoy, P. (1988).Bayesian and belief-function propagation (Working Paper 121). Lawrence, KS: University of Kansasm, School of Business.Google Scholar
  56. Siegler, R. S. (1981). Developmental sequences within and between concepts.Monograph of the Society for Research in Child Development, Serial No. 189, 46.Google Scholar
  57. Spearman, C. (1904). “General intelligence” objectively determined and measured.American Journal of Psychology, 15, 201–292.Google Scholar
  58. Spearman, C. (1927).The abilities of man: Their nature and measurement. New York: Macmillan.Google Scholar
  59. Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory.Journal of Educational Measurement, 20, 345–354.Google Scholar
  60. Tatsuoka, K. K. (1987). Validation of cognitive sensitivity for item response curves.Journal of Educational Measurement, 24, 233–245.Google Scholar
  61. Tatsuoka, K. K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M. G. Shafto, (Eds.),Diagnostic monitoring of skill and knowledge acquisition (pp. 453–488). Hillsdale, NJ: Erlbaum.Google Scholar
  62. Thompson, P. W. (1982). Were lions to speak, we wouldn't understand.Journal of Mathematical Behavior, 3, 147–165.Google Scholar
  63. Twining, W. L. (1985).Theories of evidence: Bentham and Wigmore. Stanford, CA: Stanford University Press.Google Scholar
  64. VanLehn, K. (1990).Mind bugs: The origins of procedural misconceptions. Cambridge, MA: MIT Press.Google Scholar
  65. Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F., Mislevy, R. J., Steinberg, L., & Thissen, D. (1990).Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  66. Wigmore, J. H. (1937).The science of judicial proof (3rd ed.). Boston: Little, Brown, & Co.Google Scholar
  67. Wolf, D., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. In G. Grant (Ed.),Review of Educational Research, Vol. 17 (pp. 31–74). Washington, DC: American Educational Research Association.Google Scholar
  68. Wright, S. (1934). The method of path coefficients.Annals of Mathematical Statistics, 5, 161–215.Google Scholar
  69. Yamamoto, K. (1987).A model that combines IRT and latent class models. Unpublished doctoral dissertation, University of Illinois, Champaign-Urbana.Google Scholar

Copyright information

© The Psychometric Society 1994

Authors and Affiliations

  • Robert J. Mislevy
    • 1
  1. 1.Educational Testing ServicePrinceton

Personalised recommendations