Using Evidence Centered Design to Think About Assessments

  • Russell G. Almond


Evidence-centered assessment design (ECD) provides a simple principle as the basis of assessment design: assessment tasks should be designed to provide evidence of the claims which the assessment designers wish to make about the examinees. This paper looks at the Bayesian model of evidence which underlies much of the ECD philosophy. It then goes on to explore how the ECD principle can help assessment designers think about three important issues in the future of assessment: (1) How can we organize evidence about student performance gathered from diverse sources across multiple time points? (2) How should we balance information gathered about multiple aspects of proficiency? (3) How should we collect evidence from complex tasks? The chapter illustrates these ideas with some examples of advanced assessments that have used ECD.


Evidence-centered assessment design Decision analysis Constructed response Diagnostic assessment 



Evidence-centered assessment design was originally a three-way collaboration between myself, Bob Mislevy and Linda Steinberg. Although this work represents my perspective on ECD, much of my perspective has become sufficiently mixed with ideas that originated with Bob or Linda. Similarly, my perspective has been expanded by discussions with too many colleagues to mention. Malcolm Bauer, Dan Eigner, Yoon-Jeon Kim, and Thomas Quinlan made numerous suggestions that helped improve the clarity of this chapter.

Any opinions expressed in this chapter are those of the author and not necessarily of Educational Testing Service.


  1. Advanced Distributed Learning (ADL). (2009). SCORM 2004 4th Edition Version 1.1 Overview. Retrieved September 06, 2009, from
  2. Almond, R. G. (2007). Cognitive modeling to represent growth (learning) using Markov decision processes. Technology, Instruction, Cognition and Learning (TICL), 5, 313–324. Retrieved from Google Scholar
  3. Almond, R. G. (in press). Estimating parameters of periodic assessment models. To appear in Educational Testing Service Research Report series, Princeton, NJ.Google Scholar
  4. Almond, R. G., DiBello, L., Jenkins, F., Mislevy, R. J., Senturk, D., Steinberg, L. S., et al. (2001). Models for conditional probability tables in educational assessment. In T. Jaakkola & T. Richardson (Eds.), Artificial intelligence and statistics 2001 (pp. 137–143). San Francisco: Morgan Kaufmann.Google Scholar
  5. Almond, R. G., DiBello, L. V., Moulder, B., & Zapata-Rivera, J.-D. (2007). Modeling diagnostic assessment with Bayesian networks. Journal of Educational Measurement, 44(4), 341–359.CrossRefGoogle Scholar
  6. Almond, R. G., & Mislevy, R. J. (1999). Graphical models and computerized adaptive testing. Applied Psychological Measurement, 23, 223–238.CrossRefGoogle Scholar
  7. Almond, R. G., Mislevy, R. J., & Yan, D. (2007). Using anchor sets to identify scale and location of latent variables. Paper presented at Annual meeting of the National Council on Measurement in Education (NCME), Chicago.Google Scholar
  8. Almond, R. G., Mulder, J., Hemat, L. A., & Yan, D. (2009). Bayesian Network models for local dependence among observable outcome variables. Journal of Educational and Behavioral Statstics, 34(4), 491–521.Google Scholar
  9. Almond, R. G., Shute, V. J., Underwood, J. S., & Zapata-Rivera, J. -D. (2009). Bayesian networks: A teacher’s view. International Journal of Approximate Reasoning, 50, 450–460 (Doi: 10.1016/j.ijar.2008.04.011).CrossRefGoogle Scholar
  10. Almond, R. G. (2010). ‘I can name that Bayesian network in two matrixes’. International Journal of Approximate Reasoning, 51, 167–178.Google Scholar
  11. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V. 2.0. The Journal of Technology, Learning, and Assessment, 4(3), 13–18. Retrieved from Google Scholar
  12. Behrens, J. T., Mislevy, R. J., Bauer, M., Williamson, D. M., & Levy, R. (2004). Introduction to evidence centered design and lessons learned from its application in a global E-Learning program. International Journal of Measurement, 4, 295–301.Google Scholar
  13. Black, P., & Wiliam, D. (1998). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 80(2), 139–147. Retrieved from Scholar
  14. Boutilier, C., Dean, T., & Hanks, S. (1999). Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11, 1–94. Retrieved from Scholar
  15. Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359–374.CrossRefGoogle Scholar
  16. Gansle, K. A., VanDerHeyden, A. M., Noell, G. H., Resetar, J. L., & Williams, K. L. (2006). The technical adequacy of curriculum-based and rating-based measures of written expression for elementary school students. School Psychology Review, 35(3), 435–450.Google Scholar
  17. Gierl, M. J., Leighton, J. P., & Hunka, S. M. (2007). Using the attribute hierarchy method to make diagnostic inferences about examinees’ cognitive skills. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment: Theories and applications (pp. 242–274). Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
  18. Gitomer, D. H., Steinberg, L. S., & Mislevy, R. J. (1995). Diagnostic assessment of trouble-shooting skill in an intelligent tutoring system. In P. D. Nichols, S. F. Chipman, & R. L. Brennen (Eds.), Cognitively diagnostic assessment (pp. 73–101). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  19. Glück, J., & Spiel, C. (2007). Studying development via item response models: A wide range of potential uses. In M. von Davier & C. H. Carstense (Eds.), Multivariate and mixture distribution rasch models: Extensions and applications (pp. 281–292). New York: Springer.CrossRefGoogle Scholar
  20. Good, I. J. (1950). Probability and the weighing of evidence. London: Charles Griffin.Google Scholar
  21. Good, I. J. (1985). Weight of evidence: A brief survey. In J. Bernardo, M. DeGroot, D. Lindley, & A. Smith (Eds.), Bayesian statistics 2 (pp. 249–269). Amsterdam: North Holland.Google Scholar
  22. Good, I. J., & Card, W. (1971). The diagnostic process with special reference to errors. Methods of Information in Medicine, 10, 176–188.Google Scholar
  23. Howard, R. A., & Matheson, J. E. (1981a). Principles and applications of decision analysis. Menlo Park, CA: Strategic Decisions Group.Google Scholar
  24. Howard, R. A., & Matheson, J. E. (1981b). Influence diagrams. In R. A. Howard & J. E. Matheson (Eds.), Principles and applications of decision analysis. Menlo Park, CA: Strategic Decisions Group.Google Scholar
  25. Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model: An approach for integrating cognitive theory with assessment practice. Journal of Educational Measurement, 41, 205–236.CrossRefGoogle Scholar
  26. Madigan, D., & Almond, R. G. (1995). Test selection strategies for belief networks. In D. Fisher & H. J. Lenz (Eds.), Learning from data: AI and statistics V (pp. 89–98). New York: Springer.Google Scholar
  27. Matheson, J. E. (1990). Using influence diagrams to value information and control. In R. M. Oliver & J. Q. Smith (Eds.), Influence diagrams, belief nets and decision analysis (pp. 25–48). Chichester: Wiley.Google Scholar
  28. Meiser, T., Stern, E., & Langeheine, R. (1998). Latent change in discrete data: Unidimensional, multidimensiona, and mixutre distribution Rasch models for the analysis of repeated observations. Methods of Psychological Research Online, 3(2). Retrieved from
  29. Mislevy, R. J. (1994). Evidence and inference in educational assessment. Psychometrika, 12, 341–369.Google Scholar
  30. Mislevy, R. J., Almond, R. G., & Steinberg, L. S. (2002). Design and analysis in a task-based language assessment. Language Testing, 19(4), 477–496.CrossRefGoogle Scholar
  31. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). On the roles of task model variables in assessment design. In S. Irvine & P. Kyllonen (Eds.), Generating items for cognitive tests: Theory and practice (pp. 97–128). Mahwah, NJ: Erlbaum.Google Scholar
  32. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessment (with discussion). Measurement: Interdisciplinary Research and Perspective, 1(1), 3–62.CrossRefGoogle Scholar
  33. Mislevy, R. J., Steinberg, L. S., Almond, R. G., & Lukas, J. F. (2006). Concepts, terminology and basic models of evidence-centered design. In D. M. Williamson, R. J. Mislevy, & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer-based testing (pp. 15–47). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  34. National Council of Teachers of Mathematics (NCTM). (1989). Curriculum and performance standards for school mathematics. Author.Google Scholar
  35. Schum, D. A. (1994). The evidential foundations of probabilistic reasoning. New York: Wiley.Google Scholar
  36. Shute, V. J., Hansen, E. G., & Almond, R. G. (2008). You can’t fatten a hog by weighing it – or can you? Evaluating an assessment for learning system called ACED. International Journal of Artificial Intelligence in Education, 18(4), 289–316. Retrieved from Scholar
  37. Shute, V. J., Ventura, M., Bauer, M. I., & Zapata-Rivera, D. (2009). Melding the power of serious games and embedded assessment to monitor and foster learning: Flow and grow. In U. Ritterfeld, M. J. Cody, & P. Vorderer (Eds.), The social science of serious games: Theories and applications (pp. 295–321). Philadelphia, PA: Routledge/LEA.Google Scholar
  38. Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. New York: Oxford University Press. (ISBN: 0195152964.)Google Scholar
  39. Sinharay, S., & Haberman, S. J. (2008). How much can we reliably know about what examinees know? Measurement: Interdisciplinary Research and Perspectives, 6, 46–49.Google Scholar
  40. Spandel, V., & Stiggins, R. L. (1990). Creating writers: Linking assessment and writing instruction. New York: Longman.Google Scholar
  41. Stevens, R. H., & Thadani, V. (2007). Quantifying student’s scientific problem solving efficiency and effectiveness. Technology, Instruction, Cognition, and Learning, 5(4), 325–338.Google Scholar
  42. Tanimoto, S. L. (2001). Distributed transcripts for online learning: Design issues. Journal of Interactive Media in Education, 2001(2). Retrieved 2009-09-26 from
  43. Tatsuoka, K. K. (1984). Analysis of errors in fraction addition and subtraction problems (NIE Final report NIE-G-81-002). University of Illinois, Computer-based Education Research, Urbana, IL.Google Scholar
  44. Tatsuoka, K. K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M. G. Shafto (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 453–488). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  45. Vygotsky, L. (1978). Mind in society: The development of higher mental processes. Cambridge, MA: Harvard University Press.Google Scholar
  46. Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applicaitons. New York: Cambridge University Press.CrossRefGoogle Scholar
  47. Wainer, H., Veva, J. L., Camacho, F., Reeve, B. B., III, Rosa, K., Nelson, L., et al. (2001). Augmented scores—“Borrowing strength” to compute scores based on a small number of items. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 343–388). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  48. Weaver, R., & Junker, B. W. (2004). Model specification for cognitive assessment of proportional reasoning (Department of Statistics Technical Report No. 777). Carnegie Mellon University: Pittsburgh, PA. Retrieved December 11, 2008, from Scholar
  49. Whittaker, J. (1990). Graphical models in applied multivariate statistics. Chichester: Wiley.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Research and DevelopmentEducational Testing ServicePrincetonUSA

Personalised recommendations