Skip to main content

Perspectives on Methodological Issues

  • Chapter
  • First Online:
Assessment and Teaching of 21st Century Skills

Abstract

In this chapter the authors have surveyed the methodological perspectives seen as important for assessing twenty-first century skills. Some of those issues are specific to twenty-first century skills, but the majority would apply more generally to the assessment of other psychological and educational variables. The narrative of the paper initially follows the logic of assessment development, commencing by defining constructs to be assessed, designing tasks that can be used to generate informative student responses, coding/valuing of those responses, delivering the tasks and gathering the responses, and modeling the responses in accordance with the constructs. The paper continues with a survey of the strands of validity evidence that need to be established, and a discussion of specific issues that are prominent in this context, such as the need to resolve issues of generality versus contextual specificity; the relationships of classroom to large-scale assessments; and the possible roles for technological advances in assessing these skills. There is also a brief segment discussing some issues that arise with respect to specific types of variables involved in the assessment of twenty-first century skills. The chapter concludes with a listing of particular challenges that are regarded as being prominent at the time of writing. There is an annexure that describes specific approaches to assessment design that are useful in the development of new assessments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We will not specify a comprehensive list of the 21st century skills here. That is provided in Chap. 2.

  2. 2.

    Although the emphasis in a cognitive perspective is often taken to be synonymous with information-processing views of cognition, this is by no means necessary. Alternative theoretical frameworks, such as sociocultural perspectives (Valsiner and Veer 2000) or embodied cognition approaches (Clark 1999), can be used to develop educational assessments.

  3. 3.

    Note that the term “developmental” is not intended to imply that there is a biological inevitability to the process of development but that there are specific paths (not necessarily unique) that are seen as leading to more sophisticated learning.

  4. 4.

    The following section was adapted from the NRC 2006 report Systems for state science assessment edited by Wilson & Bertenthal.

  5. 5.

    Note that this is only a partial list of what is in the original.

  6. 6.

    ftp://ftp.cordis.europa.eu/pub/ist/docs/ka3/eat/FREETEXT.pdf

  7. 7.

    The following section has been adapted from Wilson 2009.

References*

  • Abidi, S. S. R., Chong, Y., & Abidi, S. R. (2001). Patient empowerment via ‘pushed’ delivery of customized healthcare educational content over the Internet. Paper presented at the 10th World Congress on Medical Informatics, London.

    Google Scholar 

  • Ackerman, T., Zhang, W., Henson, R., & Templin, J. (2006, April). Evaluating a third grade science benchmark test using a skills assessment model: Q-matrix evaluation. Paper presented at the annual meeting of the National Council on Measurement in Education (NCME), San Francisco

    Google Scholar 

  • Adams, W. K., Reid, S., LeMaster, R., McKagan, S., Perkins, K., & Dubson, M. (2008). A study of educational simulations part 1—engagement and learning. Journal of Interactive Learning Research, 19(3), 397–419.

    Google Scholar 

  • Aleinikov, A. G., Kackmeister, S., & Koenig, R. (Eds.). (2000). 101 Definitions: Creativity. Midland: Alden B Dow Creativity Center Press.

    Google Scholar 

  • Almond, R. G., Steinberg, L. S., & Mislevy, R. J. (2002). Enhancing the design and delivery of assessment systems: A four-process architecture. Journal of Technology, Learning, and Assessment in Education, 1(5). Available from http://www.jtla.org

  • Almond, R. G., Steinberg, L. S., & Mislevy, R. J. (2003). A four-process architecture for assessment delivery, with connections to assessment design (Vol. 616). Los Angeles: University of California Los Angeles Center for Research on Evaluations, Standards and Student Testing (CRESST).

    Google Scholar 

  • American Association for the Advancement of Science (AAAS). (1993). Benchmarks for science literacy. New York: Oxford University Press.

    Google Scholar 

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (AERA, APA, NCME, 1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association.

    Google Scholar 

  • Autor, D. H., Levy, F., & Murnane, R. J. (2003). The skill content of recent technological change: An empirical exploration. Quarterly Journal of Economics, 118(4), 1279–1333.

    Google Scholar 

  • Ball, S. J. (1985). Participant observation with pupils. In R. Burgess (Ed.), Strategies of educational research: Qualitative methods (pp. 23–53). Lewes: Falmer.

    Google Scholar 

  • Behrens, J. T., Frezzo D. C., Mislevy R. J., Kroopnick M., & Wise D. (2007). Structural, functional, and semiotic symmetries in simulation-based games and assessments. In E. Baker, J. Dickieson, W. Wulfeck, & H. F. O’Neil (Eds.), Assessment of problem solving using simulations (pp. 59–80). New York: Earlbaum.

    Google Scholar 

  • Bejar, I. I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2003). A feasibility study of on-the-fly item generation in adaptive testing. Journal of Technology, Learning, and Assessment, 2(3). Available from http://www.jtla.org

  • Bejar, I. I., Braun, H., & Tannenbaum, R. (2007). A prospective, predictive and progressive approach to standard setting. In R. W. Lissitz (Ed.), Assessing and modeling cognitive development in school: Intellectual growth and standard setting (pp. 1–30). Maple Grove: JAM Press.

    Google Scholar 

  • Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17(4), 9–16.

    Google Scholar 

  • Bennett, R. E., & Gitomer, D. H. (2009). Transforming K-12 assessment: Integrating accountability testing, formative assessment and professional support. In C. Wyatt-Smith & J. Cumming (Eds.), Educational assessment in the 21st century (pp. 43–61). New York: Springer.

    Google Scholar 

  • Bennett, R. E., Goodman, M., Hessinger, J., Kahn, H., Ligget, J., & Marshall, G. (1999). Using multimedia in large-scale computer-based testing programs. Computers in Human Behaviour, 15(3–4), 283–294.

    Google Scholar 

  • Biggs, J. B., & Collis, K. F. (1982). Evaluating the quality of learning: The SOLO taxonomy. New York: Academic.

    Google Scholar 

  • Binkley, M., Erstad, O., Herman, J., Raizen, S., Ripley, M., & Rumble, M. (2009). Developing 21st century skills and assessments. White Paper from the Assessment and Learning of 21st Century Skills Project.

    Google Scholar 

  • Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2003). Assessment for learning. London: Open University Press.

    Google Scholar 

  • Bloom, B. S. (Ed.). (1956). Taxonomy of educational objectives: The classification of educational goals: Handbook I, cognitive domain. New York/Toronto: Longmans, Green.

    Google Scholar 

  • Bourque, M. L. (2009). A history of NAEP achievement levels: Issues, implementation, and impact 1989–2009 (No. Paper Commissioned for the 20th Anniversary of the National Assessment Governing Board 1988–2008). Washington, DC: NAGB. Downloaded from http://www.nagb.org/who-we-are/20-anniversary/bourque-achievement-levels-formatted.pdf

  • Braun, H. I., & Qian, J. (2007). An enhanced method for mapping state standards onto the NAEP scale. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 313–338). New York: Springer.

    Google Scholar 

  • Braun, H., Bejar, I. I., & Williamson, D. M. (2006). Rule-based methods for automated scoring: Applications in a licensing context. In D. M. Williamson, R. J. Mislevy, & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer-based testing (pp. 83–122). Mahwah: Lawrence Erlbaum.

    Google Scholar 

  • Brown, A. L., & Reeve, R. A. (1987). Bandwidths of competence: The role of supportive contexts in learning and development. In L. S. Liben (Ed.), Development and learning: Conflict or congruence? (pp. 173–223). Hillsdale: Erlbaum.

    Google Scholar 

  • Brown, N. J. S., Furtak, E. M., Timms, M., Nagashima, S. O., & Wilson, M. (2010a). The ­evidence-based reasoning framework: Assessing scientific reasoning. Educational Assessment, 15(3–4), 123–141.

    Google Scholar 

  • Brown, N. J. S., Nagashima, S. O., Fu, A., Timms, M., & Wilson, M. (2010b). A framework for analyzing scientific reasoning in assessments. Educational Assessment, 15(3–4), 142–174.

    Google Scholar 

  • Brown, N., Wilson, M., Nagashima, S., Timms, M., Schneider, A., & Herman, J. (2008, March 24–28). A model of scientific reasoning. Paper presented at the Annual Meeting of the American Educational Research Association, New York.

    Google Scholar 

  • Brusilovsky, P., Sosnovsky, S., & Yudelson, M. (2006). Addictive links: The motivational value of adaptive link annotation in educational hypermedia. In V. Wade, H. Ashman, & B. Smyth (Eds.), Adaptive hypermedia and adaptive Web-based systems, 4th International Conference, AH 2006. Dublin: Springer.

    Google Scholar 

  • Carnevale, A. P., Gainer, L. J., & Meltzer, A. S. (1990). Workplace basics: The essential skills employers want. San Francisco: Jossey-Bass.

    Google Scholar 

  • Carpenter, T. P., & Lehrer, R. (1999). Teaching and learning mathematics with understanding.In E. Fennema & T. R. Romberg (Eds.), Mathematics classrooms that promote understanding (pp. 19–32). Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Case, R., & Griffin, S. (1990). Child cognitive development: The role of central conceptual structures in the development of scientific and social thought. In E. A. Hauert (Ed.), Developmental psychology: Cognitive, perceptuo-motor, and neurological perspectives (pp. 193–230). North-Holland: Elsevier.

    Google Scholar 

  • Catley, K., Lehrer, R., & Reiser, B. (2005). Tracing a prospective learning progression for developing understanding of evolution. Paper Commissioned by the National Academies Committee on Test Design for K-12 Science Achievement. http://www7.nationalacademies.org/bota/Evolution.pdf

  • Center for Continuous Instructional Improvement (CCII). (2009). Report of the CCII Panel on learning progressions in science (CPRE Research Report). New York: Columbia University.

    Google Scholar 

  • Center for Creative Learning. (2007). Assessing creativity index. Retrieved August 27, 2009, from http://www.creativelearning.com/Assess/index.htm

  • Chedrawy, Z., & Abidi, S. S. R. (2006). An adaptive personalized recommendation strategy featuring context sensitive content adaptation. Paper presented at the Adaptive Hypermedia and Adaptive Web-Based Systems, 4th International Conference, AH 2006, Dublin, Ireland.

    Google Scholar 

  • Chen, Z.-L., & Raghavan, S. (2008). Tutorials in operations research: State-of-the-art decision-making tools in the information-intensive age, personalization and recommender systems. Paper presented at the INFORMS Annual Meeting. Retrieved from http://books.google.com/books?hl=en&lr=&id=4c6b1_emsyMC&oi=fnd&pg=PA55&dq=personalisation+online+entertainment+netflix&ots=haYV26Glyf&sig=kqjo5t1C1lNLlP3QG-R0iGQCG3o#v=onepage&q=&f=false

  • Claesgens, J., Scalise, K., Wilson, M., & Stacy, A. (2009). Mapping student understanding in chemistry: The perspectives of chemists. Science Education, 93(1), 56–85.

    Google Scholar 

  • Clark, A. (1999). An embodied cognitive science? Trends in Cognitive Sciences, 3(9), 345–351.

    Google Scholar 

  • Conlan, O., O’Keeffe, I., & Tallon, S. (2006). Combining adaptive hypermedia techniques and ontology reasoning to produce Dynamic Personalized News Services. Paper presented at the Adaptive Hypermedia and Adaptive Web-based Systems, Dublin, Ireland.

    Google Scholar 

  • Crick, R. D. (2005). Being a Learner: A Virtue for the 21st Century. British Journal of Educational Studies, 53(3), 359–374.

    Google Scholar 

  • Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.

    Google Scholar 

  • Dagger, D., Wade, V., & Conlan, O. (2005). Personalisation for all: Making adaptive course composition easy. Educational Technology & Society, 8(3), 9–25.

    Google Scholar 

  • Dahlgren, L. O. (1984). Outcomes of learning. In F. Martin, D. Hounsell, & N. Entwistle (Eds.), The experience of learning. Edinburgh: Scottish Academic Press.

    Google Scholar 

  • DocenteMas. (2009). The Chilean teacher evaluation system. Retrieved from http://www.docentemas.cl/

  • Drasgow, F., Luecht, R., & Bennett, R. E. (2006). Technology and testing. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 471–515). Westport: Praeger Publishers.

    Google Scholar 

  • Duncan, R. G., & Hmelo-Silver, C. E. (2009). Learning progressions: Aligning curriculum, instruction, and assessment. Journal of Research in Science Teaching, 46(6), 606–609.

    Google Scholar 

  • Frazier, E., Greiner, S., & Wethington, D. (Producer). (2004, August 14, 2009) The use of biometrics in education technology assessment. Retrieved from http://www.bsu.edu/web/elfrazier/TechnologyAssessment.htm

  • Frezzo, D. C., Behrens, J. T., & Mislevy, R. J. (2010). Design patterns for learning and assessment: Facilitating the introduction of a complex simulation-based learning environment into a community of instructors. Journal of Science Education and Technology, 19(2), 105–114.

    Google Scholar 

  • Frezzo, D. C., Behrens, J. T., Mislevy, R. J., West, P., & DiCerbo, K. E. (2009, April). Psychometric and evidentiary approaches to simulation assessment in Packet Tracer software. Paper presented at the Fifth International Conference on Networking and Services (ICNS), Valencia, Spain.

    Google Scholar 

  • Gao, X., Shavelson, R. J., & Baxter, G. P. (1994). Generalizability of large-scale performance assessments in science: Promises and problems. Applied Measurement in Education, 7(4), 323–342.

    Google Scholar 

  • Gellersen, H.-W. (1999). Handheld and ubiquitous computing: First International Symposium. Paper presented at the HUC ‘99, Karlsruhe, Germany.

    Google Scholar 

  • Gifford, B. R. (2001). Transformational instructional materials, settings and economics. In The Case for the Distributed Learning Workshop, Minneapolis.

    Google Scholar 

  • Giles, J. (2005). Wisdom of the crowd. Decision makers, wrestling with thorny choices, are tapping into the collective foresight of ordinary people. Nature, 438, 281.

    Google Scholar 

  • Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some questions. The American Psychologist, 18, 519–521.

    Google Scholar 

  • Graesser, A. C., Jackson, G. T., & McDaniel, B. (2007). AutoTutor holds conversations with learners that are responsive to their cognitive and emotional state. Educational Technology, 47, 19–22.

    Google Scholar 

  • Guilford, J. P. (1946). New standards for test evaluation. Educational and Psychological Measurement, 6, 427–438.

    Google Scholar 

  • Haladyna, T. M. (1994). Cognitive taxonomies. In T. M. Haladyna (Ed.), Developing and validating multiple-choice test items (pp. 104–110). Hillsdale: Lawrence Erlbaum Associates.

    Google Scholar 

  • Hartley, D. (2009). Personalisation: The nostalgic revival of child-centred education? Journal of Education Policy, 24(4), 423–434.

    Google Scholar 

  • Hattie, J. (2009, April 16). Visibly learning from reports: The validity of score reports. Paper presented at the annual meeting of the National Council on Measurement in Education (NCME), San Diego, CA.

    Google Scholar 

  • Hawkins, D. T. (2007, November). Trends, tactics, and truth in the information industry: The fall 2007 ASIDIC meeting. InformationToday, p. 34.

    Google Scholar 

  • Hayes, J. R. (1985). Three problems in teaching general skills. In S. F. Chipman, J. W. Segal, & R. Glaser (Eds.), Thinking and learning skills: Research and open questions (Vol. 2, pp. 391–406). Hillsdale: Erlbaum.

    Google Scholar 

  • Henson, R., & Templin, J. (2008, March). Implementation of standards setting for a geometry end-of-course exam. Paper presented at the 2008 American Educational Research Association conference in New York, New York.

    Google Scholar 

  • Hernández, J. A., Ochoa Ortiz, A., Andaverde, J., & Burlak, G. (2008). Biometrics in online assessments: A study case in high school student. Paper presented at the 8th International Conference on Electronics, Communications and Computers (conielecomp 2008), Puebla.

    Google Scholar 

  • Hirsch, E. D. (2006, 26 April). Reading-comprehension skills? What are they really? Education Week, 25(33), 57, 42.

    Google Scholar 

  • Hopkins, D. (2004). Assessment for personalised learning: The quiet revolution. Paper presented at the Perspectives on Pupil Assessment, New Relationships: Teaching, Learning and Accountability, General Teaching Council Conference, London.

    Google Scholar 

  • Howe, J. (2008, Winter). The wisdom of the crowd resides in how the crowd is used. Nieman Reports, New Venues, 62(4), 47–50.

    Google Scholar 

  • International Organization for Standardization. (2009). International standards for business, government and society, JTC 1/SC 37—Biometrics. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_tc_browse.htm?commid=313770&development=on

  • Kanter, R. M. (1994). Collaborative advantage: The Art of alliances. Harvard Business Review, 72(4), 96–108.

    Google Scholar 

  • Kelleher, K. (2006). Personalize it. Wired Magazine, 14(7), 1.

    Google Scholar 

  • Kyllonen, P. C., Walters, A. M., & Kaufman, J. C. (2005). Noncognitive constructs and their assessment in graduate education: A review. Educational Assessment, 10(3), 143–184.

    Google Scholar 

  • Lawton, D. L. (1970). Social class, language and education. London: Routledge and Kegan Paul.

    Google Scholar 

  • Lesgold, A. (2009). Better schools for the 21st century: What is needed and what will it take to get improvement. Pittsburgh: University of Pittsburgh.

    Google Scholar 

  • Levy, F., & Murnane, R. (2006, May 31). How computerized work and globalization shape human skill demands. Retrieved August 23, 2009, from http://web.mit.edu/flevy/www/computers_offshoring_and_skills.pdf

  • Linn, R. L., & Baker, E. L. (1996). Can performance- based student assessments be psychometrically sound? In J. B. Baron, & D. P. Wolf (Eds.), Performance-based student assessment: Challenges and possibilities (pp. 84–103). Chicago: University of Chicago Press.

    Google Scholar 

  • Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635–694.

    Google Scholar 

  • Lord, F. M. (1971). Tailored testing, an approximation of stochastic approximation. Journal of the American Statistical Association, 66, 707–711.

    Google Scholar 

  • Margolis, M. J., & Clauser, B. E. (2006). A regression-based procedure for automated scoring of a complex medical performance assessment. In D. M. Williamson, I. J. Bejar, & R. J. Mislevy (Eds.), Automated scoring of complex tasks in computer based testing. Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Martinez, M. (2002). What is personalized learning? Are we there yet? E-Learning Developer’s Journal. E-Learning Guild (www.elarningguild.com). http://www.elearningguild.com/pdf/2/050702dss-h.pdf

  • Marton, F. (1981). Phenomenography—Describing conceptions of the world around us. Instructional Science, 10, 177–200.

    Google Scholar 

  • Marton, F. (1983). Beyond individual differences. Educational Psychology, 3, 289–303.

    Google Scholar 

  • Marton, F. (1986). Phenomenography—A research approach to investigating different understandings of reality. Journal of Thought, 21, 29–49.

    Google Scholar 

  • Marton, F. (1988). Phenomenography—Exploring different conceptions of reality. In D. Fetterman (Ed.), Qualitative approaches to evaluation in education (pp. 176–205). New York: Praeger.

    Google Scholar 

  • Marton, F., Hounsell, D., & Entwistle, N. (Eds.). (1984). The experience of learning. Edinburgh: Scottish Academic Press. Masters, G.N. & Wilson, M. (1997). Developmental assessment. Berkeley, CA: BEAR Research Report, University of California.

    Google Scholar 

  • Masters G. (1982). A rasch model for partial credit scoring. Psychometrika 42(2), 149–174.

    Google Scholar 

  • Masters, G.N. & Wilson, M. (1997). Developmental assessment. Berkeley, CA: BEAR Research Report, University of California.

    Google Scholar 

  • Mayer, R. E. (1983). Thinking, problem-solving and cognition. New York: W H Freeman.

    Google Scholar 

  • Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). Washington, DC: American Council on Education/Macmillan.

    Google Scholar 

  • Messick, S. (1995). Validity of psychological assessment. Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. The American Psychologist, 50(9), 741–749.

    Google Scholar 

  • Microsoft. (2009). Microsoft Certification Program. Retrieved from http://www.microsoft.com/learning/

  • Miliband, D. (2003). Opportunity for all, targeting disadvantage through personalised learning. New Economy, 1070–3535/03/040224(5), 224–229.

    Google Scholar 

  • Mislevy, R. J., Almond, R. G., & Lukas, J. F. (2003a). A brief introduction to evidence centred design (Vol. RR-03–16). Princeton: Educational Testing Service.

    Google Scholar 

  • Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003b). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–62.

    Google Scholar 

  • Mislevy, R. J., Bejar, I. I., Bennett, R. E., Haertel, G. D., & Winters, F. I. (2008). Technology supports for assessment design. In B. McGaw, E. Baker, & P. Peterson (Eds.), International encyclopedia of education (3rd ed.). Oxford: Elsevier.

    Google Scholar 

  • Mitchell, W. J. (1990). The logic of architecture. Cambridge: MIT Press.

    Google Scholar 

  • National Research Council, Bransford, J. D., Brown, A. L., & Cocking, R. R. (2000). How people learn: Brain, mind, experience, and school: Expanded edition. Washington, DC: National Academy Press.

    Google Scholar 

  • National Research Council, Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know: The science and design of educational assessment. Washington, DC: National Academy Press.

    Google Scholar 

  • National Research Council, Wilson, M., & Bertenthal, M. (Eds.). (2006). Systems for state science assessment. Committee on Test Design for K-12 Science Achievement. Washington, DC: National Academy Press.

    Google Scholar 

  • National Research Council, Duschl, R. A., Schweingruber, H. A., & Shouse, A. W. (Eds.). (2007). Taking science to school: Learning and teaching science in Grades K-8. Committee on Science Learning, Kindergarten through Eighth Grade. Washington, DC: National Academy Press.

    Google Scholar 

  • Newell, A., Simon, H. A., & Shaw, J. C. (1958). Elements of a theory of human problem solving. Psychological Review, 65, 151–166.

    Google Scholar 

  • Oberlander, J. (2006). Adapting NLP to adaptive hypermedia. Paper presented at the Adaptive Hypermedia and Adaptive Web-Based Systems, 4th International Conference, AH 2006, Dublin, Ireland.

    Google Scholar 

  • OECD. (2005). PISA 2003 Technical Report. Paris: Organisation for Economic Co-operation and Development.

    Google Scholar 

  • Palm, T. (2008). Performance assessment and authentic assessment: A conceptual analysis of the literature. Practical Assessment, Research & Evaluation, 13(4), 4.

    Google Scholar 

  • Parshall, C. G., Stewart, R., Ritter, J. (1996, April). Innovations: Sound, graphics, and alternative response modes. Paper presented at the National Council on Measurement in Education, New York.

    Google Scholar 

  • Parshall, C. G., Davey, T., & Pashley, P. J. (2000). Innovative item types for computerized testing. In W. Van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 129–148). Norwell: Kluwer Academic Publisher.

    Google Scholar 

  • Parshall, C. G., Spray, J., Kalohn, J., & Davey, T. (2002). Issues in innovative item types practical considerations in computer-based testing (pp. 70–91). New York: Springer.

    Google Scholar 

  • Patton, M. Q. (1980). Qualitative evaluation methods. Beverly Hills: Sage.

    Google Scholar 

  • Pellegrino, J., Jones, L., & Mitchell, K. (Eds.). (1999). Grading the Nation’s report card: Evaluating NAEP and transforming the assessment of educational progress. Washington, DC: National Academy Press.

    Google Scholar 

  • Perkins, D. (1998). What is understanding? In M. S. Wiske (Ed.), Teaching for understanding: Linking research with practice. San Francisco: Jossey-Bass Publishers.

    Google Scholar 

  • Pirolli, P. (2007). Information foraging theory: Adaptive interaction with information. Oxford: Oxford University Press.

    Google Scholar 

  • Popham, W. J. (1997). Consequential validity: Right concern—Wrong concept. Educational Measurement: Issues and Practice, 16(2), 9–13.

    Google Scholar 

  • Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58–93.

    Google Scholar 

  • Reiser, R. A. (2002). A history of instructional design and technology. In R. A. Reiser & J. V. Dempsey (Eds.), Trends and issues in instructional design and technology. Upper Saddle River: Merrill/Prentice Hall.

    Google Scholar 

  • Reiser, B., Krajcik, J., Moje, E., & Marx, R. (2003, March). Design strategies for developing science instructional materials. Paper presented at the National Association for Research in Science Teaching, Philadelphia, PA.

    Google Scholar 

  • Robinson, K. (2009). Out of our minds: Learning to be creative. Chichester: Capstone.

    Google Scholar 

  • Rosenbaum, P. R. (1988). Item Bundles. Psychometrika, 53, 349–359.

    Google Scholar 

  • Rupp, A. A., & Templin, J. (2008). Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art. Measurement: Interdisciplinary Research and Perspectives, 6, 219–262.

    Google Scholar 

  • Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York: Guilford Press.

    Google Scholar 

  • Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18, 119–144.

    Google Scholar 

  • Scalise, K. (2004). A new approach to computer adaptive assessment with IRT construct-modeled item bundles (testlets): An application of the BEAR assessment system. Paper presented at the 2004 International Meeting of the Psychometric Society, Pacific Grove.

    Google Scholar 

  • Scalise, K. (submitted). Personalised learning taxonomy: Characteristics in three dimensions for ICT. British Journal of Educational Technology.

    Google Scholar 

  • Scalise, K., & Gifford, B. (2006). Computer-based assessment in E-Learning: A framework for constructing “Intermediate Constraint” questions and tasks for technology platforms. Journal of Technology, Learning, and Assessment, 4(6) [online journal]. http://escholarship.bc.edu/jtla/vol4/6.

  • Scalise, K., & Wilson, M. (2006). Analysis and comparison of automated scoring approaches: Addressing evidence-based assessment principles. In D. M. Williamson, I. J. Bejar, & R. J. Mislevy (Eds.), Automated scoring of complex tasks in computer based testing. Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Scalise, K., & Wilson, M. (2007). Bundle models for computer adaptive testing in e-learning assessment. Paper presented at the 2007 GMAC Conference on Computerized Adaptive Testing (Graduate Management Admission Council), Minneapolis, MN.

    Google Scholar 

  • Schum, D. A. (1987). Evidence and inference for the intelligence analyst. Lanham: University Press of America.

    Google Scholar 

  • Searle, J. (1969). Speech acts. Cambridge: Cambridge University Press.

    Google Scholar 

  • Shute, V., Ventura, M., Bauer, M., & Zapata-Rivera, D. (2009). Melding the power of serious games and embedded assessment to monitor and foster learning, flow and grow melding the power of serious games. New York: Routledge.

    Google Scholar 

  • Shute, V., Maskduki, I., Donmez, O., Dennen, V. P., Kim, Y. J., & Jeong, A. C. (2010). Modeling, assessing, and supporting key competencies within game environments. In D. Ifenthaler, P. Pirnay-Dummer, & N. M. Seel (Eds.), Computer-based diagnostics and systematic analysis of knowledge. New York: Springer. Smith, C., Wiser, M., Anderson, C. W., Krajcik, J. & Coppola, B. (2004). Implications of research on children’s learning for assessment: matter and atomic molecular theory. Paper Commissioned by the National Academies Committee on Test Design for K-12 Science Achievement. Washington DC

    Google Scholar 

  • Simon, H. A. (1980). Problem solving and education. In D. T. Tuma, & R. Reif, (Eds.), Problem solving and education: Issues in teaching and research (pp. 81–96). Hillsdale: Erlbaum.

    Google Scholar 

  • Smith, C., Wiser, M., Anderson, C. W., Krajcik, J. & Coppola, B. (2004). Implications of research on children’s learning for assessment: matter and atomic molecular theory. Paper Commissioned by the National Academies Committee on Test Design for K-12 Science Achievement. Washington DC.

    Google Scholar 

  • Smith, C. L., Wiser, M., Anderson, C. W., & Krajcik, J. (2006). Implications of research on Children’s learning for standards and assessment: A proposed learning progression for matter and the atomic molecular theory. Measurement: Interdisciplinary Research and Perspectives, 4(1 & 2).

    Google Scholar 

  • Stiggins, R. J. (2002). Assessment crisis: The absence of assessment for learning. Phi Delta Kappan, 83(10), 758–765.

    Google Scholar 

  • Templin, J., & Henson, R. A. (2008, March). Understanding the impact of skill acquisition: relating diagnostic assessments to measureable outcomes. Paper presented at the 2008 American Educational Research Association conference in New York, New York.

    Google Scholar 

  • Treffinger, D. J. (1996). Creativity, creative thinking, and critical thinking: In search of definitions. Sarasota: Center for Creative Learning.

    Google Scholar 

  • Valsiner, J., & Veer, R. V. D. (2000). The social mind. Cambridge: Cambridge University Press.

    Google Scholar 

  • Van der Linden, W. J., & Glas, C. A. W. (2007). Statistical aspects of adaptive testing. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26, pp. 801–838). New York: Elsevier.

    Google Scholar 

  • Wainer, H., & Dorans, N. J. (2000). Computerized adaptive testing: A primer (2nd ed.). Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Wainer, H., Brown, L., Bradlow, E., Wang, X., Skorupski, W. P., & Boulet, J. (2006). An application of testlet response theory in the scoring of a complex certification exam. In D. M. Williamson, I. J. Bejar, & R. J. Mislevy (Eds.), Automated scoring of complex tasks in computer based testing. Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29, 126–149.

    Google Scholar 

  • Weekley, J. A., & Ployhart, R. E. (2006). Situational judgment tests: Theory, measurement, and application. Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Weiss, D. J. (Ed.). (2007). Proceedings of the 2007 GMAC Conference on Computerized Adaptive Testing. Available at http://www.psych.umn.edu/psylabs/catcentral/

  • Wiley, D. (2008). Lying about personalized learning, iterating toward openness. Retrieved from http://opencontent.org/blog/archives/655

  • Wiliam, D., & Thompson, M. (2007). Integrating assessment with instruction: What will it take to make it work? In C. A. Dwyer (Ed.), The future of assessment: Shaping teaching and learning (pp. 53–82). Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Williamson, D. M., Mislevy, R. J., & Bejar, I. I. (2006). Automated scoring of complex tasks in computer-based testing. Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Wilson, M. (Ed.). (2004). Towards coherence between classroom assessment and accountability. Chicago: Chicago University Press.

    Google Scholar 

  • Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Wilson, M., & Adams, R. J. (1995). Rasch models for item bundles. Psychometrika, 60(2), 181–198.

    Google Scholar 

  • Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13, 181–208.

    Google Scholar 

  • Wilson, M. (2009). Measuring progressions: Assessment structures underlying a learning progression. Journal of Research in Science Teaching, 46(6), 716–730.

    Google Scholar 

  • Wise, S. L., & DeMars, C. E. (2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43, 19–38.

    Google Scholar 

  • Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18, 163–183.

    Google Scholar 

  • Wolf, D. P., & Reardon, S. F. (1996). Access to excellence through new forms of student assessment. In D. P. Wolf, & J. B. Baron (Eds.), Performance-based student assessment: Challenges and possibilities. Ninety-fifth yearbook of the national society for the study of education, part I. Chicago: University of Chicago Press.

    Google Scholar 

  • Zechner, K., Higgins, D., Xiaoming, X., & Williamson, D. (2009). Automatic scoring of non-native spontaneous speech in test of spoken English. Speech Communication, 51, 883–895.

    Google Scholar 

Download references

Acknowledgement

We thank the members of the Working Group who have contributed ideas and made suggestions in support of the writing of this paper, in particular, Chris Dede and his group at Harvard, John Hattie, Detlev Leutner, André Rupp, and Hans Wagemaker.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Wilson .

Editor information

Editors and Affiliations

Annex: Assessment Design Approaches

Annex: Assessment Design Approaches

Evidence-Centered Design

Design, in general, is a prospective activity; it is an evolving plan for creating an object with desired functionality or esthetic value. It is prospective because it takes place prior to the creation of the object. That is, a design and the resulting object are two different things (Mitchell 1990, pp. 37–38):

when we describe the forms of buildings we refer to extant constructions of physical materials in physical space, but when we describe designs we make claims about something else—constructions of imagination. More precisely, we refer to some sort of model—a drawing, physical scale model, structure of information in computer memory, or even a mental model—rather than to a real building.

The idea of design is, of course, equally applicable to assessments, and Mitchell’s distinction just noted is equally applicable. The design of an assessment and the resulting assessment-as-implemented are different entities. Under the best of circumstance, the design is sound and the resulting assessment satisfies the design, as evidenced empirically through the administration of the assessment. Under less ideal circumstances, the design may not be sound—in which case only by a miracle will the resulting assessment be sound or useful—or the implementation of the assessment is less than ideal. In short, merely using a design process in no way guarantees that the resulting assessment will be satisfactory, but it would be foolish to implement an assessment without a thorough design effort as a preamble.

An approach to assessment design that is gaining momentum is ECD, evidence-centered design (Mislevy et al. 2003b). The approach is based on the idea that the design of an assessment can be facilitated or optimized by taking into consideration the argument we wish to make in support of the proposed score interpretation or inference from the assessment. In its barest form a proposed score interpretation takes the following form: Given that the students has obtained score X, it follows that the student knows and can do Y.

There is no reason for anyone to accept such an assertion at face value. It would be sensible to expect an elaboration of the reasons, an argument, before we accept the conclusion or, if necessary, challenge it. A Toulmian argument, whereby the reasons for the above interpretation are explicated and potential counterarguments are addressed, is at the heart of ECD. ECD focuses on that argument primarily, up to the test score level, by explicating what the intended conclusions or inferences based on scores will be, and, given those inferences as the goal of the assessment, determine the observation of student performance that would leads us to those conclusions. Such an approach is in line with current thinking about validation, where a distinction is made between (1) a validity argument, the supporting reasoning for a particular score interpretation, and (2) the appraisal of that argument. ECD turns the validity argument on its head to find out what needs to be the case, what must be true of the assessment—what should the design of the assessment be—so that the score interpretations that we would like to reach in the end will have a better chance of being supported.

For example, suppose we are developing an assessment to characterize student’s mastery of information technology. If we wish to reach conclusions about this, we need to carefully define what we mean by “command of information technology,” including what behavior on the part of students would convince us that they have acquired mastery. With that definition in hand, we can then proceed to devise a series of tasks that will elicit student behavior or performance indicative of different levels of command of information technology, as we have defined it. Then, as the assessment is implemented, trials need to be conducted to verify that, indeed, the items produced according to the design elicit the evidence that will be needed to support that interpretation.

Approaching assessment development this way means that we have well-defined expectations of what the data from the assessment will look like. For example, what the difficulty of the items will be, how strongly they will intercorrelate, and how the scores will relate to other test scores and background variables. Those expectations are informed by the knowledge about student learning and developmental considerations that were the basis of the design of the assessment; if they are not met, there will be work to be done to find out where the design is lacking or whether the theoretical information used in the design was inadequate.

The process of reconciling design expectations with empirical reality parallels the scientific method’s emphasis on hypothesis testing aided by suitable experimental designs. It should be pointed out, however, that an argument based solely on positive confirmatory evidence is not sufficiently compelling. Ruling out alternative interpretations of positive confirmatory evidence would add considerable weight to an argument, as would a failed attempt to challenge the argument. Such challenges can take a variety of forms in an assessment context. For example, Loevinger (1957) argued that items that explicitly aim to measure a different construct should be included, at least experimentally, to ensure that performance in those items is not explained equally well by the postulated construct.

ECD is highly prospective about the process for implementing the assessment so that the desired score interpretations can be supported in the end. Essentially, ECD prescribes an order of design events. First, the purpose of the assessment needs to be explicated to make it clear what sort of inferences need to be drawn from performance on the test. Once those target inferences are enumerated, the second step is to identify the types of evidence needed to support them. Finally, the third step is to conceive of means of eliciting the evidence needed to support the target inferences. These three steps are associated with corresponding models: a student model, an evidence model, and a series of task models. Note that, according to ECD, task models, from which items would be produced, are the last to be formulated. This is an important design principle, especially since when undertaking the development of an assessment, there is a strong temptation to “start writing items” before we have a good grasp of what the goals of the assessment are. Writing items without first having identified the target inferences, and the evidence required to support them, risks producing many items that are not optimal or even failing to produce the items that are needed to support score interpretation (see e.g., Pellegrino et al. (1999), Chap. 5). For example, producing overly hard or easy items may be suboptimal if decisions or inferences are desired for students having a broad range of proficiency. Under the best of circumstances, starting to write items before we have a firm conception of the goals of the assessment leads to many wasted items that, in the end, do not fit well into the assessment. Under the worst of circumstances, producing items in this manner can permanently hobble the effectiveness of an assessment because we have to make do with the items that are available.

The importance of a design perspective has grown as a result of the shift to so-called standards-based reporting. Standards-based reporting evolved from earlier efforts at criterion-referenced testing (Glaser 1963) intended to attach a specific interpretations to test scores, especially scores that would define different levels of achievement. Since the early 1990s, the National Assessment of Educational Progress (NAEP) in the USA has relied on achievement levels (Bourque 2009). In the USA, tests oriented to inform accountability decisions have followed in NAEP’s footsteps in reporting scores in terms of achievement or performance levels. This, however, does not imply that achievement levels are defined equivalently (Braun and Qian 2007) in different jurisdictions. While it is true that the definition of achievement levels need not, for legitimate policy reasons, be equivalent across jurisdictions in practice in the USA, there has not been a good accounting of the variability across states. A likely reason is that the achievement levels are defined by cutscores that are typically arrived at by an expert panel after the assessment has been implemented (Bejar et al. 2007). However, unless the achievement levels have been defined as part of the design effort, rather than leaving them to be based on the assessment as implemented, there is a good chance that there will be a lack of alignment between the intended achievement levels and the levels that emerge from the cutscore setting process. The cutscore setting panel has the duty to produce the most sensible cutscores it can. However, if the assessment was developed without these cutscores in mind, the panel will still need to produce a set of cutscores to fit the assessment as it exists. The fact that the panel is comprised of subject matter experts cannot possibly compensate for an assessment that was not designed to specifically support the desired inferences.

Whether the assessment outcomes are achievement level or scores, an important further consideration is the temporal span assumed by the assessment. In a K-12 context, the assessment is focused on a single grade, and, typically, the assessment is administered toward the end of the year. A drawback of a single end-of-year assessment is that there is not an opportunity to utilize the assessment information to improve student achievement, at least not directly (Stiggins 2002). An alternative is to distribute assessments during the year (Bennett and Gitomer 2009); a major advantage of this is the opportunity it gives to act upon the assessment results that occur earlier in the year. Some subjects, notably mathematics and the language arts, can extend over several years, and the yearly end-of-year assessments could be viewed interim assessments. Consider first the simpler case where instruction is completed within a year and there is an end-of-year assessment. In this case, achievement levels can be unambiguously defined as the levels of knowledge expected after 1 year of instruction. For subjects that require a multiyear sequence or for subjects that distribute the assessment across several measurement occasions within a year, at least two approaches are available. One of these defines the achievement levels in a bottom-up fashion. The achievement levels for the first measurement occasion are defined first, followed by the definitions for subsequent measurement occasions. So long as the process is carried out in a coordinated fashion, the resulting sets of achievement levels should exhibit what has been called coherence (Wilson 2004). The alternative approach is top-down; in this case, the achievement levels at the terminal point of instruction are defined first. For example, in the USA, it is common to define so-called “exit criteria” for mathematics and language art subjects that, in principle, define what students should have learned by, say, Grade 10. With those exit definitions at hand, it is possible to work backwards and define achievement levels for earlier measurement occasions in a coherent manner.

Operationalization Issues

The foregoing considerations are some of the critical information in determining achievement levels, which, according to Fig. 3.18, are the foundation on which the assessment rests, along with background knowledge about student learning and developmental considerations. For clarity, Fig. 3.18 outlines the “work flow” for assessment at one point in time, but in reality, at least for some subject matters, the design of “an” assessment really entails the simultaneous design of several. That complexity is captured in Fig. 3.18 under Developmental considerations; as the figure shows, achievement levels are set by those developmental considerations and a competency model, which summarizes what we know about how students learn in the domain to be assessed (NRC 2001).

Fig. 3.18
figure 18_3

The ECD framework

The achievement levels are fairly abstract characterizations of what students are expected to achieve. Those expectations need to be recast to make them more concrete, by means of evidence models and task models. Evidence models spell out the student behavior that would be evidence of having acquired the skills and knowledge called for by the achievement levels. Evidence models are, in turn, specifications for the tasks or items that will actually elicit the evidence called for. Once the achievement levels, task models, and evidence models are established, the design proceeds by defining task specifications and performance level descriptors (PLDs), which contain all the preceding information in a form that lends itself to formulating the test specifications. These three components should be seen as components of an iterative process. As the name implies, task specifications are very specific descriptions of the tasks that will potentially comprise the assessment. It would be prudent to produce specifications for more tasks than can possibly be used in the assessment to allow for the possibility that some of them will not work out well. PLDs are (tentative) narratives of what students at each achievement levels can be said to know and are able to do.

A change to any of these components requires revisiting the other two; in practice, test specifications cannot be finalized without information about pragmatic constraints, such as budgets, testing time available, and so on. A requirement to shorten testing time would trigger changes to the test specifications, which in turn could trigger changes to the task specifications. Utmost care is needed in this process. Test specifications determine test-level attributes like reliability and decision consistency and need to be carefully thought through. An assessment that does classify students into achievement levels with sufficient consistency is a failure, no matter how soundly and carefully the achievement levels have been defined, since the uncertainty that will necessarily be attached to student-level and policy-level decisions based on such assessment will diminish its value.

This is an iterative process that aims at an optimal design, subject to relevant pragmatic and psychometric constraints. Note that among the psychometric constraints is incorporated the goal of achieving maximal discrimination in the region of the scale where the eventual cutscores are likely to be located. This is also an iterative process, ideally supplemented by field trials. Once the array of available tasks or task models is known and the constraints are agreed upon, a test blueprint can be formulated, which should be sufficiently detailed so that preliminary cutscores corresponding to the performance standards can be formulated. After the assessment is administered, it will be possible to evaluate whether the preliminary cutscores are well supported or need adjustment in light of the data that are available. At that juncture, the role of the standard setting panel is to accept the preliminary cutscores or to adjust them in the light of new information.

The BEAR Assessment System

As mentioned before, the assessment structure plays a key role in the in the study and the educational implementation of learning progressions. Although there are several alternative approaches that could be used to model, this section focuses in the BEAR Assessment System (BAS; Wilson 2005; Wilson and Sloane 2000), a measurement approach that will allow us to represent one of the various forms in which LPs could be conceived or measured.

The BEAR Assessment System is based on the idea that good assessment addresses the need for sound measurement by way of four principles: (1) a developmental perspective; (2) the match between instruction and assessment; (3) management by instructors to allow appropriate feedback, feedforward, and follow-up; and (4) generation of quality evidence. These four principles, with the four building blocks that embody them, are shown in Fig. 3.19. They serve as the basis of a model that is rooted in our knowledge of cognition and learning in each domain and that supports the alignment of instruction, curriculum, and assessment—all aspects recommended by the NRC (2001) as important components of educational assessment.

Fig. 3.19
figure 19_3

The principles and building blocks of the BEAR Assessment System

Principle 1: A Developmental Perspective

A “developmental perspective” on student learning highlights two crucial ideas: (a) the need to characterize the evolution of learners over time and (b) the need for assessments that are “tailored” to the characteristics of different learning theories and learning domains.

The first element, portraying the evolution of learners over time, emphasizes the definition of relevant constructs based on the development of student mastery of particular concepts and skills over time, as opposed to making a single measurement at some final or supposedly significant point of time. Additionally, it promotes assessments based on “psychologically plausible” pathways of increasing proficiency, as opposed to attempt to assess contents based on logical approaches to the structure of disciplinary knowledge.

Much of the strength of the BEAR Assessment System is related to the second element, the emphasis on providing tools to model many different kinds of learning theories and learning domains, which avoids the “one-size-fits-all” development assessment approach that has rarely satisfied educational needs. What is to be measured and how it is to be valued in each BEAR assessment application is drawn from the expertise and learning theories of the teachers, curriculum developers, and assessment developers involved in the process of creating the assessments.

The developmental perspective assumes that student performance on a given learning progression can be traced over the course of instruction, facilitating a more developmental perspective on student learning. Assessing the growth of students’ understanding of particular concepts and skills requires a model of how student learning develops over a certain period of (instructional) time; this growth perspective helps one to move away from “one shot” testing situations and cross-sectional approaches to defining student performance, toward an approach that focuses on the process of learning and on an individual’s progress through that process. Clear definitions of what students are expected to learn and a theoretical framework of how that learning is expected to unfold as the student progresses through the instructional material (i.e., in terms of learning performances) are necessary to establish the construct validity of an assessment system.

Building Block 1: Construct Maps

Construct maps (Wilson 2005) embody this first of the four principles: a developmental perspective on assessing student achievement and growth. A construct map is a well-thought-out and researched ordering of qualitatively different levels of performance focusing on one characteristic that organizes clear definitions of the expected student progress. Thus, a construct map defines what is to be measured or assessed in terms general enough to be interpretable within a curriculum, and potentially across curricula, but specific enough to guide the development of the other components. When instructional practices are linked to the construct map, then the construct map also indicates the aims of the teaching.

Construct maps are derived in part from research into the underlying cognitive structure of the domain and in part from professional judgments about what constitutes higher and lower levels of performance or competence, but are also informed by empirical research into how students respond to instruction or perform in practice (NRC 2001).

Construct maps are one model of how assessments can be integrated with instruction and accountability. They provide a way for large-scale assessments to be linked in a principled way to what students are learning in classrooms, while having the potential at least to remain independent of the content of a specific curriculum.

The idea of using construct maps as the basis for assessments offers the possibility of gaining significant efficiency in assessment: Although each new curriculum prides itself on bringing something new to the subject matter, the truth is that most curricula are composed of a common stock of content. And, as the influence of national and state standards increases, this will become truer and make them easier to codify. Thus, we might expect innovative curricula to have one, or perhaps even two, variables that do not overlap with typical curricula, but the rest will form a fairly stable set of variables that will be common across many curricula.

Principle 2: Match Between Instruction and Assessment

The main motivation for the progress variables so far developed is that they serve as a framework for the assessments and a method for making measurement possible. However, this second principle makes clear that the framework for the assessments and the framework for the curriculum and instruction must be one and the same. This emphasis is consistent with research in the design of learning environments, which suggests that instructional settings should coordinate their focus on the learner (incorporated in Principle 1) with both knowledge-centered and assessment-centered environments (NRC 2000).

Building Block 2: The Items Design

The items design process governs the coordination between classroom instruction and assessment. The critical element to ensure this in the BEAR Assessment System is that each assessment task and typical student response is matched to particular levels of proficiency within at least one construct map.

When using this assessment system within a curriculum, a particularly effective mode of assessment is what is called embedded assessment. This means that opportunities to assess student progress and performance are integrated into the instructional materials and are (from the student’s point of view) virtually indistinguishable from the day-to-day classroom activities.

It is useful to think of the metaphor of a stream of instructional activity and student learning, with the teacher dipping into the stream of learning from time to time to evaluate student progress and performance. In this model or metaphor, assessment then becomes part of the teaching and learning process, and we can think of it as being assessment for learning (AfL; Black et al. 2003).

If assessment is also a learning event, then it does not take time away from instruction unnecessarily, and the number of assessment tasks can be more readily increased so as to improve the reliability of the results (Linn and Baker 1996). Nevertheless, for assessment to become fully and meaningfully embedded in the teaching and learning process, the assessment must be linked to the curriculum and not be seen as curriculum-independent as is the rhetoric for traditional norm-referenced tests (Wolf and Reardon 1996).

Principle 3: Management by Teachers

For information from the assessment tasks and the BEAR analysis to be useful to instructors and students, it must be couched in terms that are directly related to the instructional goals behind the progress variables. Open-ended tasks, if used, must all be scorable—quickly, readily, and reliably.

Building Block 3: The Outcome Space

The outcome space is the set of categorical outcomes into which student performances are categorized, for all the items associated with a particular progress variable. In practice, these are presented as scoring guides for student responses to assessment tasks, which are meant to help make the performance criteria for the assessments clear and explicit (or “transparent and open” to use Glaser’s (1963) terms)—not only to the teachers but also to the students and parents, administrators, or other “consumers” of assessment results. In fact, we strongly recommend to teachers that they share the scoring guides with administrators, parents, and students, as a way of helping them understand what types of cognitive performance are expected and to model the desired processes.

Scoring guides are the primary means by which the essential element of teacher professional judgment is implemented in the BEAR Assessment System. These are supplemented by “exemplars” of student work at every scoring level for each task and variable combination, and “blueprints,” which provide the teachers with a layout indicating opportune times in the curriculum to assess the students on the different variables.

Principle 4: Evidence of High Quality Assessment

Technical issues of reliability and validity, fairness, consistency, and bias can quickly sink any attempt to measure along a progress variable, as described above, or even to develop a reasonable framework that can be supported by evidence. To ensure comparability of results across time and context, procedures are needed to (a) examine the coherence of information gathered using different formats, (b) map student performances onto the progress variables, (c) describe the structural elements of the accountability system—tasks and raters—in terms of the achievement variables, and (d) establish uniform levels of system functioning, in terms of quality control indices such as reliability.

Building Block 4: Wright Maps

Wright maps represent this principle of evidence of high quality. Wright maps are graphical and empirical representations of a construct map, showing how it unfolds or evolves in terms of increasingly sophisticated student performances.

They are derived from empirical analyses of student data on sets of assessment tasks. They show on an ordering of these assessment tasks from relatively easy tasks to more difficult ones. A key feature of these maps is that both students and tasks can be located on the same scale, giving student proficiency the possibility of substantive interpretation, in terms of what the student knows and can do and where the student is having difficulty. The maps can be used to interpret the progress of one particular student or the pattern of achievement of groups of students ranging from classes to nations.

Wright maps can be very useful in large-scale assessments, providing information that is not readily available through numerical score averages and other traditional summary information—they are used extensively, for example, in reporting on the PISA assessments (OECD 2005). Moreover, Wright maps can be seamlessly interpreted as representations of learning progressions, quickly mapping the statistical results back to the initial construct, providing the necessary evidence to explore questions about the structure of the learning progression, serving as the basis for improved versions of the original constructs.

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Wilson, M., Bejar, I., Scalise, K., Templin, J., Wiliam, D., Irribarra, D.T. (2012). Perspectives on Methodological Issues. In: Griffin, P., McGaw, B., Care, E. (eds) Assessment and Teaching of 21st Century Skills. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-2324-5_3

Download citation

Publish with us

Policies and ethics