Skip to main content

Item Development Research and Practice

  • Chapter
  • First Online:
Handbook of Accessible Instruction and Testing Practices

Abstract

Recent legislation and federal regulations in education have increased our attention to issues of inclusion, fairness, equity, and access in achievement testing. This has resulted in a growing literature on item writing for accessibility. This chapter provides an overview of the fundamentals of item development, especially as they pertain to accessible assessments. Constructed-response and selected-response items are first introduced and compared, with examples. Next, the item development process and guidelines for effective item writing are presented. Empirical research examining the item development process is then reviewed for general education items and items modified for accessibility. Methods for evaluating item quality, in regard to accessibility, are summarized. Finally, recent innovations and technological enhancements in item development, administration, and scoring are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

    Google Scholar 

  • American Institutes for Research. (2009). Reading assessment and item specifications for the 2009 National Assessment of Educational Progress. Washington, DC: National Assessment Governing Board.

    Google Scholar 

  • Attali, Y., & Burstein, J. (2006). Automated scoring with e-rater v.2.0. Journal of Technology, Learning, and Assessment, 4(3), 1–30.

    Google Scholar 

  • Attali, Y., Powers, D., & Hawthorn, J. (2008). Effect of immediate feedback and revision on psychometric properties of open-ended sentence-completion items (ETS RR-08-16). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Baldwin, D., Fowles, M., & Livingston, S. (2005). Guidelines for constructed-response and other performance assessments. Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Beddow, P. A. (2010). Beyond universal design: Accessibility theory to advance testing for all students. In M. Russell & M. Kavanaugh (Eds.), Assessing students in the margins: Challenges, strategies, and techniques (pp. 381–405). Charlotte, NC: Information Age.

    Google Scholar 

  • Beddow, P. A., Elliott, S. N., & Kettler, R. J. (2010). Test accessibility and modification inventory, TAMI™ accessibility rating matrix, technical manual. Nashville, TN: Vanderbilt University. Retrieved at http://peabody.vanderbilt.edu/docs/pdf/PRO/TAMI_Technical_Manual.pdf

    Google Scholar 

  • Bennett, R. E., Morley, M., Quardt, D., & Rock, D. A. (1999). Graphical modeling: A new response type for measuring the qualitative component of mathematical reasoning (ETS RR-99-21). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Bennett, R. E., & Ward, W. C. (1991). Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment. Hillsdale, NJ: Lawrence Erlbaum.

    Google Scholar 

  • Bennett, R. E., Ward, W. C., Rock, D. A., & LaHart, C. (1990). Toward a framework for constructed-response items. Princeton, NJ: Educational Testing Service. ED395 032.

    Google Scholar 

  • Bradbard, D. A., Parker, D. F., & Stone, G. L. (2004). An alternate multiple-choice scoring procedure in a macroeconomics course. Decision Sciences Journal of Innovative Education, 2(1), 11–26.

    Article  Google Scholar 

  • Bridgeman, B., & Cline, F. (2000). Variations in mean response times for questions on the computer-adaptive GRE general test: Implications for fair assessment (ETS RR-00-07). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Bridgeman, B., Cline, F., & Levin, J. (2008). Effects of calculator availability on GRE quantitative questions (ETS RR-08-31). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Browder, D., Flowers, C., Ahlgrim-Delzell, L., Karvonen, M., Spooner, F., & Algozzine, R. (2004). The alignment of alternate assessment content with academic and functional curricula. The Journal of Special Education, 37(4), 211–223.

    Google Scholar 

  • Bush, M. (2001). A multiple choice test that rewards partial knowledge. Journal of Further and Higher Education, 25(2), 157–163.

    Article  Google Scholar 

  • Chang, S.-H., Lin, P.-C., & Lin, Z. C. (2007). Measures of partial knowledge and unexpected responses in multiple-choice tests. Educational Technology & Society, 10(4), 95–109.

    Google Scholar 

  • Christensen, L. L., Shyyan, V., Rogers, C., & Kincaid, A. (2014). Audio support guidelines for accessible assessments: Insights from cognitive labs. Minneapolis, MN: University of Minnesota, Enhanced Assessment Grant (#S368A120006), U.S. Department of Education.

    Google Scholar 

  • Cook, L., Eignor, D., Steinberg, J., Sawaki, Y., & Cline, F. (2009). Using factor analysis to investigate the impact of accommodations on the scores of students with disabilities on a reading comprehension assessment. Journal of Applied Testing Technology, 10(2), 1–33.

    Google Scholar 

  • Coombs, C. H., Milholland, J. E., & Womer, F. B. (1956). The assessment of partial knowledge. Educational and Psychological Measurement, 16(1), 13–37.

    Article  Google Scholar 

  • Couch, B. A., Wood, W. B., & Knight, J. K. (2015). The molecular biology capstone assessment: A concept assessment for upper-division molecular biology students. CBE – Life Sciences Education, 10(1), 1–11.

    Google Scholar 

  • Deno, S. L. (1985). Curriculum-based measurement: The emerging alternative. Exceptional Children, 52, 219–232.

    Article  PubMed  Google Scholar 

  • Diaz, J., Rifqi, M., & Bouchon-Meunier, B. (2007). Evidential multiple choice questions. In P. Brusilovsky, M. Grigoriadou, & K. Papanikolaou (Eds.), Proceedings of workshop on personalisation in E-learning environments at individual and group level (pp. 61–64.) 11th International Conference on User Modeling, Corfu, Greece. Retrieved 25 Sept 2010 from http://hermis.di.uoa.gr/PeLEIGL/program.html

    Google Scholar 

  • Dolan, R. P., Goodman, J., Strain-Seymour, E., Adams, J., & Sethuraman, S. (2011). Cognitive lab evaluation of innovative items in mathematics and English/language arts assessment of elementary, middle, and high school students: Research report. Iowa City, IA: Pearson.

    Google Scholar 

  • Downing, S. M. (2006). Selected-response item formats in test development. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 287–301). Mahwah, NJ: Lawrence Erlbaum.

    Google Scholar 

  • Elliott, S. N., Kettler, R. J., Beddow, P. A., Kurz, A., Compton, E., McGrath, D., et al. (2010). Effects of using modified items to test students with persistent academic difficulties. Exceptional Children, 76(4), 475–495.

    Article  Google Scholar 

  • Ellsworth, R. A., Dunnell, P., & Duell, O. K. (1990). What are the textbooks telling teachers? The Journal of Educational Research, 83, 289–293.

    Article  Google Scholar 

  • Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis: Verbal reports as data (Revised edition). Cambridge, MA: MIT Press.

    Google Scholar 

  • Frey, B. B., Petersen, S., Edwards, L. M., Pedrotti, J. T., & Peyton, V. (2005). Item-writing rules: Collective wisdom. Teaching and Teacher Education, 21, 357–364.

    Article  Google Scholar 

  • Gallagher, A., Bennet, R. E., & Cahalan, C. (2000). Detecting construct-irrelevant variance in an open-ended, computerized mathematics task (ETS RR-00-18). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Gitomer, D. H. (2007). Design principles for constructed response tasks: Assessing subject-matter understanding in NAEP (ETS unpublished research report). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Haladyna, T. M. (1989, April). Fidelity and proximity to criterion: When should we use multiple-choice? Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.

    Google Scholar 

  • Haladyna, T. M. (1997). Writing test items to evaluate higher order thinking. Boston: Allyn & Bacon.

    Google Scholar 

  • Haladyna, T. M., & Downing, S. M. (1989a). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2(1), 37–50.

    Article  Google Scholar 

  • Haladyna, T. M., & Downing, S. M. (1989b). The validity of a taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 1, 51–78.

    Article  Google Scholar 

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–334.

    Article  Google Scholar 

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York: Routledge.

    Google Scholar 

  • Hannah, L. S., & Michaels, J. U. (1977). A comprehensive framework for instructional objectives. Reading, MA: Addison-Wesley.

    Google Scholar 

  • Hart, D. (1994). Authentic assessment: A handbook for educators. Menlo Park, CA: Addison-Wesley.

    Google Scholar 

  • Hogan, T. P., & Murphy, G. (2007). Recommendations for preparing and scoring constructed-response items: What the experts say. Applied Measurement in Education, 20(4), 427–441.

    Article  Google Scholar 

  • Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). New York: American Council on Education, Macmillan.

    Google Scholar 

  • Kelly, F. J. (1916). The Kansas silent reading tests. The Journal of Educational Psychology, 7, 63–80.

    Article  Google Scholar 

  • Kettler, R. J., Elliot, S. N., & Beddow, P. A. (2009). Modifying achievement test items: A theory-guided and data-based approach for better measurement of what students with disabilities know. Peabody Journal of Education, 84, 529–551.

    Article  Google Scholar 

  • Kettler, R. J., Rodriguez, M. C., Bolt, D. M., Elliott, S. N., Beddow, P. A., & Kurz, A. (2011). Modified multiple-choice items for alternate assessments: Reliability, difficulty, and differential boost. Applied Measurement in Education, 24, 210–234.

    Article  Google Scholar 

  • Lane, S., & Iwatani, E. (2016). Design of performance assessments in education. In S. Lane, M. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp. 274–293). New York: Routledge.

    Google Scholar 

  • Laitusis, C. C., Buzick, H., Cook, L., & Stone, E. (2010). Adaptive testing options for accountability assessments. In M. Russell & M. Kavanaugh (Eds.), Assessing students in the margins: Challenges, strategies, and techniques (pp. 291–310). Charlotte, NC: Information Age.

    Google Scholar 

  • Mann, H. (1867). Lectures and annual reports on education. Boston: Rand & Avery.

    Google Scholar 

  • Marion, S. F., & Pellegrino, J. W. (2006). A validity framework for evaluation the technical quality of alternate assessment. Educational Measurement: Issues and Practice, 25(4), 47–57.

    Article  Google Scholar 

  • Marion, S. F., & Pellegrino, J. W. (2009). Validity framework for evaluation the technical quality of alternate assessments based on alternate achievement standards. Paper presented at the annual meeting of the national council on measurement in education, San Diego, CA.

    Google Scholar 

  • Martinez, M. E. (1999). Cognition and the questions of test item format. Educational Psychologist, 34, 207–218.

    Article  Google Scholar 

  • Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: American Council on Education, Macmillan.

    Google Scholar 

  • Moreno, R. M., Martinez, R. J., & Muniz, J. (2006). New guidelines for developing multiple-choice items. Methodology, 2, 65–72.

    Article  Google Scholar 

  • Office of Technology Assessment. (1992). Testing in American schools: Asking the right questions, OTA-SET-519. Washington, DC: US Congress. Retrieved at http://govinfo.library.unt.edu/ota/Ota_1/DATA/1992/9236.PDF

    Google Scholar 

  • Osterlind, S. J., & Merz, W. R. (1994). Building a taxonomy for constructed-response test items. Educational Assessment, 2(2), 133–147.

    Article  Google Scholar 

  • Popham, J. W. (2016). Classroom assessment: What teachers need to know. Boston: Pearson.

    Google Scholar 

  • Rodriguez, M.C. (1997, March). The art & science of item writing: A meta-analysis of multiple-choice item format effects. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL.

    Google Scholar 

  • Rodriguez, M.C. (1998, April). The construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.

    Google Scholar 

  • Rodriguez, M. C. (2002). Choosing an item format. In G. Tindal & T. M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation (pp. 213–231). Mahwah, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  • Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40(2), 163–184.

    Article  Google Scholar 

  • Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, 24(2), 3–13.

    Article  Google Scholar 

  • Rodriguez, M. C. (2016). Selected-response item development. In S. Lane, M. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp. 259–273). New York: Routledge.

    Google Scholar 

  • Rodriguez, M. C., & Albano, A. D. (in press). The college instructor’s guide to test item writing. New York: Routledge.

    Google Scholar 

  • Rodriguez, M.C., Elliott, S.N., Kettler, R.J., & Beddow, P.A. (2009, April). The role of item response attractors in the modification of test items. Paper presented at the annual meeting of the National Council on Measurement in Educational, San Diego, CA.

    Google Scholar 

  • Rodriguez, M. C., Kettler, R. J., & Elliott, S. N. (2014). Distractor functioning in modified items for test accessibility. Sage Open, 4(4), 1–10.

    Google Scholar 

  • Ruch, G. M., & Stoddard, G. D. (1925). Comparative reliabilities of objective examinations. Journal of Educational Psychology, 16, 89–103.

    Article  Google Scholar 

  • Russell, M., & Kavanaugh, M. (2010). Assessing students in the margins: Challenges, strategies, and techniques. Charlotte, NC: Information Age.

    Google Scholar 

  • Schedl, M. A., & Malloy, J. (2014). Writing items and tasks. In A. J. Kunnan (Ed.), The companion to language assessment. Volume II: Approaches and development. Chichester, West Sussex: Wiley-Blackwell.

    Google Scholar 

  • Sireci, S. G., & Zenisky, A. L. (2016). Computerized innovative item formats: Achievement and credentialing. In S. Lane, M. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp. 313–334). New York: Routledge.

    Google Scholar 

  • Snow, R. E. (1980). Aptitude and achievement. In W. B. Schrader (Ed.), Measuring achievement: Progress over a decade. New directions for testing and measurement (Vol. 5, pp. 39–59). San Francisco: Jossey-Bass.

    Google Scholar 

  • Sparks, J. R., Song, Y., Brantley, W., & Liu, O. L. (2014). Assessing written communication in higher education: Review and recommendations for next-generation assessment (RR-14-37). Princeton, NJ: Educational Testing Service. Retrieved at https://www.ets.org/research/policy_research_reports/publications/report/2014/jtmo

    Google Scholar 

  • Sternberg, R. J. (1982). Handbook of human intelligence. Cambridge, MA: Cambridge University Press.

    Google Scholar 

  • Thorndike, R. M., & Thorndike-Christ, T. (2011). Measurement and evaluation in psychology and education. Boston: Pearson.

    Google Scholar 

  • Thurlow, M. L., Laitusis, C. C., Dillon, D. R., Cook, L. L., Moen, R. E., Abedi, J., & O’Brien, D. G. (2009). Accessibility principles for reading assessments. Minneapolis, MN: National Accessible Reading Assessments Projects.

    Google Scholar 

  • Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118.

    Article  Google Scholar 

  • Wakeman, S., Flowers, C., & Browder, D. (2007). Aligning alternate assessments to grade level content standards: Issues and considerations for alternates based on alternate achievement standards (Policy Directions 19). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

    Google Scholar 

  • Williamson, D. M., Bejar, I. I., & Mislevy, R. J. (2006). Automated scoring of complex tasks in computer-based testing. Mahwah, NJ: Lawrence Erlbaum.

    Google Scholar 

  • Winter, P. C., Kopriva, R. J., Chen, C., & Emick, J. E. (2007). Exploring individual and item factors that affect assessment validity for diverse learners: Results from a large-scale cognitive lab. Learning and Individual Differences, 16, 267–276.

    Article  Google Scholar 

  • Wyse, A. E., & Albano, A. D. (2015). Considering the use of general and modified assessment items in computerized adaptive testing. Applied Measurement in Education, 28(2), 156–167.

    Article  Google Scholar 

  • Yang, Y., Buckendahl, C. W., Juszkiewicz, P. J., & Bhola, D. S. (2002). A review of strategies for validating computer-automated scoring. Applied Measurement in Education, 15, 391–412.

    Article  Google Scholar 

  • Young, J. W., So, Y., & Ockey, G. J. (2013). Guidelines for best test development practices to ensure validity and fairness for international English language proficiency assessments. Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Ysseldyke, J. E., & Olsen, K. R. (1997). Putting alternate assessments into practice: What to measure and possible sources of data (Synthesis Report No. 28). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

    Google Scholar 

  • Zigmond, N., & Kloo, A. (2009). The “two percent students”: Considerations and consequences of eligibility decisions. Peabody Journal of Education, 84, 478–495.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anthony D. Albano .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Albano, A.D., Rodriguez, M.C. (2018). Item Development Research and Practice. In: Elliott, S., Kettler, R., Beddow, P., Kurz, A. (eds) Handbook of Accessible Instruction and Testing Practices. Springer, Cham. https://doi.org/10.1007/978-3-319-71126-3_12

Download citation

Publish with us

Policies and ethics