Abstract
Multiple choice (MC) questions from a graduate physiology course were evaluated by cognitive-psychology (but not physiology) experts, and analyzed statistically, in order to test the independence of content expertise and cognitive complexity ratings of MC items. Integration of higher order thinking into MC exams is important, but widely known to be challenging—perhaps especially when content experts must think like novices. Expertise in the domain (content) may actually impede the creation of higher-complexity items. Three cognitive psychology experts independently rated cognitive complexity for 252 multiple-choice physiology items using a six-level cognitive complexity matrix that was synthesized from the literature. Rasch modeling estimated item difficulties. The complexity ratings and difficulty estimates were then analyzed together to determine the relative contributions (and independence) of complexity and difficulty to the likelihood of correct answers on each item. Cognitive complexity was found to be statistically independent of difficulty estimates for 88 % of items. Using the complexity matrix, modifications were identified to increase some item complexities by one level, without affecting the item’s difficulty. Cognitive complexity can effectively be rated by non-content experts. The six-level complexity matrix, if applied by faculty peer groups trained in cognitive complexity and without domain-specific expertise, could lead to improvements in the complexity targeted with item writing and revision. Targeting higher order thinking with MC questions can be achieved without changing item difficulties or other test characteristics, but this may be less likely if the content expert is left to assess items within their domain of expertise.
Similar content being viewed by others
References
American Psychological Association, National Council on Measurement in Education, American Educational Research Association. (1999). Standards for educational and psychological testing, 2E. Washington, DC: American Educational Research Association.
Anderson, L. W., Krathwohl, D. R., Airasian, P. W., Cruikshank, K. A., Mayer, R. E., Pintrich, P. R., et al. (Eds.). (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. New York: Longman.
Anderson, J. R. (2005). Cognitive psychology and its implications, 6E. New York, NY: Worth Publishers.
Bloom, B. J., Englehart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives: The classification of educational goals, by a committee of college and university examiners. Handbook I: Cognitive domain. New York: David McKay.
Bond, T. G., & Fox, C. M. (2007). Applying the Rasch Model: Fundamental measurement in the human sciences, 2E. Mahwah, NJ: Lawrence Erlbaum Associates.
Bruff, D. (2009). Teaching with classroom response systems: Creating active learning environments. San Francisco, CA: Jossey Bass.
Buckles, S., & Siegfried, J. J. (2006). Using multiple-choice questions to evaluate in-depth learning of economics. The Journal of Economic Education, 37(1), 48–57.
Case, S. M., & Swanson, D. B. (2002). Constructing written test questions for the basic and clinical sciences, 3E-Revised. Philadelphia: National Board of Medical Examiners.
Cizek, G. J., & Bunch, M. B. (2008). Standard setting: A guide to establishing and evaluating performance standards on tests. Newbury Park, CA: Sage Publications.
Crocker, L., & Algina, J. (1986). Introduction to classical & modern test theory. Belmont, CA: Wadsworth Group.
Custers, E. J. F. M., & Boshuizen, H. P. A. (2002). The psychology of learning. In G. R. Norman, C. P. M. van der Vleuten, & D. L. Newble (Eds.), International handbook of research in medical education (Vol. 1, pp. 163–203). Dordrecht: Kluwer.
Dimitrov, D. (2007). Least squares distance method of cognitive validation and analysis for binary items using their item response theory parameters. Applied Psychological Measurement, 31, 367–387.
Downing, S. M. (2002). Assessment of knowledge with written test forms. In G. R. Norman, C. P. M. van der Vleuten, & D. L. Newble (Eds.), International handbook of research in medical education (Vol. 2, pp. 647–672). Dordrecht: Kluwer.
Ericcson, K. A. (2004). Deliberate practice and the acquisition and maintenance of expert performance in medicine and related domains. Academic Medicine, 9(10 suppl), S70–S81.
Gierl, M. J., Leighton, J. P., & Hunka, S. M. (2000). Exploring the logic of Tatsuoka’s rule-space model for test development and analysis. An NCME instructional module. Educational Measurement: Issues and Practice, 19(3), 34–44.
Gruppen, L. D., & Frohna, A. Z. (2002). Clinical Reasoning. In G. R. Norman, C. P. M. van der Vleuten, & D. L. Newble (Eds.), International handbook of research in medical education (Vol. 1, pp. 205–230). Dordrecht: Kluwer.
Gushta, M. M., Yumoto, F., & Williams, A. (2009). Separating item difficulty and cognitive complexity in educational achievement testing. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.
Haladyna, T. M. (1997). Writing test items to evaluate higher order thinking. Needham Heights, MA: Allyn & Bacon.
Linacre, J. M. (2007). A User’s guide to WINSTEPS® Rasch-model computer program. Chicago, IL: Author. Downloaded 10 October 2007 from http://www.winsteps.com/winsteps.htm.
Mislevy, R. J., & Huang, C.-W. (2007). Measurement models as narrative structures. In M. von Davier & C. H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models: Extensions & applications (pp. 16–35). New York: Springer.
Moseley, D., Baumfield, V., Elliott, J., Gregson, M., Higgins, S., Miller, J., et al. (2005). Frameworks for thinking. Cambridge, UK: Cambridge University Press.
Rupp, A. A., & Mislevy, R. J. (2007). Cognitive foundations of structured item response models. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment: Theories and applications (pp. 205–241). Cambridge: Cambridge University Press.
Shelton, S. W. (1999). The effect of experience on the use of irrelevant evidence in auditor judgment. The Accounting Review, 74(2), 217–224.
Smith, R. M., Schumacker, R. E., & Bush, J. J. (1998). Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement, 2, 66–78.
Tardieua, H., Ehrlicha, M.-F., & Gyselincka, V. (1992). Levels of representation and domain-specific knowledge in comprehension of scientific texts. Language and Cognitive Processes, 7(3–4), 335–351. doi:10.1080/01690969208409390.
Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20(4), 345–354.
van de Watering, G., & van der Rijt, J. (2006). Teachers’ and students’ perceptions of assessments: A review and a study into the ability and accuracy of estimating the difficulty levels of assessment items. Educational Research Review, 1(2), 133–147.
van Hoeij, M. J. W., Haarhuis, J. C. M., Wierstra, R. F. A., & van Beukelen, P. (2004). Developing a classification tool based on Bloom’s Taxonomy to assess the cognitive level of short essay questions. Journal of Veterinary Medical Education, 31(3), 261–267.
Williams, R. D., & Haladyna, T. M. (1982). Logical operations for generating intended questions (LOGIQ): A typology for higher level test items. In G. H. Roid & T. M. Haladyna (Eds.), A technology for test-item writing (pp. 161–186). New York: Academic Press.
Zheng, A. Y., Lawhorn, J. K., Lumley, T., & Freeman, S. (2008). Application of Bloom’s taxonomy debunks the “MCAT Myth”. Science, 319, 414–455. doi:10.1126/science.1147852.
Acknowledgments
This work was supported by a Curricular Innovation, Research, and Creativity in Learning Environment (CIRCLE) Grant (intramural (GUMC)) to RET.
Conflict of interest
No declarations of interest to report for any co-author.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
The Least Squares Distance Model (LSDM; Dimitrov 2007) uses existing IRT item parameter estimates obtained from a separate procedure or program and an appropriate Q-matrix to model attribute probabilities for fixed levels of ability. These probability estimates are calculated as intact units for each fixed level of theta according to the following equations:
similar to the Rasch model, where P ij is the probability of a correct response on item j by person i given ability θ i ; \( P\left( {A_{k} = 1|\theta_{i} } \right) \) is the probability of correct performance on attribute A k for the person with ability level θ i ; and q jk is the Q-matrix element (0, 1) associated with item j and attribute A k .
With n binary items, this generates a system of n linear equations with K unknowns, \( \ln P\left( {A_{k} = 1|\theta_{i} } \right) \), for each fixed level of ability. This system of equations is represented in matrix algebra form as L = QX where L is the vector of elements lnP ij (known); Q is the Q-matrix (known); and X is an unknown vector of elements X k = \( \ln P\left( {A_{k} = 1|\theta_{i} } \right) \).
By minimizing the Euclidean norm of the vector ||QX − L||, the results of the unknown vector X and the least squares distance (LSD) are generated and the probability of a correct response for a student with ability θ i on a item associated with attribute A k is \( P\left( {A_{k} = 1|\theta_{i} } \right) \) = exp(Xk). The LSDM-calculated item probabilities are calculated as the product of the attribute probabilities across ability levels; these probabilities approximate the probabilities calculated under the Rasch model.
Rights and permissions
About this article
Cite this article
Tractenberg, R.E., Gushta, M.M., Mulroney, S.E. et al. Multiple choice questions can be designed or revised to challenge learners’ critical thinking. Adv in Health Sci Educ 18, 945–961 (2013). https://doi.org/10.1007/s10459-012-9434-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10459-012-9434-4