The purpose of this study was to examine eighth-grade students’ science performance in terms of two test design components, item format, and cognitive domain. The portion of Taiwanese data came from the 2011 administration of the Trends in International Mathematics and Science Study (TIMSS), one of the major international large-scale assessments in science. The item difficulty analysis was initially applied to show the proportion of correct items. A regression-based cumulative link mixed modeling (CLMM) approach was further utilized to estimate the impact of item format, cognitive domain, and their interaction on the students’ science scores. The results of the proportion-correct statistics showed that constructed-response items were more difficult than multiple-choice items, and that the reasoning cognitive domain items were more difficult compared to the items in the applying and knowing domains. In terms of the CLMM results, students tended to obtain higher scores when answering constructed-response items as well as items in the applying cognitive domain. When the two predictors and the interaction term were included together, the directions and magnitudes of the predictors on student science performance changed substantially. Plausible explanations for the complex nature of the effects of the two test-design predictors on student science performance are discussed. The results provide practical, empirical-based evidence for test developers, teachers, and stakeholders to be aware of the differential function of item format, cognitive domain, and their interaction in students’ science performance.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking the use of tests: a meta-analysis of practice testing. Review of Educational Research, 87(3), 659–701.
Agresti, A. (2013). Categorical data analysis (3nd ed.). New York: John Wiley & Sons.
Alivernini, F., & Manganelli, S. (2015). Country, school, and student factors associated with extreme levels of science literacy across 25 countries. International Journal of Science Education, 37(12), 1992–2012.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Anderson, L. W., & Krathwohl, D. R. (Eds.). (2001). A taxonomy for learning, teaching and assessing: a revision of Bloom’s taxonomy of educational objectives. New York: Longman.
Bieber, T., & Martens, K. (2011). The OECD PISA study as a soft power in education? Lessons from Switzerland and the U.S. European Journal of Education, 46(1), 101–116.
Biggs, J. B. (1996). Western misconceptions of the Confucian-heritage learning culture. In D. A. Watkins & J. B. Biggs (Eds.), The Chinese learner: culture, psychological and contextual influences (pp. 45–68). Hong Kong: Comparative Education Research Center and the Australian Council for Educational Research.
Bloom, B. S., Englehart, M. B., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives. The classification of educational goals. Handbook I: cognitive domain. New York: Longmans Green.
Briggs, D. C. (2008). Using explanatory item response models to analyze group differences in science achievement. Applied Measurement in Education, 21(2), 89–118.
Cairns, D., & Areepattamannil, S. (2017). Exploring the relations of inquiry-based teaching to science achievement and disposition in 54 countries. Research in Science Education. https://doi.org/10.1007/s11165-017-9639-x.
Carnoy, M., Khavenson, T., & Ivanova, A. (2015). Using TIMSS and PISA results to inform educational policy: a study of Russia and its neighbours. Compare, 45(2), 248–271.
Christensen, R. H. B. (2015). Ordinal: regression models for ordinal data via cumulative link (mixed) models [computer software]. Retrieved from https://cran.r-project.org/web/packages/ordinal/index.html. Accessed 22 Aug 2016.
Commons, M. L., Goodheart, E. A., Pekker, A., Dawson, T. L., Draney, K., & Adams, K. M. (2008). Using Rasch scaled stage scores to validate orders of hierarchical complexity of balance beam task sequences. Journal of Applied Measurement, 9(2), 182–199.
Commons, M. L., Gane-McCalla, R., Barker, C. D., & Li, E. Y. (2014a). The model of hierarchical complexity as a measurement system. Behavioral Development Bulletin, 19(3), 9–14.
Commons, M. L., Li, E. Y., Richardson, A. M., Gane-McCalla, R., Barker, C. D., & Tuladhar, C. T. (2014b). Does the model of hierarchical complex produce significant gaps between orders and are the orders equally space? Journal of Applied Measurement, 15(4), 422–450.
Cunnington, J. P. W., Norman, G. R., Blake, J. M., Dauphinee, W. D., & Blackmore, D. E. (1997). Applying learning taxonomies to test items: is a fact an artifact? In A. J. J. A. Scherpbier, C. P. M. van der Vleuten, J. J. Rethans, & A. F. W. van der Steeg (Eds.), Advances in medical education (pp. 139–142). Dordrecht: Springer.
De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: a generalized linear and nonlinear approach. New York: Springer-Verlag.
Debeer, D., & Janssen, R. (2013). Modeling item-position effects within an IRT framework. Journal of Educational Measurement, 50(2), 164–185.
DeMars, C. E. (1998). Gender differences in mathematics and science on a high school proficiency exam: the role of response format. Applied Measurement in Education, 11(3), 279–299.
Downing, S. M. (2002). Assessment of knowledge with written test forms. In G. R. Norman et al. (Series Eds.), Springer International Handbooks of Education. International handbook of research in medical education (vol. 7, pp. 647–672). Dordrecht, The Netherlands: Springer. https://doi.org/10.1007/978-94-010-0462-6_25.
Everitt, B., & Skrondal, A. (2010). The Cambridge dictionary of statistics. Cambridge: Cambridge University Press.
Foy, P., Arora, A., & Stanco, G. M. (Eds.). (2013). TIMSS 2011 user guide for the international database. Chestnut Hill: TIMSS & PIRLS International Study Center, Boston College.
Furst, E. J. (1981). Bloom’s taxonomy of educational objectives for the cognitive domain: philosophical and educational issues. Review of Educational Research, 51(4), 441–453.
Gierl, M. J. (1997). Comparing the cognitive representations of test developers and students on a mathematics achievement test using Bloom’s taxonomy. Journal of Educational Research, 91(1), 26–32.
Gierl, M. J., Bulut, O., Guo, Q., & Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests in education: a comprehensive review. Review of Educational Research, 87(6), 1082–1116.
Glynn, S. M. (2012). International assessment: a research model and teachers’ evaluation of TIMSS science achievement items. Journal of Research in Science Teaching, 49(10), 1321–1344.
Grek, S. (2009). Governing by numbers: the PISA ‘effect’ in Europe. Journal of Educational Policy, 24(1), 23–37.
Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17–27.
Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York: Routledge.
Hamilton, L. S., Nussbaum, E. M., & Snow, R. E. (1997). Interview procedures for validating science assessments. Applied Measurement in Education, 10(2), 181–200.
Hancock, G. R. (1994). Cognitive complexity and the comparability of multiple-choice and constructed-response test formats. The Journal of Experimental Education, 62(2), 143–157.
Hebel, F. L., Montpied, P., Tiberghien, A., & Fontanieu, V. (2017). Sources of difficulty in assessment: examples of PISA science items. International Journal of Science Education, 39(4), 468–487.
Hift, R. J. (2014). Should essays and other open-ended-type questions retain a place in written summative assessment in clinical medicine? BMC Medical Education, 14, 249.
Ho, E. S. C. (2016). The use of large-scale assessment (PISA): insight for policy and practice in the case of Hong Kong. Research Papers in Education, 31(5), 516–528.
Kan, A., & Bulut, O. (2014). Examining the relationship between gender DIF and language complexity in mathematics assessments. International Journal of Testing, 14(3), 245–264.
Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319–342.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.
Kastberg, D., Roey, S., Ferraro, D., Lemanski, N., & Erberber, E. (2013). U.S. TIMSS and PIRLS 2011 technical report and user’s guide. (NCES 2013-046). U.S. Department of Education. Washington, DC: National Center for Education Statistics. Retrieved from http://nces.ed.gov/pubsearch.
Kaya, S.‚ & Rice, D. C. (2010). Multilevel effects of student and classroom factors on elementary science achievement in five countries. International Journal of Science Education‚ 32(10), 1337–1363.
Kubinger, K. D., & Gottschall, C. H. (2007). Item difficulty of multiple choice tests dependant on different item response formats—an experiment in fundamental research on psychological assessment. Psychology Science, 49(4), 361–374.
Kuenzi, J. (2008). Science, Technology, Engineering, and Mathematics (STEM) education: background, federal policy, and legislative action (RL 33434). CRS Report for Congress. Retrieved from http://wikileaks.org/leak/crs/RL33434.pdf
Langdon, D., McKittrick, G., Beede, D., Khan, B., & Doms, M. (2011). STEM: good jobs now and for the future (ESA Issue Brief No. 03-11). Washington, DC: U.S. Department of Commerce.
Lawrenz, F., Huffman, D., & Welch, W. (2000). Policy considerations based on a cost analysis of alternative test formats in large scale science assessments. Journal of Research in Science Teaching, 37(6), 615–626.
Leighton, J. P., & Gierl, M. J. (2007). Defining and evaluating models of cognition used in educational measurement to make inferences about examinees’ thinking processes. Educational Measurement: Issues and Practice, 26(1), 3–16.
Li, J. (2012). Cultural foundations of learning: east and west. New York: Cambridge University Press.
Lietz, P., & Tobin, M. (2016). The impact of large-scale assessments in education on education policy: evidence from around the world. Research Papers in Education, 31(5), 499–501.
Linn, M. C., Gerard, L., Ryoo, K., McElhaney, K., Liu, O. L., & Rafferty, A. N. (2014). Computer-guided inquiry to improve science learning. Science, 344(6180), 155–156.
Liou, P.-Y. (2017). Profiles of adolescents’ motivational beliefs in science learning and science achievement in 26 countries: Results from TIMSS 2011 data. International Journal of Educational Research, 81(1), 83–96.
Liou, P.-Y., & Ho, J. H. N. (2018). Relationships among instructional practices, students’ motivational beliefs and science achievement in Taiwan using hierarchical linear modelling. Research Papers in Education, 33(1), 73–88.
Liou, P.-Y., & Hung, Y.-C. (2015). Statistical techniques utilized in analyzing PISA and TIMSS data in science education from 1996 to 2013: A methodological review. International Journal of Science and Mathematics Education, 13(6), 1449–1468.
Liu, J. (2012). Does cram schooling matter? Who goes to cram schools? Evidence from Taiwan. International Journal of Educational Development, 32(1), 46–52.
Liu, X., & Ruiz, M. E. (2008). Using data mining to predict K-12 students’ performance on large-scale assessment items related to energy. Journal of Research in Science Teaching, 45(5), 554–573.
Liu, O. L., & Wilson, M. (2009a). Gender differences in large-scale mathematics assessments: PISA trend 2000 & 2003. Applied Measurement in Education, 22(2), 164–184.
Liu, O. L., & Wilson, M. (2009b). Gender differences and similarities in PISA 2003 mathematics: a comparison between the United States and Hong Kong. International Journal of Testing, 9(1), 20–40.
Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., & Linn, M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215–233.
Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31(3), 234–250.
Martin, M. O., Mullis, I. V. S., Foy, P., & Stanco, G. M. (2012). TIMSS 2011 international results in science. Chestnut Hill: TIMSS & PIRLS International Study Center, Boston College.
Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207–218.
Marton, F., Dall’Alba, G., & Beaty, E. (1993). Conceptions of learning. International Journal of Educational Research, 19(3), 277–300.
Marzano, R. J., & Kendall, J. S. (2006). The need for a revision of Bloom’s taxonomy. In R. J. Marzano & J. S. Kendall (Eds.), The new taxonomy of educational objectives (pp. 1–20). Thousand Oaks: Corwin Press.
McConney, A., Oliver, M. C., Woods-McConney, A., Schibeci, R., & Maor, D. (2014). Inquiry, engagement, and literacy in science: a retrospective, cross-national analysis using PISA 2006. Science Education, 98(6), 963–980.
McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statistical Society. Series B (Methodological), 42(2), 109–142.
Messick, S. (1987). Validity. Princeton: Educational Testing Service.
Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749.
Mullis, I. V. S., & Martin, M. O. (2011). TIMSS 2011 item writing guidelines. Retrieved from https://timssandpirls.bc.edu/methods/pdf/T11_Item_writing_guidelines.pdf
Mullis, I. V. S., Martin, M. O., Ruddock, G. J., O’Sullivan, C. Y., & Preuschoff, C. (2009). TIMSS 2011 assessment frameworks. Chestnut Hill: TIMSS & PIRLS International Study Center, Boston College.
National Research Council (2007). Rising above the gathering storm: Energizing and employing America for a brighter economic future. Committee on Prospering in the Global Economy of the 21st Century: An Agenda for American Science and Technology. Washington, DC: National Academies Press.
National Research Council. (2012). A framework for K-12 science education: practices, crosscutting concepts, and core ideas. Washington, DC: National Academies Press.
Organisation for Economic Co-operation and Development (OECD). (2016). PISA 2015 assessment and analytical framework. Paris: OECD Publishing.
R Core Team. (2016). R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.
Randall, J., Cheong, Y. F., & Engelhard, G. (2011). Using explanatory item response theory modeling to investigate context effects of differential item functioning for students with disabilities. Educational and Psychological Measurement, 71(1), 129–147.
Ruiz-Primo, M. A., Shavelson, R. J., Li, M., & Schultz, S. E. (2001). On the validity of cognitive interpretations of scores from alternative mapping techniques. Educational Assessment, 7(2), 99–141.
Rutkowski, L., & Rutkowski, D. (2017). Improving the comparability and local usefulness of international assessments: a look back and a way forward. Scandinavian Journal of Educational Research. https://doi.org/10.1080/00313831.2016.1261044.
Säljö, R. (1997). Learning in the learner’s perspective 1. Some commonsense concepts. Gothenburg: Institute of Education, University of Cothenburg.
Schmidt, W. H., & Burroughs, N. A. (2016). Influencing public school policy in the United States: the role of large-scale assessments. Research Papers in Education, 31(5), 567–577.
Schwabe, F., McElvany, N., & Trendtel, M. (2015). The school age gender gap in reading achievement: examining the influences of item format and intrinsic reading motivation. Reading Research Quarterly, 50(2), 219–232.
Smith, M. D. (2017). Cognitive validity: can multiple-choice items tap historical thinking process? American Educational Research Journal, 54(6), 1256–1287.
Solano-Flores, G., & Li, M. (2009). Generalizability of cognitive interview-based measures across cultural groups. Educational Measurement: Issues and Practice, 28(1), 9–18.
Sugrue, B. (1995). A theory-based framework for assessing domain-specific problem solving ability. Educational Measurement: Issues and Practice, 14(1), 29–36.
Tan, K. C. D., Goh, N. K., Chia, L. S., & Treagust, D. F. (2002). Development and application of a two-tier multiple choice diagnostic instrument to assess high school students’ understanding of inorganic chemistry qualitative analysis. Journal of Research in Science Teaching, 39(4), 283–301.
Tekkumru-Kisa, M., Blum, W., Stein, M. K., & Schunn, C. (2015). A framework for analyzing cognitive demand and content-practices integration: task analysis guide in science. Journal of Research in Science Teaching, 52(5), 659–685.
Tobin, M., Nugroho, D., & Lietz, P. (2016). Large-scale assessments of students’ learning and education policy: synthesising evidence across world regions. Research Papers in Education, 31(5), 578–594.
Treagust, D. F. (1995). Diagnostic assessment of students’ science knowledge. In S. M. Glynn & R. Duit (Eds.), Learning science in the schools: research reforming practice (pp. 327–346). Mahwah: Lawrence Erlbaum Associates.
Tsai, C.-C. (2004). Conceptions of learning science among high school students in Taiwan: a phenomenographic analysis. International Journal of Science Education, 26(14), 1733–1750.
Tsai, C.-C., & Kuo, P.-C. (2008). Cram school students’ conceptions of learning and learning science in Taiwan. International Journal of Science Education, 30(3), 351–373.
Tuerlinckx, F., & Wang, W. C. (2004). Models for polytomous data. In P. De Boeck & M. Wilson (Eds.), Explanatory item response models: a generalized linear and nonlinear approach (pp. 75–109). New York: Springer-Verlag.
Tutz, G., & Hennevogl, W. (1996). Random effects in ordinal regression models. Computational Statistics & Data Analysis, 22(5), 537–557.
Wang, C.-L., & Liou, P.-Y. (2017). Students’ motivational beliefs in science learning, school motivational contexts, and science achievement in Taiwan. International Journal of Science Education, 39(7), 898–917.
Wang, J., Oliver, J. S., & Staver, J. R. (2008). Self-concept and science achievement: investigating a reciprocal relation model across the gender classification in a cross cultural context. Journal of Research in Science Teaching, 45(6), 711–725.
Webb, N. L. (2007). Issues related to judging the alignment of curriculum standards and assessments. Applied Measurement in Education, 20(1), 7–25.
Wilson, M., & Wang, W. C. (1995). Complex composites: issues that arise in combing different modes of assessment. Applied Psychological Measurement, 19(1), 51–71.
Wilson, C. D., Taylor, J. A., Kowalski, S. M., & Carlson, J. (2010). The relative effects and equity of inquiry-based and commonplace science teaching on students’ knowledge, reasoning, and argumentation. Journal of Research in Science Teaching, 47(3), 276–301.
Woods-McConney, A., Oliver, M. C., McConney, A., Maor, D., & Schibeci, R. (2013). Science engagement and literacy: a retrospective analysis for indigenous and non-indigenous students in Aotearoa New Zealand and Australia. Research in Science Education, 43(1), 233–252.
Zeng, K. (1999). Dragon gate: comparative examinations and their consequences. London: Cassell.
This work was supported by the Ministry of Science and Technology, Taiwan (R.O.C.) under Grants 104-2629-S-008-001- and 106-2511-S-008-011-.
About this article
Cite this article
Liou, PY., Bulut, O. The Effects of Item Format and Cognitive Domain on Students’ Science Performance in TIMSS 2011. Res Sci Educ 50, 99–121 (2020). https://doi.org/10.1007/s11165-017-9682-7
- Cognitive domain
- Item format
- Science achievement