The Effects of Item Format and Cognitive Domain on Students’ Science Performance in TIMSS 2011

Abstract

The purpose of this study was to examine eighth-grade students’ science performance in terms of two test design components, item format, and cognitive domain. The portion of Taiwanese data came from the 2011 administration of the Trends in International Mathematics and Science Study (TIMSS), one of the major international large-scale assessments in science. The item difficulty analysis was initially applied to show the proportion of correct items. A regression-based cumulative link mixed modeling (CLMM) approach was further utilized to estimate the impact of item format, cognitive domain, and their interaction on the students’ science scores. The results of the proportion-correct statistics showed that constructed-response items were more difficult than multiple-choice items, and that the reasoning cognitive domain items were more difficult compared to the items in the applying and knowing domains. In terms of the CLMM results, students tended to obtain higher scores when answering constructed-response items as well as items in the applying cognitive domain. When the two predictors and the interaction term were included together, the directions and magnitudes of the predictors on student science performance changed substantially. Plausible explanations for the complex nature of the effects of the two test-design predictors on student science performance are discussed. The results provide practical, empirical-based evidence for test developers, teachers, and stakeholders to be aware of the differential function of item format, cognitive domain, and their interaction in students’ science performance.

This is a preview of subscription content, access via your institution.

Fig. 1

References

  1. Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking the use of tests: a meta-analysis of practice testing. Review of Educational Research, 87(3), 659–701.

  2. Agresti, A. (2013). Categorical data analysis (3nd ed.). New York: John Wiley & Sons.

    Google Scholar 

  3. Alivernini, F., & Manganelli, S. (2015). Country, school, and student factors associated with extreme levels of science literacy across 25 countries. International Journal of Science Education, 37(12), 1992–2012.

  4. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

    Google Scholar 

  5. Anderson, L. W., & Krathwohl, D. R. (Eds.). (2001). A taxonomy for learning, teaching and assessing: a revision of Bloom’s taxonomy of educational objectives. New York: Longman.

    Google Scholar 

  6. Bieber, T., & Martens, K. (2011). The OECD PISA study as a soft power in education? Lessons from Switzerland and the U.S. European Journal of Education, 46(1), 101–116.

  7. Biggs, J. B. (1996). Western misconceptions of the Confucian-heritage learning culture. In D. A. Watkins & J. B. Biggs (Eds.), The Chinese learner: culture, psychological and contextual influences (pp. 45–68). Hong Kong: Comparative Education Research Center and the Australian Council for Educational Research.

    Google Scholar 

  8. Bloom, B. S., Englehart, M. B., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives. The classification of educational goals. Handbook I: cognitive domain. New York: Longmans Green.

    Google Scholar 

  9. Briggs, D. C. (2008). Using explanatory item response models to analyze group differences in science achievement. Applied Measurement in Education, 21(2), 89–118.

    Google Scholar 

  10. Cairns, D., & Areepattamannil, S. (2017). Exploring the relations of inquiry-based teaching to science achievement and disposition in 54 countries. Research in Science Education. https://doi.org/10.1007/s11165-017-9639-x.

    Google Scholar 

  11. Carnoy, M., Khavenson, T., & Ivanova, A. (2015). Using TIMSS and PISA results to inform educational policy: a study of Russia and its neighbours. Compare, 45(2), 248–271.

  12. Christensen, R. H. B. (2015). Ordinal: regression models for ordinal data via cumulative link (mixed) models [computer software]. Retrieved from https://cran.r-project.org/web/packages/ordinal/index.html. Accessed 22 Aug 2016.

  13. Commons, M. L., Goodheart, E. A., Pekker, A., Dawson, T. L., Draney, K., & Adams, K. M. (2008). Using Rasch scaled stage scores to validate orders of hierarchical complexity of balance beam task sequences. Journal of Applied Measurement, 9(2), 182–199.

    Google Scholar 

  14. Commons, M. L., Gane-McCalla, R., Barker, C. D., & Li, E. Y. (2014a). The model of hierarchical complexity as a measurement system. Behavioral Development Bulletin, 19(3), 9–14.

    Google Scholar 

  15. Commons, M. L., Li, E. Y., Richardson, A. M., Gane-McCalla, R., Barker, C. D., & Tuladhar, C. T. (2014b). Does the model of hierarchical complex produce significant gaps between orders and are the orders equally space? Journal of Applied Measurement, 15(4), 422–450.

    Google Scholar 

  16. Cunnington, J. P. W., Norman, G. R., Blake, J. M., Dauphinee, W. D., & Blackmore, D. E. (1997). Applying learning taxonomies to test items: is a fact an artifact? In A. J. J. A. Scherpbier, C. P. M. van der Vleuten, J. J. Rethans, & A. F. W. van der Steeg (Eds.), Advances in medical education (pp. 139–142). Dordrecht: Springer.

    Google Scholar 

  17. De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: a generalized linear and nonlinear approach. New York: Springer-Verlag.

    Google Scholar 

  18. Debeer, D., & Janssen, R. (2013). Modeling item-position effects within an IRT framework. Journal of Educational Measurement, 50(2), 164–185.

  19. DeMars, C. E. (1998). Gender differences in mathematics and science on a high school proficiency exam: the role of response format. Applied Measurement in Education, 11(3), 279–299.

    Google Scholar 

  20. Downing, S. M. (2002). Assessment of knowledge with written test forms. In G. R. Norman et al. (Series Eds.), Springer International Handbooks of Education. International handbook of research in medical education (vol. 7, pp. 647–672). Dordrecht, The Netherlands: Springer. https://doi.org/10.1007/978-94-010-0462-6_25.

    Google Scholar 

  21. Everitt, B., & Skrondal, A. (2010). The Cambridge dictionary of statistics. Cambridge: Cambridge University Press.

    Google Scholar 

  22. Foy, P., Arora, A., & Stanco, G. M. (Eds.). (2013). TIMSS 2011 user guide for the international database. Chestnut Hill: TIMSS & PIRLS International Study Center, Boston College.

    Google Scholar 

  23. Furst, E. J. (1981). Bloom’s taxonomy of educational objectives for the cognitive domain: philosophical and educational issues. Review of Educational Research, 51(4), 441–453.

    Google Scholar 

  24. Gierl, M. J. (1997). Comparing the cognitive representations of test developers and students on a mathematics achievement test using Bloom’s taxonomy. Journal of Educational Research, 91(1), 26–32.

  25. Gierl, M. J., Bulut, O., Guo, Q., & Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests in education: a comprehensive review. Review of Educational Research, 87(6), 1082–1116.

  26. Glynn, S. M. (2012). International assessment: a research model and teachers’ evaluation of TIMSS science achievement items. Journal of Research in Science Teaching, 49(10), 1321–1344.

    Google Scholar 

  27. Grek, S. (2009). Governing by numbers: the PISA ‘effect’ in Europe. Journal of Educational Policy, 24(1), 23–37.

    Google Scholar 

  28. Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17–27.

  29. Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York: Routledge.

    Google Scholar 

  30. Hamilton, L. S., Nussbaum, E. M., & Snow, R. E. (1997). Interview procedures for validating science assessments. Applied Measurement in Education, 10(2), 181–200.

  31. Hancock, G. R. (1994). Cognitive complexity and the comparability of multiple-choice and constructed-response test formats. The Journal of Experimental Education, 62(2), 143–157.

    Google Scholar 

  32. Hebel, F. L., Montpied, P., Tiberghien, A., & Fontanieu, V. (2017). Sources of difficulty in assessment: examples of PISA science items. International Journal of Science Education, 39(4), 468–487. 

  33. Hift, R. J. (2014). Should essays and other open-ended-type questions retain a place in written summative assessment in clinical medicine? BMC Medical Education, 14, 249.

  34. Ho, E. S. C. (2016). The use of large-scale assessment (PISA): insight for policy and practice in the case of Hong Kong. Research Papers in Education, 31(5), 516–528.

    Google Scholar 

  35. Kan, A., & Bulut, O. (2014). Examining the relationship between gender DIF and language complexity in mathematics assessments. International Journal of Testing, 14(3), 245–264.

  36. Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319–342.

    Google Scholar 

  37. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.

    Google Scholar 

  38. Kastberg, D., Roey, S., Ferraro, D., Lemanski, N., & Erberber, E. (2013). U.S. TIMSS and PIRLS 2011 technical report and user’s guide. (NCES 2013-046). U.S. Department of Education. Washington, DC: National Center for Education Statistics. Retrieved from http://nces.ed.gov/pubsearch.

  39. Kaya, S.‚ & Rice, D. C. (2010). Multilevel effects of student and classroom factors on elementary science achievement in five countries. International Journal of Science Education‚ 32(10), 1337–1363.

  40. Kubinger, K. D., & Gottschall, C. H. (2007). Item difficulty of multiple choice tests dependant on different item response formats—an experiment in fundamental research on psychological assessment. Psychology Science, 49(4), 361–374.

  41. Kuenzi, J. (2008). Science, Technology, Engineering, and Mathematics (STEM) education: background, federal policy, and legislative action (RL 33434). CRS Report for Congress. Retrieved from http://wikileaks.org/leak/crs/RL33434.pdf

  42. Langdon, D., McKittrick, G., Beede, D., Khan, B., & Doms, M. (2011). STEM: good jobs now and for the future (ESA Issue Brief No. 03-11). Washington, DC: U.S. Department of Commerce.

    Google Scholar 

  43. Lawrenz, F., Huffman, D., & Welch, W. (2000). Policy considerations based on a cost analysis of alternative test formats in large scale science assessments. Journal of Research in Science Teaching, 37(6), 615–626.

  44. Leighton, J. P., & Gierl, M. J. (2007). Defining and evaluating models of cognition used in educational measurement to make inferences about examinees’ thinking processes. Educational Measurement: Issues and Practice, 26(1), 3–16.

  45. Li, J. (2012). Cultural foundations of learning: east and west. New York: Cambridge University Press.

    Google Scholar 

  46. Lietz, P., & Tobin, M. (2016). The impact of large-scale assessments in education on education policy: evidence from around the world. Research Papers in Education, 31(5), 499–501.

  47. Linn, M. C., Gerard, L., Ryoo, K., McElhaney, K., Liu, O. L., & Rafferty, A. N. (2014). Computer-guided inquiry to improve science learning. Science, 344(6180), 155–156.

  48. Liou, P.-Y. (2017). Profiles of adolescents’ motivational beliefs in science learning and science achievement in 26 countries: Results from TIMSS 2011 data. International Journal of Educational Research, 81(1), 83–96.

    Google Scholar 

  49. Liou, P.-Y., & Ho, J. H. N. (2018). Relationships among instructional practices, students’ motivational beliefs and science achievement in Taiwan using hierarchical linear modelling. Research Papers in Education, 33(1), 73–88.

  50. Liou, P.-Y., & Hung, Y.-C. (2015). Statistical techniques utilized in analyzing PISA and TIMSS data in science education from 1996 to 2013: A methodological review. International Journal of Science and Mathematics Education, 13(6), 1449–1468.

  51. Liu, J. (2012). Does cram schooling matter? Who goes to cram schools? Evidence from Taiwan. International Journal of Educational Development, 32(1), 46–52.

    Google Scholar 

  52. Liu, X., & Ruiz, M. E. (2008). Using data mining to predict K-12 students’ performance on large-scale assessment items related to energy. Journal of Research in Science Teaching, 45(5), 554–573.

  53. Liu, O. L., & Wilson, M. (2009a). Gender differences in large-scale mathematics assessments: PISA trend 2000 & 2003. Applied Measurement in Education, 22(2), 164–184.

  54. Liu, O. L., & Wilson, M. (2009b). Gender differences and similarities in PISA 2003 mathematics: a comparison between the United States and Hong Kong. International Journal of Testing, 9(1), 20–40.

  55. Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., & Linn, M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215–233.

  56. Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31(3), 234–250.

  57. Martin, M. O., Mullis, I. V. S., Foy, P., & Stanco, G. M. (2012). TIMSS 2011 international results in science. Chestnut Hill: TIMSS & PIRLS International Study Center, Boston College.

  58. Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207–218.

    Google Scholar 

  59. Marton, F., Dall’Alba, G., & Beaty, E. (1993). Conceptions of learning. International Journal of Educational Research, 19(3), 277–300.

    Google Scholar 

  60. Marzano, R. J., & Kendall, J. S. (2006). The need for a revision of Bloom’s taxonomy. In R. J. Marzano & J. S. Kendall (Eds.), The new taxonomy of educational objectives (pp. 1–20). Thousand Oaks: Corwin Press.

    Google Scholar 

  61. McConney, A., Oliver, M. C., Woods-McConney, A., Schibeci, R., & Maor, D. (2014). Inquiry, engagement, and literacy in science: a retrospective, cross-national analysis using PISA 2006. Science Education, 98(6), 963–980.

    Google Scholar 

  62. McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statistical Society. Series B (Methodological), 42(2), 109–142.

    Google Scholar 

  63. Messick, S. (1987). Validity. Princeton: Educational Testing Service.

    Google Scholar 

  64. Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749.

    Google Scholar 

  65. Mullis, I. V. S., & Martin, M. O. (2011). TIMSS 2011 item writing guidelines. Retrieved from https://timssandpirls.bc.edu/methods/pdf/T11_Item_writing_guidelines.pdf

  66. Mullis, I. V. S., Martin, M. O., Ruddock, G. J., O’Sullivan, C. Y., & Preuschoff, C. (2009). TIMSS 2011 assessment frameworks. Chestnut Hill: TIMSS & PIRLS International Study Center, Boston College.

  67. National Research Council (2007). Rising above the gathering storm: Energizing and employing America for a brighter economic future. Committee on Prospering in the Global Economy of the 21st Century: An Agenda for American Science and Technology. Washington, DC: National Academies Press.

  68. National Research Council. (2012). A framework for K-12 science education: practices, crosscutting concepts, and core ideas. Washington, DC: National Academies Press.

  69. Organisation for Economic Co-operation and Development (OECD). (2016). PISA 2015 assessment and analytical framework. Paris: OECD Publishing.

    Google Scholar 

  70. R Core Team. (2016). R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.

    Google Scholar 

  71. Randall, J., Cheong, Y. F., & Engelhard, G. (2011). Using explanatory item response theory modeling to investigate context effects of differential item functioning for students with disabilities. Educational and Psychological Measurement, 71(1), 129–147.

    Google Scholar 

  72. Ruiz-Primo, M. A., Shavelson, R. J., Li, M., & Schultz, S. E. (2001). On the validity of cognitive interpretations of scores from alternative mapping techniques. Educational Assessment, 7(2), 99–141.

    Google Scholar 

  73. Rutkowski, L., & Rutkowski, D. (2017). Improving the comparability and local usefulness of international assessments: a look back and a way forward. Scandinavian Journal of Educational Research. https://doi.org/10.1080/00313831.2016.1261044.

    Google Scholar 

  74. Säljö, R. (1997). Learning in the learner’s perspective 1. Some commonsense concepts. Gothenburg: Institute of Education, University of Cothenburg.

    Google Scholar 

  75. Schmidt, W. H., & Burroughs, N. A. (2016). Influencing public school policy in the United States: the role of large-scale assessments. Research Papers in Education, 31(5), 567–577.

    Google Scholar 

  76. Schwabe, F., McElvany, N., & Trendtel, M. (2015). The school age gender gap in reading achievement: examining the influences of item format and intrinsic reading motivation. Reading Research Quarterly, 50(2), 219–232.

    Google Scholar 

  77. Smith, M. D. (2017). Cognitive validity: can multiple-choice items tap historical thinking process? American Educational Research Journal, 54(6), 1256–1287.

    Google Scholar 

  78. Solano-Flores, G., & Li, M. (2009). Generalizability of cognitive interview-based measures across cultural groups. Educational Measurement: Issues and Practice, 28(1), 9–18.

  79. Sugrue, B. (1995). A theory-based framework for assessing domain-specific problem solving ability. Educational Measurement: Issues and Practice, 14(1), 29–36.

    Google Scholar 

  80. Tan, K. C. D., Goh, N. K., Chia, L. S., & Treagust, D. F. (2002). Development and application of a two-tier multiple choice diagnostic instrument to assess high school students’ understanding of inorganic chemistry qualitative analysis. Journal of Research in Science Teaching, 39(4), 283–301.

  81. Tekkumru-Kisa, M., Blum, W., Stein, M. K., & Schunn, C. (2015). A framework for analyzing cognitive demand and content-practices integration: task analysis guide in science. Journal of Research in Science Teaching, 52(5), 659–685.

  82. Tobin, M., Nugroho, D., & Lietz, P. (2016). Large-scale assessments of students’ learning and education policy: synthesising evidence across world regions. Research Papers in Education, 31(5), 578–594.

    Google Scholar 

  83. Treagust, D. F. (1995). Diagnostic assessment of students’ science knowledge. In S. M. Glynn & R. Duit (Eds.), Learning science in the schools: research reforming practice (pp. 327–346). Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  84. Tsai, C.-C. (2004). Conceptions of learning science among high school students in Taiwan: a phenomenographic analysis. International Journal of Science Education, 26(14), 1733–1750.

    Google Scholar 

  85. Tsai, C.-C., & Kuo, P.-C. (2008). Cram school students’ conceptions of learning and learning science in Taiwan. International Journal of Science Education, 30(3), 351–373.

    Google Scholar 

  86. Tuerlinckx, F., & Wang, W. C. (2004). Models for polytomous data. In P. De Boeck & M. Wilson (Eds.), Explanatory item response models: a generalized linear and nonlinear approach (pp. 75–109). New York: Springer-Verlag.

    Google Scholar 

  87. Tutz, G., & Hennevogl, W. (1996). Random effects in ordinal regression models. Computational Statistics & Data Analysis, 22(5), 537–557.

    Google Scholar 

  88. Wang, C.-L., & Liou, P.-Y. (2017). Students’ motivational beliefs in science learning, school motivational contexts, and science achievement in Taiwan. International Journal of Science Education, 39(7), 898–917.

  89. Wang, J., Oliver, J. S., & Staver, J. R. (2008). Self-concept and science achievement: investigating a reciprocal relation model across the gender classification in a cross cultural context. Journal of Research in Science Teaching, 45(6), 711–725.

    Google Scholar 

  90. Webb, N. L. (2007). Issues related to judging the alignment of curriculum standards and assessments. Applied Measurement in Education, 20(1), 7–25.

    Google Scholar 

  91. Wilson, M., & Wang, W. C. (1995). Complex composites: issues that arise in combing different modes of assessment. Applied Psychological Measurement, 19(1), 51–71.

    Google Scholar 

  92. Wilson, C. D., Taylor, J. A., Kowalski, S. M., & Carlson, J. (2010). The relative effects and equity of inquiry-based and commonplace science teaching on students’ knowledge, reasoning, and argumentation. Journal of Research in Science Teaching, 47(3), 276–301.

    Google Scholar 

  93. Woods-McConney, A., Oliver, M. C., McConney, A., Maor, D., & Schibeci, R. (2013). Science engagement and literacy: a retrospective analysis for indigenous and non-indigenous students in Aotearoa New Zealand and Australia. Research in Science Education, 43(1), 233–252.

    Google Scholar 

  94. Zeng, K. (1999). Dragon gate: comparative examinations and their consequences. London: Cassell.

    Google Scholar 

Download references

Funding

This work was supported by the Ministry of Science and Technology, Taiwan (R.O.C.) under Grants 104-2629-S-008-001- and 106-2511-S-008-011-.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Pey-Yan Liou.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liou, PY., Bulut, O. The Effects of Item Format and Cognitive Domain on Students’ Science Performance in TIMSS 2011. Res Sci Educ 50, 99–121 (2020). https://doi.org/10.1007/s11165-017-9682-7

Download citation

Keywords

  • Cognitive domain
  • Item format
  • Science achievement
  • TIMSS