The Psychometric Modeling of Scientific Reasoning: a Review and Recommendations for Future Avenues

Abstract

Psychometric modeling has become a frequently used statistical tool in research on scientific reasoning. We review psychometric modeling practices in this field, including model choice, model testing, and researchers’ inferences based on their psychometric practices. A review of 11 empirical research studies reveals that the predominant psychometric approach is Rasch modeling with a focus on itemfit statistics, applied in a way strongly similar to practices in national and international large-scale educational assessment programs. This approach is common in the educational assessment community and rooted in subtle philosophical views on measurement. However, we find that based on this approach, researchers tend to draw interpretations that are not within the inferential domain of this specific approach and not in accordance with the related practices and inferential purposes. In some of the reviewed articles, researchers put emphasis on item infit statistics for dimensionality assessment. Item infit statistics, however, cannot be regarded as a valid indicator of the dimensionality of scientific reasoning. Using simulations as illustration, we argue that this practice is limited in delivering psychological insights; in fact, various recent inferences about the structure, cognitive basis, and correlates of scientific reasoning might be unwarranted. In order to harness its full potential, we make suggestions towards adjusting psychometric modeling practices to the psychological and educational questions at hand.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    The research in focus of this review uses the Rasch model for dichotomous data. Readers interested in the theory and application of Rasch models for polytomous data are referred to Anderson et al. (2007).

  2. 2.

    Further item response models include yet more parameters, which represent for example item-specific guessing probabilities (giving the right answer by chance on, for example, multiple choice tests) and item-specific slipping probabilities (not giving the right answer by chance because, for example, items have different distracting elements; Revelle 2004; Thissen and Steinberg 1986).

  3. 3.

    Sometimes the two schools are referred to as two paradigms because they differ so strongly in their theoretical assumptions (Andrich 2004). Here, we prefer to call the two positions schools because their similarities and differences are not as clear-cut as sometimes assumed (Robitzsch 2016), and in addition, Kuhn’s (T. S. Kuhn 1970) concept of paradigms and their relation to development in science has been contested (Bird 2013; Toulmin 1974); thus, we deem the concept of the two paradigms described by Andrich (2004) not yet sufficiently elaborated and critically reflected to be accepted.

  4. 4.

    Classical test theory does not formally explicate a model to explain the relation between person and item characteristics and item responses. Psychometric modeling does explicate such a model, for example the Rasch model, and thus allows testing its underlying assumptions. The distinction between classical test theory and modern psychometric modeling is, however, not clear-cut (Holland and Hoskens 2003); factor analysis used to be regarded as an instrument of classical test theory, which, in its confirmatory versions, is closely related to the psychometric models discussed here (Gebhardt 2016).

  5. 5.

    One could argue that large-scale assessment programs also represent a particular type of research, particularly because data gathered in these studies are often used for ancillary research. However, the major aim of all of these programs is policy-informing assessment. Thus, their methodology is aligned with this aim instead of advancing scientific theory.

  6. 6.

    Research related to scientific reasoning is also conducted under the terms of scientific inquiry and scientific thinking. For terminological coherence, in the description of the reviewed studies, these and related terms are described as scientific reasoning.

  7. 7.

    Notably, these packages are all commercial software, and in none of the studies, free software packages such as those available in the R software environment (for example the TAM package, which encompasses a broad variety of psychometric models; Kiefer et al. 2016) have been used.

  8. 8.

    Items with low infit are not damaging to estimating a person’s ability, but rather are removed when constructing an instrument to reduce the item pool from, say, 100 to 20 items (Linacre and Wright 1994). We thank a reviewer for pointing this out.

  9. 9.

    It should be noted that other researchers have evaluated and discussed item infit statistics, however, with different aims (Christensen and Kreiner 2013; Heene et al. 2014; Smith et al. 2008; Smith et al. 1998; R. M. Smith and Suh 2003). The present simulations have a demonstration purpose, rather than the purpose of exhibiting generalizable statistical insights. The simulation conditions were therefore tailored towards reflecting the main characteristics of the reviewed studies.

  10. 10.

    The number of items was simulated to be equal across dimensions; under deviations from this scenario itemfit statistics likely tag more items. However, an equal or almost equal number of items across dimensions is the regular scenario in psychometric studies and the reviewed literature.

References

  1. Ainley, J., Fraillon, J., & Freeman, C. (2007). National assessment program—ICT literacy years 6 & 10 report, 2005. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).

  2. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. American Educational Research Association.

  3. Andersen, E. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38(1), 123–140. https://doi.org/10.1007/BF02291180.

    Google Scholar 

  4. Anderson, C. J., Li, Z., & Vermunt, J. K. (2007). Estimation of models in a Rasch family for polytomous items and multiple latent variables. Journal of Statistical Software, 20(6), 1–36. https://doi.org/10.18637/jss.v020.i06.

    Google Scholar 

  5. Andrich, D. (2004). Controversy and the Rasch model: a characteristic of incompatible paradigms? Medical Care, 42(Supplement), I–7. https://doi.org/10.1097/01.mlr.0000103528.48582.7c.

    Google Scholar 

  6. Andrich, D. (2011, October). Rating scales and Rasch measurement. Expert Review of Pharmacoeconomics & Outcomes Research, 11(5), 571–585. https://doi.org/10.1586/erp.11.59.

    Google Scholar 

  7. Baird, J.-A., Andrich, D., Hopfenbeck, T. N., & Stobart, G. (2017). Assessment and learning: fields apart? Assessment in Education: Principles, Policy & Practice, 24(3), 317–350. https://doi.org/10.1080/0969594X.2017.1319337.

    Google Scholar 

  8. Bartholomew, D. J., Deary, I. J., & Lawn, M. (2009). A new lease of life for thomson’s bonds model of intelligence. Psychological Review, 116(3), 567–579.

    Google Scholar 

  9. Bartolucci, F., Bacci, S., & Gnaldi, M. (2014). MultiLCIRT: an R package for multidimensional latent class item response models. Computational Statistics & Data Analysis, 71, 971–985. https://doi.org/10.1016/j.csda.2013.05.018.

    Google Scholar 

  10. Bird, A. (2013). Thomas kuhn. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy (Fall 2013). Metaphysics Research Lab, Stanford University.

  11. Bond, T. & Fox, C. M. (2015). Applying the rasch model: fundamental measurement in the human sciences. Routledge.

  12. Bonifay, W., Lane, S. P., & Reise, S. P. (2016). Three concerns with applying a bifactor model as a structure of psychopathology. Clinical Psychological Science, 5(1), 184–186. https://doi.org/10.1177/2167702616657069.

    Google Scholar 

  13. Boone, W. J., Staver, J. R., & Yale, M. S. (2014). Rasch analysis in the human sciences. Dordrecht: Springer Netherlands.

    Google Scholar 

  14. Borsboom, D. (2008). Latent variable theory. Measurement: Interdisciplinary Research & Perspective, 6(1-2), 25–53. https://doi.org/10.1080/15366360802035497.

    Google Scholar 

  15. Bozdogan, H. (1987). Model selection and akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika, 52(3), 345–370. https://doi.org/10.1007/BF02294361.

    Google Scholar 

  16. Brown, N. J., & Wilson, M. (2011). A model of cognition: the missing cornerstone of assessment. Educational Psychology Review, 23(2), 221–234.

    Google Scholar 

  17. Brown, N. J., Furtak, E. M., Timms, M., Nagashima, S. O., & Wilson, M. (2010). The evidence-based reasoning framework: assessing scientific reasoning. Educational Assessment, 15(3-4), 123–141.

    Google Scholar 

  18. Bruner, J. S., Goodnow, J. J., & Austin, G. A. (1956). A study of thinking. 1956. New York: John Wiley.

  19. Bürkner, P. C. (2017). brms: an R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01.

    Google Scholar 

  20. Cano, F. (2005). Epistemological beliefs and approaches to learning: their change through secondary school and their influence on academic performance. British Journal of Educational Psychology, 75(2), 203–221. https://doi.org/10.1348/000709904X22683.

    Google Scholar 

  21. Carey, S. (1992). The origin and evolution of everyday concepts. University of Minnesota Press, Minneapolis.

  22. Caspi, A., Houts, R. M., Belsky, D. W., Goldman-Mellor, S. J., Harrington, H., Israel, S., Meier, M. H., Ramrakha, S., Shalev, I., Poulton, R., & Moffitt, T. E. (2014). The p factor: one general psychopathology factor in the structure of psychiatric disorders? Clinical Psychological Science, 2(2), 119–137. https://doi.org/10.1177/2167702613497473.

    Google Scholar 

  23. Chambers, C. D., Dienes, Z., McIntosh, R. D., Rotshtein, P., & Willmes, K. (2015). Registered reports: realigning incentives in scientific publishing. Cortex, 66(3), A1–A2. https://doi.org/10.1016/j.cortex.2012.12.016.

    Google Scholar 

  24. Chen, Z., & Klahr, D. (1999). All other things being equal: acquisition and transfer of the control of variables strategy. Child Development, 70(5), 1098–1120. https://doi.org/10.1111/1467-8624.00081.

    Google Scholar 

  25. Christensen, K. B. & Kreiner, S. (2013). Item fit statistics. In Rasch models in health (pp. 83–104). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch5.

  26. Conway, A. R., & Kovacs, K. (2015). New and emerging models of human intelligence. Wiley Interdisciplinary Reviews: Cognitive Science, 6(5), 419–426. https://doi.org/10.1002/wcs.1356.

    Google Scholar 

  27. Cullen, L. T. (2012). Rasch models: foundations, recent developments, and applications. [S.l.]: Springer.

  28. Davier, M. v., & Carstensen, C. H. (2007). Multivariate and mixture distribution rasch models extensions and applications. New York: Springer.

    Google Scholar 

  29. De Groot, A. (2014). The meaning of “significance” for different types of research [translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han Lj Van Der Maas]. Acta Psychologica, 148, 188–194. https://doi.org/10.1016/j.actpsy.2014.02.001.

    Google Scholar 

  30. de la Torre, J. (2009). A cognitive diagnosis model for cognitively based multiple-choice options. Applied Psychological Measurement, 33(3), 163–183. https://doi.org/10.1177/0146621608320523.

    Google Scholar 

  31. Deary, I. J., Wilson, J. A., Carding, P. N., MacKenzie, K., & Watson, R. (2010). From dysphonia to dysphoria: Mokken scaling shows a strong, reliable hierarchy of voice symptoms in the Voice Symptom Scale questionnaire. Journal of Psychosomatic Research, 68(1), 67–71. https://doi.org/10.1016/j.jpsychores.2009.06.008.

    Google Scholar 

  32. Dewey, J. (1910). How we think. Boston, MA: DC Heath.

  33. Dickison, P., Luo, X., Kim, D., Woo, A., Muntean, W., & Bergstrom, B. (2016). Assessing higher-order cognitive constructs by using an information-processing framework. Journal of Applied Testing Technology, 17, 1–19.

    Google Scholar 

  34. Divgi, D. (1986). Does the rasch model really work for multiple choice items? Not if you look closely. Journal of Educational Measurement, 23(4), 283–298. https://doi.org/10.1111/j.1745-3984.1986.tb00251.x.

    Google Scholar 

  35. Donovan, J., Hutton, P., Lennon, M., O’Connor, G., & Morrissey, N. (2008a). National assessment program—science literacy year 6 school release materials, 2006. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).

  36. Donovan, J., Lennon, M., O’connor, G., & Morrissey, N. (2008b). National assessment program–science literacy year 6 report, 2006. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).

  37. Engelhard Jr, G. (2013). Invariant measurement: using Rasch models in the social, behavioral, and health sciences. New York: Routledge.

  38. Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112.

    Google Scholar 

  39. Esswein, J. L. (2010). Critical thinking and reasoning in middle school science education (Doctoral dissertation, The Ohio State University).

  40. Finkelstein, L. (2003). Widely, strongly and weakly defined measurement. Measurement, 34(1), 39–48. https://doi.org/10.1016/S0263-2241(03)00018-6.

    Google Scholar 

  41. Fischer, F., Kollar, I., Ufer, S., Sodian, B., Hussmann, H., Pekrun, R., et al. (2014). Scientific reasoning and argumentation: advancing an interdisciplinary research agenda in education. Frontline Learning Research, 4, 28–45. https://doi.org/10.14786/flr.v2i2.96.

    Google Scholar 

  42. Fox, J.-P. (2010). Bayesian item response modeling: theory and applications. Springer Science & Business Media.

  43. Gebhardt, E. (2016). Latent path models within an irt framework (Doctoral dissertation).

    Google Scholar 

  44. Gelman, A., & Loken, E. (2014). The statistical crisis in science data-dependent analysis—a “garden of forking paths”—explains why many statistically significant comparisons don’t hold up. American Scientist, 102(6), 460.

    Google Scholar 

  45. Gignac, G. E. (2016, July). On the evaluation of competing theories: a reply to van der Maas and Kan. Intelligence, 57, 84–86. https://doi.org/10.1016/j.intell.2016.03.006.

    Google Scholar 

  46. Glas, C. A. & Verhelst, N. D. (1995). Testing the Rasch model. In Rasch models (pp. 69–95). Springer.

  47. Glockner-Rist, A., & Hoijtink, H. (2003). The best of both worlds: factor analysis of dichotomous data using item response theory and structural equation modeling. Structural Equation Modeling, 10(4), 544–565. https://doi.org/10.1207/S15328007SEM1004_4.

    Google Scholar 

  48. Grube, C. R. (2010). Kompetenzen naturwissenschaftlicher Erkenntnisgewinnung [Competencies of scientific inquiry] (Doctoral dissertation, Universität Kassel).

  49. Hambleton, R. K. (2000). Response to hays et al and McHorney and Cohen: emergence of item response modeling in instrument development and data analysis. Medical Care, 38, II–60. https://doi.org/10.1097/00005650-200009002-00009.

    Google Scholar 

  50. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.

  51. Hartig, J., & Frey, A. (2013). Sind Modelle der Item-Response-Theorie (IRT) das “Mittel der Wahl ”für die Modellierung von Kompetenzen? Zeitschrift für Erziehungswissenschaft, 16(S1), 47–51. https://doi.org/10.1007/s11618-013-0386-0.

    Google Scholar 

  52. Hartig, J., Klieme, E., & Leutner, D. (2008). Assessment of competencies in educational contexts. Hogrefe Publishing.

  53. Hartmann, S., Upmeier zu Belzen, A., Kroeger, D., & Pant, H. A. (2015, January). Scientific reasoning in higher education: constructing and evaluating the criterion-related validity of an assessment of preservice science teachers’ competencies. Zeitschrift fuer Psychologie, 223(1), 47–53. https://doi.org/10.1027/2151-2604/a000199.

    Google Scholar 

  54. Heene, M. (2006). Konstruktion und Evaluation eines Studierendenauswahlverfahrens für Psychologie an der Universität Heidelberg. Unpublished Doctoral Dissertation, University of Heidelberg.

  55. Heene, M., Bollmann, S., & Buhner, M. (2014). Much ado about nothing, or much to do about something: effects of scale shortening on criterion validity and mean differences. Journal of Individual Differences, 35(4), 245–249. https://doi.org/10.1027/1614-0001/a000146Heene,M.

  56. Heene, M., Kyngdon, A., & Sckopke, P. (2016). Detecting violations of unidimensionality by order-restricted inference methods. Frontiers in Applied Mathematics and Statistics, 2, 3.

    Google Scholar 

  57. Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first-order item response theory: application to true-score prediction from a possibly nonparallel test. Psychometrika, 68(1), 123–149.

    Google Scholar 

  58. Humphry, S. (2011, January). The role of the unit in physics and psychometrics. Measurement: Interdisciplinary Research & Perspective, 9(1), 1–24. https://doi.org/10.1080/15366367.2011.558442.

    Google Scholar 

  59. Jeon, M., Draney, K., & Wilson, M. (2015). A general saltus lltm-r for cognitive assessments. In Quantitative psychology research (pp. 73–90). Springer. https://doi.org/10.1007/978-3-319-07503-7_5.

  60. Kiefer, T., Robitzsch, A., Wu, M., & Robitzsch, A. (2016). Package tam. R software package. Kitchner, K. S. (1983). Cognition, metacognition, and epistemic cognition. Human Development, 26, 222–232.

    Google Scholar 

  61. Klahr, D. (2002). Exploring science: the cognition and development of discovery processes. The MIT Press.

  62. Klahr, D., & Dunbar, K. (1988). Dual space search during scientific reasoning. Cognitive Science, 12(1), 1–48. https://doi.org/10.1207/s15516709cog1201_1.

    Google Scholar 

  63. Koeppen, K., Hartig, J., Klieme, E., & Leutner, D. (2008). Current issues in competence modeling and assessment. Zeitschrift für Psychologie, 216(2), 61–73.

    Google Scholar 

  64. Koller, I., Maier, M. J., & Hatzinger, R. (2015). An empirical power analysis of quasi-exact tests for the rasch model. Methodology, 11(2), 45–54. https://doi.org/10.1027/1614-2241/a000090.

    Google Scholar 

  65. Körber, S., Mayer, D., Osterhaus, C., Schwippert, K., & Sodian, B. (2014, September). The development of scientific thinking in elementary school: a comprehensive inventory. Child Development, 86(1), 327–336. https://doi.org/10.1111/cdev.12298.

    Google Scholar 

  66. Körber, S., Osterhaus, C., & Sodian, B. (2015). Testing primary-school children’s understanding of the nature of science. British Journal of Developmental Psychology, 33(1), 57–72. https://doi.org/10.1111/bjdp.12067.

    Google Scholar 

  67. Kreiner, S. & Christensen, K. B. (2013). Overall tests of the rasch model. In Rasch models in health (pp. 105–110). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch6.

  68. Kremer, K., Specht, C., Urhahne, D., & Mayer, J. (2014, January 2). The relationship in biology between the nature of science and scientific inquiry. Journal of Biological Education, 48(1), 1–8. https://doi.org/10.1080/00219266.2013.788541.

    Google Scholar 

  69. Kuhn, D. (1989). Children and adults as intuitive scientists. Psychological Review, 96(4), 674–689. https://doi.org/10.1037/0033-295X.96.4.674.

    Google Scholar 

  70. Kuhn, D. (1991). The skills of argument. Cambridge University Press.

  71. Kuhn, D., Iordanou, K., Pease, M., & Wirkala, C. (2008). Beyond control of variables: what needs to develop to achieve skilled scientific thinking? Cognitive Development, 23(4), 435–451. https://doi.org/10.1016/j.cogdev.2008.09.006.

    Google Scholar 

  72. Kuhn, D., & Pease, M. (2008). What needs to develop in the development of inquiry skills? Cognition and Instruction, 26(4), 512–559. https://doi.org/10.1080/07370000802391745.

    Google Scholar 

  73. Kuhn, D., Ramsey, S., & Arvidsson, T. S. (2015, July). Developing multivariable thinkers. Cognitive Development, 35, 92–110. https://doi.org/10.1016/j.cogdev.2014.11.003.

    Google Scholar 

  74. Kuhn, D., & Udell, W. (2003). The development of argument skills. Child Development, 74(5), 1245–1260. https://doi.org/10.1111/1467-8624.00605.

    Google Scholar 

  75. Kuhn, T. S. (1970). The structure of scientific revolutions ([2d ed., enl). International encyclopedia of unified science. Foundations of the unity of science, v. 2, no. 2. Chicago: University of Chicago Press.

  76. Kuo, C.-Y., Wu, H.-K., Jen, T.-H., & Hsu, Y.-S. (2015, September 22). Development and validation of a multimedia-based assessment of scientific inquiry abilities. International Journal of Science Education, 37(14), 2326–2357. https://doi.org/10.1080/09500693.2015.1078521.

    Google Scholar 

  77. Lehrer, R., & Schauble, L. (2000). Modeling in mathematics and science. In R. Glaser (Ed.), Advances in instructional psychology, Volume 5: Educational Design and Cognitive Science (pp. 100–159). New Jersey: Lawrence Erlbaum.

    Google Scholar 

  78. Linacre, J. M. (2010). Two perspectives on the application of rasch models. European Journal of Phsyciological Rehabilitaiton Medicine, 46, 309–310.

    Google Scholar 

  79. Linacre, J. M. (2012). A user’s guide to facets rasch-model computer programs.

    Google Scholar 

  80. Linacre, J. M., & Wright, B. D. (1994). Dichotomous infit and outfit mean-square fit statistics. Rasch Measurement Transactions, 8(2), 260.

    Google Scholar 

  81. Linacre, J. M. & Wright, B. D. (2000). Winsteps. URL: http://www.winsteps.com/index.htm [accessed 2017-01-01].

  82. Lou, Y., Blanchard, P., & Kennedy, E. (2015). Development and validation of a science inquiry skills assessment. Journal of Geoscience Education, 63(1), 73–85. https://doi.org/10.5408/14-028.1.

    Google Scholar 

  83. MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). The problem of equivalent models in applications of covariance structure analysis. Psychological Bulletin, 114(1), 185–199.

    Google Scholar 

  84. Mair, P., & Hatzinger, R. (2007). Extended rasch modeling: the erm package for the application of irt models in r. Journal of Statistical Software, 20(9), 1–20. https://doi.org/10.18637/jss.v020.i09.

    Google Scholar 

  85. Manlove, S., Lazonder, A. W., & Jong, T. D. (2006). Regulative support for collaborative scientific inquiry learning. Journal of Computer Assisted Learning, 22(2), 87–98.

    Google Scholar 

  86. Mari, L., Maul, A., Irribarra, D. T., & Wilson, M. (2016). A meta-structural understanding of measurement. In Journal of physics: conference series (Vol. 772, p. 012009). IOP Publishing.

  87. Mari, L., Maul, A., Torres Irribarra, D., & Wilson, M. (2017). Quantities, Quantification, and the Necessary and Sufficient Conditions for Measurement. Measurement, 100, 115–121

    Google Scholar 

  88. Masters, G. N. (1988). Item discrimination: when more is worse. Journal of Educational Measurement, 25(1), 15–29. https://doi.org/10.1111/j.1745-3984.1988.tb00288.x.

    Google Scholar 

  89. Maul, A. (2017). Rethinking traditional methods of survey validation. Measurement: Interdisciplinary Research and Perspectives, 15(2), 51–69. https://doi.org/10.1080/15366367.2017.1348108.

    Google Scholar 

  90. Maydeu-Olivares, A. (2013, July). Goodness-of-fit assessment of item response theory models. Measurement: Interdisciplinary Research & Perspective, 11(3), 71–101. https://doi.org/10.1080/15366367.2013.831680.

    Google Scholar 

  91. Maydeu-Olivares, A., & Joe, H. (2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research, 49(4), 305–328. https://doi.org/10.1080/00273171.2014.911075.

    Google Scholar 

  92. Mayer, D., Sodian, B., Körber, S., & Schwippert, K. (2014, February). Scientific reasoning in elementary school children: assessment and relations with cognitive abilities. Learning and Instruction, 29, 43–55. https://doi.org/10.1016/j.learninstruc.2013.07.005.

    Google Scholar 

  93. Meijer, R. R., Sijtsma, K., & Smid, N. G. (1990). Theoretical and empirical comparison of the Mokken and the Rasch approach to IRT. Applied Psychological Measurement, 14(3), 283–298. https://doi.org/10.1177/014662169001400306.

    Google Scholar 

  94. Michell, J. (2000). Normal science, pathological science and psychometrics. Theory & Psychology, 10(5), 639–667.

    Google Scholar 

  95. Mokken, R. J. (1971). A theory and procedure of scale analysis: with applications in political research. Walter de Gruyter.

  96. Molenaar, I. W. (2001). Thirty years of nonparametric item response theory. Applied Psychological Measurement, 25(3), 295–299. https://doi.org/10.1177/01466210122032091.

    Google Scholar 

  97. Morris, B. J., Croker, S., Masnick, A., & Zimmerman, C. (2012). The emergence of scientific reasoning. In Current topics in children’s learning and cognition. Rijeka, Croatia: InTech.

  98. Mullis, I. V. S., Martin, M. O., Ruddock, G. J., O’Sullivan, C. Y., & Preuschoff, C. (2009). Timss 2011 assessment frameworks.

  99. Mullis, I. V., Martin, M. O., Smith, T. A., Garden, R. A., Gregory, K. D., Gonzalez, E. J., … O’Connor, K. M. (2003). TIMSS trends in mathematics and science study: assessment frameworks and specifications 2003.

  100. Musek, J. (2007). A general factor of personality: evidence for the big one in the five-factor model. Journal of Research in Personality, 41(6), 1213–1233. https://doi.org/10.1016/j.jrp.2007.02.003.

    Google Scholar 

  101. National Assessment Governing Board. (2007). Science assessment and item specifications for the 2009 national assessment of educational progress. Washington: National Assessment Governing Board.

    Google Scholar 

  102. Nowak, K. H., Nehring, A., Tiemann, R., & Upmeier zu Belzen, A. (2013). Assessing students’ abilities in processes of scientific inquiry in biology using a paper-and-pencil test. Journal of Biological Education, 47(3), 182–188. https://doi.org/10.1080/00219266.2013.822747.

    Google Scholar 

  103. OECD. (2006). Assessing scientific, reading and mathematical literacy: a framework for PISA 2006. Paris: Organisation for Economic Co-operation and Development.

  104. Opitz, A., Heene, M., & Fischer, F. (2017). Measuring scientific reasoning—a review of test instruments. Educational Research and Evaluation, 23(3-4), 78–101.

    Google Scholar 

  105. Pant, H. A., Stanat, P., Schroeders, U., Roppelt, A., Siegle, T., Pohlmann, C., & Institut zur Qualitätsentwicklung im Bildungswesen (Eds.). (2013). IQB ländervergleich 2012: mathematische und naturwissenschaftliche Kompetenzen am Ende der Sekundarstufe i. Munster: Waxmann.

  106. Peirce, C. S. (2012). Philosophical writings of Peirce. Courier Corporation.

  107. Piaget, J. & Inhelder, B. (1958). The growth of logical thinking from childhood to adolescence: an essay on the construction of formal operational structures. Abingdon, Oxon: Routledge.

  108. Pohl, S., & Steyer, R. (2010). Modeling common traits and method effects in multitrait-multimethod analysis. Multivariate Behavioral Research, 45(1), 45–72. https://doi.org/10.1080/00273170903504729.

    Google Scholar 

  109. Raiche, G. & Raiche, M. G. (2009). The irtprob package.

    Google Scholar 

  110. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Nielsen & Lydiche.

  111. Raykov, T. & Marcoulides, G. A. (2011). Introduction to psychometric theory. Routledge.

  112. R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. Accessed 15 Sept 2013.

  113. Reckase, M. (2009). Multidimensional item response theory. Springer.

  114. Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–696. https://doi.org/10.1080/00273171.2012.715555.

    Google Scholar 

  115. Reise, S. P., Bonifay, W. E., & Haviland, M. G. (2013). Scoring and modeling psychological measures in the presence of multidimensionality. Journal of Personality Assessment, 95(2), 129–140. https://doi.org/10.1080/00223891.2012.725437.

    Google Scholar 

  116. Renkl, A. (2012). Modellierung von Kompetenzen oder von interindividuellen Kompetenzunterschieden. Psychologische Rundschau., 63(1), 50–53.

    Google Scholar 

  117. Revelle, W. (2004). An introduction to psychometric theory with applications in r. Springer.

  118. Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological Review, 107(2), 358–367.

    Google Scholar 

  119. Robitzsch, A. (2016). Essays zu methodischen herausforderungen im large-scale assessment. Humboldt-Universität zu Berlin.

  120. Robitzsch, A., Kiefer, T., George, A. C., & Uenlue, A. (2014). Cdm: cognitive diagnosis modeling. R package version, 3.

  121. Rosseel, Y., Oberski, D., Byrnes, J., Vanbrabant, L., Savalei, V., Merkle, E., ... Barendse, M., et al. (2017). Package lavaan.

  122. Rost, J., Carstensen, C., & Von Davier, M. (1997). Applying the mixed rasch model to personality questionnaires. Applications of latent trait and latent class models in the social sciences, 324–332.

  123. Schommer, M., Calvert, C., Gariglietti, G., & Bajaj, A. (1997). The development of epistemological beliefs among secondary students: a longitudinal study. Journal of Educational Psychology, 89(1), 37–40. https://doi.org/10.1037/0022-0663.89.1.37.

    Google Scholar 

  124. Siersma, V. & Eusebi, P. (2013). Analysis with repeatedly measured binary item response data by ad hoc rasch scales. In Rasch models in health (pp. 257–276). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch14

  125. Sijtsma, K. (2011). Review. Measurement, 44(7), 1209–1219. https://doi.org/10.1016/j.measurement.2011.03.019.

    Google Scholar 

  126. Sijtsma, K. (2012, December 1). Psychological measurement between physics and statistics. Theory & Psychology, 22(6), 786–809. https://doi.org/10.1177/0959354312454353.

    Google Scholar 

  127. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.

    Google Scholar 

  128. Smith, A. B., Rush, R., Fallowfield, L. J., Velikova, G., & Sharpe, M. (2008). Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology, 8(1), 33. https://doi.org/10.1186/1471-2288-8-33.

    Google Scholar 

  129. Smith, R. M., Schumacker, R. E., & Bush, M. J. (1998). Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement, 2(1), 66–78.

    Google Scholar 

  130. Smith, R. M., & Suh, K. K. (2003). Rasch fit statistics as a test of the invariance of item parameter estimates. Journal of Applied Measurement, 4(2), 153–163.

    Google Scholar 

  131. Sodian, B., & Bullock, M. (2008, October). Scientific reasoning where are we now? Cognitive Development, 23(4), 431–434. https://doi.org/10.1016/j.cogdev.2008.09.003.

    Google Scholar 

  132. Sodian, B., Zaitchik, D., & Carey, S. (1991). Young children’s differentiation of hypothetical beliefs from evidence. Child Development, 62(4), 753–766. https://doi.org/10.1111/j.1467-8624.1991.tb01567.x.

    Google Scholar 

  133. Stewart, I. (2008). Nature’s numbers: the unreal reality of mathematics. NY: Basic Books.

  134. Strobl, C., Kopf, J., & Zeileis, A. (2015). Rasch trees: a new method for detecting differential item functioning in the Rasch model. Psychometrika, 80(2), 289–316. https://doi.org/10.1007/s11336-013-9388-3.

    Google Scholar 

  135. Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51(4), 567–577. https://doi.org/10.1007/BF02295596.

    Google Scholar 

  136. Thurstone, L. L. [Louis L]. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529–554, 4.

    Google Scholar 

  137. Thurstone, L. L. Louis Leon & Chave, E. J. (1954). Chicago: Chicago University Press.

  138. Toulmin, S. (1974). Human understanding, volume i.

  139. Van der Ark, L. A., et al. (2007). Mokken scale analysis in r. Journal of Statistical Software, 20, 1–19.

    Google Scholar 

  140. Van Der Maas, H. L., Dolan, C. V., Grasman, R. P., Wicherts, J. M., Huizenga, H. M., & Raijmakers, M. E. (2006). A dynamical model of general intelligence: the positive manifold of intelligence by mutualism. Psychological Review, 113(4), 842–861. https://doi.org/10.1037/0033-295X.113.4.842.

    Google Scholar 

  141. van Bork, R., Epskamp, S., Rhemtulla, M., Borsboom, D., & van der Maas, H. L. (2017). What is the p-factor of psychopathology? Some risks of general factor modeling. Theory & Psychology, 27(6), 759–773.

    Google Scholar 

  142. Vandekerckhove, J., Matzke, D., & Wagenmakers, E.-J. (2015). Model comparison and the principle. The Oxford handbook of computational and mathematical psychology, 300.

  143. Vandekerckhove, J., Tuerlinckx, F., & Lee, M. D. (2011). Hierarchical diffusion models for two-choice response times. Psychological Methods, 16(1), 44–62.

    Google Scholar 

  144. von Davier, M. (2001). Winmira 2001. Computer software]. St. Paul, MN: Assessment Systems Corporation.

  145. Vosniadou, S., & Brewer, W. F. (1992). Mental models of the earth: a study of conceptual change in childhood. Cognitive Psychology, 24(4), 535–585. https://doi.org/10.1016/0010-0285(92)90018-W.

    Google Scholar 

  146. Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105.

    Google Scholar 

  147. Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632–638.

    Google Scholar 

  148. Whitely, S. E. (1983) Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179–197

    Google Scholar 

  149. Wilkening, F., & Sodian, B. (2005). Scientific reasoning in young children: introduction. Swiss Journal of Psychology, 64(3), 137–139. https://doi.org/10.1024/1421-0185.64.3.137.

    Google Scholar 

  150. Wilson, M., Allen, D. D., & Li, J. C. (2006). Improving measurement in health education and health behavior research using item response modeling: comparison with the classical test theory approach. Health Education Research, 21(Supplement 1), i19–i32.

    Google Scholar 

  151. Wright, B. D. (1979). Best test design. Chicago, IL: MESA Press.

  152. Wright, B. D. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.

    Google Scholar 

  153. Wu, M. (2004). Plausible values. Rasch Measurement Transactions, 18, 976–978.

    Google Scholar 

  154. Wu, M. L. (2007). ACER ConQuest version 2.0: generalised item response modelling software. Camberwell, Vic.: ACER Press.

  155. Zimmerman, C. (2000). The development of scientific reasoning skills. Developmental Review, 20(1), 99–149. https://doi.org/10.1006/drev.1999.0497.

    Google Scholar 

  156. Zimmerman, C. (2007). The development of scientific thinking skills in elementary and middle school. Developmental Review, 27(2), 172–223. https://doi.org/10.1016/j.dr.2006.12.001.

    Google Scholar 

  157. Zimmerman, C., & Klahr, D. (2018). Development of scientific thinking. In J. T. Wixted (Ed.), Stevens’ handbook of experimental psychology and cognitive neuroscience (pp. 1–25). Hoboken: John Wiley & Sons, Inc..

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Peter A. Edelsbrunner.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Edelsbrunner, P.A., Dablander, F. The Psychometric Modeling of Scientific Reasoning: a Review and Recommendations for Future Avenues. Educ Psychol Rev 31, 1–34 (2019). https://doi.org/10.1007/s10648-018-9455-5

Download citation

Keywords

  • Scientific reasoning
  • Psychometrics
  • Review
  • Rasch model
  • Item response theory