Skip to main content
Log in

The Psychometric Modeling of Scientific Reasoning: a Review and Recommendations for Future Avenues

  • REVIEW ARTICLE
  • Published:
Educational Psychology Review Aims and scope Submit manuscript

Abstract

Psychometric modeling has become a frequently used statistical tool in research on scientific reasoning. We review psychometric modeling practices in this field, including model choice, model testing, and researchers’ inferences based on their psychometric practices. A review of 11 empirical research studies reveals that the predominant psychometric approach is Rasch modeling with a focus on itemfit statistics, applied in a way strongly similar to practices in national and international large-scale educational assessment programs. This approach is common in the educational assessment community and rooted in subtle philosophical views on measurement. However, we find that based on this approach, researchers tend to draw interpretations that are not within the inferential domain of this specific approach and not in accordance with the related practices and inferential purposes. In some of the reviewed articles, researchers put emphasis on item infit statistics for dimensionality assessment. Item infit statistics, however, cannot be regarded as a valid indicator of the dimensionality of scientific reasoning. Using simulations as illustration, we argue that this practice is limited in delivering psychological insights; in fact, various recent inferences about the structure, cognitive basis, and correlates of scientific reasoning might be unwarranted. In order to harness its full potential, we make suggestions towards adjusting psychometric modeling practices to the psychological and educational questions at hand.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The research in focus of this review uses the Rasch model for dichotomous data. Readers interested in the theory and application of Rasch models for polytomous data are referred to Anderson et al. (2007).

  2. Further item response models include yet more parameters, which represent for example item-specific guessing probabilities (giving the right answer by chance on, for example, multiple choice tests) and item-specific slipping probabilities (not giving the right answer by chance because, for example, items have different distracting elements; Revelle 2004; Thissen and Steinberg 1986).

  3. Sometimes the two schools are referred to as two paradigms because they differ so strongly in their theoretical assumptions (Andrich 2004). Here, we prefer to call the two positions schools because their similarities and differences are not as clear-cut as sometimes assumed (Robitzsch 2016), and in addition, Kuhn’s (T. S. Kuhn 1970) concept of paradigms and their relation to development in science has been contested (Bird 2013; Toulmin 1974); thus, we deem the concept of the two paradigms described by Andrich (2004) not yet sufficiently elaborated and critically reflected to be accepted.

  4. Classical test theory does not formally explicate a model to explain the relation between person and item characteristics and item responses. Psychometric modeling does explicate such a model, for example the Rasch model, and thus allows testing its underlying assumptions. The distinction between classical test theory and modern psychometric modeling is, however, not clear-cut (Holland and Hoskens 2003); factor analysis used to be regarded as an instrument of classical test theory, which, in its confirmatory versions, is closely related to the psychometric models discussed here (Gebhardt 2016).

  5. One could argue that large-scale assessment programs also represent a particular type of research, particularly because data gathered in these studies are often used for ancillary research. However, the major aim of all of these programs is policy-informing assessment. Thus, their methodology is aligned with this aim instead of advancing scientific theory.

  6. Research related to scientific reasoning is also conducted under the terms of scientific inquiry and scientific thinking. For terminological coherence, in the description of the reviewed studies, these and related terms are described as scientific reasoning.

  7. Notably, these packages are all commercial software, and in none of the studies, free software packages such as those available in the R software environment (for example the TAM package, which encompasses a broad variety of psychometric models; Kiefer et al. 2016) have been used.

  8. Items with low infit are not damaging to estimating a person’s ability, but rather are removed when constructing an instrument to reduce the item pool from, say, 100 to 20 items (Linacre and Wright 1994). We thank a reviewer for pointing this out.

  9. It should be noted that other researchers have evaluated and discussed item infit statistics, however, with different aims (Christensen and Kreiner 2013; Heene et al. 2014; Smith et al. 2008; Smith et al. 1998; R. M. Smith and Suh 2003). The present simulations have a demonstration purpose, rather than the purpose of exhibiting generalizable statistical insights. The simulation conditions were therefore tailored towards reflecting the main characteristics of the reviewed studies.

  10. The number of items was simulated to be equal across dimensions; under deviations from this scenario itemfit statistics likely tag more items. However, an equal or almost equal number of items across dimensions is the regular scenario in psychometric studies and the reviewed literature.

References

  • Ainley, J., Fraillon, J., & Freeman, C. (2007). National assessment program—ICT literacy years 6 & 10 report, 2005. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. American Educational Research Association.

  • Andersen, E. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38(1), 123–140. https://doi.org/10.1007/BF02291180.

    Google Scholar 

  • Anderson, C. J., Li, Z., & Vermunt, J. K. (2007). Estimation of models in a Rasch family for polytomous items and multiple latent variables. Journal of Statistical Software, 20(6), 1–36. https://doi.org/10.18637/jss.v020.i06.

    Google Scholar 

  • Andrich, D. (2004). Controversy and the Rasch model: a characteristic of incompatible paradigms? Medical Care, 42(Supplement), I–7. https://doi.org/10.1097/01.mlr.0000103528.48582.7c.

    Google Scholar 

  • Andrich, D. (2011, October). Rating scales and Rasch measurement. Expert Review of Pharmacoeconomics & Outcomes Research, 11(5), 571–585. https://doi.org/10.1586/erp.11.59.

    Google Scholar 

  • Baird, J.-A., Andrich, D., Hopfenbeck, T. N., & Stobart, G. (2017). Assessment and learning: fields apart? Assessment in Education: Principles, Policy & Practice, 24(3), 317–350. https://doi.org/10.1080/0969594X.2017.1319337.

    Google Scholar 

  • Bartholomew, D. J., Deary, I. J., & Lawn, M. (2009). A new lease of life for thomson’s bonds model of intelligence. Psychological Review, 116(3), 567–579.

    Google Scholar 

  • Bartolucci, F., Bacci, S., & Gnaldi, M. (2014). MultiLCIRT: an R package for multidimensional latent class item response models. Computational Statistics & Data Analysis, 71, 971–985. https://doi.org/10.1016/j.csda.2013.05.018.

    Google Scholar 

  • Bird, A. (2013). Thomas kuhn. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy (Fall 2013). Metaphysics Research Lab, Stanford University.

  • Bond, T. & Fox, C. M. (2015). Applying the rasch model: fundamental measurement in the human sciences. Routledge.

  • Bonifay, W., Lane, S. P., & Reise, S. P. (2016). Three concerns with applying a bifactor model as a structure of psychopathology. Clinical Psychological Science, 5(1), 184–186. https://doi.org/10.1177/2167702616657069.

    Google Scholar 

  • Boone, W. J., Staver, J. R., & Yale, M. S. (2014). Rasch analysis in the human sciences. Dordrecht: Springer Netherlands.

    Google Scholar 

  • Borsboom, D. (2008). Latent variable theory. Measurement: Interdisciplinary Research & Perspective, 6(1-2), 25–53. https://doi.org/10.1080/15366360802035497.

    Google Scholar 

  • Bozdogan, H. (1987). Model selection and akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika, 52(3), 345–370. https://doi.org/10.1007/BF02294361.

    Google Scholar 

  • Brown, N. J., & Wilson, M. (2011). A model of cognition: the missing cornerstone of assessment. Educational Psychology Review, 23(2), 221–234.

    Google Scholar 

  • Brown, N. J., Furtak, E. M., Timms, M., Nagashima, S. O., & Wilson, M. (2010). The evidence-based reasoning framework: assessing scientific reasoning. Educational Assessment, 15(3-4), 123–141.

    Google Scholar 

  • Bruner, J. S., Goodnow, J. J., & Austin, G. A. (1956). A study of thinking. 1956. New York: John Wiley.

  • Bürkner, P. C. (2017). brms: an R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01.

    Google Scholar 

  • Cano, F. (2005). Epistemological beliefs and approaches to learning: their change through secondary school and their influence on academic performance. British Journal of Educational Psychology, 75(2), 203–221. https://doi.org/10.1348/000709904X22683.

    Google Scholar 

  • Carey, S. (1992). The origin and evolution of everyday concepts. University of Minnesota Press, Minneapolis.

  • Caspi, A., Houts, R. M., Belsky, D. W., Goldman-Mellor, S. J., Harrington, H., Israel, S., Meier, M. H., Ramrakha, S., Shalev, I., Poulton, R., & Moffitt, T. E. (2014). The p factor: one general psychopathology factor in the structure of psychiatric disorders? Clinical Psychological Science, 2(2), 119–137. https://doi.org/10.1177/2167702613497473.

    Google Scholar 

  • Chambers, C. D., Dienes, Z., McIntosh, R. D., Rotshtein, P., & Willmes, K. (2015). Registered reports: realigning incentives in scientific publishing. Cortex, 66(3), A1–A2. https://doi.org/10.1016/j.cortex.2012.12.016.

    Google Scholar 

  • Chen, Z., & Klahr, D. (1999). All other things being equal: acquisition and transfer of the control of variables strategy. Child Development, 70(5), 1098–1120. https://doi.org/10.1111/1467-8624.00081.

    Google Scholar 

  • Christensen, K. B. & Kreiner, S. (2013). Item fit statistics. In Rasch models in health (pp. 83–104). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch5.

  • Conway, A. R., & Kovacs, K. (2015). New and emerging models of human intelligence. Wiley Interdisciplinary Reviews: Cognitive Science, 6(5), 419–426. https://doi.org/10.1002/wcs.1356.

    Google Scholar 

  • Cullen, L. T. (2012). Rasch models: foundations, recent developments, and applications. [S.l.]: Springer.

  • Davier, M. v., & Carstensen, C. H. (2007). Multivariate and mixture distribution rasch models extensions and applications. New York: Springer.

    Google Scholar 

  • De Groot, A. (2014). The meaning of “significance” for different types of research [translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han Lj Van Der Maas]. Acta Psychologica, 148, 188–194. https://doi.org/10.1016/j.actpsy.2014.02.001.

    Google Scholar 

  • de la Torre, J. (2009). A cognitive diagnosis model for cognitively based multiple-choice options. Applied Psychological Measurement, 33(3), 163–183. https://doi.org/10.1177/0146621608320523.

    Google Scholar 

  • Deary, I. J., Wilson, J. A., Carding, P. N., MacKenzie, K., & Watson, R. (2010). From dysphonia to dysphoria: Mokken scaling shows a strong, reliable hierarchy of voice symptoms in the Voice Symptom Scale questionnaire. Journal of Psychosomatic Research, 68(1), 67–71. https://doi.org/10.1016/j.jpsychores.2009.06.008.

    Google Scholar 

  • Dewey, J. (1910). How we think. Boston, MA: DC Heath.

  • Dickison, P., Luo, X., Kim, D., Woo, A., Muntean, W., & Bergstrom, B. (2016). Assessing higher-order cognitive constructs by using an information-processing framework. Journal of Applied Testing Technology, 17, 1–19.

    Google Scholar 

  • Divgi, D. (1986). Does the rasch model really work for multiple choice items? Not if you look closely. Journal of Educational Measurement, 23(4), 283–298. https://doi.org/10.1111/j.1745-3984.1986.tb00251.x.

    Google Scholar 

  • Donovan, J., Hutton, P., Lennon, M., O’Connor, G., & Morrissey, N. (2008a). National assessment program—science literacy year 6 school release materials, 2006. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).

  • Donovan, J., Lennon, M., O’connor, G., & Morrissey, N. (2008b). National assessment program–science literacy year 6 report, 2006. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).

  • Engelhard Jr, G. (2013). Invariant measurement: using Rasch models in the social, behavioral, and health sciences. New York: Routledge.

  • Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112.

    Google Scholar 

  • Esswein, J. L. (2010). Critical thinking and reasoning in middle school science education (Doctoral dissertation, The Ohio State University).

  • Finkelstein, L. (2003). Widely, strongly and weakly defined measurement. Measurement, 34(1), 39–48. https://doi.org/10.1016/S0263-2241(03)00018-6.

    Google Scholar 

  • Fischer, F., Kollar, I., Ufer, S., Sodian, B., Hussmann, H., Pekrun, R., et al. (2014). Scientific reasoning and argumentation: advancing an interdisciplinary research agenda in education. Frontline Learning Research, 4, 28–45. https://doi.org/10.14786/flr.v2i2.96.

    Google Scholar 

  • Fox, J.-P. (2010). Bayesian item response modeling: theory and applications. Springer Science & Business Media.

  • Gebhardt, E. (2016). Latent path models within an irt framework (Doctoral dissertation).

    Google Scholar 

  • Gelman, A., & Loken, E. (2014). The statistical crisis in science data-dependent analysis—a “garden of forking paths”—explains why many statistically significant comparisons don’t hold up. American Scientist, 102(6), 460.

    Google Scholar 

  • Gignac, G. E. (2016, July). On the evaluation of competing theories: a reply to van der Maas and Kan. Intelligence, 57, 84–86. https://doi.org/10.1016/j.intell.2016.03.006.

    Google Scholar 

  • Glas, C. A. & Verhelst, N. D. (1995). Testing the Rasch model. In Rasch models (pp. 69–95). Springer.

  • Glockner-Rist, A., & Hoijtink, H. (2003). The best of both worlds: factor analysis of dichotomous data using item response theory and structural equation modeling. Structural Equation Modeling, 10(4), 544–565. https://doi.org/10.1207/S15328007SEM1004_4.

    Google Scholar 

  • Grube, C. R. (2010). Kompetenzen naturwissenschaftlicher Erkenntnisgewinnung [Competencies of scientific inquiry] (Doctoral dissertation, Universität Kassel).

  • Hambleton, R. K. (2000). Response to hays et al and McHorney and Cohen: emergence of item response modeling in instrument development and data analysis. Medical Care, 38, II–60. https://doi.org/10.1097/00005650-200009002-00009.

    Google Scholar 

  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.

  • Hartig, J., & Frey, A. (2013). Sind Modelle der Item-Response-Theorie (IRT) das “Mittel der Wahl ”für die Modellierung von Kompetenzen? Zeitschrift für Erziehungswissenschaft, 16(S1), 47–51. https://doi.org/10.1007/s11618-013-0386-0.

    Google Scholar 

  • Hartig, J., Klieme, E., & Leutner, D. (2008). Assessment of competencies in educational contexts. Hogrefe Publishing.

  • Hartmann, S., Upmeier zu Belzen, A., Kroeger, D., & Pant, H. A. (2015, January). Scientific reasoning in higher education: constructing and evaluating the criterion-related validity of an assessment of preservice science teachers’ competencies. Zeitschrift fuer Psychologie, 223(1), 47–53. https://doi.org/10.1027/2151-2604/a000199.

    Google Scholar 

  • Heene, M. (2006). Konstruktion und Evaluation eines Studierendenauswahlverfahrens für Psychologie an der Universität Heidelberg. Unpublished Doctoral Dissertation, University of Heidelberg.

  • Heene, M., Bollmann, S., & Buhner, M. (2014). Much ado about nothing, or much to do about something: effects of scale shortening on criterion validity and mean differences. Journal of Individual Differences, 35(4), 245–249. https://doi.org/10.1027/1614-0001/a000146Heene,M.

  • Heene, M., Kyngdon, A., & Sckopke, P. (2016). Detecting violations of unidimensionality by order-restricted inference methods. Frontiers in Applied Mathematics and Statistics, 2, 3.

    Google Scholar 

  • Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first-order item response theory: application to true-score prediction from a possibly nonparallel test. Psychometrika, 68(1), 123–149.

    Google Scholar 

  • Humphry, S. (2011, January). The role of the unit in physics and psychometrics. Measurement: Interdisciplinary Research & Perspective, 9(1), 1–24. https://doi.org/10.1080/15366367.2011.558442.

    Google Scholar 

  • Jeon, M., Draney, K., & Wilson, M. (2015). A general saltus lltm-r for cognitive assessments. In Quantitative psychology research (pp. 73–90). Springer. https://doi.org/10.1007/978-3-319-07503-7_5.

  • Kiefer, T., Robitzsch, A., Wu, M., & Robitzsch, A. (2016). Package tam. R software package. Kitchner, K. S. (1983). Cognition, metacognition, and epistemic cognition. Human Development, 26, 222–232.

    Google Scholar 

  • Klahr, D. (2002). Exploring science: the cognition and development of discovery processes. The MIT Press.

  • Klahr, D., & Dunbar, K. (1988). Dual space search during scientific reasoning. Cognitive Science, 12(1), 1–48. https://doi.org/10.1207/s15516709cog1201_1.

    Google Scholar 

  • Koeppen, K., Hartig, J., Klieme, E., & Leutner, D. (2008). Current issues in competence modeling and assessment. Zeitschrift für Psychologie, 216(2), 61–73.

    Google Scholar 

  • Koller, I., Maier, M. J., & Hatzinger, R. (2015). An empirical power analysis of quasi-exact tests for the rasch model. Methodology, 11(2), 45–54. https://doi.org/10.1027/1614-2241/a000090.

    Google Scholar 

  • Körber, S., Mayer, D., Osterhaus, C., Schwippert, K., & Sodian, B. (2014, September). The development of scientific thinking in elementary school: a comprehensive inventory. Child Development, 86(1), 327–336. https://doi.org/10.1111/cdev.12298.

    Google Scholar 

  • Körber, S., Osterhaus, C., & Sodian, B. (2015). Testing primary-school children’s understanding of the nature of science. British Journal of Developmental Psychology, 33(1), 57–72. https://doi.org/10.1111/bjdp.12067.

    Google Scholar 

  • Kreiner, S. & Christensen, K. B. (2013). Overall tests of the rasch model. In Rasch models in health (pp. 105–110). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch6.

  • Kremer, K., Specht, C., Urhahne, D., & Mayer, J. (2014, January 2). The relationship in biology between the nature of science and scientific inquiry. Journal of Biological Education, 48(1), 1–8. https://doi.org/10.1080/00219266.2013.788541.

    Google Scholar 

  • Kuhn, D. (1989). Children and adults as intuitive scientists. Psychological Review, 96(4), 674–689. https://doi.org/10.1037/0033-295X.96.4.674.

    Google Scholar 

  • Kuhn, D. (1991). The skills of argument. Cambridge University Press.

  • Kuhn, D., Iordanou, K., Pease, M., & Wirkala, C. (2008). Beyond control of variables: what needs to develop to achieve skilled scientific thinking? Cognitive Development, 23(4), 435–451. https://doi.org/10.1016/j.cogdev.2008.09.006.

    Google Scholar 

  • Kuhn, D., & Pease, M. (2008). What needs to develop in the development of inquiry skills? Cognition and Instruction, 26(4), 512–559. https://doi.org/10.1080/07370000802391745.

    Google Scholar 

  • Kuhn, D., Ramsey, S., & Arvidsson, T. S. (2015, July). Developing multivariable thinkers. Cognitive Development, 35, 92–110. https://doi.org/10.1016/j.cogdev.2014.11.003.

    Google Scholar 

  • Kuhn, D., & Udell, W. (2003). The development of argument skills. Child Development, 74(5), 1245–1260. https://doi.org/10.1111/1467-8624.00605.

    Google Scholar 

  • Kuhn, T. S. (1970). The structure of scientific revolutions ([2d ed., enl). International encyclopedia of unified science. Foundations of the unity of science, v. 2, no. 2. Chicago: University of Chicago Press.

  • Kuo, C.-Y., Wu, H.-K., Jen, T.-H., & Hsu, Y.-S. (2015, September 22). Development and validation of a multimedia-based assessment of scientific inquiry abilities. International Journal of Science Education, 37(14), 2326–2357. https://doi.org/10.1080/09500693.2015.1078521.

    Google Scholar 

  • Lehrer, R., & Schauble, L. (2000). Modeling in mathematics and science. In R. Glaser (Ed.), Advances in instructional psychology, Volume 5: Educational Design and Cognitive Science (pp. 100–159). New Jersey: Lawrence Erlbaum.

    Google Scholar 

  • Linacre, J. M. (2010). Two perspectives on the application of rasch models. European Journal of Phsyciological Rehabilitaiton Medicine, 46, 309–310.

    Google Scholar 

  • Linacre, J. M. (2012). A user’s guide to facets rasch-model computer programs.

    Google Scholar 

  • Linacre, J. M., & Wright, B. D. (1994). Dichotomous infit and outfit mean-square fit statistics. Rasch Measurement Transactions, 8(2), 260.

    Google Scholar 

  • Linacre, J. M. & Wright, B. D. (2000). Winsteps. URL: http://www.winsteps.com/index.htm [accessed 2017-01-01].

  • Lou, Y., Blanchard, P., & Kennedy, E. (2015). Development and validation of a science inquiry skills assessment. Journal of Geoscience Education, 63(1), 73–85. https://doi.org/10.5408/14-028.1.

    Google Scholar 

  • MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). The problem of equivalent models in applications of covariance structure analysis. Psychological Bulletin, 114(1), 185–199.

    Google Scholar 

  • Mair, P., & Hatzinger, R. (2007). Extended rasch modeling: the erm package for the application of irt models in r. Journal of Statistical Software, 20(9), 1–20. https://doi.org/10.18637/jss.v020.i09.

    Google Scholar 

  • Manlove, S., Lazonder, A. W., & Jong, T. D. (2006). Regulative support for collaborative scientific inquiry learning. Journal of Computer Assisted Learning, 22(2), 87–98.

    Google Scholar 

  • Mari, L., Maul, A., Irribarra, D. T., & Wilson, M. (2016). A meta-structural understanding of measurement. In Journal of physics: conference series (Vol. 772, p. 012009). IOP Publishing.

  • Mari, L., Maul, A., Torres Irribarra, D., & Wilson, M. (2017). Quantities, Quantification, and the Necessary and Sufficient Conditions for Measurement. Measurement, 100, 115–121

    Google Scholar 

  • Masters, G. N. (1988). Item discrimination: when more is worse. Journal of Educational Measurement, 25(1), 15–29. https://doi.org/10.1111/j.1745-3984.1988.tb00288.x.

    Google Scholar 

  • Maul, A. (2017). Rethinking traditional methods of survey validation. Measurement: Interdisciplinary Research and Perspectives, 15(2), 51–69. https://doi.org/10.1080/15366367.2017.1348108.

    Google Scholar 

  • Maydeu-Olivares, A. (2013, July). Goodness-of-fit assessment of item response theory models. Measurement: Interdisciplinary Research & Perspective, 11(3), 71–101. https://doi.org/10.1080/15366367.2013.831680.

    Google Scholar 

  • Maydeu-Olivares, A., & Joe, H. (2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research, 49(4), 305–328. https://doi.org/10.1080/00273171.2014.911075.

    Google Scholar 

  • Mayer, D., Sodian, B., Körber, S., & Schwippert, K. (2014, February). Scientific reasoning in elementary school children: assessment and relations with cognitive abilities. Learning and Instruction, 29, 43–55. https://doi.org/10.1016/j.learninstruc.2013.07.005.

    Google Scholar 

  • Meijer, R. R., Sijtsma, K., & Smid, N. G. (1990). Theoretical and empirical comparison of the Mokken and the Rasch approach to IRT. Applied Psychological Measurement, 14(3), 283–298. https://doi.org/10.1177/014662169001400306.

    Google Scholar 

  • Michell, J. (2000). Normal science, pathological science and psychometrics. Theory & Psychology, 10(5), 639–667.

    Google Scholar 

  • Mokken, R. J. (1971). A theory and procedure of scale analysis: with applications in political research. Walter de Gruyter.

  • Molenaar, I. W. (2001). Thirty years of nonparametric item response theory. Applied Psychological Measurement, 25(3), 295–299. https://doi.org/10.1177/01466210122032091.

    Google Scholar 

  • Morris, B. J., Croker, S., Masnick, A., & Zimmerman, C. (2012). The emergence of scientific reasoning. In Current topics in children’s learning and cognition. Rijeka, Croatia: InTech.

  • Mullis, I. V. S., Martin, M. O., Ruddock, G. J., O’Sullivan, C. Y., & Preuschoff, C. (2009). Timss 2011 assessment frameworks.

  • Mullis, I. V., Martin, M. O., Smith, T. A., Garden, R. A., Gregory, K. D., Gonzalez, E. J., … O’Connor, K. M. (2003). TIMSS trends in mathematics and science study: assessment frameworks and specifications 2003.

  • Musek, J. (2007). A general factor of personality: evidence for the big one in the five-factor model. Journal of Research in Personality, 41(6), 1213–1233. https://doi.org/10.1016/j.jrp.2007.02.003.

    Google Scholar 

  • National Assessment Governing Board. (2007). Science assessment and item specifications for the 2009 national assessment of educational progress. Washington: National Assessment Governing Board.

    Google Scholar 

  • Nowak, K. H., Nehring, A., Tiemann, R., & Upmeier zu Belzen, A. (2013). Assessing students’ abilities in processes of scientific inquiry in biology using a paper-and-pencil test. Journal of Biological Education, 47(3), 182–188. https://doi.org/10.1080/00219266.2013.822747.

    Google Scholar 

  • OECD. (2006). Assessing scientific, reading and mathematical literacy: a framework for PISA 2006. Paris: Organisation for Economic Co-operation and Development.

  • Opitz, A., Heene, M., & Fischer, F. (2017). Measuring scientific reasoning—a review of test instruments. Educational Research and Evaluation, 23(3-4), 78–101.

    Google Scholar 

  • Pant, H. A., Stanat, P., Schroeders, U., Roppelt, A., Siegle, T., Pohlmann, C., & Institut zur Qualitätsentwicklung im Bildungswesen (Eds.). (2013). IQB ländervergleich 2012: mathematische und naturwissenschaftliche Kompetenzen am Ende der Sekundarstufe i. Munster: Waxmann.

  • Peirce, C. S. (2012). Philosophical writings of Peirce. Courier Corporation.

  • Piaget, J. & Inhelder, B. (1958). The growth of logical thinking from childhood to adolescence: an essay on the construction of formal operational structures. Abingdon, Oxon: Routledge.

  • Pohl, S., & Steyer, R. (2010). Modeling common traits and method effects in multitrait-multimethod analysis. Multivariate Behavioral Research, 45(1), 45–72. https://doi.org/10.1080/00273170903504729.

    Google Scholar 

  • Raiche, G. & Raiche, M. G. (2009). The irtprob package.

    Google Scholar 

  • Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Nielsen & Lydiche.

  • Raykov, T. & Marcoulides, G. A. (2011). Introduction to psychometric theory. Routledge.

  • R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. Accessed 15 Sept 2013.

  • Reckase, M. (2009). Multidimensional item response theory. Springer.

  • Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–696. https://doi.org/10.1080/00273171.2012.715555.

    Google Scholar 

  • Reise, S. P., Bonifay, W. E., & Haviland, M. G. (2013). Scoring and modeling psychological measures in the presence of multidimensionality. Journal of Personality Assessment, 95(2), 129–140. https://doi.org/10.1080/00223891.2012.725437.

    Google Scholar 

  • Renkl, A. (2012). Modellierung von Kompetenzen oder von interindividuellen Kompetenzunterschieden. Psychologische Rundschau., 63(1), 50–53.

    Google Scholar 

  • Revelle, W. (2004). An introduction to psychometric theory with applications in r. Springer.

  • Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological Review, 107(2), 358–367.

    Google Scholar 

  • Robitzsch, A. (2016). Essays zu methodischen herausforderungen im large-scale assessment. Humboldt-Universität zu Berlin.

  • Robitzsch, A., Kiefer, T., George, A. C., & Uenlue, A. (2014). Cdm: cognitive diagnosis modeling. R package version, 3.

  • Rosseel, Y., Oberski, D., Byrnes, J., Vanbrabant, L., Savalei, V., Merkle, E., ... Barendse, M., et al. (2017). Package lavaan.

  • Rost, J., Carstensen, C., & Von Davier, M. (1997). Applying the mixed rasch model to personality questionnaires. Applications of latent trait and latent class models in the social sciences, 324–332.

  • Schommer, M., Calvert, C., Gariglietti, G., & Bajaj, A. (1997). The development of epistemological beliefs among secondary students: a longitudinal study. Journal of Educational Psychology, 89(1), 37–40. https://doi.org/10.1037/0022-0663.89.1.37.

    Google Scholar 

  • Siersma, V. & Eusebi, P. (2013). Analysis with repeatedly measured binary item response data by ad hoc rasch scales. In Rasch models in health (pp. 257–276). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch14

  • Sijtsma, K. (2011). Review. Measurement, 44(7), 1209–1219. https://doi.org/10.1016/j.measurement.2011.03.019.

    Google Scholar 

  • Sijtsma, K. (2012, December 1). Psychological measurement between physics and statistics. Theory & Psychology, 22(6), 786–809. https://doi.org/10.1177/0959354312454353.

    Google Scholar 

  • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.

    Google Scholar 

  • Smith, A. B., Rush, R., Fallowfield, L. J., Velikova, G., & Sharpe, M. (2008). Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology, 8(1), 33. https://doi.org/10.1186/1471-2288-8-33.

    Google Scholar 

  • Smith, R. M., Schumacker, R. E., & Bush, M. J. (1998). Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement, 2(1), 66–78.

    Google Scholar 

  • Smith, R. M., & Suh, K. K. (2003). Rasch fit statistics as a test of the invariance of item parameter estimates. Journal of Applied Measurement, 4(2), 153–163.

    Google Scholar 

  • Sodian, B., & Bullock, M. (2008, October). Scientific reasoning where are we now? Cognitive Development, 23(4), 431–434. https://doi.org/10.1016/j.cogdev.2008.09.003.

    Google Scholar 

  • Sodian, B., Zaitchik, D., & Carey, S. (1991). Young children’s differentiation of hypothetical beliefs from evidence. Child Development, 62(4), 753–766. https://doi.org/10.1111/j.1467-8624.1991.tb01567.x.

    Google Scholar 

  • Stewart, I. (2008). Nature’s numbers: the unreal reality of mathematics. NY: Basic Books.

  • Strobl, C., Kopf, J., & Zeileis, A. (2015). Rasch trees: a new method for detecting differential item functioning in the Rasch model. Psychometrika, 80(2), 289–316. https://doi.org/10.1007/s11336-013-9388-3.

    Google Scholar 

  • Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51(4), 567–577. https://doi.org/10.1007/BF02295596.

    Google Scholar 

  • Thurstone, L. L. [Louis L]. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529–554, 4.

    Google Scholar 

  • Thurstone, L. L. Louis Leon & Chave, E. J. (1954). Chicago: Chicago University Press.

  • Toulmin, S. (1974). Human understanding, volume i.

  • Van der Ark, L. A., et al. (2007). Mokken scale analysis in r. Journal of Statistical Software, 20, 1–19.

    Google Scholar 

  • Van Der Maas, H. L., Dolan, C. V., Grasman, R. P., Wicherts, J. M., Huizenga, H. M., & Raijmakers, M. E. (2006). A dynamical model of general intelligence: the positive manifold of intelligence by mutualism. Psychological Review, 113(4), 842–861. https://doi.org/10.1037/0033-295X.113.4.842.

    Google Scholar 

  • van Bork, R., Epskamp, S., Rhemtulla, M., Borsboom, D., & van der Maas, H. L. (2017). What is the p-factor of psychopathology? Some risks of general factor modeling. Theory & Psychology, 27(6), 759–773.

    Google Scholar 

  • Vandekerckhove, J., Matzke, D., & Wagenmakers, E.-J. (2015). Model comparison and the principle. The Oxford handbook of computational and mathematical psychology, 300.

  • Vandekerckhove, J., Tuerlinckx, F., & Lee, M. D. (2011). Hierarchical diffusion models for two-choice response times. Psychological Methods, 16(1), 44–62.

    Google Scholar 

  • von Davier, M. (2001). Winmira 2001. Computer software]. St. Paul, MN: Assessment Systems Corporation.

  • Vosniadou, S., & Brewer, W. F. (1992). Mental models of the earth: a study of conceptual change in childhood. Cognitive Psychology, 24(4), 535–585. https://doi.org/10.1016/0010-0285(92)90018-W.

    Google Scholar 

  • Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105.

    Google Scholar 

  • Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632–638.

    Google Scholar 

  • Whitely, S. E. (1983) Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179–197

    Google Scholar 

  • Wilkening, F., & Sodian, B. (2005). Scientific reasoning in young children: introduction. Swiss Journal of Psychology, 64(3), 137–139. https://doi.org/10.1024/1421-0185.64.3.137.

    Google Scholar 

  • Wilson, M., Allen, D. D., & Li, J. C. (2006). Improving measurement in health education and health behavior research using item response modeling: comparison with the classical test theory approach. Health Education Research, 21(Supplement 1), i19–i32.

    Google Scholar 

  • Wright, B. D. (1979). Best test design. Chicago, IL: MESA Press.

  • Wright, B. D. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.

    Google Scholar 

  • Wu, M. (2004). Plausible values. Rasch Measurement Transactions, 18, 976–978.

    Google Scholar 

  • Wu, M. L. (2007). ACER ConQuest version 2.0: generalised item response modelling software. Camberwell, Vic.: ACER Press.

  • Zimmerman, C. (2000). The development of scientific reasoning skills. Developmental Review, 20(1), 99–149. https://doi.org/10.1006/drev.1999.0497.

    Google Scholar 

  • Zimmerman, C. (2007). The development of scientific thinking skills in elementary and middle school. Developmental Review, 27(2), 172–223. https://doi.org/10.1016/j.dr.2006.12.001.

    Google Scholar 

  • Zimmerman, C., & Klahr, D. (2018). Development of scientific thinking. In J. T. Wixted (Ed.), Stevens’ handbook of experimental psychology and cognitive neuroscience (pp. 1–25). Hoboken: John Wiley & Sons, Inc..

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter A. Edelsbrunner.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Edelsbrunner, P.A., Dablander, F. The Psychometric Modeling of Scientific Reasoning: a Review and Recommendations for Future Avenues. Educ Psychol Rev 31, 1–34 (2019). https://doi.org/10.1007/s10648-018-9455-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10648-018-9455-5

Keywords

Navigation