Psychometric modeling has become a frequently used statistical tool in research on scientific reasoning. We review psychometric modeling practices in this field, including model choice, model testing, and researchers’ inferences based on their psychometric practices. A review of 11 empirical research studies reveals that the predominant psychometric approach is Rasch modeling with a focus on itemfit statistics, applied in a way strongly similar to practices in national and international large-scale educational assessment programs. This approach is common in the educational assessment community and rooted in subtle philosophical views on measurement. However, we find that based on this approach, researchers tend to draw interpretations that are not within the inferential domain of this specific approach and not in accordance with the related practices and inferential purposes. In some of the reviewed articles, researchers put emphasis on item infit statistics for dimensionality assessment. Item infit statistics, however, cannot be regarded as a valid indicator of the dimensionality of scientific reasoning. Using simulations as illustration, we argue that this practice is limited in delivering psychological insights; in fact, various recent inferences about the structure, cognitive basis, and correlates of scientific reasoning might be unwarranted. In order to harness its full potential, we make suggestions towards adjusting psychometric modeling practices to the psychological and educational questions at hand.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
The research in focus of this review uses the Rasch model for dichotomous data. Readers interested in the theory and application of Rasch models for polytomous data are referred to Anderson et al. (2007).
Further item response models include yet more parameters, which represent for example item-specific guessing probabilities (giving the right answer by chance on, for example, multiple choice tests) and item-specific slipping probabilities (not giving the right answer by chance because, for example, items have different distracting elements; Revelle 2004; Thissen and Steinberg 1986).
Sometimes the two schools are referred to as two paradigms because they differ so strongly in their theoretical assumptions (Andrich 2004). Here, we prefer to call the two positions schools because their similarities and differences are not as clear-cut as sometimes assumed (Robitzsch 2016), and in addition, Kuhn’s (T. S. Kuhn 1970) concept of paradigms and their relation to development in science has been contested (Bird 2013; Toulmin 1974); thus, we deem the concept of the two paradigms described by Andrich (2004) not yet sufficiently elaborated and critically reflected to be accepted.
Classical test theory does not formally explicate a model to explain the relation between person and item characteristics and item responses. Psychometric modeling does explicate such a model, for example the Rasch model, and thus allows testing its underlying assumptions. The distinction between classical test theory and modern psychometric modeling is, however, not clear-cut (Holland and Hoskens 2003); factor analysis used to be regarded as an instrument of classical test theory, which, in its confirmatory versions, is closely related to the psychometric models discussed here (Gebhardt 2016).
One could argue that large-scale assessment programs also represent a particular type of research, particularly because data gathered in these studies are often used for ancillary research. However, the major aim of all of these programs is policy-informing assessment. Thus, their methodology is aligned with this aim instead of advancing scientific theory.
Research related to scientific reasoning is also conducted under the terms of scientific inquiry and scientific thinking. For terminological coherence, in the description of the reviewed studies, these and related terms are described as scientific reasoning.
Notably, these packages are all commercial software, and in none of the studies, free software packages such as those available in the R software environment (for example the TAM package, which encompasses a broad variety of psychometric models; Kiefer et al. 2016) have been used.
Items with low infit are not damaging to estimating a person’s ability, but rather are removed when constructing an instrument to reduce the item pool from, say, 100 to 20 items (Linacre and Wright 1994). We thank a reviewer for pointing this out.
It should be noted that other researchers have evaluated and discussed item infit statistics, however, with different aims (Christensen and Kreiner 2013; Heene et al. 2014; Smith et al. 2008; Smith et al. 1998; R. M. Smith and Suh 2003). The present simulations have a demonstration purpose, rather than the purpose of exhibiting generalizable statistical insights. The simulation conditions were therefore tailored towards reflecting the main characteristics of the reviewed studies.
The number of items was simulated to be equal across dimensions; under deviations from this scenario itemfit statistics likely tag more items. However, an equal or almost equal number of items across dimensions is the regular scenario in psychometric studies and the reviewed literature.
Ainley, J., Fraillon, J., & Freeman, C. (2007). National assessment program—ICT literacy years 6 & 10 report, 2005. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. American Educational Research Association.
Andersen, E. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38(1), 123–140. https://doi.org/10.1007/BF02291180.
Anderson, C. J., Li, Z., & Vermunt, J. K. (2007). Estimation of models in a Rasch family for polytomous items and multiple latent variables. Journal of Statistical Software, 20(6), 1–36. https://doi.org/10.18637/jss.v020.i06.
Andrich, D. (2004). Controversy and the Rasch model: a characteristic of incompatible paradigms? Medical Care, 42(Supplement), I–7. https://doi.org/10.1097/01.mlr.0000103528.48582.7c.
Andrich, D. (2011, October). Rating scales and Rasch measurement. Expert Review of Pharmacoeconomics & Outcomes Research, 11(5), 571–585. https://doi.org/10.1586/erp.11.59.
Baird, J.-A., Andrich, D., Hopfenbeck, T. N., & Stobart, G. (2017). Assessment and learning: fields apart? Assessment in Education: Principles, Policy & Practice, 24(3), 317–350. https://doi.org/10.1080/0969594X.2017.1319337.
Bartholomew, D. J., Deary, I. J., & Lawn, M. (2009). A new lease of life for thomson’s bonds model of intelligence. Psychological Review, 116(3), 567–579.
Bartolucci, F., Bacci, S., & Gnaldi, M. (2014). MultiLCIRT: an R package for multidimensional latent class item response models. Computational Statistics & Data Analysis, 71, 971–985. https://doi.org/10.1016/j.csda.2013.05.018.
Bird, A. (2013). Thomas kuhn. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy (Fall 2013). Metaphysics Research Lab, Stanford University.
Bond, T. & Fox, C. M. (2015). Applying the rasch model: fundamental measurement in the human sciences. Routledge.
Bonifay, W., Lane, S. P., & Reise, S. P. (2016). Three concerns with applying a bifactor model as a structure of psychopathology. Clinical Psychological Science, 5(1), 184–186. https://doi.org/10.1177/2167702616657069.
Boone, W. J., Staver, J. R., & Yale, M. S. (2014). Rasch analysis in the human sciences. Dordrecht: Springer Netherlands.
Borsboom, D. (2008). Latent variable theory. Measurement: Interdisciplinary Research & Perspective, 6(1-2), 25–53. https://doi.org/10.1080/15366360802035497.
Bozdogan, H. (1987). Model selection and akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika, 52(3), 345–370. https://doi.org/10.1007/BF02294361.
Brown, N. J., & Wilson, M. (2011). A model of cognition: the missing cornerstone of assessment. Educational Psychology Review, 23(2), 221–234.
Brown, N. J., Furtak, E. M., Timms, M., Nagashima, S. O., & Wilson, M. (2010). The evidence-based reasoning framework: assessing scientific reasoning. Educational Assessment, 15(3-4), 123–141.
Bruner, J. S., Goodnow, J. J., & Austin, G. A. (1956). A study of thinking. 1956. New York: John Wiley.
Bürkner, P. C. (2017). brms: an R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01.
Cano, F. (2005). Epistemological beliefs and approaches to learning: their change through secondary school and their influence on academic performance. British Journal of Educational Psychology, 75(2), 203–221. https://doi.org/10.1348/000709904X22683.
Carey, S. (1992). The origin and evolution of everyday concepts. University of Minnesota Press, Minneapolis.
Caspi, A., Houts, R. M., Belsky, D. W., Goldman-Mellor, S. J., Harrington, H., Israel, S., Meier, M. H., Ramrakha, S., Shalev, I., Poulton, R., & Moffitt, T. E. (2014). The p factor: one general psychopathology factor in the structure of psychiatric disorders? Clinical Psychological Science, 2(2), 119–137. https://doi.org/10.1177/2167702613497473.
Chambers, C. D., Dienes, Z., McIntosh, R. D., Rotshtein, P., & Willmes, K. (2015). Registered reports: realigning incentives in scientific publishing. Cortex, 66(3), A1–A2. https://doi.org/10.1016/j.cortex.2012.12.016.
Chen, Z., & Klahr, D. (1999). All other things being equal: acquisition and transfer of the control of variables strategy. Child Development, 70(5), 1098–1120. https://doi.org/10.1111/1467-8624.00081.
Christensen, K. B. & Kreiner, S. (2013). Item fit statistics. In Rasch models in health (pp. 83–104). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch5.
Conway, A. R., & Kovacs, K. (2015). New and emerging models of human intelligence. Wiley Interdisciplinary Reviews: Cognitive Science, 6(5), 419–426. https://doi.org/10.1002/wcs.1356.
Cullen, L. T. (2012). Rasch models: foundations, recent developments, and applications. [S.l.]: Springer.
Davier, M. v., & Carstensen, C. H. (2007). Multivariate and mixture distribution rasch models extensions and applications. New York: Springer.
De Groot, A. (2014). The meaning of “significance” for different types of research [translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han Lj Van Der Maas]. Acta Psychologica, 148, 188–194. https://doi.org/10.1016/j.actpsy.2014.02.001.
de la Torre, J. (2009). A cognitive diagnosis model for cognitively based multiple-choice options. Applied Psychological Measurement, 33(3), 163–183. https://doi.org/10.1177/0146621608320523.
Deary, I. J., Wilson, J. A., Carding, P. N., MacKenzie, K., & Watson, R. (2010). From dysphonia to dysphoria: Mokken scaling shows a strong, reliable hierarchy of voice symptoms in the Voice Symptom Scale questionnaire. Journal of Psychosomatic Research, 68(1), 67–71. https://doi.org/10.1016/j.jpsychores.2009.06.008.
Dewey, J. (1910). How we think. Boston, MA: DC Heath.
Dickison, P., Luo, X., Kim, D., Woo, A., Muntean, W., & Bergstrom, B. (2016). Assessing higher-order cognitive constructs by using an information-processing framework. Journal of Applied Testing Technology, 17, 1–19.
Divgi, D. (1986). Does the rasch model really work for multiple choice items? Not if you look closely. Journal of Educational Measurement, 23(4), 283–298. https://doi.org/10.1111/j.1745-3984.1986.tb00251.x.
Donovan, J., Hutton, P., Lennon, M., O’Connor, G., & Morrissey, N. (2008a). National assessment program—science literacy year 6 school release materials, 2006. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).
Donovan, J., Lennon, M., O’connor, G., & Morrissey, N. (2008b). National assessment program–science literacy year 6 report, 2006. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).
Engelhard Jr, G. (2013). Invariant measurement: using Rasch models in the social, behavioral, and health sciences. New York: Routledge.
Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112.
Esswein, J. L. (2010). Critical thinking and reasoning in middle school science education (Doctoral dissertation, The Ohio State University).
Finkelstein, L. (2003). Widely, strongly and weakly defined measurement. Measurement, 34(1), 39–48. https://doi.org/10.1016/S0263-2241(03)00018-6.
Fischer, F., Kollar, I., Ufer, S., Sodian, B., Hussmann, H., Pekrun, R., et al. (2014). Scientific reasoning and argumentation: advancing an interdisciplinary research agenda in education. Frontline Learning Research, 4, 28–45. https://doi.org/10.14786/flr.v2i2.96.
Fox, J.-P. (2010). Bayesian item response modeling: theory and applications. Springer Science & Business Media.
Gebhardt, E. (2016). Latent path models within an irt framework (Doctoral dissertation).
Gelman, A., & Loken, E. (2014). The statistical crisis in science data-dependent analysis—a “garden of forking paths”—explains why many statistically significant comparisons don’t hold up. American Scientist, 102(6), 460.
Gignac, G. E. (2016, July). On the evaluation of competing theories: a reply to van der Maas and Kan. Intelligence, 57, 84–86. https://doi.org/10.1016/j.intell.2016.03.006.
Glas, C. A. & Verhelst, N. D. (1995). Testing the Rasch model. In Rasch models (pp. 69–95). Springer.
Glockner-Rist, A., & Hoijtink, H. (2003). The best of both worlds: factor analysis of dichotomous data using item response theory and structural equation modeling. Structural Equation Modeling, 10(4), 544–565. https://doi.org/10.1207/S15328007SEM1004_4.
Grube, C. R. (2010). Kompetenzen naturwissenschaftlicher Erkenntnisgewinnung [Competencies of scientific inquiry] (Doctoral dissertation, Universität Kassel).
Hambleton, R. K. (2000). Response to hays et al and McHorney and Cohen: emergence of item response modeling in instrument development and data analysis. Medical Care, 38, II–60. https://doi.org/10.1097/00005650-200009002-00009.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.
Hartig, J., & Frey, A. (2013). Sind Modelle der Item-Response-Theorie (IRT) das “Mittel der Wahl ”für die Modellierung von Kompetenzen? Zeitschrift für Erziehungswissenschaft, 16(S1), 47–51. https://doi.org/10.1007/s11618-013-0386-0.
Hartig, J., Klieme, E., & Leutner, D. (2008). Assessment of competencies in educational contexts. Hogrefe Publishing.
Hartmann, S., Upmeier zu Belzen, A., Kroeger, D., & Pant, H. A. (2015, January). Scientific reasoning in higher education: constructing and evaluating the criterion-related validity of an assessment of preservice science teachers’ competencies. Zeitschrift fuer Psychologie, 223(1), 47–53. https://doi.org/10.1027/2151-2604/a000199.
Heene, M. (2006). Konstruktion und Evaluation eines Studierendenauswahlverfahrens für Psychologie an der Universität Heidelberg. Unpublished Doctoral Dissertation, University of Heidelberg.
Heene, M., Bollmann, S., & Buhner, M. (2014). Much ado about nothing, or much to do about something: effects of scale shortening on criterion validity and mean differences. Journal of Individual Differences, 35(4), 245–249. https://doi.org/10.1027/1614-0001/a000146Heene,M.
Heene, M., Kyngdon, A., & Sckopke, P. (2016). Detecting violations of unidimensionality by order-restricted inference methods. Frontiers in Applied Mathematics and Statistics, 2, 3.
Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first-order item response theory: application to true-score prediction from a possibly nonparallel test. Psychometrika, 68(1), 123–149.
Humphry, S. (2011, January). The role of the unit in physics and psychometrics. Measurement: Interdisciplinary Research & Perspective, 9(1), 1–24. https://doi.org/10.1080/15366367.2011.558442.
Jeon, M., Draney, K., & Wilson, M. (2015). A general saltus lltm-r for cognitive assessments. In Quantitative psychology research (pp. 73–90). Springer. https://doi.org/10.1007/978-3-319-07503-7_5.
Kiefer, T., Robitzsch, A., Wu, M., & Robitzsch, A. (2016). Package tam. R software package. Kitchner, K. S. (1983). Cognition, metacognition, and epistemic cognition. Human Development, 26, 222–232.
Klahr, D. (2002). Exploring science: the cognition and development of discovery processes. The MIT Press.
Klahr, D., & Dunbar, K. (1988). Dual space search during scientific reasoning. Cognitive Science, 12(1), 1–48. https://doi.org/10.1207/s15516709cog1201_1.
Koeppen, K., Hartig, J., Klieme, E., & Leutner, D. (2008). Current issues in competence modeling and assessment. Zeitschrift für Psychologie, 216(2), 61–73.
Koller, I., Maier, M. J., & Hatzinger, R. (2015). An empirical power analysis of quasi-exact tests for the rasch model. Methodology, 11(2), 45–54. https://doi.org/10.1027/1614-2241/a000090.
Körber, S., Mayer, D., Osterhaus, C., Schwippert, K., & Sodian, B. (2014, September). The development of scientific thinking in elementary school: a comprehensive inventory. Child Development, 86(1), 327–336. https://doi.org/10.1111/cdev.12298.
Körber, S., Osterhaus, C., & Sodian, B. (2015). Testing primary-school children’s understanding of the nature of science. British Journal of Developmental Psychology, 33(1), 57–72. https://doi.org/10.1111/bjdp.12067.
Kreiner, S. & Christensen, K. B. (2013). Overall tests of the rasch model. In Rasch models in health (pp. 105–110). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch6.
Kremer, K., Specht, C., Urhahne, D., & Mayer, J. (2014, January 2). The relationship in biology between the nature of science and scientific inquiry. Journal of Biological Education, 48(1), 1–8. https://doi.org/10.1080/00219266.2013.788541.
Kuhn, D. (1989). Children and adults as intuitive scientists. Psychological Review, 96(4), 674–689. https://doi.org/10.1037/0033-295X.96.4.674.
Kuhn, D. (1991). The skills of argument. Cambridge University Press.
Kuhn, D., Iordanou, K., Pease, M., & Wirkala, C. (2008). Beyond control of variables: what needs to develop to achieve skilled scientific thinking? Cognitive Development, 23(4), 435–451. https://doi.org/10.1016/j.cogdev.2008.09.006.
Kuhn, D., & Pease, M. (2008). What needs to develop in the development of inquiry skills? Cognition and Instruction, 26(4), 512–559. https://doi.org/10.1080/07370000802391745.
Kuhn, D., Ramsey, S., & Arvidsson, T. S. (2015, July). Developing multivariable thinkers. Cognitive Development, 35, 92–110. https://doi.org/10.1016/j.cogdev.2014.11.003.
Kuhn, D., & Udell, W. (2003). The development of argument skills. Child Development, 74(5), 1245–1260. https://doi.org/10.1111/1467-8624.00605.
Kuhn, T. S. (1970). The structure of scientific revolutions ([2d ed., enl). International encyclopedia of unified science. Foundations of the unity of science, v. 2, no. 2. Chicago: University of Chicago Press.
Kuo, C.-Y., Wu, H.-K., Jen, T.-H., & Hsu, Y.-S. (2015, September 22). Development and validation of a multimedia-based assessment of scientific inquiry abilities. International Journal of Science Education, 37(14), 2326–2357. https://doi.org/10.1080/09500693.2015.1078521.
Lehrer, R., & Schauble, L. (2000). Modeling in mathematics and science. In R. Glaser (Ed.), Advances in instructional psychology, Volume 5: Educational Design and Cognitive Science (pp. 100–159). New Jersey: Lawrence Erlbaum.
Linacre, J. M. (2010). Two perspectives on the application of rasch models. European Journal of Phsyciological Rehabilitaiton Medicine, 46, 309–310.
Linacre, J. M. (2012). A user’s guide to facets rasch-model computer programs.
Linacre, J. M., & Wright, B. D. (1994). Dichotomous infit and outfit mean-square fit statistics. Rasch Measurement Transactions, 8(2), 260.
Linacre, J. M. & Wright, B. D. (2000). Winsteps. URL: http://www.winsteps.com/index.htm [accessed 2017-01-01].
Lou, Y., Blanchard, P., & Kennedy, E. (2015). Development and validation of a science inquiry skills assessment. Journal of Geoscience Education, 63(1), 73–85. https://doi.org/10.5408/14-028.1.
MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). The problem of equivalent models in applications of covariance structure analysis. Psychological Bulletin, 114(1), 185–199.
Mair, P., & Hatzinger, R. (2007). Extended rasch modeling: the erm package for the application of irt models in r. Journal of Statistical Software, 20(9), 1–20. https://doi.org/10.18637/jss.v020.i09.
Manlove, S., Lazonder, A. W., & Jong, T. D. (2006). Regulative support for collaborative scientific inquiry learning. Journal of Computer Assisted Learning, 22(2), 87–98.
Mari, L., Maul, A., Irribarra, D. T., & Wilson, M. (2016). A meta-structural understanding of measurement. In Journal of physics: conference series (Vol. 772, p. 012009). IOP Publishing.
Mari, L., Maul, A., Torres Irribarra, D., & Wilson, M. (2017). Quantities, Quantification, and the Necessary and Sufficient Conditions for Measurement. Measurement, 100, 115–121
Masters, G. N. (1988). Item discrimination: when more is worse. Journal of Educational Measurement, 25(1), 15–29. https://doi.org/10.1111/j.1745-3984.1988.tb00288.x.
Maul, A. (2017). Rethinking traditional methods of survey validation. Measurement: Interdisciplinary Research and Perspectives, 15(2), 51–69. https://doi.org/10.1080/15366367.2017.1348108.
Maydeu-Olivares, A. (2013, July). Goodness-of-fit assessment of item response theory models. Measurement: Interdisciplinary Research & Perspective, 11(3), 71–101. https://doi.org/10.1080/15366367.2013.831680.
Maydeu-Olivares, A., & Joe, H. (2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research, 49(4), 305–328. https://doi.org/10.1080/00273171.2014.911075.
Mayer, D., Sodian, B., Körber, S., & Schwippert, K. (2014, February). Scientific reasoning in elementary school children: assessment and relations with cognitive abilities. Learning and Instruction, 29, 43–55. https://doi.org/10.1016/j.learninstruc.2013.07.005.
Meijer, R. R., Sijtsma, K., & Smid, N. G. (1990). Theoretical and empirical comparison of the Mokken and the Rasch approach to IRT. Applied Psychological Measurement, 14(3), 283–298. https://doi.org/10.1177/014662169001400306.
Michell, J. (2000). Normal science, pathological science and psychometrics. Theory & Psychology, 10(5), 639–667.
Mokken, R. J. (1971). A theory and procedure of scale analysis: with applications in political research. Walter de Gruyter.
Molenaar, I. W. (2001). Thirty years of nonparametric item response theory. Applied Psychological Measurement, 25(3), 295–299. https://doi.org/10.1177/01466210122032091.
Morris, B. J., Croker, S., Masnick, A., & Zimmerman, C. (2012). The emergence of scientific reasoning. In Current topics in children’s learning and cognition. Rijeka, Croatia: InTech.
Mullis, I. V. S., Martin, M. O., Ruddock, G. J., O’Sullivan, C. Y., & Preuschoff, C. (2009). Timss 2011 assessment frameworks.
Mullis, I. V., Martin, M. O., Smith, T. A., Garden, R. A., Gregory, K. D., Gonzalez, E. J., … O’Connor, K. M. (2003). TIMSS trends in mathematics and science study: assessment frameworks and specifications 2003.
Musek, J. (2007). A general factor of personality: evidence for the big one in the five-factor model. Journal of Research in Personality, 41(6), 1213–1233. https://doi.org/10.1016/j.jrp.2007.02.003.
National Assessment Governing Board. (2007). Science assessment and item specifications for the 2009 national assessment of educational progress. Washington: National Assessment Governing Board.
Nowak, K. H., Nehring, A., Tiemann, R., & Upmeier zu Belzen, A. (2013). Assessing students’ abilities in processes of scientific inquiry in biology using a paper-and-pencil test. Journal of Biological Education, 47(3), 182–188. https://doi.org/10.1080/00219266.2013.822747.
OECD. (2006). Assessing scientific, reading and mathematical literacy: a framework for PISA 2006. Paris: Organisation for Economic Co-operation and Development.
Opitz, A., Heene, M., & Fischer, F. (2017). Measuring scientific reasoning—a review of test instruments. Educational Research and Evaluation, 23(3-4), 78–101.
Pant, H. A., Stanat, P., Schroeders, U., Roppelt, A., Siegle, T., Pohlmann, C., & Institut zur Qualitätsentwicklung im Bildungswesen (Eds.). (2013). IQB ländervergleich 2012: mathematische und naturwissenschaftliche Kompetenzen am Ende der Sekundarstufe i. Munster: Waxmann.
Peirce, C. S. (2012). Philosophical writings of Peirce. Courier Corporation.
Piaget, J. & Inhelder, B. (1958). The growth of logical thinking from childhood to adolescence: an essay on the construction of formal operational structures. Abingdon, Oxon: Routledge.
Pohl, S., & Steyer, R. (2010). Modeling common traits and method effects in multitrait-multimethod analysis. Multivariate Behavioral Research, 45(1), 45–72. https://doi.org/10.1080/00273170903504729.
Raiche, G. & Raiche, M. G. (2009). The irtprob package.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Nielsen & Lydiche.
Raykov, T. & Marcoulides, G. A. (2011). Introduction to psychometric theory. Routledge.
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. Accessed 15 Sept 2013.
Reckase, M. (2009). Multidimensional item response theory. Springer.
Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–696. https://doi.org/10.1080/00273171.2012.715555.
Reise, S. P., Bonifay, W. E., & Haviland, M. G. (2013). Scoring and modeling psychological measures in the presence of multidimensionality. Journal of Personality Assessment, 95(2), 129–140. https://doi.org/10.1080/00223891.2012.725437.
Renkl, A. (2012). Modellierung von Kompetenzen oder von interindividuellen Kompetenzunterschieden. Psychologische Rundschau., 63(1), 50–53.
Revelle, W. (2004). An introduction to psychometric theory with applications in r. Springer.
Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological Review, 107(2), 358–367.
Robitzsch, A. (2016). Essays zu methodischen herausforderungen im large-scale assessment. Humboldt-Universität zu Berlin.
Robitzsch, A., Kiefer, T., George, A. C., & Uenlue, A. (2014). Cdm: cognitive diagnosis modeling. R package version, 3.
Rosseel, Y., Oberski, D., Byrnes, J., Vanbrabant, L., Savalei, V., Merkle, E., ... Barendse, M., et al. (2017). Package lavaan.
Rost, J., Carstensen, C., & Von Davier, M. (1997). Applying the mixed rasch model to personality questionnaires. Applications of latent trait and latent class models in the social sciences, 324–332.
Schommer, M., Calvert, C., Gariglietti, G., & Bajaj, A. (1997). The development of epistemological beliefs among secondary students: a longitudinal study. Journal of Educational Psychology, 89(1), 37–40. https://doi.org/10.1037/0022-06126.96.36.199.
Siersma, V. & Eusebi, P. (2013). Analysis with repeatedly measured binary item response data by ad hoc rasch scales. In Rasch models in health (pp. 257–276). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch14
Sijtsma, K. (2011). Review. Measurement, 44(7), 1209–1219. https://doi.org/10.1016/j.measurement.2011.03.019.
Sijtsma, K. (2012, December 1). Psychological measurement between physics and statistics. Theory & Psychology, 22(6), 786–809. https://doi.org/10.1177/0959354312454353.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.
Smith, A. B., Rush, R., Fallowfield, L. J., Velikova, G., & Sharpe, M. (2008). Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology, 8(1), 33. https://doi.org/10.1186/1471-2288-8-33.
Smith, R. M., Schumacker, R. E., & Bush, M. J. (1998). Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement, 2(1), 66–78.
Smith, R. M., & Suh, K. K. (2003). Rasch fit statistics as a test of the invariance of item parameter estimates. Journal of Applied Measurement, 4(2), 153–163.
Sodian, B., & Bullock, M. (2008, October). Scientific reasoning where are we now? Cognitive Development, 23(4), 431–434. https://doi.org/10.1016/j.cogdev.2008.09.003.
Sodian, B., Zaitchik, D., & Carey, S. (1991). Young children’s differentiation of hypothetical beliefs from evidence. Child Development, 62(4), 753–766. https://doi.org/10.1111/j.1467-8624.1991.tb01567.x.
Stewart, I. (2008). Nature’s numbers: the unreal reality of mathematics. NY: Basic Books.
Strobl, C., Kopf, J., & Zeileis, A. (2015). Rasch trees: a new method for detecting differential item functioning in the Rasch model. Psychometrika, 80(2), 289–316. https://doi.org/10.1007/s11336-013-9388-3.
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51(4), 567–577. https://doi.org/10.1007/BF02295596.
Thurstone, L. L. [Louis L]. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529–554, 4.
Thurstone, L. L. Louis Leon & Chave, E. J. (1954). Chicago: Chicago University Press.
Toulmin, S. (1974). Human understanding, volume i.
Van der Ark, L. A., et al. (2007). Mokken scale analysis in r. Journal of Statistical Software, 20, 1–19.
Van Der Maas, H. L., Dolan, C. V., Grasman, R. P., Wicherts, J. M., Huizenga, H. M., & Raijmakers, M. E. (2006). A dynamical model of general intelligence: the positive manifold of intelligence by mutualism. Psychological Review, 113(4), 842–861. https://doi.org/10.1037/0033-295X.113.4.842.
van Bork, R., Epskamp, S., Rhemtulla, M., Borsboom, D., & van der Maas, H. L. (2017). What is the p-factor of psychopathology? Some risks of general factor modeling. Theory & Psychology, 27(6), 759–773.
Vandekerckhove, J., Matzke, D., & Wagenmakers, E.-J. (2015). Model comparison and the principle. The Oxford handbook of computational and mathematical psychology, 300.
Vandekerckhove, J., Tuerlinckx, F., & Lee, M. D. (2011). Hierarchical diffusion models for two-choice response times. Psychological Methods, 16(1), 44–62.
von Davier, M. (2001). Winmira 2001. Computer software]. St. Paul, MN: Assessment Systems Corporation.
Vosniadou, S., & Brewer, W. F. (1992). Mental models of the earth: a study of conceptual change in childhood. Cognitive Psychology, 24(4), 535–585. https://doi.org/10.1016/0010-0285(92)90018-W.
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632–638.
Whitely, S. E. (1983) Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179–197
Wilkening, F., & Sodian, B. (2005). Scientific reasoning in young children: introduction. Swiss Journal of Psychology, 64(3), 137–139. https://doi.org/10.1024/1421-0188.8.131.52.
Wilson, M., Allen, D. D., & Li, J. C. (2006). Improving measurement in health education and health behavior research using item response modeling: comparison with the classical test theory approach. Health Education Research, 21(Supplement 1), i19–i32.
Wright, B. D. (1979). Best test design. Chicago, IL: MESA Press.
Wright, B. D. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.
Wu, M. (2004). Plausible values. Rasch Measurement Transactions, 18, 976–978.
Wu, M. L. (2007). ACER ConQuest version 2.0: generalised item response modelling software. Camberwell, Vic.: ACER Press.
Zimmerman, C. (2000). The development of scientific reasoning skills. Developmental Review, 20(1), 99–149. https://doi.org/10.1006/drev.1999.0497.
Zimmerman, C. (2007). The development of scientific thinking skills in elementary and middle school. Developmental Review, 27(2), 172–223. https://doi.org/10.1016/j.dr.2006.12.001.
Zimmerman, C., & Klahr, D. (2018). Development of scientific thinking. In J. T. Wixted (Ed.), Stevens’ handbook of experimental psychology and cognitive neuroscience (pp. 1–25). Hoboken: John Wiley & Sons, Inc..
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Edelsbrunner, P.A., Dablander, F. The Psychometric Modeling of Scientific Reasoning: a Review and Recommendations for Future Avenues. Educ Psychol Rev 31, 1–34 (2019). https://doi.org/10.1007/s10648-018-9455-5
- Scientific reasoning
- Rasch model
- Item response theory