The Psychometric Modeling of Scientific Reasoning: a Review and Recommendations for Future Avenues

Edelsbrunner, Peter A.; Dablander, Fabian

doi:10.1007/s10648-018-9455-5

The Psychometric Modeling of Scientific Reasoning: a Review and Recommendations for Future Avenues

REVIEW ARTICLE
Published: 23 November 2018

Volume 31, pages 1–34, (2019)
Cite this article

Educational Psychology Review Aims and scope Submit manuscript

2673 Accesses
27 Citations
11 Altmetric
2 Mentions
Explore all metrics

Abstract

Psychometric modeling has become a frequently used statistical tool in research on scientific reasoning. We review psychometric modeling practices in this field, including model choice, model testing, and researchers’ inferences based on their psychometric practices. A review of 11 empirical research studies reveals that the predominant psychometric approach is Rasch modeling with a focus on itemfit statistics, applied in a way strongly similar to practices in national and international large-scale educational assessment programs. This approach is common in the educational assessment community and rooted in subtle philosophical views on measurement. However, we find that based on this approach, researchers tend to draw interpretations that are not within the inferential domain of this specific approach and not in accordance with the related practices and inferential purposes. In some of the reviewed articles, researchers put emphasis on item infit statistics for dimensionality assessment. Item infit statistics, however, cannot be regarded as a valid indicator of the dimensionality of scientific reasoning. Using simulations as illustration, we argue that this practice is limited in delivering psychological insights; in fact, various recent inferences about the structure, cognitive basis, and correlates of scientific reasoning might be unwarranted. In order to harness its full potential, we make suggestions towards adjusting psychometric modeling practices to the psychological and educational questions at hand.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Objective Measurement in Psychometric Analysis

Validity-Versus-Reliability Tradeoffs and the Ethics of Educational Research

Synthesis of Validation Practices in Two Assessment Journals: Psychological Assessment and the European Journal of Psychological Assessment

Notes

The research in focus of this review uses the Rasch model for dichotomous data. Readers interested in the theory and application of Rasch models for polytomous data are referred to Anderson et al. (2007).
Further item response models include yet more parameters, which represent for example item-specific guessing probabilities (giving the right answer by chance on, for example, multiple choice tests) and item-specific slipping probabilities (not giving the right answer by chance because, for example, items have different distracting elements; Revelle 2004; Thissen and Steinberg 1986).
Sometimes the two schools are referred to as two paradigms because they differ so strongly in their theoretical assumptions (Andrich 2004). Here, we prefer to call the two positions schools because their similarities and differences are not as clear-cut as sometimes assumed (Robitzsch 2016), and in addition, Kuhn’s (T. S. Kuhn 1970) concept of paradigms and their relation to development in science has been contested (Bird 2013; Toulmin 1974); thus, we deem the concept of the two paradigms described by Andrich (2004) not yet sufficiently elaborated and critically reflected to be accepted.
Classical test theory does not formally explicate a model to explain the relation between person and item characteristics and item responses. Psychometric modeling does explicate such a model, for example the Rasch model, and thus allows testing its underlying assumptions. The distinction between classical test theory and modern psychometric modeling is, however, not clear-cut (Holland and Hoskens 2003); factor analysis used to be regarded as an instrument of classical test theory, which, in its confirmatory versions, is closely related to the psychometric models discussed here (Gebhardt 2016).
One could argue that large-scale assessment programs also represent a particular type of research, particularly because data gathered in these studies are often used for ancillary research. However, the major aim of all of these programs is policy-informing assessment. Thus, their methodology is aligned with this aim instead of advancing scientific theory.
Research related to scientific reasoning is also conducted under the terms of scientific inquiry and scientific thinking. For terminological coherence, in the description of the reviewed studies, these and related terms are described as scientific reasoning.
Notably, these packages are all commercial software, and in none of the studies, free software packages such as those available in the R software environment (for example the TAM package, which encompasses a broad variety of psychometric models; Kiefer et al. 2016) have been used.
Items with low infit are not damaging to estimating a person’s ability, but rather are removed when constructing an instrument to reduce the item pool from, say, 100 to 20 items (Linacre and Wright 1994). We thank a reviewer for pointing this out.
It should be noted that other researchers have evaluated and discussed item infit statistics, however, with different aims (Christensen and Kreiner 2013; Heene et al. 2014; Smith et al. 2008; Smith et al. 1998; R. M. Smith and Suh 2003). The present simulations have a demonstration purpose, rather than the purpose of exhibiting generalizable statistical insights. The simulation conditions were therefore tailored towards reflecting the main characteristics of the reviewed studies.
The number of items was simulated to be equal across dimensions; under deviations from this scenario itemfit statistics likely tag more items. However, an equal or almost equal number of items across dimensions is the regular scenario in psychometric studies and the reviewed literature.

References

Ainley, J., Fraillon, J., & Freeman, C. (2007). National assessment program—ICT literacy years 6 & 10 report, 2005. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. American Educational Research Association.
Andersen, E. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38(1), 123–140. https://doi.org/10.1007/BF02291180.
Google Scholar
Anderson, C. J., Li, Z., & Vermunt, J. K. (2007). Estimation of models in a Rasch family for polytomous items and multiple latent variables. Journal of Statistical Software, 20(6), 1–36. https://doi.org/10.18637/jss.v020.i06.
Google Scholar
Andrich, D. (2004). Controversy and the Rasch model: a characteristic of incompatible paradigms? Medical Care, 42(Supplement), I–7. https://doi.org/10.1097/01.mlr.0000103528.48582.7c.
Google Scholar
Andrich, D. (2011, October). Rating scales and Rasch measurement. Expert Review of Pharmacoeconomics & Outcomes Research, 11(5), 571–585. https://doi.org/10.1586/erp.11.59.
Google Scholar
Baird, J.-A., Andrich, D., Hopfenbeck, T. N., & Stobart, G. (2017). Assessment and learning: fields apart? Assessment in Education: Principles, Policy & Practice, 24(3), 317–350. https://doi.org/10.1080/0969594X.2017.1319337.
Google Scholar
Bartholomew, D. J., Deary, I. J., & Lawn, M. (2009). A new lease of life for thomson’s bonds model of intelligence. Psychological Review, 116(3), 567–579.
Google Scholar
Bartolucci, F., Bacci, S., & Gnaldi, M. (2014). MultiLCIRT: an R package for multidimensional latent class item response models. Computational Statistics & Data Analysis, 71, 971–985. https://doi.org/10.1016/j.csda.2013.05.018.
Google Scholar
Bird, A. (2013). Thomas kuhn. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy (Fall 2013). Metaphysics Research Lab, Stanford University.
Bond, T. & Fox, C. M. (2015). Applying the rasch model: fundamental measurement in the human sciences. Routledge.
Bonifay, W., Lane, S. P., & Reise, S. P. (2016). Three concerns with applying a bifactor model as a structure of psychopathology. Clinical Psychological Science, 5(1), 184–186. https://doi.org/10.1177/2167702616657069.
Google Scholar
Boone, W. J., Staver, J. R., & Yale, M. S. (2014). Rasch analysis in the human sciences. Dordrecht: Springer Netherlands.
Google Scholar
Borsboom, D. (2008). Latent variable theory. Measurement: Interdisciplinary Research & Perspective, 6(1-2), 25–53. https://doi.org/10.1080/15366360802035497.
Google Scholar
Bozdogan, H. (1987). Model selection and akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika, 52(3), 345–370. https://doi.org/10.1007/BF02294361.
Google Scholar
Brown, N. J., & Wilson, M. (2011). A model of cognition: the missing cornerstone of assessment. Educational Psychology Review, 23(2), 221–234.
Google Scholar
Brown, N. J., Furtak, E. M., Timms, M., Nagashima, S. O., & Wilson, M. (2010). The evidence-based reasoning framework: assessing scientific reasoning. Educational Assessment, 15(3-4), 123–141.
Google Scholar
Bruner, J. S., Goodnow, J. J., & Austin, G. A. (1956). A study of thinking. 1956. New York: John Wiley.
Bürkner, P. C. (2017). brms: an R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01.
Google Scholar
Cano, F. (2005). Epistemological beliefs and approaches to learning: their change through secondary school and their influence on academic performance. British Journal of Educational Psychology, 75(2), 203–221. https://doi.org/10.1348/000709904X22683.
Google Scholar
Carey, S. (1992). The origin and evolution of everyday concepts. University of Minnesota Press, Minneapolis.
Caspi, A., Houts, R. M., Belsky, D. W., Goldman-Mellor, S. J., Harrington, H., Israel, S., Meier, M. H., Ramrakha, S., Shalev, I., Poulton, R., & Moffitt, T. E. (2014). The p factor: one general psychopathology factor in the structure of psychiatric disorders? Clinical Psychological Science, 2(2), 119–137. https://doi.org/10.1177/2167702613497473.
Google Scholar
Chambers, C. D., Dienes, Z., McIntosh, R. D., Rotshtein, P., & Willmes, K. (2015). Registered reports: realigning incentives in scientific publishing. Cortex, 66(3), A1–A2. https://doi.org/10.1016/j.cortex.2012.12.016.
Google Scholar
Chen, Z., & Klahr, D. (1999). All other things being equal: acquisition and transfer of the control of variables strategy. Child Development, 70(5), 1098–1120. https://doi.org/10.1111/1467-8624.00081.
Google Scholar
Christensen, K. B. & Kreiner, S. (2013). Item fit statistics. In Rasch models in health (pp. 83–104). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch5.
Conway, A. R., & Kovacs, K. (2015). New and emerging models of human intelligence. Wiley Interdisciplinary Reviews: Cognitive Science, 6(5), 419–426. https://doi.org/10.1002/wcs.1356.
Google Scholar
Cullen, L. T. (2012). Rasch models: foundations, recent developments, and applications. [S.l.]: Springer.
Davier, M. v., & Carstensen, C. H. (2007). Multivariate and mixture distribution rasch models extensions and applications. New York: Springer.
Google Scholar
De Groot, A. (2014). The meaning of “significance” for different types of research [translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han Lj Van Der Maas]. Acta Psychologica, 148, 188–194. https://doi.org/10.1016/j.actpsy.2014.02.001.
Google Scholar
de la Torre, J. (2009). A cognitive diagnosis model for cognitively based multiple-choice options. Applied Psychological Measurement, 33(3), 163–183. https://doi.org/10.1177/0146621608320523.
Google Scholar
Deary, I. J., Wilson, J. A., Carding, P. N., MacKenzie, K., & Watson, R. (2010). From dysphonia to dysphoria: Mokken scaling shows a strong, reliable hierarchy of voice symptoms in the Voice Symptom Scale questionnaire. Journal of Psychosomatic Research, 68(1), 67–71. https://doi.org/10.1016/j.jpsychores.2009.06.008.
Google Scholar
Dewey, J. (1910). How we think. Boston, MA: DC Heath.
Dickison, P., Luo, X., Kim, D., Woo, A., Muntean, W., & Bergstrom, B. (2016). Assessing higher-order cognitive constructs by using an information-processing framework. Journal of Applied Testing Technology, 17, 1–19.
Google Scholar
Divgi, D. (1986). Does the rasch model really work for multiple choice items? Not if you look closely. Journal of Educational Measurement, 23(4), 283–298. https://doi.org/10.1111/j.1745-3984.1986.tb00251.x.
Google Scholar
Donovan, J., Hutton, P., Lennon, M., O’Connor, G., & Morrissey, N. (2008a). National assessment program—science literacy year 6 school release materials, 2006. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).
Donovan, J., Lennon, M., O’connor, G., & Morrissey, N. (2008b). National assessment program–science literacy year 6 report, 2006. Ministerial Council on Education, Employment, Training and Youth Affairs (NJ1).
Engelhard Jr, G. (2013). Invariant measurement: using Rasch models in the social, behavioral, and health sciences. New York: Routledge.
Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112.
Google Scholar
Esswein, J. L. (2010). Critical thinking and reasoning in middle school science education (Doctoral dissertation, The Ohio State University).
Finkelstein, L. (2003). Widely, strongly and weakly defined measurement. Measurement, 34(1), 39–48. https://doi.org/10.1016/S0263-2241(03)00018-6.
Google Scholar
Fischer, F., Kollar, I., Ufer, S., Sodian, B., Hussmann, H., Pekrun, R., et al. (2014). Scientific reasoning and argumentation: advancing an interdisciplinary research agenda in education. Frontline Learning Research, 4, 28–45. https://doi.org/10.14786/flr.v2i2.96.
Google Scholar
Fox, J.-P. (2010). Bayesian item response modeling: theory and applications. Springer Science & Business Media.
Gebhardt, E. (2016). Latent path models within an irt framework (Doctoral dissertation).
Google Scholar
Gelman, A., & Loken, E. (2014). The statistical crisis in science data-dependent analysis—a “garden of forking paths”—explains why many statistically significant comparisons don’t hold up. American Scientist, 102(6), 460.
Google Scholar
Gignac, G. E. (2016, July). On the evaluation of competing theories: a reply to van der Maas and Kan. Intelligence, 57, 84–86. https://doi.org/10.1016/j.intell.2016.03.006.
Google Scholar
Glas, C. A. & Verhelst, N. D. (1995). Testing the Rasch model. In Rasch models (pp. 69–95). Springer.
Glockner-Rist, A., & Hoijtink, H. (2003). The best of both worlds: factor analysis of dichotomous data using item response theory and structural equation modeling. Structural Equation Modeling, 10(4), 544–565. https://doi.org/10.1207/S15328007SEM1004_4.
Google Scholar
Grube, C. R. (2010). Kompetenzen naturwissenschaftlicher Erkenntnisgewinnung [Competencies of scientific inquiry] (Doctoral dissertation, Universität Kassel).
Hambleton, R. K. (2000). Response to hays et al and McHorney and Cohen: emergence of item response modeling in instrument development and data analysis. Medical Care, 38, II–60. https://doi.org/10.1097/00005650-200009002-00009.
Google Scholar
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.
Hartig, J., & Frey, A. (2013). Sind Modelle der Item-Response-Theorie (IRT) das “Mittel der Wahl ”für die Modellierung von Kompetenzen? Zeitschrift für Erziehungswissenschaft, 16(S1), 47–51. https://doi.org/10.1007/s11618-013-0386-0.
Google Scholar
Hartig, J., Klieme, E., & Leutner, D. (2008). Assessment of competencies in educational contexts. Hogrefe Publishing.
Hartmann, S., Upmeier zu Belzen, A., Kroeger, D., & Pant, H. A. (2015, January). Scientific reasoning in higher education: constructing and evaluating the criterion-related validity of an assessment of preservice science teachers’ competencies. Zeitschrift fuer Psychologie, 223(1), 47–53. https://doi.org/10.1027/2151-2604/a000199.
Google Scholar
Heene, M. (2006). Konstruktion und Evaluation eines Studierendenauswahlverfahrens für Psychologie an der Universität Heidelberg. Unpublished Doctoral Dissertation, University of Heidelberg.
Heene, M., Bollmann, S., & Buhner, M. (2014). Much ado about nothing, or much to do about something: effects of scale shortening on criterion validity and mean differences. Journal of Individual Differences, 35(4), 245–249. https://doi.org/10.1027/1614-0001/a000146Heene,M.
Heene, M., Kyngdon, A., & Sckopke, P. (2016). Detecting violations of unidimensionality by order-restricted inference methods. Frontiers in Applied Mathematics and Statistics, 2, 3.
Google Scholar
Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first-order item response theory: application to true-score prediction from a possibly nonparallel test. Psychometrika, 68(1), 123–149.
Google Scholar
Humphry, S. (2011, January). The role of the unit in physics and psychometrics. Measurement: Interdisciplinary Research & Perspective, 9(1), 1–24. https://doi.org/10.1080/15366367.2011.558442.
Google Scholar
Jeon, M., Draney, K., & Wilson, M. (2015). A general saltus lltm-r for cognitive assessments. In Quantitative psychology research (pp. 73–90). Springer. https://doi.org/10.1007/978-3-319-07503-7_5.
Kiefer, T., Robitzsch, A., Wu, M., & Robitzsch, A. (2016). Package tam. R software package. Kitchner, K. S. (1983). Cognition, metacognition, and epistemic cognition. Human Development, 26, 222–232.
Google Scholar
Klahr, D. (2002). Exploring science: the cognition and development of discovery processes. The MIT Press.
Klahr, D., & Dunbar, K. (1988). Dual space search during scientific reasoning. Cognitive Science, 12(1), 1–48. https://doi.org/10.1207/s15516709cog1201_1.
Google Scholar
Koeppen, K., Hartig, J., Klieme, E., & Leutner, D. (2008). Current issues in competence modeling and assessment. Zeitschrift für Psychologie, 216(2), 61–73.
Google Scholar
Koller, I., Maier, M. J., & Hatzinger, R. (2015). An empirical power analysis of quasi-exact tests for the rasch model. Methodology, 11(2), 45–54. https://doi.org/10.1027/1614-2241/a000090.
Google Scholar
Körber, S., Mayer, D., Osterhaus, C., Schwippert, K., & Sodian, B. (2014, September). The development of scientific thinking in elementary school: a comprehensive inventory. Child Development, 86(1), 327–336. https://doi.org/10.1111/cdev.12298.
Google Scholar
Körber, S., Osterhaus, C., & Sodian, B. (2015). Testing primary-school children’s understanding of the nature of science. British Journal of Developmental Psychology, 33(1), 57–72. https://doi.org/10.1111/bjdp.12067.
Google Scholar
Kreiner, S. & Christensen, K. B. (2013). Overall tests of the rasch model. In Rasch models in health (pp. 105–110). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch6.
Kremer, K., Specht, C., Urhahne, D., & Mayer, J. (2014, January 2). The relationship in biology between the nature of science and scientific inquiry. Journal of Biological Education, 48(1), 1–8. https://doi.org/10.1080/00219266.2013.788541.
Google Scholar
Kuhn, D. (1989). Children and adults as intuitive scientists. Psychological Review, 96(4), 674–689. https://doi.org/10.1037/0033-295X.96.4.674.
Google Scholar
Kuhn, D. (1991). The skills of argument. Cambridge University Press.
Kuhn, D., Iordanou, K., Pease, M., & Wirkala, C. (2008). Beyond control of variables: what needs to develop to achieve skilled scientific thinking? Cognitive Development, 23(4), 435–451. https://doi.org/10.1016/j.cogdev.2008.09.006.
Google Scholar
Kuhn, D., & Pease, M. (2008). What needs to develop in the development of inquiry skills? Cognition and Instruction, 26(4), 512–559. https://doi.org/10.1080/07370000802391745.
Google Scholar
Kuhn, D., Ramsey, S., & Arvidsson, T. S. (2015, July). Developing multivariable thinkers. Cognitive Development, 35, 92–110. https://doi.org/10.1016/j.cogdev.2014.11.003.
Google Scholar
Kuhn, D., & Udell, W. (2003). The development of argument skills. Child Development, 74(5), 1245–1260. https://doi.org/10.1111/1467-8624.00605.
Google Scholar
Kuhn, T. S. (1970). The structure of scientific revolutions ([2d ed., enl). International encyclopedia of unified science. Foundations of the unity of science, v. 2, no. 2. Chicago: University of Chicago Press.
Kuo, C.-Y., Wu, H.-K., Jen, T.-H., & Hsu, Y.-S. (2015, September 22). Development and validation of a multimedia-based assessment of scientific inquiry abilities. International Journal of Science Education, 37(14), 2326–2357. https://doi.org/10.1080/09500693.2015.1078521.
Google Scholar
Lehrer, R., & Schauble, L. (2000). Modeling in mathematics and science. In R. Glaser (Ed.), Advances in instructional psychology, Volume 5: Educational Design and Cognitive Science (pp. 100–159). New Jersey: Lawrence Erlbaum.
Google Scholar
Linacre, J. M. (2010). Two perspectives on the application of rasch models. European Journal of Phsyciological Rehabilitaiton Medicine, 46, 309–310.
Google Scholar
Linacre, J. M. (2012). A user’s guide to facets rasch-model computer programs.
Google Scholar
Linacre, J. M., & Wright, B. D. (1994). Dichotomous infit and outfit mean-square fit statistics. Rasch Measurement Transactions, 8(2), 260.
Google Scholar
Linacre, J. M. & Wright, B. D. (2000). Winsteps. URL: http://www.winsteps.com/index.htm [accessed 2017-01-01].
Lou, Y., Blanchard, P., & Kennedy, E. (2015). Development and validation of a science inquiry skills assessment. Journal of Geoscience Education, 63(1), 73–85. https://doi.org/10.5408/14-028.1.
Google Scholar
MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). The problem of equivalent models in applications of covariance structure analysis. Psychological Bulletin, 114(1), 185–199.
Google Scholar
Mair, P., & Hatzinger, R. (2007). Extended rasch modeling: the erm package for the application of irt models in r. Journal of Statistical Software, 20(9), 1–20. https://doi.org/10.18637/jss.v020.i09.
Google Scholar
Manlove, S., Lazonder, A. W., & Jong, T. D. (2006). Regulative support for collaborative scientific inquiry learning. Journal of Computer Assisted Learning, 22(2), 87–98.
Google Scholar
Mari, L., Maul, A., Irribarra, D. T., & Wilson, M. (2016). A meta-structural understanding of measurement. In Journal of physics: conference series (Vol. 772, p. 012009). IOP Publishing.
Mari, L., Maul, A., Torres Irribarra, D., & Wilson, M. (2017). Quantities, Quantification, and the Necessary and Sufficient Conditions for Measurement. Measurement, 100, 115–121
Google Scholar
Masters, G. N. (1988). Item discrimination: when more is worse. Journal of Educational Measurement, 25(1), 15–29. https://doi.org/10.1111/j.1745-3984.1988.tb00288.x.
Google Scholar
Maul, A. (2017). Rethinking traditional methods of survey validation. Measurement: Interdisciplinary Research and Perspectives, 15(2), 51–69. https://doi.org/10.1080/15366367.2017.1348108.
Google Scholar
Maydeu-Olivares, A. (2013, July). Goodness-of-fit assessment of item response theory models. Measurement: Interdisciplinary Research & Perspective, 11(3), 71–101. https://doi.org/10.1080/15366367.2013.831680.
Google Scholar
Maydeu-Olivares, A., & Joe, H. (2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research, 49(4), 305–328. https://doi.org/10.1080/00273171.2014.911075.
Google Scholar
Mayer, D., Sodian, B., Körber, S., & Schwippert, K. (2014, February). Scientific reasoning in elementary school children: assessment and relations with cognitive abilities. Learning and Instruction, 29, 43–55. https://doi.org/10.1016/j.learninstruc.2013.07.005.
Google Scholar
Meijer, R. R., Sijtsma, K., & Smid, N. G. (1990). Theoretical and empirical comparison of the Mokken and the Rasch approach to IRT. Applied Psychological Measurement, 14(3), 283–298. https://doi.org/10.1177/014662169001400306.
Google Scholar
Michell, J. (2000). Normal science, pathological science and psychometrics. Theory & Psychology, 10(5), 639–667.
Google Scholar
Mokken, R. J. (1971). A theory and procedure of scale analysis: with applications in political research. Walter de Gruyter.
Molenaar, I. W. (2001). Thirty years of nonparametric item response theory. Applied Psychological Measurement, 25(3), 295–299. https://doi.org/10.1177/01466210122032091.
Google Scholar
Morris, B. J., Croker, S., Masnick, A., & Zimmerman, C. (2012). The emergence of scientific reasoning. In Current topics in children’s learning and cognition. Rijeka, Croatia: InTech.
Mullis, I. V. S., Martin, M. O., Ruddock, G. J., O’Sullivan, C. Y., & Preuschoff, C. (2009). Timss 2011 assessment frameworks.
Mullis, I. V., Martin, M. O., Smith, T. A., Garden, R. A., Gregory, K. D., Gonzalez, E. J., … O’Connor, K. M. (2003). TIMSS trends in mathematics and science study: assessment frameworks and specifications 2003.
Musek, J. (2007). A general factor of personality: evidence for the big one in the five-factor model. Journal of Research in Personality, 41(6), 1213–1233. https://doi.org/10.1016/j.jrp.2007.02.003.
Google Scholar
National Assessment Governing Board. (2007). Science assessment and item specifications for the 2009 national assessment of educational progress. Washington: National Assessment Governing Board.
Google Scholar
Nowak, K. H., Nehring, A., Tiemann, R., & Upmeier zu Belzen, A. (2013). Assessing students’ abilities in processes of scientific inquiry in biology using a paper-and-pencil test. Journal of Biological Education, 47(3), 182–188. https://doi.org/10.1080/00219266.2013.822747.
Google Scholar
OECD. (2006). Assessing scientific, reading and mathematical literacy: a framework for PISA 2006. Paris: Organisation for Economic Co-operation and Development.
Opitz, A., Heene, M., & Fischer, F. (2017). Measuring scientific reasoning—a review of test instruments. Educational Research and Evaluation, 23(3-4), 78–101.
Google Scholar
Pant, H. A., Stanat, P., Schroeders, U., Roppelt, A., Siegle, T., Pohlmann, C., & Institut zur Qualitätsentwicklung im Bildungswesen (Eds.). (2013). IQB ländervergleich 2012: mathematische und naturwissenschaftliche Kompetenzen am Ende der Sekundarstufe i. Munster: Waxmann.
Peirce, C. S. (2012). Philosophical writings of Peirce. Courier Corporation.
Piaget, J. & Inhelder, B. (1958). The growth of logical thinking from childhood to adolescence: an essay on the construction of formal operational structures. Abingdon, Oxon: Routledge.
Pohl, S., & Steyer, R. (2010). Modeling common traits and method effects in multitrait-multimethod analysis. Multivariate Behavioral Research, 45(1), 45–72. https://doi.org/10.1080/00273170903504729.
Google Scholar
Raiche, G. & Raiche, M. G. (2009). The irtprob package.
Google Scholar
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Nielsen & Lydiche.
Raykov, T. & Marcoulides, G. A. (2011). Introduction to psychometric theory. Routledge.
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. Accessed 15 Sept 2013.
Reckase, M. (2009). Multidimensional item response theory. Springer.
Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–696. https://doi.org/10.1080/00273171.2012.715555.
Google Scholar
Reise, S. P., Bonifay, W. E., & Haviland, M. G. (2013). Scoring and modeling psychological measures in the presence of multidimensionality. Journal of Personality Assessment, 95(2), 129–140. https://doi.org/10.1080/00223891.2012.725437.
Google Scholar
Renkl, A. (2012). Modellierung von Kompetenzen oder von interindividuellen Kompetenzunterschieden. Psychologische Rundschau., 63(1), 50–53.
Google Scholar
Revelle, W. (2004). An introduction to psychometric theory with applications in r. Springer.
Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological Review, 107(2), 358–367.
Google Scholar
Robitzsch, A. (2016). Essays zu methodischen herausforderungen im large-scale assessment. Humboldt-Universität zu Berlin.
Robitzsch, A., Kiefer, T., George, A. C., & Uenlue, A. (2014). Cdm: cognitive diagnosis modeling. R package version, 3.
Rosseel, Y., Oberski, D., Byrnes, J., Vanbrabant, L., Savalei, V., Merkle, E., ... Barendse, M., et al. (2017). Package lavaan.
Rost, J., Carstensen, C., & Von Davier, M. (1997). Applying the mixed rasch model to personality questionnaires. Applications of latent trait and latent class models in the social sciences, 324–332.
Schommer, M., Calvert, C., Gariglietti, G., & Bajaj, A. (1997). The development of epistemological beliefs among secondary students: a longitudinal study. Journal of Educational Psychology, 89(1), 37–40. https://doi.org/10.1037/0022-0663.89.1.37.
Google Scholar
Siersma, V. & Eusebi, P. (2013). Analysis with repeatedly measured binary item response data by ad hoc rasch scales. In Rasch models in health (pp. 257–276). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118574454.ch14
Sijtsma, K. (2011). Review. Measurement, 44(7), 1209–1219. https://doi.org/10.1016/j.measurement.2011.03.019.
Google Scholar
Sijtsma, K. (2012, December 1). Psychological measurement between physics and statistics. Theory & Psychology, 22(6), 786–809. https://doi.org/10.1177/0959354312454353.
Google Scholar
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.
Google Scholar
Smith, A. B., Rush, R., Fallowfield, L. J., Velikova, G., & Sharpe, M. (2008). Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology, 8(1), 33. https://doi.org/10.1186/1471-2288-8-33.
Google Scholar
Smith, R. M., Schumacker, R. E., & Bush, M. J. (1998). Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement, 2(1), 66–78.
Google Scholar
Smith, R. M., & Suh, K. K. (2003). Rasch fit statistics as a test of the invariance of item parameter estimates. Journal of Applied Measurement, 4(2), 153–163.
Google Scholar
Sodian, B., & Bullock, M. (2008, October). Scientific reasoning where are we now? Cognitive Development, 23(4), 431–434. https://doi.org/10.1016/j.cogdev.2008.09.003.
Google Scholar
Sodian, B., Zaitchik, D., & Carey, S. (1991). Young children’s differentiation of hypothetical beliefs from evidence. Child Development, 62(4), 753–766. https://doi.org/10.1111/j.1467-8624.1991.tb01567.x.
Google Scholar
Stewart, I. (2008). Nature’s numbers: the unreal reality of mathematics. NY: Basic Books.
Strobl, C., Kopf, J., & Zeileis, A. (2015). Rasch trees: a new method for detecting differential item functioning in the Rasch model. Psychometrika, 80(2), 289–316. https://doi.org/10.1007/s11336-013-9388-3.
Google Scholar
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51(4), 567–577. https://doi.org/10.1007/BF02295596.
Google Scholar
Thurstone, L. L. [Louis L]. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529–554, 4.
Google Scholar
Thurstone, L. L. Louis Leon & Chave, E. J. (1954). Chicago: Chicago University Press.
Toulmin, S. (1974). Human understanding, volume i.
Van der Ark, L. A., et al. (2007). Mokken scale analysis in r. Journal of Statistical Software, 20, 1–19.
Google Scholar
Van Der Maas, H. L., Dolan, C. V., Grasman, R. P., Wicherts, J. M., Huizenga, H. M., & Raijmakers, M. E. (2006). A dynamical model of general intelligence: the positive manifold of intelligence by mutualism. Psychological Review, 113(4), 842–861. https://doi.org/10.1037/0033-295X.113.4.842.
Google Scholar
van Bork, R., Epskamp, S., Rhemtulla, M., Borsboom, D., & van der Maas, H. L. (2017). What is the p-factor of psychopathology? Some risks of general factor modeling. Theory & Psychology, 27(6), 759–773.
Google Scholar
Vandekerckhove, J., Matzke, D., & Wagenmakers, E.-J. (2015). Model comparison and the principle. The Oxford handbook of computational and mathematical psychology, 300.
Vandekerckhove, J., Tuerlinckx, F., & Lee, M. D. (2011). Hierarchical diffusion models for two-choice response times. Psychological Methods, 16(1), 44–62.
Google Scholar
von Davier, M. (2001). Winmira 2001. Computer software]. St. Paul, MN: Assessment Systems Corporation.
Vosniadou, S., & Brewer, W. F. (1992). Mental models of the earth: a study of conceptual change in childhood. Cognitive Psychology, 24(4), 535–585. https://doi.org/10.1016/0010-0285(92)90018-W.
Google Scholar
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105.
Google Scholar
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632–638.
Google Scholar
Whitely, S. E. (1983) Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179–197
Google Scholar
Wilkening, F., & Sodian, B. (2005). Scientific reasoning in young children: introduction. Swiss Journal of Psychology, 64(3), 137–139. https://doi.org/10.1024/1421-0185.64.3.137.
Google Scholar
Wilson, M., Allen, D. D., & Li, J. C. (2006). Improving measurement in health education and health behavior research using item response modeling: comparison with the classical test theory approach. Health Education Research, 21(Supplement 1), i19–i32.
Google Scholar
Wright, B. D. (1979). Best test design. Chicago, IL: MESA Press.
Wright, B. D. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.
Google Scholar
Wu, M. (2004). Plausible values. Rasch Measurement Transactions, 18, 976–978.
Google Scholar
Wu, M. L. (2007). ACER ConQuest version 2.0: generalised item response modelling software. Camberwell, Vic.: ACER Press.
Zimmerman, C. (2000). The development of scientific reasoning skills. Developmental Review, 20(1), 99–149. https://doi.org/10.1006/drev.1999.0497.
Google Scholar
Zimmerman, C. (2007). The development of scientific thinking skills in elementary and middle school. Developmental Review, 27(2), 172–223. https://doi.org/10.1016/j.dr.2006.12.001.
Google Scholar
Zimmerman, C., & Klahr, D. (2018). Development of scientific thinking. In J. T. Wixted (Ed.), Stevens’ handbook of experimental psychology and cognitive neuroscience (pp. 1–25). Hoboken: John Wiley & Sons, Inc..
Google Scholar

Download references

Author information

Authors and Affiliations

ETH Zürich, Clausiusstrasse 59, RZ H16, 8092, Zürich, Switzerland
Peter A. Edelsbrunner
University of Amsterdam, Amsterdam, Netherlands
Fabian Dablander

Authors

Peter A. Edelsbrunner
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Dablander
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter A. Edelsbrunner.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Edelsbrunner, P.A., Dablander, F. The Psychometric Modeling of Scientific Reasoning: a Review and Recommendations for Future Avenues. Educ Psychol Rev 31, 1–34 (2019). https://doi.org/10.1007/s10648-018-9455-5

Download citation

Published: 23 November 2018
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s10648-018-9455-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Psychometric Modeling of Scientific Reasoning: a Review and Recommendations for Future Avenues

Abstract

Access this article

Similar content being viewed by others

Objective Measurement in Psychometric Analysis

Validity-Versus-Reliability Tradeoffs and the Ethics of Educational Research

Synthesis of Validation Practices in Two Assessment Journals: Psychological Assessment and the European Journal of Psychological Assessment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Psychometric Modeling of Scientific Reasoning: a Review and Recommendations for Future Avenues

Abstract

Access this article

Similar content being viewed by others

Objective Measurement in Psychometric Analysis

Validity-Versus-Reliability Tradeoffs and the Ethics of Educational Research

Synthesis of Validation Practices in Two Assessment Journals: Psychological Assessment and the European Journal of Psychological Assessment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation