Inter-rater reliability and validity of peer reviews in an interdisciplinary field
Peer review is an integral part of science. Devised to ensure and enhance the quality of scientific work, it is a crucial step that influences the publication of papers, the provision of grants and, as a consequence, the career of scientists. In order to meet the challenges of this responsibility, a certain shared understanding of scientific quality seems necessary. Yet previous studies have shown that inter-rater reliability in peer reviews is relatively low. However, most of these studies did not take ill-structured measurement design of the data into account. Moreover, no prior (quantitative) study has analyzed inter-rater reliability in an interdisciplinary field. And finally, issues of validity have hardly ever been addressed. Therefore, the three major research goals of this paper are (1) to analyze inter-rater agreement of different rating dimensions (e.g., relevance and soundness) in an interdisciplinary field, (2) to account for ill-structured designs by applying state-of-the-art methods, and (3) to examine the construct and criterion validity of reviewers’ evaluations. A total of 443 reviews were analyzed. These reviews were provided by m = 130 reviewers for n = 145 submissions to an interdisciplinary conference. Our findings demonstrate the urgent need for improvement of scientific peer review. Inter-rater reliability was rather poor and there were no significant differences between evaluations from reviewers of the same scientific discipline as the papers they were reviewing versus reviewer evaluations of papers from disciplines other than their own. These findings extend beyond those of prior research. Furthermore, convergent and discriminant construct validity of the rating dimensions were low as well. Nevertheless, a multidimensional model yielded a better fit than a unidimensional model. Our study also shows that the citation rate of accepted papers was positively associated with the relevance ratings made by reviewers from the same discipline as the paper they were reviewing. In addition, high novelty ratings from same-discipline reviewers were negatively associated with citation rate.
KeywordsPeer review Inter-rater reliability Construct validity Criterion validity Interdisciplinary research Citation rate
- Akerlof, G. A. (2003). Writing the “The Market for ‘Lemons’”: A personal and interpretive essay. https://www.nobelprize.org/nobel_prizes/economic-sciences/laureates/2001/akerlof-article.html. Accessed 4 Sept 2017.
- Anderson, K. (2012). The problems with calling comments “Post-Publication Peer-Review” [Web log message]. Retrieved from http://scholarlykitchen.sspnet.org/2012/03/26/the-problems-with-calling-comments-post-publication-peer-review.
- Asparouhov, T., & Muthén, B. (2010). Bayesian analysis of latent variable models using Mplus. http://www.statmodel.com/download/BayesAdvantages18.pdf. Accessed 30 Mar 2017.
- Bornmann, L., & Daniel, H.-D. (2008a). Selecting manuscripts for a high-impact journal through peer review: A citation analysis of communications that were accepted by Angewandte Chemie International Edition, or rejected but published elsewhere. Journal of the American Society for Information Science and Technology, 59(11), 1841–1852.CrossRefGoogle Scholar
- Bortz, J., & Döring, N. (2006). Forschungsmethoden und evaluation für Human- und Sozialwissenschaftler [Research methods and evaluation for human and social scientists] (4th ed.). Heidelberg, DE: Springer.Google Scholar
- Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd ed.). New York, NY: Guilford Press.Google Scholar
- Cattell, R. B., & Jaspers, J. (1967). A general plasmode (No. 30-10-5-2) for factor analytic exercises and research. Multivariate Behavioral Research Monographs, 67, 1–212.Google Scholar
- Chase, J. M. (1970). Normative criteria for scientific publication. American Sociologist, 5(3), 262–265.Google Scholar
- Cicchetti, D. V., & Conn, H. O. (1976). A statistical analysis of reviewer agreement and bias in evaluating medical abstracts. Yale Journal of Biology and Medicine, 49(4), 373–383.Google Scholar
- Cohrs, J. C., Moschner, B., Maes, J., & Kielmann, S. (2005). The motivational bases of right-wing authoritarianism and social dominance orientation: Relations to values and attitudes in the aftermath of September 11, 2001. Personality and Social Psychology Bulletin, 31(10), 1425–1434.CrossRefGoogle Scholar
- Cornforth, J. W. (1974). Referees. New Scientist, 62(892), 39.Google Scholar
- DeCoursey, T. (2006). The pros and cons of open peer review. Nature. Retrieved from http://www.nature.com/nature/peerreview/debate/nature04991.html.
- Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford Press.Google Scholar
- Gwet, K. L. (2014). The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.Google Scholar
- Hassebrauck, M. (1983). Die Beurteilung der physischen Attraktivität: Konsens unter Urteilern? [Judging physical attractiveness: Consensus among judges?]. Zeitschrift für Sozialpsychologie, 14(2), 152–161.Google Scholar
- Hassebrauck, M. (1993). Die Beurteilung der physischen Attraktivität [The assessment of physical attractiveness]. In M. Hassebrauck & R. Niketta (Eds.), Physische Attraktivität [Physical attractiveness] (1st ed., pp. 29–59). Göttingen, DE: Hogrefe.Google Scholar
- Hemlin, S., & Montgomery, H. (1990). Scientists’ conceptions of scientific quality: An interview study. Science Studies, 3(1), 73–81.Google Scholar
- Henss, R. (1992). “Spieglein, Spieglein an der Wand …”: Geschlecht, Alter und physische Attraktiviät [“Mirror, mirror on the wall…”: Sex, age, and physical attractiveness]. Weinheim, DE: PVU.Google Scholar
- Hönekopp, J. (2006). Once more: Is beauty in the eye of the beholder? Relative contributions of private and shared taste to judgments of facial attractiveness. Journal of Experimental Psychology, 32(2), 199–209.Google Scholar
- Houry, D., Green, S., & Callaham, M. (2012). Does mentoring new peer reviewers improve review quality? A randomized trial. BMC Medical Education, 12.Google Scholar
- IBM Corp. (2011). IBM SPSS Statistics for windows (version 20.0) [computer software]. Armonk, NY: IBM Corp.Google Scholar
- Kaplan, D., & Depaoli, S. (2013). Bayesian statistical methods. In T. D. Little (Ed.), The Oxford handbook of quantitative methods (Vol. 1, pp. 407–437). New York, NY: Oxford University Press.Google Scholar
- Montgomery, A. A., Graham, A., Evans, P. H., & Fahey, T. (2002). Inter-rater agreement in the scoring of abstracts submitted to a primary care research conference. BMC Health Services Research, 2.Google Scholar
- Muthén, B. (2010). Bayesian analysis in Mplus: A brief introduction [manuscript]. http://www.statmodel.com/download/IntroBayesVersion%203.pdf. Accessed March 30 2017.
- Muthén, B., & Asparouhov, T. (2011). Bayesian SEM: A more flexible representation of substantive theory [manuscript]. http://www.statmodel.com/download/BSEMv4REVISED. Accessed March 30 2017.
- Muthén, L. K., & Muthén, B. O. (2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén.Google Scholar
- Platt, J. R. (1964). Strong inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others. Science, New Series, 146(3642), 347–353.Google Scholar
- Putka, D. J. (2002). The variance architecture approach to the study of constructs in organizational contexts (Doctoral dissertation, Ohio University). http://etd.ohiolink.edu/. Accessed March 30 2017.
- R Core Team. (2016). R: A language and environment for statistical computing (Version 3.3.1) [computer software]. Vienna, AT: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org.
- Revelle, W. (2016). Psych: Procedures for personality and psychological research (Version 1.6.9) [computer software]. Evanston, IL: Northwestern University. http://cran.r-project.org/web/packages/psych/. Accessed March 30 2017.
- Rosa, H. (2016). Resonanz - Eine Soziologie der Weltbeziehung [Resonance—A sociology of the relationship to the world]. Berlin, DE: Suhrkamp.Google Scholar
- Smith, R. (2003). The future of peer review. http://pdfs.semanticscholar.org/7c06/8fcda6956132db6732e6c353ffe5fe6b6f62.pdf?_ga=1.116839174.1674370711.1490806067. Accessed March 29 2017.
- Uebersax, J. S. (1982–1983). A design-independent method for measuring the reliability of psychiatric diagnosis. Journal of Psychiatric Research, 17(4), 335–342.Google Scholar
- Wirtz, M., & Caspar, F. (2002). Beurteilerübereinstimmung und Beurteilerreliabilität: Methoden zur Bestimmung und Verbesserung der Zuverlässigkeit von Einschätzungen mitttels Kategoriensystemen und Ratingskalen [Inter-rater agreement and inter-rater reliability: Methods on analysis and improvement of the reliability of assessments by categorical systems and rating scales]. Göttingen, DE: Hogrefe.Google Scholar
- Yates, A. (1987). Multivariate exploratory data analysis: A perspective on exploratory factor analysis. Albany, NY: State University of New York Press.Google Scholar