Scientometrics

, Volume 113, Issue 2, pp 1059–1092 | Cite as

Inter-rater reliability and validity of peer reviews in an interdisciplinary field

  • Jens Jirschitzka
  • Aileen Oeberst
  • Richard Göllner
  • Ulrike Cress
Article

Abstract

Peer review is an integral part of science. Devised to ensure and enhance the quality of scientific work, it is a crucial step that influences the publication of papers, the provision of grants and, as a consequence, the career of scientists. In order to meet the challenges of this responsibility, a certain shared understanding of scientific quality seems necessary. Yet previous studies have shown that inter-rater reliability in peer reviews is relatively low. However, most of these studies did not take ill-structured measurement design of the data into account. Moreover, no prior (quantitative) study has analyzed inter-rater reliability in an interdisciplinary field. And finally, issues of validity have hardly ever been addressed. Therefore, the three major research goals of this paper are (1) to analyze inter-rater agreement of different rating dimensions (e.g., relevance and soundness) in an interdisciplinary field, (2) to account for ill-structured designs by applying state-of-the-art methods, and (3) to examine the construct and criterion validity of reviewers’ evaluations. A total of 443 reviews were analyzed. These reviews were provided by m = 130 reviewers for n = 145 submissions to an interdisciplinary conference. Our findings demonstrate the urgent need for improvement of scientific peer review. Inter-rater reliability was rather poor and there were no significant differences between evaluations from reviewers of the same scientific discipline as the papers they were reviewing versus reviewer evaluations of papers from disciplines other than their own. These findings extend beyond those of prior research. Furthermore, convergent and discriminant construct validity of the rating dimensions were low as well. Nevertheless, a multidimensional model yielded a better fit than a unidimensional model. Our study also shows that the citation rate of accepted papers was positively associated with the relevance ratings made by reviewers from the same discipline as the paper they were reviewing. In addition, high novelty ratings from same-discipline reviewers were negatively associated with citation rate.

Keywords

Peer review Inter-rater reliability Construct validity Criterion validity Interdisciplinary research Citation rate 

Supplementary material

11192_2017_2516_MOESM1_ESM.docx (197 kb)
Supplementary material 1 (DOCX 198 kb)

References

  1. Adams, K. M. (1991). Peer review: An unflattering picture. Behavioral and Brain Sciences, 14(1), 135–136.CrossRefGoogle Scholar
  2. Akerlof, G. A. (2003). Writing the “The Market for ‘Lemons’”: A personal and interpretive essay. https://www.nobelprize.org/nobel_prizes/economic-sciences/laureates/2001/akerlof-article.html. Accessed 4 Sept 2017.
  3. Aksnes, D. W. (2003). Characteristics of highly cited papers. Research Evaluation, 12(3), 159–170.CrossRefGoogle Scholar
  4. Altman, D. G., & Bland, J. M. (2011). How to obtain the P value from a confidence interval. BMJ, 343, d2304.CrossRefGoogle Scholar
  5. Anderson, K. (2012). The problems with calling comments “Post-Publication Peer-Review” [Web log message]. Retrieved from http://scholarlykitchen.sspnet.org/2012/03/26/the-problems-with-calling-comments-post-publication-peer-review.
  6. Asparouhov, T., & Muthén, B. (2010). Bayesian analysis of latent variable models using Mplus. http://www.statmodel.com/download/BayesAdvantages18.pdf. Accessed 30 Mar 2017.
  7. Baethge, C., Franklin, J., & Mertens, S. (2013). Substantial agreement of referee recommendations at a general medical journal—A peer review evaluation at Deutsches Ärzteblatt International. PLoS ONE, 8(5), e61401.CrossRefGoogle Scholar
  8. Bailar, J. C., & Patterson, K. (1985). Journal peer review—The need for a research agenda. The New England Journal of Medicine, 312(10), 654–657.CrossRefGoogle Scholar
  9. Benda, W. G. G., & Engels, T. C. E. (2011). The predictive validity of peer review: A selective review of the judgmental forecasting qualities of peers, and implications for innovation in science. International Journal of Forecasting, 27(1), 166–182.CrossRefGoogle Scholar
  10. Beyer, J. M., Chanove, R. G., & Fox, W. B. (1995). Review process and the fates of manuscripts submitted to AMJ. Academy of Management Journal, 38(5), 1219–1260.CrossRefGoogle Scholar
  11. Blackburn, J. L., & Hakel, M. D. (2006). An examination of sources of peer-review bias. Psychological Science, 17(5), 378–382.CrossRefGoogle Scholar
  12. Bornmann, L., & Daniel, H.-D. (2005). Selection of research fellowship recipients by committee peer review. Reliability, fairness and predictive validity of Board of Trustees’ decisions. Scientomentrics, 63(2), 297–320.CrossRefGoogle Scholar
  13. Bornmann, L., & Daniel, H.-D. (2008a). Selecting manuscripts for a high-impact journal through peer review: A citation analysis of communications that were accepted by Angewandte Chemie International Edition, or rejected but published elsewhere. Journal of the American Society for Information Science and Technology, 59(11), 1841–1852.CrossRefGoogle Scholar
  14. Bornmann, L., & Daniel, H.-D. (2008b). The effectiveness of the peer review process: Inter-referee agreement and predictive validity of manuscript refereeing at Angewandte Chemie. Angewandte Chemie-International Edition, 47(38), 7173–7178.CrossRefGoogle Scholar
  15. Bornmann, L., & Daniel, H.-D. (2008c). What do citations counts measure? A review of studies on citing behavior. Journal of Documentation, 64(1), 45–80.CrossRefGoogle Scholar
  16. Bornmann, L., Mutz, R., & Daniel, H.-D. (2010). A reliability-generalization study of journal peer reviews: A multilevel meta-analysis of inter-rater reliability and its determinants. PLoS ONE, 5(12), e14331.CrossRefGoogle Scholar
  17. Bortz, J., & Döring, N. (2006). Forschungsmethoden und evaluation für Human- und Sozialwissenschaftler [Research methods and evaluation for human and social scientists] (4th ed.). Heidelberg, DE: Springer.Google Scholar
  18. Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer.MATHCrossRefGoogle Scholar
  19. Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd ed.). New York, NY: Guilford Press.Google Scholar
  20. Burdock, E. I., Fleiss, J. L., & Hardesty, A. S. (1963). A new view of inter-observer agreement. Personnel Psychology, 16(4), 373–384.CrossRefGoogle Scholar
  21. Callaham, M. L., & Tercier, J. (2007). The relationship of previous training and experience of journal peer reviewers to subsequent review quality. PLoS Medicine, 4(1), e40.CrossRefGoogle Scholar
  22. Campanario, J. M. (1998). Peer review for journals as it stands today—Part 1. Science Communication, 19(3), 181–211.CrossRefGoogle Scholar
  23. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.CrossRefGoogle Scholar
  24. Campion, M. A. (1993). Article review checklist: A criterion checklist for reviewing research articles in applied psychology. Personnel Psychology, 46(3), 705–718.CrossRefGoogle Scholar
  25. Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1(2), 245–276.CrossRefGoogle Scholar
  26. Cattell, R. B., & Jaspers, J. (1967). A general plasmode (No. 30-10-5-2) for factor analytic exercises and research. Multivariate Behavioral Research Monographs, 67, 1–212.Google Scholar
  27. Chase, J. M. (1970). Normative criteria for scientific publication. American Sociologist, 5(3), 262–265.Google Scholar
  28. Church, R. M., Crystal, J. D., & Collyer, C. E. (1996). Correction of errors in scientific research. Behavior Research Methods, Instruments, & Computers, 28(2), 305–310.CrossRefGoogle Scholar
  29. Cicchetti, D. V. (1991). The reliability of peer review for manuscript and grant submissions: A cross-disciplinary investigation. Behavioral and Brain Sciences, 14(1), 119–135.CrossRefGoogle Scholar
  30. Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284–290.CrossRefGoogle Scholar
  31. Cicchetti, D. V., & Conn, H. O. (1976). A statistical analysis of reviewer agreement and bias in evaluating medical abstracts. Yale Journal of Biology and Medicine, 49(4), 373–383.Google Scholar
  32. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.CrossRefGoogle Scholar
  33. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.MATHGoogle Scholar
  34. Cohrs, J. C., Moschner, B., Maes, J., & Kielmann, S. (2005). The motivational bases of right-wing authoritarianism and social dominance orientation: Relations to values and attitudes in the aftermath of September 11, 2001. Personality and Social Psychology Bulletin, 31(10), 1425–1434.CrossRefGoogle Scholar
  35. Cole, S., Cole, J. R., & Simon, G. A. (1981). Chance and consensus in peer review. Science, 214(4523), 881–886.CrossRefGoogle Scholar
  36. Cornforth, J. W. (1974). Referees. New Scientist, 62(892), 39.Google Scholar
  37. Crowe, M., & Sheppard, L. (2011a). A general critical appraisal tool: An evaluation of construct validity. International Journal of Nursing Studies, 48(12), 1505–1516.CrossRefGoogle Scholar
  38. Crowe, M., & Sheppard, L. (2011b). A review of critical appraisal tools show they lack rigor: Alternative tool structure is proposed. Journal of Clinical Epidemiology, 64(1), 79–89.CrossRefGoogle Scholar
  39. de Winter, J. C. F., Zadpoor, A. A., & Dodou, D. (2014). The expansion of Google Scholar versus Web of science: A longitudinal study. Scientometrics, 98(2), 1547–1565.CrossRefGoogle Scholar
  40. DeCoursey, T. (2006). The pros and cons of open peer review. Nature. Retrieved from http://www.nature.com/nature/peerreview/debate/nature04991.html.
  41. Donner, A. (1986). A review of inference procedures for the intraclass correlation coefficient in the one-way random effects model. International Statistical Review, 54(1), 67–82.MathSciNetMATHCrossRefGoogle Scholar
  42. Dziuban, C. D., & Shirkey, E. C. (1974). When is a correlation matrix appropriate for factor analysis? Some decision rules. Psychological Bulletin, 81(6), 358–361.CrossRefGoogle Scholar
  43. Eid, M. (2000). A multitrait-multimethod model with minimal assumptions. Psychometrika, 65(2), 241–261.MathSciNetMATHCrossRefGoogle Scholar
  44. Eid, M., Lischetzke, T., Nussbeck, F. W., & Trierweiler, L. I. (2003). Separating trait effects from trait-specific method effects in multitrait-multimethod models: A multiple-indicator CT-C(M-1) model. Psychological Methods, 8(1), 38–60.CrossRefGoogle Scholar
  45. Enders, C. K. (2001). The performance of the full information maximum likelihood estimator in multiple regression models with missing data. Educational and Psychological Measurement, 61(5), 713–740.MathSciNetCrossRefGoogle Scholar
  46. Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford Press.Google Scholar
  47. Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549.CrossRefGoogle Scholar
  48. Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Thousand Oaks, CA: Sage.MATHGoogle Scholar
  49. Fisher, R. A. (1934). Statistical methods for research workers (5th ed.). Edinburgh: Oliver and Boyd.MATHGoogle Scholar
  50. Fiske, D. W., & Fogg, L. (1990). But the reviewers are making different criticisms of my paper! Diversity and uniqueness in reviewer comments. American Psychologist, 45(5), 591–598.CrossRefGoogle Scholar
  51. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.CrossRefGoogle Scholar
  52. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis (3rd ed.). Boca Raton, FL: Chapman and Hall/CRC.MATHGoogle Scholar
  53. Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472.CrossRefGoogle Scholar
  54. Gilliland, S. W., & Cortina, J. M. (1997). Reviewer and editor decision making in the journal review process. Personnel Psychology, 50(2), 427–452.CrossRefGoogle Scholar
  55. Gottfredson, S. D. (1978). Evaluating psychological research reports: Dimensions, reliability, and correlates of quality judgments. American Psychologist, 33(10), 920–934.CrossRefGoogle Scholar
  56. Groves, T. (2010). Is open peer the fairest system? Yes. BMJ, 341, c6424.CrossRefGoogle Scholar
  57. Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48.MathSciNetCrossRefGoogle Scholar
  58. Gwet, K. L. (2014). The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.Google Scholar
  59. Halatchliyski, I., & Cress, U. (2014). How structure shapes dynamics: Knowledge development in Wikipedia—A network multilevel modeling approach. PLoS ONE, 9(11), e111958.CrossRefGoogle Scholar
  60. Hardwig, J. (1985). Epistemic dependence. The Journal of Philosophy, 82(7), 335–349.CrossRefGoogle Scholar
  61. Harrison, C. (2004). Peer review, politics and pluralism. Environmental Science & Policy, 7(5), 357–368.CrossRefGoogle Scholar
  62. Hassebrauck, M. (1983). Die Beurteilung der physischen Attraktivität: Konsens unter Urteilern? [Judging physical attractiveness: Consensus among judges?]. Zeitschrift für Sozialpsychologie, 14(2), 152–161.Google Scholar
  63. Hassebrauck, M. (1993). Die Beurteilung der physischen Attraktivität [The assessment of physical attractiveness]. In M. Hassebrauck & R. Niketta (Eds.), Physische Attraktivität [Physical attractiveness] (1st ed., pp. 29–59). Göttingen, DE: Hogrefe.Google Scholar
  64. Hayton, J. C., Allen, D. G., & Scarpello, V. (2004). Factor retention decisions in exploratory factor analysis: A tutorial on parallel analysis. Organizational Research Methods, 7(2), 191–205.CrossRefGoogle Scholar
  65. Hemlin, S., & Montgomery, H. (1990). Scientists’ conceptions of scientific quality: An interview study. Science Studies, 3(1), 73–81.Google Scholar
  66. Hemlin, S., & Rasmussen, S. B. (2006). The shift in academic quality control. Science, Technology and Human Values, 31(2), 173–198.CrossRefGoogle Scholar
  67. Henss, R. (1992). “Spieglein, Spieglein an der Wand …”: Geschlecht, Alter und physische Attraktiviät [“Mirror, mirror on the wall…”: Sex, age, and physical attractiveness]. Weinheim, DE: PVU.Google Scholar
  68. Herzog, H. A., Podberscek, A. L., & Docherty, A. (2005). The reliability of peer review in anthrozoology. Anthrozoos, 18(2), 175–182.CrossRefGoogle Scholar
  69. Hilbe, J. M. (2011). Negative binomial regression (2nd ed.). Cambridge: Cambridge University Press.MATHCrossRefGoogle Scholar
  70. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.MathSciNetMATHGoogle Scholar
  71. Hönekopp, J. (2006). Once more: Is beauty in the eye of the beholder? Relative contributions of private and shared taste to judgments of facial attractiveness. Journal of Experimental Psychology, 32(2), 199–209.Google Scholar
  72. Hönekopp, J., Becker, B. J., & Oswald, F. L. (2006). The meaning and suitability of various effect sizes for structured Rater x Ratee designs. Psychological Methods, 11(1), 72–86.CrossRefGoogle Scholar
  73. Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2), 179–185.MATHCrossRefGoogle Scholar
  74. Houry, D., Green, S., & Callaham, M. (2012). Does mentoring new peer reviewers improve review quality? A randomized trial. BMC Medical Education, 12.Google Scholar
  75. Howard, L., & Wilkinson, G. (1998). Peer review and editorial decision-making. British Journal of Psychiatry, 173, 110–113.CrossRefGoogle Scholar
  76. Hutcheson, G. D., & Sofroniou, N. (1999). The multivariate social scientist. Thousand Oaks, CA: Sage.MATHCrossRefGoogle Scholar
  77. IBM Corp. (2011). IBM SPSS Statistics for windows (version 20.0) [computer software]. Armonk, NY: IBM Corp.Google Scholar
  78. Jayasinghe, U. W., Marsh, H. W., & Bond, N. (2003). A multilevel cross-classified modelling approach to peer review of grant proposals: The effects of assessor and researcher attributes on assessor ratings. Journal of the Royal Statistical Society A, 166(3), 279–300.MathSciNetCrossRefGoogle Scholar
  79. Jayasinghe, U. W., Marsh, H. W., & Bond, N. (2006). A new reader trial approach to peer review in funding research grants: An Australian experiment. Scientometrics, 69(3), 591–606.CrossRefGoogle Scholar
  80. Kaiser, H. F. (1970). A second generation Little Jiffy. Psychometrika, 35(4), 401–415.MATHCrossRefGoogle Scholar
  81. Kaiser, H. F., & Rice, J. (1974). Little Jiffy, Mark IV. Educational and Psychological Measurement, 34(1), 111–117.CrossRefGoogle Scholar
  82. Kaplan, D., & Depaoli, S. (2013). Bayesian statistical methods. In T. D. Little (Ed.), The Oxford handbook of quantitative methods (Vol. 1, pp. 407–437). New York, NY: Oxford University Press.Google Scholar
  83. Kemper, K. J., McCarthy, P. L., & Cicchetti, D. V. (1996). Improving participation and interrater agreement in scoring ambulatory pediatric association abstracts: How well have we succeeded? Archives of Pediatrics and Adolescent Medicine, 150(4), 380–383.CrossRefGoogle Scholar
  84. Khan, K. (2010). Is open peer review the fairest system? No. BMJ, 341, c6425.CrossRefGoogle Scholar
  85. Kirk, S. A., & Franke, T. M. (1997). Agreeing to disagree: A study of the reliability of manuscript reviews. Social Work Research, 21(2), 121–126.CrossRefGoogle Scholar
  86. Kitcher, P. (1990). The division of cognitive labor. The Journal of Philosophy, 87(1), 5–22.CrossRefGoogle Scholar
  87. Langfeldt, L. (2001). The decision-making constraints and processes of grant peer review, and their effects on the review outcome. Social Studies of Science, 31(6), 820–841.CrossRefGoogle Scholar
  88. Lee, C. J., Sugimoto, C. R., Zhang, G., & Cronin, B. (2013). Bias in peer review. Journal of the American Society for Information Science and Technology, 64(1), 2–17.CrossRefGoogle Scholar
  89. Li, D., & Agha, L. (2015). Big names or big ideas: Do peer-review panels select the best science proposals? Science, 348, 434–438.CrossRefGoogle Scholar
  90. Lindsey, D. (1988). Assessing precision in the manuscript review process: A little better than a dice roll. Scientometrics, 14(1–2), 75–82.CrossRefGoogle Scholar
  91. Lindsey, D. (1989). Using citation counts as a measure of quality in science measuring what’s measurable rather than what’s valid. Scientometrics, 15(3–4), 189–203.CrossRefGoogle Scholar
  92. List, B. (2017). Crowd-based peer review can be good and fast. Nature, 546(7656), 9.CrossRefGoogle Scholar
  93. Lord, C. G., Ross, L., & Lepper, M. R. (1979). Biased assimilation and attitude polarization: The effects of prior theories on subsequently considered evidence. Journal of Personality and Social Psychology, 37(11), 2098–2109.CrossRefGoogle Scholar
  94. Luce, R. D. (1993). Reliability is neither to be expected nor desired in peer review. Behavioral and Brain Sciences, 16(2), 399–400.CrossRefGoogle Scholar
  95. Marsh, H. W., & Ball, S. (1981). Interjudgmental reliability of reviews for the Journal of Educational Psychology. Journal of Educational Psychology, 73(6), 872–880.CrossRefGoogle Scholar
  96. Marsh, H. W., & Ball, S. (1989). The peer review process used to evaluate manuscripts submitted to academic journals: Interjudgmental reliability. The Journal of Experimental Education, 57(2), 151–169.CrossRefGoogle Scholar
  97. Marsh, H. W., Bond, N. W., & Jayasinghe, U. W. (2007). Peer review process: Assessments by applicant-nominated referees are biased, inflated, unreliable and invalid. Australian Psychologist, 42(1), 33–38.CrossRefGoogle Scholar
  98. Marsh, H. W., Jayasinghe, U. W., & Bond, N. W. (2008). Improving the peer-review process for grant applications: Reliability, validity, bias, and generalizability. American Psychologist, 63(3), 160–168.CrossRefGoogle Scholar
  99. McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.CrossRefGoogle Scholar
  100. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749.CrossRefGoogle Scholar
  101. Montgomery, A. A., Graham, A., Evans, P. H., & Fahey, T. (2002). Inter-rater agreement in the scoring of abstracts submitted to a primary care research conference. BMC Health Services Research, 2.Google Scholar
  102. Muthén, B. (2010). Bayesian analysis in Mplus: A brief introduction [manuscript]. http://www.statmodel.com/download/IntroBayesVersion%203.pdf. Accessed March 30 2017.
  103. Muthén, B., & Asparouhov, T. (2011). Bayesian SEM: A more flexible representation of substantive theory [manuscript]. http://www.statmodel.com/download/BSEMv4REVISED. Accessed March 30 2017.
  104. Muthén, L. K., & Muthén, B. O. (2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén.Google Scholar
  105. Mutz, R., Bornmann, L., & Daniel, H.-D. (2012). Heterogeneity of inter-rater reliabilities of grant peer reviews and its determinants: A general estimating equations approach. PLoS ONE, 7(10), e48509.CrossRefGoogle Scholar
  106. O’Brien, R. M. (1991). The reliability of composites of referee assessments of manuscripts. Social Science Research, 20(3), 319–328.CrossRefGoogle Scholar
  107. O’Neill, T. A., Goffin, R. D., & Gellatly, I. R. (2012). The use of random coefficient modeling for understanding and predicting job performance ratings: An application with field data. Organizational Research Methods, 15(3), 436–462.CrossRefGoogle Scholar
  108. Opthof, T., Coronel, R., & Janse, M. J. (2002). The significance of the peer review process against the background of bias: Priority ratings of reviewers and editors and the prediction of citation, the role of geographical bias. Cardiovascular Research, 56(3), 339–346.CrossRefGoogle Scholar
  109. Oxman, A. D., Guyatt, G. H., Singer, J., Goldsmith, C. H., Hutchison, B. G., et al. (1991). Agreement among reviewers of review articles. Journal of Clinical Epidemiology, 44(1), 91–98.CrossRefGoogle Scholar
  110. Petty, R. E., Fleming, M. A., & Fabrigar, L. R. (1999). The review process at PSPB: Correlates of interreviewer agreement and manuscript acceptance. Personality and Social Psychology Bulletin, 25(2), 188–203.CrossRefGoogle Scholar
  111. Platt, J. R. (1964). Strong inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others. Science, New Series, 146(3642), 347–353.Google Scholar
  112. Popper, K. R. (1968). Epistemology without a knowing subject. Studies in Logic and the Foundations of Mathematics, 52, 333–373.CrossRefGoogle Scholar
  113. Pulakos, E. D., Schmitt, N., & Ostroff, C. (1986). A warning about the use of a standard deviation across dimensions within ratees to measure halo. Journal of Applied Psychology, 71(1), 29–32.CrossRefGoogle Scholar
  114. Putka, D. J. (2002). The variance architecture approach to the study of constructs in organizational contexts (Doctoral dissertation, Ohio University). http://etd.ohiolink.edu/. Accessed March 30 2017.
  115. Putka, D. J., Lance, C. E., Le, H., & McCloy, R. A. (2011). A cautionary note on modeling multitrait–multirater data arising from ill-structured measurement designs. Organizational Research Methods, 14(3), 503–529.CrossRefGoogle Scholar
  116. Putka, D. J., Le, H., McCloy, R. A., & Diaz, T. (2008). Ill-structured measurement designs in organizational research: Implications for estimating interrater reliability. Journal of Applied Psychology, 93(5), 959–981.CrossRefGoogle Scholar
  117. Qiu, L. (1992). A study of interdisciplinary research collaboration. Research Evaluation, 2(3), 169–175.CrossRefGoogle Scholar
  118. R Core Team. (2016). R: A language and environment for statistical computing (Version 3.3.1) [computer software]. Vienna, AT: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org.
  119. Ramasundarahettige, C. F., Donner, A., & Zou, G. Y. (2009). Confidence interval construction for a difference between two dependent intraclass correlation coefficients. Statistics in Medicine, 28(7), 1041–1053.MathSciNetCrossRefGoogle Scholar
  120. Raykov, T., & Marcoulides, G. A. (2011). Introduction to psychometric theory. New York, NY: Routledge.MATHGoogle Scholar
  121. Revelle, W. (2016). Psych: Procedures for personality and psychological research (Version 1.6.9) [computer software]. Evanston, IL: Northwestern University. http://cran.r-project.org/web/packages/psych/. Accessed March 30 2017.
  122. Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17(3), 354–373.CrossRefGoogle Scholar
  123. Rosa, H. (2016). Resonanz - Eine Soziologie der Weltbeziehung [Resonance—A sociology of the relationship to the world]. Berlin, DE: Suhrkamp.Google Scholar
  124. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.MathSciNetMATHCrossRefGoogle Scholar
  125. Rubin, H. R., Redelmeier, D. A., Wu, A. W., & Steinberg, E. P. (1993). How reliable is peer review of scientific abstracts? Looking back at the 1991 annual meeting of the Society of General Internal Medicine. Journal of General Internal Medicine, 8(5), 255–258.CrossRefGoogle Scholar
  126. Satorra, A., & Bentler, P. M. (2010). Ensuring positiveness of the scaled Chi square test statistic. Psychometrika, 75(2), 243–248.MathSciNetMATHCrossRefGoogle Scholar
  127. Scarr, S., & Weber, B. L. R. (1978). The reliability of reviews for the American Psychologist. American Psychologist, 33(10), 935.CrossRefGoogle Scholar
  128. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.MathSciNetMATHCrossRefGoogle Scholar
  129. Scott, W. A. (1974). Interreferee agreement on some characteristics of manuscripts submitted to Journal of Personality and Social Psychology. American Psychologist, 29(9), 698–702.CrossRefGoogle Scholar
  130. Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance components. New York, NY: Wiley.MATHCrossRefGoogle Scholar
  131. Serlin, R. C. (1993). Confidence intervals and the scientific method: A case for Holm on the range. Journal of Experimental Education, 61(4), 350–360.CrossRefGoogle Scholar
  132. Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46(1), 561–584.CrossRefGoogle Scholar
  133. Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.CrossRefGoogle Scholar
  134. Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 64(4), 583–639.MathSciNetMATHCrossRefGoogle Scholar
  135. Stephan, P., Veugelers, R., & Wang, J. (2017). Reviewers are blinkered by bibliometrics. Nature, 544(7651), 411–412.CrossRefGoogle Scholar
  136. Strauss, M. E., & Smith, G. T. (2009). Construct validity: Advances in theory and methodology. Annual Review of Clinical Psychology, 5, 1–25.CrossRefGoogle Scholar
  137. Tahamtan, I., Afshar, A. S., & Ahamdzadeh, K. (2016). Factors affecting number of citations: A comprehensive review of the literature. Scientometrics, 107(3), 1195–1225.CrossRefGoogle Scholar
  138. Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, 4(1), 25–29.CrossRefGoogle Scholar
  139. Uebersax, J. S. (1982–1983). A design-independent method for measuring the reliability of psychiatric diagnosis. Journal of Psychiatric Research, 17(4), 335–342.Google Scholar
  140. van Dalen, H. P., & Henkens, K. (2012). Intended and unintended consequences of a publish-or-perish culture: A worldwide survey. Journal of the American Society for Information Science and Technology, 63(7), 1282–1293.CrossRefGoogle Scholar
  141. Van de Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J. B., Neyer, F. J., & van Aken, M. A. G. (2014). A gentle introduction to Bayesian analysis: Applications to developmental research. Child Development, 85(3), 842–860.CrossRefGoogle Scholar
  142. van Noorden, R. (2015). Interdisciplinary research by the numbers: An analysis reveals the extent and impact of research that bridges disciplines. Nature, 525(7569), 306–307.CrossRefGoogle Scholar
  143. Walsh, E., Rooney, M., Appleby, L., & Wilkinson, G. (2000). Open peer review: A randomised controlled trial. The British Journal Of Psychiatry, 176(1), 47–51.CrossRefGoogle Scholar
  144. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838.MathSciNetMATHCrossRefGoogle Scholar
  145. Whitehurst, G. J. (1983). Interrater agreement for reviews for Developmental Review. Developmental Review, 3(1), 73–78.CrossRefGoogle Scholar
  146. Wirtz, M., & Caspar, F. (2002). Beurteilerübereinstimmung und Beurteilerreliabilität: Methoden zur Bestimmung und Verbesserung der Zuverlässigkeit von Einschätzungen mitttels Kategoriensystemen und Ratingskalen [Inter-rater agreement and inter-rater reliability: Methods on analysis and improvement of the reliability of assessments by categorical systems and rating scales]. Göttingen, DE: Hogrefe.Google Scholar
  147. Wood, M., Roberts, M., & Howell, B. (2004). The reliability of peer reviews of papers on information systems. Journal of Information Science, 30(1), 2–11.CrossRefGoogle Scholar
  148. Yates, A. (1987). Multivariate exploratory data analysis: A perspective on exploratory factor analysis. Albany, NY: State University of New York Press.Google Scholar
  149. Yousfi, S. (2005). Mythen und Paradoxien der klassischen Testtheorie (I): Testlänge und Gütekriterien [Myths and paradoxes of classical test theory (I): About test length, reliability, and validity]. Diagnostica, 51(1), 1–11.MathSciNetCrossRefGoogle Scholar
  150. Yuan, K.-H., & Bentler, P. M. (2000). Three likelihood-based methods for mean and covariance structure analysis with nonnormal missing data. Sociological Methodology, 30(1), 165–200.CrossRefGoogle Scholar
  151. Zyphur, M. J., & Oswald, F. L. (2015). Bayesian estimation and inference: A user’s guide. Journal of Management, 41(2), 390–420.CrossRefGoogle Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2017

Authors and Affiliations

  • Jens Jirschitzka
    • 1
  • Aileen Oeberst
    • 2
    • 3
  • Richard Göllner
    • 1
  • Ulrike Cress
    • 1
    • 3
  1. 1.Eberhard Karls Universität TübingenTübingenGermany
  2. 2.Johannes Gutenberg-Universität MainzMainzGermany
  3. 3.Leibniz-Institut für WissensmedienTübingenGermany

Personalised recommendations