Inter-rater reliability and validity of peer reviews in an interdisciplinary field

Jirschitzka, Jens; Oeberst, Aileen; Göllner, Richard; Cress, Ulrike

doi:10.1007/s11192-017-2516-6

Inter-rater reliability and validity of peer reviews in an interdisciplinary field

Published: 11 September 2017

Volume 113, pages 1059–1092, (2017)
Cite this article

Scientometrics Aims and scope Submit manuscript

Jens Jirschitzka¹,
Aileen Oeberst^2,3,
Richard Göllner¹ &
…
Ulrike Cress^1,3

1644 Accesses
22 Citations
50 Altmetric
1 Mention
Explore all metrics

Abstract

Peer review is an integral part of science. Devised to ensure and enhance the quality of scientific work, it is a crucial step that influences the publication of papers, the provision of grants and, as a consequence, the career of scientists. In order to meet the challenges of this responsibility, a certain shared understanding of scientific quality seems necessary. Yet previous studies have shown that inter-rater reliability in peer reviews is relatively low. However, most of these studies did not take ill-structured measurement design of the data into account. Moreover, no prior (quantitative) study has analyzed inter-rater reliability in an interdisciplinary field. And finally, issues of validity have hardly ever been addressed. Therefore, the three major research goals of this paper are (1) to analyze inter-rater agreement of different rating dimensions (e.g., relevance and soundness) in an interdisciplinary field, (2) to account for ill-structured designs by applying state-of-the-art methods, and (3) to examine the construct and criterion validity of reviewers’ evaluations. A total of 443 reviews were analyzed. These reviews were provided by m = 130 reviewers for n = 145 submissions to an interdisciplinary conference. Our findings demonstrate the urgent need for improvement of scientific peer review. Inter-rater reliability was rather poor and there were no significant differences between evaluations from reviewers of the same scientific discipline as the papers they were reviewing versus reviewer evaluations of papers from disciplines other than their own. These findings extend beyond those of prior research. Furthermore, convergent and discriminant construct validity of the rating dimensions were low as well. Nevertheless, a multidimensional model yielded a better fit than a unidimensional model. Our study also shows that the citation rate of accepted papers was positively associated with the relevance ratings made by reviewers from the same discipline as the paper they were reviewing. In addition, high novelty ratings from same-discipline reviewers were negatively associated with citation rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Literature reviews as independent studies: guidelines for academic practice

Article Open access 14 October 2022

How to design bibliometric research: an overview and a framework proposal

Article Open access 06 March 2024

What is Qualitative in Research

Article Open access 28 October 2021

Notes

With regard to dichotomous nominal data (e.g., “accepted” vs. “rejected”), it should be noted that Cohen’s Kappa (Cohen 1960), although often used, is by far not a reliable measurement of agreement, especially in cases of imbalanced marginal totals (e.g., see Baethge et al. 2013; Feinstein and Cicchetti 1990; Gwet 2008, 2014; Uebersax 1982–1983). Accordingly, Baethge et al. (2013) applied the agreement coefficient AC₁ for two raters proposed by Gwet (2008) to dichotomized reviewer evaluations and found a chance-corrected agreement estimation of .63. Cohen’s Kappa statistic reached only a value of .16 in the study of Baethge et al. (2013).
The investigated international conference took place within the last two decades. All of the reviewers were aware of the fact that others could access their evaluations of the papers. For the present study, the reviewers and their evaluations were fully anonymized and were analyzed in an aggregated way. Moreover, in order to protect the reviewers' privacy and anonymity as far as possible, we have omitted the mentioning of the name and the year of the conference. The same is applied to the conference proceedings.
The Spearman–Brown prophecy formula can be used to predict the reliability of a test or target score after increasing (or decreasing) the corresponding number of items, observations, or raters. It can also be used to determine the necessary number of items, observations, or raters for obtaining a certain reliability value (e.g., see Shrout and Fleiss 1979, p. 426).
It seems noteworthy that upon inspection, the “cannot judge” responses did not indicate, at least after the correction proposed by Holm (1979), that reviewers who came from other disciplines than the paper (different-discipline reviewers) used this category more often than same-discipline reviewers (all Holm-adjusted ps > .085; see Online Resource 1).
Two chains per model were used whereby a minimum of 30,000 iterations and a maximum of 200,000 iterations were specified for each chain. The convergence criterion was repeatedly assessed each time after 100 iterations, based on the final half of all iterations per chain. After reaching the criterion, the first half of all the iterations were dropped (burn-in phase). The posterior distributions were constructed with the remaining post-burn-in iterations (Brown 2015; Muthén and Muthén 2012). For determining the convergence, the Gelman-Rubin convergence criterion (Muthén and Muthén 2012; Gelman and Rubin 1992) was used for determining convergence. The parameter b in the formula of the Potential scale reduction (PSR) was set at the value of 0.001, which defines a very strict criterion (Brown 2015; Gelman et al. 2013; Muthén and Muthén 2012; van de Schoot et al. 2014; Zyphur and Oswald 2015).
We have also compared the variance components (e.g., the paper component) for each rating dimension between the same-discipline and the different-discipline paper × reviewer combinations as suggested by O’Neill et al. (2012). However, the LRTs based on the REML log-likelihoods in our study yielded several negative Chi square statistics which cannot be regarded as trustworthy (for a similar phenomenon in another context, see Satorra and Bentler 2010).
We found no significant differences between (a) the paper scores from the same-discipline reviewers and (b) the paper scores from the different-discipline reviewers (all Holm-adjusted ps > .147; see Online Resource 4).

References

Adams, K. M. (1991). Peer review: An unflattering picture. Behavioral and Brain Sciences, 14(1), 135–136.
Article Google Scholar
Akerlof, G. A. (2003). Writing the “The Market for ‘Lemons’”: A personal and interpretive essay. https://www.nobelprize.org/nobel_prizes/economic-sciences/laureates/2001/akerlof-article.html. Accessed 4 Sept 2017.
Aksnes, D. W. (2003). Characteristics of highly cited papers. Research Evaluation, 12(3), 159–170.
Article Google Scholar
Altman, D. G., & Bland, J. M. (2011). How to obtain the P value from a confidence interval. BMJ, 343, d2304.
Article Google Scholar
Anderson, K. (2012). The problems with calling comments “Post-Publication Peer-Review” [Web log message]. Retrieved from http://scholarlykitchen.sspnet.org/2012/03/26/the-problems-with-calling-comments-post-publication-peer-review.
Asparouhov, T., & Muthén, B. (2010). Bayesian analysis of latent variable models using Mplus. http://www.statmodel.com/download/BayesAdvantages18.pdf. Accessed 30 Mar 2017.
Baethge, C., Franklin, J., & Mertens, S. (2013). Substantial agreement of referee recommendations at a general medical journal—A peer review evaluation at Deutsches Ärzteblatt International. PLoS ONE, 8(5), e61401.
Article Google Scholar
Bailar, J. C., & Patterson, K. (1985). Journal peer review—The need for a research agenda. The New England Journal of Medicine, 312(10), 654–657.
Article Google Scholar
Benda, W. G. G., & Engels, T. C. E. (2011). The predictive validity of peer review: A selective review of the judgmental forecasting qualities of peers, and implications for innovation in science. International Journal of Forecasting, 27(1), 166–182.
Article Google Scholar
Beyer, J. M., Chanove, R. G., & Fox, W. B. (1995). Review process and the fates of manuscripts submitted to AMJ. Academy of Management Journal, 38(5), 1219–1260.
Article Google Scholar
Blackburn, J. L., & Hakel, M. D. (2006). An examination of sources of peer-review bias. Psychological Science, 17(5), 378–382.
Article Google Scholar
Bornmann, L., & Daniel, H.-D. (2005). Selection of research fellowship recipients by committee peer review. Reliability, fairness and predictive validity of Board of Trustees’ decisions. Scientomentrics, 63(2), 297–320.
Article Google Scholar
Bornmann, L., & Daniel, H.-D. (2008a). Selecting manuscripts for a high-impact journal through peer review: A citation analysis of communications that were accepted by Angewandte Chemie International Edition, or rejected but published elsewhere. Journal of the American Society for Information Science and Technology, 59(11), 1841–1852.
Article Google Scholar
Bornmann, L., & Daniel, H.-D. (2008b). The effectiveness of the peer review process: Inter-referee agreement and predictive validity of manuscript refereeing at Angewandte Chemie. Angewandte Chemie-International Edition, 47(38), 7173–7178.
Article Google Scholar
Bornmann, L., & Daniel, H.-D. (2008c). What do citations counts measure? A review of studies on citing behavior. Journal of Documentation, 64(1), 45–80.
Article Google Scholar
Bornmann, L., Mutz, R., & Daniel, H.-D. (2010). A reliability-generalization study of journal peer reviews: A multilevel meta-analysis of inter-rater reliability and its determinants. PLoS ONE, 5(12), e14331.
Article Google Scholar
Bortz, J., & Döring, N. (2006). Forschungsmethoden und evaluation für Human- und Sozialwissenschaftler [Research methods and evaluation for human and social scientists] (4th ed.). Heidelberg, DE: Springer.
Google Scholar
Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer.
Book MATH Google Scholar
Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd ed.). New York, NY: Guilford Press.
Google Scholar
Burdock, E. I., Fleiss, J. L., & Hardesty, A. S. (1963). A new view of inter-observer agreement. Personnel Psychology, 16(4), 373–384.
Article Google Scholar
Callaham, M. L., & Tercier, J. (2007). The relationship of previous training and experience of journal peer reviewers to subsequent review quality. PLoS Medicine, 4(1), e40.
Article Google Scholar
Campanario, J. M. (1998). Peer review for journals as it stands today—Part 1. Science Communication, 19(3), 181–211.
Article Google Scholar
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.
Article Google Scholar
Campion, M. A. (1993). Article review checklist: A criterion checklist for reviewing research articles in applied psychology. Personnel Psychology, 46(3), 705–718.
Article Google Scholar
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1(2), 245–276.
Article Google Scholar
Cattell, R. B., & Jaspers, J. (1967). A general plasmode (No. 30-10-5-2) for factor analytic exercises and research. Multivariate Behavioral Research Monographs, 67, 1–212.
Google Scholar
Chase, J. M. (1970). Normative criteria for scientific publication. American Sociologist, 5(3), 262–265.
Google Scholar
Church, R. M., Crystal, J. D., & Collyer, C. E. (1996). Correction of errors in scientific research. Behavior Research Methods, Instruments, & Computers, 28(2), 305–310.
Article Google Scholar
Cicchetti, D. V. (1991). The reliability of peer review for manuscript and grant submissions: A cross-disciplinary investigation. Behavioral and Brain Sciences, 14(1), 119–135.
Article Google Scholar
Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284–290.
Article Google Scholar
Cicchetti, D. V., & Conn, H. O. (1976). A statistical analysis of reviewer agreement and bias in evaluating medical abstracts. Yale Journal of Biology and Medicine, 49(4), 373–383.
Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Article Google Scholar
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.
MATH Google Scholar
Cohrs, J. C., Moschner, B., Maes, J., & Kielmann, S. (2005). The motivational bases of right-wing authoritarianism and social dominance orientation: Relations to values and attitudes in the aftermath of September 11, 2001. Personality and Social Psychology Bulletin, 31(10), 1425–1434.
Article Google Scholar
Cole, S., Cole, J. R., & Simon, G. A. (1981). Chance and consensus in peer review. Science, 214(4523), 881–886.
Article Google Scholar
Cornforth, J. W. (1974). Referees. New Scientist, 62(892), 39.
Google Scholar
Crowe, M., & Sheppard, L. (2011a). A general critical appraisal tool: An evaluation of construct validity. International Journal of Nursing Studies, 48(12), 1505–1516.
Article Google Scholar
Crowe, M., & Sheppard, L. (2011b). A review of critical appraisal tools show they lack rigor: Alternative tool structure is proposed. Journal of Clinical Epidemiology, 64(1), 79–89.
Article Google Scholar
de Winter, J. C. F., Zadpoor, A. A., & Dodou, D. (2014). The expansion of Google Scholar versus Web of science: A longitudinal study. Scientometrics, 98(2), 1547–1565.
Article Google Scholar
DeCoursey, T. (2006). The pros and cons of open peer review. Nature. Retrieved from http://www.nature.com/nature/peerreview/debate/nature04991.html.
Donner, A. (1986). A review of inference procedures for the intraclass correlation coefficient in the one-way random effects model. International Statistical Review, 54(1), 67–82.
Article MathSciNet MATH Google Scholar
Dziuban, C. D., & Shirkey, E. C. (1974). When is a correlation matrix appropriate for factor analysis? Some decision rules. Psychological Bulletin, 81(6), 358–361.
Article Google Scholar
Eid, M. (2000). A multitrait-multimethod model with minimal assumptions. Psychometrika, 65(2), 241–261.
Article MathSciNet MATH Google Scholar
Eid, M., Lischetzke, T., Nussbeck, F. W., & Trierweiler, L. I. (2003). Separating trait effects from trait-specific method effects in multitrait-multimethod models: A multiple-indicator CT-C(M-1) model. Psychological Methods, 8(1), 38–60.
Article Google Scholar
Enders, C. K. (2001). The performance of the full information maximum likelihood estimator in multiple regression models with missing data. Educational and Psychological Measurement, 61(5), 713–740.
Article MathSciNet Google Scholar
Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford Press.
Google Scholar
Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549.
Article Google Scholar
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Thousand Oaks, CA: Sage.
MATH Google Scholar
Fisher, R. A. (1934). Statistical methods for research workers (5th ed.). Edinburgh: Oliver and Boyd.
MATH Google Scholar
Fiske, D. W., & Fogg, L. (1990). But the reviewers are making different criticisms of my paper! Diversity and uniqueness in reviewer comments. American Psychologist, 45(5), 591–598.
Article Google Scholar
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.
Article Google Scholar
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis (3rd ed.). Boca Raton, FL: Chapman and Hall/CRC.
MATH Google Scholar
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472.
Article Google Scholar
Gilliland, S. W., & Cortina, J. M. (1997). Reviewer and editor decision making in the journal review process. Personnel Psychology, 50(2), 427–452.
Article Google Scholar
Gottfredson, S. D. (1978). Evaluating psychological research reports: Dimensions, reliability, and correlates of quality judgments. American Psychologist, 33(10), 920–934.
Article Google Scholar
Groves, T. (2010). Is open peer the fairest system? Yes. BMJ, 341, c6424.
Article Google Scholar
Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48.
Article MathSciNet Google Scholar
Gwet, K. L. (2014). The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.
Google Scholar
Halatchliyski, I., & Cress, U. (2014). How structure shapes dynamics: Knowledge development in Wikipedia—A network multilevel modeling approach. PLoS ONE, 9(11), e111958.
Article Google Scholar
Hardwig, J. (1985). Epistemic dependence. The Journal of Philosophy, 82(7), 335–349.
Article Google Scholar
Harrison, C. (2004). Peer review, politics and pluralism. Environmental Science & Policy, 7(5), 357–368.
Article Google Scholar
Hassebrauck, M. (1983). Die Beurteilung der physischen Attraktivität: Konsens unter Urteilern? [Judging physical attractiveness: Consensus among judges?]. Zeitschrift für Sozialpsychologie, 14(2), 152–161.
Google Scholar
Hassebrauck, M. (1993). Die Beurteilung der physischen Attraktivität [The assessment of physical attractiveness]. In M. Hassebrauck & R. Niketta (Eds.), Physische Attraktivität [Physical attractiveness] (1st ed., pp. 29–59). Göttingen, DE: Hogrefe.
Google Scholar
Hayton, J. C., Allen, D. G., & Scarpello, V. (2004). Factor retention decisions in exploratory factor analysis: A tutorial on parallel analysis. Organizational Research Methods, 7(2), 191–205.
Article Google Scholar
Hemlin, S., & Montgomery, H. (1990). Scientists’ conceptions of scientific quality: An interview study. Science Studies, 3(1), 73–81.
Google Scholar
Hemlin, S., & Rasmussen, S. B. (2006). The shift in academic quality control. Science, Technology and Human Values, 31(2), 173–198.
Article Google Scholar
Henss, R. (1992). “Spieglein, Spieglein an der Wand …”: Geschlecht, Alter und physische Attraktiviät [“Mirror, mirror on the wall…”: Sex, age, and physical attractiveness]. Weinheim, DE: PVU.
Google Scholar
Herzog, H. A., Podberscek, A. L., & Docherty, A. (2005). The reliability of peer review in anthrozoology. Anthrozoos, 18(2), 175–182.
Article Google Scholar
Hilbe, J. M. (2011). Negative binomial regression (2nd ed.). Cambridge: Cambridge University Press.
Book MATH Google Scholar
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.
MathSciNet MATH Google Scholar
Hönekopp, J. (2006). Once more: Is beauty in the eye of the beholder? Relative contributions of private and shared taste to judgments of facial attractiveness. Journal of Experimental Psychology, 32(2), 199–209.
Google Scholar
Hönekopp, J., Becker, B. J., & Oswald, F. L. (2006). The meaning and suitability of various effect sizes for structured Rater x Ratee designs. Psychological Methods, 11(1), 72–86.
Article Google Scholar
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2), 179–185.
Article MATH Google Scholar
Houry, D., Green, S., & Callaham, M. (2012). Does mentoring new peer reviewers improve review quality? A randomized trial. BMC Medical Education, 12.
Howard, L., & Wilkinson, G. (1998). Peer review and editorial decision-making. British Journal of Psychiatry, 173, 110–113.
Article Google Scholar
Hutcheson, G. D., & Sofroniou, N. (1999). The multivariate social scientist. Thousand Oaks, CA: Sage.
Book MATH Google Scholar
IBM Corp. (2011). IBM SPSS Statistics for windows (version 20.0) [computer software]. Armonk, NY: IBM Corp.
Jayasinghe, U. W., Marsh, H. W., & Bond, N. (2003). A multilevel cross-classified modelling approach to peer review of grant proposals: The effects of assessor and researcher attributes on assessor ratings. Journal of the Royal Statistical Society A, 166(3), 279–300.
Article MathSciNet Google Scholar
Jayasinghe, U. W., Marsh, H. W., & Bond, N. (2006). A new reader trial approach to peer review in funding research grants: An Australian experiment. Scientometrics, 69(3), 591–606.
Article Google Scholar
Kaiser, H. F. (1970). A second generation Little Jiffy. Psychometrika, 35(4), 401–415.
Article MATH Google Scholar
Kaiser, H. F., & Rice, J. (1974). Little Jiffy, Mark IV. Educational and Psychological Measurement, 34(1), 111–117.
Article Google Scholar
Kaplan, D., & Depaoli, S. (2013). Bayesian statistical methods. In T. D. Little (Ed.), The Oxford handbook of quantitative methods (Vol. 1, pp. 407–437). New York, NY: Oxford University Press.
Google Scholar
Kemper, K. J., McCarthy, P. L., & Cicchetti, D. V. (1996). Improving participation and interrater agreement in scoring ambulatory pediatric association abstracts: How well have we succeeded? Archives of Pediatrics and Adolescent Medicine, 150(4), 380–383.
Article Google Scholar
Khan, K. (2010). Is open peer review the fairest system? No. BMJ, 341, c6425.
Article Google Scholar
Kirk, S. A., & Franke, T. M. (1997). Agreeing to disagree: A study of the reliability of manuscript reviews. Social Work Research, 21(2), 121–126.
Article Google Scholar
Kitcher, P. (1990). The division of cognitive labor. The Journal of Philosophy, 87(1), 5–22.
Article Google Scholar
Langfeldt, L. (2001). The decision-making constraints and processes of grant peer review, and their effects on the review outcome. Social Studies of Science, 31(6), 820–841.
Article Google Scholar
Lee, C. J., Sugimoto, C. R., Zhang, G., & Cronin, B. (2013). Bias in peer review. Journal of the American Society for Information Science and Technology, 64(1), 2–17.
Article Google Scholar
Li, D., & Agha, L. (2015). Big names or big ideas: Do peer-review panels select the best science proposals? Science, 348, 434–438.
Article Google Scholar
Lindsey, D. (1988). Assessing precision in the manuscript review process: A little better than a dice roll. Scientometrics, 14(1–2), 75–82.
Article Google Scholar
Lindsey, D. (1989). Using citation counts as a measure of quality in science measuring what’s measurable rather than what’s valid. Scientometrics, 15(3–4), 189–203.
Article Google Scholar
List, B. (2017). Crowd-based peer review can be good and fast. Nature, 546(7656), 9.
Article Google Scholar
Lord, C. G., Ross, L., & Lepper, M. R. (1979). Biased assimilation and attitude polarization: The effects of prior theories on subsequently considered evidence. Journal of Personality and Social Psychology, 37(11), 2098–2109.
Article Google Scholar
Luce, R. D. (1993). Reliability is neither to be expected nor desired in peer review. Behavioral and Brain Sciences, 16(2), 399–400.
Article Google Scholar
Marsh, H. W., & Ball, S. (1981). Interjudgmental reliability of reviews for the Journal of Educational Psychology. Journal of Educational Psychology, 73(6), 872–880.
Article Google Scholar
Marsh, H. W., & Ball, S. (1989). The peer review process used to evaluate manuscripts submitted to academic journals: Interjudgmental reliability. The Journal of Experimental Education, 57(2), 151–169.
Article Google Scholar
Marsh, H. W., Bond, N. W., & Jayasinghe, U. W. (2007). Peer review process: Assessments by applicant-nominated referees are biased, inflated, unreliable and invalid. Australian Psychologist, 42(1), 33–38.
Article Google Scholar
Marsh, H. W., Jayasinghe, U. W., & Bond, N. W. (2008). Improving the peer-review process for grant applications: Reliability, validity, bias, and generalizability. American Psychologist, 63(3), 160–168.
Article Google Scholar
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.
Article Google Scholar
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749.
Article Google Scholar
Montgomery, A. A., Graham, A., Evans, P. H., & Fahey, T. (2002). Inter-rater agreement in the scoring of abstracts submitted to a primary care research conference. BMC Health Services Research, 2.
Muthén, B. (2010). Bayesian analysis in Mplus: A brief introduction [manuscript]. http://www.statmodel.com/download/IntroBayesVersion%203.pdf. Accessed March 30 2017.
Muthén, B., & Asparouhov, T. (2011). Bayesian SEM: A more flexible representation of substantive theory [manuscript]. http://www.statmodel.com/download/BSEMv4REVISED. Accessed March 30 2017.
Muthén, L. K., & Muthén, B. O. (2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén.
Google Scholar
Mutz, R., Bornmann, L., & Daniel, H.-D. (2012). Heterogeneity of inter-rater reliabilities of grant peer reviews and its determinants: A general estimating equations approach. PLoS ONE, 7(10), e48509.
Article Google Scholar
O’Brien, R. M. (1991). The reliability of composites of referee assessments of manuscripts. Social Science Research, 20(3), 319–328.
Article Google Scholar
O’Neill, T. A., Goffin, R. D., & Gellatly, I. R. (2012). The use of random coefficient modeling for understanding and predicting job performance ratings: An application with field data. Organizational Research Methods, 15(3), 436–462.
Article Google Scholar
Opthof, T., Coronel, R., & Janse, M. J. (2002). The significance of the peer review process against the background of bias: Priority ratings of reviewers and editors and the prediction of citation, the role of geographical bias. Cardiovascular Research, 56(3), 339–346.
Article Google Scholar
Oxman, A. D., Guyatt, G. H., Singer, J., Goldsmith, C. H., Hutchison, B. G., et al. (1991). Agreement among reviewers of review articles. Journal of Clinical Epidemiology, 44(1), 91–98.
Article Google Scholar
Petty, R. E., Fleming, M. A., & Fabrigar, L. R. (1999). The review process at PSPB: Correlates of interreviewer agreement and manuscript acceptance. Personality and Social Psychology Bulletin, 25(2), 188–203.
Article Google Scholar
Platt, J. R. (1964). Strong inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others. Science, New Series, 146(3642), 347–353.
Google Scholar
Popper, K. R. (1968). Epistemology without a knowing subject. Studies in Logic and the Foundations of Mathematics, 52, 333–373.
Article Google Scholar
Pulakos, E. D., Schmitt, N., & Ostroff, C. (1986). A warning about the use of a standard deviation across dimensions within ratees to measure halo. Journal of Applied Psychology, 71(1), 29–32.
Article Google Scholar
Putka, D. J. (2002). The variance architecture approach to the study of constructs in organizational contexts (Doctoral dissertation, Ohio University). http://etd.ohiolink.edu/. Accessed March 30 2017.
Putka, D. J., Lance, C. E., Le, H., & McCloy, R. A. (2011). A cautionary note on modeling multitrait–multirater data arising from ill-structured measurement designs. Organizational Research Methods, 14(3), 503–529.
Article Google Scholar
Putka, D. J., Le, H., McCloy, R. A., & Diaz, T. (2008). Ill-structured measurement designs in organizational research: Implications for estimating interrater reliability. Journal of Applied Psychology, 93(5), 959–981.
Article Google Scholar
Qiu, L. (1992). A study of interdisciplinary research collaboration. Research Evaluation, 2(3), 169–175.
Article Google Scholar
R Core Team. (2016). R: A language and environment for statistical computing (Version 3.3.1) [computer software]. Vienna, AT: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org.
Ramasundarahettige, C. F., Donner, A., & Zou, G. Y. (2009). Confidence interval construction for a difference between two dependent intraclass correlation coefficients. Statistics in Medicine, 28(7), 1041–1053.
Article MathSciNet Google Scholar
Raykov, T., & Marcoulides, G. A. (2011). Introduction to psychometric theory. New York, NY: Routledge.
MATH Google Scholar
Revelle, W. (2016). Psych: Procedures for personality and psychological research (Version 1.6.9) [computer software]. Evanston, IL: Northwestern University. http://cran.r-project.org/web/packages/psych/. Accessed March 30 2017.
Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17(3), 354–373.
Article Google Scholar
Rosa, H. (2016). Resonanz - Eine Soziologie der Weltbeziehung [Resonance—A sociology of the relationship to the world]. Berlin, DE: Suhrkamp.
Google Scholar
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Article MathSciNet MATH Google Scholar
Rubin, H. R., Redelmeier, D. A., Wu, A. W., & Steinberg, E. P. (1993). How reliable is peer review of scientific abstracts? Looking back at the 1991 annual meeting of the Society of General Internal Medicine. Journal of General Internal Medicine, 8(5), 255–258.
Article Google Scholar
Satorra, A., & Bentler, P. M. (2010). Ensuring positiveness of the scaled Chi square test statistic. Psychometrika, 75(2), 243–248.
Article MathSciNet MATH Google Scholar
Scarr, S., & Weber, B. L. R. (1978). The reliability of reviews for the American Psychologist. American Psychologist, 33(10), 935.
Article Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Article MathSciNet MATH Google Scholar
Scott, W. A. (1974). Interreferee agreement on some characteristics of manuscripts submitted to Journal of Personality and Social Psychology. American Psychologist, 29(9), 698–702.
Article Google Scholar
Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance components. New York, NY: Wiley.
Book MATH Google Scholar
Serlin, R. C. (1993). Confidence intervals and the scientific method: A case for Holm on the range. Journal of Experimental Education, 61(4), 350–360.
Article Google Scholar
Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46(1), 561–584.
Article Google Scholar
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.
Article Google Scholar
Smith, R. (2003). The future of peer review. http://pdfs.semanticscholar.org/7c06/8fcda6956132db6732e6c353ffe5fe6b6f62.pdf?_ga=1.116839174.1674370711.1490806067. Accessed March 29 2017.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 64(4), 583–639.
Article MathSciNet MATH Google Scholar
Stephan, P., Veugelers, R., & Wang, J. (2017). Reviewers are blinkered by bibliometrics. Nature, 544(7651), 411–412.
Article Google Scholar
Strauss, M. E., & Smith, G. T. (2009). Construct validity: Advances in theory and methodology. Annual Review of Clinical Psychology, 5, 1–25.
Article Google Scholar
Tahamtan, I., Afshar, A. S., & Ahamdzadeh, K. (2016). Factors affecting number of citations: A comprehensive review of the literature. Scientometrics, 107(3), 1195–1225.
Article Google Scholar
Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, 4(1), 25–29.
Article Google Scholar
Uebersax, J. S. (1982–1983). A design-independent method for measuring the reliability of psychiatric diagnosis. Journal of Psychiatric Research, 17(4), 335–342.
van Dalen, H. P., & Henkens, K. (2012). Intended and unintended consequences of a publish-or-perish culture: A worldwide survey. Journal of the American Society for Information Science and Technology, 63(7), 1282–1293.
Article Google Scholar
Van de Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J. B., Neyer, F. J., & van Aken, M. A. G. (2014). A gentle introduction to Bayesian analysis: Applications to developmental research. Child Development, 85(3), 842–860.
Article Google Scholar
van Noorden, R. (2015). Interdisciplinary research by the numbers: An analysis reveals the extent and impact of research that bridges disciplines. Nature, 525(7569), 306–307.
Article Google Scholar
Walsh, E., Rooney, M., Appleby, L., & Wilkinson, G. (2000). Open peer review: A randomised controlled trial. The British Journal Of Psychiatry, 176(1), 47–51.
Article Google Scholar
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838.
Article MathSciNet MATH Google Scholar
Whitehurst, G. J. (1983). Interrater agreement for reviews for Developmental Review. Developmental Review, 3(1), 73–78.
Article Google Scholar
Wirtz, M., & Caspar, F. (2002). Beurteilerübereinstimmung und Beurteilerreliabilität: Methoden zur Bestimmung und Verbesserung der Zuverlässigkeit von Einschätzungen mitttels Kategoriensystemen und Ratingskalen [Inter-rater agreement and inter-rater reliability: Methods on analysis and improvement of the reliability of assessments by categorical systems and rating scales]. Göttingen, DE: Hogrefe.
Wood, M., Roberts, M., & Howell, B. (2004). The reliability of peer reviews of papers on information systems. Journal of Information Science, 30(1), 2–11.
Article Google Scholar
Yates, A. (1987). Multivariate exploratory data analysis: A perspective on exploratory factor analysis. Albany, NY: State University of New York Press.
Google Scholar
Yousfi, S. (2005). Mythen und Paradoxien der klassischen Testtheorie (I): Testlänge und Gütekriterien [Myths and paradoxes of classical test theory (I): About test length, reliability, and validity]. Diagnostica, 51(1), 1–11.
Article MathSciNet Google Scholar
Yuan, K.-H., & Bentler, P. M. (2000). Three likelihood-based methods for mean and covariance structure analysis with nonnormal missing data. Sociological Methodology, 30(1), 165–200.
Article Google Scholar
Zyphur, M. J., & Oswald, F. L. (2015). Bayesian estimation and inference: A user’s guide. Journal of Management, 41(2), 390–420.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Eberhard Karls Universität Tübingen, Tübingen, Germany
Jens Jirschitzka, Richard Göllner & Ulrike Cress
Johannes Gutenberg-Universität Mainz, Binger Str. 14-16, 55122, Mainz, Germany
Aileen Oeberst
Leibniz-Institut für Wissensmedien, Tübingen, Germany
Aileen Oeberst & Ulrike Cress

Authors

Jens Jirschitzka
View author publications
You can also search for this author in PubMed Google Scholar
Aileen Oeberst
View author publications
You can also search for this author in PubMed Google Scholar
Richard Göllner
View author publications
You can also search for this author in PubMed Google Scholar
Ulrike Cress
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aileen Oeberst.

Additional information

Jens Jirschitzka and Aileen Oeberst have shared first authorship.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 198 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jirschitzka, J., Oeberst, A., Göllner, R. et al. Inter-rater reliability and validity of peer reviews in an interdisciplinary field. Scientometrics 113, 1059–1092 (2017). https://doi.org/10.1007/s11192-017-2516-6

Download citation

Received: 05 April 2017
Published: 11 September 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s11192-017-2516-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Inter-rater reliability and validity of peer reviews in an interdisciplinary field

Abstract

Access this article

Similar content being viewed by others

Literature reviews as independent studies: guidelines for academic practice

How to design bibliometric research: an overview and a framework proposal

What is Qualitative in Research

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (DOCX 198 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Inter-rater reliability and validity of peer reviews in an interdisciplinary field

Abstract

Access this article

Similar content being viewed by others

Literature reviews as independent studies: guidelines for academic practice

How to design bibliometric research: an overview and a framework proposal

What is Qualitative in Research

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (DOCX 198 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation