Skip to main content
Log in

Neglected considerations in the analysis of agreement among journal referees

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Studies of representative samples of submissions to scientific journals show statistically significant associations between referees' recommendations. These associations are moderately large given the multidimensional and unstable character of scientists' evaluations of papers, and composites of referees' recommendations can significantly aid editors in selecting manuscripts for publication, especially when there is great variability in the quality of submissions and acceptance rates are low. Assessments of the value of peer-review procedures in journal manuscript evaluation should take into account features of the entire scholarly communications system present in a field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes and references

  1. V. Bakanic, C. McPhail, R. J. Simon, The manuscript review and decision-making process,American Sociological Review 52 (1987) 631.

    Google Scholar 

  2. G. J. Whitehurst, Interrater agreement for journal manuscript reviews,American Psychologist, 39 (1984) 22.

    Google Scholar 

  3. S. Lock,A Delicate Balance: Editorial Peer Review in Medicine, Philadelphia, Penn., ISI Press, 1986.

    Google Scholar 

  4. E. Garfield, Refereeing and peer review. Part 1. Opinion and Conjecture on the effectiveness of refereeing,Current Contents, (1986) No. 31,3.

  5. E. Garfield, Refereeing and peer review. Part 2. The research on refereeing and alternatives to the present system,Current Contents, (1986) No. 32,3.

  6. D. P. Peters, S. J. Ceci, Peer review practices of psychological journals: The fate of published articles, submitted again,behavioural and Brain Sciences, 5 (1982) 187. Also see the commentary on this paper in the same issue ofBehavioural and Brain Sciences.

    Google Scholar 

  7. P. McReynolds, Reliability of ratings of research papers,American Psychologist, 26 (1971) 400.

    Google Scholar 

  8. C. M. Bowen, R. Perloff, J. Jacoby, Improving manuscript evaluation procedures,American Psychologist, 27 (1972) 22.

    Google Scholar 

  9. S. Cole, J. R. Cole, G. Simon, Chance and consensus in per review,Science 214 (1981) 881.

    Google Scholar 

  10. D. Klahr, Insiders, outsiders and efficiency in a National Science Foundation panel,American Psychologist, 40 (1985) 148.

    Google Scholar 

  11. Of course, referees' evaluations may show agreement because they reflect other variables beside merit; “reliability is not the same as validity.” To our knowledge, researchers have reported only two studies of the association between scientists' evaluations of papers and independent indicators of those papers' merit.Small reported data on original referees' assessments of a sample of highly cited papers in chemistry, which showed a nonsignificant correlation between the referee assessments and the subsequent citation levels (seeH. G. Small,Characteristics of Frequently Cited Papers in Chemistry. Final Report on NSF Contract NSF-C795, Philadelphia, 1974). In contrast,Gottfredson found more substantial positive correlations between citations to published psychology papers and overall judgments of those papers' quality and impact made by experts nominated by the papers' authors (seeS. D. Gottfredson, Evaluating psychological research reports: Dimensions, reliability, and correlates of quality judgments,American Psychologist, 33 (1978) 920).

    Google Scholar 

  12. M. J. Mahoney,Scientist as Subject: The Psychological Imperative, Cambridge Mass, Ballinger, 1976.

    Google Scholar 

  13. D. Lindsey,The Scientific Publication System in Social Science, San Fransisco, Jossey-Bass, 1978.

    Google Scholar 

  14. L. L. Hargens, Scholarly consensus and journal rejection rates,American Sociological Review, 53 (1988) 139.

    Google Scholar 

  15. D. Lindsey, Assessing precision in the manuscript review process: A little better than chance,Scientometrics, 14 (1988) 75.

    Google Scholar 

  16. See, for example,H. M. Blalock, Jr.,Social Statistics, New York, McGraw-Hill, 1979.

    Google Scholar 

  17. A. W. Ward, B. W. Hall, C. F. Schram, Evaluation of published educational research, a national study,American Educational Research Journal, 12 (1975) 109.

    Google Scholar 

  18. Unrepresentatively homogeneous samples of papers are also produced when editors summarily reject a large proportion of submissions. To the extent that editors screen out manuscripts that referees would judge to be of poor quality, studies based on the remaining papers that receive referee evaluations will tend to show low levels of agreement between referees. High-prestige multidisciplinary journals, high-prestige medical journals, and social science journals are most likely to exhibit high summary rejection rates. SeeM. D. Gordon,A Study of the Evaluation of Research Papers by Primary Journals in the U.K., Leicester, England: Primary Communications Research Center, University of Leicester, 1978.

    Google Scholar 

  19. W. A. Scott, Interreferee agreement on some characteristics of manuscripts submitted to the Journal of Personality and Social Psychology,American Psychologist, 29 (1974) 698.

    Google Scholar 

  20. L. L. Hargens, J. R. Herting, A new approach to referees' assessments of manuscripts,Social Science Research (forthcoming).

  21. Lindsey op. cit.. reference 17 above, suggests that they are likely to be lower, but seems to base his judgment on results from the numerous studies that have been subject to truncated variation rather than those studies that have been based on more representative samples of manuscripts.

    Google Scholar 

  22. H. L. Roediger III, The role of journal editors in the scientific process, inD. N. Jackson, J. P. Rushton (Eds)Scientific Excellence: Origins and Assessment, Beverly Hills, CA: Sage, 1987, 222.

    Google Scholar 

  23. B. C. Griffith, Judging document content versus social functions of refereeing: Possible and impossible tasks,Behavioural and Brain Sciences, 5 (1982) 214.

    Google Scholar 

  24. T. Saracevic, Relevance: A review of and framework for thinking on the notion in information science,Journal of the American Society for Information Science 26 (1975) 321.

    Google Scholar 

  25. J. C. Nunnally,Psychometric Theory, New York, McGraw Hill, 1967.

    Google Scholar 

  26. H. E. A. Tinsley, D. J. Weiss, Interrater reliability and agreement of subjective judgments,Journal of Counselling Psychology, 22 (1975) 358.

    Google Scholar 

  27. A. L. Stinchcombe, R. Ofshe, On journal editing as a probabilistic process,American Sociologist, 5 (1969) 19.

    Google Scholar 

  28. Hargens (footnote 12 in op. cit. in reference 16 above) also made this error.

    Google Scholar 

  29. Op. cit. reference 15, p. 37. Op. cit. reference 17, D. Lindsey, Assessing precision in the manuscript review process: A little better than chance,Scientometrics, 14 (1988) p. 78.

    Google Scholar 

  30. These results illustrate the point that measures with low reliability, and therefore low validity, can be valuable when selection ratios are low and there is substantial variation among cases being evaluated (seeL. J. Cronbach,Essentials of Psychological Testing (3rd Ed.), New York, Harper and Row, 1970).Lindsey and others have argued that referees' evaluations of manuscripts are more likely to be unreliable for behavioural science journals than for natural science journals, but the former are also more likely to exhibit the two conditions that enhance the practical value of even fairly unreliable evaluations. Highly selective and prestigious medical journals also exhibit these two conditions, andLock, op. cit., estimates that theBritish Medical Journal accepts 80 percent of the top quality papers submitted to it.

    Google Scholar 

  31. Lindsey's recommendation that journals solicit the opinions of at least three initial referees for each submission also exaggerates the benefits of such a policy by neglecting the peer-review system used by journals. For most behavioural-science journals, which are the focus of Lindsey's discussion, a substantial minority of manuscripts (those receiving a split decision from the two initial referees and even some of those receiving two positive evaluations) already receive three referee evaluations. Using three initial referees for all papers will increase the reliability of the composite evaluations of only those papers that would receive two unfavourable evaluations under the current system. Unfortunately, using three initial referees for these papers would also slow down their evaluation, and authors appear to be more concerned about the speed of the journal review process than about the reliability of referees' evaluations (seeY. Brackbill, F. Korten, ‘Journal reviewing practices: Authors’ and APA members' suggestions for revision,American Psychologist, 27 (1972) 22). Using three initial referees might speed up the evaluation of the remaining manuscripts somewhat (because editors would not wait until the first two referees returned recommendations before soliciting the opinion of the third), but the fact that these constitute a minority of submissions would probably not allow the time savings experienced for them to counterbalance the longer lags experienced in evaluating the very large proportion of manuscripts that receive only two evaluations under the current system.

    Google Scholar 

  32. SeeBlalock, op. cit.. p. 282–290.

    Google Scholar 

  33. Lindsey (op. cit. reference 17). reports a non-significant chi-squared value for a “quasiindependence” model applied to data from one of these journals,Personality and Social Psychology Bulletin. Unfortunately,Lindsey does not specify which model of quasi-independence he tested. We have been able to obtain the chi-squared value he reports only by (1) treating the P&PB data in Lindsey's Table 2 as frequencies (they are actually percentages) and (2) constraining the model to reproduce the entries along the diagonal of Lindsey's Table 2 (some of these entries represent disagreement and others represent agreement). Thus, it is doubtful that Lindsey's analysis tested any meaningful hypothesis, much less the null hypothesis that referees' judgments are statistically independent.

    Google Scholar 

  34. L. A. Goodman, New methods for analyzing the intrinsic character of qualitative variables using cross-classified data,American Journal of Sociology 93 (1987) 529.

    Google Scholar 

  35. C. C. Clogg, Using association models in sociological research: Some examples,American Journal of Sociology, 88 (1982) 114.

    Google Scholar 

  36. SeeJ. R. Cole, S. Cole, Which researcher will get the grant?,Nature, 279 (1979) 575–576, andGordon, op. cit. These measures include various estimates of the proportion of the total variance in referees' assessments that is between- or within-manuscript (or proposal) variance.

    Google Scholar 

  37. SeeS. Cole, G. Simon, J. R. Cole, Do journal rejection rates index consensus?,American Sociological Review, 53 (1988) 152, andL. L. Hargens, Further evidence on field differences in consensus from the NSF peer review studies,American Sociological Review, 53 (1988) 157.

    Google Scholar 

  38. H. A. Zuckerman, R. K. Merton, Patterns of evaluation in science: institutionalization, structure and functions of the referee system,Minerva, 9 (1971) 66.

    Google Scholar 

  39. R. E. Stevens,Characteristics of Subject Literatures, ACRL Monograph No.6, Chicago, Association of College and Reference Libraries, 1953.

  40. C. H. Brown,Scientific Serials, ACRL Monograph No.16, Chicago, Association of College and Reference Libraries, 1956.

  41. W. D. Garvey, N. Lin, C. E. Nelson, Some comparisons of communication activities in the physical and social sciences, In:C. E. Nelson, D. K. Pollock (Eds.)Communication among Scientists and Engineers, Lexington, Mass.: Heath, 1970, P. 61.

    Google Scholar 

  42. Op. cit. reference 16. One reason that studies of referee reliability are relatively rare for physicalscience journals is that such journals often use the single initial referee system. Thus, data on pairs of referee assessments of all submissions are unavailable for these journals. Those manuscripts that do receive at least two independent referee evaluations under this system are an unrepresentative subset of all manuscripts. Thus, nonexperimental data on referee agreement for these journals, such as the evidence reported by Zuckerman and Merton, should be viewed with caution.

    Google Scholar 

  43. W. D. Garvey,Communication: The Essence of Science, Oxford, Pergammon, 1979.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hargens, L.L., Herting, J.R. Neglected considerations in the analysis of agreement among journal referees. Scientometrics 19, 91–106 (1990). https://doi.org/10.1007/BF02130467

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02130467

Keywords

Navigation