V. Bakanic, C. McPhail, R. J. Simon, The manuscript review and decision-making process,American Sociological Review 52 (1987) 631.
G. J. Whitehurst, Interrater agreement for journal manuscript reviews,American Psychologist, 39 (1984) 22.
S. Lock,A Delicate Balance: Editorial Peer Review in Medicine, Philadelphia, Penn., ISI Press, 1986.
E. Garfield, Refereeing and peer review. Part 1. Opinion and Conjecture on the effectiveness of refereeing,Current Contents, (1986) No. 31,3.
E. Garfield, Refereeing and peer review. Part 2. The research on refereeing and alternatives to the present system,Current Contents, (1986) No. 32,3.
D. P. Peters, S. J. Ceci, Peer review practices of psychological journals: The fate of published articles, submitted again,behavioural and Brain Sciences, 5 (1982) 187. Also see the commentary on this paper in the same issue ofBehavioural and Brain Sciences.
P. McReynolds, Reliability of ratings of research papers,American Psychologist, 26 (1971) 400.
C. M. Bowen, R. Perloff, J. Jacoby, Improving manuscript evaluation procedures,American Psychologist, 27 (1972) 22.
S. Cole, J. R. Cole, G. Simon, Chance and consensus in per review,Science 214 (1981) 881.
D. Klahr, Insiders, outsiders and efficiency in a National Science Foundation panel,American Psychologist, 40 (1985) 148.
Of course, referees' evaluations may show agreement because they reflect other variables beside merit; “reliability is not the same as validity.” To our knowledge, researchers have reported only two studies of the association between scientists' evaluations of papers and independent indicators of those papers' merit.Small reported data on original referees' assessments of a sample of highly cited papers in chemistry, which showed a nonsignificant correlation between the referee assessments and the subsequent citation levels (seeH. G. Small,Characteristics of Frequently Cited Papers in Chemistry. Final Report on NSF Contract NSF-C795, Philadelphia, 1974). In contrast,Gottfredson found more substantial positive correlations between citations to published psychology papers and overall judgments of those papers' quality and impact made by experts nominated by the papers' authors (seeS. D. Gottfredson, Evaluating psychological research reports: Dimensions, reliability, and correlates of quality judgments,American Psychologist, 33 (1978) 920).
M. J. Mahoney,Scientist as Subject: The Psychological Imperative, Cambridge Mass, Ballinger, 1976.
D. Lindsey,The Scientific Publication System in Social Science, San Fransisco, Jossey-Bass, 1978.
L. L. Hargens, Scholarly consensus and journal rejection rates,American Sociological Review, 53 (1988) 139.
D. Lindsey, Assessing precision in the manuscript review process: A little better than chance,Scientometrics, 14 (1988) 75.
See, for example,H. M. Blalock, Jr.,Social Statistics, New York, McGraw-Hill, 1979.
A. W. Ward, B. W. Hall, C. F. Schram, Evaluation of published educational research, a national study,American Educational Research Journal, 12 (1975) 109.
Unrepresentatively homogeneous samples of papers are also produced when editors summarily reject a large proportion of submissions. To the extent that editors screen out manuscripts that referees would judge to be of poor quality, studies based on the remaining papers that receive referee evaluations will tend to show low levels of agreement between referees. High-prestige multidisciplinary journals, high-prestige medical journals, and social science journals are most likely to exhibit high summary rejection rates. SeeM. D. Gordon,A Study of the Evaluation of Research Papers by Primary Journals in the U.K., Leicester, England: Primary Communications Research Center, University of Leicester, 1978.
W. A. Scott, Interreferee agreement on some characteristics of manuscripts submitted to the Journal of Personality and Social Psychology,American Psychologist, 29 (1974) 698.
L. L. Hargens, J. R. Herting, A new approach to referees' assessments of manuscripts,Social Science Research (forthcoming).
Lindsey op. cit.. reference 17 above, suggests that they are likely to be lower, but seems to base his judgment on results from the numerous studies that have been subject to truncated variation rather than those studies that have been based on more representative samples of manuscripts.
H. L. Roediger III, The role of journal editors in the scientific process, inD. N. Jackson, J. P. Rushton (Eds)Scientific Excellence: Origins and Assessment, Beverly Hills, CA: Sage, 1987, 222.
B. C. Griffith, Judging document content versus social functions of refereeing: Possible and impossible tasks,Behavioural and Brain Sciences, 5 (1982) 214.
T. Saracevic, Relevance: A review of and framework for thinking on the notion in information science,Journal of the American Society for Information Science 26 (1975) 321.
J. C. Nunnally,Psychometric Theory, New York, McGraw Hill, 1967.
H. E. A. Tinsley, D. J. Weiss, Interrater reliability and agreement of subjective judgments,Journal of Counselling Psychology, 22 (1975) 358.
A. L. Stinchcombe, R. Ofshe, On journal editing as a probabilistic process,American Sociologist, 5 (1969) 19.
Hargens (footnote 12 in op. cit. in reference 16 above) also made this error.
Op. cit. reference 15, p. 37. Op. cit. reference 17, D. Lindsey, Assessing precision in the manuscript review process: A little better than chance,Scientometrics, 14 (1988) p. 78.
These results illustrate the point that measures with low reliability, and therefore low validity, can be valuable when selection ratios are low and there is substantial variation among cases being evaluated (seeL. J. Cronbach,Essentials of Psychological Testing (3rd Ed.), New York, Harper and Row, 1970).Lindsey and others have argued that referees' evaluations of manuscripts are more likely to be unreliable for behavioural science journals than for natural science journals, but the former are also more likely to exhibit the two conditions that enhance the practical value of even fairly unreliable evaluations. Highly selective and prestigious medical journals also exhibit these two conditions, andLock, op. cit., estimates that theBritish Medical Journal accepts 80 percent of the top quality papers submitted to it.
Lindsey's recommendation that journals solicit the opinions of at least three initial referees for each submission also exaggerates the benefits of such a policy by neglecting the peer-review system used by journals. For most behavioural-science journals, which are the focus of Lindsey's discussion, a substantial minority of manuscripts (those receiving a split decision from the two initial referees and even some of those receiving two positive evaluations) already receive three referee evaluations. Using three initial referees for all papers will increase the reliability of the composite evaluations of only those papers that would receive two unfavourable evaluations under the current system. Unfortunately, using three initial referees for these papers would also slow down their evaluation, and authors appear to be more concerned about the speed of the journal review process than about the reliability of referees' evaluations (seeY. Brackbill, F. Korten, ‘Journal reviewing practices: Authors’ and APA members' suggestions for revision,American Psychologist, 27 (1972) 22). Using three initial referees might speed up the evaluation of the remaining manuscripts somewhat (because editors would not wait until the first two referees returned recommendations before soliciting the opinion of the third), but the fact that these constitute a minority of submissions would probably not allow the time savings experienced for them to counterbalance the longer lags experienced in evaluating the very large proportion of manuscripts that receive only two evaluations under the current system.
SeeBlalock, op. cit.. p. 282–290.
Lindsey (op. cit. reference 17). reports a non-significant chi-squared value for a “quasiindependence” model applied to data from one of these journals,Personality and Social Psychology Bulletin. Unfortunately,Lindsey does not specify which model of quasi-independence he tested. We have been able to obtain the chi-squared value he reports only by (1) treating the P&PB data in Lindsey's Table 2 as frequencies (they are actually percentages) and (2) constraining the model to reproduce the entries along the diagonal of Lindsey's Table 2 (some of these entries represent disagreement and others represent agreement). Thus, it is doubtful that Lindsey's analysis tested any meaningful hypothesis, much less the null hypothesis that referees' judgments are statistically independent.
L. A. Goodman, New methods for analyzing the intrinsic character of qualitative variables using cross-classified data,American Journal of Sociology 93 (1987) 529.
C. C. Clogg, Using association models in sociological research: Some examples,American Journal of Sociology, 88 (1982) 114.
SeeJ. R. Cole, S. Cole, Which researcher will get the grant?,Nature, 279 (1979) 575–576, andGordon, op. cit. These measures include various estimates of the proportion of the total variance in referees' assessments that is between- or within-manuscript (or proposal) variance.
SeeS. Cole, G. Simon, J. R. Cole, Do journal rejection rates index consensus?,American Sociological Review, 53 (1988) 152, andL. L. Hargens, Further evidence on field differences in consensus from the NSF peer review studies,American Sociological Review, 53 (1988) 157.
H. A. Zuckerman, R. K. Merton, Patterns of evaluation in science: institutionalization, structure and functions of the referee system,Minerva, 9 (1971) 66.
R. E. Stevens,Characteristics of Subject Literatures, ACRL Monograph No.6, Chicago, Association of College and Reference Libraries, 1953.
C. H. Brown,Scientific Serials, ACRL Monograph No.16, Chicago, Association of College and Reference Libraries, 1956.
W. D. Garvey, N. Lin, C. E. Nelson, Some comparisons of communication activities in the physical and social sciences, In:C. E. Nelson, D. K. Pollock (Eds.)Communication among Scientists and Engineers, Lexington, Mass.: Heath, 1970, P. 61.
Op. cit. reference 16. One reason that studies of referee reliability are relatively rare for physicalscience journals is that such journals often use the single initial referee system. Thus, data on pairs of referee assessments of all submissions are unavailable for these journals. Those manuscripts that do receive at least two independent referee evaluations under this system are an unrepresentative subset of all manuscripts. Thus, nonexperimental data on referee agreement for these journals, such as the evidence reported by Zuckerman and Merton, should be viewed with caution.
W. D. Garvey,Communication: The Essence of Science, Oxford, Pergammon, 1979.