Abstract
In this paper we focus on the analysis of peer reviews and reviewers behaviour in a number of different review processes. More specifically, we report on the development, definition and rationale of a theoretical model for peer review processes to support the identification of appropriate metrics to assess the processes main characteristics in order to render peer review more transparent and understandable. Together with known metrics and techniques we introduce new ones to assess the overall quality (i.e. ,reliability, fairness, validity) and efficiency of peer review processes e.g. the robustness of the process, the degree of agreement/disagreement among reviewers, or positive/negative bias in the reviewers’ decision making process. We also check the ability of peer review to assess the impact of papers in subsequent years. We apply the proposed model and analysis framework to a large reviews data set from ten different conferences in computer science for a total of ca. 9,000 reviews on ca. 2,800 submitted contributions. We discuss the implications of the results and their potential use toward improving the analysed peer review processes. A number of interesting results were found, in particular: (1) a low correlation between peer review outcome and impact in time of the accepted contributions; (2) the influence of the assessment scale on the way how reviewers gave marks; (3) the effect and impact of rating bias, i.e. reviewers who constantly give lower/higher marks w.r.t. all other reviewers; (4) the effectiveness of statistical approaches to optimize some process parameters (e.g. ,number of papers per reviewer) to improve the process overall quality while maintaining the overall effort under control. Based on the lessons learned, we suggest ways to improve the overall quality of peer-review through procedures that can be easily implemented in current editorial management systems.
Similar content being viewed by others
Notes
“With the authors consent, a paper already peer reviewed and accepted for publication by BMJ was altered to introduce eight weaknesses in design, analysis, or interpretation” (Godlee et al. 1998).
Notice that this pragmatic choice does not imply that the authors believe blindly in citation count as being the only measure of impact. Indeed prior art has shown that it has some flaws (Krapivin et al. 2010) and could be extended to other novel metrics like number of downloads (Li et al. 2012) or other alternatives metrics (Bollen et al. 2005). However, we adopt it as it is a commonly accepted and accessible metric.
When the second ranking is random, the formula for the divergence can be expressed analytically as \(NDiv_{\rho_i,\rho_{\rm a}}(t,n,{\mathcal{C}}) = \sum_{i=0}^ {t} p_{t}(i,n) w_{i},\) where \(p_{t}(i,n) = \frac{ C_{i}^ {t} C_{t-i}^{n-t} }{ C_{t}^ {n} }\) and \(w_{i} = \frac{t-i}{t}.\)
In Fig. 2 the different scales have been normalized in the x-axis.
See Table 1 for the nominal acceptance rate for all conferences.
Also in these numerical experiments we repeated the simulations a number of runs (typically 10) to collect proper statistical data (i.e. mean value and standard deviation) for each experiment.
Please note that in our reviews dataset the reviewers did not have access to other’s reviewers marks, so they could not have been influenced by previous reviews.
Although Google Scholar has been criticized in the literature (e.g. Jacso 2010) mainly for the noise (spurious documents and citations) that it includes, it is however one of the few publicly available source of citations as well as with a high degree of coverage.
Old conferences are the ones which took place in the period from 2003 to 2006, therefore “old” enough for checking the number of citations received during the subsequent years.
This is the rationale behind some journals like PLoS ONE among others.
The marks before the computation were normalized to the scale [0,1].
We recall again that in our work we focus only on the quantitative aspect of peer review (i.e. marks) and not on the other important dimension of providing constructive feedbacks to authors.
Both in C1 and C3 the cluster with minimal probability was the “immature” cluster.
References
Akst, J. (2010). I hate your paper. The Scientist, 24(8), 36–41.
Barnes, J. (1981). Proof and the syllogism. In E. Berti (Ed.), Aristotle on science: The posterior analytics (pp. 17–59). Padua: Antenore.
Bartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological Reports, 19, 2–11.
Bartko, J. J. (1974). Corrective note to “the intraclass correlation coefficient as a measure of reliability”. Psychological Reports, 34, 418.
Benos, D. J., Bashari, E., Chaves, J. M., Gaggar, A., et al. (2007). The ups and downs of peer review. Advances in Physiology Education, 31(2), 145–152.
Birman, K., & Schneider, F. (2009). Program committee overload in systems. Communications of the ACM, 52(5), 34–37.
Bollen, J., Van de Sompel, H., Smith, J., & Luce, R. (2005). Toward alternative metrics of journal impact: A comparison of download and citation data. Information Processing & Management, 41(6), 1419–1440.
Bornmann, L. (2007). Bias cut: Women, it seems, often get a raw deal in science—So how can discrimination be tackled?. Nature, 445, 566.
Bornmann, L., & Daniel, H. D. (2005a). Committee peer review at an international research foundation: Predictive validity and fairness of selection decisions on post-graduate fellowship applications. Research Evaluation, 14(1), 15–20.
Bornmann, L., & Daniel, H. D. (2005b). Selection of research fellowship recipients by committee peer review. Reliability, fairness and predictive validity of board of trustees’ decisions. Scientometrics, 63(2), 297–320.
Bornmann, L., & Daniel, H. D. (2010a). Reliability of reviewers’ ratings when using public peer review: A case study. Learned Publishing, 23(2), 124–131.
Bornmann, L., & Daniel, H. D. (2010b). The validity of staff editors initial evaluations of manuscripts: A case study of angewandte chemie international edition. Scientometrics, 85(3), 681–687.
Bornmann, L., Mutz, R., & Daniel, H. D. D. (2008a). How to detect indications of potential sources of bias in peer review: A generalized latent variable modeling approach exemplified by a gender study. Journal of Informetrics, 2(4), 280–287.
Bornmann, L., Wallon, G., & Ledin, A. (2008b). Does the committee peer review select the best applicants for funding? An investigation of the selection process for two European Molecular Biology Organization Programmes. PLoS ONE, 3. doi:10.1371/journal.pone.0003480.
Bornmann, L., Wolf, M., & Daniel, H. D. (2012). Closed versus open reviewing of journal manuscripts: How far do comments differ in language use? Scientometrics, 91(3), 843–856. doi.10.1007/s11192-011-0569-5. http://www.akademiai.com/content/0436287611KJ2063.
Brink, D. (2008). Statistics. Fredriksberg: Ventus Publishing ApS.
Cabanac, G., & Preuss, T. (2013). Capitalizing on order effects in the bids of peer-reviewed conferences to secure reviews by expert referees. Journal of the American Society for Information Science and Technology. doi:10.1002/asi.22747.
Ceci, S., & Williams, W. (2011). Understanding current causes of women’s underrepresentation in science. Proceedings of the National Academy of Sciences, 108(8), 3157–3162.
Ceci, S. J., & Peters, D. P. (1982). Peer review: A study of reliability. Climate Change, 14(6), 44–48.
Chen, J., & Konstan, J. A. (2010). Conference paper selectivity and impact. Communications of the ACM, 53(6), 79–83. doi:10.1145/1743546.1743569.
Cicchetti, D., & Sparrow, S. (1981). Developing criteria for establishing interrater reliability of specific items: Applications to assessment of adaptive behavior. American Journal of Mental Deficiency, 86, 127–137.
Cicchetti, D. V., Lord, C., Koenig, K., Klin, A., & Volkmar, F. R. (2008). Reliability of the autism diagnostic interview: Multiple examiners evaluate a single case. Journal of Autism and Developmental Disorders, 36(4), 764–770.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Education and Psychological Measurement, XX(1), 37–46.
Davidoff, F., DeAngelis, C., Drazen, J., et al. (2001). Sponsorship, authorship, and accountability. JAMA, 286(10), 1232–1234. doi:10.1001/jama.286.10.1232/data/Journals/JAMA/4799/JED10056.pdf.
Donner, A. (1986). A review of inference procedures for the intraclass correlation coefficient in the one-way random effects model. International Statistical Review, 54(1), 67–82.
Ebel, R. L. (1951). Estimation of the reliability of ratings. Psychometrika, 16(4), 407–424.
Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh: Oliver and Boyd.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.
Freyne, J., Coyle, L., Smyth, B., & Cunningham, P. (2010). Relative status of journal and conference publications in computer science. Communications of the ACM, 53(11), 124–132. doi:10.1145/1839676.1839701.
Godlee, F., Gale, C. R., & Martyn, C. N. (1998). Effect on the quality of peer review of blinding reviewers and asking them to sign their reports a randomized controlled trial. JAMA, 280(3), 237–240.
Goodman, S. N., Berlin, J., Fletcher, S. W., & Fletcher, R. H. (1994). Manuscript quality before and after peer review and editing at annals of internal medicine. Annals of Internal Medicine, 121(1), 11–21.
Grudin, J. (2010). Conferences, community, and technology: Avoiding a crisis. In iConference 2010.
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression. Chichester: Wiley.
Ingelfinger, F. J. (1974). Peer review in biomedical publication. American Journal of Medicine, 56(5), 686–692.
Jacso, P. (2010). Metadata mega mess in Google Scholar. Online Information Review, 34(1), 175–191.
Jefferson, T., Alderson, P., Wager, E., & Davidoff, F. (2002a). Effects of editorial peer review: A systematic review. JAMA, 287(21), 2784–2786.
Jefferson, T., Wager, E., & Davidoff, F. (2002b). Measuring the quality of editorial peer review. JAMA, 287(21), 2786–2790.
Kassirer, J. P., & Campion, E. W. (1994). Peer review: Crude and understudied, but indispensable. Journal of American Medical Association, 272(2), 96–97.
Katz, D. S., Proto, A. V., & Olmsted, W. W. (2002). Incidence and nature of unblinding by authors: Our experience at two radiology journals with double-blinded peer review policies. The American Journal of Roentgenology, 179, 1415–1417.
Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1–2), 81–93.
Krapivin, M., Marchese, M., & Casati, F. (2010). Exploring and understanding citation-based scientific metrics. Advances in Complex Systems, 13(1), 59–81.
Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583–621.
Li, X., Thelwall, M., & Giustini, D. (2012). Validating online reference managers for scholarly impact measurement. Scientometrics 91(2), 461–471. doi:10.1007/s11192-011-0580-x. http://www.akademiai.com/content/35146TH23T1J1284.
Link, A. M. (1998). US and non-US submissions an analysis of reviewer bias. JAMA, 280(3), 246–247.
Lock, S. (1994). Does editorial peer review work?. Annals of Internal Medicine, 121(1), 60–61.
Lokker, C., McKibbon, K. A., McKinlay, R. J., Wilczynski, N. L., & Haynes, R. B. (2008). Prediction of citation counts for clinical articles at two years using data available within three weeks of publication: Retrospective cohort study. British Medical Journal, 336(76450), 655–657.
Madden, S., & DeWitt, D. (2006). Impact of double-blind reviewing on sigmod publication rates. ACM SIGMOD Record, 35(2), 29–32.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.
Montgomery, A., Graham, A., Evans. P., & Fahey, T. (2002). Inter-rater agreement in the scoring of abstracts submitted to a primary care research conference. BMC Health Services Research, 2(1), 8.
Ragone, A., Mirylenka, K., Casati, F., & Marchese, M. (2011). A quantitative analysis of peer review. In E. Noyons & P. Ngulube (Eds.), Proceedings of ISSI 2011—The 13th IIternational conference on scientometrics and Iiformetrics, South Africa, Durban, July 4–7, pp. 724–746.
Reinhart, M. (2009). Peer review of grant applications in biology and medicine. Reliability, fairness, and validity. Scientometrics, 81(3), 789–809.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. International Statistical Review, 86(2), 420–428.
Smith, R. (2006). Peer review: A flawed process at the heart of science and journals. JRSM, 99(4), 178.
Spier, R. (2002). The history of the peer-review process. Trends in Biotechnology, 20(8), 357–358.
Tung, A. K. H. (2006). Impact of double blind reviewing on sigmod publication: A more detail analysis. SIGMOD Record, 35(3), 6–7.
van Rooyen, S., Godlee, F., Evans, S., Black, N., & Smith, R. (1999). Effect of open peer review on quality of reviews and on reviewers’ recommendations: A randomised trial. British Medical Journal, 318, 23–27.
Walsh, E., Rooney, M., Appleby, L., & Wilkinson, G. (2000). Open peer review: A randomised controlled trial. The British Journal of Psychiatry, 176, 47–51.
Welch, B. L. (1947). The generalization of student’s problem when several different population variances are involved. Biometrika, 34(1/2), 28–35.
Wenneras, C., & Wold, A. (1997). Nepotism and sexism in peer-review. Nature, 387, 341–343.
Zuckerman, H., & Merton, R. (1971). Patterns of evaluation in science: Institutionalisation, structure and functions of the referee system. Minerva, 9, 66–100. doi:10.1007/BF01553188.
Acknowledgements
This paper is an extended version of the 12 pages paper titled “A Quantitative Analysis of Peer Review” presented at the 13th Conference of the International Society for Scientometrics and Informetrics, Durban (South Africa), 4–7 July 2011 (Ragone et al. 2011). We acknowledge the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission for the LIQUIDPUB project under FET-Open grant number: 213360. We also want to acknowledge the anonymous reviewers of our manuscript. Their comments have really helped us to improve our work, underlying something that we knew already (and mention in our work): peer review is not only focused on filtering and selecting manuscripts to publish but also to provide constructive feedbacks to authors.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ragone, A., Mirylenka, K., Casati, F. et al. On peer review in computer science: analysis of its effectiveness and suggestions for improvement. Scientometrics 97, 317–356 (2013). https://doi.org/10.1007/s11192-013-1002-z
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-013-1002-z