On peer review in computer science: analysis of its effectiveness and suggestions for improvement

Abstract

In this paper we focus on the analysis of peer reviews and reviewers behaviour in a number of different review processes. More specifically, we report on the development, definition and rationale of a theoretical model for peer review processes to support the identification of appropriate metrics to assess the processes main characteristics in order to render peer review more transparent and understandable. Together with known metrics and techniques we introduce new ones to assess the overall quality (i.e. ,reliability, fairness, validity) and efficiency of peer review processes e.g. the robustness of the process, the degree of agreement/disagreement among reviewers, or positive/negative bias in the reviewers’ decision making process. We also check the ability of peer review to assess the impact of papers in subsequent years. We apply the proposed model and analysis framework to a large reviews data set from ten different conferences in computer science for a total of ca. 9,000 reviews on ca. 2,800 submitted contributions. We discuss the implications of the results and their potential use toward improving the analysed peer review processes. A number of interesting results were found, in particular: (1) a low correlation between peer review outcome and impact in time of the accepted contributions; (2) the influence of the assessment scale on the way how reviewers gave marks; (3) the effect and impact of rating bias, i.e. reviewers who constantly give lower/higher marks w.r.t. all other reviewers; (4) the effectiveness of statistical approaches to optimize some process parameters (e.g. ,number of papers per reviewer) to improve the process overall quality while maintaining the overall effort under control. Based on the lessons learned, we suggest ways to improve the overall quality of peer-review through procedures that can be easily implemented in current editorial management systems.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Notes

  1. 1.

    “With the authors consent, a paper already peer reviewed and accepted for publication by BMJ was altered to introduce eight weaknesses in design, analysis, or interpretation” (Godlee et al. 1998).

  2. 2.

    Notice that this pragmatic choice does not imply that the authors believe blindly in citation count as being the only measure of impact. Indeed prior art has shown that it has some flaws (Krapivin et al. 2010) and could be extended to other novel metrics like number of downloads (Li et al. 2012) or other alternatives metrics (Bollen et al. 2005). However, we adopt it as it is a commonly accepted and accessible metric.

  3. 3.

    When the second ranking is random, the formula for the divergence can be expressed analytically as \(NDiv_{\rho_i,\rho_{\rm a}}(t,n,{\mathcal{C}}) = \sum_{i=0}^ {t} p_{t}(i,n) w_{i},\) where \(p_{t}(i,n) = \frac{ C_{i}^ {t} C_{t-i}^{n-t} }{ C_{t}^ {n} }\) and \(w_{i} = \frac{t-i}{t}.\)

  4. 4.

    In Fig. 2 the different scales have been normalized in the x-axis.

  5. 5.

    See Table 1 for the nominal acceptance rate for all conferences.

  6. 6.

    Also in these numerical experiments we repeated the simulations a number of runs (typically 10) to collect proper statistical data (i.e. mean value and standard deviation) for each experiment.

  7. 7.

    Please note that in our reviews dataset the reviewers did not have access to other’s reviewers marks, so they could not have been influenced by previous reviews.

  8. 8.

    Although Google Scholar has been criticized in the literature (e.g. Jacso 2010) mainly for the noise (spurious documents and citations) that it includes, it is however one of the few publicly available source of citations as well as with a high degree of coverage.

  9. 9.

    Old conferences are the ones which took place in the period from 2003 to 2006, therefore “old” enough for checking the number of citations received during the subsequent years.

  10. 10.

    This is the rationale behind some journals like PLoS ONE among others.

  11. 11.

    The marks before the computation were normalized to the scale [0,1].

  12. 12.

    We recall again that in our work we focus only on the quantitative aspect of peer review (i.e. marks) and not on the other important dimension of providing constructive feedbacks to authors.

  13. 13.

    Both in C1 and C3 the cluster with minimal probability was the “immature” cluster.

References

  1. Akst, J. (2010). I hate your paper. The Scientist, 24(8), 36–41.

    Google Scholar 

  2. Barnes, J. (1981). Proof and the syllogism. In E. Berti (Ed.), Aristotle on science: The posterior analytics (pp. 17–59). Padua: Antenore.

  3. Bartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological Reports, 19, 2–11.

    Article  Google Scholar 

  4. Bartko, J. J. (1974). Corrective note to “the intraclass correlation coefficient as a measure of reliability”. Psychological Reports, 34, 418.

    Article  Google Scholar 

  5. Benos, D. J., Bashari, E., Chaves, J. M., Gaggar, A., et al. (2007). The ups and downs of peer review. Advances in Physiology Education, 31(2), 145–152.

    Article  Google Scholar 

  6. Birman, K., & Schneider, F. (2009). Program committee overload in systems. Communications of the ACM, 52(5), 34–37.

    Article  Google Scholar 

  7. Bollen, J., Van de Sompel, H., Smith, J., & Luce, R. (2005). Toward alternative metrics of journal impact: A comparison of download and citation data. Information Processing & Management, 41(6), 1419–1440.

    Article  Google Scholar 

  8. Bornmann, L. (2007). Bias cut: Women, it seems, often get a raw deal in science—So how can discrimination be tackled?. Nature, 445, 566.

    Article  Google Scholar 

  9. Bornmann, L., & Daniel, H. D. (2005a). Committee peer review at an international research foundation: Predictive validity and fairness of selection decisions on post-graduate fellowship applications. Research Evaluation, 14(1), 15–20.

    Article  Google Scholar 

  10. Bornmann, L., & Daniel, H. D. (2005b). Selection of research fellowship recipients by committee peer review. Reliability, fairness and predictive validity of board of trustees’ decisions. Scientometrics, 63(2), 297–320.

    Article  Google Scholar 

  11. Bornmann, L., & Daniel, H. D. (2010a). Reliability of reviewers’ ratings when using public peer review: A case study. Learned Publishing, 23(2), 124–131.

    Article  Google Scholar 

  12. Bornmann, L., & Daniel, H. D. (2010b). The validity of staff editors initial evaluations of manuscripts: A case study of angewandte chemie international edition. Scientometrics, 85(3), 681–687.

    Article  Google Scholar 

  13. Bornmann, L., Mutz, R., & Daniel, H. D. D. (2008a). How to detect indications of potential sources of bias in peer review: A generalized latent variable modeling approach exemplified by a gender study. Journal of Informetrics, 2(4), 280–287.

    Article  Google Scholar 

  14. Bornmann, L., Wallon, G., & Ledin, A. (2008b). Does the committee peer review select the best applicants for funding? An investigation of the selection process for two European Molecular Biology Organization Programmes. PLoS ONE, 3. doi:10.1371/journal.pone.0003480.

  15. Bornmann, L., Wolf, M., & Daniel, H. D. (2012). Closed versus open reviewing of journal manuscripts: How far do comments differ in language use? Scientometrics, 91(3), 843–856. doi.10.1007/s11192-011-0569-5. http://www.akademiai.com/content/0436287611KJ2063.

  16. Brink, D. (2008). Statistics. Fredriksberg: Ventus Publishing ApS.

    Google Scholar 

  17. Cabanac, G., & Preuss, T. (2013). Capitalizing on order effects in the bids of peer-reviewed conferences to secure reviews by expert referees. Journal of the American Society for Information Science and Technology. doi:10.1002/asi.22747.

  18. Ceci, S., & Williams, W. (2011). Understanding current causes of women’s underrepresentation in science. Proceedings of the National Academy of Sciences, 108(8), 3157–3162.

    Article  Google Scholar 

  19. Ceci, S. J., & Peters, D. P. (1982). Peer review: A study of reliability. Climate Change, 14(6), 44–48.

    Google Scholar 

  20. Chen, J., & Konstan, J. A. (2010). Conference paper selectivity and impact. Communications of the ACM, 53(6), 79–83. doi:10.1145/1743546.1743569.

    Article  Google Scholar 

  21. Cicchetti, D., & Sparrow, S. (1981). Developing criteria for establishing interrater reliability of specific items: Applications to assessment of adaptive behavior. American Journal of Mental Deficiency, 86, 127–137.

    Google Scholar 

  22. Cicchetti, D. V., Lord, C., Koenig, K., Klin, A., & Volkmar, F. R. (2008). Reliability of the autism diagnostic interview: Multiple examiners evaluate a single case. Journal of Autism and Developmental Disorders, 36(4), 764–770.

    Article  Google Scholar 

  23. Cohen, J. (1960). A coefficient of agreement for nominal scales. Education and Psychological Measurement, XX(1), 37–46.

    Article  Google Scholar 

  24. Davidoff, F., DeAngelis, C., Drazen, J., et al. (2001). Sponsorship, authorship, and accountability. JAMA, 286(10), 1232–1234. doi:10.1001/jama.286.10.1232/data/Journals/JAMA/4799/JED10056.pdf.

  25. Donner, A. (1986). A review of inference procedures for the intraclass correlation coefficient in the one-way random effects model. International Statistical Review, 54(1), 67–82.

    MathSciNet  Article  MATH  Google Scholar 

  26. Ebel, R. L. (1951). Estimation of the reliability of ratings. Psychometrika, 16(4), 407–424.

    Article  Google Scholar 

  27. Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh: Oliver and Boyd.

    Google Scholar 

  28. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.

    Article  Google Scholar 

  29. Freyne, J., Coyle, L., Smyth, B., & Cunningham, P. (2010). Relative status of journal and conference publications in computer science. Communications of the ACM, 53(11), 124–132. doi:10.1145/1839676.1839701.

    Article  Google Scholar 

  30. Godlee, F., Gale, C. R., & Martyn, C. N. (1998). Effect on the quality of peer review of blinding reviewers and asking them to sign their reports a randomized controlled trial. JAMA, 280(3), 237–240.

    Article  Google Scholar 

  31. Goodman, S. N., Berlin, J., Fletcher, S. W., & Fletcher, R. H. (1994). Manuscript quality before and after peer review and editing at annals of internal medicine. Annals of Internal Medicine, 121(1), 11–21.

    Article  Google Scholar 

  32. Grudin, J. (2010). Conferences, community, and technology: Avoiding a crisis. In iConference 2010.

  33. Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression. Chichester: Wiley.

    Book  MATH  Google Scholar 

  34. Ingelfinger, F. J. (1974). Peer review in biomedical publication. American Journal of Medicine, 56(5), 686–692.

    Article  Google Scholar 

  35. Jacso, P. (2010). Metadata mega mess in Google Scholar. Online Information Review, 34(1), 175–191.

    Article  Google Scholar 

  36. Jefferson, T., Alderson, P., Wager, E., & Davidoff, F. (2002a). Effects of editorial peer review: A systematic review. JAMA, 287(21), 2784–2786.

    Article  Google Scholar 

  37. Jefferson, T., Wager, E., & Davidoff, F. (2002b). Measuring the quality of editorial peer review. JAMA, 287(21), 2786–2790.

    Article  Google Scholar 

  38. Kassirer, J. P., & Campion, E. W. (1994). Peer review: Crude and understudied, but indispensable. Journal of American Medical Association, 272(2), 96–97.

    Article  Google Scholar 

  39. Katz, D. S., Proto, A. V., & Olmsted, W. W. (2002). Incidence and nature of unblinding by authors: Our experience at two radiology journals with double-blinded peer review policies. The American Journal of Roentgenology, 179, 1415–1417.

    Article  Google Scholar 

  40. Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1–2), 81–93.

    MathSciNet  MATH  Google Scholar 

  41. Krapivin, M., Marchese, M., & Casati, F. (2010). Exploring and understanding citation-based scientific metrics. Advances in Complex Systems, 13(1), 59–81.

    MathSciNet  Article  MATH  Google Scholar 

  42. Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583–621.

    Article  MATH  Google Scholar 

  43. Li, X., Thelwall, M., & Giustini, D. (2012). Validating online reference managers for scholarly impact measurement. Scientometrics 91(2), 461–471. doi:10.1007/s11192-011-0580-x. http://www.akademiai.com/content/35146TH23T1J1284.

    Google Scholar 

  44. Link, A. M. (1998). US and non-US submissions an analysis of reviewer bias. JAMA, 280(3), 246–247.

    Article  Google Scholar 

  45. Lock, S. (1994). Does editorial peer review work?. Annals of Internal Medicine, 121(1), 60–61.

    Article  Google Scholar 

  46. Lokker, C., McKibbon, K. A., McKinlay, R. J., Wilczynski, N. L., & Haynes, R. B. (2008). Prediction of citation counts for clinical articles at two years using data available within three weeks of publication: Retrospective cohort study. British Medical Journal, 336(76450), 655–657.

    Article  Google Scholar 

  47. Madden, S., & DeWitt, D. (2006). Impact of double-blind reviewing on sigmod publication rates. ACM SIGMOD Record, 35(2), 29–32.

    Article  Google Scholar 

  48. McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.

    Article  Google Scholar 

  49. Montgomery, A., Graham, A., Evans. P., & Fahey, T. (2002). Inter-rater agreement in the scoring of abstracts submitted to a primary care research conference. BMC Health Services Research, 2(1), 8.

    Article  Google Scholar 

  50. Ragone, A., Mirylenka, K., Casati, F., & Marchese, M. (2011). A quantitative analysis of peer review. In E. Noyons & P. Ngulube (Eds.), Proceedings of ISSI 2011—The 13th IIternational conference on scientometrics and Iiformetrics, South Africa, Durban, July 4–7, pp. 724–746.

  51. Reinhart, M. (2009). Peer review of grant applications in biology and medicine. Reliability, fairness, and validity. Scientometrics, 81(3), 789–809.

    Article  Google Scholar 

  52. Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. International Statistical Review, 86(2), 420–428.

    Google Scholar 

  53. Smith, R. (2006). Peer review: A flawed process at the heart of science and journals. JRSM, 99(4), 178.

    Article  Google Scholar 

  54. Spier, R. (2002). The history of the peer-review process. Trends in Biotechnology, 20(8), 357–358.

    Article  Google Scholar 

  55. Tung, A. K. H. (2006). Impact of double blind reviewing on sigmod publication: A more detail analysis. SIGMOD Record, 35(3), 6–7.

    Google Scholar 

  56. van Rooyen, S., Godlee, F., Evans, S., Black, N., & Smith, R. (1999). Effect of open peer review on quality of reviews and on reviewers’ recommendations: A randomised trial. British Medical Journal, 318, 23–27.

    Article  Google Scholar 

  57. Walsh, E., Rooney, M., Appleby, L., & Wilkinson, G. (2000). Open peer review: A randomised controlled trial. The British Journal of Psychiatry, 176, 47–51.

    Article  Google Scholar 

  58. Welch, B. L. (1947). The generalization of student’s problem when several different population variances are involved. Biometrika, 34(1/2), 28–35.

    MathSciNet  Article  MATH  Google Scholar 

  59. Wenneras, C., & Wold, A. (1997). Nepotism and sexism in peer-review. Nature, 387, 341–343.

    Article  Google Scholar 

  60. Zuckerman, H., & Merton, R. (1971). Patterns of evaluation in science: Institutionalisation, structure and functions of the referee system. Minerva, 9, 66–100. doi:10.1007/BF01553188.

    Article  Google Scholar 

Download references

Acknowledgements

This paper is an extended version of the 12 pages paper titled “A Quantitative Analysis of Peer Review” presented at the 13th Conference of the International Society for Scientometrics and Informetrics, Durban (South Africa), 4–7 July 2011 (Ragone et al. 2011). We acknowledge the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission for the LIQUIDPUB project under FET-Open grant number: 213360. We also want to acknowledge the anonymous reviewers of our manuscript. Their comments have really helped us to improve our work, underlying something that we knew already (and mention in our work): peer review is not only focused on filtering and selecting manuscripts to publish but also to provide constructive feedbacks to authors.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Maurizio Marchese.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Ragone, A., Mirylenka, K., Casati, F. et al. On peer review in computer science: analysis of its effectiveness and suggestions for improvement. Scientometrics 97, 317–356 (2013). https://doi.org/10.1007/s11192-013-1002-z

Download citation

Keywords

  • Peer review
  • Quality metrics
  • Reliability
  • Fairness
  • Validity
  • Efficiency

Mathematics Subject Classification (2000)

  • 62-07
  • 62P25
  • 91C99