Skip to main content

Do they agree? Bibliometric evaluation versus informed peer review in the Italian research assessment exercise


During the Italian research assessment exercise, the national agency ANVUR performed an experiment to assess agreement between grades attributed to journal articles by informed peer review (IR) and by bibliometrics. A sample of articles was evaluated by using both methods and agreement was analyzed by weighted Cohen’s kappas. ANVUR presented results as indicating an overall “good” or “more than adequate” agreement. This paper re-examines the experiment results according to the available statistical guidelines for interpreting kappa values, by showing that the degree of agreement (always in the range 0.09–0.42) has to be interpreted, for all research fields, as unacceptable, poor or, in a few cases, as, at most, fair. The only notable exception, confirmed also by a statistical meta-analysis, was a moderate agreement for economics and statistics (Area 13) and its sub-fields. We show that the experiment protocol adopted in Area 13 was substantially modified with respect to all the other research fields, to the point that results for economics and statistics have to be considered as fatally flawed. The evidence of a poor agreement supports the conclusion that IR and bibliometrics do not produce similar results, and that the adoption of both methods in the Italian research assessment possibly introduced systematic and unknown biases in its final results. The conclusion reached by ANVUR must be reversed: the available evidence does not justify at all the joint use of IR and bibliometrics within the same research assessment exercise.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2


  1. 1.

    Data had been requested to the President of the ANVUR with a mail sent the 10th February 2014. We have not received yet a reply.

  2. 2.

    The Final Report and all the Area Reports are in Italian. Quotations from these documents are translated by the authors. Appendix A of the Area 13 Report is in English.

  3. 3.

    The 14 areas are: Mathematics and informatics (Area 1); Physics (Area 2); Chemistry (Area 3); Earth Sciences (Area 4); Biology (Area 5); Medicine (Area 6); Agricultural and Veterinary Sciences (Area 7); Civil Engineering and Architecture (Area 8); Industrial and Information Engineering (Area 9); Antiquities, Philology, Literary studies, Art History (Area 10); History, Philosophy, Pedagogy and Psychology (Area 11); Law (Area 12); Economics and Statistics (Area 13); Political and Social sciences (Area 14).

  4. 4.

    Minister of Education, decree n. 17, 2011/07/15.

  5. 5.

    The resulting university and department rankings were disseminated through a booklet (AA.VV. 2013).

  6. 6.

    The complete list of the Italian scientific classification is available with the official English translation:

  7. 7.

    Minister of Education, decree n. 17, 2011/07/15, art. 8 comma 4.

  8. 8.

    This scale of value appears to be operationally meaningless (Baccini 2016); similar critiques are moved also to the British Research Assessment (McNay 2011).

  9. 9.

    Area 9 used quartiles instead of the percentiles of the VQR distribution rule for defining the journal segments.

  10. 10.

    The procedure is described in the Area 13 report and reproduced in Bertocchi et al. (2015).

  11. 11.

    ANVUR published estimates of the bias in the Appendix A of the Final Report.

  12. 12.

    At the same time ANVUR diffused a press release comparing the average score of the 14 areas (, that was extensively used by newspapers in their coverage of the news about VQR.

  13. 13.

  14. 14.

    The exposition is based on Appendice B, par. B.1 of the Final Report.

  15. 15.

    In the Areas 1, 3, 5, 6, 7, 8 the sub-areas corresponded to the sub-GEVs in which the panel organized the evaluation. In Area 2 (physics), the seven sub-areas were defined directly by the GEV by modifying the Web of Science classification; Area 4 was partitioned in four sub-areas corresponding to an administrative classification called “settori concorsuali”, that is a classification adopted by Italian government for the recruitment of professors; Area 9 was partitioned in 9 macro-fields defined directly by the GEV; finally, Area 13 adopted a fourfold classification developed directly by the GEV. These pieces of information are drawn from Area reports, but they are not properly disclosed neither in the Final Report neither in (Cicero et al. 2013).

  16. 16.

    It is worthwhile to note that many articles were dropped by the experiment because the bibliometric evaluation resulted in an inconclusive IR value. This induced a distortion in the sample that was not considered in the analysis by ANVUR.

  17. 17.

    ANVUR stressed the importance of significance test for Cohen’s kappa, by improperly interpreting statistical significance as agreement. In the section of an article reproducing result of the experiment, the statistical significance of kappa values was even exchanged for agreement: “kappa is always statistically different from zero, showing that there is a fundamental agreement” (Ancaiani et al. 2015).

  18. 18.

    For the whole sample and VQR-weights, a value of 0.3441 is reported in (Ancaiani et al. 2015).

  19. 19.

    For Area 13 and VQR-weights, a value of 0.6104 is reported in the Appendix B of the ANVUR Final Report and also in Cicero et al. (2013) and Ancaiani et al. (2015). The value of 0.61 appears inconsistent when the other kappas calculated for the sub-areas of Area 13, reported in Table 2, are considered. The value of 0.54 that we used in this paper is drawn directly from the Area 13 report and reproduced also in Bertocchi et al. (2015).

  20. 20.

    In the conclusion of the Area 9 report, that phrase is followed by the contradictory statement that: “The degree of concordance between peer evaluations and bibliometric evaluations is moderate (in Italian: “moderato”) in near all the sub-areas, while it results rather high (in Italian: “piuttosto elevato”) for informatic engineering” (ANVUR 2013).

  21. 21.

    (Bertocchi et al. 2015) introduced references to the relevant literature which were not presented in the working paper versions of the paper. They wrote “Since the most common scales to subjectively assess the value of kappa mention “adequate” and “fair to good”, these are the terms we use in the paper.” Really, the term “adequate” is not used in the relevant literature, but it is adopted by the ANVUR only in its reports.

  22. 22.

    This point is clearly stated in the Appendix B of the Area Report: “The sample selection shall take account of any specific request for peer review reported via the CINECA electronic form for highly specialized and multidisciplinary products” (p. 64). This information is not reported neither in Bertocchi et al. (2015) nor in (Bertocchi et al. 2013a, b, c, d, e) or in (Ancaiani et al. 2015).

  23. 23.

    As we have seen above, in the Areas from 1 to 9 the use of matrices may not give a definite result because a discordance between number of citations and impact factor. As a consequence, in these areas many journal articles were evaluated by IR. Referees of Areas 1-9 did not know if the article was sent them because of an uncertain bibliometric evaluation, or because the article was part of the random sample for the experiment. Indeed in Areas 1-9, to recognize that a research product belonged to the experiment sample, it was necessary to compare its citations with thresholds whose official values were never published by ANVUR. Instead, in Area 13 no journal articles were evaluated trough IR review except the ones of the experiment. So in Area 13 if a referee was requested to perform a peer review of a journal article, he or she may know immediately that he or she was engaged in the experiment.

  24. 24.

    This information was not disclosed in the ANVUR’s Final Report.

  25. 25.

    If the opinion of the two referees coincided, the final evaluation was promptly defined. If the opinion of the two referees diverged, a complex process started. This process is stated in the appendix A of the Area 13 Report: “The opinion of the external referees was then summarized by the internal consensus group: in case of disagreement between [the two referee’s reports], the [final score] is not simply the average of [the referee’s scores], but also reflects the opinion of two (and occasionally three) members of the GEV13 (as described in detail in the documents devoted to the peer review process)” (ANVUR 2013). The work of the consensus groups is described as follows: “The Consensus Groups will give an overall evaluation of the research product by using the informed peer review method, by considering the evaluation of the two external referees, the available indicators for quality and relevance of the research product, and the Consensus Group competences.” (ANVUR 2013). The consensus groups in some cases evaluated also the competences of the two referees, and gave “more importance to the most expert referee in the research field”. (Area Report, p. 15 translation from Italian by the authors). It is worthwhile to note that in the overall Italian research assessment exercise the notion of “informed peer review” individuates at least two very different processes (ANVUR 2013). The first one refers to the evaluation made by external referees who knows metadata and bibliometric indicators about the refereed articles. The second one consists in the evaluation made by a consensus group, that is by two or more member of a GEV, who used not only the informed peer reviews produced by two referees, but also the bibliometric indicators available for the article, and a personal judgement about the competences of the referees (ANVUR 2013).

  26. 26.

    It is worthwhile to recall that Area 13 panelists had decided the journal ranking used for bibliometric evaluation of articles submitted to the research assessment; they were the same panelists that decided the final score of the IR process. These modifications of the protocol, specific of the experiment performed in Area 13, have possibly introduced a substantial bias toward agreement between bibliometrics and IR. Indeed the relatively high kappas calculated for Area 13 can be interpreted as indicating a “fair to good agreement” between the evaluation based on the journal ranking developed by the panelists, and the IR performed by the panelists on the basis of the referee reports.

  27. 27.

    Economic history is a short label for economic history and history of economic thought.

  28. 28.

    As a rule, evaluations for each sub-area were conducted by a specific sub-area panel. Instead, journal articles classified as pertaining to “economic history” were evaluated by the panelist of the sub-area “economics”. It is worthwhile to note that in the experiment the best agreement between IR and bibliometrics is reached exactly in a sub-area (economics) where a subset of observations with lower agreement, those of economic history, were treated separately from the others.

  29. 29.

    In the Area 13 panel only one panelist is enrolled as full professor of economic history, and only one can be properly considered as an expert in the history of economic thought.


  1. AA.VV. (2013). I voti all’università. La Valutazione della qualità della ricerca in Italia. MIlano: Corriere della Sera.

  2. Abramo, G., & D’Angelo, C. A. (2015). The VQR, Italy’s second national research assessment: Methodological failures and ranking distortions. Journal of the Association for Information Science and Technology.,. doi:10.1002/asi.23323.

    Google Scholar 

  3. Aksnes, D. W., & Taxt, R. E. (2004). Peer reviews and bibliometric indicators: A comparative study at a Norwegian University. Research Evaluation, 13, 33–41. doi:10.3152/147154404781776563.

    Article  Google Scholar 

  4. Allen, L., Jones, C., Dolby, K., Lynn, D., & Walport, M. (2009). Looking for landmarks: The role of expert review and bibliometric analysis in evaluating scientific publication outputs. PLoS ONE, 4(6), e5910. doi:10.1371/journal.pone.0005910.

    Article  Google Scholar 

  5. Altman, D. G. (1991). Practical statistics for medical research. London: Chapman and Hall.

    Google Scholar 

  6. Ancaiani, A., Anfossi, A. F., Barbara, A., Benedetto, S., Blasi, B., Carletti, V., et al. (2015). Evaluating scientific research in Italy: The 2004–10 research evaluation exercise. Research Evaluation, 24(3), 242–255. doi:10.1093/reseval/rvv008.

    Article  Google Scholar 

  7. ANVUR. (2013). Rapporto finale. Valutazione della qualità della ricerca 2004-2010 (VQR 20042010). Roma.

  8. Baccini, A. (2014a). La VQR di Area 13: una riflessione di sintesi. Statistica & Società, 3(3), 32–37.

    MathSciNet  Google Scholar 

  9. Baccini, A. (2014b). Lo strano caso delle concordanze della VQR.

  10. Baccini, A. (2016). Napoléon et l’évaluation bibliométrique de la recherche. Considérations sur la réforme de l’université et sur l’action de l’agence national d’évaluation en Italie. Canadian Journal of Information and Library Science-Revue Canadienne des Sciences de l’Information et de Bibliotheconomie.

  11. Berghmans, T., Meert, A. P., Mascaux, C., Paesmans, M., Lafitte, J. J., & Sculier, J. P. (2003). Citation indexes do not reflect methodological quality in lung cancer randomised trials. Annals of Oncology, 14(5), 715–721. doi:10.1093/annonc/mdg203.

    Article  Google Scholar 

  12. Bertocchi, G., Gambardella, A., Jappelli, T., Nappi, C. A., & Peracchi, F. (2013a). Bibliometric evaluation vs. informed peer review: Evidence from Italy. Department of Economics DEMB. University of Modena and Reggio Emilia, Department of Economics Marco Biagi.

  13. Bertocchi, G., Gambardella, A., Jappelli, T., Nappi, C. A., & Peracchi, F. (2013b). Bibliometric evaluation vs. informed peer review: Evidence from Italy. ReCent WP. Center for Economic Research, University of Modena and Reggio Emilia, Dept. of Economics Marco Biagi.

  14. Bertocchi, G., Gambardella, A., Jappelli, T., Nappi, C. A., & Peracchi, F. (2013c). Bibliometric evaluation vs. informed peer review: Evidence from Italy. IZA Discussion paper. Institute for the Study of Labour (IZA), Bonn.

  15. Bertocchi, G., Gambardella, A., Jappelli, T., Nappi, C. A., & Peracchi, F. (2013d). Bibliometric evaluation vs. informed peer review: Evidence from Italy. CEPR Discussion papers.

  16. Bertocchi, G., Gambardella, A., Jappelli, T., Nappi, C. A., & Peracchi, F. (2013e). Bibliometric evaluation vs. informed peer review: Evidence from Italy. CSEF working papers. Naples: Centre for Studies in Economics and Finance (CSEF).

  17. Bertocchi, G., Gambardella, A., Jappelli, T., Nappi, C. A., & Peracchi, F. (2014). Assessing Italian research quality: A comparison between bibliometric evaluation and informed peer review. In V. C. s. P. Portal (Ed.). CEPR (Centre for Economic Policy Research).

  18. Bertocchi, G., Gambardella, A., Jappelli, T., Nappi, C. A., & Peracchi, F. (2015). Bibliometric evaluation vs. informed peer review: Evidence from Italy. Research Policy, 44(2), 451–466. doi:10.1016/j.respol.2014.08.004.

    Article  Google Scholar 

  19. Cicero, T., Malgarini, M., Nappi, C. A., & Peracchi, F. (2013). Bibliometric and peer review methods for research evaluation: a methodological appraisement (in Italian). MPRA (Munich Personal REPEc Archive). Munich.

  20. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. doi:10.1177/001316446002000104.

    Article  Google Scholar 

  21. Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. doi:10.1037/h0026256.

    Article  Google Scholar 

  22. De Nicolao, G. (2014). VQR da buttare? Persino ANVUR cestina i voti usati per l’assegnazione FFO 2013.

  23. Fleiss, J. L., Levin, B., & Myunghee, C. P. (2003). Statistical methods for rates and proportions. Hoboken, NJ: Wiley.

    Book  MATH  Google Scholar 

  24. George, D., & Mallery, P. (2003). SPSS for windows step by step: A simple guide and reference (4th ed.). Boston: Allys & Bacon.

    Google Scholar 

  25. HEFCE. (2015). The metric tide: Correlation analysis of REF2014 scores and metrics (Supplementary Report II to the Independent Review of the Role of Metrics in Research Assessment and Management).

  26. Koenig, M. E. D. (1983). Bibliometric indicators versus expert opinion in assessing research performance. Journal of the American Society for Information Science, 34, 136–145.

    Article  Google Scholar 

  27. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

    MathSciNet  Article  MATH  Google Scholar 

  28. Lee, F. S. (2007). The research assessment exercise, the state and the dominance of mainstream economics in British universities. Cambridge Journal of Economics, 31(2), 309–325.

    Article  Google Scholar 

  29. Lovegrove, B. G., & Johnson, S. D. (2008). Assessment of research performance in biology: How well do peer review and bibliometry correlate? BioScience, 58(2), 160–164. doi:10.1641/B580210.

    Article  Google Scholar 

  30. McNay, I. (2011). Research assessment: Work in progress, or ‘la lutta continua’. In M. Saunders, P. Trowler, & V. Bamber (Eds.), Reconceptualising evaluation in higher education the practice turn (pp. 51–57). New York: McGRaw Hill.

    Google Scholar 

  31. Mryglod, O., Kenna, R., Holovatch, Y., & Berche, B. (2015). Predicting results of the research excellence framework using departmental h-index. Scientometrics, 102(3), 2165–2180. doi:10.1007/s11192-014-1512-3.

    Article  Google Scholar 

  32. RAE. (2005). RAE 2008. Guidance to panels. London: HEFCE.

  33. Rinia, E. J., van Leeuwen, T. N., van Vuren, H. G., & van Raan, A. F. J. (1998). Comparative analysis of a set of bibliometric indicators and central peer review criteria: Evaluation of condensed matter physics in the Netherlands. Reseach Policy, 27(1), 95–107.

    Article  Google Scholar 

  34. Sheskin, D. J. (2003). Handbook of parametric and nonparametric statistical procedures. London: Chapman & Hall.

    Book  MATH  Google Scholar 

  35. Spiegelhalter, D. J. (2005). Funnel plots for comparing institutional performance. Statistics in Medicine, 24(8), 1185–1202. doi:10.1002/sim.1970.

    MathSciNet  Article  Google Scholar 

  36. Stemler, S. E., & Tsai, J. (2008). Best practices in interrater reliability three common approaches. In J. Osborne (Ed.), Best practices in quantitative methods (pp. 29–49). Thousand Oaks: Sage.

    Chapter  Google Scholar 

  37. Sun, S. (2011). Meta-analysis of Cohen’s kappa. Health Services and Outcomes Research Methodology, 11(3–4), 145–163. doi:10.1007/s10742-011-0077-3.

    Article  Google Scholar 

  38. van Raan, A. F. J. (2006). Comparison of the Hirsch-index with standard bibliometric indicators and with peer judgment for 147 chemistry research groups. Scientometrics, 67(3), 491–502. doi:10.1556/Scient.67.2006.3.10.

    Article  Google Scholar 

  39. Wouters, P., Thelwall, M., Kousha, K., Waltman, L., de Rijcke, S., Rushforth, A., et al. (2015). The metric tide: Literature review (Supplementary Report I to the Independent Review of the Role of Metrics in Research Assessment and Management). HEFCE.

Download references

Author information



Corresponding author

Correspondence to Alberto Baccini.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 587 kb)



In Area 13, 590 journal articles were selected for the experiment. Each article was assigned to two GEV members responsible of the IR process. Each of the two GEV members chose an external referee. After having received the referees scores, the two GEV members met up in a Consensus group in charge of defining the final IR evaluation of the paper. If we suppose (Hypothesis 1) that none of the articles was evaluated at a first sight as D (limited) by both GEV members –in this case an article was not submitted to the IR process-, then the 590 articles products can be treated as they were evaluated by two referees. Let’s suppose now (Hypothesis 2) that the Consensus Group formed by the two GEV members has never modified a concordant judgement, i.e. a judgment for which both referees were in agreement. We know (ANVUR 2013) that the referees gave concordant judgments for 264 articles. This means that the evaluation of, at least, 326 articles, that is 55.3 % of the total, was decided by Consensus Groups composed by GEV members. Table 4 summarizes specific estimates for the merit classes.

Table 4 Distribution of articles per merit classes and estimate of the number of articles evaluated directly by the Area 13 panelists

In the columns we can read the merit classes. Rows 1 and 2 contains respectively the distribution of articles per merit classes as judged by using bibliometrics and IR. Row 3 shows the number of articles for which the two referees expressed a concordant evaluation. In the row 4 the numbers of articles for which the IR was in agreement with the bibliometric evaluation are reported. Row 5 contains the estimate, under Hypothesis 1 and 2, of the minimum number of articles whose final IR evaluation was decided by the GEV Consensus Groups. We can see that 54.3 % of articles classified as A through IR were evaluated directly by the GEV Consensus Groups; this percentage increases to 58 % for articles classified as B; and at least 83.7 % of articles classified as C were decided by Consensus Groups. The percentage decreases to 31.6 % for products evaluated as D.

It is worthwhile to note that the hypothesis 1 and 2 tend to lower the estimate of the number of articles directly evaluated by the Consensus groups. In particular, for this estimate we are assuming that in Area 13 GEV members never agreed to directly evaluate as D an article, in which case it was scored D without being sent out for peer review. Consider that in the other areas, the percent of articles that received a concordant D score by two referees is 21.1 % (705/3441). In Area 13 this percentage is more than doubled: 44.3 % (117/264). It is impossible to establish, by using publicly available data, how much this result is due to a higher level of agreement between the two referees or to an initial agreement of the two GEV members, both evaluating an article as D. On this basis, we can presumptively affirm that the number 54 is just an underestimate of the number of articles evaluated as D by the Consensus Group.

A third hypothesis is also formulated: each time peers expressed a concordant evaluation, this coincided with the bibliometric evaluation. For example: every time two referees agreed to judge an article as A, the work resulted classified as A also by using bibliometrics. Under this hypothesis it is possible to estimate (row 7 of the table) the minimum number of articles for which the Consensus group decided an evaluation coincident with the one reached by bibliometrics. At least 64 articles out of the 311 (21 %) for which IR and bibliometrics evaluation coincided were directly evaluated by the Consensus Group. This value is strongly underestimated. Indeed, for articles evaluated as B, peers were in agreement for 73 articles; but bibliometric and peer review coincided for just 56 articles.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Baccini, A., De Nicolao, G. Do they agree? Bibliometric evaluation versus informed peer review in the Italian research assessment exercise. Scientometrics 108, 1651–1671 (2016).

Download citation


  • Informed peer review
  • Research assessment
  • Meta-analysis
  • Bibliometric evaluation
  • Italian VQR
  • Peer review
  • Cohen’s kappa