Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

The effect of database dirty data on h-index calculation

  • 577 Accesses

  • 9 Citations


As all databases, the bibliometric ones (e.g. Scopus, Web of Knowledge and Google Scholar) are not exempt from errors, such as missing or wrong records, which may obviously affect publication/citation statistics and—more in general—the resulting bibliometric indicators. This paper tries to answer to the question “What is the effect of database uncertainty on the evaluation of the h-index?”, breaking the paradigm of deterministic database analysis and treating responses to database queries as random variables. Precisely an informetric model of the h-index is used to quantify the variability of this indicator with respect to the variability stemming from errors in database records. Some preliminary results are presented and discussed.

This is a preview of subscription content, log in to check access.

Fig. 1


  1. Alonso, S., Cabrerizo, F. J., Herrera-Viedma, E., & Herrera, F. (2009). h-Index: A review focused in its variants, computation and standardization for different scientific fields. Journal of Informetrics, 3(4), 273–289.

  2. Bar-Ilan, J., Levene, M., & Lin, A. (2007). Some measures for comparing citation databases. Journal of Informetrics, 1(1), 26–34.

  3. Bornmann, L., & Daniel, H. D. (2005). Does the h-index for ranking of scientists really work? Scientometrics, 65(3), 391–392.

  4. Braun, T., Glänzel, W., & Schubert, A. (2006). A Hirsch-type index for journals. Scientometrics, 69(1), 169–173.

  5. Casella, G., & Berger, R. L. (2001). Statistical inference (2nd ed., pp. 240–245). North Scituate: Duxbury Press.

  6. Courtault, J. M., & Hayek, N. (2008). On the Robustness of the h-index: a mathematical approach. Economics Bulletin, 3(78), 1–9.

  7. Egghe, L. (1990). The duality of informetric systems with applications to the empirical laws. Journal of Information Science, 16(1), 17–27.

  8. Egghe, L. (2005a). Power laws in the information production process: Lotkaian informetrics. London: Academic Press.

  9. Egghe, L. (2005b). Relations between the continuous and the discrete Lotka power function. Journal of the American Society for Information Science and Technology, 56(7), 664–668.

  10. Egghe, L. (2006). An improvement of the h-index: The g-index. ISSI Newsletter, 2(1), 8–9.

  11. Egghe, L. (2009). Lotkaian informetrics and applications to social networks. Bulletin of the Belgian Mathematical Society-Simon Stevin, 16(4), 689–703.

  12. Egghe, L., & Rousseau, R. (2006). An informetric model for the Hirsch-index. Scientometrics, 69(1), 121–129.

  13. Franceschini, F., Galetto, M., Maisano, D., & Mastrogiacomo, L. (2012a). The success-index: An alternative approach to the h-index for evaluating an individual’s research output. Scientometrics, 92(3), 621–641.

  14. Franceschini, F. M., Galetto, D. M., & Mastrogiacomo, L. (2012a). An informetric model for the success-index. Forthcoming on Journal of Informetrics.

  15. Franceschini, F., & Maisano, D. (2010a). Analysis of the Hirsch index’s operational properties. European Journal of Operational Research, 203(2), 494–504.

  16. Franceschini, F., & Maisano, D. (2010b). The Hirsch spectrum: A novel tool for analyzing scientific journals. Journal of Informetrics, 4(1), 64–73.

  17. Glänzel, W. (2006a). On the h-index-a mathematical approach to a new measure of publication activity and citation impact. Scientometrics, 67(2), 315–321.

  18. Glänzel, W. (2006b). On the opportunities and limitations of the h-index. Science focus, 1(1), 10–11

  19. Henzinger, M., Suñol, J., & Weber, I. (2010). The stability of the h-index. Scientometrics, 84(2), 465–479.

  20. Hernández, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1), 9–37.

  21. Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National academy of Sciences of the United States of America, 102(46), 16569–16572.

  22. Jacso, P. (2006). Deflated, inflated and phantom citation counts. Online Information Review, 30(3), 297–309.

  23. Jacsó, P. (2008). The pros and cons of computing the h-index using Web of Science. Online Information Review, 32(5), 673–688.

  24. Jacsó, P. (2011a). Google Scholar duped and deduped–the aura of “robometrics”. Online Information Review, 35(1), 154–160.

  25. Jacsó, P. (2011b). The h-index, h-core citation rate and the bibliometric profile of the Scopus database. Online Information Review, 35(3), 492–501.

  26. JCGM100:2008 (2008). Evaluation of measurement data—Guide to the expression of uncertainty in measurement. International Organization for Standardization, Geneve, Switzerland

  27. Kim, W., Choi, B. J., Hong, E. K., Kim, S. K., & Lee, D. (2003). A taxonomy of dirty data. Data Mining and Knowledge Discovery, 7(1), 81–99.

  28. Lotka, A. J. (1926). The frequency distribution of scientific productivity. Journal of Washington Academy Sciences, 16, 317–323.

  29. Montgomery, D. C. (2009). Statistical quality control: A modern introduction. Hoboken: Wiley.

  30. Scopus-Elsevier. (2012). Scopus Content Coverage. Retrieved September 2012, from http://www.scopus.com.

  31. Thomson-Reuters (Ed.) (2012) 2011 Journal Citation Reports® Science Edition.

  32. Times Higher Education. (2012). The World University Rankings. Retrieved September 2012, from http://www.timeshighereducation.co.uk/world-university-rankings/.

  33. Van Raan, A. F. J. (2006). Comparison of the Hirsch-index with standard bibliometric indicators and with peer judgment for 147 chemistry research groups. Scientometrics, 67(3), 491–502.

  34. Vanclay, J. K. (2007). On the robustness of the h-index. Journal of the American Society for Information Science and Technology, 58(10), 1547–1550.

Download references

Author information

Correspondence to Fiorenzo Franceschini.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Franceschini, F., Maisano, D. & Mastrogiacomo, L. The effect of database dirty data on h-index calculation. Scientometrics 95, 1179–1188 (2013). https://doi.org/10.1007/s11192-012-0871-x

Download citation


  • Citations
  • h-index
  • h-index robustness
  • Uncertain data
  • Dirty database