Skip to main content

Assessing author self-citation as a mechanism of relevant knowledge diffusion


Author self-citation is a practice that has been historically surrounded by controversy. Although the prevalence of self-citations in different scientific fields has been thoroughly analysed, there is a lack of large scale quantitative research focusing on its usefulness at guiding readers in finding new relevant scientific knowledge. In this work we empirically address this issue. Using as our main corpus the entire set of PLOS journals research articles, we train a topic discovery model able to capture semantic dissimilarity between pairs of articles. By dividing pairs of articles involved in intra-PLOS citations into self-citations (articles linked by a cite which share at least one author) and non-self-citations (articles linked by a cite which share no author), we observe the distribution of semantic dissimilarity between citing and cited papers in both groups. We find that the typical semantic distance between articles involved in self-citations is significantly smaller than the observed one for articles involved in non-self-citations. Additionally, we find that our results are not driven by the fact that authors tend to specialize in particular areas of research, make use of specific research methodologies or simply have particular styles of writing. Overall, assuming shared content as an indicator of relevance and pertinence of citations, our results indicate that self-citations are, in general, useful as a mechanism of knowledge diffusion.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3


  1. Nevertheless, it should also be noted that there is evidence suggesting that the exclusion of self-citations has a small or null effect in evaluative bibliometrics at the macro level (Glänzel and Thijs 2004).

  2. The detailed construction of this last field is described in Public Library of Science (2015).

  3. KDE functions are a non-parametric way of estimating probability density functions of a random variable. In contrast to Fig. 2, where we were unable to present the data using KDE functions as the JSD metric for the not related group is highly concentrated around 0.7; in Fig. 3, as we do not plot the distribution for this group, we chose to plot KDE functions instead of empirical distribution functions, because they are more straightforward to interpret.


  • Aksnes, D. (2003). A macro study of self-citation. Scientometrics, 56(2), 235–246.

    Article  Google Scholar 

  • Anauati, V., Galiani, S., & Gálvez, R. H. (2016). Quantifying the life cycle of scholarly articles across fields of economic research. Economic Inquiry, 54(2), 1339–1355.

    Article  Google Scholar 

  • Ball, P. (2005). Index aims for fair ranking of scientists. Nature, 436(7053), 900–900.

    Article  Google Scholar 

  • Bartneck, C., & Kokkelmans, S. (2011). Detecting h-index manipulation through self-citation analysis. Scientometrics, 87(1), 85–98.

    Article  Google Scholar 

  • Bird, S. (2006). Nltk: the natural language toolkit. In Proceedings of the COLING/ACL on interactive presentation sessions, COLING-ACL ’06 (pp. 69–72). Stroudsburg, PA: Association for Computational Linguistics.

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Bornmann, L., & Daniel, H.-D. (2008). What do citation counts measure? A review of studies on citing behavior. Journal of Documentation, 64(1), 45–80.

    Article  Google Scholar 

  • Briët, J., & Harremoës, P. (2009). Properties of classical and quantum Jensen–Shannon divergence. Physical Review A, 79(5), 052311.

    Article  Google Scholar 

  • Chamberlain, S., Boettiger, C., & Ram, K. (2015). rplos: Interface to the Search ‘API’ for ‘PLoS’ Journals. R package version 0.5.4.

  • Eichstaedt, J. C., Schwartz, H. A., Kern, M. L., Park, G., Labarthe, D. R., Merchant, R. M., et al. (2015). Psychological language on twitter predicts county-level heart disease mortality. Psychological Science, 26(2), 159–169.

    Article  Google Scholar 

  • Engqvist, L., & Frommen, J. (2008). The h-index and self-citations. Proceedings of the National academy of Sciences of the United States of America, 99, 11270–11274.

    Google Scholar 

  • Estabrooks, C. A., Derksen, L., Winther, C., Lavis, J. N., Scott, S. D., Wallin, L., et al. (2008). The intellectual structure and substance of the knowledge utilization field: A longitudinal author co-citation analysis, 1945 to 2004. Implementation Science, 3(1), 49.

    Article  Google Scholar 

  • Fagerberg, J., Srholec, M., & Verspagen, B. (2010). Chapter 20—innovation and economic development. In B. H. Hall & N. Rosenberg (Eds.), Handbook of the economics of innovation (Vol. 2, pp. 833–872). Amsterdam: North-Holland.

    Google Scholar 

  • Glänzel, W., & Thijs, B. (2004). The influence of author self-citations on bibliometric macro indicators. Scientometrics, 59(3), 281–310.

    Article  Google Scholar 

  • Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In Proceedings of the conference on empirical methods in natural language processing, EMNLP ’08 (pp. 363–371). Stroudsburg, PA: Association for Computational Linguistics.

  • Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National academy of Sciences of the United States of America, 102(46), 16569–16572.

    Article  MATH  Google Scholar 

  • Hu, D. J. (2009). Latent dirichlet allocation for text, images, and music. Last checked on January 16, 2017.

  • Hudson, J. (2007). Be known by the company you keep: Citations–quality or chance? Scientometrics, 71(2), 231–238.

    Article  Google Scholar 

  • Hyland, K. (2003). Self-citation and self-reference: Credibility and promotion in academic publication. Journal of the American Society for Information Science and Technology, 54(3), 251–259.

    Article  Google Scholar 

  • Knorr-Cetina, K. (1981). The manufacture of knowledge. An essay on the constructivist and contextual nature of science. Oxford: Pergamon Press.

    Google Scholar 

  • Kulkarni, A. V., Aziz, B., Shams, I., & Busse, J. W. (2011). Author self-citation in the general medicine literature. PLoS ONE, 6(6), e20885.

    Article  Google Scholar 

  • Lawani, S. M. (1982). On the heterogeneity and classification of author self-citations. Journal of the American Society for Information Science, 33(5), 281.

    Article  Google Scholar 

  • MacRoberts, M. H., & MacRoberts, B. R. (1989). Problems of citation analysis: A critical review. Journal of the American Society for Information Science, 40(5), 342–349.

    Article  Google Scholar 

  • Maliniak, D., Powers, R., & Walter, B. F. (2013). The gender citation gap in international relations. International Organization, 67(4), 889–922.

    Article  Google Scholar 

  • Merton, R. K. (1973). Sociology of science: Theoretical and empirical investigations. Chicago: University of Chicago Press.

    Google Scholar 

  • Motamed, M., Mehta, D., Basavaraj, S., & Fuad, F. (2002). Self citations and impact factors in otolaryngology journals. Clinical Otolaryngology & Allied Sciences, 27(5), 318–320.

    Article  Google Scholar 

  • Park, H. W., Hong, H. D., & Leydesdorff, L. (2005). A comparison of the knowledge-based innovation systems in the economies of South Korea and the Netherlands using Triple Helix indicators. Scientometrics, 65(1), 3–27.

    Article  Google Scholar 

  • Public Library of Science. (2015). Plos subject area thesaurus. Last checked on January 16, 2017.

  • Řehůřek, R., & Sojka, P., (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks (pp. 45–50). Valletta: ELRA.

  • Rogers, E. M. (2003). Diffusion of innovations (5th ed.). New York, NY: Free Press.

    Google Scholar 

  • Schreiber, M. (2007). Self-citation corrections for the Hirsch index. Europhysics Letters, 78(3), 30002.

    Article  Google Scholar 

  • Seglen, P. O. (1992). The skewness of science. Journal of the American Society for Information Science, 43(9), 628–638.

    Article  Google Scholar 

  • Snyder, H., & Bonzi, S. (1998). Patterns of self-citation across disciplines (1980–1989). Journal of Information Science, 24(6), 431–435.

    Article  Google Scholar 

  • Loria, S., Keen, P., Honnibal, M., Yankovsky, R., Karesh, D., Dempsey, E., et al. (2013). TextBlob: Simplified text processing. Last checked on January 16, 2017.

  • Tagliacozzo, R. (1977). Self-citations in scientific literature. Journal of Documentation, 33(4), 251–265.

    Article  Google Scholar 

  • Wojick, D. E., Warnick, W. L., Carroll, B. C., & Crowe, J. (2006). The digital road to scientific knowledge diffusion. D-Lib Magazine, 12(6), 1082–9873.

    Article  Google Scholar 

  • Yu, G., Wang, M.-Y., & Yu, D.-R. (2010). Characterizing knowledge diffusion of nanoscience & nanotechnology by citation analysis. Scientometrics, 84(1), 81–97.

    MathSciNet  Article  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ramiro H. Gálvez.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gálvez, R.H. Assessing author self-citation as a mechanism of relevant knowledge diffusion. Scientometrics 111, 1801–1812 (2017).

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI:


  • Author self-citation
  • Latent Dirichlet allocation
  • Semantic dissimilarity
  • Knowledge diffusion