Assessing author self-citation as a mechanism of relevant knowledge diffusion
Author self-citation is a practice that has been historically surrounded by controversy. Although the prevalence of self-citations in different scientific fields has been thoroughly analysed, there is a lack of large scale quantitative research focusing on its usefulness at guiding readers in finding new relevant scientific knowledge. In this work we empirically address this issue. Using as our main corpus the entire set of PLOS journals research articles, we train a topic discovery model able to capture semantic dissimilarity between pairs of articles. By dividing pairs of articles involved in intra-PLOS citations into self-citations (articles linked by a cite which share at least one author) and non-self-citations (articles linked by a cite which share no author), we observe the distribution of semantic dissimilarity between citing and cited papers in both groups. We find that the typical semantic distance between articles involved in self-citations is significantly smaller than the observed one for articles involved in non-self-citations. Additionally, we find that our results are not driven by the fact that authors tend to specialize in particular areas of research, make use of specific research methodologies or simply have particular styles of writing. Overall, assuming shared content as an indicator of relevance and pertinence of citations, our results indicate that self-citations are, in general, useful as a mechanism of knowledge diffusion.
KeywordsAuthor self-citation Latent Dirichlet allocation Semantic dissimilarity Knowledge diffusion
- Bird, S. (2006). Nltk: the natural language toolkit. In Proceedings of the COLING/ACL on interactive presentation sessions, COLING-ACL ’06 (pp. 69–72). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
- Chamberlain, S., Boettiger, C., & Ram, K. (2015). rplos: Interface to the Search ‘API’ for ‘PLoS’ Journals. R package version 0.5.4.Google Scholar
- Engqvist, L., & Frommen, J. (2008). The h-index and self-citations. Proceedings of the National academy of Sciences of the United States of America, 99, 11270–11274.Google Scholar
- Fagerberg, J., Srholec, M., & Verspagen, B. (2010). Chapter 20—innovation and economic development. In B. H. Hall & N. Rosenberg (Eds.), Handbook of the economics of innovation (Vol. 2, pp. 833–872). Amsterdam: North-Holland.Google Scholar
- Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In Proceedings of the conference on empirical methods in natural language processing, EMNLP ’08 (pp. 363–371). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
- Hu, D. J. (2009). Latent dirichlet allocation for text, images, and music. http://cseweb.ucsd.edu/~dhu/docs/research_exam09.pdf. Last checked on January 16, 2017.
- Knorr-Cetina, K. (1981). The manufacture of knowledge. An essay on the constructivist and contextual nature of science. Oxford: Pergamon Press.Google Scholar
- Merton, R. K. (1973). Sociology of science: Theoretical and empirical investigations. Chicago: University of Chicago Press.Google Scholar
- Public Library of Science. (2015). Plos subject area thesaurus. https://github.com/PLOS/plos-thesaurus. Last checked on January 16, 2017.
- Řehůřek, R., & Sojka, P., (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks (pp. 45–50). Valletta: ELRA.Google Scholar
- Rogers, E. M. (2003). Diffusion of innovations (5th ed.). New York, NY: Free Press.Google Scholar
- Loria, S., Keen, P., Honnibal, M., Yankovsky, R., Karesh, D., Dempsey, E., et al. (2013). TextBlob: Simplified text processing. https://textblob.readthedocs.io/en/dev/. Last checked on January 16, 2017.