Skip to main content
Log in

Successful fish go with the flow: citation impact prediction based on centrality measures for term–document networks

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

In this work we address the challenge of how to identify those documents from a given set of texts that are most likely to have substantial impact in the future. To this end we develop a purely content-based methodology in order to rank a given set of documents, for example abstracts of scientific publications, according to their potential to generate impact as measured by the numbers of citations that the articles will receive in the future. We construct a bipartite network consisting of documents that are linked to keywords and terms that they contain. We study recursive centrality measures for such networks that quantify how many different terms a document contains and how these terms are related to each other. From this we derive a novel indicator—document centrality—that is shown to be highly predictive of citation impact in six different case studies. We compare these results to findings from a multivariable regression model and from conventional network-based centrality measures to show that document centrality indeed offers a comparably high performance in identifying those articles that contain a large number of high-impact keywords. Our findings suggest that articles which conform to the mainstream within a given research field tend to receive higher numbers of citations than highly original and innovative articles.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://wikibon.org/blog/big-data-statistics/, retrieved 07/29/2015.

  2. http://apps.webofknowledge.com/.

References

  • Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175–185.

    MathSciNet  Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Bollen, J., Rodriguez, M. A., & Van De Sompel, H. (2006). Journal status. Scientometrics, 69(3), 669–687.

    Article  Google Scholar 

  • Bollen, J., Van de Sompel, H., Hagberg, A., & Chute, R. (2009). A principal component analysis of 30 scientific impact measures. PLoS One, 4(6), e6022.

  • Bornmann, L., Schier, H., Marx, W., & Daniel, H. D. (2012). What factors determine citation counts of publications in chemistry besides their quality? Journal of Informetrics, 6(1), 11–18.

    Article  Google Scholar 

  • Callaham, M., Wears, R. L., & Weber, E. (2002). Journal prestige, publication bias, and other characteristics associated with citation of published studies in peer-reviewed journals. Journal of the American Medical Association, 287(21), 2847–2850.

    Article  Google Scholar 

  • Chang, J., & Blei, D. M. (2009). Relational topic models for document networks. In Proceedings of the 12th international conference on artificial intelligence and statistics (AISTATS) (Vol. 5, pp. 81–88).

  • Chen, P., Xie, H., Maslov, S., & Redner, S. (2007). Finding scientific gems with Google’s PageRank algorithm. Journal of Informetrics, 1(1), 8–15.

    Article  Google Scholar 

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

    MATH  Google Scholar 

  • Danell, R. (2011). Can the quality of scientific work be predicted using information on the author’s track record? Journal of the American Society for Information Science and Technology, 62(1), 50–60.

    Article  Google Scholar 

  • Didegah, F., & Thelwall, M. (2013). Determinants of research citation impact in nanoscience and nanotechnology. Journal of the American Society for Information Science and Technology, 64(5), 1055–1064.

    Article  Google Scholar 

  • Dietz, L., Bickel, S., & Scheffer, T. (2007). Unsupervised prediction of citation influences. In Proceedings of the 24th international conference on machine learning (pp. 233–240).

  • Dodds, P. S., Harris, K., Kloumann, I., Bliss, C., & Danforth, C. (2011). Temporal patterns of happiness and information in a global social network: Hedonometrics and twitter. PLoS ONE, 6(12), e26752.

    Article  Google Scholar 

  • Eysenbach, G. (2011). Can tweets predict citations? Metrics of social impact based on twitter and correlation with traditional metrics of scientific impact. Journal of Medical Internet Research, 13(4), e123.

    Article  Google Scholar 

  • Feng, G., Guo, J., Jing, B.-Y., & Hao, L. (2011). A Bayesian feature selection paradigm for text classification. Information Processing and Management, 48(2), 283–302.

    Article  Google Scholar 

  • Fu, L. D., & Aliferis, C. F. (2010). Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature. Scientometrics, 85, 257–270.

    Article  Google Scholar 

  • Garfield, E. (1979). Citation indexing: Its theory and application in science, technology, and humanities. Ney York: Wiley.

    Google Scholar 

  • Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. Boca Raton, FL: Chapman & Hall/CRC.

    MATH  Google Scholar 

  • Hidalgo, C. A., & Hausmann, R. (2009). Building blocks of economic complexity. Proceedings of the National Academy of Sciences, 106(26), 10570–10575.

    Article  Google Scholar 

  • Hofmann, T. (2001). Unsupervised learning by probabilistic semantic analysis. Machine Learning, 42, 177–196.

    Article  MATH  Google Scholar 

  • Jian, L., Cai, Z., Wang, D., & Zhang, H. (2014). Bayesian citation-KNN with distance weighting. International Journal of Machine Learning and Cybernetics, 5(2), 193–199.

    Article  Google Scholar 

  • Jovanovic, A. S., & Renn, O. (2013). Search for the ‘European way’ of taming the risks of new technologies: The EU research project iNTeg-Risk. Journal of Risk Research, 16(3–4), 271–274.

    Article  Google Scholar 

  • Kwok, J.T.-Y. (1998). Automated text categorization using support vector machine. In Proceedings of the international conference on neural information processing (ICONIP) (pp. 347–351).

  • Larsen, P. O., & von Ins, M. (2010). The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index. Scientometrics, 84(3), 575–603.

    Article  Google Scholar 

  • Leydesdorff, L. (2007). Betweenness centrality as an indicator of the interdisciplinarity of scientific journals. Journal of the American Society for Information Science and Techology, 58(9), 1303–1319.

    Article  Google Scholar 

  • Leydesdorff, L. (2009). How are new citation-based journal indicators adding to the bibliometric toolbox? Journal of the American Society for Information Science and Technology, 60(7), 1327–1336.

    Article  Google Scholar 

  • Leydesdorff, L., & Bornmann, L. (2011). Integrated impact indicators (I3) compared with impact factors (Ifs): An alternative design with policy implications. Journal of the American Society for Information Science and Technology, 62(7), 1370–1381.

    Article  Google Scholar 

  • Liu, Y., Niculescu-Mizil, A., & Gryc, W. (2009). Topic-link LDA: Joint models of topic and author community. In Proceedings of the 26th annual international conference on machine learning (ICML) (pp. 665–72).

  • Liu, L. G., Xuan, Z. G., Dang, Z. Y., Guo, Q., & Wang, Z. T. (2007). Weighted network properties of Chines nature science basic research. Physica A, 377(1), 302–314.

    Article  Google Scholar 

  • MacRoberts, M. H., & MacRoberts, B. R. (1996). Problems of citation analysis. Scientometrics, 36(3), 435–444.

    Article  Google Scholar 

  • Meyer, D., Leisch, F., & Hornik, K. (2003). The support vector machine under test. Neurocomputing, 55(1), 169–186.

    Article  Google Scholar 

  • Moohebat, M., Raj, R. G., Kareem, S. B. A., & Thorleuchter, D. (2015). Identifying ISI-indexed articles by their lexical usage: A text analysis approach. Journal of the Association for Information Science and Technology, 66(3), 501–511.

    Article  Google Scholar 

  • Nallapati, R., Ahmed, A., Xing, E., & Cohen, W. W. (2008). Joint latent topic models for text and citations. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 542–550).

  • Newman, M. E. J. (2004). Coauthorship networks and patterns of scientific collaboration. Proceedings of the National Academy of Sciences, 101(1), 5200–5205.

    Article  Google Scholar 

  • Newman, M. E. J. (2009). The first-mover advantage in scientific publication. Europhysics Letters, 86(6), 68001.

    Article  Google Scholar 

  • Newman, M. E. J. (2010). Networks: An introduction. Oxford: Oxford University Press.

    Book  MATH  Google Scholar 

  • Percino, G., Klimek, P., & Thurner, S. (2014). Instrumentational complexity of music genres and why simplicity sells. PLoS ONE, 9, e115255.

    Article  Google Scholar 

  • Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

    Article  Google Scholar 

  • Radicchi, F., & Castellano, C. (2012). Testing the fairness of citation indicators for comparison across scientific domains: The case of fractional citation counts. Journal of Informetrics, 6(1), 121–130.

    Article  Google Scholar 

  • Radicchi, F., Fortunato, S., Markines, B., & Vespignani, A. (2009). Diffusion of scientific credits and the ranking of scientists. Physical Review E, 056103.

  • Sayyadi, H., & Getoor, L. (2009). FutureRank: Ranking scientific articles by predicting their future PageRank. In The 9th SIAM international conference on data mining.

  • Stewart, J. A. (1983). Achievement and ascriptive processes in the recognition of scientific articles. Social Forces, 62(1), 166–189.

    Article  Google Scholar 

  • Van Dalen, H. P., & Henkens, K. (2001). What makes a scientific article influential? The case of demographers. Scientometrics, 50(3), 455–482.

    Article  Google Scholar 

  • Vieira, E. S., & Gomes, J. A. N. F. (2010). Citation to scientific articles: Its distribution and dependence on the article features. Journal of Informetrics, 4(1), 1–13.

    Article  Google Scholar 

  • Walker, D., Xie, H., Yan, K. K., & Maslov, S. (2007). Ranking scientific publications using a simple model of network traffic. Journal of Statistical Mechanics, P06010. doi:10.1088/1742-5468/2007/06/P06010.

  • Wang, D., Song, C., & Barabási, A.-L. (2013). Quantifying long-term scientific impact. Science, 342(6154), 127–132.

    Article  Google Scholar 

  • Yan, E., Ding, Y., & Sugimoto, C. R. (2011). P-Rank: An indicator measuring prestige in heterogeneous scholarly networks. Journal of the American Society for Information Science and Technology, 62(3), 467–477.

    Google Scholar 

  • Yu, X., Gu, Q., Zhou, M., & Han, J. (2012). Citation prediction in heterogeneous bibliographic networks. In SDM (Vol. 12, pp. 1119–1130).

  • Yu, T., Yu, G., Li, P.-Y., & Wang, L. (2014). Citation impact prediction for scientific papers using stepwise regression analysis. Scientometrics, 101, 1233–1252.

    Article  Google Scholar 

Download references

Acknowledgments

PK acknowledges financial support from the European Commission, EU FP7 Project MULTIPLEX, No. 317532. We thank the anonymous referees for providing extremely helpful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aleksandar S. Jovanovic.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (TXT 2 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Klimek, P., S. Jovanovic, A., Egloff, R. et al. Successful fish go with the flow: citation impact prediction based on centrality measures for term–document networks. Scientometrics 107, 1265–1282 (2016). https://doi.org/10.1007/s11192-016-1926-1

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-016-1926-1

Keywords

Navigation