Skip to main content
Log in

A comparative analysis of local similarity metrics and machine learning approaches: application to link prediction in author citation networks

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Understanding the evolution of paper and author citations is of paramount importance for the design of research policies and evaluation criteria that can promote and accelerate scientific discoveries. Recently many studies on the evolution of science have been conducted in the context of the emergent Science of Science field. While many studies have probed the link problem in citation networks, only a few works have analyzed the temporal nature of link prediction in author citation networks. In this study we compared the performance of 10 well-known local network similarity measurements with four machine learning models to predict future links in author citations networks. Differently from traditional link prediction methods, the temporal nature of the predict links is relevant for our approach. Our analysis revealed that the Jaccard coefficient was found to be among the most relevant measurements. The preferential attachment measurement, conversely, displayed the worst performance. We also found that the extension of local measurements to their weighted version do not significantly improved the performance of predicting citations. Finally, we also found that a XGBoost and neural network approach summarizing the information from all 10 considered similarity measurements was able to provide the highest AUC performance and competitive precision values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Adamic, E., & Adar, LA. (2003). Friends and neighbors on the web (3):211–230

  • Amancio, D. R., Nunes, M. G. V., Oliveira, O. N., Jr., Pardo, T. A. S., Antiqueira, L., & Costa, L. F. (2011). Using metrics from complex networks to evaluate machine translation. Physica A: Statistical Mechanics and its Applications, 390(1), 131–142.

    Article  Google Scholar 

  • Amancio, D. R., Nunes, Md. G. V., Oliveira, O. N., Jr., & da F Costa L,. (2012). Using complex networks concepts to assess approaches for citations in scientific papers. Scientometrics, 91(3), 827–842.

  • Amancio, D. R., Oliveira, O. N., Jr., & da Fontoura, Costa L. (2012). Three-feature model to reproduce the topology of citation networks and the effects from authors’ visibility on their h-index. Journal of Informetrics, 6(3), 427–434.

    Article  Google Scholar 

  • Amancio, D. R., Comin, C. H., Casanova, D., Travieso, G., Bruno, O. M., Rodrigues, F. A., & Costa, L. F. (2014). A systematic comparison of supervised classifiers. PLoS One, 9(4), e94. 137.

    Article  Google Scholar 

  • Amancio, D. R., Oliveira, O. N., Jr., & Costa, Ld. F. (2015). Topological-collaborative approach for disambiguating authors’ names in collaborative networks. Scientometrics, 102(1), 465–485.

    Article  Google Scholar 

  • Bai, X., Xia, F., Lee, I., Zhang, J., & Ning, Z. (2016). Identifying anomalous citations for objective evaluation of scholarly article impact. PloS One, 11(9), e0162.

    Article  Google Scholar 

  • Bai, X., Zhang, F., & Lee, I. (2019). Predicting the citations of scholarly paper. Journal of Informetrics, 13(1), 407–418.

    Article  Google Scholar 

  • Bai, X., Zhang, F., Ni, J., Shi, L., & Lee, I. (2020). Measure the impact of institution and paper via institution-citation network. IEEE Access, 8, 548–555.

    Google Scholar 

  • Barabási, A., Jeong, H., Néda, Z., Ravasz, E., Schubert, A., & Vicsek, T. (2002). Evolution of the social network of scientific collaborations. Physica A: Statistical Mechanics and its Applications, 311(3–4), 590–614. https://doi.org/10.1016/s0378-4371(02)00736-7

    Article  MathSciNet  MATH  Google Scholar 

  • Bornmann, L., & Daniel, HD. (2008). What do citation counts measure? a review of studies on citing behavior. Journal of documentation

  • Bradley, A. P. (1997). The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159.

    Article  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  • Chacon, X. S., Silva, T. C., & Amancio, D. R. (2020). Comparing the impact of subfields in scientific journals. Scientometrics, 125(1), 625–639.

    Article  Google Scholar 

  • Chen, S., Dang, D., Macy, R., & Rockwell, C. (2019). Link prediction on the patent citation network. https://crockwell.github.io/data/LP_patent.pdf

  • Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp 785–794

  • Cui, P., Wang, X., Pei, J., & Zhu, W. (2018). A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering, 31(5), 833–852.

    Article  Google Scholar 

  • Daud, A., Ahmed, W., Amjad, T., Nasir, JA., Aljohani, NR., Abbasi, RA., & Ahmad, I. (2017). Who will cite you back? reciprocal link prediction in citation networks. Library Hi Tech

  • Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on Machine learning, pp 233–240

  • Edwards, M. A., & Roy, S. (2017). Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environmental Engineering Science, 34(1), 51–61.

    Article  Google Scholar 

  • Eom, Y. H., & Fortunato, S. (2011). Characterizing and modeling citation dynamics. PLoS ONE, 6(9), e24-926.

    Article  Google Scholar 

  • Fortunato, S., Bergstrom, C. T., Börner, K., Evans, J. A., Helbing, D., Milojević, S., Petersen, A. M., Radicchi, F., Sinatra, R., Uzzi, B., Vespignani, A., Waltman, L., Wang, D., & Barabási, A. L. (2018). Science of science. Science. https://doi.org/10.1126/science.aao0185

    Article  Google Scholar 

  • Hennemann, S., Rybski, D., & Liefner, I. (2012). The myth of global science collaboration-collaboration patterns in epistemic communities. Journal of Informetrics, 6(2), 217–225.

    Article  Google Scholar 

  • Hug, SE., & Brändle, MP. (2017). The coverage of microsoft academic: Analyzing the publication output of a university. CoRR arxiv:bs/1703.05539

  • Hung, S. W., & Wang, A. P. (2010). Examining the small world phenomenon in the patent citation network: a case study of the radio frequency identification (rfid) network. Scientometrics, 82(1), 121–134.

    Article  Google Scholar 

  • Jain, A., Mao, J., & Mohiuddin, K. (1996). Artificial neural networks: a tutorial. Computer, 29(3), 31–44. https://doi.org/10.1109/2.485891

    Article  Google Scholar 

  • Katz, J. (1994). Geographical proximity and scientific collaboration. Scientometrics, 31(1), 31–43.

    Article  Google Scholar 

  • Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 1995, 1137–1145.

    Google Scholar 

  • Krumov, L., Fretter, C., Müller-Hannemann, M., Weihe, K., & Hütt, M. T. (2011). Motifs in co-authorship networks and their relation to the impact of scientific publications. The European Physical Journal B, 84(4), 535–540.

    Article  Google Scholar 

  • Lande, D., Fu, M., Guo, W., Balagura, I., Gorbov, I., & Yang, H. (2020). Link prediction of scientific collaboration networks based on information retrieval. World Wide Web pp 1–19

  • Li, W., Aste, T., Caccioli, F., & Livan, G. (2019). Reciprocity and impact in academic careers. EPJ Data Science, 8(1), 20.

    Article  Google Scholar 

  • Liu, X. F., Chen, H. J., & Sun, W. J. (2021). Adaptive topological coevolution of interdependent networks: Scientific collaboration-citation networks as an example. Physica A: Statistical Mechanics and its Applications, 564(125), 518.

    Google Scholar 

  • Lü, L., & Zhou, T. (2010). Link prediction in weighted networks: The role of weak ties. EPL (Europhysics Letters), 89(18), 001.

    Google Scholar 

  • Lü, L., & Zhou, T. (2011). Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6), 1150–1170.

    Article  Google Scholar 

  • Martinčić-Ipšić, S., Močibob, E., & Perc, M. (2017). Link prediction on twitter. PLoS ONE, 12(7), 1–21. https://doi.org/10.1371/journal.pone.0181079

    Article  Google Scholar 

  • Milojević, S. (2013). Accuracy of simple, initials-based methods for author name disambiguation. Journal of Informetrics, 7(4), 767–773.

    Article  Google Scholar 

  • Molléri, J. S., Petersen, K., & Mendes, E. (2018). Towards understanding the relation between citations and research quality in software engineering studies. Scientometrics, 117(3), 1453–1478.

    Article  Google Scholar 

  • Newman, M. (2001). Clustering and preferential attachment in growing networks. Physical Review E, 64(2), 025–102.

    Google Scholar 

  • Nie, Z., Liu, Y., Yang, L., Li, S., & Pan, F. (2021). Construction and application of materials knowledge graph based on author disambiguation: Revisiting the evolution of lifepo4. Advanced Energy Materials p 2003580

  • Nielsen, M. A. (2015). Neural networks and deep learning (Vol. 25). CA: Determination press San Francisco.

    Google Scholar 

  • Nielsen, M. W., & Andersen, J. P. (2021). Global citation inequality is on the rise. Proceedings of the National Academy of Sciences, 118(7), 2012208118.

    Article  Google Scholar 

  • Noble, W. S. (2006). What is a support vector machine? Nature Biotechnology, 24(12), 1565–1567.

    Article  Google Scholar 

  • Parnas, D. L. (2007). Stop the numbers game. Communications of the ACM, 50(11), 19–21.

    Article  Google Scholar 

  • Powell, W. W., White, D. R., Koput, K. W., & Owen-Smith, J. (2005). Network dynamics and field evolution: The growth of interorganizational collaboration in the life sciences. American Journal of Sociology, 110(4), 1132–1205.

    Article  Google Scholar 

  • Radicchi, F., Fortunato, S., Markines, B., & Vespignani, A. (2009). Diffusion of scientific credits and the ranking of scientists. Physical Review E, 80(5), 056–103.

    Article  Google Scholar 

  • Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. Encyclopedia of Database Systems, 5, 532–538.

    Article  Google Scholar 

  • de Sá, H., & Prudencio, R. (2011). Supervised link prediction in weighted networks. In: Neural Networks (IJCNN), The 2011 International Joint Conference on, IEEE, pp 2281–2288

  • Sebo, P., de Lucia, S., & Vernaz, N. (2021). Accuracy of pubmed-based author lists of publications and use of author identifiers to address author name ambiguity: a cross-sectional study. Scientometrics pp 1–15

  • Shibata, N., Kajikawa, Y., & Sakata, I. (2012). Link prediction in citation networks. Journal of the American Society for Information Science and Technology, 63(1), 78–85.

    Article  Google Scholar 

  • Silva, F. N., Amancio, D. R., Bardosova, M., Costa, Ld. F., & Oliveira, O. N., Jr. (2016). Using network science and text analytics to produce surveys in a scientific topic. Journal of Informetrics, 10(2), 487–502.

    Article  Google Scholar 

  • Silva, F. N., Tandon, A., Amancio, D. R., Flammini, A., Menczer, F., Milojević, S., & Fortunato, S. (2020). Recency predicts bursts in the evolution of author citations. Quantitative Science Studies, 1(3), 1298–1308.

    Article  Google Scholar 

  • Stella, M. (2019). Modelling early word acquisition through multiplex lexical networks and machine learning. Big Data and Cognitive Computing, 3(1), 10.

    Article  Google Scholar 

  • Stella, M. (2020). Multiplex networks quantify robustness of the mental lexicon to catastrophic concept failures, aphasic degradation and ageing. Physica A: Statistical Mechanics and Its Applications, 554(124), 382.

    Google Scholar 

  • Vital, Jr A., & Amancio, DR. (2021). A comparative analysis of local network similarity measurements: application to author citation networks. arXiv:2103.13946

  • Wang, K., Shen, Z., Huang, C., Wu, C. H., Dong, Y., & Kanakia, A. (2020). Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1), 396–413.

    Article  Google Scholar 

  • Wang, M., Yu, G., & Yu, D. (2008). Measuring the preferential attachment mechanism in citation networks. Physica A: Statistical Mechanics and its Applications, 387(18), 4692–4698.

    Article  Google Scholar 

  • Wang, P., Xu, B., Wu, Y., & Zhou, X. (2014). Link prediction in social networks: the state-of-the-art

  • Wright, RE. (1995). Logistic regression.

  • Wuestman, M. L., Hoekman, J., & Frenken, K. (2019). The geography of scientific citations. Research Policy, 48(7), 1771–1780.

    Article  Google Scholar 

  • Yegnanarayana, B. (2009). Artificial neural networks. Delhi: PHI Learning Pvt. Ltd.

    Google Scholar 

  • Zhang, G., Ding, Y., & Milojević, S. (2013). Citation content analysis (cca): A framework for syntactic and semantic analysis of citation content. Journal of the American Society for Information Science and Technology, 64(7), 1490–1503.

    Article  Google Scholar 

  • Zhang, L., & Ban, Z. (2020). Author name disambiguation based on rule and graph model. In: CCF International Conference on Natural Language Processing and Chinese Computing, Springer, pp 617–628

  • Zhou, T., Lü, L., & Zhang, Y. C. (2009). Predicting missing links via local information. The European Physical Journal B, 71(4), 623–630.

    Article  MATH  Google Scholar 

Download references

Acknowledgements

A preprint version of this manuscript is available at arXiv (Vital and Amancio 2021). D.R.A. acknowledges financial support from São Paulo Research Foundation (FAPESP Grant No. 2020/06271-0) and CNPq-Brazil (Grant No. 304026/2018-2 and 311074/2021-9). This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diego R. Amancio.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 279 kb)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vital, A., Amancio, D.R. A comparative analysis of local similarity metrics and machine learning approaches: application to link prediction in author citation networks. Scientometrics 127, 6011–6028 (2022). https://doi.org/10.1007/s11192-022-04484-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-022-04484-6

Keywords

Navigation