Abstract
Understanding the evolution of paper and author citations is of paramount importance for the design of research policies and evaluation criteria that can promote and accelerate scientific discoveries. Recently many studies on the evolution of science have been conducted in the context of the emergent Science of Science field. While many studies have probed the link problem in citation networks, only a few works have analyzed the temporal nature of link prediction in author citation networks. In this study we compared the performance of 10 well-known local network similarity measurements with four machine learning models to predict future links in author citations networks. Differently from traditional link prediction methods, the temporal nature of the predict links is relevant for our approach. Our analysis revealed that the Jaccard coefficient was found to be among the most relevant measurements. The preferential attachment measurement, conversely, displayed the worst performance. We also found that the extension of local measurements to their weighted version do not significantly improved the performance of predicting citations. Finally, we also found that a XGBoost and neural network approach summarizing the information from all 10 considered similarity measurements was able to provide the highest AUC performance and competitive precision values.
Similar content being viewed by others
References
Adamic, E., & Adar, LA. (2003). Friends and neighbors on the web (3):211–230
Amancio, D. R., Nunes, M. G. V., Oliveira, O. N., Jr., Pardo, T. A. S., Antiqueira, L., & Costa, L. F. (2011). Using metrics from complex networks to evaluate machine translation. Physica A: Statistical Mechanics and its Applications, 390(1), 131–142.
Amancio, D. R., Nunes, Md. G. V., Oliveira, O. N., Jr., & da F Costa L,. (2012). Using complex networks concepts to assess approaches for citations in scientific papers. Scientometrics, 91(3), 827–842.
Amancio, D. R., Oliveira, O. N., Jr., & da Fontoura, Costa L. (2012). Three-feature model to reproduce the topology of citation networks and the effects from authors’ visibility on their h-index. Journal of Informetrics, 6(3), 427–434.
Amancio, D. R., Comin, C. H., Casanova, D., Travieso, G., Bruno, O. M., Rodrigues, F. A., & Costa, L. F. (2014). A systematic comparison of supervised classifiers. PLoS One, 9(4), e94. 137.
Amancio, D. R., Oliveira, O. N., Jr., & Costa, Ld. F. (2015). Topological-collaborative approach for disambiguating authors’ names in collaborative networks. Scientometrics, 102(1), 465–485.
Bai, X., Xia, F., Lee, I., Zhang, J., & Ning, Z. (2016). Identifying anomalous citations for objective evaluation of scholarly article impact. PloS One, 11(9), e0162.
Bai, X., Zhang, F., & Lee, I. (2019). Predicting the citations of scholarly paper. Journal of Informetrics, 13(1), 407–418.
Bai, X., Zhang, F., Ni, J., Shi, L., & Lee, I. (2020). Measure the impact of institution and paper via institution-citation network. IEEE Access, 8, 548–555.
Barabási, A., Jeong, H., Néda, Z., Ravasz, E., Schubert, A., & Vicsek, T. (2002). Evolution of the social network of scientific collaborations. Physica A: Statistical Mechanics and its Applications, 311(3–4), 590–614. https://doi.org/10.1016/s0378-4371(02)00736-7
Bornmann, L., & Daniel, HD. (2008). What do citation counts measure? a review of studies on citing behavior. Journal of documentation
Bradley, A. P. (1997). The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Chacon, X. S., Silva, T. C., & Amancio, D. R. (2020). Comparing the impact of subfields in scientific journals. Scientometrics, 125(1), 625–639.
Chen, S., Dang, D., Macy, R., & Rockwell, C. (2019). Link prediction on the patent citation network. https://crockwell.github.io/data/LP_patent.pdf
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp 785–794
Cui, P., Wang, X., Pei, J., & Zhu, W. (2018). A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering, 31(5), 833–852.
Daud, A., Ahmed, W., Amjad, T., Nasir, JA., Aljohani, NR., Abbasi, RA., & Ahmad, I. (2017). Who will cite you back? reciprocal link prediction in citation networks. Library Hi Tech
Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on Machine learning, pp 233–240
Edwards, M. A., & Roy, S. (2017). Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environmental Engineering Science, 34(1), 51–61.
Eom, Y. H., & Fortunato, S. (2011). Characterizing and modeling citation dynamics. PLoS ONE, 6(9), e24-926.
Fortunato, S., Bergstrom, C. T., Börner, K., Evans, J. A., Helbing, D., Milojević, S., Petersen, A. M., Radicchi, F., Sinatra, R., Uzzi, B., Vespignani, A., Waltman, L., Wang, D., & Barabási, A. L. (2018). Science of science. Science. https://doi.org/10.1126/science.aao0185
Hennemann, S., Rybski, D., & Liefner, I. (2012). The myth of global science collaboration-collaboration patterns in epistemic communities. Journal of Informetrics, 6(2), 217–225.
Hug, SE., & Brändle, MP. (2017). The coverage of microsoft academic: Analyzing the publication output of a university. CoRR arxiv:bs/1703.05539
Hung, S. W., & Wang, A. P. (2010). Examining the small world phenomenon in the patent citation network: a case study of the radio frequency identification (rfid) network. Scientometrics, 82(1), 121–134.
Jain, A., Mao, J., & Mohiuddin, K. (1996). Artificial neural networks: a tutorial. Computer, 29(3), 31–44. https://doi.org/10.1109/2.485891
Katz, J. (1994). Geographical proximity and scientific collaboration. Scientometrics, 31(1), 31–43.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 1995, 1137–1145.
Krumov, L., Fretter, C., Müller-Hannemann, M., Weihe, K., & Hütt, M. T. (2011). Motifs in co-authorship networks and their relation to the impact of scientific publications. The European Physical Journal B, 84(4), 535–540.
Lande, D., Fu, M., Guo, W., Balagura, I., Gorbov, I., & Yang, H. (2020). Link prediction of scientific collaboration networks based on information retrieval. World Wide Web pp 1–19
Li, W., Aste, T., Caccioli, F., & Livan, G. (2019). Reciprocity and impact in academic careers. EPJ Data Science, 8(1), 20.
Liu, X. F., Chen, H. J., & Sun, W. J. (2021). Adaptive topological coevolution of interdependent networks: Scientific collaboration-citation networks as an example. Physica A: Statistical Mechanics and its Applications, 564(125), 518.
Lü, L., & Zhou, T. (2010). Link prediction in weighted networks: The role of weak ties. EPL (Europhysics Letters), 89(18), 001.
Lü, L., & Zhou, T. (2011). Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6), 1150–1170.
Martinčić-Ipšić, S., Močibob, E., & Perc, M. (2017). Link prediction on twitter. PLoS ONE, 12(7), 1–21. https://doi.org/10.1371/journal.pone.0181079
Milojević, S. (2013). Accuracy of simple, initials-based methods for author name disambiguation. Journal of Informetrics, 7(4), 767–773.
Molléri, J. S., Petersen, K., & Mendes, E. (2018). Towards understanding the relation between citations and research quality in software engineering studies. Scientometrics, 117(3), 1453–1478.
Newman, M. (2001). Clustering and preferential attachment in growing networks. Physical Review E, 64(2), 025–102.
Nie, Z., Liu, Y., Yang, L., Li, S., & Pan, F. (2021). Construction and application of materials knowledge graph based on author disambiguation: Revisiting the evolution of lifepo4. Advanced Energy Materials p 2003580
Nielsen, M. A. (2015). Neural networks and deep learning (Vol. 25). CA: Determination press San Francisco.
Nielsen, M. W., & Andersen, J. P. (2021). Global citation inequality is on the rise. Proceedings of the National Academy of Sciences, 118(7), 2012208118.
Noble, W. S. (2006). What is a support vector machine? Nature Biotechnology, 24(12), 1565–1567.
Parnas, D. L. (2007). Stop the numbers game. Communications of the ACM, 50(11), 19–21.
Powell, W. W., White, D. R., Koput, K. W., & Owen-Smith, J. (2005). Network dynamics and field evolution: The growth of interorganizational collaboration in the life sciences. American Journal of Sociology, 110(4), 1132–1205.
Radicchi, F., Fortunato, S., Markines, B., & Vespignani, A. (2009). Diffusion of scientific credits and the ranking of scientists. Physical Review E, 80(5), 056–103.
Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. Encyclopedia of Database Systems, 5, 532–538.
de Sá, H., & Prudencio, R. (2011). Supervised link prediction in weighted networks. In: Neural Networks (IJCNN), The 2011 International Joint Conference on, IEEE, pp 2281–2288
Sebo, P., de Lucia, S., & Vernaz, N. (2021). Accuracy of pubmed-based author lists of publications and use of author identifiers to address author name ambiguity: a cross-sectional study. Scientometrics pp 1–15
Shibata, N., Kajikawa, Y., & Sakata, I. (2012). Link prediction in citation networks. Journal of the American Society for Information Science and Technology, 63(1), 78–85.
Silva, F. N., Amancio, D. R., Bardosova, M., Costa, Ld. F., & Oliveira, O. N., Jr. (2016). Using network science and text analytics to produce surveys in a scientific topic. Journal of Informetrics, 10(2), 487–502.
Silva, F. N., Tandon, A., Amancio, D. R., Flammini, A., Menczer, F., Milojević, S., & Fortunato, S. (2020). Recency predicts bursts in the evolution of author citations. Quantitative Science Studies, 1(3), 1298–1308.
Stella, M. (2019). Modelling early word acquisition through multiplex lexical networks and machine learning. Big Data and Cognitive Computing, 3(1), 10.
Stella, M. (2020). Multiplex networks quantify robustness of the mental lexicon to catastrophic concept failures, aphasic degradation and ageing. Physica A: Statistical Mechanics and Its Applications, 554(124), 382.
Vital, Jr A., & Amancio, DR. (2021). A comparative analysis of local network similarity measurements: application to author citation networks. arXiv:2103.13946
Wang, K., Shen, Z., Huang, C., Wu, C. H., Dong, Y., & Kanakia, A. (2020). Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1), 396–413.
Wang, M., Yu, G., & Yu, D. (2008). Measuring the preferential attachment mechanism in citation networks. Physica A: Statistical Mechanics and its Applications, 387(18), 4692–4698.
Wang, P., Xu, B., Wu, Y., & Zhou, X. (2014). Link prediction in social networks: the state-of-the-art
Wright, RE. (1995). Logistic regression.
Wuestman, M. L., Hoekman, J., & Frenken, K. (2019). The geography of scientific citations. Research Policy, 48(7), 1771–1780.
Yegnanarayana, B. (2009). Artificial neural networks. Delhi: PHI Learning Pvt. Ltd.
Zhang, G., Ding, Y., & Milojević, S. (2013). Citation content analysis (cca): A framework for syntactic and semantic analysis of citation content. Journal of the American Society for Information Science and Technology, 64(7), 1490–1503.
Zhang, L., & Ban, Z. (2020). Author name disambiguation based on rule and graph model. In: CCF International Conference on Natural Language Processing and Chinese Computing, Springer, pp 617–628
Zhou, T., Lü, L., & Zhang, Y. C. (2009). Predicting missing links via local information. The European Physical Journal B, 71(4), 623–630.
Acknowledgements
A preprint version of this manuscript is available at arXiv (Vital and Amancio 2021). D.R.A. acknowledges financial support from São Paulo Research Foundation (FAPESP Grant No. 2020/06271-0) and CNPq-Brazil (Grant No. 304026/2018-2 and 311074/2021-9). This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest to declare.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vital, A., Amancio, D.R. A comparative analysis of local similarity metrics and machine learning approaches: application to link prediction in author citation networks. Scientometrics 127, 6011–6028 (2022). https://doi.org/10.1007/s11192-022-04484-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-022-04484-6