Formational bounds of link prediction in collaboration networks

Abstract

Link prediction in collaboration networks is often solved by identifying structural properties of existing nodes that are disconnected at one point in time, and that share a link later on. The maximally possible recall rate or upper bound of this approach’s success is capped by the proportion of links that are formed among existing nodes embedded in these properties. Consequentially, sustained links as well as links that involve one or two new network participants are typically not predicted. The purpose of this study is to highlight formational constraints that need to be considered to increase the practical value of link prediction methods targeted for collaboration networks. In this study, we identify the distribution of basic link formation types based on four large-scale, over-time collaboration networks, showing that roughly speaking, 25% of links represent continued collaborations, 25% of links are new collaborations between existing authors, and 50% are formed between an existing author and a new network member. This implies that for collaboration networks, increasing the accuracy of computational link prediction solutions may not be a reasonable goal when the ratio of collaboration links that are eligible to the classic link prediction process is low.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Notes

  1. 1.

    In this study, ‘collaboration’ means ‘coauthorship’ in a research paper and these two terms are used interchangeably.

  2. 2.

    https://www.nlm.nih.gov/bsd/licensee/medpmmenu.html.

  3. 3.

    https://databank.illinois.edu/datasets/IDB-4222651.

  4. 4.

    http://dblp.org/xml/release/; for this study, we downloaded the April 2015 release.

  5. 5.

    A list of 392 journal was obtained from Thomson Reuters Journal Citation Report 2012 for the category “Computer Science”. We retrieved records on these papers published in these journals from DBLP.

  6. 6.

    http://journals.aps.org/datasets; for this study, we obtained the APS 2014 release version under the permission of the American Physical Society.

  7. 7.

    Mark E. J. Newman at the University of Michigan Department of Physics kindly provided the disambiguation code.

  8. 8.

    http://scholar.ndsl.kr/index.do; for this study, we obtained the KISTI 2016 version under a research agreement with the Korea Institute for Science and Technology Information.

  9. 9.

    This demonstrates why varying past–present network time frames matters for this study. The idea of using different past–present network periods was suggested by one of the reviewers of this paper.

  10. 10.

    In some fields, such as natural language processing, recall and precision are often inversely related and therefore an average score such as the F metric (e.g., harmonic mean of precision and recall) is calculated.

  11. 11.

    For a detailed explanation for the Degree Product predictor, see “Appendix”.

  12. 12.

    This does not mean that all preferential attachment models are designed to explain power-law obeying networks. However, many studies on preferential attachment have attempted to model power-law obeying networks.

  13. 13.

    https://cran.r-project.org/web/packages/poweRlaw/index.html.

  14. 14.

    Many studies on power-law distribution in collaboration networks have fitted distribution tails (i.e., distribution of certain x values and above) to power-law slopes to assess the performance of proposed network generation models. Several studies have divided a degree distribution into two parts (below and above a certain x value) and fit them separately to different power-law slopes (e.g., Wagner and Leydesdorff 2005). A few others have tested power-law distributions with cut-offs (below certain x value) (e.g., Newman 2001b).

References

  1. Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social Networks, 25(3), 211–230. https://doi.org/10.1016/So378-8733(03)00009-1.

    Article  Google Scholar 

  2. Barabási, A. L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., & Vicsek, T. (2002). Evolution of the social network of scientific collaborations. Physica A-Statistical Mechanics and Its Applications, 311(3–4), 590–614. https://doi.org/10.1016/s0378-4371(02)00736-7.

    MathSciNet  MATH  Article  Google Scholar 

  3. Braun, T., Glänzel, W., & Schubert, A. (2001). Publication and cooperation patterns of the authors of neuroscience journals. Scientometrics, 51(3), 499–510. https://doi.org/10.1023/A:1019643002560.

    Article  Google Scholar 

  4. Cabanac, G., Hubert, G., & Milard, B. (2015). Academic careers in Computer Science: Continuance and transience of lifetime co-authorships. Scientometrics, 102(1), 135–150. https://doi.org/10.1007/s11192-014-1426-0.

    Article  Google Scholar 

  5. Chen, D.-B., Xiao, R., & Zeng, A. (2014). Predicting the evolution of spreading on complex networks. Scientific Reports. https://doi.org/10.1038/srep06108

    Article  Google Scholar 

  6. Chen, H., Li, X., & Huang, Z. (2005). Link prediction approach to collaborative filtering. Paper presented at the proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries (JCDL ‘05).

  7. Choudhury, N., & Uddin, S. (2017). Mining actor-level structural and neighborhood evolution for link prediction in dynamic networks. Paper presented at the Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017, Sydney, Australia.

  8. Choudhury, N., & Uddin, S. (2018). Evolutionary community mining for link prediction in dynamic networks. Paper presented at the complex networks & their applications VI, Lyon, France.

  9. Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-law distributions in empirical data. Siam Review, 51(4), 661–703. https://doi.org/10.1137/070710111.

    MathSciNet  MATH  Article  Google Scholar 

  10. Fegley, B. D., & Torvik, V. I. (2013). Has large-scale named-entity network analysis been resting on a flawed assumption? PLoS ONE, 8(7), 1–16. https://doi.org/10.1371/journal.pone.0070299.

    Article  Google Scholar 

  11. Guns, R. (2014). Link prediction. In Measuring scholarly impact (pp. 35–55). Springer.

  12. Guns, R., & Rousseau, R. (2014). Recommending research collaborations using link prediction and random forest classifiers. Scientometrics, 101(2), 1461–1473. https://doi.org/10.1007/s11192-013-1228-9.

    Article  Google Scholar 

  13. Kim, J. (2018). Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 116(3), 1867–1886. https://doi.org/10.1007/s11192-018-2824-5.

    Article  Google Scholar 

  14. Kim, J., & Diesner, J. (2015). The effect of data pre-processing on understanding the evolution of collaboration networks. Journal of Informetrics, 9(1), 226–236. https://doi.org/10.1016/j.joi.2015.01.002.

    Article  Google Scholar 

  15. Kim, J., & Diesner, J. (2016). Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks. Journal of the Association for Information Science and Technology, 67(6), 1446–1461.

    Article  Google Scholar 

  16. Kim, J., & Diesner, J. (2017). Over-time measurement of triadic closure in coauthorship networks. Social Network Analysis and Mining, 7(1), 1–12. https://doi.org/10.1007/s13278-017-0428-3.

    Article  Google Scholar 

  17. Kim, J., Tao, L., Lee, S.-H., & Diesner, J. (2016). Evolution and structure of scientific co-publishing network in Korea between 1948–2011. Scientometrics, 107(1), 27–41. https://doi.org/10.1007/s11192-016-1878-5.

    Article  Google Scholar 

  18. Lerchenmueller, M. J., & Sorenson, O. (2016). Author Disambiguation in PubMed: Evidence on the precision and recall of author-ity among NIH-funded scientists. PLoS ONE, 11(7), e0158731.

    Article  Google Scholar 

  19. Liben-Nowell, D., & Kleinberg, J. (2007). The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7), 1019–1031. https://doi.org/10.1002/asi.20591.

    Article  Google Scholar 

  20. Lü, L., & Zhou, T. (2011). Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and Its Applications, 390(6), 1150–1170.

    Article  Google Scholar 

  21. Martin, T., Ball, B., Karrer, B., & Newman, M. E. J. (2013). Coauthorship and citation patterns in the Physical Review. Physical Review E, 88(1), 012814. https://doi.org/10.1103/physreve.88.012814.

    Article  Google Scholar 

  22. Milojević, S. (2010). Modes of collaboration in modern science: Beyond power laws and preferential attachment. Journal of the American Society for Information Science and Technology, 61(7), 1410–1423. https://doi.org/10.1002/asi.21331.

    Article  Google Scholar 

  23. Mohdeb, D., Boubetra, A., & Charikhi, M. (2016). Tie persistence in academic social networks. Informatica, 40(3), 353.

    MathSciNet  Google Scholar 

  24. Mollenhorst, G., Volker, B., & Flap, H. (2011). Shared contexts and triadic closure in core discussion networks. Social Networks, 33(4), 292–302. https://doi.org/10.1016/j.socnet.2011.09.001.

    Article  Google Scholar 

  25. Newman, D., Karimi, S., & Cavedon, L. (2009). Using topic models to interpret MEDLINE’s medical subject headings. In A. Nicholson, & X. Li (Eds.), AI 2009: Advances in artificial intelligence (Vol. 5866, pp. 270–279). Berlin, Heidelberg: Springer.

  26. Newman, M. E. J. (2001a). Clustering and preferential attachment in growing networks. Physical Review E. https://doi.org/10.1103/physreve.64.025102.

    Article  Google Scholar 

  27. Newman, M. E. J. (2001b). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America, 98(2), 404–409. https://doi.org/10.1073/pnas.021544898.

    MathSciNet  MATH  Article  Google Scholar 

  28. Pennock, D. M., Flake, G. W., Lawrence, S., Glover, E. J., & Giles, C. L. (2002). Winners don’t take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences of the United States of America, 99(8), 5207–5211. https://doi.org/10.1073/pnas.032085699.

    MATH  Article  Google Scholar 

  29. Perc, M. (2014). The Matthew effect in empirical data. Journal of The Royal Society Interface. https://doi.org/10.1098/rsif.2014.0378.

    Article  Google Scholar 

  30. Price, D., & Gürsey, S. (1976). Studies in scientometrics. 1. Transience and continuance in scientific authorship. Paper presented at the international forum on information and documentation.

  31. Reitz, F., & Hoffmann, O. (2011). Did they notice? A case-study on the community contribution to data quality in DBLP. In S. Gradmann, F. Borri, C. Meghini, & H. Schuldt (Eds.), Research and advanced technology for digital libraries, TPDL 2011 (Vol. 6966, pp. 204–215). Berlin: Springer.

    Google Scholar 

  32. Resnick, P., & Varian, H. R. (1997). Recommender systems. Communications of the ACM, 40(3), 56–58.

    Article  Google Scholar 

  33. Schubert, A., & Glänzel, W. (1991). Publication dynamics—Models and indicators. Scientometrics, 20(1), 317–331. https://doi.org/10.1007/Bf02018161.

    MATH  Article  Google Scholar 

  34. Taskar, B., Wong, M. F., Abbeel, P., & Koller, D. (2003). Link prediction in relational data. Paper presented at the advances in neural information processing systems.

  35. Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304.

    Article  Google Scholar 

  36. Wagner, C. S., & Leydesdorff, L. (2005). Network structure, self-organization, and the growth of international collaboration in science. Research Policy, 34(10), 1608–1618. https://doi.org/10.1016/j.respol.2005.08.002.

    Article  Google Scholar 

  37. Yan, E., & Guns, R. (2014). Predicting and recommending collaborations: An author-, institution-, and country-level analysis. Journal of Informetrics, 8(2), 295–309. https://doi.org/10.1016/j.joi.2014.01.008.

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported, in part, by Korea Institute of Science and Technology Information (KISTI). We would like to thank Vetle Torvik (University of Illinois at Urbana-Champaign), the American Physical Society, DBLP, and KISTI for providing datasets. We are also grateful to Mark E. J. Newman (University of Michigan) for providing code for disambiguating author names in APS data and Raf Guns (University of Antwerp) for comments on link prediction processes in LinkPred.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jinseok Kim.

Appendix

Appendix

Degree Product: (Barabási et al. 2002) showed that if links in a network are formed based on preferential attachment, the probability of two nodes to form a link is proportional to the product of the degrees of those two nodes. This is frequently used to predict link formation among nodes present in both past and present networks. In the following equation, S(x, y) is the prediction score for a pair of node x and y, and Γ(x) is the set of nodes connected to x.

$$S\left( {x, y} \right) = \left| {\Gamma \left( x \right)} \right| \times \left| {\Gamma \left( y \right)} \right|$$
(2)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kim, J., Diesner, J. Formational bounds of link prediction in collaboration networks. Scientometrics 119, 687–706 (2019). https://doi.org/10.1007/s11192-019-03055-6

Download citation

Keywords

  • Collaboration network
  • Link prediction
  • Network evolution
  • Link formation primitives
  • Preferential attachment