Hierarchical Word Mover Distance for Collaboration Recommender System

  • Chao SunEmail author
  • King Tao Jason Ng
  • Philip Henville
  • Roman Marchant
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 996)


Natural Language Processing (NLP) techniques have enabled automated analysis over a large collection of documents, which makes it possible to quantitatively compare researcher profiles based on their publications. This paper proposes a novel researcher similarity measuring system which combines a variety of techniques, including topic modelling, Word2vec and word mover distance calculations on publication abstracts. The proposed method, implemented in python, matches researchers based upon a document’s texts by evaluating the semantic meanings of words and topics. The distances between researchers are calculated over various text features in an hierarchical structure. Results show that the system is successful in identifying existing co-authorships from sample data despite co-authorship properties having been removed, as well as suggesting valid potential academic collaboration links from related research areas irrespective of previous collaboration activity.



Dr Joel Nothman from the Sydney Informatics Hub has provided valuable suggestions and feedbacks to this work.

Prof. Nick Enfield, director of SSSHARC, Faculty of Arts and Social Sciences, the University of Sydney, initiated the question and supported this work.

The major development was conducted under the Capstone student project program initiated by the School of IT, the University of Sydney.


  1. 1.
    Ahlgren, P., Grönqvist, L.: Evaluation of retrieval effectiveness with incomplete relevance data: theoretical and experimental comparison of three measures. Inf. Process. Manag. 44(1), 212–225 (2008)CrossRefGoogle Scholar
  2. 2.
    Arora, S., Ge, R., Moitra, A.: Learning topic models - going beyond SVD. CoRR abs/1204.1956 (2012)Google Scholar
  3. 3.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  4. 4.
    Gollapalli, S.D., Mitra, P., Giles, C.L.: Similar researcher search in academic environments. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2012, pp. 167–170. ACM, New York (2012)Google Scholar
  5. 5.
    Hitchcock, F.L.: The distribution of a product from several sources to numerous localities. J. Math. Phys. 20(1–4), 224–230 (1941)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 50–57. ACM, New York (1999)Google Scholar
  7. 7.
    Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments. In: Information Processing and Management, pp. 779–840 (2000)Google Scholar
  8. 8.
    Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Statist. 22(1), 79–86 (1951). Scholar
  9. 9.
    Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37 (2015)Google Scholar
  10. 10.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)Google Scholar
  11. 11.
    Newman, M.E.J.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. United States Am. 101(1), 5200–5205 (2004)CrossRefGoogle Scholar
  12. 12.
    Pele, O., Werman, M.: A linear time histogram metric for improved SIFT matching. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 495–508. Springer, Heidelberg (2008). Scholar
  13. 13.
    Pele, O., Werman, M.: Fast and robust earth mover’s distances. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 460–467. IEEE, September 2009Google Scholar
  14. 14.
    Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50, May 2010Google Scholar
  15. 15.
    Wagner, W.: Steven bird, ewan klein and edward loper: natural language processing with python, analyzing text with the natural language toolkit. Lang. Resour. Eval. 44(4), 421–424 (2010)CrossRefGoogle Scholar
  16. 16.
    Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, pp. 1105–1112. ACM, New York (2009)Google Scholar
  17. 17.
    Xu, Y., Guo, X., Hao, J., Ma, J., Lau, R.Y.K., Xu, W.: Combining social network and semantic concept analysis for personalized academic researcher recommendation. Decis. Support Syst. 54(1), 564–573 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Faculty of Arts and Social SciencesThe University of SydneySydneyAustralia
  2. 2.Centre for Translational Data ScienceThe University of SydneySydneyAustralia
  3. 3.Faculty of Engineering and Information TechnologiesThe University of SydneySydneyAustralia

Personalised recommendations