, Volume 121, Issue 1, pp 399–432 | Cite as

P2V: large-scale academic paper embedding

  • Yi ZhangEmail author
  • Fen Zhao
  • Jianguo Lu


Academic papers not only contain text but also links via citation links. Representing such data is crucial for many tasks, such as classification, disambiguation, duplicates detection, recommendation and influence prediction. The success of the skip-gram model has inspired many algorithms for learning embeddings for words, documents, and networks. However, there is limited research on learning the representation of linked documents such as academic papers. In this paper, we propose a new neural network based algorithm, called P2V (paper2vector), to learn high-quality embeddings for academic papers on large-scale datasets. We compare our model with traditional non-neural network based algorithms and state-of-the-art neural network methods on four datasets of various sizes. The largest dataset we used contains 46.64 million papers and 528.68 million citation links. Experimental results show that P2V achieves state-of-the-art performance in paper classification, paper similarity, and paper influence prediction task.


Embedding Data Representation Academic Paper 



Funding was provided by Natural Sciences and Engineering Research Council of Canada (Grant No. RGPIN-2014-04463).


  1. Ai, Q., Yang, L., Guo, J., & Croft, W. B. (2016a). Analysis of the paragraph vector model for information retrieval. In Proceedings of the 2016 ACM on international conference on the theory of information retrieval, ACM (pp. 133–142).Google Scholar
  2. Ai, Q., Yang, L., Guo, J., & Croft, W. B. (2016b). Improving language estimation with the paragraph vector model for ad-hoc retrieval. In Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 869–872).Google Scholar
  3. Bai, X., Zhang, F., & Lee, I. (2019). Predicting the citations of scholarly paper. Journal of Informetrics, 13(1), 407–418. Scholar
  4. Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL, 1, 238–247.Google Scholar
  5. Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on interactive presentation sessions, association for computational linguistics (pp. 69–72).Google Scholar
  6. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.zbMATHGoogle Scholar
  7. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117. Scholar
  8. Cai, D., He, X., & Han, J. (2008). Training linear discriminant analysis in linear time. In 2008 IEEE 24th international conference on data engineering, IEEE, Cancun, Mexico, (pp. 209–217),
  9. Cao, S., Lu, W., & Xu, Q. (2015). GraRep: Learning graph representations with global structural information. In Proceedings of the 24th ACM international on conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’15, pp. 891–900,
  10. Dong, Y., Chawla, N. V., & Swami, A. (2017). Metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp. 135–144Google Scholar
  11. Dumais, S. T. (2005). Latent semantic analysis. Annual Review of Information Science and Technology, 38(1), 188–230. Scholar
  12. Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., & Smith, N. A. (2014). Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166
  13. Fu, L. D., & Aliferis, C. F. (2010). Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature. Scientometrics, 85(1), 257–270. Scholar
  14. Ganguly, S., & Pudi, V. (2017). Paper2vec: Combining graph and text information for scientific paper representation. In J. M. Jose, C. Hauff, I. S. Altıngovde, D. Song, D. Albakour, S. Watt, & J. Tait (Eds.), Advances in information retrieval (pp. 383–395). Berlin: Springer.CrossRefGoogle Scholar
  15. Gao, Y., Zhang, C., Peng, J., & Parameswaran, A. (2018). Low-norm graph embedding. arXiv preprint arXiv:180203560.
  16. Goyal, P., & Ferrara, E. (2018). Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151, 78–94. Scholar
  17. Grover, A., & Leskovec, J. (2016). Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, (pp. 855–864).Google Scholar
  18. Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116(2), 1367–1382. Scholar
  19. Jia, Y., Wang, Y., Jin, X., Lin, H., & Cheng, X. (2017). Knowledge graph embedding: A locally and temporally adaptive translation-based approach. ACM Transactions on the Web, 12(2), 8:1–8:33. Scholar
  20. Kawamura, T., Watanabe, K., Matsumoto, N., Egami, S., & Jibu, M. (2018). Funding map using paragraph embedding based on semantic diversity. Scientometrics, 116(2), 941–958. Scholar
  21. Kong, X., Mao, M., Wang, W., Liu, J., & Xu, B. (2018). VOPRec: vector representation learning of papers with text information and structural identity for recommendation. IEEE Transactions on Emerging Topics in Computing,. Scholar
  22. Lau, J. H., & Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:160705368.
  23. Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. ICML, 14, 1188–1196.Google Scholar
  24. Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. Advances in Neural Information Processing Systems (pp. 2177–2185).Google Scholar
  25. Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., et al. (2018). Computational linguistics literature and citations oriented citation linkage, classification and summarization. International Journal on Digital Libraries, 19(2), 173–190. Scholar
  26. Mesnil, G., Mikolov, T., Ranzato, M., & Bengio, Y. (2014). Ensemble of generative and discriminative techniques for sentiment analysis of movie reviews. arXiv preprint arXiv:14125335eff.
  27. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781.
  28. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).Google Scholar
  29. Moraes, L., Baki, S., Verma, R., & Lee, D. (2018). Identifying reference spans: Topic modeling and word embeddings help IR. International Journal on Digital Libraries, 19(2), 191–202. Scholar
  30. Mu, C., Yang, G., & Yan, Z. (2018). Revisiting skip-gram negative sampling model with rectification. arXiv preprint arXiv:1804.00306.
  31. Müller, M. C. (2017). Semantic author name disambiguation with word embeddings. In J. Kamps, G. Tsakonas, Y. Manolopoulos, L. Iliadis, & I. Karydis (Eds.), Research and advanced technology for digital libraries (pp. 300–311). Berlin: Springer. Lecture Notes in Computer Science.CrossRefGoogle Scholar
  32. Palumbo, E., Rizzo, G., & Troncy, R. (2017). Entity2Rec: Learning user-item relatedness from knowledge graphs for top-n item recommendation. In Proceedings of the eleventh ACM conference on recommender systems, ACM, New York, NY, USA, RecSys ’17 (pp. 32–36),
  33. Palumbo, E., Rizzo, G., Troncy, R., Baralis, E., Osella, M., & Ferro, E. (2018). Knowledge graph embeddings with node2vec for item recommendation. In The semantic web: ESWC 2018 satellite events, Springer, Cham, Lecture Notes in Computer Science (pp. 117–120),
  34. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.MathSciNetzbMATHGoogle Scholar
  35. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).Google Scholar
  36. Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 701–710.Google Scholar
  37. Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137. Scholar
  38. Qiu, J., Dong, Y., Ma, H., Li, J., Wang, K., & Tang, J. (2017). Network embedding as matrix factorization: Unifying DeepWalk, LINE, PTE, and node2vec. arXiv:1710.02971 [cs, stat].
  39. Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  40. Řehuřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, ELRA, Valletta, Malta, pp. 45–50.Google Scholar
  41. Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association, pp. 338–342.Google Scholar
  42. Schlötterer, J., Seifert, C., & Granitzer, M. (2017). On joint representation learning of network structure and document content. In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Machine learning and knowledge extraction (pp. 237–251)., Lecture notes in computer science Berlin: Springer.CrossRefGoogle Scholar
  43. Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B. J. P., & Wang, K. (2015). An overview of microsoft academic service (MAS) and applications. In Proceedings of the 24th international conference on World Wide Web, ACM, WWW ’15 Companion, pp. 243–246,
  44. Smiley, D., Pugh, E., & Parisa, K. (2015). Apache solr enterprise search server (3rd ed.). Birmingham: Packt Publishing Ltd.Google Scholar
  45. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, KDD ’08, pp 990–998,
  46. Tang, J., Qu, M., & Mei, Q. (2015a). Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 1165–1174.Google Scholar
  47. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015b). Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, ACM, pp 1067–1077.Google Scholar
  48. Tian, H., & Zhuo, HH. (2017). Paper2vec: Citation-Context Based Document Distributed Representation for Scholar Recommendation. arXiv preprint arXiv:170306587.
  49. Van Der Maaten, L. (2014). Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research, 15(1), 3221–3245.MathSciNetzbMATHGoogle Scholar
  50. Wang, D., Cui, P., & Zhu, W. (2016a). Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 1225–1234.Google Scholar
  51. Wang, R., Yan, Y., Wang, J., Jia, Y., Zhang, Y., Zhang, W., & Wang, X. (2018). AceKG: A Large-scale Knowledge Graph for Academic Data Mining. In Proceedings of the 27th ACM international conference on information and knowledge management, ACM, CIKM ’18 (pp. 1487–1490),
  52. Wang, S., Tang, J., Aggarwal, C., & Liu, H. (2016b). Linked document embedding for classification. In Proceedings of the 25th ACM international on conference on information and knowledge management, ACM (pp. 115–124).Google Scholar
  53. Wu, Q., & Wolfram, D. (2011). The influence of effects and phenomena on citations: A comparative analysis of four citation perspectives. Scientometrics, 89(1), 245. Scholar
  54. Yang, C., Liu, Z., Zhao, D., Sun, M., & Chang, E. Y. (2015). Network representation learning with rich text information. In IJCAI (pp. 2111–2117).Google Scholar
  55. Zhang, D., Yin, J., Zhu, X., & Zhang, C. (2016). Homophily, structure, and content augmented network representation learning. In 2016 IEEE 16th international conference on data mining (ICDM), IEEE, Barcelona, Spain (pp. 609–618),
  56. Zhang, Y., & Lu, J. (2016). Near-duplicated Documents in CiteSeerX. In Proceedings of the IJCAI 2016 workshop on scholarly big data (pp. 22–28).Google Scholar
  57. Zhang, Y., Lu, J., & Shai, O. (2018a). Improve network embeddings with regularization. In Proceedings of the 27th ACM international conference on information and knowledge management, ACM, CIKM ’18 (pp. 1643–1646),
  58. Zhang, Z., Yang, H., Bu, J., Zhou, S., Yu, P., Zhang, J., et al. (2018b). ANRL: Attributed network representation learning via deep neural networks. In IJCAI (pp. 3155–3161).Google Scholar
  59. Zhao, F., Zhang, Y., Lu, J., & Shai, O. (2019). Measuring academic influence using heterogeneous author-citation networks. Scientometrics, 118(3), 1119–1140. Scholar
  60. Zhao, S., Zhang, D., Duan, Z., Chen, J., Zhang, Yp, & Tang, J. (2018). A novel classification method for paper-reviewer recommendation. Scientometrics, 115(3), 1293–1313. Scholar
  61. Zhou, T., Zhang, Y., & Lu, J. (2016). Identifying Academic papers in computer science based on text classification. In Proceedings of the IJCAI 2016 workshop on scholarly big data (pp. 16–21).Google Scholar
  62. Zhu, D., Dai, X. Y., & Chen, J. (2019). Representing anything from scholar papers. Journal of Web Semantics p S1570826819300150,
  63. Zhu, S., Yu, K., Chi, Y., & Gong, Y. (2007). Combining content and link for classification using matrix factorization. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’07, ACM Press, Amsterdam, The Netherlands, p 487,

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2019

Authors and Affiliations

  1. 1.School of Computer ScienceUniversity of WindsorWindsorCanada

Personalised recommendations