Skip to main content
Log in

P2V: large-scale academic paper embedding

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Academic papers not only contain text but also links via citation links. Representing such data is crucial for many tasks, such as classification, disambiguation, duplicates detection, recommendation and influence prediction. The success of the skip-gram model has inspired many algorithms for learning embeddings for words, documents, and networks. However, there is limited research on learning the representation of linked documents such as academic papers. In this paper, we propose a new neural network based algorithm, called P2V (paper2vector), to learn high-quality embeddings for academic papers on large-scale datasets. We compare our model with traditional non-neural network based algorithms and state-of-the-art neural network methods on four datasets of various sizes. The largest dataset we used contains 46.64 million papers and 528.68 million citation links. Experimental results show that P2V achieves state-of-the-art performance in paper classification, paper similarity, and paper influence prediction task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. http://scholar.google.ca/.

  2. https://www.semanticscholar.org/.

  3. DeepWalk uses CBOW model (Mikolov et al. 2013b) to learn the embedding in the original paper. However, the authors switched to SGNS to learn the embedding in the latest version of release DeepWalk implementation. In our work, we use SGNS to learn embeddings.

  4. http://zhang18f.myweb.cs.uwindsor.ca/p2v/.

  5. https://github.com/thunlp/OpenNE.

  6. http://gist.nju.edu.cn/.

  7. http://zhang18f.myweb.cs.uwindsor.ca/ss.

References

  • Ai, Q., Yang, L., Guo, J., & Croft, W. B. (2016a). Analysis of the paragraph vector model for information retrieval. In Proceedings of the 2016 ACM on international conference on the theory of information retrieval, ACM (pp. 133–142).

  • Ai, Q., Yang, L., Guo, J., & Croft, W. B. (2016b). Improving language estimation with the paragraph vector model for ad-hoc retrieval. In Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 869–872).

  • Bai, X., Zhang, F., & Lee, I. (2019). Predicting the citations of scholarly paper. Journal of Informetrics, 13(1), 407–418. https://doi.org/10.1016/j.joi.2019.01.010.

    Article  Google Scholar 

  • Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL, 1, 238–247.

    Google Scholar 

  • Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on interactive presentation sessions, association for computational linguistics (pp. 69–72).

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117. https://doi.org/10.1016/S0169-7552(98)00110-X.

    Article  Google Scholar 

  • Cai, D., He, X., & Han, J. (2008). Training linear discriminant analysis in linear time. In 2008 IEEE 24th international conference on data engineering, IEEE, Cancun, Mexico, (pp. 209–217), https://doi.org/10.1109/ICDE.2008.4497429.

  • Cao, S., Lu, W., & Xu, Q. (2015). GraRep: Learning graph representations with global structural information. In Proceedings of the 24th ACM international on conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’15, pp. 891–900, https://doi.org/10.1145/2806416.2806512.

  • Dong, Y., Chawla, N. V., & Swami, A. (2017). Metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp. 135–144

  • Dumais, S. T. (2005). Latent semantic analysis. Annual Review of Information Science and Technology, 38(1), 188–230. https://doi.org/10.1002/aris.1440380105.

    Article  Google Scholar 

  • Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., & Smith, N. A. (2014). Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166

  • Fu, L. D., & Aliferis, C. F. (2010). Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature. Scientometrics, 85(1), 257–270. https://doi.org/10.1007/s11192-010-0160-5.

    Article  Google Scholar 

  • Ganguly, S., & Pudi, V. (2017). Paper2vec: Combining graph and text information for scientific paper representation. In J. M. Jose, C. Hauff, I. S. Altıngovde, D. Song, D. Albakour, S. Watt, & J. Tait (Eds.), Advances in information retrieval (pp. 383–395). Berlin: Springer.

    Chapter  Google Scholar 

  • Gao, Y., Zhang, C., Peng, J., & Parameswaran, A. (2018). Low-norm graph embedding. arXiv preprint arXiv:180203560.

  • Goyal, P., & Ferrara, E. (2018). Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151, 78–94. https://doi.org/10.1016/j.knosys.2018.03.022.

    Article  Google Scholar 

  • Grover, A., & Leskovec, J. (2016). Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, (pp. 855–864).

  • Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116(2), 1367–1382. https://doi.org/10.1007/s11192-018-2718-6.

    Article  Google Scholar 

  • Jia, Y., Wang, Y., Jin, X., Lin, H., & Cheng, X. (2017). Knowledge graph embedding: A locally and temporally adaptive translation-based approach. ACM Transactions on the Web, 12(2), 8:1–8:33. https://doi.org/10.1145/3132733.

    Article  Google Scholar 

  • Kawamura, T., Watanabe, K., Matsumoto, N., Egami, S., & Jibu, M. (2018). Funding map using paragraph embedding based on semantic diversity. Scientometrics, 116(2), 941–958. https://doi.org/10.1007/s11192-018-2783-x.

    Article  Google Scholar 

  • Kong, X., Mao, M., Wang, W., Liu, J., & Xu, B. (2018). VOPRec: vector representation learning of papers with text information and structural identity for recommendation. IEEE Transactions on Emerging Topics in Computing,. https://doi.org/10.1109/TETC.2018.2830698.

    Google Scholar 

  • Lau, J. H., & Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:160705368.

  • Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. ICML, 14, 1188–1196.

    Google Scholar 

  • Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. Advances in Neural Information Processing Systems (pp. 2177–2185).

  • Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., et al. (2018). Computational linguistics literature and citations oriented citation linkage, classification and summarization. International Journal on Digital Libraries, 19(2), 173–190. https://doi.org/10.1007/s00799-017-0219-5.

    Article  Google Scholar 

  • Mesnil, G., Mikolov, T., Ranzato, M., & Bengio, Y. (2014). Ensemble of generative and discriminative techniques for sentiment analysis of movie reviews. arXiv preprint arXiv:14125335eff.

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781.

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

  • Moraes, L., Baki, S., Verma, R., & Lee, D. (2018). Identifying reference spans: Topic modeling and word embeddings help IR. International Journal on Digital Libraries, 19(2), 191–202. https://doi.org/10.1007/s00799-017-0220-z.

    Article  Google Scholar 

  • Mu, C., Yang, G., & Yan, Z. (2018). Revisiting skip-gram negative sampling model with rectification. arXiv preprint arXiv:1804.00306.

  • Müller, M. C. (2017). Semantic author name disambiguation with word embeddings. In J. Kamps, G. Tsakonas, Y. Manolopoulos, L. Iliadis, & I. Karydis (Eds.), Research and advanced technology for digital libraries (pp. 300–311). Berlin: Springer. Lecture Notes in Computer Science.

    Chapter  Google Scholar 

  • Palumbo, E., Rizzo, G., & Troncy, R. (2017). Entity2Rec: Learning user-item relatedness from knowledge graphs for top-n item recommendation. In Proceedings of the eleventh ACM conference on recommender systems, ACM, New York, NY, USA, RecSys ’17 (pp. 32–36), https://doi.org/10.1145/3109859.3109889.

  • Palumbo, E., Rizzo, G., Troncy, R., Baralis, E., Osella, M., & Ferro, E. (2018). Knowledge graph embeddings with node2vec for item recommendation. In The semantic web: ESWC 2018 satellite events, Springer, Cham, Lecture Notes in Computer Science (pp. 117–120), https://doi.org/10.1007/978-3-319-98192-5_22.

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    MathSciNet  MATH  Google Scholar 

  • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

  • Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 701–710.

  • Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137. https://doi.org/10.1108/eb046814.

    Article  Google Scholar 

  • Qiu, J., Dong, Y., Ma, H., Li, J., Wang, K., & Tang, J. (2017). Network embedding as matrix factorization: Unifying DeepWalk, LINE, PTE, and node2vec. arXiv:1710.02971 [cs, stat].

  • Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Řehuřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, ELRA, Valletta, Malta, pp. 45–50.

  • Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association, pp. 338–342.

  • Schlötterer, J., Seifert, C., & Granitzer, M. (2017). On joint representation learning of network structure and document content. In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Machine learning and knowledge extraction (pp. 237–251)., Lecture notes in computer science Berlin: Springer.

    Chapter  Google Scholar 

  • Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B. J. P., & Wang, K. (2015). An overview of microsoft academic service (MAS) and applications. In Proceedings of the 24th international conference on World Wide Web, ACM, WWW ’15 Companion, pp. 243–246, https://doi.org/10.1145/2740908.2742839.

  • Smiley, D., Pugh, E., & Parisa, K. (2015). Apache solr enterprise search server (3rd ed.). Birmingham: Packt Publishing Ltd.

    Google Scholar 

  • Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, KDD ’08, pp 990–998, https://doi.org/10.1145/1401890.1402008.

  • Tang, J., Qu, M., & Mei, Q. (2015a). Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 1165–1174.

  • Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015b). Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, ACM, pp 1067–1077.

  • Tian, H., & Zhuo, HH. (2017). Paper2vec: Citation-Context Based Document Distributed Representation for Scholar Recommendation. arXiv preprint arXiv:170306587.

  • Van Der Maaten, L. (2014). Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research, 15(1), 3221–3245.

    MathSciNet  MATH  Google Scholar 

  • Wang, D., Cui, P., & Zhu, W. (2016a). Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 1225–1234.

  • Wang, R., Yan, Y., Wang, J., Jia, Y., Zhang, Y., Zhang, W., & Wang, X. (2018). AceKG: A Large-scale Knowledge Graph for Academic Data Mining. In Proceedings of the 27th ACM international conference on information and knowledge management, ACM, CIKM ’18 (pp. 1487–1490), https://doi.org/10.1145/3269206.3269252.

  • Wang, S., Tang, J., Aggarwal, C., & Liu, H. (2016b). Linked document embedding for classification. In Proceedings of the 25th ACM international on conference on information and knowledge management, ACM (pp. 115–124).

  • Wu, Q., & Wolfram, D. (2011). The influence of effects and phenomena on citations: A comparative analysis of four citation perspectives. Scientometrics, 89(1), 245. https://doi.org/10.1007/s11192-011-0456-0.

    Article  Google Scholar 

  • Yang, C., Liu, Z., Zhao, D., Sun, M., & Chang, E. Y. (2015). Network representation learning with rich text information. In IJCAI (pp. 2111–2117).

  • Zhang, D., Yin, J., Zhu, X., & Zhang, C. (2016). Homophily, structure, and content augmented network representation learning. In 2016 IEEE 16th international conference on data mining (ICDM), IEEE, Barcelona, Spain (pp. 609–618), https://doi.org/10.1109/ICDM.2016.0072.

  • Zhang, Y., & Lu, J. (2016). Near-duplicated Documents in CiteSeerX. In Proceedings of the IJCAI 2016 workshop on scholarly big data (pp. 22–28).

  • Zhang, Y., Lu, J., & Shai, O. (2018a). Improve network embeddings with regularization. In Proceedings of the 27th ACM international conference on information and knowledge management, ACM, CIKM ’18 (pp. 1643–1646), https://doi.org/10.1145/3269206.3269320

  • Zhang, Z., Yang, H., Bu, J., Zhou, S., Yu, P., Zhang, J., et al. (2018b). ANRL: Attributed network representation learning via deep neural networks. In IJCAI (pp. 3155–3161).

  • Zhao, F., Zhang, Y., Lu, J., & Shai, O. (2019). Measuring academic influence using heterogeneous author-citation networks. Scientometrics, 118(3), 1119–1140. https://doi.org/10.1007/s11192-019-03010-5.

    Article  Google Scholar 

  • Zhao, S., Zhang, D., Duan, Z., Chen, J., Zhang, Yp, & Tang, J. (2018). A novel classification method for paper-reviewer recommendation. Scientometrics, 115(3), 1293–1313. https://doi.org/10.1007/s11192-018-2726-6.

    Article  Google Scholar 

  • Zhou, T., Zhang, Y., & Lu, J. (2016). Identifying Academic papers in computer science based on text classification. In Proceedings of the IJCAI 2016 workshop on scholarly big data (pp. 16–21).

  • Zhu, D., Dai, X. Y., & Chen, J. (2019). Representing anything from scholar papers. Journal of Web Semantics p S1570826819300150, https://doi.org/10.1016/j.websem.2019.02.001

  • Zhu, S., Yu, K., Chi, Y., & Gong, Y. (2007). Combining content and link for classification using matrix factorization. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’07, ACM Press, Amsterdam, The Netherlands, p 487, https://doi.org/10.1145/1277741.1277825.

Download references

Acknowledgements

Funding was provided by Natural Sciences and Engineering Research Council of Canada (Grant No. RGPIN-2014-04463).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Zhao, F. & Lu, J. P2V: large-scale academic paper embedding. Scientometrics 121, 399–432 (2019). https://doi.org/10.1007/s11192-019-03206-9

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-019-03206-9

Keywords

Navigation