P2V: large-scale academic paper embedding

Zhang, Yi; Zhao, Fen; Lu, Jianguo

doi:10.1007/s11192-019-03206-9

P2V: large-scale academic paper embedding

Published: 10 August 2019

Volume 121, pages 399–432, (2019)
Cite this article

Scientometrics Aims and scope Submit manuscript

11 Citations
Explore all metrics

Abstract

Academic papers not only contain text but also links via citation links. Representing such data is crucial for many tasks, such as classification, disambiguation, duplicates detection, recommendation and influence prediction. The success of the skip-gram model has inspired many algorithms for learning embeddings for words, documents, and networks. However, there is limited research on learning the representation of linked documents such as academic papers. In this paper, we propose a new neural network based algorithm, called P2V (paper2vector), to learn high-quality embeddings for academic papers on large-scale datasets. We compare our model with traditional non-neural network based algorithms and state-of-the-art neural network methods on four datasets of various sizes. The largest dataset we used contains 46.64 million papers and 528.68 million citation links. Experimental results show that P2V achieves state-of-the-art performance in paper classification, paper similarity, and paper influence prediction task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Paper2vec: Combining Graph and Text Information for Scientific Paper Representation

PRM-KGED: paper recommender model using knowledge graph embedding and deep neural network

Article 20 November 2023

Document Network Projection in Pretrained Word Embedding Space

Notes

http://scholar.google.ca/.
https://www.semanticscholar.org/.
DeepWalk uses CBOW model (Mikolov et al. 2013b) to learn the embedding in the original paper. However, the authors switched to SGNS to learn the embedding in the latest version of release DeepWalk implementation. In our work, we use SGNS to learn embeddings.
http://zhang18f.myweb.cs.uwindsor.ca/p2v/.
https://github.com/thunlp/OpenNE.
http://gist.nju.edu.cn/.
http://zhang18f.myweb.cs.uwindsor.ca/ss.

References

Ai, Q., Yang, L., Guo, J., & Croft, W. B. (2016a). Analysis of the paragraph vector model for information retrieval. In Proceedings of the 2016 ACM on international conference on the theory of information retrieval, ACM (pp. 133–142).
Ai, Q., Yang, L., Guo, J., & Croft, W. B. (2016b). Improving language estimation with the paragraph vector model for ad-hoc retrieval. In Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 869–872).
Bai, X., Zhang, F., & Lee, I. (2019). Predicting the citations of scholarly paper. Journal of Informetrics, 13(1), 407–418. https://doi.org/10.1016/j.joi.2019.01.010.
Article Google Scholar
Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL, 1, 238–247.
Google Scholar
Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on interactive presentation sessions, association for computational linguistics (pp. 69–72).
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117. https://doi.org/10.1016/S0169-7552(98)00110-X.
Article Google Scholar
Cai, D., He, X., & Han, J. (2008). Training linear discriminant analysis in linear time. In 2008 IEEE 24th international conference on data engineering, IEEE, Cancun, Mexico, (pp. 209–217), https://doi.org/10.1109/ICDE.2008.4497429.
Cao, S., Lu, W., & Xu, Q. (2015). GraRep: Learning graph representations with global structural information. In Proceedings of the 24th ACM international on conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’15, pp. 891–900, https://doi.org/10.1145/2806416.2806512.
Dong, Y., Chawla, N. V., & Swami, A. (2017). Metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp. 135–144
Dumais, S. T. (2005). Latent semantic analysis. Annual Review of Information Science and Technology, 38(1), 188–230. https://doi.org/10.1002/aris.1440380105.
Article Google Scholar
Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., & Smith, N. A. (2014). Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166
Fu, L. D., & Aliferis, C. F. (2010). Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature. Scientometrics, 85(1), 257–270. https://doi.org/10.1007/s11192-010-0160-5.
Article Google Scholar
Ganguly, S., & Pudi, V. (2017). Paper2vec: Combining graph and text information for scientific paper representation. In J. M. Jose, C. Hauff, I. S. Altıngovde, D. Song, D. Albakour, S. Watt, & J. Tait (Eds.), Advances in information retrieval (pp. 383–395). Berlin: Springer.
Chapter Google Scholar
Gao, Y., Zhang, C., Peng, J., & Parameswaran, A. (2018). Low-norm graph embedding. arXiv preprint arXiv:180203560.
Goyal, P., & Ferrara, E. (2018). Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151, 78–94. https://doi.org/10.1016/j.knosys.2018.03.022.
Article Google Scholar
Grover, A., & Leskovec, J. (2016). Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, (pp. 855–864).
Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116(2), 1367–1382. https://doi.org/10.1007/s11192-018-2718-6.
Article Google Scholar
Jia, Y., Wang, Y., Jin, X., Lin, H., & Cheng, X. (2017). Knowledge graph embedding: A locally and temporally adaptive translation-based approach. ACM Transactions on the Web, 12(2), 8:1–8:33. https://doi.org/10.1145/3132733.
Article Google Scholar
Kawamura, T., Watanabe, K., Matsumoto, N., Egami, S., & Jibu, M. (2018). Funding map using paragraph embedding based on semantic diversity. Scientometrics, 116(2), 941–958. https://doi.org/10.1007/s11192-018-2783-x.
Article Google Scholar
Kong, X., Mao, M., Wang, W., Liu, J., & Xu, B. (2018). VOPRec: vector representation learning of papers with text information and structural identity for recommendation. IEEE Transactions on Emerging Topics in Computing,. https://doi.org/10.1109/TETC.2018.2830698.
Google Scholar
Lau, J. H., & Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:160705368.
Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. ICML, 14, 1188–1196.
Google Scholar
Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. Advances in Neural Information Processing Systems (pp. 2177–2185).
Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., et al. (2018). Computational linguistics literature and citations oriented citation linkage, classification and summarization. International Journal on Digital Libraries, 19(2), 173–190. https://doi.org/10.1007/s00799-017-0219-5.
Article Google Scholar
Mesnil, G., Mikolov, T., Ranzato, M., & Bengio, Y. (2014). Ensemble of generative and discriminative techniques for sentiment analysis of movie reviews. arXiv preprint arXiv:14125335eff.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
Moraes, L., Baki, S., Verma, R., & Lee, D. (2018). Identifying reference spans: Topic modeling and word embeddings help IR. International Journal on Digital Libraries, 19(2), 191–202. https://doi.org/10.1007/s00799-017-0220-z.
Article Google Scholar
Mu, C., Yang, G., & Yan, Z. (2018). Revisiting skip-gram negative sampling model with rectification. arXiv preprint arXiv:1804.00306.
Müller, M. C. (2017). Semantic author name disambiguation with word embeddings. In J. Kamps, G. Tsakonas, Y. Manolopoulos, L. Iliadis, & I. Karydis (Eds.), Research and advanced technology for digital libraries (pp. 300–311). Berlin: Springer. Lecture Notes in Computer Science.
Chapter Google Scholar
Palumbo, E., Rizzo, G., & Troncy, R. (2017). Entity2Rec: Learning user-item relatedness from knowledge graphs for top-n item recommendation. In Proceedings of the eleventh ACM conference on recommender systems, ACM, New York, NY, USA, RecSys ’17 (pp. 32–36), https://doi.org/10.1145/3109859.3109889.
Palumbo, E., Rizzo, G., Troncy, R., Baralis, E., Osella, M., & Ferro, E. (2018). Knowledge graph embeddings with node2vec for item recommendation. In The semantic web: ESWC 2018 satellite events, Springer, Cham, Lecture Notes in Computer Science (pp. 117–120), https://doi.org/10.1007/978-3-319-98192-5_22.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
MathSciNet MATH Google Scholar
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 701–710.
Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137. https://doi.org/10.1108/eb046814.
Article Google Scholar
Qiu, J., Dong, Y., Ma, H., Li, J., Wang, K., & Tang, J. (2017). Network embedding as matrix factorization: Unifying DeepWalk, LINE, PTE, and node2vec. arXiv:1710.02971 [cs, stat].
Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge: Cambridge University Press.
Book Google Scholar
Řehuřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, ELRA, Valletta, Malta, pp. 45–50.
Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association, pp. 338–342.
Schlötterer, J., Seifert, C., & Granitzer, M. (2017). On joint representation learning of network structure and document content. In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Machine learning and knowledge extraction (pp. 237–251)., Lecture notes in computer science Berlin: Springer.
Chapter Google Scholar
Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B. J. P., & Wang, K. (2015). An overview of microsoft academic service (MAS) and applications. In Proceedings of the 24th international conference on World Wide Web, ACM, WWW ’15 Companion, pp. 243–246, https://doi.org/10.1145/2740908.2742839.
Smiley, D., Pugh, E., & Parisa, K. (2015). Apache solr enterprise search server (3rd ed.). Birmingham: Packt Publishing Ltd.
Google Scholar
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, KDD ’08, pp 990–998, https://doi.org/10.1145/1401890.1402008.
Tang, J., Qu, M., & Mei, Q. (2015a). Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 1165–1174.
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015b). Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, ACM, pp 1067–1077.
Tian, H., & Zhuo, HH. (2017). Paper2vec: Citation-Context Based Document Distributed Representation for Scholar Recommendation. arXiv preprint arXiv:170306587.
Van Der Maaten, L. (2014). Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research, 15(1), 3221–3245.
MathSciNet MATH Google Scholar
Wang, D., Cui, P., & Zhu, W. (2016a). Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 1225–1234.
Wang, R., Yan, Y., Wang, J., Jia, Y., Zhang, Y., Zhang, W., & Wang, X. (2018). AceKG: A Large-scale Knowledge Graph for Academic Data Mining. In Proceedings of the 27th ACM international conference on information and knowledge management, ACM, CIKM ’18 (pp. 1487–1490), https://doi.org/10.1145/3269206.3269252.
Wang, S., Tang, J., Aggarwal, C., & Liu, H. (2016b). Linked document embedding for classification. In Proceedings of the 25th ACM international on conference on information and knowledge management, ACM (pp. 115–124).
Wu, Q., & Wolfram, D. (2011). The influence of effects and phenomena on citations: A comparative analysis of four citation perspectives. Scientometrics, 89(1), 245. https://doi.org/10.1007/s11192-011-0456-0.
Article Google Scholar
Yang, C., Liu, Z., Zhao, D., Sun, M., & Chang, E. Y. (2015). Network representation learning with rich text information. In IJCAI (pp. 2111–2117).
Zhang, D., Yin, J., Zhu, X., & Zhang, C. (2016). Homophily, structure, and content augmented network representation learning. In 2016 IEEE 16th international conference on data mining (ICDM), IEEE, Barcelona, Spain (pp. 609–618), https://doi.org/10.1109/ICDM.2016.0072.
Zhang, Y., & Lu, J. (2016). Near-duplicated Documents in CiteSeerX. In Proceedings of the IJCAI 2016 workshop on scholarly big data (pp. 22–28).
Zhang, Y., Lu, J., & Shai, O. (2018a). Improve network embeddings with regularization. In Proceedings of the 27th ACM international conference on information and knowledge management, ACM, CIKM ’18 (pp. 1643–1646), https://doi.org/10.1145/3269206.3269320
Zhang, Z., Yang, H., Bu, J., Zhou, S., Yu, P., Zhang, J., et al. (2018b). ANRL: Attributed network representation learning via deep neural networks. In IJCAI (pp. 3155–3161).
Zhao, F., Zhang, Y., Lu, J., & Shai, O. (2019). Measuring academic influence using heterogeneous author-citation networks. Scientometrics, 118(3), 1119–1140. https://doi.org/10.1007/s11192-019-03010-5.
Article Google Scholar
Zhao, S., Zhang, D., Duan, Z., Chen, J., Zhang, Yp, & Tang, J. (2018). A novel classification method for paper-reviewer recommendation. Scientometrics, 115(3), 1293–1313. https://doi.org/10.1007/s11192-018-2726-6.
Article Google Scholar
Zhou, T., Zhang, Y., & Lu, J. (2016). Identifying Academic papers in computer science based on text classification. In Proceedings of the IJCAI 2016 workshop on scholarly big data (pp. 16–21).
Zhu, D., Dai, X. Y., & Chen, J. (2019). Representing anything from scholar papers. Journal of Web Semantics p S1570826819300150, https://doi.org/10.1016/j.websem.2019.02.001
Zhu, S., Yu, K., Chi, Y., & Gong, Y. (2007). Combining content and link for classification using matrix factorization. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’07, ACM Press, Amsterdam, The Netherlands, p 487, https://doi.org/10.1145/1277741.1277825.

Download references

Acknowledgements

Funding was provided by Natural Sciences and Engineering Research Council of Canada (Grant No. RGPIN-2014-04463).

Author information

Authors and Affiliations

School of Computer Science, University of Windsor, Windsor, ON, Canada
Yi Zhang, Fen Zhao & Jianguo Lu

Authors

Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fen Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Zhao, F. & Lu, J. P2V: large-scale academic paper embedding. Scientometrics 121, 399–432 (2019). https://doi.org/10.1007/s11192-019-03206-9

Download citation

Received: 18 March 2019
Published: 10 August 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11192-019-03206-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

P2V: large-scale academic paper embedding

Abstract

Access this article

Similar content being viewed by others

Paper2vec: Combining Graph and Text Information for Scientific Paper Representation

PRM-KGED: paper recommender model using knowledge graph embedding and deep neural network

Document Network Projection in Pretrained Word Embedding Space

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

P2V: large-scale academic paper embedding

Abstract

Access this article

Similar content being viewed by others

Paper2vec: Combining Graph and Text Information for Scientific Paper Representation

PRM-KGED: paper recommender model using knowledge graph embedding and deep neural network

Document Network Projection in Pretrained Word Embedding Space

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation