Skip to main content
Log in

A multi-view method of scientific paper classification via heterogeneous graph embeddings

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

The classification task of scientific papers can be implemented based on contents or citations. In order to improve the performance on this task, we express papers as nodes and integrate scientific papers’ contents and citations into a heterogeneous graph. It has two types of edges. One type represents the semantic similarity between papers, derived from papers’ titles and abstracts. The other type represents the citation relationship between papers and the journals or proceedings of conferences of their references. We utilize a contrastive learning method to embed the nodes in the heterogeneous graph into a vector space. Then, we feed the paper node vectors into classifiers, such as the decision tree, multilayer perceptron, and so on. We conduct experiments on three datasets of scientific papers: the Microsoft Academic Graph with 63,211 scientific papers in 20 classes, the Proceedings of the National Academy of Sciences with 38,243 scientific papers in 18 classes, and the American Physical Society with 443,845 scientific papers in 5 classes. The experimental results on the multi-class task show that our multi-view method scores the classification accuracy up to 98%, outperforming state-of-the-arts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://www.webofscience.com/.

  2. https://dblp.org/statistics/publicationsperyear.html.

  3. https://www.microsoft.com/en-us/research/project/open-academic-graph/.

  4. https://www.pnas.org/.

  5. https://journals.aps.org/.

References

  • Achakulvisut, T., Acuna, D. E., Ruangrong, T., & Kording, K. (2016). Science concierge: A fast content based recommendation system for scientific publications. PLoS ONE, 11(7), e0158423.

    Article  Google Scholar 

  • Alsmadi, K. M., Omar, K., Noah, A. S., & Almarashdah, I. (2009). Performance comparison of multi-layer perceptron (back propagation, delta rule and perceptron) algorithms in Neural Networks. IEEE international advance computing conference (pp. 296–299).

  • Arman, C., Sergey, F., Beltagy, I., Doug, D., & Daniel, W. (2020). Specter: Document-level representation learning using citation-informed transformers. ACL (pp. 2270–2282).

  • Ashish, V., Noam, S., Niki, P., Jakob, U., Llion, J., Aidan, N. G., Łukasz, K., & Illia, P. (2017). Attention is all you need. NeurIPS (pp. 6000–6010).

  • Beltagy, I., Kyle, L., & Arman, C. (2019). Scibert: A pretrained language model for scientific text. EMNLP (pp. 3615–3620).

  • Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.

    MATH  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  Google Scholar 

  • Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees (Vol. 432, pp. 151–166). Belmont, CA: International Group, Wadsworth.

    MATH  Google Scholar 

  • Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. SIGKDD (pp. 785–794).

  • Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21–27.

    Article  Google Scholar 

  • David, B. M., Andrew, N. Y., & Michael, J. I. (2003). Latent Dirichlet Allocatio. Journal of Machine Learning Research, 3, 993–1102.

    MATH  Google Scholar 

  • Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL (pp. 4171–4186).

  • Ding, K., Wang, J., Li, J., Li, D., & Liu, H. (2020). Be more with less: Hypergraph attention networks for inductive text classification. EMNLP (pp. 4927–4936).

  • Ech-Chouyyekh, M., Omara, H., & Lazaar, M. (2019). Scientific paper classification using convolutional neural networks. Proceedings of the 4th international conference on big data and internet of things (pp. 1–6).

  • Freund, Y., & Robert, E. S. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.

    Article  MathSciNet  Google Scholar 

  • Ganguly, S., & Pudi, V. (2017). Paper2vec: Combining graph and text information for scientific paper representation. Advances in Information Retrieval (pp. 383–395). Berlin: Springer.

    Chapter  Google Scholar 

  • Gao, M., Chen, L., He, X., & Zhou, A. (2018). Bine: Bipartite network embedding. SIGIR (pp. 715–724).

  • Grave, E., Mikolov, T., Joulin, A., & Bojanowski, P. (2017). Bag of tricks for efficient text classification. EACL (pp. 427–431).

  • Han, E., Karypis, G., & Kumar, V. (2001). Text categorization using weight adjusted k-nearest neighbor classification. PAKDD, 13, 53–65.

    MATH  Google Scholar 

  • Jacovi, A., Shalom, O., & Goldberg, Y. (2018). Understanding convolutional neural networks for text classification. EMNLP (pp. 56–65).

  • Jin, R., Lu, L., Lee, J., & Usman, A. (2019). Multi-representational convolutional neural networks for text classification. Computational Intelligence, 35(3), 599–609.

    Article  MathSciNet  Google Scholar 

  • Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant feature. Machine Learning: ECML-98 (pp. 137–142).

  • Jones, K. S. (2004). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 60, 493–502.

    Article  Google Scholar 

  • Kipf, N. T., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. ICLR.

  • Kong, X., Mao, M., Wang, W., Liu, J., & Xu, B. (2018). VOPRec: Vector representation learning of papers with text information and structural identity for recommendation. IEEE Transactions on Emerging Topics in Computing, 9, 226–237.

    Article  Google Scholar 

  • Kozlowski, D., Dusdal, J., Pang, J., & Zilian, A. (2021). Semantic and relational spaces in science of science: Deep learning models for article vectorisation. Scientometrics, 126, 5881–5910.

    Article  Google Scholar 

  • Le, V. Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. ICML (pp. 1188–1196).

  • Li, X., Ding, D., Kao, B., Sun, Y., & Mamoulis, N. (2021). Leveraging meta-path contexts for classification in heterogeneous information networks. ICDE (pp. 912–923).

  • Lu, Y., Luo, J., Xiao, Y., & Zhu, H. (2021). Text representation model of scientific papers based on fusing multi-viewpoint information and its quality assessment. Scientometrics, 126, 6937–6963.

    Article  Google Scholar 

  • Luo, X. (2021). Efficient English text classification using selected Machine Learning Techniques. Alexandria Engineering Journal, 60(3), 3401–3409.

    Article  Google Scholar 

  • Maron, E. M. (1961). Automatic indexing: An experimental inquiry. Journal of the ACM, 1, 404–417.

    Article  Google Scholar 

  • Masmoudi, A., Bellaaj, H., Drira, K., & Jmaiel, M. (2021). A co-training-based approach for the hierarchical multi-label classification of research papers. Expert Systems, 38, e12613.

    Article  Google Scholar 

  • Mauro, D. L. T., & Julio, C. (2021). SciKGraph: A knowledge graph approach to structure a scientific field. Journal of Informetrics, 15, 101–109.

    Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. NeurIPS (pp. 3111–3119).

  • Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. SIGKDD (pp. 701–710).

  • Quan, J., Li, Q., & Li, M. (2014). Computer science paper classification for CSAR. ICWL, pp. 34–43.

  • Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.

    Google Scholar 

  • Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  • Ramesh, B., & Sathiaseelan, J. G. R. (2015). An advanced multi class instance selection based support vector machine for text classification. Procedia Computer Science, 57, 1124–1130.

    Article  Google Scholar 

  • Sajid, A. N., Ahmad, M., Afzal, T. M., & Atta-ur-Rahman. (2021). Exploiting papers’ reference’s section for multi-label computer science research papers’ classification. Journal of Information and Knowledge Management, 20(01), 2150004.

    Article  Google Scholar 

  • Sun, Y., Han, J., Yan, X., Yu, S. P. & Wu, T. (2011) PathSim: Meta path-based top-K similarity search in heterogeneous information networks. PVLDB, 992-1003.

  • Tan, Z., Chen, J., Kang, Q., Zhou, M., Abusorrah, A., & Sedraoui, K. (2022). Dynamic embedding projection-gated convolutional neural networks for text classification. IEEE Transactions on Neural Networks and Learning Systems, 33(3), 973–982.

    Article  Google Scholar 

  • Turgut, D., & Alper, K. U. (2020). A novel term weighting scheme for text classification: TF-MONO. Journal of Informetrics, 14(4), 101076.

    Article  Google Scholar 

  • Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). Graph attention networks. ICLR.

  • Wang, X., Ji, Ho., Shi, C., Wang, B., Ye, Y., Cui, P., &Yu, S. P. (2019). Heterogeneous graph attention network. WWW ’19 (pp. 2022–2032).

  • Wang, R., Li, Z., Cao, J., Chen, T., & Wang, L. (2019). Convolutional recurrent neural networks for text classification. IJCNN, pp. 1–6.

  • Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., & Brew, J. (2019). Huggingface’s transformers: State of the art natural language processing. arXiv preprint arXiv:1910.03771.

  • Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). How powerful are graph neural networks? ICLR.

  • Yao, L., Mao, C., & Luo, Y. (2019). Graph convolutional networks for text classification. AAAI (pp. 7370–7377).

  • Zhang, Y., Zhao, F., & Lu, J. (2019). P2v: Large-scale academic paper embedding. Scientometrics, 121(1), 399–432.

    Article  Google Scholar 

  • Zhang, T., Kishore, V., Wu, F., Weinberger, Q. K., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. ICLR.

  • Zhang, Y., Yu, X., Cui, Z., Wu, S., Wen, Z., & Wang, L. (2020). Every document owns its structure: Inductive text classification via graph neural networks. ACL.

  • Zhang, C., Song, D., Huang, C., Swami, A., & Chawla, V. N. (2019). Heterogeneous graph neural network. KDD (pp. 793–803).

  • Zhang, M., Gao, X., Cao, D. M., & Ma, Y. (2006). Modelling citation networks for improving scientific paper classification performance. PRICAI (pp. 413–422).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zheng Xie.

Appendix: Fine-grained study on the PNAS

Appendix: Fine-grained study on the PNAS

We divide the classes into the coarse-grained type and the fine-grained type on the PNAS dataset. The Physical sciences and Social sciences are regarded as coarse-grained subjects, while the sub-disciplines of Biological sciences are fine-grained. Here we compare the classification performance of different granularities on the MLPClssifier (Table 9). We calculate the increments of GATs, the concatenate method, and our method compared to LDA. The GATs outperform LDA on the three classification indicators, precision, recall, and F1, which shows that the semantic features captured by GATs are more useful than the topic distribution obtained by LDA. The directly concatenating method outperforms GATs, and the precision of Genetics is even 32% higher, reflecting the advantage of using citations. Our method outperforms the direct concatenating method, which shows the advantage of using contrastive learning.

Table 9 The fine-grained study results on the PNAS

To compare the coarse and fine granularity, we average the increments on labels. Table 10 shows the average incremental results in three classification indicators. It shows that our method has the most obvious improvement compared with other methods, and the increment in classification indicators of fine-grained subjects is much higher than that of coarse-grained subjects. It also shows that our method can distinguish the differences between similar disciplines. Similar disciplines are hard to classify only through semantics. Therefore, the citations contribute a lot.

Table 10 The average increment of different granularities

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lv, Y., Xie, Z., Zuo, X. et al. A multi-view method of scientific paper classification via heterogeneous graph embeddings. Scientometrics 127, 4847–4872 (2022). https://doi.org/10.1007/s11192-022-04419-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-022-04419-1

Keywords

Navigation