Ontology-Driven Scientific Literature Classification Using Clustering and Self-supervised Learning

Pan, Zhengtong; Soong, Patrick; Rafatirad, Setareh

doi:10.1007/978-981-19-2600-6_10

Zhengtong Pan⁷,
Patrick Soong⁷ &
Setareh Rafatirad⁷

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 137))

Included in the following conference series:

International Conference on Data Management, Analytics & Innovation

565 Accesses

The original version of this chapter was revised: The Author name has been corrected from “Patrck Soong” to “Patrick Soong”. The correction to this chapter is available at https://doi.org/10.1007/978-981-19-2600-6_52

Abstract

The rapid growth of scientific literature in the fields of computer engineering (CE) and computer science (CS) presents difficulties to researchers who are interested in exploring research publication records based on standard scientific categories. This urges the need for a context-aware, automatic classification of text documents into standard scientific categories. Document classification is a significant application of supervised learning which requires a labeled dataset for training the classifier. However, research publication records available on Google Scholar and dblp services are not labeled. First, manual annotation of a large body of scientific research work based on standard scientific terminology requires domain expertise and is extremely time-consuming. Second, hierarchical labeling of records facilitates a more effective and context-aware retrieval of documents. In this paper, we propose an ontology-driven classification technique based on zero-shot learning in conjunction with agglomerative clustering to automatically label a scientific literature dataset related to CE and CS. We further study and compare the effectiveness of multiple text classifiers such as logistic regression (LR), support vector machines (SVM), gradient boosting with Word2vec and bag of words (BOW) embedding, recurrent neural networks (RNN) with GloVe embedding, and feed-forward neural networks with BOW embedding. Our study showed that RNN with GloVe embedding outperforms other models with an above 0.85 F1 score on all granularity levels. Our proposed technique will help junior and experienced researchers identify new emerging technologies and domains for their research purposes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Change history

18 December 2022
In the original version of the book, the second author’s first name was inadvertently published with a typo. The name has now been corrected from “Patrck Soong” to “Patrick Soong” in the chapter “Ontology-Driven Scientific Literature Classification Using Clustering and Self-supervised Learning” updated version.
The chapter and book have been updated with the changes.

References

Gartner Inc, 5 Trends Drive the Gartner Hype Cycle for Emerging Technologies (2020)
Google Scholar
C.H. Caldas, L. Soibelman, Automating hierarchical document classification for construction management information systems. Autom. Constr. 12(4), 395–406 (2003)
Article Google Scholar
J. Xu, V. Singh, V. Govindaraju, D. Neogi, A hierarchical classification model for document categorization, in 2009 10th International Conference on Document Analysis and Recognition, (2009), pp. 486–490
Google Scholar
S.-W. Kim, J.-M. Gil, Research paper classification systems based on TF-IDF and LDA schemes. HCIS 9, 1–21 (2019)
Google Scholar
B. Kandimalla, S. Rohatgi, J. Wu, C.L. Giles, Large scale subject category classification of scholarly papers with deep attentive neural networks. Frontiers (2021)
Google Scholar
D. Koller, M. Sahami, Hierarchically classifying documents using very few words, in Proceedings of the 14th International Conference on Machine Learning (ICML) (1997)
Google Scholar
Google Scholar, Google Scholar Top Publications (2008)
Google Scholar
Schloss Dagstuhl—Leibniz Center for Informatics, in DBLP Computer Science Bibliography (2019)
Google Scholar
P.K. Pushp, M.M. Srivastava, Train Once, Test Anywhere: Zero-Shot Learning for Text Classification. ArXiv, abs/1712.05972 (2017)
Google Scholar
J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), pp. 1532–1543
Google Scholar
W. Yin et al., Benchmarking Zero-Shot Text Classification: Datasets, Evaluation and Entailment Approach. ArXiv, abs/1909.00161 (2019)
Google Scholar
P. Cristian, R. Trainan, BART: weakly-supervised topic label generation, in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2021), pp. 1418–1425
Google Scholar
M.A. Musen, The protégé project: a look back and a look forward. AI Matters 1(4), 4–12 (2015)
Google Scholar
N.F. Noy, D.L. Mcguinness, Ontology Development 101: A Guide to Creating Your First Ontology (2001)
Google Scholar
S. Rafatirad, R. Jain, Contextual augmentation of ontology for recognizing sub-events, in 2011 IEEE Fifth International Conference on Semantic Computing (2011), pp. 546–553
Google Scholar
S. Rafatirad, R. Jain, K. Laskey, Context-based event ontology extension in multimedia applications, in 2013 IEEE Seventh International Conference on Semantic Computing (2013), pp. 278–285
Google Scholar
C. Bandi, S. Salehi, R. Hassan, P.D. Sai Manoj, H. Homayoun, S. Rafatirad, Ontology-driven framework for trend analysis of vulnerabilities and impacts in IOT hardware, in IEEE 15th International Conference on Semantic Computing (ICSC) (2021), pp. 211–214
Google Scholar
Y. Wu, S. Zhao, W. Li, Phrase2Vec: phrase embedding based on parsing. Inf. Sci. 517, 100–127 (2020)
Article Google Scholar
M. Lewis et al., BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. ArXiv, abs/1910.13461 (2020)
Google Scholar
S.R. Bowman, G. Angeli, C. Potts, C.D. Manning, A large annotated corpus for learning natural language inference, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2015)
Google Scholar
A. Natekin, A. Knoll, Gradient boosting machines, a tutorial. Front. Neurorobot. 7 (2013)
Google Scholar
W. Yin, K. Kann, M. Yu, H. Schütze, Comparative Study of CNN and RNN for Natural Language Processing. ArXiv, abs/1702.01923 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of California, Davis, USA
Zhengtong Pan, Patrick Soong & Setareh Rafatirad

Authors

Zhengtong Pan
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Soong
View author publications
You can also search for this author in PubMed Google Scholar
Setareh Rafatirad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Setareh Rafatirad .

Editor information

Editors and Affiliations

Bangabasi Morning College, Kolkata, West Bengal, India
Saptarsi Goswami
Vara Technology, Saket, Delhi, India
Inderjit Singh Barara
Society for Data Science, Pune, Maharashtra, India
Amol Goje
National University of Singapore, Singapore, Singapore
C. Mohan
Department of Computer Science, Technion—Israel Institute of Technology, Haifa, Israel
Alfred M. Bruckstein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pan, Z., Soong, P., Rafatirad, S. (2023). Ontology-Driven Scientific Literature Classification Using Clustering and Self-supervised Learning. In: Goswami, S., Barara, I.S., Goje, A., Mohan, C., Bruckstein, A.M. (eds) Data Management, Analytics and Innovation. ICDMAI 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 137. Springer, Singapore. https://doi.org/10.1007/978-981-19-2600-6_10

Download citation

DOI: https://doi.org/10.1007/978-981-19-2600-6_10
Published: 22 September 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2599-3
Online ISBN: 978-981-19-2600-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics