Abstract
The rapid growth of scientific literature in the fields of computer engineering (CE) and computer science (CS) presents difficulties to researchers who are interested in exploring research publication records based on standard scientific categories. This urges the need for a context-aware, automatic classification of text documents into standard scientific categories. Document classification is a significant application of supervised learning which requires a labeled dataset for training the classifier. However, research publication records available on Google Scholar and dblp services are not labeled. First, manual annotation of a large body of scientific research work based on standard scientific terminology requires domain expertise and is extremely time-consuming. Second, hierarchical labeling of records facilitates a more effective and context-aware retrieval of documents. In this paper, we propose an ontology-driven classification technique based on zero-shot learning in conjunction with agglomerative clustering to automatically label a scientific literature dataset related to CE and CS. We further study and compare the effectiveness of multiple text classifiers such as logistic regression (LR), support vector machines (SVM), gradient boosting with Word2vec and bag of words (BOW) embedding, recurrent neural networks (RNN) with GloVe embedding, and feed-forward neural networks with BOW embedding. Our study showed that RNN with GloVe embedding outperforms other models with an above 0.85 F1 score on all granularity levels. Our proposed technique will help junior and experienced researchers identify new emerging technologies and domains for their research purposes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Change history
18 December 2022
In the original version of the book, the second author’s first name was inadvertently published with a typo. The name has now been corrected from “Patrck Soong” to “Patrick Soong” in the chapter “Ontology-Driven Scientific Literature Classification Using Clustering and Self-supervised Learning” updated version.
The chapter and book have been updated with the changes.
References
Gartner Inc, 5 Trends Drive the Gartner Hype Cycle for Emerging Technologies (2020)
C.H. Caldas, L. Soibelman, Automating hierarchical document classification for construction management information systems. Autom. Constr. 12(4), 395–406 (2003)
J. Xu, V. Singh, V. Govindaraju, D. Neogi, A hierarchical classification model for document categorization, in 2009 10th International Conference on Document Analysis and Recognition, (2009), pp. 486–490
S.-W. Kim, J.-M. Gil, Research paper classification systems based on TF-IDF and LDA schemes. HCIS 9, 1–21 (2019)
B. Kandimalla, S. Rohatgi, J. Wu, C.L. Giles, Large scale subject category classification of scholarly papers with deep attentive neural networks. Frontiers (2021)
D. Koller, M. Sahami, Hierarchically classifying documents using very few words, in Proceedings of the 14th International Conference on Machine Learning (ICML) (1997)
Google Scholar, Google Scholar Top Publications (2008)
Schloss Dagstuhl—Leibniz Center for Informatics, in DBLP Computer Science Bibliography (2019)
P.K. Pushp, M.M. Srivastava, Train Once, Test Anywhere: Zero-Shot Learning for Text Classification. ArXiv, abs/1712.05972 (2017)
J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), pp. 1532–1543
W. Yin et al., Benchmarking Zero-Shot Text Classification: Datasets, Evaluation and Entailment Approach. ArXiv, abs/1909.00161 (2019)
P. Cristian, R. Trainan, BART: weakly-supervised topic label generation, in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2021), pp. 1418–1425
M.A. Musen, The protégé project: a look back and a look forward. AI Matters 1(4), 4–12 (2015)
N.F. Noy, D.L. Mcguinness, Ontology Development 101: A Guide to Creating Your First Ontology (2001)
S. Rafatirad, R. Jain, Contextual augmentation of ontology for recognizing sub-events, in 2011 IEEE Fifth International Conference on Semantic Computing (2011), pp. 546–553
S. Rafatirad, R. Jain, K. Laskey, Context-based event ontology extension in multimedia applications, in 2013 IEEE Seventh International Conference on Semantic Computing (2013), pp. 278–285
C. Bandi, S. Salehi, R. Hassan, P.D. Sai Manoj, H. Homayoun, S. Rafatirad, Ontology-driven framework for trend analysis of vulnerabilities and impacts in IOT hardware, in IEEE 15th International Conference on Semantic Computing (ICSC) (2021), pp. 211–214
Y. Wu, S. Zhao, W. Li, Phrase2Vec: phrase embedding based on parsing. Inf. Sci. 517, 100–127 (2020)
M. Lewis et al., BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. ArXiv, abs/1910.13461 (2020)
S.R. Bowman, G. Angeli, C. Potts, C.D. Manning, A large annotated corpus for learning natural language inference, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2015)
A. Natekin, A. Knoll, Gradient boosting machines, a tutorial. Front. Neurorobot. 7 (2013)
W. Yin, K. Kann, M. Yu, H. Schütze, Comparative Study of CNN and RNN for Natural Language Processing. ArXiv, abs/1702.01923 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Pan, Z., Soong, P., Rafatirad, S. (2023). Ontology-Driven Scientific Literature Classification Using Clustering and Self-supervised Learning. In: Goswami, S., Barara, I.S., Goje, A., Mohan, C., Bruckstein, A.M. (eds) Data Management, Analytics and Innovation. ICDMAI 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 137. Springer, Singapore. https://doi.org/10.1007/978-981-19-2600-6_10
Download citation
DOI: https://doi.org/10.1007/978-981-19-2600-6_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2599-3
Online ISBN: 978-981-19-2600-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)