Skip to main content

Ontology-Driven Scientific Literature Classification Using Clustering and Self-supervised Learning

  • Conference paper
  • First Online:
Data Management, Analytics and Innovation (ICDMAI 2022)
  • The original version of this chapter was revised: The Author name has been corrected from “Patrck Soong” to “Patrick Soong”. The correction to this chapter is available at https://doi.org/10.1007/978-981-19-2600-6_52

Abstract

The rapid growth of scientific literature in the fields of computer engineering (CE) and computer science (CS) presents difficulties to researchers who are interested in exploring research publication records based on standard scientific categories. This urges the need for a context-aware, automatic classification of text documents into standard scientific categories. Document classification is a significant application of supervised learning which requires a labeled dataset for training the classifier. However, research publication records available on Google Scholar and dblp services are not labeled. First, manual annotation of a large body of scientific research work based on standard scientific terminology requires domain expertise and is extremely time-consuming. Second, hierarchical labeling of records facilitates a more effective and context-aware retrieval of documents. In this paper, we propose an ontology-driven classification technique based on zero-shot learning in conjunction with agglomerative clustering to automatically label a scientific literature dataset related to CE and CS. We further study and compare the effectiveness of multiple text classifiers such as logistic regression (LR), support vector machines (SVM), gradient boosting with Word2vec and bag of words (BOW) embedding, recurrent neural networks (RNN) with GloVe embedding, and feed-forward neural networks with BOW embedding. Our study showed that RNN with GloVe embedding outperforms other models with an above 0.85 F1 score on all granularity levels. Our proposed technique will help junior and experienced researchers identify new emerging technologies and domains for their research purposes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Change history

  • 18 December 2022

    In the original version of the book, the second author’s first name was inadvertently published with a typo. The name has now been corrected from “Patrck Soong” to “Patrick Soong” in the chapter “Ontology-Driven Scientific Literature Classification Using Clustering and Self-supervised Learning” updated version.

    The chapter and book have been updated with the changes.

References

  1. Gartner Inc, 5 Trends Drive the Gartner Hype Cycle for Emerging Technologies (2020)

    Google Scholar 

  2. C.H. Caldas, L. Soibelman, Automating hierarchical document classification for construction management information systems. Autom. Constr. 12(4), 395–406 (2003)

    Article  Google Scholar 

  3. J. Xu, V. Singh, V. Govindaraju, D. Neogi, A hierarchical classification model for document categorization, in 2009 10th International Conference on Document Analysis and Recognition, (2009), pp. 486–490

    Google Scholar 

  4. S.-W. Kim, J.-M. Gil, Research paper classification systems based on TF-IDF and LDA schemes. HCIS 9, 1–21 (2019)

    Google Scholar 

  5. B. Kandimalla, S. Rohatgi, J. Wu, C.L. Giles, Large scale subject category classification of scholarly papers with deep attentive neural networks. Frontiers (2021)

    Google Scholar 

  6. D. Koller, M. Sahami, Hierarchically classifying documents using very few words, in Proceedings of the 14th International Conference on Machine Learning (ICML) (1997)

    Google Scholar 

  7. Google Scholar, Google Scholar Top Publications (2008)

    Google Scholar 

  8. Schloss Dagstuhl—Leibniz Center for Informatics, in DBLP Computer Science Bibliography (2019)

    Google Scholar 

  9. P.K. Pushp, M.M. Srivastava, Train Once, Test Anywhere: Zero-Shot Learning for Text Classification. ArXiv, abs/1712.05972 (2017)

    Google Scholar 

  10. J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), pp. 1532–1543

    Google Scholar 

  11. W. Yin et al., Benchmarking Zero-Shot Text Classification: Datasets, Evaluation and Entailment Approach. ArXiv, abs/1909.00161 (2019)

    Google Scholar 

  12. P. Cristian, R. Trainan, BART: weakly-supervised topic label generation, in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2021), pp. 1418–1425

    Google Scholar 

  13. M.A. Musen, The protégé project: a look back and a look forward. AI Matters 1(4), 4–12 (2015)

    Google Scholar 

  14. N.F. Noy, D.L. Mcguinness, Ontology Development 101: A Guide to Creating Your First Ontology (2001)

    Google Scholar 

  15. S. Rafatirad, R. Jain, Contextual augmentation of ontology for recognizing sub-events, in 2011 IEEE Fifth International Conference on Semantic Computing (2011), pp. 546–553

    Google Scholar 

  16. S. Rafatirad, R. Jain, K. Laskey, Context-based event ontology extension in multimedia applications, in 2013 IEEE Seventh International Conference on Semantic Computing (2013), pp. 278–285

    Google Scholar 

  17. C. Bandi, S. Salehi, R. Hassan, P.D. Sai Manoj, H. Homayoun, S. Rafatirad, Ontology-driven framework for trend analysis of vulnerabilities and impacts in IOT hardware, in IEEE 15th International Conference on Semantic Computing (ICSC) (2021), pp. 211–214

    Google Scholar 

  18. Y. Wu, S. Zhao, W. Li, Phrase2Vec: phrase embedding based on parsing. Inf. Sci. 517, 100–127 (2020)

    Article  Google Scholar 

  19. M. Lewis et al., BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. ArXiv, abs/1910.13461 (2020)

    Google Scholar 

  20. S.R. Bowman, G. Angeli, C. Potts, C.D. Manning, A large annotated corpus for learning natural language inference, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2015)

    Google Scholar 

  21. A. Natekin, A. Knoll, Gradient boosting machines, a tutorial. Front. Neurorobot. 7 (2013)

    Google Scholar 

  22. W. Yin, K. Kann, M. Yu, H. Schütze, Comparative Study of CNN and RNN for Natural Language Processing. ArXiv, abs/1702.01923 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Setareh Rafatirad .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pan, Z., Soong, P., Rafatirad, S. (2023). Ontology-Driven Scientific Literature Classification Using Clustering and Self-supervised Learning. In: Goswami, S., Barara, I.S., Goje, A., Mohan, C., Bruckstein, A.M. (eds) Data Management, Analytics and Innovation. ICDMAI 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 137. Springer, Singapore. https://doi.org/10.1007/978-981-19-2600-6_10

Download citation

Publish with us

Policies and ethics