Skip to main content

An Approach for Document Clustering Using Semantic Similarity and Whale Optimization

  • Conference paper
  • First Online:
Artificial Intelligence Systems and the Internet of Things in the Digital Era (EAMMIS 2021)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 239))

Abstract

Document ontology matching is one of the fastest-growing fields in the world of Semantic Web Technologies. Clustering documents based on their contents is a difficult task due to the under-optimized algorithms. In the proposed method, TF-IDF is used to select the features in order to analyse the documents and the word occurrences in the document. Multiple variations of TF-IDF are used as feature selection methods to analyse the best method clubbed with other measuring techniques and optimization algorithms to devise a high precision and accurate algorithm for Discovering Patterns in the documents. The Dataset used in the proposed method is CHIC heritage dataset in the English language which contains over 1000 documents. WebPMI and KL divergence is used as a measuring technique to measure semantic similarity between two queries. At last Whale Optimization Algorithm (WOA) is used to optimize the whole framework and produce high accuracy results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bafna, P., Pramod, D., Vaidya, A.: Document clustering: TF-IDF approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 61–66. IEEE, March 2016

    Google Scholar 

  2. Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF* IDF, LSI and multi-words for text clustering. Expert Syst. Appl. 38(3), 2758–2765 (2011)

    Article  Google Scholar 

  3. Neto, J.L., Santos, A.D., Kaestner, C.A., Alexandre, N., Santos, D.: Document clustering and text summarization (2000)

    Google Scholar 

  4. Popescu, A.M., Etzioni, O.: Extracting product features and opinions from reviews. In: Natural Language Processing and Text Mining, pp. 9–28. Springer, London (2007)

    Google Scholar 

  5. Forman, G.: An extensive empirical study of feature selection metrics for text clustering. J. Mach. Learn. Res. 3(Mar), 1289–1305 (2003)

    Google Scholar 

  6. Goldberger, J., Gordon, S., Greenspan, H.: An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures, p. 487. IEEE, October 2003

    Google Scholar 

  7. Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7893–7897. IEEE, May 2013

    Google Scholar 

  8. Schneider, K.M.: A new feature selection score for multinomial naive Bayes text clustering based on KL-divergence. In: Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 186–189, July 2004

    Google Scholar 

  9. Aljarah, I., Faris, H., Mirjalili, S.: Optimizing connection weights in neural networks using the whale optimization algorithm. Soft. Comput. 22(1), 1–15 (2016). https://doi.org/10.1007/s00500-016-2442-1

    Article  Google Scholar 

  10. Goswami, M., Purkayastha, B.S.: Discovering patterns using feature selection techniques and correlation. In: International Conference on Innovative Data Communication Technologies and Application, pp. 824–831. Springer, Cham, October 2019

    Google Scholar 

  11. Cn, P., Deepak, G., Zakir, M., Kr, V.: Enhanced neighborhood normalized pointwise mutual information algorithm for constraint aware data clustering. ICTACT (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sawarn, S., Deepak, G. (2021). An Approach for Document Clustering Using Semantic Similarity and Whale Optimization. In: Musleh Al-Sartawi, A.M., Razzaque, A., Kamal, M.M. (eds) Artificial Intelligence Systems and the Internet of Things in the Digital Era. EAMMIS 2021. Lecture Notes in Networks and Systems, vol 239. Springer, Cham. https://doi.org/10.1007/978-3-030-77246-8_31

Download citation

Publish with us

Policies and ethics