An Approach for Document Clustering Using Semantic Similarity and Whale Optimization

Sawarn, Shivam; Deepak, Gerard

doi:10.1007/978-3-030-77246-8_31

Shivam Sawarn¹² &
Gerard Deepak¹³

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 239))

Included in the following conference series:

European, Asian, Middle Eastern, North African Conference on Management & Information Systems

1517 Accesses
2 Citations

Abstract

Document ontology matching is one of the fastest-growing fields in the world of Semantic Web Technologies. Clustering documents based on their contents is a difficult task due to the under-optimized algorithms. In the proposed method, TF-IDF is used to select the features in order to analyse the documents and the word occurrences in the document. Multiple variations of TF-IDF are used as feature selection methods to analyse the best method clubbed with other measuring techniques and optimization algorithms to devise a high precision and accurate algorithm for Discovering Patterns in the documents. The Dataset used in the proposed method is CHIC heritage dataset in the English language which contains over 1000 documents. WebPMI and KL divergence is used as a measuring technique to measure semantic similarity between two queries. At last Whale Optimization Algorithm (WOA) is used to optimize the whole framework and produce high accuracy results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bafna, P., Pramod, D., Vaidya, A.: Document clustering: TF-IDF approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 61–66. IEEE, March 2016
Google Scholar
Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF* IDF, LSI and multi-words for text clustering. Expert Syst. Appl. 38(3), 2758–2765 (2011)
Article Google Scholar
Neto, J.L., Santos, A.D., Kaestner, C.A., Alexandre, N., Santos, D.: Document clustering and text summarization (2000)
Google Scholar
Popescu, A.M., Etzioni, O.: Extracting product features and opinions from reviews. In: Natural Language Processing and Text Mining, pp. 9–28. Springer, London (2007)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text clustering. J. Mach. Learn. Res. 3(Mar), 1289–1305 (2003)
Google Scholar
Goldberger, J., Gordon, S., Greenspan, H.: An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures, p. 487. IEEE, October 2003
Google Scholar
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7893–7897. IEEE, May 2013
Google Scholar
Schneider, K.M.: A new feature selection score for multinomial naive Bayes text clustering based on KL-divergence. In: Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 186–189, July 2004
Google Scholar
Aljarah, I., Faris, H., Mirjalili, S.: Optimizing connection weights in neural networks using the whale optimization algorithm. Soft. Comput. 22(1), 1–15 (2016). https://doi.org/10.1007/s00500-016-2442-1
Article Google Scholar
Goswami, M., Purkayastha, B.S.: Discovering patterns using feature selection techniques and correlation. In: International Conference on Innovative Data Communication Technologies and Application, pp. 824–831. Springer, Cham, October 2019
Google Scholar
Cn, P., Deepak, G., Zakir, M., Kr, V.: Enhanced neighborhood normalized pointwise mutual information algorithm for constraint aware data clustering. ICTACT (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, DIT University, Dehradun, India
Shivam Sawarn
Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, Tiruchirappalli, India
Gerard Deepak

Authors

Shivam Sawarn
View author publications
You can also search for this author in PubMed Google Scholar
Gerard Deepak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Accounting, Finance and Banking Department, Ahlia University, Manama, Bahrain
Abdalmuttaleb M.A Musleh Al-Sartawi
College of Business and Finance, Ahlia University, Manama, Bahrain
Anjum Razzaque
School of Strategy and Leadership, Coventry University, Coventry, UK
Muhammad Mustafa Kamal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sawarn, S., Deepak, G. (2021). An Approach for Document Clustering Using Semantic Similarity and Whale Optimization. In: Musleh Al-Sartawi, A.M., Razzaque, A., Kamal, M.M. (eds) Artificial Intelligence Systems and the Internet of Things in the Digital Era. EAMMIS 2021. Lecture Notes in Networks and Systems, vol 239. Springer, Cham. https://doi.org/10.1007/978-3-030-77246-8_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-77246-8_31
Published: 29 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77245-1
Online ISBN: 978-3-030-77246-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics