Abstract
Document ontology matching is one of the fastest-growing fields in the world of Semantic Web Technologies. Clustering documents based on their contents is a difficult task due to the under-optimized algorithms. In the proposed method, TF-IDF is used to select the features in order to analyse the documents and the word occurrences in the document. Multiple variations of TF-IDF are used as feature selection methods to analyse the best method clubbed with other measuring techniques and optimization algorithms to devise a high precision and accurate algorithm for Discovering Patterns in the documents. The Dataset used in the proposed method is CHIC heritage dataset in the English language which contains over 1000 documents. WebPMI and KL divergence is used as a measuring technique to measure semantic similarity between two queries. At last Whale Optimization Algorithm (WOA) is used to optimize the whole framework and produce high accuracy results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bafna, P., Pramod, D., Vaidya, A.: Document clustering: TF-IDF approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 61–66. IEEE, March 2016
Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF* IDF, LSI and multi-words for text clustering. Expert Syst. Appl. 38(3), 2758–2765 (2011)
Neto, J.L., Santos, A.D., Kaestner, C.A., Alexandre, N., Santos, D.: Document clustering and text summarization (2000)
Popescu, A.M., Etzioni, O.: Extracting product features and opinions from reviews. In: Natural Language Processing and Text Mining, pp. 9–28. Springer, London (2007)
Forman, G.: An extensive empirical study of feature selection metrics for text clustering. J. Mach. Learn. Res. 3(Mar), 1289–1305 (2003)
Goldberger, J., Gordon, S., Greenspan, H.: An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures, p. 487. IEEE, October 2003
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7893–7897. IEEE, May 2013
Schneider, K.M.: A new feature selection score for multinomial naive Bayes text clustering based on KL-divergence. In: Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 186–189, July 2004
Aljarah, I., Faris, H., Mirjalili, S.: Optimizing connection weights in neural networks using the whale optimization algorithm. Soft. Comput. 22(1), 1–15 (2016). https://doi.org/10.1007/s00500-016-2442-1
Goswami, M., Purkayastha, B.S.: Discovering patterns using feature selection techniques and correlation. In: International Conference on Innovative Data Communication Technologies and Application, pp. 824–831. Springer, Cham, October 2019
Cn, P., Deepak, G., Zakir, M., Kr, V.: Enhanced neighborhood normalized pointwise mutual information algorithm for constraint aware data clustering. ICTACT (2016)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sawarn, S., Deepak, G. (2021). An Approach for Document Clustering Using Semantic Similarity and Whale Optimization. In: Musleh Al-Sartawi, A.M., Razzaque, A., Kamal, M.M. (eds) Artificial Intelligence Systems and the Internet of Things in the Digital Era. EAMMIS 2021. Lecture Notes in Networks and Systems, vol 239. Springer, Cham. https://doi.org/10.1007/978-3-030-77246-8_31
Download citation
DOI: https://doi.org/10.1007/978-3-030-77246-8_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77245-1
Online ISBN: 978-3-030-77246-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)