Semantic Similarity Extraction on Corpora Using Natural Language Processing Techniques and Text Analytics Algorithms

Varghese, Nisha; Punithavalli, M.

doi:10.1007/978-981-16-6332-1_16

Part of the book series: Algorithms for Intelligent Systems ((AIS))

576 Accesses
1 Citations

Abstract

Extraction of Semantic Similarity and relevant information from the corpus is one of the elusive tasks in Text Mining due to the unstructured data, uneven pattern, multiple resolutions, concealed meaning, and other ambiguities. The main focus of semantic similarity analysis lies in meaning concerning the word sense that lies in the arrangements context words and the other words in the sentence with respect to the window size. One of the hurdles to extract the exact semantic similarity from paraphrase statements is the corpus length. The longer corpus has the better chance to match any query statement and it may contain more words, which arises the over penalization problem. To alleviate this problem avoid over penalization by length normalization. The objective of the study is to improve the efficiency in capturing semantic similarity and pertinent information by increased term frequency saturation and increased impact of document normalization with the less penalization method. This study introduced a novel method, Perfect Matching Algorithm (PMA), developed to reduce the over penalization on context corpus with taken into account the length of both Query and Context Documents by the length normalization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Muhammed SH (2014) An automatic similarity detection engine between sacred texts using text mining and similarity measures. Rochester Institute of Technology
Google Scholar
McDonald DM (2014) A text mining analysis of religious texts. J Bus Inq 13(1):27–47
Google Scholar
Verma M (2017) Lexical analysis of religious texts using text mining and machine learning tools. Int J Comput Appl 168(8):39–45
Google Scholar
Taa A, Abed QA, Ahmad M (2018) Al-Quran ontology based on knowledge themes. J Fundamental Appl Sci 9(5):800–810
Article Google Scholar
Hegazi MOA, Hilal A, Alhawarat M (2015) Fine-grained Quran dataset. Int J Adv Comput Sci Appl 6(12):308–313
Google Scholar
Popa RC, Goga N, Goga M (2016) Ontology learning applied in education: a case of the new testament. The European proceedings of social & behavioural sciences EpSBS, Edu World 2016 7th ınternational conference, pp 1032–1039
Google Scholar
Popa RC, Goga N, Goga M (2019) Extracting knowledge from the Bible: a comparison between the old and the new testament. International conference on automation, computational and technology management (ICACTM). IEEE, New York, pp 505–510
Google Scholar
Firth JR (1957) The technique of semantics. Papers Linguistics 37(2):191–200
Google Scholar
Miller K (1993) Five papers on wordnet. Technical report. Prinston University, Prinston.
Google Scholar
Baeza-Yates R, Robeiro-Neto B (1999) Modern information retrieval. ACM Press Books
Google Scholar
Zipf GK (1949) Human behaviour and the principal of least effort. Addison-Wesley
Google Scholar
Hassanat AB (2014) Dimensionality invariant similarity measure. J Am Sci 10(8):221–226
Google Scholar
Huang A (2008) Similarity measures for text document clustering. New Zealand computer science research student conference, pp 1–8
Google Scholar
Kocher M, Savoy J (2017) Distance measures in author profiling. Inf Process Manage 53(1):1103–1119
Article Google Scholar
Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci 1(4):300–307
MathSciNet Google Scholar
Cha S-H (2008) Taxonomy of nominal type histogram distance measures. American conference on applied mathematics (MATH ‘08), pp 325–330
Google Scholar
Manning CD, Raghavan P, Schütze H (2009) Introduction to information retrieval. Cambridge University Press, Cambridge, England, pp 192–195
MATH Google Scholar
Varghese N, Punithavalli M (2020) Semantic similarity analysis on knowledge-based and prediction based models. Int J Innov Technol Exploring Eng 9(6):447–481
Google Scholar
Lv Y, Zhai C (2011) Lower-bounding term frequency normalization. CIKM, pp 7–16
Google Scholar
Cotterell R, Schütze H (2019) MorphologicalWord-embeddings. Human Language Technologies. Preprint at arXiv: 1907.02423, pp 1–6
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. Preprint at arXiv: 1301.3781, pp 1–12
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. Preprintat arXiv: 1310.4546, pp 1–9
Varghese N, Punithavalli M (2020) Word vector representations: sparse versus dense vectors. Working Papers Linguistic Lit 12(1): 360–367
Google Scholar
Kwon YM, Jun SH, Gal WM, Lim MJ (2018) The performance comparison of the classifiers according to binary bow, count bow and Tf-Idf feature vectors for malware detection. Int J Eng Technol 7(3):15–22
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Applications, Bharathiar University, Coimbatore, 641046, India
Nisha Varghese & M. Punithavalli

Authors

Nisha Varghese
View author publications
You can also search for this author in PubMed Google Scholar
M. Punithavalli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nisha Varghese .

Editor information

Editors and Affiliations

Poornima College of Engineering, Jaipur, India
Garima Mathur
Poornima College of Engineering, Jaipur, India
Mahesh Bundele
Rajasthan Technical University, Kota, Rajasthan, India
Mahendra Lalwani
Polish Academy of Sciences, Warsaw, Poland
Marcin Paprzycki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Varghese, N., Punithavalli, M. (2022). Semantic Similarity Extraction on Corpora Using Natural Language Processing Techniques and Text Analytics Algorithms. In: Mathur, G., Bundele, M., Lalwani, M., Paprzycki, M. (eds) Proceedings of 2nd International Conference on Artificial Intelligence: Advances and Applications. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-16-6332-1_16

Download citation

DOI: https://doi.org/10.1007/978-981-16-6332-1_16
Published: 14 February 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6331-4
Online ISBN: 978-981-16-6332-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics