Skip to main content

Semantic Similarity Extraction on Corpora Using Natural Language Processing Techniques and Text Analytics Algorithms

  • Conference paper
  • First Online:
Proceedings of 2nd International Conference on Artificial Intelligence: Advances and Applications

Abstract

Extraction of Semantic Similarity and relevant information from the corpus is one of the elusive tasks in Text Mining due to the unstructured data, uneven pattern, multiple resolutions, concealed meaning, and other ambiguities. The main focus of semantic similarity analysis lies in meaning concerning the word sense that lies in the arrangements context words and the other words in the sentence with respect to the window size. One of the hurdles to extract the exact semantic similarity from paraphrase statements is the corpus length. The longer corpus has the better chance to match any query statement and it may contain more words, which arises the over penalization problem. To alleviate this problem avoid over penalization by length normalization. The objective of the study is to improve the efficiency in capturing semantic similarity and pertinent information by increased term frequency saturation and increased impact of document normalization with the less penalization method. This study introduced a novel method, Perfect Matching Algorithm (PMA), developed to reduce the over penalization on context corpus with taken into account the length of both Query and Context Documents by the length normalization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Muhammed SH (2014) An automatic similarity detection engine between sacred texts using text mining and similarity measures. Rochester Institute of Technology

    Google Scholar 

  2. McDonald DM (2014) A text mining analysis of religious texts. J Bus Inq 13(1):27–47

    Google Scholar 

  3. Verma M (2017) Lexical analysis of religious texts using text mining and machine learning tools. Int J Comput Appl 168(8):39–45

    Google Scholar 

  4. Taa A, Abed QA, Ahmad M (2018) Al-Quran ontology based on knowledge themes. J Fundamental Appl Sci 9(5):800–810

    Article  Google Scholar 

  5. Hegazi MOA, Hilal A, Alhawarat M (2015) Fine-grained Quran dataset. Int J Adv Comput Sci Appl 6(12):308–313

    Google Scholar 

  6. Popa RC, Goga N, Goga M (2016) Ontology learning applied in education: a case of the new testament. The European proceedings of social & behavioural sciences EpSBS, Edu World 2016 7th ınternational conference, pp 1032–1039

    Google Scholar 

  7. Popa RC, Goga N, Goga M (2019) Extracting knowledge from the Bible: a comparison between the old and the new testament. International conference on automation, computational and technology management (ICACTM). IEEE, New York, pp 505–510

    Google Scholar 

  8. Firth JR (1957) The technique of semantics. Papers Linguistics 37(2):191–200

    Google Scholar 

  9. Miller K (1993) Five papers on wordnet. Technical report. Prinston University, Prinston.

    Google Scholar 

  10. Baeza-Yates R, Robeiro-Neto B (1999) Modern information retrieval. ACM Press Books

    Google Scholar 

  11. Zipf GK (1949) Human behaviour and the principal of least effort. Addison-Wesley

    Google Scholar 

  12. Hassanat AB (2014) Dimensionality invariant similarity measure. J Am Sci 10(8):221–226

    Google Scholar 

  13. Huang A (2008) Similarity measures for text document clustering. New Zealand computer science research student conference, pp 1–8

    Google Scholar 

  14. Kocher M, Savoy J (2017) Distance measures in author profiling. Inf Process Manage 53(1):1103–1119

    Article  Google Scholar 

  15. Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci 1(4):300–307

    MathSciNet  Google Scholar 

  16. Cha S-H (2008) Taxonomy of nominal type histogram distance measures. American conference on applied mathematics (MATH ‘08), pp 325–330

    Google Scholar 

  17. Manning CD, Raghavan P, Schütze H (2009) Introduction to information retrieval. Cambridge University Press, Cambridge, England, pp 192–195

    MATH  Google Scholar 

  18. Varghese N, Punithavalli M (2020) Semantic similarity analysis on knowledge-based and prediction based models. Int J Innov Technol Exploring Eng 9(6):447–481

    Google Scholar 

  19. Lv Y, Zhai C (2011) Lower-bounding term frequency normalization. CIKM, pp 7–16

    Google Scholar 

  20. Cotterell R, Schütze H (2019) MorphologicalWord-embeddings. Human Language Technologies. Preprint at arXiv: 1907.02423, pp 1–6

  21. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. Preprint at arXiv: 1301.3781, pp 1–12

  22. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. Preprintat arXiv: 1310.4546, pp 1–9

  23. Varghese N, Punithavalli M (2020) Word vector representations: sparse versus dense vectors. Working Papers Linguistic Lit 12(1): 360–367

    Google Scholar 

  24. Kwon YM, Jun SH, Gal WM, Lim MJ (2018) The performance comparison of the classifiers according to binary bow, count bow and Tf-Idf feature vectors for malware detection. Int J Eng Technol 7(3):15–22

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nisha Varghese .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Varghese, N., Punithavalli, M. (2022). Semantic Similarity Extraction on Corpora Using Natural Language Processing Techniques and Text Analytics Algorithms. In: Mathur, G., Bundele, M., Lalwani, M., Paprzycki, M. (eds) Proceedings of 2nd International Conference on Artificial Intelligence: Advances and Applications. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-16-6332-1_16

Download citation

Publish with us

Policies and ethics