Skip to main content
Log in

Semantic similarity based approach for reducing Arabic texts dimensionality

  • Special Issue Article
  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

An efficient method is introduced to represent large Arabic texts in comparatively smaller size without losing significant information. The proposed method uses the distributional semantics to build the word-context matrix representing the distribution of words across contexts and to transform the text into a vector space model (VSM) representation based on word semantic similarity. The linguistic features of the Arabic language, in addition to the semantic information extracted from different lexical-semantic resources such as Arabic WordNet and named entities’ gazetteers are used to improve the text representation and to create word clusters of similar and related words. Distributional similarity measures have been used to capture the words’ semantic similarity and to create clusters of similar words. The conducted experiments have shown that the proposed method significantly reduces the size of text representation by about 27 % compared with the stem-based VSM and by about 50 % compared with the traditional bag-of-words model. Their results have shown that the amount of dimension reduction depends on the size and shape of the windows of analysis as well as on the content of the text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Almaany. (2014). Dictionary and glossary. http://www.almaany.com/.

  • Awajan, A. (2007). Arabic text preprocessing for the natural language processing applications. Arab Gulf Journal of Scientific Research, 25(4), 179–189.

    Google Scholar 

  • Awajan, A. (2015). Semantic vector space model for reducing arabic text dimensionality. In Proceedings of the 5th international conference on digital information and communication technology and its applications, Lebanon, (pp. 129–135). April 29–May 1, 2015.

  • Baker, K. (2013). Singular value decomposition tutorial. Note for NLP Seminar. 1–24. Accessed December 2013, from www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf.

  • Beesley, K. R. (1998). Consonant spreading in Arabic stems. In COLING-ACL’98, vol 1, pp 117–123, Montreal, Quebec, Canada, August 10–14.

  • Biemann, C. (2006). Chinese whispers—An efficient graph clustering algorithm and its application to natural language processing problems. Workshop on TextGraphs, at HLT-NAACL 2006, pp. 73–80

  • Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A., Ould Abdallahi, O. B. M., & Shoul, M. (2010). Alkhalil Morpho Sys: A morphosyntactic analysis system for Arabic texts. In International Arab conference on information technology. http://www.itpapers.info/acit10/Papers/f653.

  • Bullinaria, J. A., & Levy, J. P. (2012). Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming and SVD. Behavior Research Methods, 44, 890–907.

    Article  Google Scholar 

  • Duwairi, R., Al-Refai, M. N., & Khasawneh, N. (2009). Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science and Technology, 60(11), 2347–2352.

    Article  Google Scholar 

  • Elkateb, S., Black, W., Rodríguez, H., Alkhalifa, M., Vossen, P., Pease, A., & Fellbaum, C. (2006). Building a WordNet for Arabic. In Proceedings of the fifth international conference on language resources and evaluation (LREC 2006). Genoa, Italy, May 22–28, 2006.

  • Froud, H., Lachkar, A., & Ouatik, S. A. (2012). A comparative study of root-based and stem-based approaches for measuring similarity between Arabic words for Arabic text mining applications. Advanced Computing: An International Journal (ACIJ), 3(6).

  • Green, S., & Manning, C. D. (2010). Better Arabic parsing: Baselines, evaluations, and analysis. In COLING, Beijing (pp. 394–402).

  • Habash, N. (2010). Introduction to Arabic natural language processing. San Rafael: Morgan & Claypool Publishers.

    Google Scholar 

  • Hagiwara, M. (2008). A supervised learning approach to automatic synonym identification based on distributional features. In Proceedings of the ACL-08, Columbus, June 2008 (pp. 1–6).

  • Harrag, F., El-Qawasmah, E., & Al-Salman, A. M. (2010). Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In IEEE first international conference on integrated intelligent computing, pp. 6–11.

  • Harris, Z. (1954). Distributional structure. Word, 10(23), 146–162.

    Article  Google Scholar 

  • Hasnah, A. M., & Al-Ja’am, J. M. (2002). Thesaurus-based query disambiguation method for cross-language information retrieval. International Journal Intelligent Computing and Information Sciences, 2(2), 58–68.

    Google Scholar 

  • Heintz, I. (2010). Arabic language modeling with stem-derived morphemes for automatic speech recognition. Ph.D. thesis, Graduate School of The Ohio State University.

  • Hmeidi, I., Kanaan, G., & Evens, M. (1997). Design and implementation of automatic indexing for information retrieval with arabic documents. Journal of the American Society for Information Science, 48(10), 867–881.

    Article  Google Scholar 

  • Kirchhoff, K., Vergyri, D., Duh, K., Bilmes, J., & Stolcke, A. (2006). Morphology-based language modeling for conversational Arabic speech recognition. Computer Speech & Language, 20(4), 589–608.

    Article  Google Scholar 

  • Martins, C. A., Monard, M. C., & Matsubara, E. T. (2003). Reducing the dimensionality of bag-of-words text representation used by learning algorithms. In Proceedings of 3rd IASTED international conference on artificial intelligence and applications (AIA2003), Benalmádena, Espanha (pp. 228–233). Calgary: Acta Press.

  • Mihalcea, R., & Tarau, P. (2004). TextRank: Brining order into texts. In Proceedings of EMNLP 2004. Association for Computational Linguistics, Barcelona, Spain (pp. 404–411).

  • Parkinson, D. B. (2005). Using Arabic synonyms. Cambridge: Cambridge University Press.

    Google Scholar 

  • Saad, M. K., & Ashour, W. (2010). OSAC: Open Source Arabic Corpus, the 6th International Symposium on Electrical and Electronics Engineering and Computer Science, European University of Lefke, Cyprus, from http://sourceforge.net/projects/ar-text-mining/files/ArabicCorpora.

  • Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York, NY: McGraw-Hill. Inc.

    MATH  Google Scholar 

  • Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communication of the ACM, 18(11), 613–620.

    Article  MATH  Google Scholar 

  • Shaalan, K. (2014). A survey of Arabic named entity recognition and classification. Computational Linguistics, 40(2), 469–510. doi:10.1162/COLIa00178.

    Article  Google Scholar 

  • Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37, 141–188.

    MathSciNet  MATH  Google Scholar 

  • Van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). Cambridge: Computer Laboratory, University of Cambridge.

    MATH  Google Scholar 

  • Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In SIGIR’02, Proceedings of the 25th annual international ACMSIGIR conference on Research and development in information retrieval, Tampere, Finland (pp. 269–274). August 11–15, 2002.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arafat Awajan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Awajan, A. Semantic similarity based approach for reducing Arabic texts dimensionality. Int J Speech Technol 19, 191–201 (2016). https://doi.org/10.1007/s10772-015-9284-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-015-9284-6

Keywords

Navigation