Advertisement

Combining Words and Concepts for Automatic Arabic Text Classification

  • Alaa Alahmadi
  • Arash Joorabchi
  • Abdulhussain E. MahdiEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 782)

Abstract

The paper examines combining words and concepts for text representation for Arabic Automatic Text Classification (ATC) and its impact on the accuracy of the classification, when used with various stemming methods and classifiers. An experimental Arabic ATC system was developed and the effects of its main components on the classification accuracy are assessed. Firstly, variants of the standard Bag-of-Words model with different stemming methods are examined and compared. Arabic Wikipedia and WordNet were examined and compared for providing concepts for effective Bag-of-Concepts representation. Based on this, Wikipedia was then utilized to provide concepts, and different strategies for combining words and concepts, including two new in-house developed approaches, were examined for effective Arabic text representation in terms of their impact on the overall classification accuracy. Our experimental results show that text representation is a key element in the performance of Arabic ATC, and combining words and concepts to represent Arabic text enhances the classification accuracy as compared to using words or concepts alone.

Keywords

Arabic text classification Text representation models Bag of words Bag of concepts Wikipedia WordNet 

References

  1. 1.
    Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)CrossRefzbMATHGoogle Scholar
  2. 2.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48 (1998)Google Scholar
  3. 3.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34, 1–47 (2002)CrossRefGoogle Scholar
  4. 4.
    Hotho, A., Staab, S., Stumme, G.: Wordnet improves Text Document Clustering (2003)Google Scholar
  5. 5.
    Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: IJCAI, vol. 5, pp. 1048–1053 (2005)Google Scholar
  6. 6.
    Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: AAAI, vol. 6, pp. 1301–1306 (2006)Google Scholar
  7. 7.
    Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of word-and sense-based text categorization using several classification algorithms. J. Intell. Inf. Syst. 21, 227–247 (2003)CrossRefGoogle Scholar
  8. 8.
    de Buenaga Rodríguez, M., Hidalgo, J.M.G., Agudo, B.D.: Using WordNet to complement training information in text categorization. arXiv preprint cmp-lg/9709007 (1997)
  9. 9.
    Scott, S., Matwin, S.: Text classification using WordNet hypernyms. In: Use of WordNet in Natural Language Processing Systems, Proceedings of the Conference, pp. 38–44 (1998)Google Scholar
  10. 10.
    Wang, P., Hu, J., Zeng, H.-J., Chen, L., Chen, Z.: Improving text classification by using encyclopedia knowledge, pp. 332–341 (2007)Google Scholar
  11. 11.
    Wang, P., Hu, J., Zeng, H.-J., Chen, Z.: Using Wikipedia knowledge to improve text classification. Knowl. Inf. Syst. 19, 265–281 (2008)CrossRefGoogle Scholar
  12. 12.
    Benkhalifa, M., Mouradi, A., Bouyakhf, H.: Integrating external knowledge to supplement training data in semi-supervised learning for text categorization. Inf. Retr. 4, 91–113 (2001)CrossRefzbMATHGoogle Scholar
  13. 13.
    Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 179–186. ACM (2008)Google Scholar
  14. 14.
    Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering, pp. 541–544 (2003)Google Scholar
  15. 15.
    Harrag, F., El-Qawasmah, E., Al-Salman, A.M.S.: Stemming as a feature reduction technique for arabic text categorization. In: 2011 10th International Symposium on Programming and Systems (ISPS), pp. 128–133. IEEE (2011)Google Scholar
  16. 16.
    Syiam, M.M., Fayed, Z.T., Habib, M.B.: An intelligent system for Arabic text categorization. Int. J. Intell. Comput. Inf. Sci. 6, 1–19 (2006)CrossRefGoogle Scholar
  17. 17.
    Darwish, K., Oard, D.W.: Adapting morphology for Arabic information retrieval*. In: Soudi, A., van den Bosch, A., Neumann, G. (eds.) Arabic Computational Morphology. TLTB, vol. 38, pp. 245–262. Springer, Dordrecht (2007).  https://doi.org/10.1007/978-1-4020-6046-5_13 CrossRefGoogle Scholar
  18. 18.
    Al-Shammari, E.T.: Improving Arabic document categorization: introducing local stem. In: 2010 10th International Conference on Intelligent Systems Design and Applications (ISDA), pp. 385–390. IEEE (2010)Google Scholar
  19. 19.
    Larkey, L.S., Ballesteros, L., Connell, M.E.: Light stemming for Arabic information retrieval. In: Soudi, A., van den Bosch, A., Neumann, G. (eds.) Arabic Computational Morphology, vol. 38, pp. 221–243. Springer, Dordrecht (2007).  https://doi.org/10.1007/978-1-4020-6046-5_12 CrossRefGoogle Scholar
  20. 20.
    Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M., Al-Rajeh, A.: Automatic Arabic text classification (2008)Google Scholar
  21. 21.
    Moh'd A Mesleh, A.: Chi square feature extraction based SVMs Arabic language text categorization system. J. Comput. Sci. 3, 430–435 (2007)CrossRefGoogle Scholar
  22. 22.
    Kanaan, G., Al-Shalabi, R., Ghwanmeh, S., Al-Ma’adeed, H.: A comparison of text-classification techniques applied to Arabic text. J. Am. Soc. Inform. Sci. Technol. 60, 1836–1844 (2009)CrossRefGoogle Scholar
  23. 23.
    Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–282. ACM (2002)Google Scholar
  24. 24.
    Alsaleem, S.: Automated Arabic text categorization using SVM and NB. Int. Arab J. e-Technol. 2, 124–128 (2011)Google Scholar
  25. 25.
    Khreisat, L.: A machine learning approach for Arabic text classification using N-gram frequency statistics. J. Informetr. 3, 72–77 (2009)CrossRefGoogle Scholar
  26. 26.
    Khoja, S., Garside, R.: Stemming arabic text. Computing Department, Lancaster University, Lancaster, UK (1999)Google Scholar
  27. 27.
    Al-Shalabi, R., Obeidat, R.: Improving KNN Arabic text classification with n-grams based document indexing. In: Proceedings of the Sixth International Conference on Informatics and Systems, Cairo, Egypt, pp. 108–112. Citeseer (2008)Google Scholar
  28. 28.
    Elberrichi, Z., Abidi, K.: Arabic text categorization: a comparative study of different representation modes. Int. Arab J. Inf. Technol. (IAJIT) 9, 465–470 (2012)Google Scholar
  29. 29.
    Yousif, S.A., Samawi, V.W., Elkabani, I., Zantout, R.: The Effect of Combining Different Semantic Relations on Arabic Text ClassificationGoogle Scholar
  30. 30.
    Saad, M.K., Ashour, W.: Osac: open source arabic corpora. In: 6th ArchEng International Symposiums, EEECS, vol. 10 (2010)Google Scholar
  31. 31.
    Milne, D., Witten, I.H.: An open-source toolkit for mining Wikipedia. Artif. Intell. 194, 222–239 (2013)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Abbas, M., Smaili, K.: Comparison of topic identification methods for arabic language. In: Proceedings of International Conference on Recent Advances in Natural Language Processing, RANLP, pp. 14–17 (2005)Google Scholar
  33. 33.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11, 10–18 (2009)CrossRefGoogle Scholar
  34. 34.
    Ben-Hur, A., Weston, J.: A user’s guide to support vector machines. In: Carugo, O., Eisenhaber, F. (eds.) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol. 609, pp. 223–239. Humana Press, New York (2010).  https://doi.org/10.1007/978-1-60327-241-4_13 CrossRefGoogle Scholar
  35. 35.
    Gabrilovich, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. J. Artif. Intell. Res. 34, 443–498 (2009)zbMATHGoogle Scholar
  36. 36.
    Duwairi, R., Al-Refai, M.N., Khasawneh, N.: Feature reduction techniques for Arabic text categorization. J. Am. Soc. Inform. Sci. Technol. 60, 2347–2352 (2009)CrossRefGoogle Scholar
  37. 37.
    Saad, M.K.: The impact of text preprocessing and term weighting on Arabic text classification. The Islamic University-Gaza (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Alaa Alahmadi
    • 1
  • Arash Joorabchi
    • 1
  • Abdulhussain E. Mahdi
    • 1
    Email author
  1. 1.Electronic and Computer Engineering DepartmentUniversity of LimerickLimerickIreland

Personalised recommendations