Multi-Word Expressions Annotations Effect in Document Classification Task

  • Dhekra Najar
  • Slim Mesfar
  • Henda Ben Ghezela
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10859)


Document classification is a necessary task for most Natural Language Processing tools since it classifies documents content in a helpful and meaningful way. The main concern in this paper is to investigate the impact of using multi-words for text representation on the performances of text classification task. Two text classification strategies are proposed to observe the robustness of each of them. First, we will deal with the literature review of existing linguistic resources in Arabic language. Secondly, we will present a classification method that is based on domain candidate simple terms. These terms are automatically extracted from multiple specialized corpora depending on their appearance frequency. Then, we will present a detailed description of a classification method based on multi-word expressions dictionary. CompounDic, an Arabic multi-word expressions dictionary, will be used to automatically annotate multi-word expressions and variations in text. Finally, we carried out a series of experiments on classifying specialized text based on simple words and multi-word expressions for comparison purposes. Our experiments show that the use of multi-word expressions annotations enhances the text classification results.


Text classification Topic identification Multi-Word Expressions Natural language processing NooJ Arabic language MWEs variation 


  1. 1.
    Lewis, D.D.: Text representation for intelligent text retrieval: a classification oriented view. In: Text-Based Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval, pp. 179–197 (1992)Google Scholar
  2. 2.
    Dalal, M.K., Zaveri, M.A.: Automatic text classification: a technical review. Int. J. Comput. Appl. 28(2), 37–40 (2011)Google Scholar
  3. 3.
    Papka, R., Allan, J.: Document classification using multiword features. In: Proceedings of the Seventh International Conference on Information and Knowledge Management, pp. 124–131. ACM (1998)Google Scholar
  4. 4.
    Attia, M.A.: accommodating multiword expressions in an Arabic LFG grammar. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 87–98. Springer, Heidelberg (2006). Scholar
  5. 5.
    Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). Scholar
  6. 6.
    Boulaknadel, S., Daille, B., Aboutajdin, D.: A multi-word term extraction program for Arabic language. In: LREC (2008)Google Scholar
  7. 7.
    Hawwari, A., Bar, K.., Diab, M.: Building an Arabic multiword expressions repository. In: Proceedings of the 50th ACL, pp. 24–29 (2012)Google Scholar
  8. 8.
    Najar, D., Mesfar, S., Ghezela, H.B.: A large terminological dictionary of Arabic compound words. In: Okrut, T., Hetsevich, Y., Silberztein, M., Stanislavenka, H. (eds.) NooJ 2015. CCIS, vol. 607, pp. 16–28. Springer, Cham (2016). Scholar
  9. 9.
    Mesfar, S.: Analyse morpho-syntaxique automatique et reconnaissance des entités nommées en arabe standard. Doctoral Dissertation, Besançon (2008)Google Scholar
  10. 10.
    Silberztein, M.: The Formalisation of Natural Languages: The NooJ Approach, p. 346. Wiley, Hoboken (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.RIADIUniversity of ManoubaManoubaTunisia

Personalised recommendations