Improving Native Language Identification Model with Syntactic Features: Case of Arabic

  • Seifeddine Mechti
  • Nabil KhoufiEmail author
  • Lamia Hadrich Belguith
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 941)


In this paper, we present a method based on machine learning for Arabic native language identification task. We expose a hybrid method that combines surface analysis in texts with an automatic learning method. Unlike the few techniques found in the state of the art, the features selection phase allowed improving performances. We also show the impact of syntactic features for native language identification task. Therefore, the obtained results outperformed those provided by the best methods used for Arabic native language detection.


Arabic native language identification Machine learning Syntactic features 


  1. 1.
    Malmasi, S., Dras, M.: Arabic native language identification. In: Proceedings of the Arabic Natural Language Processing Workshop, Doha, Qatar (2014)Google Scholar
  2. 2.
    Koppel, M., Schler, J., Zigdon, K.: Automatically determining an anonymous author’s native language. In: International Conference on Intelligence and Security Informatics, pp. 209–217. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  3. 3.
    Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of an anonymous text. Commun. ACM 52(2), 119–123 (2009)CrossRefGoogle Scholar
  4. 4.
    Wong, S.M.J., Dras, M.: Contrastive analysis and native language identification. In: Proceedings of the Australasian Language Technology Association Workshop, pp. 53–61 (2009)Google Scholar
  5. 5.
    Kochmar, E.: Identification of a writer’s native language by error analysis. Doctoral dissertation, Master’s thesis, University of Cambridge (2011)Google Scholar
  6. 6.
    Bykh, S., Meurers, D.: Native language identification using recurring n-grams–investigating abstraction and domain dependence. In: Proceedings of COLING 2012, pp. 425–440 (2012)Google Scholar
  7. 7.
    Ionescu, R.T., Popescu, M., Cahill, A.: Can characters reveal your native language? A language-independent approach to native language identification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1363–1373 (2014)Google Scholar
  8. 8.
    Jiang, X., Guo, Y., Geertzen, J., Alexopoulou, D., Sun, L., Korhonen, A.: Native language identification using large, longitudinal data. In: LREC, pp. 3309–3312 (2014)Google Scholar
  9. 9.
    Nisioi, S.: Feature analysis for native language identification. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 644–657. Springer, Cham (2015)Google Scholar
  10. 10.
    Malmasi, S., Dras, M., Temnikova, I.: Norwegian native language identification. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 404–412 (2015)Google Scholar
  11. 11.
    Lan, W., Hayato, Y.: Robust Chinese native language identification with skip-gram. In: DEIM Forum (2016)Google Scholar
  12. 12.
    Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A., Bebah, M.O.A.O., Shoul, M.: Alkhalil morpho sys1: a morphosyntactic analysis system for arabic texts. In: International Arab Conference on Information Technology, Benghazi, Libya, pp. 1–6 (2010)Google Scholar
  13. 13.
    Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423–430. Association for Computational Linguistics (2003)Google Scholar
  14. 14.
    Habash, N.Y.: Introduction to Arabic natural language processing. In: Hirst, G. (ed.) Synthesis Lectures on Human Language Technologies, vol. 3, no. 1 (2010)CrossRefGoogle Scholar
  15. 15.
    Hajic, J., Vidová-Hladká, B., Pajas, P.: The Prague dependency treebank: annotation structure and support. In: Proceedings of the IRCS Workshop on Linguistic Databases, pp. 105–114 (2001)Google Scholar
  16. 16.
    Habash, N.Y., Roth, R.M.: CATiB: the Columbia Arabic treebank. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 221–224. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  17. 17.
    Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The Penn Arabic treebank: building a large-scale annotated Arabic corpus. In: The NEMLAR Conference on Arabic Language Resources and Tools, pp. 102–109 (2004)Google Scholar
  18. 18.
    Alfaifi, A.Y.G., Atwell, E., Hedaya, I.: Arabic learner corpus (ALC) v2: a new written and spoken corpus of Arabic learners. In: Proceedings of Learner Corpus Studies in Asia and the World 2014, vol. 2, pp. 77–89. Kobe International Communication Center (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Seifeddine Mechti
    • 1
  • Nabil Khoufi
    • 2
    Email author
  • Lamia Hadrich Belguith
    • 3
  1. 1.LARODEC Laboratory, ISG of TunisUniversity of TunisTunisTunisia
  2. 2.ANLP Research Group, MIRACL Laboratory, IHE of SfaxUniversity of SfaxSfaxTunisia
  3. 3.ANLP Research Group, MIRACL Laboratory, FSEG of SfaxUniversity of SfaxSfaxTunisia

Personalised recommendations