A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer

  • Mohammed Attia
  • Pavel Pecina
  • Antonio Toral
  • Lamia Tounsi
  • Josef van Genabith
Part of the Communications in Computer and Information Science book series (CCIS, volume 100)

Abstract

Current Arabic lexicons, whether computational or otherwise, make no distinction between entries from Modern Standard Arabic (MSA) and Classical Arabic (CA), and tend to include obsolete words that are not attested in current usage. We address this problem by building a large-scale, corpus-based lexical database that is representative of MSA. We use an MSA corpus of 1,089,111,204 words, a pre-annotation tool, machine learning techniques, and knowledge-based templatic matching to automatically acquire and filter lexical knowledge about morpho-syntactic attributes and inflection paradigms. Our lexical database is scalable, interoperable and suitable for constructing a morphological analyser, regardless of the design approach and programming language used. The database is formatted according to the international ISO standard in lexical resource representation, the Lexical Markup Framework (LMF). This lexical database is used in developing an open-source finite-state morphological processing toolkit. We build a web application, AraComLex (Arabic Computer Lexicon), for managing and curating the lexical database.

Keywords

Arabic Lexical Database Modern Standard Arabic Arabic morphology Arabic Morphological Transducer 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dichy, J., Ali, F.: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? In: The MT-Summit IX Workshop on Machine Translation for Semitic Languages, New Orleans (2003)Google Scholar
  2. 2.
    Attia, M.: An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modelling Finite State Networks. In: Challenges of Arabic for NLP/MT Conference. The British Computer Society, London (2006)Google Scholar
  3. 3.
    Buckwalter, T.: Buckwalter Arabic Morphological Analyzer (BAMA) Version 2.0. Linguistic Data Consortium (LDC) catalogue numberLDC2004L02,ISBN1-58563-324-0 (2004)Google Scholar
  4. 4.
    Beesley, K.R.: Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001. In: The ACL 2001 Workshop on Arabic Language Processing: Status and Prospects, Toulouse, France (2001)Google Scholar
  5. 5.
    Sinclair, J.M. (ed.): Looking Up: An Account of the COBUILD Project in Lexical Computing. Collins, London (1987)Google Scholar
  6. 6.
    Maamouri, M., Graff, D., Bouziri, B., Krouna, S., Kulick, S.: LDC Standard Arabic Morphological Analyzer (SAMA) v. 3.0. LDC Catalog No. LDC2010L01 (2010) ISBN: 1-58563-555-3Google Scholar
  7. 7.
    Bin-Muqbil, M.: Phonetic and Phonological Aspects of Arabic Emphatics and Gutturals. Ph.D. thesis in the University of Wisconsin, Madison (2006)Google Scholar
  8. 8.
    Watson, J.: The Phonology and Morphology of Arabic. Oxford University Press, New York (2002)Google Scholar
  9. 9.
    Elgibali, A., Badawi, E.M.: Understanding Arabic: Essays in Contemporary Arabic Linguistics in Honor of El-Said M. Badawi. American University in Cairo Press, Egypt (1996)Google Scholar
  10. 10.
    Fischer, W.: Classical Arabic. In: The Semitic Languages. Routledge, London (1997)Google Scholar
  11. 11.
    Van Mol, M.: Variation in Modern Standard Arabic in Radio News Broadcasts, A Synchronic Descriptive Investigation in the use of complementary Particles. Leuven, OLA 117 (2003)Google Scholar
  12. 12.
    Stetkevych, J.: The modern Arabic literary language: lexical and stylistic developments. Publications of the Center for Middle Eastern Studies, vol. (6). University of Chicago Press, Chicago (1970)Google Scholar
  13. 13.
    Owens, J.: The Arabic Grammatical Tradition. In: The Semitic Languages. Routledge, London (1997)Google Scholar
  14. 14.
    Ghazali, S., Braham, A.: Dictionary Definitions and Corpus-Based Evidence in Modern Standard Arabic. In: Arabic NLP Workshop at ACL/EACL, Toulouse, France (2001)Google Scholar
  15. 15.
    Lane, E.W.: Preface. In: Arabic–English Lexicon. Williams and Norgate, London (1863)Google Scholar
  16. 16.
    Arberry, A.J.: Oriental essays: portraits of seven scholars. George Allen and Unwin, London (1960)Google Scholar
  17. 17.
    Wehr, H., Cowan, J.M.: Dictionary of Modern Written Arabic, pp. VII-XV. Spoken Language Services, Ithaca (1976)Google Scholar
  18. 18.
    Brill, M.: The Basic Word List of the Arabic Daily Newspaper. The Hebrew University Press Association, Jerusalem (1940)Google Scholar
  19. 19.
    Kuŏcera, H., Francis, W.N.: Computational Analysis of Present-Day American English. Brown University Press, Providence (1967)Google Scholar
  20. 20.
    Landau, J.M.: A Word Count of Modern Arabic Prose. American Council of Learned Societies, New York (1959)Google Scholar
  21. 21.
    Al-Sulaiti, L., Atwell, E.: The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics 11 (2006)Google Scholar
  22. 22.
    Hajič, J., Smrž, O., Buckwalter, T., Jin, H.: Feature-Based Tagger of Approximations of Functional Arabic Morphology. In: The 4th Workshop on Treebanks and Linguistic Theories (TLT 2005), Barcelona, Spain (2005)Google Scholar
  23. 23.
    Atkins, B.T.S., Rundell, M.: The Oxford Guide to Practical Lexicography. Oxford University Press, Oxford (2008)Google Scholar
  24. 24.
    Van Mol, M.: The development of a new learner’s dictionary for Modern Standard Arabic: the linguistic corpus approach. In: Heid, U., Evert, S., Lehmann, E., Rohrer, C. (eds.) Proceedings of the Ninth EURALEX International Congress, Stuttgart, pp. 831–836 (2000)Google Scholar
  25. 25.
    Boudelaa, S., Marslen-Wilson, W.D.: Aralex: A lexical database for Modern Standard Arabic. Behavior Research Methods 42(2) (2010)Google Scholar
  26. 26.
    Beesley, K.R.: Arabic Morphological Analysis on the Internet. In: The 6th International Conference and Exhibition on Multilingual Computing, Cambridge, UK (1998)Google Scholar
  27. 27.
    Beesley, K.R., Karttunen, L.: Finite State Morphology: CSLI studies in computational linguistics. CSLI, Stanford (2003)Google Scholar
  28. 28.
    Kiraz, G.A.: Computational Nonlinear Morphology: With Emphasis on Semitic Languages. Cambridge University Press, Cambridge (2001)CrossRefGoogle Scholar
  29. 29.
    Parker, R., Graff, D., Chen, K., Kong, J., Maeda, K.: Arabic Gigaword Fourth Edition. LDC Catalog No. LDC2009T30 (2009) ISBN: 1-58563-532-4 Google Scholar
  30. 30.
    Habash, N., Rambow, O., Roth, R.: MADA+TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization. In: The 2nd International Conference on Arabic Language Resources and Tools (MEDAR 2009), Cairo, Egypt, pp. 102–109 (2009)Google Scholar
  31. 31.
    Habash, N., Rambow, O.: Arabic Tokenization, Morphological Analysis, and Part- of-Speech Tagging in One Fell Swoop. In: Proceedings of the Conference of American Association for Computational Linguistics (ACL 2005). The University of Michigan, Ann Arbor (2005)Google Scholar
  32. 32.
    Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In: Proceedings of Association for Computational Linguistics (ACL), Columbus, Ohio (2008)Google Scholar
  33. 33.
    Francopoulo, G., Bel, N., George, M., Calzolari, N., Monachini, M., Pet, M., Soria, C.: Multilingual resources for NLP in the lexical markup framework (LMF). Language Resources and Evaluation (2008) ISSN 1574-020XGoogle Scholar
  34. 34.
    ISO 24613: Language Resource Management Lexical Markup Framework (draft version), ISO Switzerland (2007)Google Scholar
  35. 35.
    Chen, P.P.: The Entity-Relationship Model: Toward a Unified View of Data. ACM Transactions on Database Systems 1, 9–36 (1976)CrossRefGoogle Scholar
  36. 36.
    Rosenblatt, F.: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington DC (1961)MATHGoogle Scholar
  37. 37.
    Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, Englewood Cliffs (1998)MATHGoogle Scholar
  38. 38.
    Hulden, M.: Foma: a finite-state compiler and library. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  39. 39.
    Attia, M., Toral, A., Tounsi, L., Monachini, M., van Genabith, J.: An automatically built Named Entity lexicon for Arabic. In: LREC 2010, Valletta, Malta (2010)Google Scholar
  40. 40.
    Attia, M., Toral, A., Tounsi, L., Monachini, M.: van Genabith. Automatic Extraction of Arabic Multiword Expressions. In: COLING 2010 Workshop on Multiword Expressions: from Theory to Applications, Beijing, China (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mohammed Attia
    • 1
  • Pavel Pecina
    • 1
  • Antonio Toral
    • 1
  • Lamia Tounsi
    • 1
  • Josef van Genabith
    • 1
  1. 1.School of ComputingDublin City UniversityDublinIreland

Personalised recommendations