Advertisement

Language Resources and Evaluation

, Volume 46, Issue 4, pp 543–563 | Cite as

A real time Named Entity Recognition system for Arabic text mining

  • Harith Al-Jumaily
  • Paloma Martínez
  • José L. Martínez-Fernández
  • Erik Van der Goot
Original Paper

Abstract

Arabic is the most widely spoken language in the Arab World. Most people of the Islamic World understand the Classic Arabic language because it is the language of the Qur’an. Despite the fact that in the last decade the number of Arabic Internet users (Middle East and North and East of Africa) has increased considerably, systems to analyze Arabic digital resources automatically are not as easily available as they are for English. Therefore, in this work, an attempt is made to build a real time Named Entity Recognition system that can be used in web applications to detect the appearance of specific named entities and events in news written in Arabic. Arabic is a highly inflectional language, thus we will try to minimize the impact of Arabic affixes on the quality of the pattern recognition model applied to identify named entities. These patterns are built up by processing and integrating different gazetteers, from DBPedia (http://dbpedia.org/About, 2009) to GATE (A general architecture for text engineering, 2009) and ANERGazet (http://users.dsic.upv.es/grupos/nle/?file=kop4.php).

Keywords

Arabic language Text mining Named Entity Recognition Event detection Morphological analysis Root extraction 

Notes

Acknowledgments

This work has been partially supported by the Spanish Center for Industry Technological Development (CDTI, Ministry of Industry, Tourism and Trade), through the BUSCAMEDIA Project (CEN-20091026), and also by the Spanish research projects: MA2VICMR: Improving the access, analysis and visibility of the multilingual and multimedia information in web for the Region of Madrid (S2009/TIC-1542), and MULTIMEDICA: Multilingual Information Extraction in Health domain and application to scientific and informative documents (TIN2010-20644-C03-01). The authors would like also to thank the IPSC of the European Commission’s Joint Research Centre for allowing us to include the EMM search engine in our system.

References

  1. Abuleil, S. (2004). Extracting names from Arabic text for question-answering systems. RIAO’04, Proceedings of the 7th international conference on Coupling approaches, coupling media, and COUPLING languages for information retrieval, April 26–28, 2004 (pp. 638–647). France: University of Avignon (Vaucluse).Google Scholar
  2. Afify, M., Sarikaya, R., Kuo, H.-K. J., Besacier, L., & Gao, Y. (2006). On the use of morphological analysis for dialectal Arabic speech recognition. Interspeech-2006, Pittsburg PA, September 2006.Google Scholar
  3. Alqatta Alsaqly, I. (1999). Building nouns, verbs, and Gerunds, reviewed by Dr. Ahmed Mohamed Abdel-Dayem. Egypt: Dar Al Kutub.Google Scholar
  4. Al-Sughaiyer, I., & Al-Kharashi, I. (2004). Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology, 55(3), 189–213.CrossRefGoogle Scholar
  5. Al-Zoghby, A., Eldin, A. S., Ismail, N. A., & Hamza, T. (2007). Mining Arabic text using soft-matching association rules. In International conference on Computer engineering & systems, 2007. ICCES ’07 (pp. 421–426). November 2007.Google Scholar
  6. Attia, M. (2008). Handling Arabic morphological and syntactic ambiguities within the LFG framework with a view to machine translation. PhD Dissertation, University of Manchester.Google Scholar
  7. Benajiba, Y. (2009). Arabic Named Entity Recognition. PhD dissertation, Universidad Politécnica de Valencia.Google Scholar
  8. Benajiba, Y., Diab, M., & Rosso, P. (October, 2008). Arabic Named Entity Recognition using Optimized Feature Sets. In Proceedings of international conference on Empirical methods in natural language processing, EMNLP-2008 (pp. 284–293). Honolulu: Waikiki.Google Scholar
  9. Benajiba, Y., Diab, M. T., & Rosso, P. (2009). Arabic Named Entity Recognition: A feature-driven study. IEEE Transactions on Audio, Speech & Language Processing, 17(5), 926–934.CrossRefGoogle Scholar
  10. Benajiba, Y., Rosso, P., & Benedí, J. M. (2007). ANERsys: An Arabic Named Entity Recognition system based on maximum entropy. Computational linguistics and intelligent text processing, 8th international conference, February 18–24, 2007 (pp. 143–153). Mexico City: CICLing.Google Scholar
  11. Benajiba, Y., Zitouni, I., Diab, M., & Rosso, P. (2010). Arabic Named Entity Recognition: Using features extracted from noisy data. In Proceedings of ACL 2010, Uppsala, Sweden, July 2010.Google Scholar
  12. Best, C., Steinberger, R., & Halkia, S. (2007). Web mining and intelligence. IPSC Institute for the Protection and the Security of the Citizen. Council of Europe. http://globesec.jrc.ec.europa.eu/publications/brochures/brochures/LB7606422ENC.pdf.
  13. Buckwalter, T. (2004). Buckwalter Arabic Morphological Analyzer Version 2.0. LDC. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02.
  14. DBpedia. (2009). http://dbpedia.org/About.
  15. Diab, M. (2009). Second Generation Tools (AMIRA 2.0): Fast and robust tokenization, POS tagging, and base phrase chunking. In MEDAR 2nd international conference on Arabic language resources and Tools. Egypt: Cairo.Google Scholar
  16. El potencial de la Red en árabe. Accessed April 27, 2010, from http://www.elmundo.es.
  17. EMM search engine of the JRC. (2009). http://langtech.jrc.it/.
  18. Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing, 8(4), Article 14.Google Scholar
  19. Freund, Y., Seung, H., Shamir, E., & Tishby, N. (1997). Selective sampling using the Query by Committee algorithm. Machine Learning, 28, 133–168.CrossRefGoogle Scholar
  20. GATE. (2009). A general architecture for text engineering. http://gate.ac.uk/.
  21. Goweder, A., Poesio, M., & Roeck A. (2004). Broken plural detection for Arabic Information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, UK.Google Scholar
  22. Habash, N. Y. (2010). Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1), 1–187.CrossRefGoogle Scholar
  23. Habash, N., Rambow, O., & Roth, R. (2009). MADA + TOKAN: A Toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt.Google Scholar
  24. Habash, N., Soudi, A., & Buckwalter, T. (2007). On Arabic transliteration. In A. van den Bosch & A. Soudi (Eds.), Arabic computational morphology: Knowledge-based and empirical methods. Berlin: Springer.Google Scholar
  25. Halpern, J. (2007). The challenges and pitfalls of Arabic Romanization and Arabization. In Second Workshop on Computational approaches to Arabic Script-based Languages (CAASL2). Stanford: Stanford University.Google Scholar
  26. Internet World Stats, Usage and Population Statistics. http://www.internetworldstats.com/.
  27. Kaye, A. S. (1991). The Hamzat al-Wasl in Contemporary modern standard Arabic. Journal of the American Oriental Society, 111(3), 572–574. http://www.jstor.org/stable/604273.Google Scholar
  28. Khoja, S., & Garside, R. (1999). Stemming Arabic text. Computing department. Lancaster U.K.: Lancaster University. Accessed September 22, 1999, from http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps.
  29. Khoja, S., Garside, R., & Knowles, G. (2001). A tagset for the morphosyntactic tagging of Arabic. In Paper given at the Corpus Linguistics 2001 conference, Lancaster.Google Scholar
  30. Lieberman, H. (Ed.). (2001). Your wish is my command: Programming by example. San Francisco, CA: Morgan Kaufmann.Google Scholar
  31. Linguistic Data Consortium (LDC). (2011). http://www.ldc.upenn.edu/.
  32. Maamouri, M., et al. (2009) LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010L01.
  33. Martin, A., & Van der Goot, E. (2009). Near real time information mining in multilingual news. In Proceedings of the 18th international World Wide Web conference (WWW’2009), Madrid, 20–24 April 2009 (pp. 1153–1154). New York: ACM.Google Scholar
  34. Natural Language Engineering Lab. http://users.dsic.upv.es/grupos/nle/?file=kop4.php.
  35. Pouliquen, B., & Steinberger, R. (2007). C. Automatic detection of quotations in multilingual news. In Proceedings of the international conference recent advances in Natural language processing (RANLP’2007), 27–29 September 2007(pp. 25–32). Borovets, Bulgaria.Google Scholar
  36. Sawalha, M., & Atwell, E. (2008). Comparative evaluation of Arabic language morphological analysers and stemmers. In Proceedings of COLING 2008 22nd international conference on Computational linguistics.Google Scholar
  37. Shaalan, K. F., & Raza, H. (2008). Arabic Named Entity Recognition from diverse text types. In Advances in natural language processing, 6th international conference, GoTAL 2008, Gothenburg, Sweden, August 25–27, 2008, Proceedings. Lecture Notes in Computer Science 5221 (pp. 440–451). Berlin: Springer, ISBN 978-3-540-85286-5.Google Scholar
  38. Silberztein, M. (2002). NOOJ: A cooperative object oriented architecture for NLP. In 5th INTEX Workshop, May 2002, Marseille, France.Google Scholar
  39. Steinberger, R., Pouliquen, B., & Ignat, C. (2008). Using language-independent rules to achieve high multilinguality in text mining. In F. Fogelman-Soulié, P. Domenico, J. Piskorski, & R. Steinberger (Eds.), Mining massive data sets for security (pp. 217–240). Amsterdam: John Benjamins Publishers.Google Scholar
  40. The Arab League Educational, Cultural and Scientific Organization (Alecso). (2007). http://www.alecso.org.tn/.
  41. The World Wide Web Consortium (W3C). (2009). http://www.w3.org/.
  42. Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2), 67–88.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  • Harith Al-Jumaily
    • 1
  • Paloma Martínez
    • 1
  • José L. Martínez-Fernández
    • 2
  • Erik Van der Goot
    • 3
  1. 1.Computer Science DepartmentCarlos III University of MadridLeganés, MadridSpain
  2. 2.DAEDALUS – Data, Decisions and Language S.A.MadridSpain
  3. 3.EC Joint Research CentreIspraItaly

Personalised recommendations