A real time Named Entity Recognition system for Arabic text mining
Arabic is the most widely spoken language in the Arab World. Most people of the Islamic World understand the Classic Arabic language because it is the language of the Qur’an. Despite the fact that in the last decade the number of Arabic Internet users (Middle East and North and East of Africa) has increased considerably, systems to analyze Arabic digital resources automatically are not as easily available as they are for English. Therefore, in this work, an attempt is made to build a real time Named Entity Recognition system that can be used in web applications to detect the appearance of specific named entities and events in news written in Arabic. Arabic is a highly inflectional language, thus we will try to minimize the impact of Arabic affixes on the quality of the pattern recognition model applied to identify named entities. These patterns are built up by processing and integrating different gazetteers, from DBPedia (http://dbpedia.org/About, 2009) to GATE (A general architecture for text engineering, 2009) and ANERGazet (http://users.dsic.upv.es/grupos/nle/?file=kop4.php).
KeywordsArabic language Text mining Named Entity Recognition Event detection Morphological analysis Root extraction
This work has been partially supported by the Spanish Center for Industry Technological Development (CDTI, Ministry of Industry, Tourism and Trade), through the BUSCAMEDIA Project (CEN-20091026), and also by the Spanish research projects: MA2VICMR: Improving the access, analysis and visibility of the multilingual and multimedia information in web for the Region of Madrid (S2009/TIC-1542), and MULTIMEDICA: Multilingual Information Extraction in Health domain and application to scientific and informative documents (TIN2010-20644-C03-01). The authors would like also to thank the IPSC of the European Commission’s Joint Research Centre for allowing us to include the EMM search engine in our system.
- Abuleil, S. (2004). Extracting names from Arabic text for question-answering systems. RIAO’04, Proceedings of the 7th international conference on Coupling approaches, coupling media, and COUPLING languages for information retrieval, April 26–28, 2004 (pp. 638–647). France: University of Avignon (Vaucluse).Google Scholar
- Afify, M., Sarikaya, R., Kuo, H.-K. J., Besacier, L., & Gao, Y. (2006). On the use of morphological analysis for dialectal Arabic speech recognition. Interspeech-2006, Pittsburg PA, September 2006.Google Scholar
- Alqatta Alsaqly, I. (1999). Building nouns, verbs, and Gerunds, reviewed by Dr. Ahmed Mohamed Abdel-Dayem. Egypt: Dar Al Kutub.Google Scholar
- Al-Zoghby, A., Eldin, A. S., Ismail, N. A., & Hamza, T. (2007). Mining Arabic text using soft-matching association rules. In International conference on Computer engineering & systems, 2007. ICCES ’07 (pp. 421–426). November 2007.Google Scholar
- Attia, M. (2008). Handling Arabic morphological and syntactic ambiguities within the LFG framework with a view to machine translation. PhD Dissertation, University of Manchester.Google Scholar
- Benajiba, Y. (2009). Arabic Named Entity Recognition. PhD dissertation, Universidad Politécnica de Valencia.Google Scholar
- Benajiba, Y., Diab, M., & Rosso, P. (October, 2008). Arabic Named Entity Recognition using Optimized Feature Sets. In Proceedings of international conference on Empirical methods in natural language processing, EMNLP-2008 (pp. 284–293). Honolulu: Waikiki.Google Scholar
- Benajiba, Y., Rosso, P., & Benedí, J. M. (2007). ANERsys: An Arabic Named Entity Recognition system based on maximum entropy. Computational linguistics and intelligent text processing, 8th international conference, February 18–24, 2007 (pp. 143–153). Mexico City: CICLing.Google Scholar
- Benajiba, Y., Zitouni, I., Diab, M., & Rosso, P. (2010). Arabic Named Entity Recognition: Using features extracted from noisy data. In Proceedings of ACL 2010, Uppsala, Sweden, July 2010.Google Scholar
- Best, C., Steinberger, R., & Halkia, S. (2007). Web mining and intelligence. IPSC Institute for the Protection and the Security of the Citizen. Council of Europe. http://globesec.jrc.ec.europa.eu/publications/brochures/brochures/LB7606422ENC.pdf.
- Buckwalter, T. (2004). Buckwalter Arabic Morphological Analyzer Version 2.0. LDC. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02.
- DBpedia. (2009). http://dbpedia.org/About.
- Diab, M. (2009). Second Generation Tools (AMIRA 2.0): Fast and robust tokenization, POS tagging, and base phrase chunking. In MEDAR 2nd international conference on Arabic language resources and Tools. Egypt: Cairo.Google Scholar
- El potencial de la Red en árabe. Accessed April 27, 2010, from http://www.elmundo.es.
- EMM search engine of the JRC. (2009). http://langtech.jrc.it/.
- Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing, 8(4), Article 14.Google Scholar
- GATE. (2009). A general architecture for text engineering. http://gate.ac.uk/.
- Goweder, A., Poesio, M., & Roeck A. (2004). Broken plural detection for Arabic Information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, UK.Google Scholar
- Habash, N., Rambow, O., & Roth, R. (2009). MADA + TOKAN: A Toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt.Google Scholar
- Habash, N., Soudi, A., & Buckwalter, T. (2007). On Arabic transliteration. In A. van den Bosch & A. Soudi (Eds.), Arabic computational morphology: Knowledge-based and empirical methods. Berlin: Springer.Google Scholar
- Halpern, J. (2007). The challenges and pitfalls of Arabic Romanization and Arabization. In Second Workshop on Computational approaches to Arabic Script-based Languages (CAASL2). Stanford: Stanford University.Google Scholar
- Internet World Stats, Usage and Population Statistics. http://www.internetworldstats.com/.
- Khoja, S., & Garside, R. (1999). Stemming Arabic text. Computing department. Lancaster U.K.: Lancaster University. Accessed September 22, 1999, from http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps.
- Khoja, S., Garside, R., & Knowles, G. (2001). A tagset for the morphosyntactic tagging of Arabic. In Paper given at the Corpus Linguistics 2001 conference, Lancaster.Google Scholar
- Lieberman, H. (Ed.). (2001). Your wish is my command: Programming by example. San Francisco, CA: Morgan Kaufmann.Google Scholar
- Linguistic Data Consortium (LDC). (2011). http://www.ldc.upenn.edu/.
- Maamouri, M., et al. (2009) LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010L01.
- Martin, A., & Van der Goot, E. (2009). Near real time information mining in multilingual news. In Proceedings of the 18th international World Wide Web conference (WWW’2009), Madrid, 20–24 April 2009 (pp. 1153–1154). New York: ACM.Google Scholar
- Natural Language Engineering Lab. http://users.dsic.upv.es/grupos/nle/?file=kop4.php.
- Pouliquen, B., & Steinberger, R. (2007). C. Automatic detection of quotations in multilingual news. In Proceedings of the international conference recent advances in Natural language processing (RANLP’2007), 27–29 September 2007(pp. 25–32). Borovets, Bulgaria.Google Scholar
- Sawalha, M., & Atwell, E. (2008). Comparative evaluation of Arabic language morphological analysers and stemmers. In Proceedings of COLING 2008 22nd international conference on Computational linguistics.Google Scholar
- Shaalan, K. F., & Raza, H. (2008). Arabic Named Entity Recognition from diverse text types. In Advances in natural language processing, 6th international conference, GoTAL 2008, Gothenburg, Sweden, August 25–27, 2008, Proceedings. Lecture Notes in Computer Science 5221 (pp. 440–451). Berlin: Springer, ISBN 978-3-540-85286-5.Google Scholar
- Silberztein, M. (2002). NOOJ: A cooperative object oriented architecture for NLP. In 5th INTEX Workshop, May 2002, Marseille, France.Google Scholar
- Steinberger, R., Pouliquen, B., & Ignat, C. (2008). Using language-independent rules to achieve high multilinguality in text mining. In F. Fogelman-Soulié, P. Domenico, J. Piskorski, & R. Steinberger (Eds.), Mining massive data sets for security (pp. 217–240). Amsterdam: John Benjamins Publishers.Google Scholar
- The Arab League Educational, Cultural and Scientific Organization (Alecso). (2007). http://www.alecso.org.tn/.
- The World Wide Web Consortium (W3C). (2009). http://www.w3.org/.