Skip to main content
Log in

A real time Named Entity Recognition system for Arabic text mining

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Arabic is the most widely spoken language in the Arab World. Most people of the Islamic World understand the Classic Arabic language because it is the language of the Qur’an. Despite the fact that in the last decade the number of Arabic Internet users (Middle East and North and East of Africa) has increased considerably, systems to analyze Arabic digital resources automatically are not as easily available as they are for English. Therefore, in this work, an attempt is made to build a real time Named Entity Recognition system that can be used in web applications to detect the appearance of specific named entities and events in news written in Arabic. Arabic is a highly inflectional language, thus we will try to minimize the impact of Arabic affixes on the quality of the pattern recognition model applied to identify named entities. These patterns are built up by processing and integrating different gazetteers, from DBPedia (http://dbpedia.org/About, 2009) to GATE (A general architecture for text engineering, 2009) and ANERGazet (http://users.dsic.upv.es/grupos/nle/?file=kop4.php).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. In this paper the Arabic words are presented using the HSB transliteration schema (Habash et al. 2007) as follows: ء ', آĀ, أ Â, ؤ ŵ, إ Ǎ, ئ ŷ, ا A, ب b, ة ħ, ت t, ث θ, ج j, ح H, خ x, د d, ذ ð,ر r, ز z, س s, ش š, ص S, ض D, ط T, ظ Ď, ع ς, غ γ, ف f, ق q, ك k, ل l, م m, ن n, ه h, و w, ى ý, ي y, َ a, ُ u, ِ i, ً ã, ٌ ũ, ٍ ĩ, ّ  ~ , ْ., ـ _,.

  2. ACE 2004, 2005 Multilingual Training Corpus http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T09, http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T06.

References

  • Abuleil, S. (2004). Extracting names from Arabic text for question-answering systems. RIAO’04, Proceedings of the 7th international conference on Coupling approaches, coupling media, and COUPLING languages for information retrieval, April 26–28, 2004 (pp. 638–647). France: University of Avignon (Vaucluse).

  • Afify, M., Sarikaya, R., Kuo, H.-K. J., Besacier, L., & Gao, Y. (2006). On the use of morphological analysis for dialectal Arabic speech recognition. Interspeech-2006, Pittsburg PA, September 2006.

  • Alqatta Alsaqly, I. (1999). Building nouns, verbs, and Gerunds, reviewed by Dr. Ahmed Mohamed Abdel-Dayem. Egypt: Dar Al Kutub.

  • Al-Sughaiyer, I., & Al-Kharashi, I. (2004). Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology, 55(3), 189–213.

    Article  Google Scholar 

  • Al-Zoghby, A., Eldin, A. S., Ismail, N. A., & Hamza, T. (2007). Mining Arabic text using soft-matching association rules. In International conference on Computer engineering & systems, 2007. ICCES ’07 (pp. 421–426). November 2007.

  • Attia, M. (2008). Handling Arabic morphological and syntactic ambiguities within the LFG framework with a view to machine translation. PhD Dissertation, University of Manchester.

  • Benajiba, Y. (2009). Arabic Named Entity Recognition. PhD dissertation, Universidad Politécnica de Valencia.

  • Benajiba, Y., Diab, M., & Rosso, P. (October, 2008). Arabic Named Entity Recognition using Optimized Feature Sets. In Proceedings of international conference on Empirical methods in natural language processing, EMNLP-2008 (pp. 284–293). Honolulu: Waikiki.

  • Benajiba, Y., Diab, M. T., & Rosso, P. (2009). Arabic Named Entity Recognition: A feature-driven study. IEEE Transactions on Audio, Speech & Language Processing, 17(5), 926–934.

    Article  Google Scholar 

  • Benajiba, Y., Rosso, P., & Benedí, J. M. (2007). ANERsys: An Arabic Named Entity Recognition system based on maximum entropy. Computational linguistics and intelligent text processing, 8th international conference, February 18–24, 2007 (pp. 143–153). Mexico City: CICLing.

  • Benajiba, Y., Zitouni, I., Diab, M., & Rosso, P. (2010). Arabic Named Entity Recognition: Using features extracted from noisy data. In Proceedings of ACL 2010, Uppsala, Sweden, July 2010.

  • Best, C., Steinberger, R., & Halkia, S. (2007). Web mining and intelligence. IPSC Institute for the Protection and the Security of the Citizen. Council of Europe. http://globesec.jrc.ec.europa.eu/publications/brochures/brochures/LB7606422ENC.pdf.

  • Buckwalter, T. (2004). Buckwalter Arabic Morphological Analyzer Version 2.0. LDC. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02.

  • DBpedia. (2009). http://dbpedia.org/About.

  • Diab, M. (2009). Second Generation Tools (AMIRA 2.0): Fast and robust tokenization, POS tagging, and base phrase chunking. In MEDAR 2nd international conference on Arabic language resources and Tools. Egypt: Cairo.

  • El potencial de la Red en árabe. Accessed April 27, 2010, from http://www.elmundo.es.

  • EMM search engine of the JRC. (2009). http://langtech.jrc.it/.

  • Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing, 8(4), Article 14.

    Google Scholar 

  • Freund, Y., Seung, H., Shamir, E., & Tishby, N. (1997). Selective sampling using the Query by Committee algorithm. Machine Learning, 28, 133–168.

    Article  Google Scholar 

  • GATE. (2009). A general architecture for text engineering. http://gate.ac.uk/.

  • Goweder, A., Poesio, M., & Roeck A. (2004). Broken plural detection for Arabic Information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, UK.

  • Habash, N. Y. (2010). Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1), 1–187.

    Article  Google Scholar 

  • Habash, N., Rambow, O., & Roth, R. (2009). MADA + TOKAN: A Toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt.

  • Habash, N., Soudi, A., & Buckwalter, T. (2007). On Arabic transliteration. In A. van den Bosch & A. Soudi (Eds.), Arabic computational morphology: Knowledge-based and empirical methods. Berlin: Springer.

  • Halpern, J. (2007). The challenges and pitfalls of Arabic Romanization and Arabization. In Second Workshop on Computational approaches to Arabic Script-based Languages (CAASL2). Stanford: Stanford University.

  • Internet World Stats, Usage and Population Statistics. http://www.internetworldstats.com/.

  • Kaye, A. S. (1991). The Hamzat al-Wasl in Contemporary modern standard Arabic. Journal of the American Oriental Society, 111(3), 572–574. http://www.jstor.org/stable/604273.

    Google Scholar 

  • Khoja, S., & Garside, R. (1999). Stemming Arabic text. Computing department. Lancaster U.K.: Lancaster University. Accessed September 22, 1999, from http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps.

  • Khoja, S., Garside, R., & Knowles, G. (2001). A tagset for the morphosyntactic tagging of Arabic. In Paper given at the Corpus Linguistics 2001 conference, Lancaster.

  • Lieberman, H. (Ed.). (2001). Your wish is my command: Programming by example. San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • Linguistic Data Consortium (LDC). (2011). http://www.ldc.upenn.edu/.

  • Maamouri, M., et al. (2009) LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010L01.

  • Martin, A., & Van der Goot, E. (2009). Near real time information mining in multilingual news. In Proceedings of the 18th international World Wide Web conference (WWW’2009), Madrid, 20–24 April 2009 (pp. 1153–1154). New York: ACM.

  • Natural Language Engineering Lab. http://users.dsic.upv.es/grupos/nle/?file=kop4.php.

  • Pouliquen, B., & Steinberger, R. (2007). C. Automatic detection of quotations in multilingual news. In Proceedings of the international conference recent advances in Natural language processing (RANLP’2007), 27–29 September 2007(pp. 25–32). Borovets, Bulgaria.

  • Sawalha, M., & Atwell, E. (2008). Comparative evaluation of Arabic language morphological analysers and stemmers. In Proceedings of COLING 2008 22nd international conference on Computational linguistics.

  • Shaalan, K. F., & Raza, H. (2008). Arabic Named Entity Recognition from diverse text types. In Advances in natural language processing, 6th international conference, GoTAL 2008, Gothenburg, Sweden, August 25–27, 2008, Proceedings. Lecture Notes in Computer Science 5221 (pp. 440–451). Berlin: Springer, ISBN 978-3-540-85286-5.

  • Silberztein, M. (2002). NOOJ: A cooperative object oriented architecture for NLP. In 5th INTEX Workshop, May 2002, Marseille, France.

  • Steinberger, R., Pouliquen, B., & Ignat, C. (2008). Using language-independent rules to achieve high multilinguality in text mining. In F. Fogelman-Soulié, P. Domenico, J. Piskorski, & R. Steinberger (Eds.), Mining massive data sets for security (pp. 217–240). Amsterdam: John Benjamins Publishers.

    Google Scholar 

  • The Arab League Educational, Cultural and Scientific Organization (Alecso). (2007). http://www.alecso.org.tn/.

  • The World Wide Web Consortium (W3C). (2009). http://www.w3.org/.

  • Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2), 67–88.

    Article  Google Scholar 

Download references

Acknowledgments

This work has been partially supported by the Spanish Center for Industry Technological Development (CDTI, Ministry of Industry, Tourism and Trade), through the BUSCAMEDIA Project (CEN-20091026), and also by the Spanish research projects: MA2VICMR: Improving the access, analysis and visibility of the multilingual and multimedia information in web for the Region of Madrid (S2009/TIC-1542), and MULTIMEDICA: Multilingual Information Extraction in Health domain and application to scientific and informative documents (TIN2010-20644-C03-01). The authors would like also to thank the IPSC of the European Commission’s Joint Research Centre for allowing us to include the EMM search engine in our system.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Harith Al-Jumaily.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Al-Jumaily, H., Martínez, P., Martínez-Fernández, J.L. et al. A real time Named Entity Recognition system for Arabic text mining. Lang Resources & Evaluation 46, 543–563 (2012). https://doi.org/10.1007/s10579-011-9146-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-011-9146-z

Keywords

Navigation