A real time Named Entity Recognition system for Arabic text mining

Al-Jumaily, Harith; Martínez, Paloma; Martínez-Fernández, José L.; Van der Goot, Erik

doi:10.1007/s10579-011-9146-z

A real time Named Entity Recognition system for Arabic text mining

Original Paper
Published: 01 May 2011

Volume 46, pages 543–563, (2012)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Harith Al-Jumaily¹,
Paloma Martínez¹,
José L. Martínez-Fernández² &
…
Erik Van der Goot³

667 Accesses
16 Citations
Explore all metrics

Abstract

Arabic is the most widely spoken language in the Arab World. Most people of the Islamic World understand the Classic Arabic language because it is the language of the Qur’an. Despite the fact that in the last decade the number of Arabic Internet users (Middle East and North and East of Africa) has increased considerably, systems to analyze Arabic digital resources automatically are not as easily available as they are for English. Therefore, in this work, an attempt is made to build a real time Named Entity Recognition system that can be used in web applications to detect the appearance of specific named entities and events in news written in Arabic. Arabic is a highly inflectional language, thus we will try to minimize the impact of Arabic affixes on the quality of the pattern recognition model applied to identify named entities. These patterns are built up by processing and integrating different gazetteers, from DBPedia (http://dbpedia.org/About, 2009) to GATE (A general architecture for text engineering, 2009) and ANERGazet (http://users.dsic.upv.es/grupos/nle/?file=kop4.php).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Arabic Named Entity Recognition—A Survey and Analysis

An Approach for Extracting and Disambiguating Arabic Persons’ Names Using Clustered Dictionaries and Scored Patterns

RENA: A Named Entity Recognition System for Arabic

Notes

In this paper the Arabic words are presented using the HSB transliteration schema (Habash et al. 2007) as follows: ء ', آĀ, أ Â, ؤ ŵ, إ Ǎ, ئ ŷ, ا A, ب b, ة ħ, ت t, ث θ, ج j, ح H, خ x, د d, ذ ð,ر r, ز z, س s, ش š, ص S, ض D, ط T, ظ Ď, ع ς, غ γ, ف f, ق q, ك k, ل l, م m, ن n, ه h, و w, ى ý, ي y, َ a, ُ u, ِ i, ً ã, ٌ ũ, ٍ ĩ, ّ ~ , ْ., ـ _,.
ACE 2004, 2005 Multilingual Training Corpus http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T09, http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T06.

References

Abuleil, S. (2004). Extracting names from Arabic text for question-answering systems. RIAO’04, Proceedings of the 7th international conference on Coupling approaches, coupling media, and COUPLING languages for information retrieval, April 26–28, 2004 (pp. 638–647). France: University of Avignon (Vaucluse).
Afify, M., Sarikaya, R., Kuo, H.-K. J., Besacier, L., & Gao, Y. (2006). On the use of morphological analysis for dialectal Arabic speech recognition. Interspeech-2006, Pittsburg PA, September 2006.
Alqatta Alsaqly, I. (1999). Building nouns, verbs, and Gerunds, reviewed by Dr. Ahmed Mohamed Abdel-Dayem. Egypt: Dar Al Kutub.
Al-Sughaiyer, I., & Al-Kharashi, I. (2004). Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology, 55(3), 189–213.
Article Google Scholar
Al-Zoghby, A., Eldin, A. S., Ismail, N. A., & Hamza, T. (2007). Mining Arabic text using soft-matching association rules. In International conference on Computer engineering & systems, 2007. ICCES ’07 (pp. 421–426). November 2007.
Attia, M. (2008). Handling Arabic morphological and syntactic ambiguities within the LFG framework with a view to machine translation. PhD Dissertation, University of Manchester.
Benajiba, Y. (2009). Arabic Named Entity Recognition. PhD dissertation, Universidad Politécnica de Valencia.
Benajiba, Y., Diab, M., & Rosso, P. (October, 2008). Arabic Named Entity Recognition using Optimized Feature Sets. In Proceedings of international conference on Empirical methods in natural language processing, EMNLP-2008 (pp. 284–293). Honolulu: Waikiki.
Benajiba, Y., Diab, M. T., & Rosso, P. (2009). Arabic Named Entity Recognition: A feature-driven study. IEEE Transactions on Audio, Speech & Language Processing, 17(5), 926–934.
Article Google Scholar
Benajiba, Y., Rosso, P., & Benedí, J. M. (2007). ANERsys: An Arabic Named Entity Recognition system based on maximum entropy. Computational linguistics and intelligent text processing, 8th international conference, February 18–24, 2007 (pp. 143–153). Mexico City: CICLing.
Benajiba, Y., Zitouni, I., Diab, M., & Rosso, P. (2010). Arabic Named Entity Recognition: Using features extracted from noisy data. In Proceedings of ACL 2010, Uppsala, Sweden, July 2010.
Best, C., Steinberger, R., & Halkia, S. (2007). Web mining and intelligence. IPSC Institute for the Protection and the Security of the Citizen. Council of Europe. http://globesec.jrc.ec.europa.eu/publications/brochures/brochures/LB7606422ENC.pdf.
Buckwalter, T. (2004). Buckwalter Arabic Morphological Analyzer Version 2.0. LDC. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02.
DBpedia. (2009). http://dbpedia.org/About.
Diab, M. (2009). Second Generation Tools (AMIRA 2.0): Fast and robust tokenization, POS tagging, and base phrase chunking. In MEDAR 2nd international conference on Arabic language resources and Tools. Egypt: Cairo.
El potencial de la Red en árabe. Accessed April 27, 2010, from http://www.elmundo.es.
EMM search engine of the JRC. (2009). http://langtech.jrc.it/.
Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing, 8(4), Article 14.
Google Scholar
Freund, Y., Seung, H., Shamir, E., & Tishby, N. (1997). Selective sampling using the Query by Committee algorithm. Machine Learning, 28, 133–168.
Article Google Scholar
GATE. (2009). A general architecture for text engineering. http://gate.ac.uk/.
Goweder, A., Poesio, M., & Roeck A. (2004). Broken plural detection for Arabic Information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, UK.
Habash, N. Y. (2010). Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1), 1–187.
Article Google Scholar
Habash, N., Rambow, O., & Roth, R. (2009). MADA + TOKAN: A Toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt.
Habash, N., Soudi, A., & Buckwalter, T. (2007). On Arabic transliteration. In A. van den Bosch & A. Soudi (Eds.), Arabic computational morphology: Knowledge-based and empirical methods. Berlin: Springer.
Halpern, J. (2007). The challenges and pitfalls of Arabic Romanization and Arabization. In Second Workshop on Computational approaches to Arabic Script-based Languages (CAASL2). Stanford: Stanford University.
Internet World Stats, Usage and Population Statistics. http://www.internetworldstats.com/.
Kaye, A. S. (1991). The Hamzat al-Wasl in Contemporary modern standard Arabic. Journal of the American Oriental Society, 111(3), 572–574. http://www.jstor.org/stable/604273.
Google Scholar
Khoja, S., & Garside, R. (1999). Stemming Arabic text. Computing department. Lancaster U.K.: Lancaster University. Accessed September 22, 1999, from http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps.
Khoja, S., Garside, R., & Knowles, G. (2001). A tagset for the morphosyntactic tagging of Arabic. In Paper given at the Corpus Linguistics 2001 conference, Lancaster.
Lieberman, H. (Ed.). (2001). Your wish is my command: Programming by example. San Francisco, CA: Morgan Kaufmann.
Google Scholar
Linguistic Data Consortium (LDC). (2011). http://www.ldc.upenn.edu/.
Maamouri, M., et al. (2009) LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010L01.
Martin, A., & Van der Goot, E. (2009). Near real time information mining in multilingual news. In Proceedings of the 18th international World Wide Web conference (WWW’2009), Madrid, 20–24 April 2009 (pp. 1153–1154). New York: ACM.
Natural Language Engineering Lab. http://users.dsic.upv.es/grupos/nle/?file=kop4.php.
Pouliquen, B., & Steinberger, R. (2007). C. Automatic detection of quotations in multilingual news. In Proceedings of the international conference recent advances in Natural language processing (RANLP’2007), 27–29 September 2007(pp. 25–32). Borovets, Bulgaria.
Sawalha, M., & Atwell, E. (2008). Comparative evaluation of Arabic language morphological analysers and stemmers. In Proceedings of COLING 2008 22nd international conference on Computational linguistics.
Shaalan, K. F., & Raza, H. (2008). Arabic Named Entity Recognition from diverse text types. In Advances in natural language processing, 6th international conference, GoTAL 2008, Gothenburg, Sweden, August 25–27, 2008, Proceedings. Lecture Notes in Computer Science 5221 (pp. 440–451). Berlin: Springer, ISBN 978-3-540-85286-5.
Silberztein, M. (2002). NOOJ: A cooperative object oriented architecture for NLP. In 5th INTEX Workshop, May 2002, Marseille, France.
Steinberger, R., Pouliquen, B., & Ignat, C. (2008). Using language-independent rules to achieve high multilinguality in text mining. In F. Fogelman-Soulié, P. Domenico, J. Piskorski, & R. Steinberger (Eds.), Mining massive data sets for security (pp. 217–240). Amsterdam: John Benjamins Publishers.
Google Scholar
The Arab League Educational, Cultural and Scientific Organization (Alecso). (2007). http://www.alecso.org.tn/.
The World Wide Web Consortium (W3C). (2009). http://www.w3.org/.
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2), 67–88.
Article Google Scholar

Download references

Acknowledgments

This work has been partially supported by the Spanish Center for Industry Technological Development (CDTI, Ministry of Industry, Tourism and Trade), through the BUSCAMEDIA Project (CEN-20091026), and also by the Spanish research projects: MA2VICMR: Improving the access, analysis and visibility of the multilingual and multimedia information in web for the Region of Madrid (S2009/TIC-1542), and MULTIMEDICA: Multilingual Information Extraction in Health domain and application to scientific and informative documents (TIN2010-20644-C03-01). The authors would like also to thank the IPSC of the European Commission’s Joint Research Centre for allowing us to include the EMM search engine in our system.

Author information

Authors and Affiliations

Computer Science Department, Carlos III University of Madrid, Av. Universidad 30, 28911, Leganés, Madrid, Spain
Harith Al-Jumaily & Paloma Martínez
DAEDALUS – Data, Decisions and Language S.A., Avda. de la Albufera, 321, 28031, Madrid, Spain
José L. Martínez-Fernández
EC Joint Research Centre, Via E. Fermi, 27549, Ispra, Italy
Erik Van der Goot

Authors

Harith Al-Jumaily
View author publications
You can also search for this author in PubMed Google Scholar
Paloma Martínez
View author publications
You can also search for this author in PubMed Google Scholar
José L. Martínez-Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Erik Van der Goot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Harith Al-Jumaily.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Al-Jumaily, H., Martínez, P., Martínez-Fernández, J.L. et al. A real time Named Entity Recognition system for Arabic text mining. Lang Resources & Evaluation 46, 543–563 (2012). https://doi.org/10.1007/s10579-011-9146-z

Download citation

Published: 01 May 2011
Issue Date: December 2012
DOI: https://doi.org/10.1007/s10579-011-9146-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A real time Named Entity Recognition system for Arabic text mining

Abstract

Access this article

Similar content being viewed by others

Arabic Named Entity Recognition—A Survey and Analysis

An Approach for Extracting and Disambiguating Arabic Persons’ Names Using Clustered Dictionaries and Scored Patterns

RENA: A Named Entity Recognition System for Arabic

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A real time Named Entity Recognition system for Arabic text mining

Abstract

Access this article

Similar content being viewed by others

Arabic Named Entity Recognition—A Survey and Analysis

An Approach for Extracting and Disambiguating Arabic Persons’ Names Using Clustered Dictionaries and Scored Patterns

RENA: A Named Entity Recognition System for Arabic

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation