Skip to main content

Named Entity Recognition

  • Chapter
  • First Online:
Natural Language Processing of Semitic Languages

Abstract

Named entity recognition (NER) is the problem of locating and categorizing important nouns and proper nouns in a text. In this chapter, we review the general state of research on entity recognition, relevant challenges and the current state of the art works on named entity recognition on Semitic languages. Specifically, we look into two case studies for Arabic and Hebrew. We also review Semitic NLP tasks which overlap with the named entity recognition. We close with an overview of the available resources for Semitic named entity recognition and some the open.research questions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Ratinov and Roth [59] have shown that with a small linear expansion of the parameters, the BILOU representation results in a better NER performance.

  2. 2.

    Temporal and numerical expressions are other examples named entities which are not proper nouns.

  3. 3.

    Gazeteer is a term that is commonly used to refer to a domain specific lexicon. For example, there are gazetteers for country and city names.

  4. 4.

    Example is borrowed from [60].

  5. 5.

    Two well-explained usage of the above HMM framework can be found in [37, 48].

  6. 6.

    http://nlpers.blogspot.com/2006/08/doing-named-entity-recognition-dont.html

  7. 7.

    The term named entity was first introduced at the MUC-6 [54].

  8. 8.

    The 2002 shared task was conducted on Dutch and Spanish [67]. The 2003 shared task was conducted on English and German [68].

  9. 9.

    Per CoNLL definition, any named entities that does not belong to the person, location and organization classes is considered to be MIS.

  10. 10.

    See [43] and [45] for an overview Biomedical NER.

  11. 11.

    Samples for the Arabic are shown using the Buckwalter romanization [18] and samples for the Hebrews are shown using the romanization scheme in [40].

  12. 12.

    Capitalization is not used consistently among Latin-scripted languages. Capitalization typically applies to proper nouns in English, to all nouns in German, and to any important noun in Italian.

  13. 13.

    For example, foreign person names such as Hayato (Japanese) or Tahvo (Finish) can be mapped to different Arabic spellings.

  14. 14.

    Other relevant works on Arabic NER: [25, 47, 50, 62].

  15. 15.

    Published in Hebrew.

  16. 16.

    For more information about Arabic transliteration, see: [29].

  17. 17.

    See [42] for more information about the Arabic ACE dataset.

  18. 18.

    The fourth release of Ontonotes includes named entity annotation for a corpus of 300,000 words.

  19. 19.

    currently at http://www1.ccls.columbia.edu/~ybenajiba/downloads.html

  20. 20.

    Work presented in [55, 64] also report a large scale annotation of named entity information. However the datasets were not released publicly.

  21. 21.

    According to Wikipedia statistics, Amharic Wikipedia has more than 10,000 articles which is a promising resource for gazetteer construction.

References

  1. Abdul-Hamid A, Darwish K (2010) Simplified feature set for Arabic named entity recognition. In: Proceedings of the 2010 named entities workshop. Association for Computational Linguistics, Uppsala, pp 110–115

    Google Scholar 

  2. Al-Onaizan Y, Knight K (2002a) Machine transliteration of names in Arabic texts. In: Proceedings of the ACL-02 workshop on computational approaches to Semitic languages, Philadelphia. Association for Computational Linguistics

    Google Scholar 

  3. Al-Onaizan Y, Knight K (2002b) Translating named entities using monolingual and bilingual resources. In: Proceedings of 40th annual meeting of the Association for Computational Linguistics, Philadelphia. Association for Computational Linguistics, pp 400–408

    Google Scholar 

  4. Alotaibi F, Lee M (2012) Mapping Arabic Wikipedia into the named entities taxonomy. In: Proceedings of COLING 2012: posters, Mumbai. The COLING 2012 Organizing Committee, pp 43–52

    Google Scholar 

  5. Attia M, Toral A, Tounsi L, Monachini M, van Genabith J (2010) An automatically built named entity lexicon for Arabic. In: Calzolari N, Choukri K, Maegaard B, Mariani J, Odijk J, Piperidis S, Rosner M, Tapias D (eds) Proceedings of the seventh conference on international language resources and evaluation (LREC’10), Valletta. European Language Resources Association (ELRA)

    Google Scholar 

  6. Azab M, Bouamor H, Mohit B, Oflazer K (2013) Dudley North visits North London: learning when to transliterate to arabic. In: Proceedings of the 2013 conference of the North American chapter of the Association for Computational Linguistics: human language technologies (NAACL-HLT 2013), Atlanta. Association for Computational Linguistics

    Google Scholar 

  7. Babych B, Hartley A (2003) Improving machine translation quality with automatic named entity recognition. In: Proceedings of the 7th international EAMT workshop on MT and other language technology tools, EAMT ’03, Dublin

    Google Scholar 

  8. Balasuriya D, Ringland N, Nothman J, Murphy T, Curran JR (2009) Named entity recognition in Wikipedia. In: Proceedings of the 2009 workshop on the people’s web meets NLP: collaboratively constructed Semantic resources. Association for Computational Linguistics, Suntec, pp 10–18

    Google Scholar 

  9. Benajiba Y, Zitouni I (2009) Morphology-based segmentation combination for Arabic mention detection. ACM Trans Asian Lang Inf Process (TALIP) 8:16:1–16:18

    Google Scholar 

  10. Benajiba Y, Zitouni I (2010) Enhancing mention detection using projection via aligned corpora. In: Proceedings of the 2010 conference on empirical methods in natural language processing, Cambridge. Association for Computational Linguistics, pp 993–1001

    Google Scholar 

  11. Benajiba Y, Rosso P, BenedíRuiz JM (2007) ANERsys: an Arabic named entity recognition system based on maximum entropy. In: Gelbukh A (ed) Proceedings of CICLing, Mexico City. Springer, pp 143–153

    Google Scholar 

  12. Benajiba Y, Diab M, Rosso P (2008) Arabic named entity recognition using optimized feature sets. In: Proceedings of the 2008 conference on empirical methods in natural language processing, Honolulu. Association for Computational Linguistics, pp 284–293

    Google Scholar 

  13. Bender O, Och FJ, Ney H (2003) Maximum entropy models for named entity recognition. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003, Edmonton, pp 148–151

    Google Scholar 

  14. Ben Mordecai N, Elhadad M (2005) Hebrew named entity recognition. Master’s thesis, Department of Computer Science, Ben Gurion University of the Negev

    Google Scholar 

  15. Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22:39–71

    Google Scholar 

  16. Bikel D, Miller S, Schwarz R, Weischedel R (1997) Nymble: a high-performance learning name-finder. In: Proceedings of the applied natural language processing, Tzigov Chark

    Google Scholar 

  17. Borthwick A (1999) A maximum entropy approach to named entity recognition. Phd thesis, Computer Science Department, New York University

    Google Scholar 

  18. Buckwalter T (2002) Buckwalter Arabic morphological analyzer version 1.0

    Google Scholar 

  19. Chieu HL, Ng HT (2002) Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th international conference on computational linguistics – vol 1, COLING ’02. Taipei

    Google Scholar 

  20. Collins M (2002) Discriminative training methods for hidden Markov models: theory and experiments with Perceptron algorithms. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing (EMNLP), Philadelphia, pp 1–8

    Google Scholar 

  21. Daumé III H (2007) Frustratingly easy domain adaptation. In: Proceedings of the 45th annual meeting of the Association of Computational Linguistics, Prague. Association for Computational Linguistics, pp 256–263

    Google Scholar 

  22. Diab M (2009) Second generation tools (AMIRA 2.0): fast and robust tokenization, pos tagging, and base phrase chunking. In: Proceedings of the 2nd international conference on Arabic language resources and tools, Cairo

    Google Scholar 

  23. Doddington G, Mitchell A, Przybocki M, Rambow L, Strassel S, Weischedel R (2004) The automatic content extraction (ACE) program-tasks, data and evaluation. In: Proceedings of LREC 2004, Lisbon, pp 837–840

    Google Scholar 

  24. Farber B, Freitag D, Habash N, Rambow O (2008) Improving NER in Arabic using a morphological tagger. In: Calzolari N, Choukri K, Maegaard B, Mariani J, Odjik J, Piperidis S, Tapias D (eds) Proceedings of the sixth international language resources and evaluation (LREC’08), Marrakech. European Language Resources Association (ELRA), Marrakesch, pp 2509–2514

    Google Scholar 

  25. Fehri H, Haddar K, Ben Hamadou A (2011) Recognition and translation of Arabic named entities with NooJ using a new representation model. In: Proceedings of the 9th international workshop on finite state methods and natural language processing, Blois. Association for Computational Linguistics, pp 134–142

    Google Scholar 

  26. Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003, Edmonton, pp 168–171

    Google Scholar 

  27. Florian R, Hassan H, Ittycheriah A, Jing H, Kambhatla N, Luo X, Nicolov N, Roukos S (2004) A statistical model for multilingual entity detection and tracking. In: Dumais S, Marcu D, Roukos S (eds) Proceedings of the human language technology conference of the North American chapter of the Association for Computational Linguistics: HLT-NAACL 2004, Boston. Association for Computational Linguistics

    Google Scholar 

  28. Goldberg Y, Elhadad M (2008) Identification of transliterated foreign words in Hebrew script. In: Proceedings of the 9th international conference on Computational linguistics and intelligent text processing, CICLing’08, Haifa, pp 466–477

    Google Scholar 

  29. Habash N, Soudi A, Buckwalter T (2007) On arabic transliteration. Text Speech Lang Technology 38:15–22

    Article  Google Scholar 

  30. Habash N, Rambow O, Roth R (2009) MADA+TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Choukri K, Maegaard B (eds) Proceedings of the second international conference on Arabic language resources and tools, the MEDAR consortium, Cairo

    Google Scholar 

  31. Hassan A, Fahmy H, Hassan H (2007) Improving named entity translation by exploiting comparable and parallel corpora. In: Proceedings of the conference on recent advances in natural language processing (RANLP ’07), Borovets

    Google Scholar 

  32. Hermjakob U, Knight K, Daumé III H (2008) Name translation in statistical machine translation – learning when to transliterate. In: Proceedings of ACL-08: HLT, Columbus. Association for Computational Linguistics, pp 389–397

    Google Scholar 

  33. Hovy E, Marcus M, Palmer M, Ramshaw L, Weischedel R (2006) OntoNotes: the 90% solution. In: Proceedings of the human language technology conference of the NAACL (HLT-NAACL), New York City. Association for Computational Linguistics, pp 57–60

    Google Scholar 

  34. Huang F, Emami A, Zitouni I (2008) When Harry met Harri: cross-lingual name spelling normalization. In: Proceedings of the 2008 conference on empirical methods in natural language processing, Honolulu. Association for Computational Linguistics, pp 391–399

    Google Scholar 

  35. Itai A, Wintner S (2008) Language resources for Hebrew. Lang Resour Eval 42:75–98

    Article  Google Scholar 

  36. Jiang J, Zhai C (2006) Exploiting domain structure for named entity recognition. In: Proceedings of the human language technology conference of the NAACL (HLT-NAACL), New York City. Association for Computational Linguistics, pp 74–81

    Google Scholar 

  37. Jurafsky D, Martin JH (2008) Speech and language processing. Pearson Prentice Hall, Upper Saddle River

    Google Scholar 

  38. Karimi S, Scholer F, Turpin A (2011) Machine transliteration survey. ACM Comput Surv 43:17:1–17:46

    Google Scholar 

  39. Khalid MA, Jijkoun V, De Rijke M (2008) The impact of named entity normalization on information retrieval for question answering. In: Proceedings of the IR research, 30th European conference on advances in information retrieval, Glasgow, Springer, pp 705–710

    Google Scholar 

  40. Kirschenbaum A, Wintner S (2009) Lightly supervised transliteration for machine translation. In: Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009), Athens. Association for Computational Linguistics, pp 433–441

    Google Scholar 

  41. Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning, ICML ’01, Williamstown. Morgan Kaufmann, pp 282–289

    Google Scholar 

  42. LDC (2005) ACE (automatic content extraction) Arabic annotation guidelines for entities, version 5.3.3. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  43. Leaman R, Gonzalez G (2008) Banner: an executable survey of advances in biomedical named entity recognition. In: Proceedings of pacific symposium on biocomputing, Kohala Coast, pp 652–663

    Google Scholar 

  44. Lemberski G (2003) Named entity recognition in Hebrew. Master’s thesis, Department of Computer Science, Ben Gurion University

    Google Scholar 

  45. Leser U, Hakenberg J (2005) What makes a gene name? Named entity recognition in the biomedical literature. Brief. Bioinform 6:357–369

    Google Scholar 

  46. Maamouri M, Graff D, Bouziri B, Krouna S, Bies A, Kulick S (2010) LDC standard Arabic morphological analyzer (SAMA) version 3.1, LDC2004L02. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  47. Maloney J, Niv M (1998) TAGARAB: a fast, accurate arabic name recognizer using high precision morphological analysis. In: Proceedings of the workshop on computational approaches to Semitic languages, Montreal

    Google Scholar 

  48. Malouf R (2002) Markov models for language-independent named entity recognition. In: Proceedings of the 6th conference on natural language learning – vol 20, COLING-02, Stroudsburg. Association for Computational Linguistics, pp 1–4

    Google Scholar 

  49. McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Daelemans W, Osborne M (eds) Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003, Edmonton, pp 188–191

    Google Scholar 

  50. Mesfar S (2007) Named entity recognition for Arabic using syntactic grammars. In: Kedad Z, Lammari N, Métais E, Meziane F, Rezgui Y (eds) Natural language processing and information systems. Lecture notes in computer science, vol 4592. Springer, Berlin, pp 305–316

    Chapter  Google Scholar 

  51. Mikheev A, Moens M, Grover C (1999) Named entity recognition without gazetteers. In: Proceedings of the ninth conference of the European chapter of the Association for Computational Linguistics (EACL-99), Bergen. Association for Computational Linguistics

    Google Scholar 

  52. Miller S, Crystal M, Fox H, Ramshaw L, Schwartz R, Stone R, Weischedel R, The Annotation Group (1998) Algorithms that learn to extract information BBN: description of the sift system as used for MUC-7. In: Proceedings of the seventh message understanding conference (MUC-7), Fairfax

    Google Scholar 

  53. Mohit B, Schneider N, Bhowmick R, Oflazer K, Smith NA (2012) Recall-oriented learning of named entities in arabic wikipedia. In: Proceedings of the 13th Conference of the European Chapter of the ACL (EACL 2012), Avignon. Association for Computational Linguistics

    Google Scholar 

  54. Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30:3–26

    Article  Google Scholar 

  55. Nezda L, Hickl A, Lehmann J, Fayyaz S (2006) What in the world is a Shahab? Wide coverage named entity recognition for Arabic. In: Proccedings of LREC, Genoa, pp 41–46

    Google Scholar 

  56. Nothman J, Murphy T, Curran JR (2009) Analysing Wikipedia and gold-standard corpora for NER training. In: Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009), Athens. Association for Computational Linguistics, pp 612–620

    Google Scholar 

  57. Oudah M, Shaalan K (2012) A pipeline Arabic named entity recognition using a hybrid approach. In: Proceedings of COLING 2012, Mumbai. The COLING 2012 Organizing Committee, pp 2159–2176

    Google Scholar 

  58. Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286

    Article  Google Scholar 

  59. Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In: Proceedings of the thirteenth conference on computational natural language learning (CoNLL-2009), Colorado. Association for Computational Linguistics, Boulder, pp 147–155

    Google Scholar 

  60. Riloff E, Jones R (1999) Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence, Orlando. American Association for Artificial Intelligence, pp 474–479

    Google Scholar 

  61. Riloff EM, Phillips W (2004) Introduction to the sundance and autoslog systems. Technical report, University of Utah

    Google Scholar 

  62. Samy D, Moreno A, Guirao JM (2005) A proposal for an Arabic named entity tagger leveraging a parallel corpus. In: Proceedings of the conference of the recent advances in natural language processing (RANLP-05), Borovets

    Google Scholar 

  63. Sekine S, Sudo K, Nobata C (2002) Extended named entity hierarchy. In: Proceedings of LREC, Las Palmas

    Google Scholar 

  64. Shaalan K, Raza H (2009) NERA: named entity recognition for Arabic. J Am Soc Inf Sci Technol 60(8):1652–1663

    Article  Google Scholar 

  65. Sintayehu Z (2001) Automatic classification of Amharic news items: the case of Ethiopian news agency. Master’s thesis, School of Information Studies for Africa, Addis Ababa University

    Google Scholar 

  66. Sundheim BM (1995) Named entity task definition, version 2.1. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia

    Google Scholar 

  67. Tjong Kim Sang EF (2002) Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In: Proceedings of the sixth conference on natural language learning (CoNLL-2002), Taipei

    Google Scholar 

  68. Tjong Kim Sang EF, De Meulder F (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Daelemans W, Osborne M (eds) Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003, Edmonton, pp 142–147

    Google Scholar 

  69. Toral A, Noguera E, Llopis F, Muñoz R (2005) Improving question answering using named entity recognition. In: Natural language processing and information systems, vol 3513/2005. Springer, Berlin/New York, pp 181–191

    Google Scholar 

  70. Walker C, Strassel S, Medero J, Maeda K (2006) ACE 2005 multilingual training corpus. LDC2006T06, Linguistic Data Consortium, Philadelphia

    Google Scholar 

  71. Wu D, Lee WS, Ye N, Chieu HL (2009) Domain adaptive bootstrapping for named entity recognition. In: Proceedings of the 2009 conference on empirical methods in natural language processing, Singapore. Association for Computational Linguistics, pp 1523–1532

    Google Scholar 

  72. Yarowsky D, Ngai G, Wicentowski R (2001) Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the first international conference on human language technology research, HLT ’01, Stroudsburg. Association for Computational Linguistics, pp 1–8

    Google Scholar 

  73. Zaghouani W (2012) RENAR: A rule-based Arabic named entity recognition system. ACM Trans Asian Lang Inf Process (TALIP) 11:1–13

    Google Scholar 

  74. Zitouni I, Florian R (2008) Mention detection crossing the language barrier. In: Proceedings of the 2008 conference on empirical methods in natural language processing, Honolulu. Association for Computational Linguistics, pp 600–609

    Google Scholar 

Download references

Acknowledgements

I am grateful to Kemal Oflazer, Houda Bouamor, Emily Alp and two anonymous reviewers for their comments and feedback. This publication was made possible by grant YSREP-1-018-1-004 from the Qatar National Research Fund (a member of the Qatar Foundation). The statements made herein are solely the responsibility of the authors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Behrang Mohit .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Mohit, B. (2014). Named Entity Recognition. In: Zitouni, I. (eds) Natural Language Processing of Semitic Languages. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45358-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-45358-8_7

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-45357-1

  • Online ISBN: 978-3-642-45358-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics