Named Entity Recognition

Mohit, Behrang

doi:10.1007/978-3-642-45358-8_7

Behrang Mohit⁵

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

5349 Accesses
52 Citations

Abstract

Named entity recognition (NER) is the problem of locating and categorizing important nouns and proper nouns in a text. In this chapter, we review the general state of research on entity recognition, relevant challenges and the current state of the art works on named entity recognition on Semitic languages. Specifically, we look into two case studies for Arabic and Hebrew. We also review Semitic NLP tasks which overlap with the named entity recognition. We close with an overview of the available resources for Semitic named entity recognition and some the open.research questions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Ratinov and Roth [59] have shown that with a small linear expansion of the parameters, the BILOU representation results in a better NER performance.
2.
Temporal and numerical expressions are other examples named entities which are not proper nouns.
3.
Gazeteer is a term that is commonly used to refer to a domain specific lexicon. For example, there are gazetteers for country and city names.
4.
Example is borrowed from [60].
5.
Two well-explained usage of the above HMM framework can be found in [37, 48].
6.
http://nlpers.blogspot.com/2006/08/doing-named-entity-recognition-dont.html
7.
The term named entity was first introduced at the MUC-6 [54].
8.
The 2002 shared task was conducted on Dutch and Spanish [67]. The 2003 shared task was conducted on English and German [68].
9.
Per CoNLL definition, any named entities that does not belong to the person, location and organization classes is considered to be MIS.
10.
See [43] and [45] for an overview Biomedical NER.
11.
Samples for the Arabic are shown using the Buckwalter romanization [18] and samples for the Hebrews are shown using the romanization scheme in [40].
12.
Capitalization is not used consistently among Latin-scripted languages. Capitalization typically applies to proper nouns in English, to all nouns in German, and to any important noun in Italian.
13.
For example, foreign person names such as Hayato (Japanese) or Tahvo (Finish) can be mapped to different Arabic spellings.
14.
Other relevant works on Arabic NER: [25, 47, 50, 62].
15.
Published in Hebrew.
16.
For more information about Arabic transliteration, see: [29].
17.
See [42] for more information about the Arabic ACE dataset.
18.
The fourth release of Ontonotes includes named entity annotation for a corpus of 300,000 words.
19.
currently at http://www1.ccls.columbia.edu/~ybenajiba/downloads.html
20.
Work presented in [55, 64] also report a large scale annotation of named entity information. However the datasets were not released publicly.
21.
According to Wikipedia statistics, Amharic Wikipedia has more than 10,000 articles which is a promising resource for gazetteer construction.

References

Abdul-Hamid A, Darwish K (2010) Simplified feature set for Arabic named entity recognition. In: Proceedings of the 2010 named entities workshop. Association for Computational Linguistics, Uppsala, pp 110–115
Google Scholar
Al-Onaizan Y, Knight K (2002a) Machine transliteration of names in Arabic texts. In: Proceedings of the ACL-02 workshop on computational approaches to Semitic languages, Philadelphia. Association for Computational Linguistics
Google Scholar
Al-Onaizan Y, Knight K (2002b) Translating named entities using monolingual and bilingual resources. In: Proceedings of 40th annual meeting of the Association for Computational Linguistics, Philadelphia. Association for Computational Linguistics, pp 400–408
Google Scholar
Alotaibi F, Lee M (2012) Mapping Arabic Wikipedia into the named entities taxonomy. In: Proceedings of COLING 2012: posters, Mumbai. The COLING 2012 Organizing Committee, pp 43–52
Google Scholar
Attia M, Toral A, Tounsi L, Monachini M, van Genabith J (2010) An automatically built named entity lexicon for Arabic. In: Calzolari N, Choukri K, Maegaard B, Mariani J, Odijk J, Piperidis S, Rosner M, Tapias D (eds) Proceedings of the seventh conference on international language resources and evaluation (LREC’10), Valletta. European Language Resources Association (ELRA)
Google Scholar
Azab M, Bouamor H, Mohit B, Oflazer K (2013) Dudley North visits North London: learning when to transliterate to arabic. In: Proceedings of the 2013 conference of the North American chapter of the Association for Computational Linguistics: human language technologies (NAACL-HLT 2013), Atlanta. Association for Computational Linguistics
Google Scholar
Babych B, Hartley A (2003) Improving machine translation quality with automatic named entity recognition. In: Proceedings of the 7th international EAMT workshop on MT and other language technology tools, EAMT ’03, Dublin
Google Scholar
Balasuriya D, Ringland N, Nothman J, Murphy T, Curran JR (2009) Named entity recognition in Wikipedia. In: Proceedings of the 2009 workshop on the people’s web meets NLP: collaboratively constructed Semantic resources. Association for Computational Linguistics, Suntec, pp 10–18
Google Scholar
Benajiba Y, Zitouni I (2009) Morphology-based segmentation combination for Arabic mention detection. ACM Trans Asian Lang Inf Process (TALIP) 8:16:1–16:18
Google Scholar
Benajiba Y, Zitouni I (2010) Enhancing mention detection using projection via aligned corpora. In: Proceedings of the 2010 conference on empirical methods in natural language processing, Cambridge. Association for Computational Linguistics, pp 993–1001
Google Scholar
Benajiba Y, Rosso P, BenedíRuiz JM (2007) ANERsys: an Arabic named entity recognition system based on maximum entropy. In: Gelbukh A (ed) Proceedings of CICLing, Mexico City. Springer, pp 143–153
Google Scholar
Benajiba Y, Diab M, Rosso P (2008) Arabic named entity recognition using optimized feature sets. In: Proceedings of the 2008 conference on empirical methods in natural language processing, Honolulu. Association for Computational Linguistics, pp 284–293
Google Scholar
Bender O, Och FJ, Ney H (2003) Maximum entropy models for named entity recognition. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003, Edmonton, pp 148–151
Google Scholar
Ben Mordecai N, Elhadad M (2005) Hebrew named entity recognition. Master’s thesis, Department of Computer Science, Ben Gurion University of the Negev
Google Scholar
Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22:39–71
Google Scholar
Bikel D, Miller S, Schwarz R, Weischedel R (1997) Nymble: a high-performance learning name-finder. In: Proceedings of the applied natural language processing, Tzigov Chark
Google Scholar
Borthwick A (1999) A maximum entropy approach to named entity recognition. Phd thesis, Computer Science Department, New York University
Google Scholar
Buckwalter T (2002) Buckwalter Arabic morphological analyzer version 1.0
Google Scholar
Chieu HL, Ng HT (2002) Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th international conference on computational linguistics – vol 1, COLING ’02. Taipei
Google Scholar
Collins M (2002) Discriminative training methods for hidden Markov models: theory and experiments with Perceptron algorithms. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing (EMNLP), Philadelphia, pp 1–8
Google Scholar
Daumé III H (2007) Frustratingly easy domain adaptation. In: Proceedings of the 45th annual meeting of the Association of Computational Linguistics, Prague. Association for Computational Linguistics, pp 256–263
Google Scholar
Diab M (2009) Second generation tools (AMIRA 2.0): fast and robust tokenization, pos tagging, and base phrase chunking. In: Proceedings of the 2nd international conference on Arabic language resources and tools, Cairo
Google Scholar
Doddington G, Mitchell A, Przybocki M, Rambow L, Strassel S, Weischedel R (2004) The automatic content extraction (ACE) program-tasks, data and evaluation. In: Proceedings of LREC 2004, Lisbon, pp 837–840
Google Scholar
Farber B, Freitag D, Habash N, Rambow O (2008) Improving NER in Arabic using a morphological tagger. In: Calzolari N, Choukri K, Maegaard B, Mariani J, Odjik J, Piperidis S, Tapias D (eds) Proceedings of the sixth international language resources and evaluation (LREC’08), Marrakech. European Language Resources Association (ELRA), Marrakesch, pp 2509–2514
Google Scholar
Fehri H, Haddar K, Ben Hamadou A (2011) Recognition and translation of Arabic named entities with NooJ using a new representation model. In: Proceedings of the 9th international workshop on finite state methods and natural language processing, Blois. Association for Computational Linguistics, pp 134–142
Google Scholar
Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003, Edmonton, pp 168–171
Google Scholar
Florian R, Hassan H, Ittycheriah A, Jing H, Kambhatla N, Luo X, Nicolov N, Roukos S (2004) A statistical model for multilingual entity detection and tracking. In: Dumais S, Marcu D, Roukos S (eds) Proceedings of the human language technology conference of the North American chapter of the Association for Computational Linguistics: HLT-NAACL 2004, Boston. Association for Computational Linguistics
Google Scholar
Goldberg Y, Elhadad M (2008) Identification of transliterated foreign words in Hebrew script. In: Proceedings of the 9th international conference on Computational linguistics and intelligent text processing, CICLing’08, Haifa, pp 466–477
Google Scholar
Habash N, Soudi A, Buckwalter T (2007) On arabic transliteration. Text Speech Lang Technology 38:15–22
Article Google Scholar
Habash N, Rambow O, Roth R (2009) MADA+TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Choukri K, Maegaard B (eds) Proceedings of the second international conference on Arabic language resources and tools, the MEDAR consortium, Cairo
Google Scholar
Hassan A, Fahmy H, Hassan H (2007) Improving named entity translation by exploiting comparable and parallel corpora. In: Proceedings of the conference on recent advances in natural language processing (RANLP ’07), Borovets
Google Scholar
Hermjakob U, Knight K, Daumé III H (2008) Name translation in statistical machine translation – learning when to transliterate. In: Proceedings of ACL-08: HLT, Columbus. Association for Computational Linguistics, pp 389–397
Google Scholar
Hovy E, Marcus M, Palmer M, Ramshaw L, Weischedel R (2006) OntoNotes: the 90% solution. In: Proceedings of the human language technology conference of the NAACL (HLT-NAACL), New York City. Association for Computational Linguistics, pp 57–60
Google Scholar
Huang F, Emami A, Zitouni I (2008) When Harry met Harri: cross-lingual name spelling normalization. In: Proceedings of the 2008 conference on empirical methods in natural language processing, Honolulu. Association for Computational Linguistics, pp 391–399
Google Scholar
Itai A, Wintner S (2008) Language resources for Hebrew. Lang Resour Eval 42:75–98
Article Google Scholar
Jiang J, Zhai C (2006) Exploiting domain structure for named entity recognition. In: Proceedings of the human language technology conference of the NAACL (HLT-NAACL), New York City. Association for Computational Linguistics, pp 74–81
Google Scholar
Jurafsky D, Martin JH (2008) Speech and language processing. Pearson Prentice Hall, Upper Saddle River
Google Scholar
Karimi S, Scholer F, Turpin A (2011) Machine transliteration survey. ACM Comput Surv 43:17:1–17:46
Google Scholar
Khalid MA, Jijkoun V, De Rijke M (2008) The impact of named entity normalization on information retrieval for question answering. In: Proceedings of the IR research, 30th European conference on advances in information retrieval, Glasgow, Springer, pp 705–710
Google Scholar
Kirschenbaum A, Wintner S (2009) Lightly supervised transliteration for machine translation. In: Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009), Athens. Association for Computational Linguistics, pp 433–441
Google Scholar
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning, ICML ’01, Williamstown. Morgan Kaufmann, pp 282–289
Google Scholar
LDC (2005) ACE (automatic content extraction) Arabic annotation guidelines for entities, version 5.3.3. Linguistic Data Consortium, Philadelphia
Google Scholar
Leaman R, Gonzalez G (2008) Banner: an executable survey of advances in biomedical named entity recognition. In: Proceedings of pacific symposium on biocomputing, Kohala Coast, pp 652–663
Google Scholar
Lemberski G (2003) Named entity recognition in Hebrew. Master’s thesis, Department of Computer Science, Ben Gurion University
Google Scholar
Leser U, Hakenberg J (2005) What makes a gene name? Named entity recognition in the biomedical literature. Brief. Bioinform 6:357–369
Google Scholar
Maamouri M, Graff D, Bouziri B, Krouna S, Bies A, Kulick S (2010) LDC standard Arabic morphological analyzer (SAMA) version 3.1, LDC2004L02. Linguistic Data Consortium, Philadelphia
Google Scholar
Maloney J, Niv M (1998) TAGARAB: a fast, accurate arabic name recognizer using high precision morphological analysis. In: Proceedings of the workshop on computational approaches to Semitic languages, Montreal
Google Scholar
Malouf R (2002) Markov models for language-independent named entity recognition. In: Proceedings of the 6th conference on natural language learning – vol 20, COLING-02, Stroudsburg. Association for Computational Linguistics, pp 1–4
Google Scholar
McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Daelemans W, Osborne M (eds) Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003, Edmonton, pp 188–191
Google Scholar
Mesfar S (2007) Named entity recognition for Arabic using syntactic grammars. In: Kedad Z, Lammari N, Métais E, Meziane F, Rezgui Y (eds) Natural language processing and information systems. Lecture notes in computer science, vol 4592. Springer, Berlin, pp 305–316
Chapter Google Scholar
Mikheev A, Moens M, Grover C (1999) Named entity recognition without gazetteers. In: Proceedings of the ninth conference of the European chapter of the Association for Computational Linguistics (EACL-99), Bergen. Association for Computational Linguistics
Google Scholar
Miller S, Crystal M, Fox H, Ramshaw L, Schwartz R, Stone R, Weischedel R, The Annotation Group (1998) Algorithms that learn to extract information BBN: description of the sift system as used for MUC-7. In: Proceedings of the seventh message understanding conference (MUC-7), Fairfax
Google Scholar
Mohit B, Schneider N, Bhowmick R, Oflazer K, Smith NA (2012) Recall-oriented learning of named entities in arabic wikipedia. In: Proceedings of the 13th Conference of the European Chapter of the ACL (EACL 2012), Avignon. Association for Computational Linguistics
Google Scholar
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvisticae Investigationes 30:3–26
Article Google Scholar
Nezda L, Hickl A, Lehmann J, Fayyaz S (2006) What in the world is a Shahab? Wide coverage named entity recognition for Arabic. In: Proccedings of LREC, Genoa, pp 41–46
Google Scholar
Nothman J, Murphy T, Curran JR (2009) Analysing Wikipedia and gold-standard corpora for NER training. In: Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009), Athens. Association for Computational Linguistics, pp 612–620
Google Scholar
Oudah M, Shaalan K (2012) A pipeline Arabic named entity recognition using a hybrid approach. In: Proceedings of COLING 2012, Mumbai. The COLING 2012 Organizing Committee, pp 2159–2176
Google Scholar
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286
Article Google Scholar
Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In: Proceedings of the thirteenth conference on computational natural language learning (CoNLL-2009), Colorado. Association for Computational Linguistics, Boulder, pp 147–155
Google Scholar
Riloff E, Jones R (1999) Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence, Orlando. American Association for Artificial Intelligence, pp 474–479
Google Scholar
Riloff EM, Phillips W (2004) Introduction to the sundance and autoslog systems. Technical report, University of Utah
Google Scholar
Samy D, Moreno A, Guirao JM (2005) A proposal for an Arabic named entity tagger leveraging a parallel corpus. In: Proceedings of the conference of the recent advances in natural language processing (RANLP-05), Borovets
Google Scholar
Sekine S, Sudo K, Nobata C (2002) Extended named entity hierarchy. In: Proceedings of LREC, Las Palmas
Google Scholar
Shaalan K, Raza H (2009) NERA: named entity recognition for Arabic. J Am Soc Inf Sci Technol 60(8):1652–1663
Article Google Scholar
Sintayehu Z (2001) Automatic classification of Amharic news items: the case of Ethiopian news agency. Master’s thesis, School of Information Studies for Africa, Addis Ababa University
Google Scholar
Sundheim BM (1995) Named entity task definition, version 2.1. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia
Google Scholar
Tjong Kim Sang EF (2002) Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In: Proceedings of the sixth conference on natural language learning (CoNLL-2002), Taipei
Google Scholar
Tjong Kim Sang EF, De Meulder F (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Daelemans W, Osborne M (eds) Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003, Edmonton, pp 142–147
Google Scholar
Toral A, Noguera E, Llopis F, Muñoz R (2005) Improving question answering using named entity recognition. In: Natural language processing and information systems, vol 3513/2005. Springer, Berlin/New York, pp 181–191
Google Scholar
Walker C, Strassel S, Medero J, Maeda K (2006) ACE 2005 multilingual training corpus. LDC2006T06, Linguistic Data Consortium, Philadelphia
Google Scholar
Wu D, Lee WS, Ye N, Chieu HL (2009) Domain adaptive bootstrapping for named entity recognition. In: Proceedings of the 2009 conference on empirical methods in natural language processing, Singapore. Association for Computational Linguistics, pp 1523–1532
Google Scholar
Yarowsky D, Ngai G, Wicentowski R (2001) Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the first international conference on human language technology research, HLT ’01, Stroudsburg. Association for Computational Linguistics, pp 1–8
Google Scholar
Zaghouani W (2012) RENAR: A rule-based Arabic named entity recognition system. ACM Trans Asian Lang Inf Process (TALIP) 11:1–13
Google Scholar
Zitouni I, Florian R (2008) Mention detection crossing the language barrier. In: Proceedings of the 2008 conference on empirical methods in natural language processing, Honolulu. Association for Computational Linguistics, pp 600–609
Google Scholar

Download references

Acknowledgements

I am grateful to Kemal Oflazer, Houda Bouamor, Emily Alp and two anonymous reviewers for their comments and feedback. This publication was made possible by grant YSREP-1-018-1-004 from the Qatar National Research Fund (a member of the Qatar Foundation). The statements made herein are solely the responsibility of the authors.

Author information

Authors and Affiliations

Carnegie Mellon University in Qatar, Doha, Qatar
Behrang Mohit

Authors

Behrang Mohit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Behrang Mohit .

Editor information

Editors and Affiliations

Microsoft, Redmond, Washington, USA
Imed Zitouni

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mohit, B. (2014). Named Entity Recognition. In: Zitouni, I. (eds) Natural Language Processing of Semitic Languages. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45358-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-45358-8_7
Published: 25 March 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45357-1
Online ISBN: 978-3-642-45358-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics