Cross Script Hindi English NER Corpus from Wikipedia

Ansari, Mohd Zeeshan; Ahmad, Tanvir; Arshad Ali, Md

doi:10.1007/978-3-030-03146-6_116

Mohd Zeeshan Ansari⁶,
Tanvir Ahmad⁶ &
Md Arshad Ali⁶

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 26))

Included in the following conference series:

International Conference on Intelligent Data Communication Technologies and Internet of Things

1981 Accesses
2 Citations

Abstract

The text generated on social media platforms is essentially a mixed lingual text. The mixing of language in any form produces considerable amount of difficulty in language processing systems. Moreover, the advancements in language processing research depends upon the availability of standard corpora. The development of mixed lingual Indian Named Entity Recognition (NER) systems are facing obstacles due to unavailability of the standard evaluation corpora. Such corpora may be of mixed lingual nature in which text is written using multiple languages predominantly using a single script only. The motivation of our work is to emphasize the automatic generation such kind of corpora in order to encourage mixed lingual Indian NER. The paper presents the preparation of a Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora is successfully annotated using standard CoNLL-2003 categories of PER, LOC, ORG, and MISC. Its evaluation is carried out on a variety of machine learning algorithms and favorable results are achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Grishman, R., Sundheim B.: Message understanding conference-6: a brief history. In: 16th Conference on Computational Linguistics, pp. 466–461 (1996)
Google Scholar
Borrega O., Taule M., Marti M.A.: What do we mean when we speak about named entities. In: Conference on Corpus Linguistics (2007)
Google Scholar
Das, A., Gambäck, B.: Code-mixing in social media text: the last language identification frontier? Traitement Automatique des Langues 54(3), 41–64 (2013)
Google Scholar
Nothman, J., Curran, J.R., Murphy, T.: Transforming Wikipedia into named entity training data. In: Proceedings of the 2008 Australian Language Technology Workshop, pp. 124–132 (2008)
Google Scholar
Nothman, J., Murphy, T., Curran, J.R.: Analysing Wikipedia and gold-standard corpora for NER training. In: Proceedings of the 12th Conference of the European Chapter of the ACL, Athens, Greece, pp. 612–620 (2009)
Google Scholar
Balasuriya, D., Ringland, N., Nothman, J., Murphy, T., Curran, J.R.: Named entity recognition in Wikipedia. In: Proceedings of the 2009 Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 10–18 (2009)
Google Scholar
Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., Gómez-Berbís, J.M.: Named entity recognition: fallacies, challenges and opportunities. Comput. Stand. Interfaces 35, 482–489 (2012)
Article Google Scholar
Saha, S.K., Sarkar, S., Mitra, M.: Gazetteer preparation for named entity recognition in Indian languages. In: The 6th Workshop on Asian Language Resources, pp. 9–16 (2008)
Google Scholar
Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009), pp. 147–155 (2009)
Google Scholar
Banerjee, S., Naskar, S.K., Rosso, P., Bandyopadhyay, S.: Named entity recognition on code-mixed cross-script social media content. Computación y Sistemas (CyS) 21(4), 681–692 (2017)
Google Scholar
Nemeskey, D., Simon, E.: Automatically generated NE tagged corpora for English and Hungarian. In: Proceedings of the 4th Named Entity Workshop, NEWS 2012, Stroudsburg, PA, USA, pp. 38–46. Association for Computational Linguistics (2012)
Google Scholar
Mayhew, S., Tsai, C.-T., Roth, D.: Cheap translation for cross-lingual named entity recognition. In: EMNLP (2017)
Google Scholar
Tsai, C.-T., Mayhew, S., Roth D.: Cross-lingual named entity recognition via wikification. In: Proceedings of the Conference on Computational Natural Language Learning (CoNLL) (2016)
Google Scholar
Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from wikipedia. Artif. Intell. 194, 151–175 (2013)
Article MathSciNet Google Scholar
Kazama, J., Torisawa, K.: Exploiting Wikipedia as external knowledge for named entity recognition. In: Proceedings of EMNLP 2007, Praque, Czech Republic (2007)
Google Scholar
Collins, M.: Ranking algorithms for named entity extraction. In: Proceedings of the 40th Annual Meeting on Association of Computational Linguistics, pp. 489–496 (2002)
Google Scholar
Report of Commissioner of Linguistic Minorities: 50th report (July 2012 to June 2013). Commissioner for Linguistic Minorities, Ministry of Minority Affairs, Government of India
Google Scholar
Mausam, S., Soderland, Etzioni O., Weld, D.S., Reiter, K., Skinner, M., Sammer, M., Blimes, J., et al.: Panlingual lexical translation via probabilistic inference. Artif. Intell. 174(9–10), 619–637 (2010)
Article MathSciNet Google Scholar
Kim, S., Tautanova, K., Yu, H.: Multilingual named entity recognition using parallel data and metadata from Wikipedia. In: Association of Computational Linguistics (2012)
Google Scholar
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language independent named entity recognition. In: Proceedings of CoNLL-2003
Google Scholar
www.wikipedia.org. Accessed 15 June 2018

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Jamia Millia Islamia, New Delhi, India
Mohd Zeeshan Ansari, Tanvir Ahmad & Md Arshad Ali

Authors

Mohd Zeeshan Ansari
View author publications
You can also search for this author in PubMed Google Scholar
Tanvir Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Md Arshad Ali
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohd Zeeshan Ansari .

Editor information

Editors and Affiliations

Department of ECE, Karunya University, Coimbatore, India
Jude Hemanth
Department of Electrical and Computer Engineering, Ryerson Communications Lab, Ryerson University, Toronto, ON, Canada
Xavier Fernando
Faculty of Engineering, Department of Telecommunication Engineering, Czech Technical University, Prague, Czech Republic
Pavel Lafata
School of Science, Joondalup Campus, Edith Cowan University, Joondalup, WA, Australia
Zubair Baig

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ansari, M.Z., Ahmad, T., Arshad Ali, M. (2019). Cross Script Hindi English NER Corpus from Wikipedia. In: Hemanth, J., Fernando, X., Lafata, P., Baig, Z. (eds) International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018. ICICI 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-030-03146-6_116

Download citation

DOI: https://doi.org/10.1007/978-3-030-03146-6_116
Published: 21 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03145-9
Online ISBN: 978-3-030-03146-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics