Abstract
The text generated on social media platforms is essentially a mixed lingual text. The mixing of language in any form produces considerable amount of difficulty in language processing systems. Moreover, the advancements in language processing research depends upon the availability of standard corpora. The development of mixed lingual Indian Named Entity Recognition (NER) systems are facing obstacles due to unavailability of the standard evaluation corpora. Such corpora may be of mixed lingual nature in which text is written using multiple languages predominantly using a single script only. The motivation of our work is to emphasize the automatic generation such kind of corpora in order to encourage mixed lingual Indian NER. The paper presents the preparation of a Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora is successfully annotated using standard CoNLL-2003 categories of PER, LOC, ORG, and MISC. Its evaluation is carried out on a variety of machine learning algorithms and favorable results are achieved.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Grishman, R., Sundheim B.: Message understanding conference-6: a brief history. In: 16th Conference on Computational Linguistics, pp. 466–461 (1996)
Borrega O., Taule M., Marti M.A.: What do we mean when we speak about named entities. In: Conference on Corpus Linguistics (2007)
Das, A., Gambäck, B.: Code-mixing in social media text: the last language identification frontier? Traitement Automatique des Langues 54(3), 41–64 (2013)
Nothman, J., Curran, J.R., Murphy, T.: Transforming Wikipedia into named entity training data. In: Proceedings of the 2008 Australian Language Technology Workshop, pp. 124–132 (2008)
Nothman, J., Murphy, T., Curran, J.R.: Analysing Wikipedia and gold-standard corpora for NER training. In: Proceedings of the 12th Conference of the European Chapter of the ACL, Athens, Greece, pp. 612–620 (2009)
Balasuriya, D., Ringland, N., Nothman, J., Murphy, T., Curran, J.R.: Named entity recognition in Wikipedia. In: Proceedings of the 2009 Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 10–18 (2009)
Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., Gómez-Berbís, J.M.: Named entity recognition: fallacies, challenges and opportunities. Comput. Stand. Interfaces 35, 482–489 (2012)
Saha, S.K., Sarkar, S., Mitra, M.: Gazetteer preparation for named entity recognition in Indian languages. In: The 6th Workshop on Asian Language Resources, pp. 9–16 (2008)
Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009), pp. 147–155 (2009)
Banerjee, S., Naskar, S.K., Rosso, P., Bandyopadhyay, S.: Named entity recognition on code-mixed cross-script social media content. Computación y Sistemas (CyS) 21(4), 681–692 (2017)
Nemeskey, D., Simon, E.: Automatically generated NE tagged corpora for English and Hungarian. In: Proceedings of the 4th Named Entity Workshop, NEWS 2012, Stroudsburg, PA, USA, pp. 38–46. Association for Computational Linguistics (2012)
Mayhew, S., Tsai, C.-T., Roth, D.: Cheap translation for cross-lingual named entity recognition. In: EMNLP (2017)
Tsai, C.-T., Mayhew, S., Roth D.: Cross-lingual named entity recognition via wikification. In: Proceedings of the Conference on Computational Natural Language Learning (CoNLL) (2016)
Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from wikipedia. Artif. Intell. 194, 151–175 (2013)
Kazama, J., Torisawa, K.: Exploiting Wikipedia as external knowledge for named entity recognition. In: Proceedings of EMNLP 2007, Praque, Czech Republic (2007)
Collins, M.: Ranking algorithms for named entity extraction. In: Proceedings of the 40th Annual Meeting on Association of Computational Linguistics, pp. 489–496 (2002)
Report of Commissioner of Linguistic Minorities: 50th report (July 2012 to June 2013). Commissioner for Linguistic Minorities, Ministry of Minority Affairs, Government of India
Mausam, S., Soderland, Etzioni O., Weld, D.S., Reiter, K., Skinner, M., Sammer, M., Blimes, J., et al.: Panlingual lexical translation via probabilistic inference. Artif. Intell. 174(9–10), 619–637 (2010)
Kim, S., Tautanova, K., Yu, H.: Multilingual named entity recognition using parallel data and metadata from Wikipedia. In: Association of Computational Linguistics (2012)
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language independent named entity recognition. In: Proceedings of CoNLL-2003
www.wikipedia.org. Accessed 15 June 2018
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Ansari, M.Z., Ahmad, T., Arshad Ali, M. (2019). Cross Script Hindi English NER Corpus from Wikipedia. In: Hemanth, J., Fernando, X., Lafata, P., Baig, Z. (eds) International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018. ICICI 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-030-03146-6_116
Download citation
DOI: https://doi.org/10.1007/978-3-030-03146-6_116
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03145-9
Online ISBN: 978-3-030-03146-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)