Skip to main content

Abstract

The text generated on social media platforms is essentially a mixed lingual text. The mixing of language in any form produces considerable amount of difficulty in language processing systems. Moreover, the advancements in language processing research depends upon the availability of standard corpora. The development of mixed lingual Indian Named Entity Recognition (NER) systems are facing obstacles due to unavailability of the standard evaluation corpora. Such corpora may be of mixed lingual nature in which text is written using multiple languages predominantly using a single script only. The motivation of our work is to emphasize the automatic generation such kind of corpora in order to encourage mixed lingual Indian NER. The paper presents the preparation of a Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora is successfully annotated using standard CoNLL-2003 categories of PER, LOC, ORG, and MISC. Its evaluation is carried out on a variety of machine learning algorithms and favorable results are achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Grishman, R., Sundheim B.: Message understanding conference-6: a brief history. In: 16th Conference on Computational Linguistics, pp. 466–461 (1996)

    Google Scholar 

  2. Borrega O., Taule M., Marti M.A.: What do we mean when we speak about named entities. In: Conference on Corpus Linguistics (2007)

    Google Scholar 

  3. Das, A., Gambäck, B.: Code-mixing in social media text: the last language identification frontier? Traitement Automatique des Langues 54(3), 41–64 (2013)

    Google Scholar 

  4. Nothman, J., Curran, J.R., Murphy, T.: Transforming Wikipedia into named entity training data. In: Proceedings of the 2008 Australian Language Technology Workshop, pp. 124–132 (2008)

    Google Scholar 

  5. Nothman, J., Murphy, T., Curran, J.R.: Analysing Wikipedia and gold-standard corpora for NER training. In: Proceedings of the 12th Conference of the European Chapter of the ACL, Athens, Greece, pp. 612–620 (2009)

    Google Scholar 

  6. Balasuriya, D., Ringland, N., Nothman, J., Murphy, T., Curran, J.R.: Named entity recognition in Wikipedia. In: Proceedings of the 2009 Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 10–18 (2009)

    Google Scholar 

  7. Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., Gómez-Berbís, J.M.: Named entity recognition: fallacies, challenges and opportunities. Comput. Stand. Interfaces 35, 482–489 (2012)

    Article  Google Scholar 

  8. Saha, S.K., Sarkar, S., Mitra, M.: Gazetteer preparation for named entity recognition in Indian languages. In: The 6th Workshop on Asian Language Resources, pp. 9–16 (2008)

    Google Scholar 

  9. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009), pp. 147–155 (2009)

    Google Scholar 

  10. Banerjee, S., Naskar, S.K., Rosso, P., Bandyopadhyay, S.: Named entity recognition on code-mixed cross-script social media content. Computación y Sistemas (CyS) 21(4), 681–692 (2017)

    Google Scholar 

  11. Nemeskey, D., Simon, E.: Automatically generated NE tagged corpora for English and Hungarian. In: Proceedings of the 4th Named Entity Workshop, NEWS 2012, Stroudsburg, PA, USA, pp. 38–46. Association for Computational Linguistics (2012)

    Google Scholar 

  12. Mayhew, S., Tsai, C.-T., Roth, D.: Cheap translation for cross-lingual named entity recognition. In: EMNLP (2017)

    Google Scholar 

  13. Tsai, C.-T., Mayhew, S., Roth D.: Cross-lingual named entity recognition via wikification. In: Proceedings of the Conference on Computational Natural Language Learning (CoNLL) (2016)

    Google Scholar 

  14. Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from wikipedia. Artif. Intell. 194, 151–175 (2013)

    Article  MathSciNet  Google Scholar 

  15. Kazama, J., Torisawa, K.: Exploiting Wikipedia as external knowledge for named entity recognition. In: Proceedings of EMNLP 2007, Praque, Czech Republic (2007)

    Google Scholar 

  16. Collins, M.: Ranking algorithms for named entity extraction. In: Proceedings of the 40th Annual Meeting on Association of Computational Linguistics, pp. 489–496 (2002)

    Google Scholar 

  17. Report of Commissioner of Linguistic Minorities: 50th report (July 2012 to June 2013). Commissioner for Linguistic Minorities, Ministry of Minority Affairs, Government of India

    Google Scholar 

  18. Mausam, S., Soderland, Etzioni O., Weld, D.S., Reiter, K., Skinner, M., Sammer, M., Blimes, J., et al.: Panlingual lexical translation via probabilistic inference. Artif. Intell. 174(9–10), 619–637 (2010)

    Article  MathSciNet  Google Scholar 

  19. Kim, S., Tautanova, K., Yu, H.: Multilingual named entity recognition using parallel data and metadata from Wikipedia. In: Association of Computational Linguistics (2012)

    Google Scholar 

  20. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language independent named entity recognition. In: Proceedings of CoNLL-2003

    Google Scholar 

  21. www.wikipedia.org. Accessed 15 June 2018

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohd Zeeshan Ansari .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ansari, M.Z., Ahmad, T., Arshad Ali, M. (2019). Cross Script Hindi English NER Corpus from Wikipedia. In: Hemanth, J., Fernando, X., Lafata, P., Baig, Z. (eds) International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018. ICICI 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-030-03146-6_116

Download citation

Publish with us

Policies and ethics