Cross Script Hindi English NER Corpus from Wikipedia

  • Mohd Zeeshan AnsariEmail author
  • Tanvir Ahmad
  • Md Arshad Ali
Conference paper
Part of the Lecture Notes on Data Engineering and Communications Technologies book series (LNDECT, volume 26)


The text generated on social media platforms is essentially a mixed lingual text. The mixing of language in any form produces considerable amount of difficulty in language processing systems. Moreover, the advancements in language processing research depends upon the availability of standard corpora. The development of mixed lingual Indian Named Entity Recognition (NER) systems are facing obstacles due to unavailability of the standard evaluation corpora. Such corpora may be of mixed lingual nature in which text is written using multiple languages predominantly using a single script only. The motivation of our work is to emphasize the automatic generation such kind of corpora in order to encourage mixed lingual Indian NER. The paper presents the preparation of a Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora is successfully annotated using standard CoNLL-2003 categories of PER, LOC, ORG, and MISC. Its evaluation is carried out on a variety of machine learning algorithms and favorable results are achieved.


Named entity recognition Information extraction Wikipedia Annotated corpora Indian language 


  1. 1.
    Grishman, R., Sundheim B.: Message understanding conference-6: a brief history. In: 16th Conference on Computational Linguistics, pp. 466–461 (1996)Google Scholar
  2. 2.
    Borrega O., Taule M., Marti M.A.: What do we mean when we speak about named entities. In: Conference on Corpus Linguistics (2007)Google Scholar
  3. 3.
    Das, A., Gambäck, B.: Code-mixing in social media text: the last language identification frontier? Traitement Automatique des Langues 54(3), 41–64 (2013)Google Scholar
  4. 4.
    Nothman, J., Curran, J.R., Murphy, T.: Transforming Wikipedia into named entity training data. In: Proceedings of the 2008 Australian Language Technology Workshop, pp. 124–132 (2008)Google Scholar
  5. 5.
    Nothman, J., Murphy, T., Curran, J.R.: Analysing Wikipedia and gold-standard corpora for NER training. In: Proceedings of the 12th Conference of the European Chapter of the ACL, Athens, Greece, pp. 612–620 (2009)Google Scholar
  6. 6.
    Balasuriya, D., Ringland, N., Nothman, J., Murphy, T., Curran, J.R.: Named entity recognition in Wikipedia. In: Proceedings of the 2009 Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 10–18 (2009)Google Scholar
  7. 7.
    Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., Gómez-Berbís, J.M.: Named entity recognition: fallacies, challenges and opportunities. Comput. Stand. Interfaces 35, 482–489 (2012)CrossRefGoogle Scholar
  8. 8.
    Saha, S.K., Sarkar, S., Mitra, M.: Gazetteer preparation for named entity recognition in Indian languages. In: The 6th Workshop on Asian Language Resources, pp. 9–16 (2008)Google Scholar
  9. 9.
    Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009), pp. 147–155 (2009)Google Scholar
  10. 10.
    Banerjee, S., Naskar, S.K., Rosso, P., Bandyopadhyay, S.: Named entity recognition on code-mixed cross-script social media content. Computación y Sistemas (CyS) 21(4), 681–692 (2017)Google Scholar
  11. 11.
    Nemeskey, D., Simon, E.: Automatically generated NE tagged corpora for English and Hungarian. In: Proceedings of the 4th Named Entity Workshop, NEWS 2012, Stroudsburg, PA, USA, pp. 38–46. Association for Computational Linguistics (2012)Google Scholar
  12. 12.
    Mayhew, S., Tsai, C.-T., Roth, D.: Cheap translation for cross-lingual named entity recognition. In: EMNLP (2017)Google Scholar
  13. 13.
    Tsai, C.-T., Mayhew, S., Roth D.: Cross-lingual named entity recognition via wikification. In: Proceedings of the Conference on Computational Natural Language Learning (CoNLL) (2016)Google Scholar
  14. 14.
    Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from wikipedia. Artif. Intell. 194, 151–175 (2013)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Kazama, J., Torisawa, K.: Exploiting Wikipedia as external knowledge for named entity recognition. In: Proceedings of EMNLP 2007, Praque, Czech Republic (2007)Google Scholar
  16. 16.
    Collins, M.: Ranking algorithms for named entity extraction. In: Proceedings of the 40th Annual Meeting on Association of Computational Linguistics, pp. 489–496 (2002)Google Scholar
  17. 17.
    Report of Commissioner of Linguistic Minorities: 50th report (July 2012 to June 2013). Commissioner for Linguistic Minorities, Ministry of Minority Affairs, Government of IndiaGoogle Scholar
  18. 18.
    Mausam, S., Soderland, Etzioni O., Weld, D.S., Reiter, K., Skinner, M., Sammer, M., Blimes, J., et al.: Panlingual lexical translation via probabilistic inference. Artif. Intell. 174(9–10), 619–637 (2010)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Kim, S., Tautanova, K., Yu, H.: Multilingual named entity recognition using parallel data and metadata from Wikipedia. In: Association of Computational Linguistics (2012)Google Scholar
  20. 20.
    Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language independent named entity recognition. In: Proceedings of CoNLL-2003Google Scholar
  21. 21. Accessed 15 June 2018

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Mohd Zeeshan Ansari
    • 1
    Email author
  • Tanvir Ahmad
    • 1
  • Md Arshad Ali
    • 1
  1. 1.Department of Computer EngineeringJamia Millia IslamiaNew DelhiIndia

Personalised recommendations