A Hidden Markov Model Based Named Entity Recognition System: Bengali and Hindi as Case Studies

  • Asif Ekbal
  • Sivaji Bandyopadhyay
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4815)

Abstract

Named Entity Recognition (NER) has an important role in almost all Natural Language Processing (NLP) application areas including information retrieval, machine translation, question-answering system, automatic summarization etc. This paper reports about the development of a statistical Hidden Markov Model (HMM) based NER system. The system is initially developed for Bengali using a tagged Bengali news corpus, developed from the archive of a leading Bengali newspaper available in the web. The system is trained with a training corpus of 150,000 wordforms, initially tagged with a HMM based part of speech (POS) tagger. Evaluation results of the 10-fold cross validation test yield an average Recall, Precision and F-Score values of 90.2%, 79.48% and 84.5%, respectively. This HMM based NER system is then trained and tested on the Hindi data to show its effectiveness towards the language independent abilities. Experimental results of the 10-fold cross validation test has demonstrated the average Recall, Precision and F-Score values of 82.5%, 74.6% and 78.35%, respectively with 27,151 Hindi wordforms.

Keywords

Named Entity (NE) Named Entity Recognition (NER) Hidden Markov Model (HMM) Named Entity Recognition in Bengali 

References

  1. 1.
    Chinchor, N.: MUC-6 Named Entity Task Definition (Version 2.1). In: MUC-6, Maryland (1995)Google Scholar
  2. 2.
    Chinchor, N.: MUC-7 Named Entity Task Definition (Version 3.5). In: MUC-7, Fairfax (1998)Google Scholar
  3. 3.
    Cunningham, H.: GATE, a General Architecture for Text Engineering. Computers and the Humanities 36, 223–254 (2002)CrossRefGoogle Scholar
  4. 4.
    Moldovan, D., Harabagiu, S., Girju, R., Morarescu, P., Lacatusu, F., Novischi, A., Badulescu, A., Bolohan, O.: LCC Tools for Question Answering. In: Text REtrieval Conference (TREC 2002) (2002)Google Scholar
  5. 5.
    Babych, B., Hartley, A.: Improving Machine Translation Quality with Automatic Named Entity Recognition. In: Proceedings of EAMT/EACL 2003 Workshop on MT and other Language Technology Tools, pp. 1–8 (2003)Google Scholar
  6. 6.
    Bikel, D.M., Schwartz, R.L., Weischedel, R.M.: An Algorithm that Learns What’s in a Name. Machine Learning 34(1-3), 211–231 (1999)MATHCrossRefGoogle Scholar
  7. 7.
    Borthwick, A.: Maximum Entropy Approach to Named Entity Recognition. PhD thesis, New York University (1999)Google Scholar
  8. 8.
    McCallum, A., Li, W.: Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-enhanced Lexicons. In: Proceedings of CoNLL (2003)Google Scholar
  9. 9.
    Viterbi, A.J.: Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Transaction on Information Theory 13(2), 260–267 (1967)MATHCrossRefGoogle Scholar
  10. 10.
    Zhou, G., Su, J.: Named Entity Recognition using an HMM-based Chunk Tagger. In: Proceedings of ACL, Philadelphia, pp. 473–480 (2002)Google Scholar
  11. 11.
    Ekbal, A., Bandyopadhyay, S.: Pattern Based Bootstrapping Method for Named Entity Recognition. In: Proceedings of ICAPR-2007, Kolkata, India, pp. 349–355 (2007)Google Scholar
  12. 12.
    Ekbal, A., Bandyopadhyay, S.: Lexical Pattern Learning from Corpus Data for Named Entity Recognition. In: Proceedings of 5th International Conference on Natural Language Processing (ICON), Hyderabad, India, pp. 123–128 (2007)Google Scholar
  13. 13.
    Li, W., McCallum, A.: Rapid Development of Hindi Named Entity Recognition using Conditional Random Fields and Feature Induction. ACM Transactions on Asian Language Information Processing (TALIP) 2(3), 290–294 (2003)CrossRefGoogle Scholar
  14. 14.
    Brants, T.: TnT a Statistical Parts-of-Speech Tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing ANLP-2000, pp. 224–231 (2000)Google Scholar
  15. 15.
    Ekbal, A., Bandyopadhyay, S.: Lexicon Development and POS Tagging using a Tagged Bengali News Corpus. In: Proceedings of the 20th International Florida AI Research Society Conference (FLAIRS-2007), Florida, pp. 261–263 (2007)Google Scholar
  16. 16.
    Ekbal, A., Mondal, S., Bandyopadhyay, S.: POS Tagging using HMM and Rule-based Chunking. In: Proceedings of the IJCAI Workshop on Shallow Parsing for South Asian Languages, Hyderabad, India, pp. 31–34 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Asif Ekbal
    • 1
  • Sivaji Bandyopadhyay
    • 1
  1. 1.Computer Science and Engineering Department, Jadavpur University, KolkataIndia

Personalised recommendations