Advertisement

A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms

  • György Szarvas
  • Richárd Farkas
  • András Kocsor
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4265)

Abstract

In this paper we introduce a multilingual Named Entity Recognition (NER) system that uses statistical modeling techniques. The system identifies and classifies NEs in the Hungarian and English languages by applying AdaBoostM1 and the C4.5 decision tree learning algorithm. We focused on building as large a feature set as possible, and used a split and recombine technique to fully exploit its potentials. This methodology provided an opportunity to train several independent decision tree classifiers based on different subsets of features and combine their decisions in a majority voting scheme. The corpus made for the CoNLL 2003 conference and a segment of Szeged Corpus was used for training and validation purposes. Both of them consist entirely of newswire articles. Our system remains portable across languages without requiring any major modification and slightly outperforms the best system of CoNLL 2003, and achieved a 94.77% F measure for Hungarian. The real value of our approach lies in its different basis compared to other top performing models for English, which makes our system extremely successful when used in combination with CoNLL modells.

Keywords

Named Entity Recognition NER Boosting C4.5 decision tree voting machine learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bikel, D.M., Schwartz, R.L., Weischedel, R.M.: An algorithm that learns what’s in a name. Machine Learning 34(1-3), 211–231 (1999)MATHCrossRefGoogle Scholar
  2. 2.
    Carreras, X., Márques, L., Padró, L.: Named Entity Extraction using AdaBoost. In: Proceedings of CoNLL-2002, Taipei, Taiwan, pp. 167–170 (2002)Google Scholar
  3. 3.
    Chieu, H.L., Ng, H.T.: Named Entity Recognition with a Maximum Entropy Approach. In: Proceedings of CoNLL-2003, pp. 160–163 (2003)Google Scholar
  4. 4.
    Chinchor, N.: MUC-7 Named Entity Task Definition. In: Proceedings of Seventh Message Understanding Conference (1998)Google Scholar
  5. 5.
    Cucerzan, S., Yarowsky, D.: Language-independent named entity recognition combining morphological and contextual evidence. In: Proceedings of Joint SIGDAT Conf. on EMNLP/VLC (1999)Google Scholar
  6. 6.
    Csendes, D., Csirik, J.A., Gyimóthy, T.: The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS, vol. 3206, pp. 41–47. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  7. 7.
    Richárd, F., György, S., András, K.: Named Entity Recognition for Hungarian using various Machine Learning Algorithms (accepted for publication in Acta Cybernetica), http://www.inf.u-szeged.hu/~rfarkas/ACTA2006_hun_namedentity.pdf
  8. 8.
    Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named Entity Recognition through Classifier Combination. In: Proceedings of CoNLL-2003, pp. 168–171 (2003)Google Scholar
  9. 9.
    Gábor, K., Héja, E., Mészáros, Á., Sass, B.: Nyílt tokenosztályok reprezentációjának technológiája. In: IKTA-00037/2002, Budapest, Hungary (2002)Google Scholar
  10. 10.
    Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the Bio-Entity Task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (2004)Google Scholar
  11. 11.
    Quinlan, R.: C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  12. 12.
    Prószéky, G.: Syntax as Meta-Morphology. In: Proceedings of COLING 1996, vol. 2, pp. 1123–1126 (1996)Google Scholar
  13. 13.
    Shapire, R.E.: The Strength of Weak Learnability. Machine Learnings 5, 197–227 (1990)Google Scholar
  14. 14.
    Szarvas, G., Farkas, R., Felföldi, L., Kocsor, A., Csirik, J.: A highly accurate Named Entity corpus for Hungarian, In: Proceedings of International Conference on Language Resources and Evaluation (2006)Google Scholar
  15. 15.
    Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL 2003 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of CoNLL 2003 (2003)Google Scholar
  16. 16.
    Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • György Szarvas
    • 1
  • Richárd Farkas
    • 2
  • András Kocsor
    • 2
  1. 1.Department of InformaticsUniversity of SzegedSzegedHungary
  2. 2.Research Group on Artificial IntelligenceMTA-SZTESzegedHungary

Personalised recommendations