Advertisement

Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts

  • Michał Marcińczuk
  • Maciej Janicki
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7181)

Abstract

In this paper we present several optimizations introduced to Conditional Random Fields-based model for proper names recognition in Polish running texts. The proposed optimizations refer to word-level segmentation problems, gazetteers incompleteness, problem of unambiguous generalization features, feature construction and selection, and finally recognition of common proper names on the basis of external sources of knowledge. The problem of proper name recognition is limited to recognition of person first names and surnames, names of countries, cities and roads. The evaluation is performed in two ways: a single domain evaluation using 10-fold cross validation on a Corpus of Stock Exchange Reports and a cross-domain evaluation on a Corpus of Economic News. An additional corpus of Wikipedia articles, namely InfiKorp is used in the feature selection. Finally, we evaluate three configurations of proposed modifications. The top configuration improved the final result from 94.53% to 95.65% of F-measure for single domain and from 70.86% to 79.63% for cross-domain evaluation.

Keywords

Named Entity Recognition Proper Name Recognition Machine Learning Conditional Random Fields Gazetteers Classifier Ensemble Polish 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Mykowiecka, A., Kupść, A., Marciniak, M., Piskorski, J.: Resources for Information Extraction from Polish texts. In: Proceedings of the 3rd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, (LTC 2007), Poznań, Poland, October 5-7 (2007)Google Scholar
  2. 2.
    Graliński, F., Jassem, K., Marcińczuk, M.: An Environment for Named Entity Recognition and Translation. In: Màrquez, L., Somers, H. (eds.) Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, Spain, pp. 88–95 (2009)Google Scholar
  3. 3.
    Graliński, F., Jassem, K., Marcińczuk, M., Wawrzyniak, P.: Named Entity Recognition in Machine Anonymization. In: Kłopotek, M.A., Przepiorkowski, A., Wierzchoń, A.T., Trojanowski, K. (eds.) Recent Advances in Intelligent Information Systems, pp. 247–260. Academic Publishing House Exit (2009)Google Scholar
  4. 4.
    Marcińczuk, M., Zaśko-Zielińska, M., Piasecki, M.: Structure Annotation in the Polish Corpus of Suicide Notes. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 419–426. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  5. 5.
    ACE (Automatic Content Extraction) English Annotation Guidelines for Entities. Linguistic Data Consortium, LDC (2008)Google Scholar
  6. 6.
    McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Seventh Conference on Natural Language Learning, CoNLL (2003)Google Scholar
  7. 7.
    Mykowiecka, A., Waszczuk, J.: Semantic Annotation of City Transportation Information Dialogues Using CRF Method. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 411–418. Springer, Heidelberg (2009), doi:10.1007/978-3-642-04208-9_56CrossRefGoogle Scholar
  8. 8.
    Marcińczuk, M., Stanek, M., Piasecki, M., Musiał, A.: Rich Set of Features for Proper Name Recognition in Polish Texts. In: Proc. of the S&IIS 2011, Poland (2011)Google Scholar
  9. 9.
    Georgiev, G., Nakov, P., Ganchev, K., Osenova, P., Simov, K.: Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields. In: Proceedings of the International Conference RANLP 2009, pp. 113–117. Association for Computational Linguistics, Borovets (2009)Google Scholar
  10. 10.
    Benajiba, Y., Rosso, P.: Arabic Named Entity Recognition using Conditional Random Fields. In: Proc. Workshop on HLT & NLP with in the Arabic World (2008)Google Scholar
  11. 11.
    Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognition in Polish Economic Texts. In: Control and Cybernetics (2011)Google Scholar
  12. 12.
    Radziszewski, A., Śniatowski, T.: Maca: a configurable tool to integrate Polish morphological data. In: Proceedings of Free RBMT 2011, Barcelona, Spain (2011)Google Scholar
  13. 13.
    Piskorski, J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004 (ELR 2004), pp. 313–316. ACL, Prague (2004)Google Scholar
  14. 14.
    Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conf. of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol. 1, pp. 134–141. Association for Computational Linguistics, Stroudsburg (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Michał Marcińczuk
    • 1
  • Maciej Janicki
    • 1
  1. 1.Wrocław University of TechnologyWrocławPoland

Personalised recommendations