Automatically Inducing a Part-of-Speech Tagger by Projecting from Multiple Source Languages Across Aligned Corpora

  • Victoria Fossum
  • Steven Abney
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3651)


We implement a variant of the algorithm described by Yarowsky and Ngai in [21] to induce an HMM POS tagger for an arbitrary target language using only an existing POS tagger for a source language and an unannotated parallel corpus between the source and target languages. We extend this work by projecting from multiple source languages onto a single target language. We hypothesize that systematic transfer errors from differing source languages will cancel out, improving the quality of bootstrapped resources in the target language. Our experiments confirm the hypothesis. Each experiment compares three cases: (a) source data comes from a single language A, (b) source data comes from a single language B, and (c) source data comes from both A and B, but half as much from each. Apart from the source language, other conditions are held constant in all three cases – including the total amount of source data used. The null hypothesis is that performance in the mixed case would be an average of performance in the single-language cases, but in fact, mixed-case performance always exceeds the maximum of the single-language cases. We observed this effect in all six experiments we ran, involving three different source-language pairs and two different target languages.


Target Language Source Language Computational Linguistics Parallel Corpus Single Language 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Al-Onaizan, Y., Curin, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D., Och, F.-J., Purdy, D., Smith, N., Yarowsky, D.: Statistical machine translation. In: Johns Hopkins University 1999 Summer Workshop on Language Engineering (1999)Google Scholar
  2. 2.
    Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the 6th Applied NLP Conference, ANLP-2000, Seattle, WA, April 29 – May 3 (2000)Google Scholar
  3. 3.
    Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics 21(4), 543–565 (1995)Google Scholar
  4. 4.
    Brill, E., Wu, J.: Classifier Combination for Improving Lexical Disambiguation. In: Proceedings of the ACL (1998)Google Scholar
  5. 5.
    Brown, P.F., Cocke, J., Della Pietra, S., Della Pietra, V.J., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A Statistical Approach to Machine Translation. Computational Linguistics 16(2), 79–85 (1990)Google Scholar
  6. 6.
    Clark, S., Curran, J., Osborne, M.: Bootstrapping POS taggers using unlabelled data. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL-2003, Edmonton, Canada, pp. 49–55 (2003)Google Scholar
  7. 7.
    Collins, M., Hajic, J., Ramshaw, L., Tillmann, C.: A Statistical Parser for Czech. In: Proceedings of the 37th Annual Meeting of the ACL, College Park, Maryland (1999)Google Scholar
  8. 8.
    Cucerzan, S., Yarowsky, D.: Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day. In: Proceedings of the Sixth Conference on Natural Language Learning (CoNLL) (2002)Google Scholar
  9. 9.
    Gimenez, J., Marquez, L.: SVMTool: A general POS tagger generator based on Support Vector Machines. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal (2004)Google Scholar
  10. 10.
    Gollins, T., Sanderson, M.: Improving Cross Language Information Retrieval with Triangulated Translation. In: Proceedings of the 24th annual international ACM SIGIR conference, pp. 90–95 (2001)Google Scholar
  11. 11.
    French-English Hansards Corpus of Canadian Parliamentary ProceedingsGoogle Scholar
  12. 12.
    Hajic, J., Hladka, B.: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In: COLING-ACL, pp. 483–490 (1998)Google Scholar
  13. 13.
    Hajic, J., Krbec, P., Kevton, P., Oliva, K., Petkevic, V.: Serial Combination of Rules and Statistics: A Case Study in Czech Tagging. In: Proceedings of the ACL (2001)Google Scholar
  14. 14.
    Henderson, J.C., Brill, E.: Exploiting Diversity in Natural Language Processing: Combining Parsers. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 187–194 (1999)Google Scholar
  15. 15.
    Hwa, R., Resnik, P., Weinberg, A.: Breaking the Resource Bottleneck for Multilingual Parsing. In: Proceedings of the Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data (2002)Google Scholar
  16. 16.
    Mann, G., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: Proceedings of NAACL 2001: 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 151–158 (2001)Google Scholar
  17. 17.
    Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2) (1989)Google Scholar
  18. 18.
    Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: International Conference on New Methods in Language Processing, Manchester, UK (1994)Google Scholar
  19. 19.
    van Halteren, H., Zavrel, J., Daelemans, W.: Improving Data Driven Wordclass Tagging by System Combination. In: Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics, pp. 491–497 (1998)Google Scholar
  20. 20.
    Witten, I., Bell, T.: The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions in Information Theory 37(4), 1085–1094 (1991)CrossRefGoogle Scholar
  21. 21.
    Yarowsky, D., Ngai, G.: Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora. In: Proceedings of NAACL, pp. 200–207 (2001)Google Scholar
  22. 22.
    Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. In: Proceedings of HLT (2001)Google Scholar
  23. 23.
    Zavrel, J., Daelemans, W.: Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers. In: Proceedings of LREC-2000, Athens (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Victoria Fossum
    • 1
  • Steven Abney
    • 2
  1. 1.Dept. of EECSUniversity of MichiganAnn Arbor
  2. 2.Dept. of LinguisticsUniversity of MichiganAnn Arbor

Personalised recommendations