Language Resources and Evaluation

, Volume 45, Issue 2, pp 181–208 | Cite as

Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair

  • Mireia Farrús
  • Marta R. Costa-jussà
  • José B. Mariño
  • Marc Poch
  • Adolfo Hernández
  • Carlos Henríquez
  • José A. R. Fonollosa
Original Paper

Abstract

This work aims to improve an N-gram-based statistical machine translation system between the Catalan and Spanish languages, trained with an aligned Spanish–Catalan parallel corpus consisting of 1.7 million sentences taken from El Periódico newspaper. Starting from a linguistic error analysis above this baseline system, orthographic, morphological, lexical, semantic and syntactic problems are approached using a set of techniques. The proposed solutions include the development and application of additional statistical techniques, text pre- and post-processing tasks, and rules based on the use of grammatical categories, as well as lexical categorization. The performance of the improved system is clearly increased, as is shown in both human and automatic evaluations of the system, with a gain of about 1.1 points BLEU observed in the Spanish-to-Catalan direction of translation, and a gain of about 0.5 points in the reverse direction. The final system is freely available online as a linguistic resource.

Keywords

Statistical machine translation N-gram-based translation Linguistic knowledge Grammatical categories 

Notes

Acknowledgments

The authors would like to thank TALP Research Center and Barcelona Media Innovation Center for its support and permission to publish this research. We would like to give credit to the anonymous reviewers of this paper for their valuable suggestions. This work has been partially funded by the Spanish Department of Science and Innovation through the Juan de la Cierva fellowship program and the Spanish Government under the BUCEADOR project (TEC2009-14094-C04-01).

References

  1. Bangalore, S., & Riccardi, G. (2000). Stochastic finite-state models for spoken language machine translation. In Workshop on Embedded Machine Translation Systems, Seattle, WA.Google Scholar
  2. Brown, P., della Pietra, S., et al. (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2), 263–311.Google Scholar
  3. Carreras, X., Chao, I., et al. (2004). FreeLing: An open-source suite of language analyzers. In Conference on Language Resources and Evaluation, Lisbon, Portugal.Google Scholar
  4. Casacuberta, F., & Vidal, E. (2004). Machine translation with inferred stochastic finite-state transducers. Computational Linguistics, 30(2), 205–225.CrossRefGoogle Scholar
  5. Crego, J. M., de Gispert, A., et al. (2006). N-gram-based SMT System Enhanced with Reordering. In Human Language Technology Conference (HLT-NAACL’06): Proceedings of the Workshop on Statistical Machine Translation, New York.Google Scholar
  6. Crego, J. M., & Mariño, J. B. (2007). Improving SMT by coupling reordering and decoding. Machine Translation, 20(3), 199–215.CrossRefGoogle Scholar
  7. de Gispert, A., & Mariño, J. (2006). Catalan-English statistical machine translation without parallel corpus: Bridging through Spanish. In Proceedings of LREC 5th Workshop on Strategies for developing Machine Translation for Minority Languages (SALTMIL’06). Genova, 65–68.Google Scholar
  8. Mariño, J. B., Banchs, R. E., et al. (2006). N-gram based machine translation. Computational Linguistics, 32(4), 527–549.CrossRefGoogle Scholar
  9. Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: Comparison of seven methods. Statistics in Medicine, 17(8), 857–872.CrossRefGoogle Scholar
  10. Niessen, S., & Ney, H. (2000). Improving SMT quality with morpho-syntactic analysis. In International Conference on Computational Linguistics, Saarbrücken, Germany.Google Scholar
  11. Och, F. J. (2003). Minimum error rate training in statistical machine translation. In 41st Meeting of the Association for Computational Linguistics, Sapporo, Japan.Google Scholar
  12. Popović, M., de Gispert, A., et al. (2006). Morpho-syntactic information for automatic error analysis of statistical machine translation output. In HLT/NAACL Workshop on Statistical Machine Translation, New York.Google Scholar
  13. Popović, M., & Ney, H. (2004). Towards the use of word stems and suffixes for statistical machine translation. In International Conference on Language Resources and Evaluation, Lisbon, Portugal.Google Scholar
  14. Popović, M., & Ney, H. (2006). POS-based word reorderings for statistical machine translation. In International Conference on Language Resources and Evaluation, Genoa, Italy.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  • Mireia Farrús
    • 1
    • 2
  • Marta R. Costa-jussà
    • 1
    • 3
  • José B. Mariño
    • 1
  • Marc Poch
    • 1
    • 4
  • Adolfo Hernández
    • 1
  • Carlos Henríquez
    • 1
  • José A. R. Fonollosa
    • 1
  1. 1.TALP Research Center, Department of Signal Theory and CommunicationsUniversitat Politècnica de CatalunyaBarcelonaSpain
  2. 2.Office of Learning Technologies Universitat Oberta de CatalunyaBarcelonaSpain
  3. 3.Voice and Language DepartmentBarcelona Media Innovation CenterBarcelonaSpain
  4. 4.Universitat Pompeu Fabra BarcelonaSpain

Personalised recommendations