Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair

Farrús, Mireia; Costa-jussà, Marta R.; Mariño, José B.; Poch, Marc; Hernández, Adolfo; Henríquez, Carlos; Fonollosa, José A. R.

doi:10.1007/s10579-011-9137-0

Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair

Original Paper
Published: 20 February 2011

Volume 45, pages 181–208, (2011)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Mireia Farrús¹^nAff2,
Marta R. Costa-jussà^1,3,
José B. Mariño¹,
Marc Poch¹^nAff4,
Adolfo Hernández¹,
Carlos Henríquez¹ &
…
José A. R. Fonollosa¹

526 Accesses
12 Citations
Explore all metrics

Abstract

This work aims to improve an N-gram-based statistical machine translation system between the Catalan and Spanish languages, trained with an aligned Spanish–Catalan parallel corpus consisting of 1.7 million sentences taken from El Periódico newspaper. Starting from a linguistic error analysis above this baseline system, orthographic, morphological, lexical, semantic and syntactic problems are approached using a set of techniques. The proposed solutions include the development and application of additional statistical techniques, text pre- and post-processing tasks, and rules based on the use of grammatical categories, as well as lexical categorization. The performance of the improved system is clearly increased, as is shown in both human and automatic evaluations of the system, with a gain of about 1.1 points BLEU observed in the Spanish-to-Catalan direction of translation, and a gain of about 0.5 points in the reverse direction. The final system is freely available online as a linguistic resource.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Slavic languages in phrase-based statistical machine translation: a survey

Article 06 May 2017

SMT: A Case Study of Kazakh-English Word Alignment

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

Article Open access 18 October 2021

Notes

References

Bangalore, S., & Riccardi, G. (2000). Stochastic finite-state models for spoken language machine translation. In Workshop on Embedded Machine Translation Systems, Seattle, WA.
Brown, P., della Pietra, S., et al. (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2), 263–311.
Google Scholar
Carreras, X., Chao, I., et al. (2004). FreeLing: An open-source suite of language analyzers. In Conference on Language Resources and Evaluation, Lisbon, Portugal.
Casacuberta, F., & Vidal, E. (2004). Machine translation with inferred stochastic finite-state transducers. Computational Linguistics, 30(2), 205–225.
Article Google Scholar
Crego, J. M., de Gispert, A., et al. (2006). N-gram-based SMT System Enhanced with Reordering. In Human Language Technology Conference (HLT-NAACL’06): Proceedings of the Workshop on Statistical Machine Translation, New York.
Crego, J. M., & Mariño, J. B. (2007). Improving SMT by coupling reordering and decoding. Machine Translation, 20(3), 199–215.
Article Google Scholar
de Gispert, A., & Mariño, J. (2006). Catalan-English statistical machine translation without parallel corpus: Bridging through Spanish. In Proceedings of LREC 5th Workshop on Strategies for developing Machine Translation for Minority Languages (SALTMIL’06). Genova, 65–68.
Mariño, J. B., Banchs, R. E., et al. (2006). N-gram based machine translation. Computational Linguistics, 32(4), 527–549.
Article Google Scholar
Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: Comparison of seven methods. Statistics in Medicine, 17(8), 857–872.
Article Google Scholar
Niessen, S., & Ney, H. (2000). Improving SMT quality with morpho-syntactic analysis. In International Conference on Computational Linguistics, Saarbrücken, Germany.
Och, F. J. (2003). Minimum error rate training in statistical machine translation. In 41st Meeting of the Association for Computational Linguistics, Sapporo, Japan.
Popović, M., de Gispert, A., et al. (2006). Morpho-syntactic information for automatic error analysis of statistical machine translation output. In HLT/NAACL Workshop on Statistical Machine Translation, New York.
Popović, M., & Ney, H. (2004). Towards the use of word stems and suffixes for statistical machine translation. In International Conference on Language Resources and Evaluation, Lisbon, Portugal.
Popović, M., & Ney, H. (2006). POS-based word reorderings for statistical machine translation. In International Conference on Language Resources and Evaluation, Genoa, Italy.

Download references

Acknowledgments

The authors would like to thank TALP Research Center and Barcelona Media Innovation Center for its support and permission to publish this research. We would like to give credit to the anonymous reviewers of this paper for their valuable suggestions. This work has been partially funded by the Spanish Department of Science and Innovation through the Juan de la Cierva fellowship program and the Spanish Government under the BUCEADOR project (TEC2009-14094-C04-01).

Author information

Mireia Farrús
Present address: Office of Learning Technologies, Universitat Oberta de Catalunya, Av. Tibidabo, 47, 08035, Barcelona, Spain
Marc Poch
Present address: Universitat Pompeu Fabra , Roc Boronat, 138 , 08018, Barcelona, Spain

Authors and Affiliations

TALP Research Center, Department of Signal Theory and Communications, Universitat Politècnica de Catalunya, C/Jordi Girona 1-3, 08034, Barcelona, Spain
Mireia Farrús, Marta R. Costa-jussà, José B. Mariño, Marc Poch, Adolfo Hernández, Carlos Henríquez & José A. R. Fonollosa
Voice and Language Department, Barcelona Media Innovation Center, Av Diagonal 177, 9th Floor, 08018, Barcelona, Spain
Marta R. Costa-jussà

Authors

Mireia Farrús
View author publications
You can also search for this author in PubMed Google Scholar
Marta R. Costa-jussà
View author publications
You can also search for this author in PubMed Google Scholar
José B. Mariño
View author publications
You can also search for this author in PubMed Google Scholar
Marc Poch
View author publications
You can also search for this author in PubMed Google Scholar
Adolfo Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Henríquez
View author publications
You can also search for this author in PubMed Google Scholar
José A. R. Fonollosa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mireia Farrús.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farrús, M., Costa-jussà, M.R., Mariño, J.B. et al. Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair. Lang Resources & Evaluation 45, 181–208 (2011). https://doi.org/10.1007/s10579-011-9137-0

Download citation

Published: 20 February 2011
Issue Date: May 2011
DOI: https://doi.org/10.1007/s10579-011-9137-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair

Abstract

Access this article

Similar content being viewed by others

Slavic languages in phrase-based statistical machine translation: a survey

SMT: A Case Study of Kazakh-English Word Alignment

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair

Abstract

Access this article

Similar content being viewed by others

Slavic languages in phrase-based statistical machine translation: a survey

SMT: A Case Study of Kazakh-English Word Alignment

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation