, Volume 22, Issue 1-2, pp 29-66
Date: 25 Nov 2008

Using target-language information to train part-of-speech taggers for machine translation

Rent the article at a discount

Rent now

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Although corpus-based approaches to machine translation (MT) are growing in interest, they are not applicable when the translation involves less-resourced language pairs for which there are no parallel corpora available; in those cases, the rule-based approach is the only applicable solution. Most rule-based MT systems make use of part-of-speech (PoS) taggers to solve the PoS ambiguities in the source-language texts to translate; those MT systems require accurate PoS taggers to produce reliable translations in the target language (TL). The standard statistical approach to PoS ambiguity resolution (or tagging) uses hidden Markov models (HMM) trained in a supervised way from hand-tagged corpora, an expensive resource not always available, or in an unsupervised way through the Baum-Welch expectation-maximization algorithm; both methods use information only from the language being tagged. However, when tagging is considered as an intermediate task for the translation procedure, that is, when the PoS tagger is to be embedded as a module within an MT system, information from the TL can be (unsupervisedly) used in the training phase to increase the translation quality of the whole MT system. This paper presents a method to train HMM-based PoS taggers to be used in MT; the new method uses not only information from the source language (SL), as general-purpose methods do, but also information from the TL and from the remaining modules of the MT system in which the PoS tagger is to be embedded. We find that the translation quality of the MT system embedding a PoS tagger trained in an unsupervised manner through this new method is clearly better than that of the same MT system embedding a PoS tagger trained through the Baum-Welch algorithm, and comparable to that obtained by embedding a PoS tagger trained in a supervised way from hand-tagged corpora.