These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The current state-of-the-art in machine translation consists of phrase-based statistical machine translation (PB-SMT) [23], an approach which has been used since the late 1990s, evolving from word-based SMT proposed by IBM [5]. These string-based techniques (which use no linguistic knowledge) seem to have reached their ceiling in terms of translation quality, while there are still a number of limitations to the model. It lacks a mechanism to deal with long-distance dependencies, it has no means to generalise over non-overt linguistic information [37] and it has limited word reordering capabilities. Furthermore, in some cases the output quality may lack appropriate fluency and grammaticality to be acceptable for actual MT users. Sometimes essential words are missing from the translation.

To overcome these limitations efforts have been made to introduce syntactic knowledge into the statistical paradigm, usually in the form of syntax trees, either only for the source (tree-to-string) or the target language (string-to-tree), or for both (tree-to-tree).

Galley et al. [12] describes an MT engine in which tree-to-string rules have been derived from a parallel corpus, driven by the problems of SMT systems raised by [11]. Marcu et al. and Wang et al. [30, 52] describe string-to-tree systems to allow for better reordering than phrase-based SMT and to improve grammaticality. Hassan et al. [18] implements another string-to-tree system by means of including supertags [2] to the target side of the phrase-based SMT baseline.

Most of the tree-to-tree approaches use one or another form of synchronous context-free grammars (SCFGs) a.k.a. syntax directed translations [1] or syntax directed transduction grammars [28]. This is true for the tree-based models of the Moses toolkit, Footnote 1 and the machine translation techniques described in, amongst others [7, 27, 36, 5355]. A more complex type of translation grammars is synchronous tree substitution grammar (STSG) [10, 38] which provides a way, as [8] points out, to perform certain operations which are not possible with SCFGs without flattening the trees, such as raising and lowering nodes. Examples of STSG approaches are the Data-Oriented Translation (DOT) model from [20, 35] which uses data-oriented parsing [3] and the approaches described in [1416] and [37], using STSG rules consisting of dependency subtrees, and a top-down transduction model using beam search.

The Parse and Corpus based MT (PaCo-MT) engine described in this chapter Footnote 2 is another tree-to-tree system that uses an STSG, differing from related work with STSGs in that the PaCo-MT engine combines dependency information with constituency information and that the translation model abstracts over word and phrase order in the synchronous grammar rules: the daughters of any node are in a canonical order representing all permutations. The final word order is generated by the tree-based target language modeling component.

Figure 17.1 presents the architecture of the PaCo-MT system. A source language (SL) sentence gets syntactically analysed by a pre-existing parser which leads to a source language parse tree, making abstraction of the surface order. This is described in Sect. 17.2. The unordered parse tree is translated into a forest of unordered trees (a.k.a. bag of bags) by applying tree transduction with the transfer grammar which is an STSG derived from a parallel treebank. Section 17.3 presents how the transduction grammar was built and Sect. 17.4 how this grammar is used in the translation process. The forest is decoded by the target language generator, described in Sect. 17.5 which generates an n -best list of translation alternatives by using a tree-based target language model. The system is evaluated on Dutch to English in Sect. 17.6 and conclusions are drawn in Sect. 17.7. As all modules of our system are language independent results for Dutch → French, English → Dutch, and French → Dutch can be expected soon.

Fig. 17.1
figure 1figure 1

The architecture of the PaCo-MT system

2 Syntactic Analysis

Dutch input sentences are parsed using Alpino [32], a stochastic rule-based dependency parser, resulting in structures as in Fig. 17.2. Footnote 3

Fig. 17.2
figure 2figure 2

An unordered parse tree for the Dutch sentence Het heeft ook een wettelijke reden “It also has a legal reason”, or according to Europarl “It is also subject to a legal requirement”. Note that edge labels are marked behind the ‘ | ’

In order to induce the translation grammar, as explained in Sect. 17.3, parse trees for the English sentences in the parallel corpora are also required. These sentences are parsed using the Stanford phrase structure parser [21] with dependency information [31]. The bracketed phrase structure and the typed dependency information are integrated into an XML format consistent with the Alpino XML format. All tokens are lemmatised using TreeTagger [39].

Abstraction is made of the surface order of the terminals in every parse tree used in the PaCo-MT system. An unordered tree is defined Footnote 4 by the tuple\(\langle V,{V }^{i},E,L\rangle\)where V is the set of nodes, Viis the set of internal nodes, and\({V }^{f} = V - {V }^{i}\)is the set of frontier nodes, i.e. nodes without daughters.\(E \subset {V }^{i} \times V\)is the set of directed edges and L is the set of labels on nodes or edges.\({V }^{l} \subseteq {V }^{f}\)is the set of lexical frontier nodes, containing actual words as labels, and\({V }^{n} = {V }^{f} - {V }^{l}\)is the set of non-lexical frontier nodes, which is empty in a full parse tree, but not necessarily in a subtree. There is exactly one root node r ∈ Viwithout incoming edges. Let T be the set of all unordered trees, including subtrees.

A subtree sr ∈ T of a tree t ∈ T has as a root node r ∈ Vtiwhere Vtiis the set of internal nodes of t. Subtrees are horizontally complete [4] if, when a daughter node of a node is included in the subtree, then so are all of its sisters. Figure 17.3 shows an example. Let\(H \subset T\)be the set of all horizontally complete subtrees.

Fig. 17.3
figure 3figure 3

An example of a horizontally complete subtree which is not a bottom-up subtree

Bottom-up subtrees are a subset of the horizontally complete subtrees: they are lexical subtrees: every terminal node of the subtree is a lexical node. Some examples are shown in Fig. 17.4. Let\(B \subset H\)be the set of all bottom-up subtrees.\(\forall b \in B : {V }_{b}^{n} = \emptyset \)and\({V }_{b}^{l} = {V }_{b}^{f}\), where Vbnis the set of non-lexical frontier nodes of b and\({V }_{b}^{l}\)is the set of lexical frontier nodes of b .\({V }_{b}^{f}\)is the set of all frontier nodes of b.

Fig. 17.4
figure 4figure 4

Two examples of bottom-up subtrees

3 The Transduction Grammar

In order to translate a source sentence, a stochastic synchronous tree substitution grammar G is applied to the source sentence parse tree. Every grammar rule\(g \in G\)consists of an elementary tree pair, defined by the tuple\(\langle {d}^{g},{e}^{g},{A}^{g}\rangle\), where\({d}^{g} \in T\)is the source side tree (Dutch),\({e}^{g} \in T\)is the target side tree (English), and Agis the alignment between the non-lexical frontier nodes of dgand eg. The alignment Agis defined by a set of tuples\(\langle {v}_{d},{v}_{e}\rangle\)where\({v}_{d} \in {V }_{d}^{n}\)and\({v}_{e} \in {V }_{e}^{n}\). Vdnis the set of non-lexical frontier nodes of dg, and Venis the set of non-lexical frontier nodes of eg. Every non-lexical frontier node of the source side is aligned with a non-lexical frontier node of the target side:\(\forall {v}_{d} \in {V }_{d}^{n}\)is aligned with a node\({v}_{e} \in {V }_{e}^{n}\). An example grammar rule is shown in Fig. 17.5.

Fig. 17.5
figure 5figure 5

An example of a grammar rule with horizontally complete subtrees on both the source and target side. Indices mark alignments

In order to induce such a grammar a node aligned parallel treebank is required. Section 17.3.1 describes how to build such a treebank. Section 17.3.2 describes the actual induction process.

3.1 Preprocessing and Alignment of the Parallel Data

The system was trained on the Dutch-English subsets of the Europarl corpus [22], the DGT translation memory, Footnote 5 the OPUS corpus Footnote 6 [42] and an additional private translation memory (transmem).

The data was syntactically parsed (as described in Sect. 17.2 ), sentence aligned using Hunalign [50] and word aligned using GIZA\(++\)[ 33]. The bidirectional GIZA\(++\)word alignments were refined using the intersect and grow-diag heuristics implemented by Moses [24], resulting in a higher recall for alignments suitable for machine translation.

For training Lingua-Align [43], which is a discriminative tree aligner [44], a set of parallel alignments was manually constructed using the Stockholm TreeAligner [29], for which the already existing word alignments were imported. The recall of the resulting alignments was rather low, even though in constructing the training data a more relaxed version of the well-formedness criteria as proposed by [19] was used.

Various features and parameters have been used in experimentation, training with around 90 % and testing with the rest of the data set. The training data set consists of 140 parallel sentences.

Recent studies in rule-based alignment error correction ([ 25, 26]) show that recall can be significantly increased while retaining a relatively high degree of precision. This approach has been extended by applying a bottom-up rule addition component that greedily adds alignments based on already existing word alignments, more relaxed well-formedness criteria, as well as using measures of similarities between the two unlinked subtrees being considered for alignment.

3.2 Grammar Rule Induction

Figure 17.6 is an example Footnote 7 of two sentences aligned at both the sentence and subsentential level. For each alignment point, either one or two rules are extracted. First, each alignment point is a lexical alignment, creating a rule that maps a source language word or phrase to a target language one (Fig. 17.7 a, b).

Fig. 17.6
figure 6figure 6

Two sentences with subsentential alignment

Fig. 17.7
figure 7figure 7

Rules extracted from the alignments in Fig. 17.6

Secondly, each aligned pair of sentences engenders further rules by partitioning each tree at each alignment point, yielding non-lexical grammar rules. For these rules, the alignment information is retained at the leaves so that these trees can be recombined (Fig. 17.7 d).

The rule extraction process was restricted to rules with horizontally complete subtrees at the source and target side. Rule extraction with other types of subtrees was considered out of the scope of the current research.

Figure 17.7 shows the four rules extracted from the alignments in Fig. 17.6. Rules are extracted by passing over the entire aligned treebank, identifying each aligned node pair and recursively iterating over its children to generate a substitutable pair of trees whose roots are aligned, and whose leaves are either terminal leaves in the treebank or correspond to aligned vertices. As shown in Fig. 17.7, when a leaf node corresponds to an alignment point, we retain the information to identify which target tree leaf aligns with each such source leaf.

Many such tree substitution rules recur many times in the treebank, and a count is kept of the number of times each pair appears, resulting in a stochastic synchronous tree substitution grammar.

4 The Transduction Process

The transduction process takes an unordered source language parse tree p ∈ T as input, applies the transduction grammar G and transduces p into an unordered weighted packed forest, which is a compact representation of a set of target trees\(Q \subset T\), which represent the translation alternatives. An example of a packed forest is shown in Fig. 17.8.

Fig. 17.8
figure 8figure 8

An example of a packed forest as output of the transducer for the Dutch sentence Het heeft ook een wettelijke reden. Note that? marks an alternation

For every node\(v \in {V }_{p}^{i}\), where Vpiis the set of internal nodes in the input parse tree p, it is checked whether there is a subtree sv ∈ H with v as its root node, which matches the source side tree dgof a grammar rule\(g \in G\).

To keep computational complexity limited the subtrees of p that are considered and the subtrees that occur in the source and target side of the grammar G have been restricted to horizontally complete subtrees (including bottom-up subtrees).

When finding a matching grammar rule for which sv = dg, the corresponding egis inserted into the output forest Q. When not finding a matching grammar rule, a horizontally complete subtree is constructed, as explained in Sect. 17.4.2 .

The weight that the target side egof grammar rule g ∈ G will get when is calculated according to Eq. 17.1. This weight calculation is similar to the approaches of [14, 37], as it contains largely the same factors. We multiply the weight of the grammar rule w (g) with the relative frequency of the grammar rule over all grammar rules with the same source side\(\frac{F(g)} {F({d}^{g})}\). This is divided by an alignment point penalty\({(j + 1)}^{app}\), favouring the solutions with the least alignment points.

$$W({e}^{g}) = \frac{w(g)} {{(j + 1)}^{app}} \times \frac{F(g)} {F({d}^{g})}$$

where\(w(g) = \root{n}\of{\prod\nolimits_{i=1}^{n}w({A}_{i}^{g})}\)is the weight of\(g \in G\), which is the geometric mean of the weight of each individual occurrence of alignment A, as produced by the discriminative aligner described in Sect. 17.3.1 ;\(j = \vert {V }_{d}^{n}\vert = \vert {V }_{e}^{n}\vert \)is the number of alignment points, which is the number of non-lexical frontier elements which are aligned in\(g \in G\); app is the alignment points power parameter (app = 0. 5); F (g) is the frequency of occurrence g in the data; F (dg) is the frequency of occurrence of the source side d of g in the data.

When no translation of a word is found in the transduction grammar, the label l ∈ L is mapped onto its target language equivalent. Adding a simple bilingual word form dictionary is optional. When a word translation is not found in the transduction grammar, the word is looked up in this dictionary. If the word has multiple translations in the dictionary, each of these translations receives the same weight and is combined with the translated label (usually part-of-speech tags). When the word is not in the dictionary or no dictionary is present, the source word is transfered as is to Q .

4.1 Subtree Matching

In a first step, the transducer performs bottom-up subtree matching, which is analogous to the use of phrases in phrase-based SMT, but restricted to linguistically meaningful phrases. Bottom-up subtree matching functions like a sub-sentential translation memory: every linguistically meaningful phrase that has been encountered in the data will be considered in the transduction process, obliterating the distinction between a translation memory, a dictionary and a parallel corpus [45].

For every node\(v \in {V }_{p}\)it is checked whether a subtree svwith root node v is found for which\({s}_{v} \in B\)and for which there is a grammar rule\(g \in G\)for which\(d = {s}_{v}\). These matches include single word translations together with their parts-of-speech.

A second step consists of performing horizontally complete subtree matching for those nodes in the source parse tree for which the number of grammar rules\(g \in G\)that match is smaller than the beam size b .

For every node\(v \in {V }_{p}^{i}\)the set\({H}_{v} \subset H \setminus B\)is generated, which is the set of all horizontally complete subtrees minus the bottom-up subtrees of p with root node v. It is checked whether a matching subtree\({s}_{v} \in {H}_{v}\)is found for which there is a grammar rule\(g \in G\)for which\({d}^{g} = {s}_{v}\).

An example of a grammar rule with horizontally complete subtrees on both source and target sides was shown in Fig. 17.5. This rule has three alignment points, as indicated by the indices.

4.2 Backing Off to Constructed Horizontally Complete Subtrees

In cases where no grammar rules are found for which the source side matches the horizontally complete subtrees at a certain node in the input parse tree, grammar rules are combined for which, when combined, the source sides form a horizontally complete subtree. An example of such a constructed grammar rule is shown in Fig. 17.9.

Fig. 17.9
figure 9figure 9

An example of a constructed grammar rule

\(\forall v \in {V }_{p}^{i}\)for which there is no\({s}_{v} \in {H}_{v}\)matching any grammar rule\(g \in G\), let\({C}_{s} =\langle {c}_{1},\ldots ,{c}_{n}\rangle\)be the set of children of root node v in subtree\({s}_{v}\).\(\forall {c}_{j} \in {C}_{s}\)the subtree svis split into two partial subtrees yvand zv, where\({C}_{y} = {C}_{s} \setminus \{ {c}_{j}\}\)is the set of children of subtree yvand\({C}_{z} =\{ {c}_{j}\}\)is the set of children of subtree zv.

When a grammar rule\(g \in G\)is found for which\({d}^{g} = {y}_{v}\)and another grammar rule\(h \in G\)is found for which\({d}^{h} = {z}_{v}\), then the respective target sides eqgwith root node q and euhwith root node u are merged into one target language tree efif q = u and\({C}_{{e}^{g,h}} = {C}_{{e}^{g}} \cup {C}_{{e}^{h}}\), resulting in a constructed grammar rule\(f\notin G\)defined by the tuple\(\langle {d}^{f},{e}^{f},{A}^{f}\rangle\), where df = sv. The alignment of the constructed grammar rule is the union of the alignments of the grammar rules g and h :\({A}^{f} = {A}^{g} \cup {A}^{h}\).

As f is a constructed grammar rule, the absolute frequency of occurrence of the grammar rule F (f ) = 0, which would result in\(W({e}^{g,h}) = 0\)in Eq. 17.1. In order to resolve this, the frequency of occurrence F (f) is estimated according to Eq. 17.2 .

$$F(f) = w({y}_{v}) \times \frac{F(g)} {F({d}^{g})} \times \frac{F(h)} {F({d}^{h})}$$


  • \(w({y}_{v}) = \root{m}\of{\prod\nolimits_{i=1}^{m}w({A}_{i}^{g})}\)is the weight of grammar rule g, which is the geometric mean of the weight of each individual occurrence of alignment A, as produced by the discriminative aligner described in 17.3.1 ;

  • F (g) is the frequency of occurrence of grammar rule g

  • F (dg) is the frequency of occurrence of the source side dgof grammar rule g

  • F (h) is the frequency of occurrence of grammar rule h

  • F (dh) is the frequency of occurrence of the source side dhof grammar rule gh

Constructing grammar rules leads to overgeneration. As a filter the target language probability of such a rule is taken into account. This is estimated by multiplying the relative frequency of vjin which cioccurs as a child over all vj’s with the relative frequency of cjoccurring N times over cjoccuring any number of times, as shown in Eq. 17.3, which is applied recursively for every node vj ∈ Vewhere Veis the set of nodes in ef.

$$P({e}^{f}) =\prod\limits_{j=1}^{m}\prod\limits_{i=1}^{n}\frac{F(\#({c}_{i}\vert {v}_{j}) \geq 1)} {F({v}_{j})} \times \frac{F(\#({c}_{i}\vert {v}_{j}) = N)} {\sum\nolimits_{r=1}^{n}F(\#({c}_{i}\vert {v}_{j}) = r)}$$


\(\#({c}_{i } \vert {v}_{j } )\) :

is the number of children of vjwith the same label as ci

N :

is the number of times the label cioccurs in the constructed rule

The new weight w (ef) is calculated according to Eq. 17.4 .

$$w({e}^{f}) = \root{cp}\of{F(f) \times P({e}^{f})}$$


cp :

is the construction penalty: 0 ≤ cp ≤ 1.

When constructing a horizontally complete subtree fails, a grammar rule is constructed by translating each child separately.

5 Generation

The main task of the target language generator is to determine word order, as the packed forest contains unordered trees. An additional task of the target language model is to provide additional information concerning lexical selection, similar to the language model in phrase-based SMT [23].

The target language generator has been described in detail in [47], but the system has been generalised and improved and was adapted to work with weighted packed forests as input.

For every node in the forest, the surface order of its children needs to be determined. For instance, when translating “een wettelijke reden” into English, the bag\(\mathit{NP}\langle \mathit{JJ}(\mathit{legal}),\mathit{DT}(a),\mathit{NN}(\mathit{reason})\rangle\)represents the surface order of all permutations of these elements.

A large monolingual treebank is searched for anNP with an occurrence of these three elements, and in what order they occur most, using the relative frequency of each permutation as a weight. If none of the permutations are found, the system backs off to a more abstract level, only looking for the bag\(\mathit{NP}\langle \mathit{JJ},\mathit{DT},\mathit{NN}\rangle\)without lexical information, for which there is most likely a match in the treebank.

When still not finding a match, all permutations are generated with an equal weight, and a penalty is applied for the distance between the source language word order and the target language word order to avoid generating too many solutions with exactly the same weight. This is related to the notion of distortion in IBM model 3 in [5].

In the example bag, there are two types of information for each child: the part-of-speech and the word token, but as already pointed out in Sect. 17.2 dependency information and lemmas are also at our disposal.

All different information sources (token, lemma, part-of-speech, and dependency relation) have been investigated with a back-off from most concrete (token + lemma  +  part-of-speech + dependency relation) to most abstract (part-of-speech).

The functionality of the generator is similar to the one described in [17], but relative frequency of occurrence is used instead of n -grams of dependencies. As shown in [47] this approach outperforms SRILM 3-g models [41] for word ordering. [51] uses feature templates for translation candidate reranking, but these can have a higher depth and complexity than the context-free rules used here.

Large monolingual target language treebanks have been built by using the target sides of the parallel corpora and adding the British National Corpus (BNC) Footnote 8 .

6 Evaluation

We evaluated translation quality from Dutch to English on a test set of 500 sentences with three reference translations, using BLEU [34], NIST [9] and translation edit rate (TER) [40], as shown in Table 17.1.

Table 17.1 Evaluation of the Dutch-English engine

We show the effect of adding data, by presenting the results when using the Europarl (EP) corpus, and when adding the OPUS corpus, the DGT corpus, and the private translation memory (transmem), and we show the effect of adding a dictionary of + 100,000 words, taken from the METIS Dutch English translation engine [6, 46]. This dictionary is only used for words where the grammar does not cover a translation.

These results show that the best scoring condition is trained on all the data apart from DGT, which seems to deteriorate performance. Adding the dictionary is beneficial under all conditions. Error analysis shows that the system often fails when using the back-off models, whereas it seems to function properly when horizontally complete subtrees are found.

Comparing the results with Moses Footnote 9 [24] shows that there is a long way to go for our syntax-based approach until we par with phrase-based SMT. The difference in score is partly due to remaining bugs in the PaCo-MT system which cause no output in 2.6 % of the cases. Another reason could be the fact that automated metrics like BLEU are known to favour phrase-based SMT systems. Nevertheless, the PaCo-MT system has not yet reached its full maturity and there are several ways to improve the approach, as discussed in Sect. 17.7 .

7 Conclusions and Future Work

With the research presented in this paper we wanted to investigate an alternative approach towards MT, not using n -grams or any other techniques from phrase-based SMT systems. Footnote 10

A detailed error analysis and comparison between the different conditions will reveal what can be done to improve the system. Different parameters in alignment can result in more useful information from the same set of data. Different approaches to grammar induction could also improve the system, as grammar induction is now limited to horizontally complete subtrees. STSGs allow more complex grammar rules including horizontally incomplete subtrees. Another improvement can be expected from working on the back-off strategy in the transducer, such as the real time construction of new grammar rules on the basis of partial grammar rules.

The system could be converted into a syntactic translation aid, by only taking the decisions of which it is confident, backing off to human decisions in cases of data sparsity. It remains to be tested whether this approach would be useful.

Further investigation of the induced grammar could lead to a reduction in grammar rules, by implementing a default inheritance hierarchy, similar to [13], speeding up the system, without having any negative effects on the output.

The current results of our system are in our opinion not sufficient to reject nor accept a syntax-based approach towards MT as an alternative for phrase-based SMT, as, quoting Kevin Knight “the devil is in the details”.Footnote 11