Language Resources and Evaluation

, Volume 48, Issue 4, pp 601–637

HamleDT: Harmonized multi-language dependency treebank

  • Daniel Zeman
  • Ondřej Dušek
  • David Mareček
  • Martin Popel
  • Loganathan Ramasamy
  • Jan Štěpánek
  • Zdeněk Žabokrtský
  • Jan Hajič
Original Paper

DOI: 10.1007/s10579-014-9275-2

Cite this article as:
Zeman, D., Dušek, O., Mareček, D. et al. Lang Resources & Evaluation (2014) 48: 601. doi:10.1007/s10579-014-9275-2

Abstract

We present HamleDT—a HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. In the present article, we provide a thorough investigation and discussion of a number of phenomena that are comparable across languages, though their annotation in treebanks often differs. We claim that transformation procedures can be designed to automatically identify most such phenomena and convert them to a unified annotation style. This unification is beneficial both to comparative corpus linguistics and to machine learning of syntactic parsing.

Keywords

Dependency treebank Annotation scheme Harmonization 

1 Introduction

Growing interest in dependency parsing is accompanied (and inspired) by the availability of new treebanks for various languages. Shared tasks such as CoNLL 2006–2009 (Buchholz and Marsi 2006; Nivre et al. 2007; Surdeanu et al. 2008; Hajič et al. 2009) have promoted parser evaluation in multilingual settings. However, differences in parsing accuracy in different languages cannot be always attributed to language differences. They are often caused by variation in domains, sizes and annotation styles of the treebanks. The impact of data size can be estimated by learning curve experiments, but normalizing the annotation style is difficult. We present a method to transform the treebanks into a common style, including a software that implements the method. We have studied treebanks of 29 languages and collected a long list of variations.1 We propose one common style (called HamleDT v1.5 style) and provide a transformation from original annotations to this style for almost all2 the phenomena we identified. In addition to dependency tree structure normalization, we also unify the tagsets of both the part-of-speech/morphological tags and the dependency relation tags.

The motivation for harmonizing the annotation conventions used for different treebanks was already described in literature, e.g., by McDonald et al. (2013). Clearly, a unified representation of language data is supposed to facilitate the development of multilingual technologies. The harmonized set of treebanks should improve the interpretability and comparability of parsing accuracy results, and thus help to drive the development of dependency parsers towards multilingual robustness. For instance, the range of unlabeled attachment scores reached by a typical state-of-the-art supervised dependency parser in different languages spans an interval of around 10 % points (given training data of a comparable size) and is even bigger for unsupervised parsers, as documented, e.g., by Mareček and Žabokrtský (2012). It is not entirely clear whether and to what extent this variance can be attributed to the peculiarities of the individual languages, or merely to the choice of annotation conventions used for the language. Using HamleDT should make it possible to separate these two sources of variance. Besides supervised and unsupervised multilingual parsing, homogeneity of the data is also essential for experiments on cross-lingual transfer of syntactic structures, be it based on projecting trees (Hwa et al. 2005) or on transferring delexicalized models (McDonald et al. 2011a).

The common style defined in HamleDT v1.5 serves as a reference point: the ability to say “our results are based on HamleDT v1.5 transformations of treebank XY” will facilitate the comparability of future results published in all these subfields.

The purpose of HamleDT is not to find a single choice of annotation conventions that ideally suits all possible tasks concerning syntactic structures, as this is hardly to be expected doable. However, assuming a different annotation convention fits a particular task better, it is much simpler to transform all the treebanks to the desired shape after they have been collected and unified in HamleDT.

Last but not least, we believe that the unified representation of linguistic content may be advantageous for linguists, enabling them to compare languages based on treebank material without the need to study multiple annotation guidelines.

2 Related work

There have been a few attempts recently to address the same problem, namely:
  • Schwartz et al. (2012) define two measures of syntactic learnability and evaluate them using five different parsers on varying annotation styles of six phenomena (coordination, infinitives, noun phrases, noun sequences, prepositional phrases and verb groups). They work only with English; they generate varying annotations during the conversion of the Penn TreeBank WSJ corpus (Marcus et al. 1993) constituency annotation to dependencies.

  • Tsarfaty et al. (2011) compare the performance of two parsers on different constituency-to-dependency conversions of the (English) Penn Treebank. They do not see the solution in data transformations; instead, they develop an evaluation technique that is robust with respect to some3 annotation styles.

  • McDonald et al. (2011b) experiment with cross-language parser training, relying on a rather small universal set of part-of-speech tags. They do not transform syntactic structures, however. They note that different annotation schemes across treebanks are responsible for the fact that some language pairs work better together than others. They use English as the source language and Danish, Dutch, German, Greek, Italian, Portuguese, Spanish, and Swedish as target languages.

  • Seginer (2007) discusses possible annotation schemes for coordination structures and relative clauses in relation to his common cover link representation.

  • Bosco et al. (2010) compare three different dependency parsers developed and tested with respect to two Italian treebanks.

  • Bengoetxea and Gojenola (2009) evaluate three types of transformations on Basque: transformation of subordinate sentences, coordinations and projectivization. An important difference between their approach and ours is that their transformations can change tokenization.

  • Nilsson et al. (2006) show that transformations of coordination and verb groups improve parsing of Czech.

3 Data

We identified over 30 languages for which treebanks exist and are available for research purposes. Most of them can either be acquired free of charge or are included in the Linguistic Data Consortium4 membership fee.

Most of the treebanks are natively based on dependencies, but some were originally based on constituents and transformed via a head-selection procedure. For instance, Spanish phrase-structure trees were converted to dependencies using the method of Civit et al. (2006).

HamleDT v1.5 currently covers 29 treebanks, with several others to be added soon. Table 1 lists the treebanks along with their data sizes. In the following, we use ISO 639 language codes in square brackets to refer to the treebanks of these languages, so e.g. [en] refers to the English treebank. A list of all 29 treebanks with references is included in Appendix 1.
Table 1

Overview of data resources included in HamleDT v1.5

Language

Prim. tree type

Used data source

Sents.

Tokens

Train/test (% sents)

Avg. sent. length

Nonprj. deps. (%)

Arabic (ar)

dep

C2007

3,043

116,793

96/4

38.38

0.37

Basque (eu)

dep

prim

11,226

151,604

90/10

13.50

1.27

Bengali (bn)

dep

I2010

1,129

7,252

87/13

6.42

1.08

Bulgarian (bg)

phr

C2006

13,221

196,151

97/3

14.84

0.38

Catalan (ca)

phr

C2009

14,924

443,317

88/12

29.70

0.00

Czech (cs)

dep

C2007

25,650

437,020

99/1

17.04

1.91

Danish (da)

dep

C2006

5,512

100,238

94/6

18.19

0.99

Dutch (nl)

phr

C2006

13,735

200,654

97/3

14.61

5.41

English (en)

phr

C2007

18,577

446,573

99/1

24.03

0.33

Estonian (et)

phr

prim

1,315

9,491

90/10

7.22

0.07

Finnish (fi)

dep

prim

4,307

58,576

90/10

13.60

0.51

German (de)

phr

C2009

38,020

680,710

95/5

17.90

2.33

Greek (el)

dep

C2007

2,902

70,223

93/7

24.20

1.17

Greek (grc)

dep

prim

21,160

308,882

98/2

14.60

19.58

Hindi (hi)

dep

I2010

3,515

77,068

85/15

21.93

1.12

Hungarian (hu)

phr

C2007

6,424

139,143

94/6

21.66

2.90

Italian (it)

dep

C2007

3,359

76,295

93/7

22.71

0.46

Japanese (ja)

dep

C2006

17,753

157,172

96/4

8.85

1.10

Latin (la)

dep

prim

3,473

53,143

91/9

15.30

7.61

Persian (fa)

dep

prim

12,455

189,572

97/3

15.22

1.77

Portuguese (pt)

phr

C2006

9,359

212,545

97/3

22.71

1.31

Romanian (ro)

dep

prim

4,042

36,150

93/7

8.94

0.00

Russian (ru)

dep

prim

34,895

497,465

99/1

14.26

0.83

Slovene (sl)

dep

C2006

1,936

35,140

79/21

18.15

1.92

Spanish (es)

phr

C2009

15,984

477,810

90/10

29.89

0.00

Swedish (sv)

phr

C2006

11,431

197,123

97/3

17.24

0.98

Tamil (ta)

dep

prim

600

95,81

80/20

15.97

0.16

Telugu (te)

dep

I2010

1,450

5,722

90/10

3.95

0.23

Turkish (tr)

dep

C2007

5,935

69,695

95/5

11.74

5.33

The average sentence length is the number of tokens divided by the number of sentences. Varying tokenization schemes obviously influence the numbers; see Sect. 5.7 for details on the individual languages. The C code in the fourth column means “CoNLL shared task”, I means “ICON” and prim means primary (non-shared-task) source. The last column gives the percentage of nodes attached non-projectively

Many treebanks (especially those used in CoNLL shared tasks) define a train/test data split. This is important for the comparability of experiments with automated parsing and part-of-speech tagging. We preserve the original data division and define test subsets for the remaining treebanks as well. In doing so, we try to keep the test size similar to the majority of CoNLL 2006/2007 test sets, i.e., roughly 5,000 tokens.

Throughout this article, a dependency tree is an abstract structure of nodes and dependencies that capture syntactic relations in a sentence. Nodes correspond to the tokens of the sentence, i.e. to words, numbers, punctuation and other symbols (see Sect. 5.7 for more on tokenization). Besides the actual word form, the node typically holds additional attributes of the token, such as its lemma and part of speech. Dependencies are directed arcs between nodes. Every node is attached to (depends on) exactly one other node, called its parent. We draw the dependency as an arrow going from the parent to the child. Thus every node has one incoming dependency and any number of outgoing dependencies. There is one exception: an artificial root node that does not correspond to any real token and has only outgoing dependencies. Dependencies have labels that mark the type of the relation.

Most diagrams in this article (Fig. 1 and onwards) depict just a snippet of the sentence, i.e. a subtree. Selected tokens (word forms) are shown in a sequence respecting the word order, with dependencies drawn as labeled arrows between two tokens (nodes). The artificial root of the whole sentence is never shown; the root token of the subtree has one incoming dependency going straight down (from an invisible parent). The relation between the subtree and its invisible parent is labeled X (it does not make sense to show the real relation type without the parent).
Fig. 1

Open image in new window Nested coordination in the Prague style. X represents the relation of the whole structure to its parent. _M denotes members of coordination, i.e., conjuncts

4 Harmonization

Our effort aims at identifying all syntactic constructions that are annotated differently in different treebanks. Once a particular construction is identified, we can typically find all its instances in the treebank using existing syntactic and morphological tags, i.e., with little or no lexical knowledge. Thanks to this fact, we were able to design algorithms to normalize the annotations of many linguistic phenomena to a single style, which we refer to as the HamleDT v1.5 style.

The HamleDT v1.5 style is mostly derived from the annotation style of the Prague Dependency Treebank (PDT, Hajič et al. 2006).5 This is a matter of convenience, to a large extent: This is the scheme with which the authors feel most at home, and many of the included treebanks already use a style similar to PDT. We do not want to claim that the HamleDT v1.5 style is objectively better than other styles. (Please note, however, that in case of coordination, the HamleDT v1.5 style provides a more expressive power than the other options, as described in Sect. 5.1).

The normalization procedure involves both structural transformations and changes to dependency relation labels. While we strive to design the structural transformations to be as reversible as possible, we do not attempt to save all information stored in the dependency labels. The original6 labels vary widely across treebanks, ranging from very simple, e.g., NMOD “generic noun modifier” in [en], over standard subject, object, etc. relations, to deep-level functions of Pāṇinian grammar such as karta and karma (k1 and k2) in [hi, bn, te].7 It does not seem possible to unify these tagsets without relabeling whole treebanks manually.

We use a lossy scheme that maps the dependency labels on the moderately sized tagset of PDT analytical functions8—see Table 2.
Table 2

Selected types of dependency relations and their relative frequency in the harmonized treebanks

Language

Atr

Adv

Obj

AuxP

Sb

Pred

Coord

AuxV

AuxC

Rest

Arabic (ar)

36.5

6.4

9.1

14.2

6.3

3.1

4.0

0.0

2.3

18.2

Basque (eu)

19.6

24.0

8.7

0.0

7.2

5.7

3.4

8.3

1.0

22.2

Bengali (bn)

18.2

22.7

17.9

0.0

16.6

16.7

4.9

0.0

0.0

3.0

Bulgarian (bg)

23.3

8.8

12.8

14.6

7.7

7.3

3.1

0.8

3.3

18.4

Catalan (ca)

22.4

16.7

5.2

9.9

7.4

8.1

2.9

9.3

1.8

16.4

Czech (cs)

28.5

10.4

8.1

9.9

7.1

6.0

4.1

1.2

1.7

23.1

Danish (da)

23.8

12.2

12.1

10.7

9.8

5.3

3.4

0.0

3.4

19.3

Dutch (nl)

14.1

24.7

6.8

10.3

8.5

7.4

2.1

5.2

3.7

17.2

English (en)

30.0

12.0

5.7

9.8

7.9

4.3

2.2

4.0

1.8

22.2

Estonian (et)

12.8

25.7

6.6

5.9

13.0

14.1

1.3

2.6

0.6

17.4

Finnish (fi)

29.7

18.2

7.8

1.5

9.4

8.3

4.1

1.6

1.2

18.2

German (de)

31.2

11.8

10.4

10.1

7.9

5.3

2.8

0.5

1.2

18.7

Greek (grc)

15.4

13.0

14.2

3.8

7.7

8.6

6.5

0.0

1.4

29.4

Greek (el)

39.8

9.9

7.5

8.3

7.1

4.5

3.2

4.0

1.6

14.0

Hindi (hi)

26.8

13.4

9.6

21.1

6.8

5.3

2.4

6.3

1.6

6.8

Hungarian (hu)

30.4

13.9

5.2

1.6

5.9

8.3

2.4

1.3

1.6

29.2

Italian (it)

22.2

12.4

4.9

14.7

5.2

4.8

3.3

2.8

1.1

28.5

Japanese (ja)

11.5

16.6

0.6

5.8

3.4

7.3

0.3

0.0

0.0

54.6

Latin (la)

17.9

13.7

15.9

5.3

10.6

8.8

6.6

1.1

3.1

17.2

Persian (fa)

25.3

8.8

10.0

14.0

6.4

7.7

4.1

0.1

2.7

20.8

Portuguese (pt)

24.6

24.0

7.1

11.4

6.0

4.3

2.4

0.0

1.0

19.0

Romanian (ro)

27.7

13.3

7.2

17.6

8.5

11.2

1.8

7.7

0.0

5.0

Russian (ru)

30.4

16.9

16.3

12.3

10.4

6.2

4.0

0.0

1.6

1.9

Slovene (sl)

15.0

10.9

8.1

7.3

5.9

7.2

4.3

9.4

3.7

28.1

Spanish (es)

22.8

16.9

5.1

9.0

7.8

8.7

2.8

8.0

2.0

17.0

Swedish (sv)

19.3

19.5

6.9

9.3

10.8

6.4

3.9

2.5

2.7

18.8

Tamil (ta)

27.7

0.0

9.7

3.0

7.3

6.0

1.6

6.3

2.8

35.6

Telugu (te)

7.3

21.3

19.5

0.0

19.2

25.6

3.5

0.1

0.0

3.6

Turkish (tr)

38.5

8.0

10.8

1.9

6.9

9.5

3.8

0.0

1.4

19.2

Average

26.2

13.9

8.9

10.3

7.6

6.3

3.3

2.8

1.8

18.8

One can see repeated patterns in the table such as the dominance of adverbials and attributes, or the relatively stable proportion of subjects. However, the numbers are still biased by imperfections in the conversion procedures (e.g., unrecognized AuxV in certain languages)

Atr Attribute, Adv adverbial, Obj object, AuxP preposition, Sb subject, Pred predicate, Coord coordinating conjunction, AuxV auxiliary verb, AuxC subordinating conjunction

Occasionally the original structure and dependency labels are not enough to determine the normalized output. For instance, the German label RC is assigned to all dependencies that attach a subordinate clause to its parent. The set of HamleDT v1.5 labels distinguishes clauses that act as nominal attributes (Atr) from those that substitute adverbial modifiers (Adv). We look at the part of speech of the parent: if it is a noun, we label the dependency Atr; if it is a verb, we label it Adv.9 Thus we also consider the part of speech, the word form, or even further morphological properties. Since the morphological (part-of-speech) tagsets also vary greatly across treebanks, we use the Interset approach described by Zeman (2008) to access all morphological information. Interset is a kind of interlingua for parts of speech and morphosyntactic features. Its aim is to provide a unified representation for as many feature values in existing tagsets as possible. We created converters (“drivers”) to Interset from all treebank tagsets for which it had not already been available. The normalized treebanks thus provide Interset-unified morphology as well.

In a typical scenario, the harmonization steps are ordered as follows:
  1. 1.

    file format conversion (from various proprietary formats to a common-schema XML) and character encoding conversion (to UTF-8),

     
  2. 2.

    conversion of morphological tags to the Interset tagset,

     
  3. 3.

    conversion of dependency relation labels to the set of HamleDT labels,

     
  4. 4.

    conversion of coordination structures into the HamleDT style (i.e., distinguishing members of coordination and shared modifiers, and attaching them to the main coordination conjunction),

     
  5. 5.

    other changes in the tree structure (i.e., rehanging nodes to make the dependent-governor relations comply with the HamleDT conventions, including relation orientation) and possibly further refinements of the dependency labels.

     

The last two points (tree transformations) represent the main focus of the present study; many detailed examples are provided in Sect. 5.

The implementation of file format converters is relatively straightforward, even though reverse engineering is sometimes needed due to missing technical documentation.

When implementing the Interset converters, around 200–500 lines of Perl code are typically needed; the code is usually not very challenging from the algorithmic point of view, but requires a very good insight into the annotation guidelines of the respective resource.

Mapping of dependency labels is usually relatively simple to implement too: sometimes it is enough just to recode the original label (e.g. Subj to Sb), sometimes the decision must be conditioned by the POS value of the node or of its parent, sometimes the rules are conditioned lexically or by certain structural properties of the tree. However, it all can be done relatively reliably.

More or less the same holds for rehanging the nodes in the fifth step. Typically, there are just a few dozens of transformation rules needed for the third and fifth step (i.e., around 200 lines of Perl code).

The algorithmically most complex step in the harmonization is typically a proper treatment of coordination structures because resolving a coordination structure affects at least three nodes in most cases, coordinations can be nested, and they can combine with almost any dependency relation type. In addition, there are multiple different encodings of coordination structures used in treebanks (17 in HamleDT v1.5), as analyzed in depth by Popel et al. (2013).

Performing the normalization of coordination structures before the normalization of other relations brings about an important advantage: in step 5, it is possible to work with dependent-governor pairs of nodes in the sense of dependency (not just with child–parent node pairs as stored in the trees), disregarding whether the former or the latter (or both) are coordinated. Without this abstraction, even simple operations, such as swapping the relation orientation between nouns and prepositions, would become quite cumbersome, as one would have to keep all possible combinations in mind, e.g. “with A and B”, “with A and with B”, “with A and B or with C and D”, “with or without A”, etc. For more details, please refer to concrete examples in Sect. 5.

5 Annotation styles for various phenomena

In this section, we present a selection of phenomena that we observed and, to various degrees for various languages, included in our normalization scenario. Language codes in brackets give examples of treebanks where the particular approach is employed. The Open image in new window symbol in figure captions marks artificial examples. Figures not marked with Open image in new window contain genuine examples found in real data, though some of them have been shortened.

Dependency relation labels from the original treebanks that appear in figures are briefly explained in Appendix 3.

5.1 Coordination

Capturing coordination in a dependency framework has been repeatedly described as difficult for both treebank designers and parsers (and it is generally regarded as an inherent difficulty of dependency syntax as such). Our analysis revealed four families of approaches, which may further vary in the attachment of punctuation, shared modifiers, etc.:
  • Prague (Figs. 1, 2, 8). All conjuncts are headed by the conjunction. Used in [ar, bn, cs, el, en, eu, grc, hi, la, nl, sl, ta, te] (Hajič et al. 2006).

  • Mel’čukian (Fig. 3). The first/last conjunct is the head, others are organized in a chain. Used in [de, ja, ru, sv, tr] (Mel’čuk 1988).

  • Stanford (Fig. 4). The first/last conjunct is the head, others are attached directly to it. Used in [bg, ca, es, fi, it, pt] (de Marneffe and Manning 2008). And

  • Tesnièrian (Fig. 5). There is no common head, all conjuncts are attached directly to the node modified by the coordination structure. Used in [hu] (Tesnière 1959).

Furthermore, the Prague style provides for nested coordinations, as in apples and pears or oranges and lemons (see Fig. 1). The asymmetric treatment of conjuncts in the other styles makes nested coordination difficult to read or even impossible to capture in some situations. The Prague style also distinguishes between shared modifiers, such as the subject in Mary came and cried, from private modifiers of the conjuncts, as in John came and Mary cried (see Fig. 2). Because this distinction is missing in non-Prague-style treebanks, we cannot recover it reliably. We apply several heuristics, but in most cases, the modifiers of the head conjunct are classified as private modifiers.
Fig. 2

 Open image in new window Shared and private modifiers in the Prague style [cs]: Car (auto) is an object shared by all three verbs while the adverbials (on Monday, yesterday, today) are private. The whole structure is in the predicate relation to its parent (which is probably the sentence root), so using the notation of Fig. 1: X = Pred

Fig. 3

 Open image in new window Coordination in the Mel’čukian style as seen in [de]

Fig. 4

 Open image in new window Coordination in the Stanford style as seen in [ca]

Fig. 5

 Open image in new window Coordination in the Tesnièrian style as seen in [hu]. All participating nodes are attached directly to the parent of the coordination

Danish (Fig. 6) employs a mixture of the Stanford and Mel’čukian styles where the last conjunct is attached indirectly via the conjunction. The Romanian and Russian treebanks omit punctuation tokens (they do not have corresponding nodes in the trees); in the case of Romanian, this means that coordinations of more than two conjuncts are disconnected (Fig. 7).
Fig. 6

 Open image in new window Danish mixture of Stanford and Mel’čukian coordination styles

Fig. 7

 Open image in new window [ro] uses Prague coordination style mixed with Tesnièrian because punctuation is missing from data

Given the advantages described above, we decided to use the Prague style (in its [cs] flavor) in our harmonized data. There is just one drawback that we are aware of: Occasionally, there may be no node suitable for the coordination head. Most asyndetic constructions do not pose a problem because there are commas or other punctuation. Without punctuation, the Prague style would need an extra node—that solution has been adopted by the authors of the [ta] treebank (see Fig. 8). Note that one-half of our treebanks already use the Prague style as their native approach, thus they always have a coordination head. In the other half, a fraction of coordinate structures cannot be fully converted (unless we add a new node, which we do not in the current version of HamleDT). For example, 14 out of the 5,988 coordinate structures in [bg] (0.23 %) lack any conjunction or punctuation that could be made the head. In these cases we currently use the first conjunct instead, effectively backing off to the Stanford style.
Fig. 8

Coordination in [ta]: The coordinating function is performed by the two morphological suffixes -um. They had to be made separate nodes during tokenization because [ta] uses the Prague style and no other coordination head was available except these morphological indicators

5.2 Prepositions

Prepositions (or postpositions; Figs. 9, 10, 11) can either govern their noun phrase (NP) [cs, en, sl, ...] or modify the head of its NP [hi]. When they govern the NP, other modifiers of the main noun are attached either to the noun (in most cases) or to the preposition [de]. The label of the relation of the prepositional phrase to its parent is sometimes found at the preposition [de, en, nl]. Elsewhere, the preposition gets an auxiliary label (such as AuxP in PDT) despite serving as head, and the real label is found at the NP head [cs, sl, ar, el, la, grc].

In HamleDT v1.5 style, prepositions govern their noun phrase because 1. they may govern the form of the noun phrase (e.g. [cs, ru, sl, de]) and 2. this is the approach taken in most of the treebanks we studied. Other modifiers inside the prepositional phrase, such as determiners and adjectives, should depend on the embedded noun phrase. The preposition is labeled with the auxiliary tag AuxP and the real relation between the prepositional phrase and its parent is labeled at the NP head.
Fig. 9

 Open image in new window A prepositional phrase in [cs]

Fig. 10

 Open image in new window A prepositional phrase in [de]

Fig. 11

 Open image in new window A postpositional phrase in [hi]

5.3 Subordinate clauses

There are three main types of subordinate clauses:
  • Relative clauses They modify noun phrases. Typically they are marked by relative pronouns that represent the modified noun and its function within the relative clause. Example: The man who came yesterday.

  • Complement clauses They serve as arguments of predicates, typically verbs. They are marked by subordinating conjunctions. Example: The man said that he came yesterday.

  • Adverbial clauses They modify predicates in the same way as adverbs; but they are not selected as arguments. Example: If the man comes today he will say more.

Roots (predicates) of relative clauses are usually attached to the noun they modify, e.g., in the man who came yesterday, came would be attached to man and who would be attached to came as its subject.

The predicate-modifying clauses use a subordinating conjunction (complementizer, adverbializer) to express their relation to the governing predicate. In treebanks, the conjunction is either attached to the predicate of the subordinate clause [es, ca, pt, de, ro] (Fig. 12) or it lies between the embedded clause and the main predicate it modifies [cs, en, hi, it, la, ru, sl] (Fig. 13). In the latter case, the label of the relation of the subordinate clause to its parent can be assigned to the conjunction [en, hi, it] or to the clausal predicate [cs, la, sl] (Fig. 14). The comma before the conjunction is attached either to the conjunction or to the subordinate predicate.
Fig. 12

Subordinate clause in [es]

Fig. 13

Subordinate clause in [it]

Fig. 14

Subordinate clause in [la]

The subordinating conjunction may also be attached as a sibling of the subordinate clause [hu], an analogy to the Tesnièrian coordination style (Fig. 15). In Fig. 16, a direct question in [tr] is rooted by the question mark, which is attached to a subordinating postposition.
Fig. 15

Subordinate clause in [hu]

Fig. 16

Subordinate clause (direct speech) in [tr]: Here, the question mark serves as the head of the direct question

The Romanian treebank is segmented into clauses instead of sentences, so every clause has its own tree, and inter-clausal relations are not annotated.

HamleDT v1.5 style follows the [cs, sl, la] approach to subordinate clauses (see Figs. 14, 19).

5.4 Verb groups

Various sorts of verbal groups include analytical verb forms (such as auxiliary + participle), modal verbs with infinitives, and similar constructions. Dependency relations, both internal (between group elements) and external (leading to the parent on the one side and verb modifiers on the other side), may be defined according to various criteria: content verb versus auxiliary, finite form versus infinitive, or subject-verb agreement, which typically holds for finite verbs, sometimes for participles but not for infinitives (Figs. 17, 18).

Participles often govern auxiliaries [es, ca, it, ro, sl] (Figs. 19, 20); elsewhere the finite verb is the head [pt, de, nl, en, sv, ru] (Figs. 17, 22, 23), and finally, [cs] mixes both approaches based on semantic criteria. In [hi, ta], the content verb, which could be a participle or a bare verb stem, is the head, and auxiliaries (finite or participles) are attached to it (Figs. 21, 24, 25, 26, 27).
Fig. 17

Passive construction in [ru]: Finite auxiliary verb (Open image in new window) is the head, passive participle (Open image in new window) depends on it. As a result, the agent (Open image in new window) is attached non-projectively to the participle (Open image in new window )

Fig. 18

Modal passive construction in [bg]: The finite modal verb (Open image in new window) is the head, the infinitive particle (Open image in new window) is the second-level head. The infinitive auxiliary (Open image in new window) is attached to Open image in new window, as is the passive participle of the content verb (Open image in new window) and the two arguments of the content verb, one of them (Open image in new window) non-projectively

Fig. 19

Past tense in [cs]: The participle of the content verb (očekával) governs the finite form of the auxiliary (jsem). Making the auxiliary the head would cause problems because it is not always present, e.g., omitting it in this sentence would just shift the sentence to the 3rd person meaning (He expected a girl would come)

Fig. 20

Negated conditional construction in [sl]. The past participle of the content verb (vstopil) is the head, the negative particle (ne) and the auxiliary (bi) depend on it

Fig. 21

Past modal construction in [nl]. The finite auxiliary verb (had) is the head. The subject (ze) is attached to the finite verb (had) while the modifier (met haar moeder) is attached non-projectively to the content verb (winkelen)

Fig. 22

Another example from [nl]. Unlike in other treebanks, even the subject (ze) is attached to the non-head participle (uitgevonden)

Fig. 23

A combination of perfect tense, modal verb, and infinitive in [de]. Infinitives are attached to modals as their objects in many treebanks, including [de]. The finite auxiliary verb (hat) is the head of the perfect tense, the participle (gesagt) depends on it. The subject (er) is attached to the finite verb (hat) while the object clause (was er eigentlich machen will) is attached to the content verb (gesagt)

Fig. 24

Infinitive with preposition in [pt]: Preposition (de) is not attached between the phase verb (deixa) and the infinitive (ser). The negative particle (não) is attached non-projectively to the non-head verb (ser). Moreover, the commas around the parenthetical (em os tempos que correm) are also non-projective

Fig. 25

[ja] Desu is the polite copula. Aite is the conjunctive form of aku = “to open”. The auxiliary iru with conjunctive of content verb together form the progressive tense. Japanese is an SOV language and left-branching structures are much preferred

Fig. 26

[fa] Note that the dependency tree of the sentence (În mehmânî tartîb šod dâde.) is ordered right-to-left, the way Persian is written. The analytical passive šod dâde is represented by a single node (token)

Fig. 27

[hi] Kara is a light verb stem, svīkāra karanā means “to accept”. Liyā, the perfect participle of lenā “to take”, is another light verb, specifying the direction of the result of the action. Hai is the auxiliary verb “to be” in finite form. Content verbs govern verbal groups in the [hi] treebank; as the main verb in this case is a compound verb (svīkāra kara), the head node of the two (kara) governs the whole group, even though the real content lies in the nominal element (svīkāra)

The head typically holds the label describing the relation of the whole verbal group to its parent. As for child nodes, subjects and negative particles are often attached to the head, especially if it is the finite element [de, en], while the arguments (objects) are attached to the content element whose valency slot they fill (often participle or infinitive). Sometimes even the subject (in [nl]) or the negative particle (in [pt]) can be attached to the non-head content element (Fig. 22). Various infinitive-marking particles (English “to”, Swedish “att”, Bulgarian “Open image in new window”) are usually treated similarly to subordinating conjunctions, i.e., they either govern the infinitive [en, da, bg] or are attached to it [de, sv]. In [pt], prepositions used between the main verb and the infinitive (“estão a usufruir” = “are enjoying”) are attached to the finite verb (Fig. 24). In [bg], all modifiers of the verb including the subject are attached to the infinitive particle Open image in new window instead of the verb below it (Fig. 18).

We intend to unify verbal groups under a common approach, but the current version 1.5 of HamleDT does not do so yet. This part is more language-dependent than the others and a further analysis is needed.

5.5 Determiner heads

The Danish treebank is probably the most extraordinary one. Nouns often depend on determiners, numerals, etc. (see Fig. 28). This approach is very rare in dependency treebanks, although it has its advocates among linguists (Hudson 2004, 2010).10

In HamleDT v1.5, we attach articles as well as other determiners to their nouns and numerals to the counted nouns.11
Fig. 28

Two fragments from [da] show determiners and numerals governing noun phrases

5.6 Punctuation

Table 3 presents an overview of punctuation treatment in the treebanks. Details and exceptions are discussed below. The type codes at paragraph beginnings refer to the columns of the table.

Pair/Pcom: Paired punctuation marks (quotation marks, brackets, parenthesizing commas (Pcom) or dashes) are typically attached to the head of the segment between them. Occasionally, they are attached one level higher, to the parent of the enclosed segment, or even higher, if the parent is member of a verbal group. Attaching punctuation to higher levels may break projectivity, as in Fig. 24. The [pt] approach attaches paired punctuation to the parent of the interior segment (i.e. to the parent of the head of the segment, not to the head), unless the parent is the root or there are tokens outside the punctuation that depend on the head inside. In this latter case, the punctuation is attached to the inner head. In [tr], the Pcom column does not necessarily refer to paired punctuation; some commas are just attached to the root, which may result in non-projectivity.

Rcom: Similarly, commas before and after a relative clause are typically attached either to the root of the relative clause (be it verb or conjunction) or to its parent. In [la], the clause is sometimes headed by a subordinating conjunction, but the comma is attached to the verb below. Note, however, that a comma terminating a clause may have multiple functions: it may at the same time delimit several nested clauses, a parenthetical phrase, and/or a conjunct.

In several languages, commas (in [fa]) or all punctuation symbols (in [eu, it, nl]) are systematically attached to neighboring tokens.

Coord: Commas, semicolons, or dashes can also substitute coordinating conjunctions, which is important especially if the Prague style of coordination is used (see Sect. 5.1). In [te], this is the sole function of commas (see Fig. 29). In [da], which does not follow the Prague approach to coordination, we observed two adjectives modifying the same noun, separated by a comma; the comma was attached to the first “conjunct”. We list the case in the Coord column although the structure was not formally tagged as coordination. In [hu], coordinating commas are normally attached to the parent of the coordination. Parents that are roots of the tree are an exception: in such cases, the comma is used as the head of the coordination.
Fig. 29

Coordinating comma in [te]

Fig. 30

NULL-like usage of period in [bn]. The node with the period represents a dropped copula. Elsewhere in the treebank, such nodes are labeled by the pseudo-word-form “NULL”

Table 3

Punctuation styles overview

Language

Fin

Pair

Pcom

Rcom

Coord

Coor1

Apos

Arabic (ar)

RN

SH

SH

 

HD

(PT)

 

Basque (eu)

PT

PT

PT

PT

 

PT

PT

Bengali (bn)

(MP*)

   

HD

  

Bulgarian (bg)

MP

SH

SH

SH

SH

SH

SH

Catalan (ca)

MP

SH

SH

SH

SH

SH

SH

Czech (cs)

RN

SH

SH

SH

HD

SH

HD

Danish (da)

MP

SH

 

SP/SH

PT*

SH

 

Dutch (nl)

PW

PW

PW

PW

 

PW

PW

English (en)

MP

SH?

SP

SP

HD

SH

SP

Estonian (et)

MP

SH|SP

SP

SH|SP

HD

SH

SP

Finnish (fi)

MP

SH

SH

SH

SH

SH

 

German (de)

MP

SH?

SP

SP

SH

PC

 

Greek (el)

RN

SH

  

HD

SH

HD

Greek (grc)

RN

(SH)

SH

SH

HD

SH

HD

Hindi (hi)

MP

SH

 

(SP)

 

PC|PT

 

Hungarian (hu)

MP

SH|SP

SH|SP

SP

HD|SP*

SP

 

Italian (it)

PT

NT/PT

PT

PT

 

PT

PT

Japanese (ja)

MP*

      

Latin (la)

  

SH

SH*

HD

SH

HD

Persian (fa)

MP

SH

PT

PT

 

PT

PT

Portuguese (pt)

MP

SP*

SP

SP

SH

SH

SP

Romanian (ro)

No punctuation

Russian (ru)

No punctuation

Slovene (sl)

RN

SH

SH

SH

HD

SH

HD

Spanish (es)

MP

SH

SH

SH

SH

SH

SH

Swedish (sv)

MP

NT/PT

SP

SP

PC

PC

SP

Tamil (ta)

RN

SH

SP

SP

HD

SH

 

Telugu (te)

(MP)

   

HD

  

Turkish (tr)

RR

 

RN*

CH

CH

CH

 

HamleDT v1.5

RN

SH

SH

SH

HD

SH

SH

RN = attached to the artificial root node; RR = attached to the artificial root and serving as root for the rest of the sentence, i.e., heading the main predicate; MP = attached to the main predicate; NT = attached to the next token; PT = attached to the previous token; PW = attached to the previous word (i.e., non-punctuation token); PC = attached to the previous conjunct; SH = attached to the head of the rel. clause/subtree inside paired punc./coordination/second appos. member; SP = attached to the (grand)parent node of the rel. clause/subtree inside paired punc.; or to the first appos. member; CH = chain: attached to parent, and the head of the clause attached to the comma; for Coord, previous conjunct attached to comma, comma attached to next conjunct; HD = serving as head of coordination; (X) = rare in this treebank, based on very few observations; X/Y = initial X, final Y; X|Y = both observed; X? = unexplained exceptions observed; X* = see text for more details; empty cell = not observed

Coor1: Multi-conjunct coordination often involves one conjunction and one or more commas. Even within the same coordination family, multiple attachment schemes are possible for the commas (the previous conjunct, the head of the coordination, etc.) Additional commas are rare in [ar], where repeated conjunctions are more common.

Apos: Constructions in which two phrases describe the same object are called appositions. These are mostly but not solely noun phrases separated by a comma, dash, bracket, etc. as in “Nicoletta Calzolari, the chief editor”. Appositions are treated in the same way as parenthesis in most treebanks—the second phrase is attached to the first. Other treebanks regard appositions as coordinations—the punctuation serves as the head, with both phrases attached symmetrically.

Fin: Sentence-final punctuation (period, question mark, exclamation mark, three dots, semicolon, or colon) is attached to the artificial root node [cs, ar, sl, grc, ta], to the main predicate [bg, ca, da, de, en, es, et, fi, hu, pt, sv], or to the previous token [eu, it, ja, nl].12 In [la, ro, ru], there is no final punctuation. It is also extremely rare in [bn, te]; however, there are a few punctuation nodes in [bn] that govern other nodes in the sentence. In fact, these nodes actually should have been labeled NULL to represent a copula or other constituents missing from the surface (Fig. 30). Such NULL nodes appear elsewhere in [bn]. Punctuation is attached to the artificial root node in [tr] but instead of being a sibling of the main predicate, it governs the predicate. Note that some languages (e.g. Czech) may require final quotation marks (if present) to appear after the final period, but in [cs], it is not treated as final punctuation (unlike the period). Such quotation marks may end up attached non-projectively to the main verb.

A few treebanks [bg, cs, la, sl] use separate nodes for periods that mark abbreviations and ordinal numbers. These nodes are attached to the previous node (i.e., the abbreviation). In [cs], this rule has a higher priority even in cases where a period serves as an abbreviation marker and a sentence terminator at the same time. Most other treebanks are tokenized so that the period shares a node with the abbreviation (see also Sect. 5.7).

In HamleDT v1.5, we treat apposition as parenthesis, we attach paired punctuation to the root of the subtree inside and sentence-final punctuation to the artificial root node, mostly for consistency reasons. For the other punctuation types, a further analysis is needed.

5.7 Tokenization and sentence segmentation

The only aspect that remains unchanged in HamleDT is tokenization and segmentation. Our harmonized trees always have the same number of nodes and sentences as the original annotation, despite some variability in the approaches we observe in the original treebanks.

5.7.1 Multi-word expressions and missing tokens

Some treebanks collapse multi-word expressions into single nodes [ca, da, es, eu, fa, hu, it, nl, pt, ro, ru]. Collapsing is restricted to personal names in [hu] and to named entities in [ro]. In [fa], it is used for analytical verb forms. The word form of the node is composed of all the participating words, joined by underscore characters or even by spaces [fa].

In [bn, te], dependencies are annotated between chunks instead of words (Fig. 31). Therefore, one node may represent a whole noun phrase with modifiers and postpositions. The treebank only shows chunk headwords, which means we cannot reconstruct the original sentence. On a similar note, punctuation tokens have been deleted from two treebanks ([ro, ru]; see also Sect. 5.6).
Fig. 31

The first cup of tea comes before the turnover. [bn] captures dependencies between chunks, not between tokens. Every sentence has been chunked and chunk headwords serve as nodes of the tree (their word forms are replaced by lemmas). The dotted dependencies below the sentence indicate which tokens belong to which chunk. Neither these dependencies nor the chunk-dependent words are visible in the treebank. The original sentences cannot be reconstructed from the trees

5.7.2 Split tokens

On the other hand, orthographic words may be split into syntactically autonomous parts in some treebanks [ar, fa]. For instance, the Arabic word Open image in new window (wabiālfālūjah = “and in al-Falujah”) is separated into wa/CONJ + bi/PREP + AlfAlwjp/NOUN_PROP. In [ta], the suffix -um indicating a coordination is treated as a separate token (see Sect. 5.1; Fig. 8).

5.7.3 Artificial nodes

Occasionally [bn, hi, te, ru], we see an inserted NULL node, which mostly stands for participants deleted on the surface, e.g., copulas [bn, ru] or conjuncts as in the Hindi example in Fig. 32.
Fig. 32

A NULL node for a deleted verb (serving as head of conjunct) in [hi]

Along the same lines, some treebanks of pro-drop languages [ca, es] use empty nodes (with artificial word “_”) representing missing subjects, as in the following Spanish sentence: “_ Afirmó que _ sigue el criterio europeo y que _ trata de incentivar el mercado donde no lo hay.” = “He said he follows the European standard and encourages the market where there is none.” All the underscores mark subjects of the following verbs and could be translated as “he”.

Underscore/NULL nodes also appear in [tr], where they encode additional information related to morphological derivation.

5.7.4 Sentence segmentation

Similarly to tokenization, we also treat sentence segmentation as fixed, despite some less usual solutions: in [ar], sentence-level units are paragraphs rather than sentences, which explains the high average sentence length in Table 1. In contrast, [ro] annotates every clause as a separate tree.

6 Obtaining HamleDT

Twelve harmonized treebanks from HamleDT v1.5 [ar, cs, da, fa, fi, grc, la, nl, pt, ro, sv, ta] are directly available for download from our web site:

http://ufal.mff.cuni.cz/hamledt.

The license terms of the rest of the treebanks prevent us from redistributing them directly (in their original or normalized form), but most of them are easily acquirable for research purposes, under the links given in Appendix 1). We provide the software that can be used to normalize and display the data after obtaining them from the original provider.

All the normalizations are implemented in Treex (formerly TectoMT) (Popel and Žabokrtský 2010), a modular open-source framework for structured language processing, written in Perl.13 In addition to normalization scripts for each treebank, Treex contains also other transformations, so for example, coordinations in any treebank can be converted from Prague to Stanford style.

The tree editor TrEd14 can open Treex files and display original and normalized trees side-by-side on multiple platforms.

7 Conclusion

We provide a thorough analysis and discussion of varying annotation approaches to a number of syntactic phenomena, as they appear in publicly available treebanks, for many languages.

We propose a method for automatic normalization of the discussed annotation styles. The method applies transformation rules conditioned on the original structural annotation, dependency labels and morphosyntactic tags. We also propose unification of the tag sets for parts of speech, morphosyntactic features, and dependency relation labels. We take care to make the structural transformations and the morphosyntactic tagset unification as reversible as possible.15

We provide an implementation of the transformations in the Treex NLP framework. Treex can also be used for transforming the data to other annotation styles besides the one we propose (cf. Popel et al. 2013). The resulting collection of harmonized treebanks, called HamleDT v1.5, is available to the research community according to the original licenses. A subset of the treebanks whose license terms permit redistribution is available directly from us. For the rest, users need to acquire the original data and apply our transformation tool.

Several future directions of our work are possible. Besides deepening the current level of harmonization (especially for verbal groups), we plan on adding new treebanks and languages, for which resources exist (e.g., French, Hebrew, Chinese, Icelandic, Ukrainian or Georgian). We also want to run parsing experiments and evaluate the various annotation styles from the point of view of learnability by parsers.

Footnotes
1

The initial version has been described in Zeman et al. (2012).

 
2

HamleDT v1.5 does not include the harmonization of verbal groups (see Sect. 5.4).

 
3

The transformations are not robust to coordination styles.

 
5

So far, there are only two differences between the PDT style (used in [cs]) and the HamleDT v1.5 style: handling of appositions (see Table 3) and marking of conjuncts (in HamleDT, the root of a conjunct subtree is marked as conjunct even if it is a preposition or subordinating conjunction; in PDT, only content words are marked as conjuncts). By conjunct, we mean a member of coordination (unlike Quirk et al. 1985). By content word, we mean autosemantic word, i.e. a word with a full lexical meaning, as contrasted with auxiliary. Note that PDT also has a more abstract layer of annotation (called tectogrammatical), but in this work, we only use the shallow dependencies (called analytical layer in PDT).

 
6

Unless we explicitly say otherwise, we mean by “original” the data source indicated in Table 1. It may actually differ from the really original treebank. For instance, some of the CoNLL data underwent a conversion procedure to the CoNLL format from other formats, and some information may have been lost in the process.

 
7

In the Pāṇinian tradition, karta is the agent, doer of the action, and karma is the “deed” or patient. See Bharati et al. (1994).

 
8

They are approximately the same as the dependency relation labels in the Czech CoNLL data set. To illustrate the mapping, more details on [bn] and [en] conversion are presented in Tables 4 and 5 in Appendix 2.

 
9

Ideally we would also want to distinguish objects (Obj) from adverbials. Unfortunately, this particular source annotation does not provide enough information to make such a distinction.

 
10

In Chomskian (constituency-based) approaches, it is the standard analysis that determiners function as the head of a noun phrase.

 
11

Note however that numerals governing nouns are not restricted to [da]. Czech has a complex set of rules for numerals (motivated by the morphological agreement), which may result under some circumstances in the numeral serving as the head.

 
12

In [ja], the previous token essentially means the main predicate, but if it is followed by a question particle then the punctuation node is attached to the particle.

 
14

http://ufal.mff.cuni.cz/tred/ with EasyTreex extension.

 
15

We do not attempt at reversibility when unifying dependency relations.

 

Acknowledgments

The authors wish to express their gratitude to all the creators and providers of the respective corpora. The work on this project was supported by the Czech Science Foundation Grant Nos. P406/11/1499 and P406/14/06548P, by the European Union Seventh Framework Programme under Grant Agreement FP7-ICT-2013-10-610516 (QTLeap), and by research resources of the Charles University in Prague (PRVOUK). This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (Project LM2010013). Finally, we are very grateful for the numerous valuable comments provided by the anonymous reviewers.

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  • Daniel Zeman
    • 1
  • Ondřej Dušek
    • 1
  • David Mareček
    • 1
  • Martin Popel
    • 1
  • Loganathan Ramasamy
    • 1
  • Jan Štěpánek
    • 1
  • Zdeněk Žabokrtský
    • 1
  • Jan Hajič
    • 1
  1. 1.Faculty of Mathematics and Physics, ÚFALCharles University in PraguePragueCzech Republic