A qualitative comparison method for rhetorical structures: identifying different discourse structures in multilingual corpora

Iruskieta, Mikel; da Cunha, Iria; Taboada, Maite

doi:10.1007/s10579-014-9271-6

A qualitative comparison method for rhetorical structures: identifying different discourse structures in multilingual corpora

Original Paper
Published: 28 May 2014

Volume 49, pages 263–309, (2015)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Mikel Iruskieta¹,
Iria da Cunha² &
Maite Taboada³

977 Accesses
17 Citations
Explore all metrics

Abstract

Explaining why the same passage may have different rhetorical structures when conveyed in different languages remains an open question. Starting from a trilingual translation corpus, this paper aims to provide a new qualitative method for the comparison of rhetorical structures in different languages and to specify why translated texts may differ in their rhetorical structures. To achieve these aims we have carried out a contrastive analysis, comparing a corpus of parallel English, Spanish and Basque texts, using Rhetorical Structure Theory. We propose a method to describe the main linguistic differences among the rhetorical structures of the three languages in the two annotation stages (segmentation and rhetorical analysis). We show a new type of comparison that has important advantages with regard to the quantitative method usually employed: it provides an accurate measurement of inter-annotator agreement, and it pinpoints sources of disagreement among annotators. With the use of this new method, we show how translation strategies affect discourse structure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Current Approaches of Corpus Pragmatics on Discourse and Translation Studies, an Introduction

Lexicometry: A Quantifying Heuristic for Social Scientists in Discourse Studies

Analyzing Spoken and Written Discourse: A Role for Natural Language Processing Tools

Notes

Soricut and Marcu (2003, pg. 152) use the term “attachment point” or “dominance set”.
Although great efforts have been made to stimulate Machine Translation studies for different language pairs, non-official languages that are typologically different and could be interesting are not considered. For example Koehn (2005) presents a 30 million word corpus translated to the 11 official of the European Union: Danish, German, Greek, English, Spanish, Finnish, French, Italian, Dutch, Portuguese, and Swedish to study different language pairs translations, but less common languages spoken in the EU are not included.
The source of the text (TERM#_original language) is shown in square brackets at the end of the figures, tables or examples.
A problem with work in the framework of RST is that there is no annotated bilingual or trilingual corpus to study the effects of translation strategies on rhetorical structure. As a consequence, a researcher in such situation first needs to learn RST and perform annotations, as Maxwell (2010) suggests.
It was used also to evaluate the RST Basque TreeBank (Iruskieta et al. 2013a), available at: http://ixa2.si.ehu.es/diskurtsoa/en/.
When a corpus is annotated only with one annotator per language, the results may yield subjective idiosyncrasies. This is not a problem for the aim of this paper, because we do not want to provide a reliable annotated corpus in three languages, but we do provide a qualitative way to compare annotation in different languages. Comparisons have been done manually and by pairs of languages following two different evaluations: (a) Marcu’s quantitative method and (b) a new qualitative-quantitative method. So even if the corpus is small, the comparison work is extensive. The aim to provide reliable corpora has been achieved in other papers by the authors [English SFU corpus (Taboada and Renkema 2008), Spanish RST TreeBank (da Cunha et al. 2011a) and Basque RST TreeBank (Iruskieta et al. 2013a)].
See the paragraph on Truncated EDUs in this section.
This evaluation method has been automated by Maziero and Pardo (2009) and nowadays it can be used in four languages: English, Spanish, Portuguese and Basque. Available at http://www.nilc.icmc.usp.br/rsteval/.
Note that, after harmonizing discourse segmentation, accuracy, precision, recall and F-measure obtain the same value. Therefore, although this results in a somewhat artificial level of agreement, we are conscious about this fact, we use the standard measure employed in the RST literature (Marcu 2000a; Maziero and Pardo 2009).
If there is more than one CS (because there is a multinuclear relation) at least one of them has to be the same for N/S-N/N mix-up.
Basque segments (A3) were also harmonized, but space constraints preclude us to align with Spanish and English. Anyway, the harmonization of TERM38_SPA segmentation in the three languages can be consulted at: http://ixa2.si.ehu.es/rst/segmentuak_multiling.php?bilatzekoa=TERM38%. The English RS-tree can be consulted at: http://ixa2.si.ehu.es/rst/diskurtsoa_jpg/TERM38_A1.jpg. The Spanish RS-tree can be consulted at: http://ixa2.si.ehu.es/rst/diskurtsoa_jpg/TERM38_A2.jpg.
If we follow this decision, we could not compare structures that contain a N/N–N/S mix-up inside the relation.
As the evaluation has been done manually, there have been some problematic cases that have not counted as an agreement. For cases in which some structures cannot be compared, no-match label has been used, which represents not more than 0.06 % of all relations (53 no-match/900 relations), about 1.18 relations per text on average (53 No Match/45 texts).
This harmonization work can be found at http://ixa2.si.ehu.es/rst/segmentuak_multiling.php.
For Kappa segment candidates were calculated automatically by counting verbs.
In the example, the original segmentation is marked with square brackets and the segmentation after harmonization with curly brackets.
One-way ANOVA demonstrated significant differences across the three languages in the corpus (\(p = 0.07\)). We thought this was quite significant, therefore we performed a post-hoc Tukey’s test and we observed that harmonization in Basque is the furthest from the other two.
EDUs are excluded because they are identical after harmonization.
“Values of agreement between \(-\)A_e/1\(-\)A_e (no observed agreement) and 1 (observed agreement = 1), with the value 0 signifying chance agreement (observed agreement = expected agreement).” (Artstein and Poesio 2008, p. 559).
Catford (1965, pg. 73) defines translation shifts as “departures from formal correspondence in the process of going from the SL to the TL” (from the Source Language to the Target Language). Chesterman (1997) states that changes from original to translated text are due to a translation strategy.
Note that here there is another translation strategy (CSC hierarchical upgrading in Basque with a coordination of two finite verbs lortu dute ‘\([\)they\(]\) achieve \([\)it\(]\)’ and eman diote ‘\([\)they\(]\) give \([\)him\(]\)’), which is not under consideration due to harmonization process.
Again, this goes against the principles of our segmentation.
Note here the human annotation error which does not follow the modular and incremental annotation that Pardo (2005) proposes.
This phenomenon (marker change is the first reason to mismatch relations) is repeated when we compare translated texts (TL) among them (MC 20.29 %, CSC 4,35 % and US 7.25 %).
http://ixa2.si.ehu.es/rst.
SFU corpus is available at http://www.sfu.ca/~mtaboada/download/downloadRST.html.
RST Spanish TreeBank is available at http://corpus.iingen.unam.mx/rst/corpus_en.html.
Basque RST TreeBank is available at http://ixa2.si.ehu.es/diskurtsoa/en/.
Truncated EDU. English translation: ‘if there can be said to be any’ (see Sect. 5).

References

Abelen, E., Redeker, G., & Thompson, S. A. (1993). The rhetorical structure of US-American and Dutch fund-raising letters. Text, 13(3), 323–350.
Google Scholar
Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.
Article Google Scholar
Baker, M. (2004). A corpus-based view of similarity and difference in translation. International Journal of Corpus Linguistics, 9(2), 167–193.
Article Google Scholar
Bateman, J. A., & Rondhuis, K. J. (1997). Coherence relations: Towards a general specification. Discourse Processes, 24(1), 3–49.
Article Google Scholar
Carlson, L., Okurowski, M. E., & Marcu, D. (2002). RST Discourse Treebank, LDC2002T07 [Corpus]. Philadelphia, PA: Linguistic Data Consortium.
Google Scholar
Carlson, L., Marcu, D., & Okurowski, M. E. (2003). Building a discourse-tagged corpus in the framework of rhetorical structure theory. In van Kuppevelt, C. J. Jan & R. W. Smith (Eds.), Current and new directions in discourse and dialogue (pp. 85–112). Berlin: Springer.
Catford, J. C. (1965). A linguistic theory of translation: An essay in applied linguistics (Vol. 8). New York: Oxford University Press.
Google Scholar
Cenoz, J. (2003). The role of typology in the organization of the multilingual lexicon. In J. Cenoz, B. Hufeisen & U. Jessner (Eds.), The multilingual lexicon (pp. 103–116), New York: Springer.
Chesterman, A. (1993). From ‘is’ to ‘ought’: Laws, norms and strategies in translation studies. Target, 5(1), 1–20.
Article Google Scholar
Chesterman, A. (1997). Memes of translation: The spread of ideas in translation theory (Vol. 22). Amsterdam and Philadelphia: Benjamins.
Book Google Scholar
Chiswick, B. R., & Miller, P. W. (2005). Linguistic distance: A quantitative measure of the distance between english and other languages. Journal of Multilingual and Multicultural Development, 26(1), 1–11.
Article Google Scholar
Cristea, D., Ide, N., & Romary, L. (1998). Veins theory: A model of global discourse cohesion and coherence. In C. Boitet & P. Whitelock (Eds.), 17th international conference on Computational linguistics (Vol. 1 pp. 281–285). Montreal, Canada: Association for Computational Linguistics.
Cui, S. (1986). A comparison of English and Chinese expository rhetorical structures. Ph.D. thesis, UCLA.
da Cunha, I., & Iruskieta, M. (2010). Comparing rhetorical structures in different languages: The influence of translation strategies. Discourse Studies, 12(5), 563–598.
Article Google Scholar
da Cunha, I., Torres-Moreno, J. M., & Sierra, G. (2011a). On the Development of the RST Spanish Treebank. In 5th Linguistic annotation workshop. 49th annual meeting of the association for computational linguistics, ACL (pp. 1–10). Portland, Oregon, USA.
da Cunha, I., Torres-Moreno, J. M., Sierra, G., Cabrera-Diego, L. A., Castro-Rolón, B. G., & Rolland-Bartilotti, J. M. (2011b). The RST Spanish Treebank On-line Interface. In International conference recent advances in NLP (pp. 698–703), Bulgaria.
Delin, J., Hartley, A. F., Paris, C., Scott, D. R., & Linden, K. V. (1994). Expressing procedural relationships in multilingual instructions. In Seventh International Workshop on Natural Language Generation (pp. 61–70), Association for Computational Linguistics.
Delin, J., Hartley, A. F., & Scott, D. R. (1996). Towards a contrastive pragmatics: Syntactic choice in English and French instructions. Language Sciences, 18(3–4), 897–931.
Article Google Scholar
Egg, M., & Redeker, G. (2010). How complex is discourse structure? In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010) (pp. 1619–1623), Valletta, Malta.
Fetzer, A., & Johansson, M. (2010). Cognitive verbs in context. A contrastive analysis of English and French argumentative discourse. International Journal of Corpus Linguistics, 15(2), 240–266.
Article Google Scholar
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.
Article Google Scholar
Flowerdew, J. (2010). Use of signalling nouns across l1 and l2 writer corpora. International Journal of Corpus Linguistics, 15(1), 36–55.
Article Google Scholar
Fung, P. (1995). Compiling bilingual lexicon entries from a non-parallel English–Chinese corpus. In 3rd workshop on very large Corpora, (Vol. 78, pp. 173–183). Boston, MA.
Ghorbel, H., Ballim, A., & Coray, G. (2001). ROSETTA: Rhetorical and semantic environment for text alignment. In: Corpus Linguistics, Lancaster University (UK) (pp. 224–233).
Gomez, X., & Simoes, A. (2009). Parallel corpus-based bilingual terminology extraction. In 8th international conference on terminology and artificial intelligence Toulouse.
Granger, S. (2003). The corpus approach: A common way forward for Contrastive Linguistics and Translation Studies (pp. 17–29). Rodopi, Corpus-based approaches to contrastive linguistics and translation studies. Amsterdam/New York.
House, J. (2004). Explicitness in discourse across languages. Neue Perspektiven in der Übersetzungs-und Dolmetschwissenschaft (pp. 185–208), Bochum: AKS.
Iruskieta, M., Aranzabe, M. J., Díaz de Ilarraza, A., Gonzalez, I., Lersundi, M., & Lopez de la Calle, O. (2013a). The RST Basque TreeBank: An online search interface to check rhetorical relations. In 4th workshop RST and discourse studies, Brasil.
Iruskieta, M., Díaz de Ilarraza, A., & Lersundi, M. (2013b). Establishing criteria for RST-based discourse segmentation and annotation for texts in Basque. Corpus Linguistics and Linguistic Theory, 1–32.
Kanté, I. (2010). Mood and modality in finite noun complement clauses: A French-English contrastive study. International Journal of Corpus Linguistics, 15(2), 267–290.
Article Google Scholar
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In: MT summit, Phuket, Thailand.
Kong, K. C. C. (1998). Are simple business request letters really simple? A comparison of Chinese and English business request letters. Text & Talk, 18(1), 103–141.
Google Scholar
Mann, W. C., & Taboada, M. (2010). RST web-site. http://www.sfu.ca/rst/. Accessed 30 September 2012.
Mann, W. C., & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text-Interdisciplinary Journal for the Study of Discourse, 8(3), 243–281.
Article Google Scholar
Marcu, D. (2000a). The rhetorical parsing of unrestricted texts: A surface-based approach. Computational Linguistics, 26(3), 395–448.
Article Google Scholar
Marcu, D. (2000b). The theory and practice of discourse parsing and summarization. Cambridge: MIT press.
Google Scholar
Marcu, D., Carlson, L., & Watanabe, M. (2000). The automatic translation of discourse structures. In 1st North American chapter of the Association for Computational Linguistics conference (pp. 9–17), Seattle (USA): Morgan Kaufmann Publishers.
Maxwell, M. (2010). Limitations of corpora. International Journal of Corpus Linguistics, 15(3), 379–383.
Article Google Scholar
Maziero, E. G., & Pardo, T. A. S. (2009). Automatização de um método de avaliação de estruturas retóricas. In: RST Brazilian meeting, São Paulo, Brazil.
Mitocariu, E., Anechitei, D. A., & Cristea, D. (2013). Comparing discourse tree structures (pp. 513–522). Berlin: Springer. Computational Linguistics and Intelligent Text Processing.
Mohamed, A. H., & Omer, M. R. (1999). Syntax as a marker of rhetorical organization in written texts: Arabic and English. International Review of Applied Linguistics in Language Teaching (IRAL), 37(4), 291–305.
Google Scholar
Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining-using brain, not brawn comparable corpora. In Annual meetings ACL (Vol. 45, pp. 664–671). Prague.
Mortier, L., & Degand, L. (2009). Adversative discourse markers in contrast: The need for a combined corpus approach. International Journal of Corpus Linguistics, 14(3), 338–366.
Article Google Scholar
O’Donnell, M. (2000). RSTTool 2.4: A markup tool for rhetorical structure Theory. In First international conference on natural language generation INLG’00 (Vol. 14, pp. 253–256). Mitzpe Ramon: ACL.
Pardo, T. A. S. (2005). Métodos para análise discursiva automática. Ph.D. thesis, Instituto de Ciências Matemáticas e de Computação, São Carlos-SP: Universidade de São Paulo.
Ramsay, G. (2000). Linearity in rhetorical organisation: A comparative cross-cultural analysis of newstext from the People’s Republic of China and Australia. International Journal of Applied Linguistics, 10(2), 241–258.
Article Google Scholar
Ramsay, G. (2001). Rhetorical styles and newstexts: A contrastive analysis of rhetorical relations in Chinese and Australian news-journal text. ASAA E-Journal of Asian Linguistics and Language-teaching, 1(1), 1–22.
Google Scholar
Salkie, R., & Oates, S. L. (1999). Contrast and concession in French and English. Languages in Contrast, 2(1), 27–56.
Article Google Scholar
Sarjala, M. (1994). Signalling of reason and cause relations in academic discourse. Anglicana Turkuensia, 13, 89–98.
Google Scholar
Scott, D. R., Delin, J., & Hartley, A. F. (1998). Identifying congruent pragmatic relations in procedural texts. Languages in Contrast, 1(1), 45–82.
Article Google Scholar
Soricut, R., & Marcu, D. (2003). Sentence level discourse parsing using syntactic and lexical information. In 2003 conference of the North American Chapter of the Association for Computational Linguistics on human language technology (Vol. 1, pp. 149–156). Association for Computational Linguistics.
Stede, M. (2008a). Disambiguating rhetorical structure. Research on Language and Computation, 6(3), 311–332.
Article Google Scholar
Stede, M. (2008b). RST revisited: Disentangling nuclearity (pp. 33–57). Amsterdam and Philadelphia: John Benjamins. ‘Subordination’ versus ‘coordination’ in sentence and text.
Taboada, M. (2004a). Building coherence and cohesion: Task-oriented dialogue in English and Spanish. Amsterdam and Philadelphia: John Benjamins.
Book Google Scholar
Taboada, M. (2004b). Rhetorical relations in dialogue: A contrastive study (pp. 75–97), Amsterdam and Philadelphia: John Benjamins. Discourse across Languages and Cultures.
Taboada, M., & Mann, W. C. (2006a). Applications of rhetorical structure theory. Discourse Studies, 8(4), 567–588.
Article Google Scholar
Taboada, M., & Mann, W. C. (2006b). Rhetorical structure theory: Looking back and moving ahead. Discourse Studies, 8(3), 423–459.
Article Google Scholar
Taboada, M., & Renkema, J. (2008). Discourse relations reference corpus. Simon Fraser University and Tilburg University. http://www.sfu.ca/rst/06tools/discourse_relations_corpus.html. Accessed 30 September 2012
Trask, R. L. (1997). The history of Basque. London: Routledge.
Google Scholar
Usoniene, A., & Soliene, A. (2010). Choice of strategies in realizations of epistemic possibility in English and Lithuanian: A corpus-based study. International Journal of Corpus Linguistics, 15(2), 291–316.
Article Google Scholar
UZEI and HAEE-IVAP. (1997). International congress on terminology. Donostia and Gasteiz: UZEI; HAEE-IVAP.
van der Vliet, N. (2010). Inter annotator agreement in discourse analysis. http://www.let.rug.nl/~nerbonne/teach/rema-stats-meth-seminar/.
Wu, D., & Xia, X. (1994). Learning an English–Chinese lexicon from a parallel corpus. In First conference of the AMTA (pp. 206–213). Citeseer, Columbia.
Xiao, R. (2010). How different is translated Chinese from native Chinese? A corpus-based study of translation universals. International Journal of Corpus Linguistics, 15(1), 5–35.
Article Google Scholar

Download references

Acknowledgments

This work has been partially financed by the Spanish projects RICOTERM 4 (FFI2010-21365-C03-01) and APLE 2 (FFI2012-37260), and a Juan de la Cierva Grant (JCI-2011-09665) to Iria da Cunha. Maite Taboada was supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (261104-2008). Mikel Iruskieta was supported by the following projects: OPENMT-2 (TIN2009-14675-C03-01) [Spanish Ministry], Ber2Tek (IE12-333) [Basque Government] and IXA group (GIU09/19) [University of the Basque Country]. We would like to thank the anonymous reviewers for their comments and suggestions, Nynke van der Vliet for her feedback on the evaluation method, Esther Miranda for designing the website, and Oier Lopez de Lacalle for helping with the scripts to calculate the statistics.

Author information

Authors and Affiliations

Department of Didactics of Language and Literature, University of the Basque Country, Sarriena auzoa z/g, 48940 , Leioa, Spain
Mikel Iruskieta
University Institute for Applied Linguistics, Universitat Pompeu Fabra, C/ Roc Boronat 138, 08018, Barcelona, Spain
Iria da Cunha
Department of Linguistics, Simon Fraser University, 8888 University Dr, Burnaby, BC , V5A 1S6, Canada
Maite Taboada

Authors

Mikel Iruskieta
View author publications
You can also search for this author in PubMed Google Scholar
Iria da Cunha
View author publications
You can also search for this author in PubMed Google Scholar
Maite Taboada
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mikel Iruskieta.

Appendix: Discourse segmentation details

The first step in analyzing texts under RST consists of segmenting the text into spans. Exactly what a span is, under RST, and more generally in discourse, is a well-debated topic. RST Mann and Thompson (1988) proposes that spans, the minimal units of discourse—later called elementary discourse units (EDUs) (Marcu 2000a)—are clauses, but that other definitions of units are possible:

The first step in analyzing a text is dividing it into units. Unit size is arbitrary, but the division of the text into units should be based on some theory-neutral classification. That is, for interesting results, the units should have independent functional integrity. In our analyzes, units are essentially clauses, except that clausal subjects and complement and non-restrictive relative clauses are considered as part of their host clause units rather than as separate units.

(Mann and Thompson 1988, p. 248)

This definition is the basis of our work. From our point of view, adjunct clauses stand in clear rhetorical relations (cause, condition, concession, etc.). Complement clauses, however, have a syntactic, but not discourse, relation to their host clause. Complement clauses include, as Mann and Thompson (1988) point out, subject and object clauses, and restrictive relative clauses, but also embedded report complements, which are, strictly speaking, also object clauses.

Other possibilities for segmentation exist; one of the better-known ones is the proposal by Carlson et al. (2003) for segmentation of the RST Discourse Treebank (Carlson et al. 2002). Carlson et al. (2003) propose a much more fine-grained segmentation, where report complements, relative clauses and appositive elements constitute their own EDUs.

In our work three annotators segmented the EDUs of each corpus (A1 segmented English texts, A2 segmented Spanish texts, and A3 segmented Basque texts). These annotators are experts on RST, since they have been researching in this field since years ago, and they have participated in several projects related to the design and elaboration of RST corpora in the three languages of this work. Annotators performed this segmentation task separately and without contact among them. In our segmentation, we follow then the general guidelines proposed by Mann and Thompson (1988), which we have operationalized for this paper. We detail the principles below.

Every EDU Should Have a Verb

In general, EDUs should contain a (finite) verb. The main exception to this rule is the case of titles, which are always EDUs, whether they contain a verb or not.

Non-finite verbs form their own EDUs only when introducing an adjunct clause (but not a modifier clause, as we will see below). In (7), the non-finite clause Focussing on less widely... is an independent EDU, because it is an adjunct clause. Note that in both Spanish and Basque the same proposition was translated as an independent sentence.

(7)
1. (a)
  [Focussing on less widely used and taught languages (LWUTLs) including Irish,] [the VOCALL partners are compiling multilingual glossaries of technical terms in the areas of computers, office skills and electronics] [and this involves the creation of a large number of new Irish terms in the above areas.]
2. (b)
  [El proyecto está enfocado hacia lenguas minoritarias en cuanto al uso y enseñanza, incluido el irlandés.] [El proyecto VOCALL estáen proceso de recopilación de un glosario plurilingüe de términos técnicos de las áreas de informática, secretariado y construcción,] [y esto supone la creación de una larga serie de nuevos términos en irlandés, en las áreas mencionadas.]
3. (c)
  [Gutxi erabiltzen eta irakasten diren hizkuntzetan kontzentratzen da proiektua (LWUTL), irlandera barne.] [Informatika, bulego-lana eta eraikuntzako arloetako termino teknikoen glosario eleanizduna biltzen ari da VOCALL,] [eta horrek esan nahi du arlo horietako irlanderazko termino berri ugari sortzen ari dela.] TERM23_ENG

In some cases, a prepositional phrase (especially one containing a nominalized verb) in one language was realized as an independent clause in another. The final decision in such cases is typically to segment minimally, that is, to unify the segmentation across the three languages, so that the language with the fewer segments determines how the texts in the other languages have to be segmented. See also Sect. 3.1.1, on harmonization of the segmentation, for more examples of our final decisions across the three languages.

Coordination and Ellipsis. Coordinated clauses are separated into two segments, including cases where the subject is elliptical in the second clause. In Spanish and Basque, both pro-drop languages, this is in fact the default for both first and second clause, and therefore we see no reason why a clause with a pro-drop subject cannot be an independent unit. We follow the same principle for English. In (8), the first two EDUs in Spanish are coordinated with an elliptical subject in both cases, referring to the authors (venimos traduciendo, ‘\([\)we\(]\) have been translating’ and queremos expresar, ‘\([\)we\(]\) wish to indicate’). They constitute separate EDUs. In the English and Basque versions, the two clauses are expressed as separate sentences.

(8)
1. (a)
  [To attain this goal we have been translating doctrinal texts in law at the University of Deusto since 1994.] [We wish to indicate the difficulties we have had over the years and also our achievements,] [if there can be said to be any.]
2. (b)
  [Para poder alcanzar ese objetivo en la Universidad de Deusto venimos traduciendo textos doctrinales del campo del Derecho desde 1994] [y queremos expresar las dificultades que hemos tenido a lo largo de estos años y, asímismo, también los logros conseguidos,] [si es que realmente los ha habido.]
3. (c)
  [Xede hori iristeko, 1994. urteaz geroztik, Deustuko Unibertsitatean Zuzenbidearen inguruko testu doktrinalak itzultzen dihardugu.] [Esperientzia horretan izandako zailtasunak eta,] [halakorik izanez gero,]^{Footnote 29} [lorpenak ere azaldu nahi ditugu.] TERM25_BSQ

Coordinated verb phrases (VPs) or verbs do not constitute their own EDUs. We differentiate coordinated clauses from coordinated VPs because the former can be independent clauses with the repetition of a subject; the latter, in the second part of the coordination, typically contain elliptical verbal forms, most frequently a finite verb or modal auxiliary.

Relative, Modifying and Appositive Clauses. We do not consider that relative clauses (restrictive or non-restrictive), clauses modifying a noun or adjective, or appositive clauses constitute their own EDUs. We include them as part of the same segment together with the element that they are modifying. This departs from RST practice, where (restrictive) relative clauses are often independent spans, as seen in many of the examples in the original literature and the analyzes on the RST web site (Mann and Thompson 1988; Mann and Taboada 2010). We found that relative clauses and other modifiers often lead to truncated EDUs, resulting in repeated use of the Same-unit relation (see Truncated EDUs in 5 section), and thus decided that it was best to not elevate them to the status of independent segments.

An example is presented in (9), where the relative clause is in parentheses in the Spanish original. Note, however, that the coordinated clauses (with an elliptical subject in all cases) are independent segments, as explained above. In Basque, on the other hand, the relative clause is translated as an independent clause with a finite verb (mugatzen da, ‘[it] is limited to’). We have not segmented it in Basque, to agree with the other two languages.

(9)
1. (a)
  \([\dots ]\) [Internet terminology extends beyond the bounds of its specialist field (which by definition is part of the lexicon of science and technology)] [and breaks into general language.]
2. (b)
  \([\dots ]\) [la terminología de Internet traspasa los límites del área de especialidad (a la que se circunscribe por definición el léxico científico y técnico)] [e irrumpe en la lengua de uso general,] \([\dots ]\)
3. (c)
  \([\dots ]\) [espezialitateko eremuaren mugak gainditzen dituela Interneteko terminologiak (espezialitatera mugatzen da, definizioz, lexiko zientifiko eta teknikoa),] [eta erabilera orokorreko hizkeran sartzen dela indartsu;] \([\dots ]\) TERM38_SPA

Parentheticals. The same principle applies to parentheticals and other units typographically marked as separate from the main text (with parentheses or dashes). They do not form an individual span if they modify a noun or adjective as in Example 10, but they do if they are independent units, with a finite verb. Such is the case in (11), with a full sentence in the parenthetical unit (in English, composed of three finite clauses: can... be represented, is and are).

(10)
1. (a)
  The analysis of the data at hand—international terms most of which have not yet been standardized in Serbian—indicate that a hierarchy of criteria for evaluating the terms, (...). TERM18_ENG

(11)
1. (a)
  [The design and management of terminological databases pose theoretical and methodological problems] [(how can a term be represented?] [Is there a minimum representation?] [How are terms to be classified?),] \((\ldots )\)
2. (b)
  [Efectivamente, el diseño y la gestión de las bases de datos terminológicos plantean problemas diversos tanto de índole teórica y metodológica] [(¿cómo se representa un término?,] [¿existe una representación mínima?,] [¿cómo se clasifican los términos?)] \((\ldots )\)
3. (c)
  [Hala da, terminologiako datu-baseak diseinatzeak eta kudeatzeak hainbat arazo dakar bai teoria eta metodologiaren aldetik] [(nola adierazi terminoa?] [Ba al da gutxieneko adierazpenik?] [Nola sailkatu terminoak?),] \((\ldots )\) TERM29_SPA

Reported Speech. We believe that reported and quoted speech do not stand in rhetorical relations to the reporting units that introduce them, and thus should not constitute separate EDUs, also following clear arguments presented elsewhere (da Cunha and Iruskieta 2010; Stede 2008a). This is in contrast to the approach in the RST Discourse Treebank (Carlson et al. 2003), where reported speech (there named attribution) is a separated EDU. There are, in any case, no examples of reported speech in our corpus.

Truncated EDUs. In some cases, a unit contains a parenthetical or inserted unit, breaking it into two separate parts, which do not have any particular rhetorical relation between each other. In those cases, we make use of a non-relation label, Same-unit, proposed for the RST Discourse Treebank (Carlson et al. 2003).

We see one such example in (11) above. The element that corresponds to the third unit in English is, in fact, inserted in the middle of the second unit in Basque. In order to align or harmonize segmentation and to preserve the integrity of that unit, we use the Same-unit (non) relation, as shown in Fig. 8, which follows the Basque word order.

Once our segmentation criteria were established and the three annotators carried out the segmentation, the three segmentations were compared in terms of precision and recall. In this way, we quantified agreement and disagreement across segmentations. Moreover, we analyzed the main causes of the disagreements. Results are shown in Sect. 3. After the segmentation agreement evaluation, we harmonized the segmentation, ensuring that units were comparable across the languages. At this point, we also calculated linguistic distance between the pairs of languages, We understand linguistic distance as “the extent to which languages differ from each other” (Chiswick and Miller 2005, pg. 1). Although this concept is well known among linguists, there is not a single measure to evaluate this distance Chiswick and Miller (2005). In our work, in order to measure this distance we calculated which language required the most changes in the harmonization process. This harmonization process was necessary to start out the analysis with similar units, and to avoid confusing analysis disagreement and segmentation agreement. Marcu et al. (2000) and Ghorbel et al. (2001) also align (which we termed harmonize) their texts, decreasing the granularity of their segmentation to avoid complexity. With this decision, we lose some rhetorical information at the most detailed level of the tree. This does not, however, affect higher levels of tree structure. The results of this harmonization are shown in Sect. 3.1.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Iruskieta, M., da Cunha, I. & Taboada, M. A qualitative comparison method for rhetorical structures: identifying different discourse structures in multilingual corpora. Lang Resources & Evaluation 49, 263–309 (2015). https://doi.org/10.1007/s10579-014-9271-6

Download citation

Received: 26 June 2013
Accepted: 08 May 2014
Published: 28 May 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s10579-014-9271-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A qualitative comparison method for rhetorical structures: identifying different discourse structures in multilingual corpora

Abstract

Access this article

Similar content being viewed by others

Current Approaches of Corpus Pragmatics on Discourse and Translation Studies, an Introduction

Lexicometry: A Quantifying Heuristic for Social Scientists in Discourse Studies

Analyzing Spoken and Written Discourse: A Role for Natural Language Processing Tools

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Discourse segmentation details

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A qualitative comparison method for rhetorical structures: identifying different discourse structures in multilingual corpora

Abstract

Access this article

Similar content being viewed by others

Current Approaches of Corpus Pragmatics on Discourse and Translation Studies, an Introduction

Lexicometry: A Quantifying Heuristic for Social Scientists in Discourse Studies

Analyzing Spoken and Written Discourse: A Role for Natural Language Processing Tools

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Discourse segmentation details

Appendix: Discourse segmentation details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation