A multi-level methodology for the automated translation of a coreference resolution dataset: an application to the Italian language

Minutolo, Aniello; Guarasci, Raffaele; Damiano, Emanuele; De Pietro, Giuseppe; Fujita, Hamido; Esposito, Massimo

doi:10.1007/s00521-022-07641-3

A multi-level methodology for the automated translation of a coreference resolution dataset: an application to the Italian language

Original Article
Open access
Published: 19 September 2022

Volume 34, pages 22493–22518, (2022)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

A multi-level methodology for the automated translation of a coreference resolution dataset: an application to the Italian language

Download PDF

Aniello Minutolo¹,
Raffaele Guarasci ORCID: orcid.org/0000-0002-0106-8635¹,
Emanuele Damiano¹,
Giuseppe De Pietro¹,
Hamido Fujita^2,3,4 &
…
Massimo Esposito¹

Abstract

In the last decade, the demand for readily accessible corpora has touched all areas of natural language processing, including coreference resolution. However, it is one of the least considered sub-fields in recent developments. Moreover, almost all existing resources are only available for the English language. To overcome this lack, this work proposes a methodology to create a corpus for coreference resolution in Italian exploiting knowledge of annotated resources in other languages. Starting from OntonNotes, the methodology translates and refines English utterances to obtain utterances respecting Italian grammar, dealing with language-specific phenomena and preserving coreference and mentions. A quantitative and qualitative evaluation is performed to assess the well-formedness of generated utterances, considering readability, grammaticality, and acceptability indexes. The results have confirmed the effectiveness of the methodology in generating a good dataset for coreference resolution starting from an existing one. The goodness of the dataset is also assessed by training a coreference resolution model based on BERT language model, achieving the promising results. Even if the methodology has been tailored for English and Italian languages, it has a general basis easily extendable to other languages, adapting a small number of language-dependent rules to generalize most of the linguistic phenomena of the language under examination.

COREA: Coreference Resolution for Extracting Answers for Dutch

Coreference in English OntoNotes: Properties and Genre Differences

Coreference Resolution with Syntax and Semantics

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Coreference resolution (henceforth CR) has a long history in natural language processing (NLP); knowing who is being talked about in a text has always been a fascinating challenge for scholars. Although it is not a new task, CR is still debated [1], demonstrating its usefulness concerning practical and theoretical issues. Indeed, coreference information has been used in various NLP tasks, such as text summarization [2], and also with reference to low-resource languages [3]. Moreover, it has been the object of study for linguistics theoretical issues [4], focusing on the interpretation of syntactic phenomena like null subjects and pronouns. Over the last decades, many approaches for CR have succeeded, ranging from simple rule-based systems to machine- and deep learning approaches [5, 6] to reinforcement learning-based solutions [7]. These approaches have also been transferred with applications to specific domains [8].

The history and developments in this field have led to the creation of numerous corpora specifically annotated for coreference-related tasks. From the earliest modestly sized corpora [9] manually created to progressively larger resources to satisfy the ever-increasing data needs of machine learning approaches [10] and capable of managing multiple languages or specific domains. Evaluation campaigns such as SemEval [11] and CoNLL 2012 [10] have contributed to the proliferation of available datasets. However, although it is long tradition in NLP, CR is one of the sub-fields of NLP, which has seen the slowest progress [1] during the last decade, dominated by the exponential growth of machine learning. In addition, the vast amount of resources available for the English language is not matched by a similar number for the other languages. Datasets in languages other than English are mainly limited to preexisting treebanks to which a specific coreference annotation level has been added.

About the language under investigation in this work, Italian, there are a few outdated annotated corpora [12,13,14], which suffer from limited size, excessive domain-dependence, and lack of a shared annotation standard scheme. Hence, only a handful of approaches for CR have been developed.

Starting from this issue, this paper describes an innovative cross-lingual methodology for creating a CR dataset in a low-resource language starting from a rich-resource one. The languages here considered are Italian and English, respectively. In particular, an Italian dataset for the CR has been generated starting from OntoNotes [15], which is currently considered the de facto standard for the evaluation of coreference tasks in English since the CoNLL shared tasks in 2011 and 2012.

The methodology is divided into two distinct steps. First, a multi-level translation process is applied to the English sentences extracted from the OntoNotes dataset for CR. This step aims to translate sentences trying to preserve mentions they can contain without losing in the translation the tokens composing the mentions, their positions, and the verbal agreements involving them. Second, a language refinement step has been introduced. This step tries to manage language-dependent phenomena to produce output sentences compliant with Italian grammar by applying language-specific rules derived from theoretical linguistics. These rules perform deletions and substitutions without losing information about mentions. Original coreference annotation has been preserved without having sentences that can sound unnatural or ungrammatical in Italian. This step is necessary in cases where there is a significant discrepancy between the two languages, in this case, Italian and English, concerning syntactic constructions involving personal pronouns that are often used in different ways.

Concerning evaluation, the results have been assessed both quantitatively and qualitatively. From the quantitative point of view, the readability of the produced sentences has been calculated using the Flesch–Kincaid index adapted for the Italian language [16]. This metric has been supplemented with a qualitative analysis carried out by native speakers using indicators from theoretical linguistics, such as grammaticality and acceptability. Grammaticality refers to a sentence’s well-formedness from a syntactic point of view, e.g., the structure and order of the constituents are maintained. The concept of acceptability, instead, is related to how the sentence is considered semantically meaningful according to the annotator’s judgments. Together, these two indicators allow assessing the quality of translated sentences from the perspectives of both grammatical correctness and meaningfulness for a native speaker. The goodness of the dataset has also been assessed by training a CR baseline model based on BERT [17]. Then, the results have been compared with the ones obtained by the same model but on the English version of the Ontonotes dataset.

The paper is organized as follows. Section 2 reviews the state of the art of datasets created for CR. It describes both the datasets created for English and other languages. Section 3 outlines the research motivations and contributions of the proposal. In Sect. 4, the methodology adopted for making the dataset starting from the original English resource is reported. This section describes the two macro-steps of translation and linguistic refinement to achieve a translated text that preserves mentions and coreferences. Section 5 discusses the results obtained, describing the evaluation process, both quantitative and qualitative, and outlining the performance achieved by a BERT-based CR model training on the generated dataset. Finally, Sect. 6 concludes the work.

2 Related work

The datasets developed over the years for CR are of various kinds. Generalist, domain-specific and multilingual datasets characterized by different criteria and annotation schemes have been created. The vast majority of the resources—as in all NLP fields—have been made for the English language, but there have also been developments in other languages in recent years.

It is worth noting that almost all resources include both coreference and anaphora resolution since both are part of the entity resolution family. The clear distinction in terminology between the two concepts is still debated in the literature. According to some studies, anaphora is a subset of coreference, while others claim that coreference is part of anaphora. For this paper, the resources available for coreference will be listed, although these are almost always valid for anaphora resolution. Notice that—as regards terminology—in this work, the definition of coreference is the same as adopted in the OntoNotes schema. Therefore coreference is not limited to noun phrases [18] but includes pronouns, head of verb phrases and named entities as potential mentions.

Starting from these premises, this section first surveys and highlights the main characteristics of existing datasets for CR in English. Successively, CR resources for languages other than English are described, specifically outlining the ones for Italian.

2.1 CR resources for English

The MUC corpora is the first dataset manually created by human annotators that also aims for evaluation purposes. The MUC-6 [19] and MUC-7 [20] are based on North American news corpora (extracted by the Wall Street Journal), and they are small in size (318 annotated articles). Although now rarely used due to their limited domain and size, they are still considered valid compared to baselines. MUC has its evaluation metrics and SGML-based annotation format.

The GNOME Corpus [21] instead is created with a specific cross-domain scope. It includes texts from three domains (museum labels, pharmaceutical leaflets, and tutorial dialogues), and it has an annotation level of discourse and semantic information. GNOME has also been used in conjunction with other datasets to create the ARRAU corpus [22]. It includes corpora from different domains such as news-wire, dialogues, and fiction. The annotation scheme is the MMAX2 format which uses hierarchical XML files at the document and sentence level.

Then, there are corpora developed for specific coreference-related sub-tasks. The character identification corpus [23] focuses on the task of speaker-linking in multi-party conversations extracted from transcriptions of TV shows. ECB + [24] is another task-specific corpus. It is devoted to the topic-based event CR, a topic that has gained much attention in the literature in recent years.

Other corpora developed for cross-domain purposes exploit freely available online resources. The GUM corpus [25] is a multilayer, CoNLL-labeled corpus containing conversational, instructional, and news texts extracted from the web. WikiCoref [26] is composed of annotated Wikipedia articles, whose entities are linked to an external knowledge repository for the mentions. Both corpora use the OntoNotes schema for the annotation. It is worth noting that also the English Penn Treebank [27] has been used for purposes related to coreference tasks. Indeed, it was also annotated with coreference links as part of the OntoNotes project [15].

There are also coreference corpora specifically developed for a single domain. For instance, NP4E [28] is a small corpus based only on security and terrorism genres. It is annotated using the MMAX2 format for the event coreference task. In addition, the healthcare domain has received special attention, so numerous biomedical corpora have been created. Starting from GENIA corpus [29], which contains 2000 MEDLINE abstract, numerous other resources have been developed, such as Genia Treebank [30], Genia event annotation [31], and MedCo coreference annotation [32]. These resources have been the focus of the BioNLP-2011 shared task on Protein CR [33]. A different approach is proposed by CRAFT [34] and by its successor HANNAPIN corpus [35]. These resources contain full annotated biochemical articles for CR. In the pharmacological field, the DrugNerAR [36] corpus has been developed, with the aim of resolving anaphora for extraction drug–drug interactions in the pharmacological literature.

2.2 CR resources for other languages

The first corpus that also deals with languages other than English is ACE [37]. Initially based only on the journalistic domain, it aims to be heterogeneous and domain-independent and is annotated for different languages (like English, Chinese, and Arabic). The covered domains range from news-wire articles to conversational telephonic speech and broadcast conversations.

OntoNotes 5.0 [38] was the dataset involved in the Semeval 2010 [39] and CoNLL 2012 [10], with the aim of modeling CR for multiple languages. It was created to classify mentions of equivalence according to the entity to which they refer. OntoNotes is mostly based on news articles; it includes three different languages and is annotated using a CoNLL-like format. It is still the most widely used corpus for evaluation in the literature.

Another parallel corpus available in two languages (English and German) is ParCor [40]. It is a corpus that includes data extracted from a specific genre ( TEDx talks and Bookshop publications). It focuses on a particular purpose: parallel pronoun CR in different languages in a machine translation context.

There are very few datasets currently used in the coreference task concerning the Italian language. VENEX [12] is a corpus which combines two different corpus-annotation initiatives: SI-TAL [41], focused on the creation of a corpus of written Italian from financial newspapers, and IPAR [42], which is a collection of spoken task-oriented dialogues of speakers. VENEX uses MATE as annotation scheme and MMAX for the markup.

Another coreference resource is I-CAB [13], a small dataset built on news documents taken from the regional newspaper L’Adige. Texts are annotated using a scheme derived from the ACE corpus.

The most recent corpus developed for Italian is Live-Memories [14]. It collects two genres of text: blog sites and Wikipedia pages related to the history, geography, and culture of the region of Trentino-Alto Adige/Su¨dtirol. The annotation follows the ARRAU guidelines adapted for the Italian language. Table 1.

Table 1 Size comparison of coreference corpora

A multi-level methodology for the automated translation of a coreference resolution dataset: an application to the Italian language

Abstract

Similar content being viewed by others

COREA: Coreference Resolution for Extracting Answers for Dutch

Coreference in English OntoNotes: Properties and Genre Differences

Coreference Resolution with Syntax and Semantics

1 Introduction

2 Related work

2.1 CR resources for English

2.2 CR resources for other languages

3 Research objectives and contribution

4 Methodology for the creation of the dataset

4.1 Source corpus

4.2 Translation

4.2.1 Data preparation

4.2.2 Utterances filtering

4.2.3 Mentions simplification

4.2.4 Mentions clusters simplification

4.2.5 Referred entities estimation

4.2.6 Utterances translation and tokens replacement

4.3 Linguistic refinement

4.3.1 Subject pronouns deletion

4.3.2 Pronouns and adjectives rewrite

5 Results and evaluation

5.1 Dataset description

5.2 Readability assessment

5.3 Grammaticality and acceptability assessment

5.4 Linguistic and qualitative assessment

5.5 Effectiveness assessment as training dataset

6 Conclusions and future work

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation