These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Coreference resolution is essential for the automatic interpretation of text. It has been studied mainly from a linguistic perspective, with an emphasis on the recognition of potential antecedents for pronouns. Many practical nlp applications such as information extraction (ie) and question answering (qa ), require accurate identification of coreference relations between noun phrases in general. In this chapter we report on the development and evaluation of an automatic system for the robust resolution of referential relations in text. Computational systems for assigning such relations automatically, require the availability of a sufficient amount of annotated data for training and testing. Therefore, we annotated a Dutch corpus of 100K words with coreferential relations, and in addition we developed guidelines for the manual annotation of coreference relations in Dutch.

We evaluated the automatic coreference resolution module in two ways. On the one hand we used the standard internal approach to evaluate a coreference resolution system by comparing the predictions of the system to a hand-annotated gold standard test set. On the other hand we performed an application-oriented evaluation of our system by testing the usefulness of coreference relation information in an nlp application. We ran experiments with a relation extraction module for the medical domain, and measured the performance of this module with and without the coreference relation information. In a separate experiment we also evaluated the effect of coreference information produced by another simple rule-based coreference module in a question answering application.

The chapter is structured as follows. We first summarise related work in Sect. 7.2. We present the corpus that is manually annotated with coreference relations in Sect. 7.3.1. Section 7.3.2 details the automatic coreference resolution system and in Sect. 7.4 we show the results of both the internal and the application-oriented evaluation. We conclude in Sect. 7.5 .

2 Related Work

In the last decade considerable efforts have been put in annotating corpora with coreferential relations. For English, many different data sets with annotated coreferential relations are available such as the muc-6 [8] and muc-7 [23] data sets, ace-2 [7], gnome corpus [26], arrau [27], and more recently, OntoNotes 3.0 [39]. But also for other languages data sets exist such as for German, the TBa-D/Z coreference corpus [12] and the Potsdam corpus [19], for Czech the Prague Dependency Treebank (pdt 2.0) [20], for Catalan AnCora-CO [29], for Italian i-cab [22] and the Live Memories Corpus [31], and the Copenhagen Dependency Treebank [18] for Danish, English, German, Italian, and Spanish. Most of these corpora follow their own annotation scheme. In SemEval-2010, Task 1 Coreference Resolution in Multiple Languages was devoted to multi-lingual coreference resolution for the languages Catalan, Dutch, English, German, Italian and Spanish [30]. The CoNLL 2011 and 2012 shared tasks are also dedicated to automatic coreference resolution.

For Dutch, besides the corea corpus described in Sect. 7.3.1 there is currently also a data set of written new media texts such as blogs [9] developed in the DuOMan project described in Chap. 20, page 359 and a substantial part (one million words) of the SoNaR corpus [32] (Chap. 13, page 219) is also annotated with coreference. All these data sets have been annotated according to the corea annotation guidelines. For the Dutch language we can now count on a large, and rich data set that is suitable both for more theoretical linguistic studies of referring expressions and for practical development and evaluation of coreference resolution systems. By covering a variety of text genres, the assembled data set can even be considered as a unique resource for cross-genre research.

Currently there are not many coreference resolution systems for Dutch available. The first full-fledged system was presented by Hoste [14, 15] in 2005 and this is the predecessor of the system described in Sect. 7.3.2. More recently, two of the participating systems in the SemEval-2010 Task 1 on multi-lingual coreference resolution were evaluated for all six languages including Dutch. The UBIU system [40], was a robust language independent system that used a memory-based learning approach using syntactic and string matching features. The SUCRE system [17] obtained the overall best results for this SemEval task and used a more flexible and rich feature construction method and a relational database in combination with machine learning.

3 Material and Methods

3.1 Corpus and Annotation

The corea corpus is composed of texts from the following sources:

  • Dutch newspaper articles gathered in the dcoi project Footnote 1

  • Transcribed spoken language material from the Spoken Dutch Corpus (cgn) Footnote 2

  • Lemmas from the Spectrum (Winkler Prins) medical encyclopedia as gathered in the imix rolaquad project Footnote 3

  • Articles from knack [16], a Flemish weekly news magazine.

All material from the first three sources was annotated in the corea project. The material from knack was already annotated with coreference relations in a previous project (cf. [14]). Note that the corpus covers a number of different genres (speech transcripts, news, medical text) and contains both Dutch and Flemish sources. The latter is particularly relevant as the use of pronouns differs between Dutch and Flemish [36].

The size of the various subcorpora, and the number of annotated coreference relations is given in Table 7.1.

Table 7.1 Corpus statistics for the coreference corpora developed and used in the corea project. Ident, bridge, pred and bound refer to the number of annotated identity, bridging, predicative, and bound variable type coreference relations respectively

For the annotation of coreference relations we developed a set of annotation guidelines [3] largely based on the muc-6 [8] and muc-7 [23] annotation scheme for English. Annotation focuses primarily on coreference or identity relations between noun phrases, where both noun phrases refer to the same extra-linguistic entity. These multiple references to the same entity can be regarded as a coreferential chain of references. While these form the majority of coreference relations in our corpus, there are also a number of special cases. A bound relation exists between an anaphor and a quantified antecedent, as in Everybodyidid what theyicould. A bridge relation is used to annotate part-whole or set-subset relations, as in the tournamenti…the quarter finalsi. We also marked predicative (pred) relations, as in Michiel Beuteiis a writeri. Strictly speaking, these are not coreference relations, but we annotated them for a practical reason. Such relations express extra information about the referent that can be useful for example for a question answering application. We used several attributes to indicate situations where a coreference relation is in the scope of negation, is modified or time dependent, or refers to a meta-linguistic aspect of the antecedent.

Annotation was done using the mmax2 tool. Footnote 4 For the dcoi and cgn material, manually corrected syntactic dependency structures were available. Following the approach of [12], we used these to simplify the annotation task by creating an initial set of markables beforehand. Labeling was done by several linguists.

To estimate the inter-annotator agreement for this task, 29 documents from cgn and dcoi were annotated independently by two annotators, who marked 517 and 470 coreference relations, respectively. For the ident relation, we compute inter-annotator agreement as the F-measure of the MUC-scores [38] obtained by taking one annotation as ‘gold standard’ and the other as ‘system output’. For the other relations, we compute inter-annotator agreement as the average of the percentage of anaphor-antecedent relations in the gold standard for which an anaphor-antecedentpair exists in the system output, and where antecedent and antecedentbelong to the same cluster (w.r.t. the ident relation) in the gold standard. Inter-annotator agreement for ident is 76 % F-score, for bridging is 33 % and for pred is 56 %. There was no agreement on the three bound relations marked by each annotator. The agreement score for ident is comparable, though slightly lower, than those reported for comparable tasks for English and German [13, 37]. Poesio and Vieira [28] reports 59 % agreement on annotating ‘associative coreferent’ definite noun phrases, a relation comparable to our bridge relation.

The main sources of disagreement were cases where one of the annotators fails to annotate a relation, where there is confusion between pred or bridge and ident, and various omissions in the guidelines (i.e. whether to consider headlines and other leading material in newspaper articles as part of the text to be annotated).

3.2 Automatic Resolution System

We developed an automatic coreference resolution tool for Dutch [14] that follows the pairwise classification method of potential anaphora-antecedent pairs similar to the approach of Soon et al. [33]. As supervised machine learning method we decided to use memory-based learning. We used the Timbl software package (version 5.1) [4] that implements several memory-based learning algorithms.

As we used a supervised machine learning approach to coreference resolution the first step was to train the classifier on examples of the task at hand: texts with manually annotated coreference relations. These manually annotated texts needed to be transformed into training instances for the machine learning classifier. First the raw texts were preprocessed to determine the noun phrases in the text and to gather grammatical, positional, and semantic information about these nouns. This preprocessing step involved a cascade of nlp steps such as tokenisation, part-of-speech tagging, text chunking, named entity recognition and grammatical relation finding as detailed in Sect. 7.3.3 .

On the basis of the preprocessed texts, training instances were created. We considered each noun phrase (and pronoun) in the text as a potential anaphor for which we needed to find its antecedent. We processed each text backward, starting with the last noun phrase and pairing it with each preceding noun phrase, with a restriction of 20 sentences backwards. Each pair of two noun phrases was regarded as a training instance for the classifier. If a pair of two noun phrases belonged to the same manually annotated coreferential chain, it got a positive label; all other pairs got a negative label. For each pair a feature vector was created to describe the noun phrases and their relation (detailed in Sect. 7.3.4 ). Test instances were generated in the same manner. In total, 242 documents from the knack material were used as training material for the coreference resolution system.

The output from the machine learning classifier was a set of positively classified instances. Instead of selecting one single antecedent per anaphor (such as for example [25, 33]), we tried to build complete coreference chains for the texts and reconstruct these on the basis of the positive instances. As we paired each noun phrase with every previous noun phrase, multiple pairs can be classified as positive. For example, we have a text about Queen Beatrix and her name is mentioned five times in the text. In the last sentence there is the pronoun “she” referring to Beatrix. So we have a coreferential chain in the text of six elements that all refer to the same entity Beatrix. If we create pairs with this pronoun and all previous noun phrases in the text, we will have five positive instances each encoding the same information: “she” refers to Beatrix. For the last mention of the name Beatrix, there are four previous mentions that also refer to Beatrix, leading to four positive instances. In total there are 5 + 4 + 3 + 2 + 1  = 15 positive instances for this chain while we need a minimum of five pairs to reconstruct the coreferential chain. Therefore we needed a second step to construct the coreferential chains by grouping and merging the positively classified instances that cover the same noun phrases. We grouped pairs together and computed their union. When the overlap was larger than 0.1 we merged the chains together (we refer to [10] for more details on the merging procedure).

3.3 Preprocessing

The following preprocessing steps were performed on the raw texts: First, tokenisation was performed by a rule-based system using regular expressions. Dutch named entity recognition was performed by looking up the entities in lists of location names, person names, organisation names and other miscellaneous named entities. We applied a part-of-speech tagger and text chunker for Dutch that used the memory-based tagger mbt [5], trained on the Spoken Dutch Corpus. Footnote 5 Finally, grammatical relation finding was performed, using a shallow parser to determine the grammatical relation between noun chunks and verbal chunks, e.g. subject, object, etc. The relation finder [34] was trained on the previously mentioned Spoken Dutch Corpus. It offered a fine-grained set of grammatical relations, such as modifiers, verbal complements, heads, direct objects, subjects, predicative complements, indirect objects, reflexive objects, etc. We used the predicted chunk tags to determine the noun phrases in each text, and the information created in the preprocessing phase was coded as feature vectors for the classification step.

3.4 Features

For each pair of noun phrases we constructed a feature vector representing their properties and their relation [14]. For each potential anaphor and antecedent we listed their individual lexical and syntactic properties. In particular, for each potential anaphor/antecedent, we encode the following information, mostly in binary features:

  • Yes/no pronoun, yes/no reflexive pronoun, type of pronoun (first/second/third person or neutral),

  • Yes/no demonstrative,

  • Type of noun phrase (definite or indefinite),

  • Yes/no proper name,

  • Yes/no part of a named entity,

  • Yes/no subject, object, etc., of the sentence as predicted by the shallow parser.

For the anaphor we also encoded its local context in the sentence as a window in words and PoS-tags of three words left and right of the anaphor. We represented the relation between the two noun phrases with the following features:

  • The distance between the antecedent and anaphor, measured in noun phrases and sentences;

  • Agreement in number and in gender between both;

  • Are both of them proper names, or is one a pronoun and the other a proper name;

  • Is there a complete string overlap, a partial overlap, a overlap of the head words or is one an abbreviation of the other.

One particularly interesting feature that we have explored was the usage of semantic clusters [35]. These clusters were extracted with unsupervised k-means clustering on the Twente Nieuws Corpus. Footnote 6 The corpus was first preprocessed by the Alpino parser [1] to extract syntactic relations. The top-10,000 lemmatised nouns (including names) were clustered into a 1,000 groups based on the similarity of their syntactic relations.Here are four examples of the generated clusters:

figure 1figure 1

For each pair of referents we constructed three features as follows. For each referent the lemma of the head word was looked up in the list of clusters. The number of the matching cluster, or 0 in case of no match, was used as the feature value. We also constructed two features presenting the cluster number of each referent and a binary feature marking whether the head words of the referents occur in the same cluster or not.

In the first version of the coreference resolution system we coded syntactic information as predicted by the memory-based shallow parser in our feature set of 47 features [14]. In the corea project we also investigated whether the richer syntactic information of a full parser would be a helpful information source for our task [11]. We used the Alpino parser [1], a broad-coverage dependency parser for Dutch to generate the 11 additional features encoding the following information:

  • Named Entity label as produced by the Alpino parser, one for the anaphor and one for the antecedent.

  • Number agreement between the anaphor and antecedent, presented as a four valued feature (values: sg, pl, both, measurable_nouns ).

  • Dependency labels as predicted for (the head word of) the anaphor and for the antecedent and whether they share the same dependency label.

  • Dependency path between the governing verb and the anaphor, and between the verb and antecedent.

  • Clause information stating whether the anaphor or antecedent is part of the main clause or not.

  • Root overlap encodes the overlap between ‘roots’ or lemmas of the anaphor and antecedent. In the Alpino parser, the root of a noun phrase is the form without inflections. Special cases were compounds and names. Compounds are split Footnote 7 and we used the last element in the comparison. For names we took the complete strings.

In total, each feature vector consisted of 59 features. In the next section we describe how we selected an optimal feature set for the classification and the results of the automatic coreference resolution experiments with and without these deeper syntactic features.

4 Evaluation

We performed both a direct evaluation and an external, application-oriented evaluation. In the direct evaluation we measured the performance of the coreference resolution system on a gold-standard test set annotated manually with coreference information. In the application-oriented evaluation we tried to estimate the usefulness of the automatically predicted coreference relations for nlp applications.

4.1 Direct Evaluation

Genetic algorithms (ga) have been proposed [6] as an useful method to find an optimal setting in the enormous search space of possible parameter and feature set combinations. We ran experiments with a generational genetic algorithm for feature set and algorithm parameter selection of Timbl with 30 generations and a population size of 10.

In this experiment we used ten fold cross validation on 242 texts from Knack. The ga was run on the first fold of the ten folds as running the ga is rather time-consuming. The found optimal setting was then used for the other folds as well. We computed a baseline score for the evaluation of the complete coreference chains. The baseline assigned each noun phrase in the test set its most nearby noun phrase as antecedent.

The results are shown in Table 7.2. Timbl scores well above the baseline in terms of F-score but the baseline has a much higher recall. The differences in F-score at the instance level between the model without and with syntactic features, are small, but when we look at the score computed at the chain level, we see an improvement of 3 % in F-score. Adding the additional features from the Alpino parser improves overall F-score by increasing the recall at the cost of precision.

Table 7.2 Micro-averaged F-scores at the instance level and muc F-scores at the chain level computed in ten fold cross validation experiments. Timbl is run with the settings as selected by the genetic algorithm (GA) without and with the additional Alpino features

4.2 Application-Oriented Evaluation

Below, we present the results of two studies that illustrate that automatic coreference resolution can have a positive effect on the performance of systems for information extraction and question answering.

4.2.1 Coreference Resolution for Information Extraction

To validate the effect of the coreference resolution system in a practical information extraction application, our industrial partner in this project, Language and Computing NV, constructed an information extraction module named Relation Finder which can predict medical semantic relations. This application was based on a version of the Spectrum medical encyclopedia (MedEnc) developed in the imix rolaquad project, in which sentences and noun phrases were annotated with domain specific semantic tags [21]. These semantic tags denote medical concepts or, at the sentence level, express relations between concepts. Example 7.1 shows two sentences from MedEnc annotated with semantic XML tags. Examples of the concept tags are con_disease, con_person_feature or con_treatment. Examples of the relation tags assigned to sentences are rel_is_symptom_of and rel_treats .Examples of the concept tags are con_disease, con_person_feature or con_treatment. Examples of the relation tags assigned to sentences are rel_is_symptom_of and rel_treats .

figure 2figure 2

The core of the Relation Finder was a maximum entropy modeling algorithm trained on approximately 2,000 annotated entries of MedEnc. Each entry was a description of a particular item such as a disease or body part in the encyclopedia and contained on average ten sentences. It was tested on two separate test sets of 50 and 500 entries respectively. Our coreference module predicted coreference relations for the noun phrases in the data. We ran two experiments with the Relation Finder. In the first experiment we used the predicted coreference relations as features and the second one we did not use these features. On the small data set we obtained an F-score of 53.03 % without coreference and 53.51 % with coreference information. On the test set with 500 entries we got a slightly better score of 59.15 % F-score without and 59.60 % with coreference information. So for both test sets we observe a modest positive effect for the experiments using the coreference information.

4.2.2 Coreference Resolution for Question Answering

The question answering system for Dutch described in [2] used information extraction to extract answers to frequent questions off-line (i.e. the system tried to find all instances of thecapital relation in the complete text collection off-line, to answer questions of the form What is the capital of LOCATION? ). Tables with relation tuples were computed automatically for relations such as age of a person, location and date of birth, founder of an organisation, function of a person, number of inhabitants, winner of a prize, etc.

Using manually developed patterns, the precision of extracted relation instances is generally quite high, but coverage tends to be limited. One reason for this is the fact that relation instances are only extracted between entities (i.e. names, dates, and numbers). Sentences of the form The village has 10,000 inhabitants do not contain a ⟨ location,number_of_inhabitants ⟩ pair. If we can resolve the antecedent of the village, however, we can extract a relation instance.

To evaluate the effect of coreference resolution for this task, [24] extended the information extraction component of the QA system with a simple rule-based coreference resolution system for pronouns. To resolve definite noun phrases, it used an automatically constructed knowledge base containing 1.3M class labels for named entities to resolve definite np s.

Table 7.3 shows that, after adding coreference resolution, the total number of extracted facts went up with over 50 % (from 93K to 145K). However, the accuracy of the newly added facts was only 40 % for cases involving pronoun resolution and 33 % for cases involving definite np s.

Table 7.3 Number of relation instances, precision, and number of unique instances (facts) extracted using the baseline system, and using coreference resolution

In spite of the limited accuracy of the newly extracted facts, we noticed that incorporation of the additional facts led to an increase in performance on the questions from the qa@clef 2005 test set of 5 % (from 65 to 70 %). We expect that even further improvements are possible by integrating the coreference resolution system described in Sect. 7.3.2 .

5 Conclusion

Coreference resolution is useful in text mining tasks such as information extraction and question answering. Using coreference resolution, more useful information can be extracted from text, and that has a positive effect on the recall of such systems. However, it is not easy to show the same convincingly in application-oriented evaluations. The reason for this is that the current state-of-the-art in coreference resolution, based on supervised machine learning, is still weak, especially in languages like Dutch for which not a lot of training data is available. More corpora are needed, annotated with coreference relations.

We presented the main outcomes of the stevincorea project, which was aimed at addressing this corpus annotation bottleneck. In this project, we annotated a balanced corpus with coreferential relations, trained a system on it, and carried out both a direct and application-oriented evaluations.

We discussed the corpus, the annotation and the inter-annotator agreement, and described the construction and evaluation of a coreference resolution module trained on this corpus in terms of the preprocessing and the features used.

We evaluated this coreference resolution module in two ways: with standard cross-validation experiments to compare the predictions of the system to a hand-annotated gold standard test set, and a more practically oriented evaluation to test the usefulness of coreference relation information in information extraction and question answering. In both cases we observed a small but real positive effect of integrating coreference information, despite the relatively low accuracy of current systems. More accurate coreference resolution systems, should increase the magnitude of the positive effect. These systems will need additional semantic and world knowledge features. We showed the positive effect of richer syntactic features as generated by the Alpino parser, and of semantic features by means of the semantic cluster features we tested.

The annotated data, the annotation guidelines, and a web demo version of the coreference resolution system are available to all and are distributed by the Dutch tsthlt Agency. Footnote 8