1 Introduction to Coreference Resolution

Coreference resolution, the task of determining the mentions in a text, dialogue or utterance that refer to the same discourse entity, has been at the core of Natural Language Understanding since the 1960s. Owing in large part to publicly available annotated corpora, such as the Message Understanding Conferences (MUC) (Grishman and Sundheim 1996), Automatic Content Extraction (ACE) (Doddington et al. 2004), and OntoNotesFootnote 1, significant progress has been made in the development of corpus-based approaches to coreference resolution. Using coreference information has been shown to be useful in tasks such as question answering (Hartrumpf et al. 2008), summarisation (Bergler et al. 2003), machine translation (Miculicich Werlen and Popescu-Belis 2017), and information extraction (Zelenko et al. 2004).

Figure 1 shows a text consisting of three sentences and demonstrates the occurrence of two nouns and the mentions referring to them; Prof. Hayes, Hayes, he (shaded in yellow) and I, me, Eric (shaded in blue). The purpose of a coreference resolution system is to identify such chains of words and phrases referring to the same entity, often starting with (proper) noun phrases and referring pronouns.

The curation of digital information, has, in recent years, emerged as a fundamental area of activity for the group of professionals often referred to as knowledge workers. These knowledge workers are given the task to conduct research in a particular domain in a very limited time frame. The output of their work is used by newspaper agencies to create articles; museums to construct new exhibitions on a specific topic; TV stations to generate news items. Owing to the diversity of tasks and domains they have to work in, knowledge workers face the challenge to explore potentially large multimedia document collections and quickly grasp key concepts and important events in the domain they are working in. In an effort to help them, we can automate some processes in digital curation such as the identification of named entities and events. This is the primary use case for our paper as coreference resolution plays a significant role in disambiguation as well as harnessing a larger number of entities and events. For example, as seen in Fig. 1, after linking He and Hayes with Prof. Hayes, the knowledge worker gets more information to work with.

Fig. 1.
figure 1

Source: Mendelsohn letters dataset (Bienert and de Wit 2014) (Color figure online)

Example of coreference occurrence in English text.

While many coreference systems exist for English (Raghunathan et al. 2010; Kummerfeld and Klein 2013; Clark and Manning 2015, 2016), a freely availableFootnote 2 competitive tool for German is still missing. In this paper, we describe our forays into developing a German coreference resolution system. We attempt to adapt the Stanford CoreNLP (Manning et al. 2014) Deterministic (rule-based) Coreference Resolution approach (Raghunathan et al. 2010; Lee et al. 2013) as well as the Stanford CoreNLP Mention Ranking (statistical) model (Clark and Manning 2015) to German. We also experiment with projection-based implementation, i.e., using Machine Translation and English coreference models to achieve German coreference resolution.

The main goals of this paper are:

  • To evaluate pre-existing English and German coreference resolution systems

  • To investigate the effectiveness of performing coreference resolution on a variety of out-of-domain texts in both English and German (outlined in Sect. 4) from digital curation scenarios.

After a brief overview of previous approaches to coreference resolution in English and German (Sect. 2), we describe implementations of three approaches to German coreference resolution (Sect. 3): the deterministic sieve-based approach, a machine learning-based system, and a English-German crosslingual projection-based system. This is followed by a discussion on applications of coreference (Sect. 4) and concluding notes on the current state of our coreference resolution systems for digital curation scenarios (Sect. 5).

2 Summary of Approaches to Coreference Resolution

A number of paradigms (rule-based, knowledge-rich, supervised and unsupervised learning) have been applied in the design of coreference resolution systems for several languages with regard to whole documents, i.e., to link all mentions or references of an entity within an entire document. While there have been several works giving a comprehensive overview of such approaches (Zheng et al. 2011; Stede 2011), we focus on corference resolution for German and English and summarise some of the systems.

There have been several attempts at performing coreference resolution for German documents and building associated systems.Footnote 3 CorZu (Tuggener 2016) is an incremental entity-mention system for German, which addresses issues such as underspecification of mentions prevalent in certain German pronouns. While it is freely available under the GNU General Public License, it depends on external software and their respective data formats such as a dependency parser, tagger, and morphological analyser, making it difficult to reimplement it.

BART, the Beautiful/Baltimore Anaphora Resolution Toolkit (Versley et al. 2008), is a modular toolkit for coreference resolution which brings together several preprocesing and syntactic features and maps it to a machine learning problem. While it is available for download as well as a Web Service, there are external dependencies such as the Charniak Reranking Parser. Definite noun matching is resolved via head string matching, achieving an F-score of 73% (Versley 2010). It has been successfully tested on the German TüBa-D/Z treebank (Telljohann et al. 2004) with a claimed F-score of 80.2% (Broscheit et al. 2010).

Definite noun matching cannot be solved via string matching in the domain of newspaper articles. Approximately 50% of the definite coreferent Noun Phrases (NPs) can be resolved using head string matching (Versley 2010). Versley (2010) also used hypernym look-up and various other features to achieve an F-score of 73% for definite anaphoric NPs. Broscheit et al. (2010) claim to get an F1 score of 80.2 on version 4 of the TüBa D/Z coreference corpus using BART.

The goal of the SemEval 2010 Shared Task 1 (Recasens et al. 2010) was to evaluate and compare automatic coreference resolution systems for six different languages, among them German, in four evaluation settings and using four different metrics. The training set contained 331,614 different tokens taken from the TüBa-D/Z data set (Telljohann et al. 2004). Only two of the four competing systems achieved F-scores over 40%, one of them being the BART system mentioned above. We use the same dataset and evaluation data to train our statistical system in Sect. 3.1.

Departing from the norm of building mention pairs, one system implemented a mention-entity approach and produced an F-score of 61.49% (Klenner et al. 2010).

The HotCoref system for German (Roesiger and Riester 2015) focused on the role of prosody for coreference resolution and used the DIRNDL corpus (Björkelund et al. 2014) for evaluation, achieving F-scores of 53.63% on TüBa-D/Z (version 9) and 60.35% on the SemEval shared task data.

Another system (Krug et al. 2015) adapted the Stanford sieve approach (Lee et al. 2013) for coreference resolution in the domain of German historic novels and evaluated it against a hand annotated corpus of 48 novel fragments with approximately 19,000 character references in total. An F1 score of 85.5 was achieved. We also adapt the Stanford Sieve approach in Sect. 3, with the aim of developing an open-domain German coreference resolution system.

In case of the English coreference resolution, we employ the Stanford CoreNLP implementations. There is a large body of work for coreference resolution in English. While the sieve-based approach (Raghunathan et al. 2010) is a prime example of rule-based coreference resolution, other approaches such as the Mention-Rank model (Clark and Manning 2015) and Neural model (Clark and Manning 2016) have been shown to outperform it.

3 Three Implementations

In this section, we describe the three models of coreference resolution.

  • Rule-based (Multi-Sieve Approach): English, German

  • Statistical (Mention Ranking Model): English, German

  • Projection-based (Crosslingual): coreference for German using English models.

3.1 Rule-Based Approach

For the English version, we employ the deterministic multi-pass sieve-based (open-source) Stanford CoreNLP system (Manning et al. 2014). For the German version, we develop an in-house system and name it CoRefGer-rule Footnote 4.

The Stanford Sieve approach is based on the idea of an annotation pipeline with coreference resolution being one of the last steps. The processing steps include sentence splitting, tokenisation, constituency and dependency parsing, and extraction of morphological data. In our system CoRefGer-rule, we also perform Named Entity Recognition.

What is typical for the Stanford sieve approach is starting with all noun phrases and pronominal phrases in the whole document and then deciding how to cluster them together, so that all the noun phrases referring to the same extratextual entity are in the same coreference chain. The sieves can be described as a succession of independent coreference models. Each of them selects candidate mentions and puts them together. The number of these sieves can be different depending on the task. Seven sieves are proposed for an English coreference system (Raghunathan et al. 2010), while eleven sieves are implemented for the task of finding coreference in German historic novels (Krug et al. 2015). We have currently implemented six of the seven sieves from the English system and will include additional ones in future versions of the system.

Sieve 1: Exact Match. With an exact match, noun phrases are extracted from the parse tree. Then, in a sliding window of five sentences, all noun phrases in this window are compared to each other. If they match exactly this leads to the creation of a new coreference chain. We use stemming so that minimally different word endings and differences in the article are taken into account (Table 1).

Table 1. Example for exact match

We also account for variations in endings such as “des Landesverbandes” and “des Landesverbands der AWO” or “der Hund” and “des Hundes”, and between definite and indefinite articles such as “einen Labrador” and “der Labrador”.

Sieve 2: Precise Constructs. This is an implementation of precise constructs like appositive constructs, predicate nominative, role appositive and relative pronouns. Due to the different tree tags a direct application of Stanford NLP algorithms was not possible. Also missing acronym and demonym lists for German posed a challenge in completing this sieve. We therefore translated the corresponding English listsFootnote 5 into German and used them in our approach.

Sieves 3, 4, 5: Noun Phrase Head Matching. The noun phrase head matching we use is different from the one proposed in Raghunathan et al. (2010). They claim that naive matching of heads of noun phrases creates too many spurious links, because it ignores incompatible modifiers like “Yale University” and “Harvard University”. Those two noun phrases would be marked as coreferent, because the head of both is “University”, although the modifiers make it clear that they refer to different entities. This is why a number of other constraints are proposed. In order to utilise them we implement a coreference chain building mechanism. For example, there is a notion of succession of the words when chaining them together, so we cannot match head nouns or noun phrases in the antecendent cluster.

We also employ stemming in noun phrases, so we match entities such as “AWO-Landesverbands” with “Landesverband”, “Geschäftsführer” and “Geschäftsführers.”

We also implement the sieves called “Variants of head matching” and “relaxed head matching” which require sophisticated coreference chaining.

Sieve 6: Integration of Named Entity Recognition. We use an in-house Named Entity Recognition engine based on DBpedia-SpotlightFootnote 6, that is also applied in the current version of the coreference resolution system (for example, to deal with the above mentioned “Yale University” vs. “Harvard University” issue).

German Specific Processing. Our implemented sieves include naive stemming, which means that words that vary in a few letters at the end are still considered as matching due to different case markers in German. The same holds for definite and indefinite articles, which are specific for German. Noun phrases are considered as matching although they have different articles.

An important component that is not implemented but plays a big role in coreference resolution is morphological processing for acquiring gender and number information. This component would make it possible to do pronoun matching other than our current method of merely matching the pronouns that are the same.

3.2 Statistical Approach

While we developed the rule-based approach (CoRefGer-rule), we also adapted the Stanford CoreNLP statistical system based on the Mention Ranking model (Clark and Manning 2015). We trained our coreference system on the TüBa-D/Z (Telljohann et al. 2004), and evaluated on the same dataset as SemEval 2010 Task 1 (Recasens et al. 2010). We named the system CoRefGer-stat. The system uses a number of features described below:

  • Distance features: the distance between the two mentions in a sentence, number of mentions

  • Syntactic features: number of embedded NPs under a mention, Part-Of-Speech tags of the first, last, and head word (based on the German parsing models included in the Stanford CoreNLP, Rafferty and Manning 2008)

  • Semantic features: named entity type, speaker identification

  • Lexical Features: the first, last, and head word of the current mention.

While the machine learning approach enables robustness and saves time in constructing sieves, the application is limited to the news domain, i.e., the domain of the training data.

3.3 Projection-Based Approach

In this section we outline the projection-based approach to coreference resolution. This approach is usually implemented in a low-resource language scenario, i.e., if sufficient language resources and training data are not available. Developing a coreference resolution system for any new language is cumbersome due to the variability of coreference phenomena in different languages as well as availability of high-quality language technologies (mention extraction, syntactic parsing, named entity recognition).

Crosslingual projection is a mechanism that allows transferring of existing methods and resources from one language (e.g., English) to another language (e.g., German). While crosslingual projection has demonstrated considerable success in various NLP applications like POS tagging and syntactic parsing, it has been less successful in coreference resolution, performing with 30% less precision than monolingual variants (Grishina and Stede 2017).

The projection-based approach can be implemented in one of the following two ways:

  • Transferring models: Computing coreference on text in English, and projecting these annotations on parallel German text via word alignments in order to obtain German coreference model

  • Transferring data: Translating German text to English, computing coreference on translated English text using English coreference model and then projecting the annotations back on to the original text via word alignment.

The “Transferring data” approach involved less overhead because new language coreference models are not generated, and proved to be more effective. We have therefore used this approach in our experiments and name it CoRefGer-proj system.

4 Evaluation and Case Studies

We are interested in applying reliable and robust coreference resolution for both English and German on a variety of domains from digital curation scenarios such as digital archives, newspaper reports, museum exhibits (Bourgonje et al. 2016, Rehm and Sasaki 2016):

  • Mendelsohn Letters Dataset (German and English): The collection (Bienert and de Wit 2014) contains 2,796 letters, written between 1910 and 1953, with a total of 1,002,742 words on more than 11,000 sheets of paper; 1,410 of the letters were written by Erich and 1,328 by Luise Mendelsohn. Most are in German (2,481), the rest is written in English (312) and French (3).

  • Research excerpts for a museum exhibition (English): This is a document collection retrieved from online archives: Wikipedia, archive.org, and Project Gutenburg. It contains documents related to Vikings; the content of this collection has been used to plan and to conceptualise a museum in Denmark.

  • Regional news stories (German): This consists of a general domain regional news collection in German. It contains 1,037 news articles, written between 2013 and 2015.

The statistics for these corpora and the standard benchmark sets are summarised in Table 2. Robustness can only be achieved if we limit the scope and coverage of the approach, i.e., if we keep the coreference resolution systems simple and actually implementable. In our use cases, a few correctly identified mentions are better than hundreds of false positives.

Table 2. Summary of curations datasets.

Evaluation is done on several datasets (standard datasets for benchmarking): CONLL 2012 for English and SemEval 2010 for German. Our goal is to determine the optimal coreference system for coreference resolution on English and German texts from digital curation scenarios (out-of-domain).

Table 3 shows the results of evaluation on CoNLL 2012 (Pradhan et al. 2012) English dataset. Two evaluation measures are employed: MUC (Vilain et al. 1995) and B-cubed (Bagga and Baldwin 1998). The MUC metric compares links between mentions in the key chains to links in the response chains. The B-cubed metric evaluates each mention in the key by mapping it to one of the mentions in the response and then measuring the amount of overlapping mentions in the corresponding key and response chains.

Table 3. Summary of evaluation of 2 coreference resolution systems on English CoNLL 2012 task across two F-1 evaluation measures.

Note that these accuracies are lower than those reported in CoNLL shared task, because the systems employed are dependent on taggers, parsers and named entity recognizers of Stanford CoreNLP and not gold standard as employed in the shared task. While we do not have any gold standard for our digital curation use-cases, a manual evaluation of a small subset of documents shows sieve-based approach to perform slightly better than the state-of-the-art statistical and neural models, most likely owing to out-of-domain applications.

For German, we experimented with different settings of the in-house CoRefGer-rule system. In Table 4, we demonstrate the performance of our Multi-Sieve (rule-based) approach on a 5000-word subset of the TüBa-D/Z corpus (Telljohann et al. 2004) using different settings of modules as follows:

  • Setting 1: Whole System with all 6 sieves in place

  • Setting 2: Contains all mentions but no coreference links

  • Setting 3: Setting 1 minus the module that is deleting any cluster that does not contain a single mention that has been recognized as an entity

  • Setting 4: Setting 1 with the module that is deleting any cluster that does not contain a single mention that has been recognized as an entity executed after the sieves have been applied.

Setting 2 assumes that our system obtained all the correct mentions and therefore tests the effectiveness of the coreference linking module only. However this setting will not work in real-life scenarios like the digital curation use cases unless we have hand-annotated corpora.

Table 4. Module-based evaluation of German sieve-based coreference (CoRefGer-rule) on different configurations across two F-1 evaluation measures.

In Table 5, we present German coreference resolution results on the test set of SemEval 2010 Task 1. We also compare the performance of our three systems (CoRefGer-rule, CoRefGer-stat, CoRefGer-proj) to one other system: CorZU (Tuggener 2016). CoRefGer-stat, CoRefGer-rule, and CoRefGer-proj are the three systems we developed in this paper. Since the sieve-based approach lacks a morphological component currently, it underperforms. An error analysis of the statistical and projection-based system reveals that several features were not sufficiently discriminating for German models. We believe completing the remaining sieve will help us in training better syntactic and semantic features for the statistical system as well.

Table 5. Summary of evaluation of coreference resolution systems on German SemEval 2010 task across 2 F-1 evaluation measures.

While there is no gold standard for any of our datasets from digital curation use cases, we nevertheless applied our English and German coreference resolution systems, as shown in Table 6. The sieve-based systems tend to give the best results (shown in the Table) while the statistical, neural and projection-based yield nearly 10% less entity mentions. We leave for future work a deeper investigation into this though we believe that interfacing with lexical resources such as those from WordNet may help ameliorate the out-of-domain issues.

Table 6. Summary of the percentage of mentions (based on total number of words) on curation datasets for which we do not have a gold standard

4.1 Add-On Value of Coreference Resolution to Digital Curation Scenarios

Consider the following sentence:

“Then came Ray Brock for dinner. On him I will elaborate after my return or as soon as a solution pops up on my “Klappenschrank”. Naturally, he sends his love to Esther and his respects to you.”

A model or dictionary can only spot “Ray Brock”, but, “him”, “he” and “his” also refer to this entity. With the aid of coreference resolution, we can increase the recall for named entity recognition as well as potentially expand the range for event detection.

The algorithm for coreference-enabled NLP technologies is as follows:

  • Input a text document, and run coreference resolution on it

  • With the aid of the above, replace all occurrences of pronouns with the actual noun in full form, such that “he” and “his” are replaced with “Ray Brock” and “Ray Brock’s” respectively

  • Run a NLP process such as Named Entity Recognition on the new document and compare with a run without the coreference annotations.

A preliminary computation of the above algorithm shows a marked improvement on the number of entities identified (27% more coverage) by an off-the-shelf Named Entity Recogniser.

5 Conclusions and Future Work

We have performed coreference resolution in both English and German on a variety of text types and described several competing approaches (rule-based, statistical, knowledge-based).

Number and gender information is one of the core features that any coreference system uses. A major deficiency for our German rule-based system described in Sect. 3.1 is the lack of interfacing with a morphological analyser, which we leave for future work.

An interesting sieve that can also be adapted from the paper about German coreference resolution in historic novels is the semantic pass where synonyms taken from GermaNet are also taken into account for matching. They handle speaker resolution and pronoun resolution in direct speech, which makes up their tenth and eleventh sieve. They do so by using handcrafted lexico-syntactic patterns. If these patterns are only specific for their domain or if they can also be successfully applied to other domains is a point for further research.

Overall, we were able to annotate our multi-domain datasets with coreference resolution. We will be investigating how much these annotations help knowledge workers in their curation use cases.

In conclusion, we have determined that the deterministic rule-based systems, although not state-of-the-art are better choices for our out-of-domain use cases.