Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers

Since its introduction some twenty years ago, named entity (NE) processing has become an essential component of virtually any text mining application and has undergone major changes. Recently, two main trends characterise its developments: the adoption of deep learning architectures and the consideration of textual material originating from historical and cultural heritage collections. While the former opens up new opportunities, the latter introduces new challenges with heterogeneous, historical and noisy inputs. If NE processing tools are increasingly being used in the context of historical documents, performance values are below the ones on contemporary data and are hardly comparable. In this context, this paper introduces the CLEF 2020 Evaluation Lab HIPE (Identifying Historical People, Places and other Entities) on named entity recognition and linking on diachronic historical newspaper material in French, German and English. Our objective is threefold: strengthening the robustness of existing approaches on non-standard inputs, enabling performance comparison of NE processing on historical texts, and, in the long run, fostering efficient semantic indexing of historical documents in order to support scholarship on digital cultural heritage collections.


Introduction
Recognition and identification of real-world entities is at the core of virtually any text mining application. As a matter of fact, referential units such as names of persons, locations and organizations underlie the semantics of texts and guide their interpretation. Around since the seminal Message Understanding Conference (MUC) evaluation cycle in the 1990s [11], named entity-related tasks have undergone major evolutions until now, from entity recognition and classification to entity disambiguation and linking [21,25]. Besides the general domain of well-written newswire data, named entity (NE) processing is also applied to specific domains, particularly bio-medical [10,14], and on more noisy inputs such as speech transcriptions [9] and tweets [26].
Recently, two main trends characterise developments in NE processing. First, at the technical level, the adoption of deep learning architectures and the usage of embedded language representations greatly reshapes the field and opens up new research directions [1,16,17]. Second, with respect to application domain and language spectrum, NE processing has been called upon to contribute to the field of Digital Humanities (DH), where massive digitization of historical documents is producing huge amounts of texts [30]. Thanks to large-scale digitization projects driven by cultural institutions, millions of images are being acquired and, when it comes to text, their content is transcribed, either manually via dedicated interfaces, or automatically via Optical Character Recognition (OCR). Beyond this great achievement in terms of document preservation and accessibility, the next crucial step is to adapt and develop appropriate language technologies to search and retrieve the contents of this 'Big Data from the Past' [13]. In this regard, information extraction techniques, and particularly NE recognition and linking, can certainly be regarded as among the first steps.
This paper introduces the CLEF 2020 Evaluation Lab 1 HIPE (Identifying Historical People, Places and other Entities) 2 . With the aim of supporting the development and progress of NE systems on historical documents (Sect. 2), this lab proposes two tasks, namely named entity recognition and linking, on historical newspapers in French, German and English (Sect. 3). We additionally report first results on French historical newspapers (Sect. 4), which comfort the idea of various benefits of such lab for both NLP and DH communities.

Motivation and Objectives
NE processing tools are increasingly being used in the context of historical documents. Research activities in this domain target texts of different nature (e.g. museum records, state-related documents, genealogical data, historical newspapers) and different tasks (NE recognition and classification, entity linking, or both). Experiments involve different time periods , focus on different domains, and use different typologies. This great diversity demonstrates how many and varied the needs-and the challenges-are, but also makes performance comparison difficult, if not impossible.
Furthermore, as per language technologies in general [29], it appears that the application of NE processing on historical texts poses new challenges [7,23]. First, inputs can be extremely noisy, with errors which do not resemble tweet misspellings or speech transcription hesitations, for which adapted approaches have already been devised [5,27]. Second, the language under study is mostly of earlier stage(s), which renders usual external and internal evidences less effective (e.g., the usage of different naming conventions and presence of historical spelling variations) [2,3]. Further, beside historical VIPs, texts from the past contain rare entities which have undergone significant changes (esp. locations) or do no longer exist, and for which adequate linguistic resources and knowledge bases are missing [12]. Finally, archives and texts from the past are not as anglophone as in today's information society, making multilingual resources and processing capacities even more essential [22].
Overall, and as demonstrated by Vilain et al. [31], the transfer of NE tools from one domain to another is not straightforward, and the performance of NE tools initially developed for homogeneous texts of the immediate past are affected when applied on historical material. This echoes the proposition of Plank [24], according to whom what is considered as standard data (i.e. contemporary news genre) is more a historical coincidence than a reality: in NLP non-canonical, heterogeneous, biased and noisy data is rather the norm than the exception.
Even though many evaluation campaigns on NE were organized over the last decades 3 , only one considered French historical texts [8]. To the best of our knowledge, no NE evaluation campaign ever addressed multilingual, diachronic historical material. In the context of new needs and materials emerging from the humanities, we believe that an evaluation campaign on historical documents is timely and will be beneficial. In addition to the release of a multilingual, historical NE-annotated corpus, the objective of this shared task is threefold: strengthening the robustness of existing approaches on non-standard inputs; enabling performance comparison of NE processing on historical texts; and fostering efficient semantic indexing of historical documents.

Task Description
The HIPE shared task puts forward 2 NE processing tasks with sub-tasks of increasing level of difficulty. Participants can submit up to 3 runs per sub-task.

Task 1: Named Entity Recognition and Classification (NERC)
Subtask 1.1 -NERC Coarse-Grained: this task includes the recognition and classification of entity mentions according to high-level entity types (Person, Location, Organisation, Product and Date). Subtask 1.2 -NERC Fine-Grained: this task includes the classification of mentions according to finer-grained entity types, nested entities (up to one level of depth) and the detection of entity mention components (e.g. function, title, name).
Task 2: Named Entity Linking (EL). This task requires the linking of named entity mentions to a unique referent in a knowledge base (a frozen dump of Wikidata) or to a NIL node if the mention does not have a referent.

Data Sets
Corpus. The HIPE corpus is composed of items from the digitized archives of several Swiss, Luxembourgish and American newspapers on a diachronic basis. 4 For each language, articles of 4 different newspapers were sampled on a decade time-bucket basis, according to the time span of the newspaper (longest duration spans ca. 200 years). More precisely, articles were first randomly sampled from each year of the considered decades, with the constraints of having a title and more than 100 characters. Subsequently to this sampling, a manual triage was applied in order to keep journalistic content only and to remove undesirable items such as feuilleton, cross-words, weather tables, time-schedules, obituaries, and what a human could not even read because of OCR noise.
Alongside each article, metadata (journal, date, title, page number, image region coordinates), the corresponding scan(s) and an OCR quality assessment score is provided. Different OCR versions of same texts are not provided, and the OCR quality of the corpus therefore corresponds to real-life setting, with variations according to digitization time and preservation state of original documents.
For each task and language-with the exception of English-the corpus is divided into training, dev and test data sets, released in IOB format with hierarchical information. For English, only dev and test sets will be released.
Annotation. HIPE annotation guidelines [6] are derived from the Quaero annotation guide 5 . Originally designed for the annotation of "extended" named entities (i.e. more than the 3 or 4 traditional entity classes) in French speech transcriptions, Quaero guidelines have furthermore been used on historic press corpora [28]. HIPE slightly recasts and simplifies them, considering only a subset of entity types and components, as well as of linguistic units eligible as named entities. HIPE guidelines were iteratively consolidated via the annotation of a "mini-reference" corpus, where annotation decisions were tested and difficult cases discussed. Despite these adaptations, HIPE annotated corpora will mostly remain compatible with Quaero guidelines.
The annotation campaign is carried out by the task organizers with the support of trilingual collaborators. We use INCEpTION as an annotation tool [15], with the visualisation of image segments alongside OCR transcriptions. 6 Before starting annotating, each annotator is first trained on a mini-reference corpus, where the inter-annotator agreement (IAA) with the gold reference is computed. For each language, a sub-sample of the corpus is annotated by 2 annotators and IAA is computed, before and after an adjudication. Randomly selected articles will also be controlled by the adjudicator. Finally, HIPE will provide complementary resources in the form of in-domain word-level and character-level embeddings acquired from historical newspaper corpora. In the same vein, participants will be encouraged to share any external resource they might use. HIPE corpus and resources will be released under a CC-BY-SA-NC 4.0 license.

Evaluation
Named Entity Recognition and Classification (Task 1) will be evaluated in terms of macro and micro Precision, Recall, F-measure, and Slot Error Rate [20]. Two evaluation scenarios will be considered: strict (exact boundary matching) and relaxed (fuzzy boundary matching). Entity linking (Task 2) will be evaluated in terms of Precision, Recall, F-measure taking into account literal and metonymic senses.

Exploratory Experiments on NER for Historical French
We made an exploratory study in order to assess whether the massive improvements in neural NER [1,17] on modern texts carry over to historical material with OCR noise. The data of our experiments is the Quaero Old Press (QOP) corpus, 295 OCRed 7 newspaper documents dating from December 1890 annotated according to the Quaero guidelines [28], split by us into train (1.45 m tokens) and dev/test (each 0.2 m). We only consider the outermost entity level (no nested entities or components) and train on the fine-grained subcategories (e.g., loc.adm.town) of the 7 main classes.
Modeling NER as a sequence labeling problem and applying Bi-LSTM networks is state of the art [1,4,17,19]. Our experiments follow [1] in using characterbased contextual string embeddings as input word representations, allowing to "better handle rare and misspelled words as well as model subword structures such as prefixes and endings". These contextualized word embeddings rely on neural forward and backward character-level language models that have been trained by us on a large collection (500 m tokens) of late 19th and early 20th centuries Swiss-French newspapers. In accordance to the literature, a Bi-LSTM NER model with an on-top CRF layer (Bi-LSTM-CRF) works best for our data. As a baseline system, which will also be provided for the shared task, we train a traditional CRF sequence classifier [18] using basic spelling features such as a token's character prefix and suffix, the casing of the initial character and the presence of punctuation marks and digits. The baseline classifier shows fairly modest overall performance scores of 69.4% recall, 56.2% precision and 62.1 F 1 (see Table 1).
Trained and evaluated on the QOP data, the neural model relying on contextual string embeddings clearly outperforms the baseline classifier. As shown in Table 1, the Bi-LSTM-CRF model achieves better F 1 for all of the 7 entity types and surpasses the feature-based classifier by nearly 12 points F 1 . Examples in Table 1 evidence that the CRF model frequently struggles with entities containing miss-recognized special characters and/or punctuation marks. In many such cases, the Bi-LSTM-CRF classifier is capable of assigning the correct label. These results indicate that the new neural methods are ready to enable substantial progress in NER on noisy historical texts.

Conclusion
From the perspective of natural language processing (NLP), the HIPE evaluation lab provides the opportunity to test the robustness of existing NERC and EL approaches against challenging historical material and to gain new insights with respect to domain and language adaptation. From the perspective of digital humanities, the lab's outcomes help DH practitioners in mapping state-of-the-art solutions for NE processing on historical texts, and in getting a better understanding of what is already possible as opposed to what is still challenging. Most importantly, digital scholars are in need of support to explore the large quantities of digitized text they currently have at hand, and NE processing is high on the agenda. Such processing can support research questions in various domains (e.g. history, political science, literature, historical linguistics) and knowing about their performance is crucial in order to make an informed use of the processed data. Overall, HIPE will contribute to advance the state of the art in semantic indexing of historical material, within the specific domain of historical newspaper processing, as in e.g. the "impresso -Media Monitoring of the Past" project 8 and, more generally, within the domain of text understanding of historical material, as in the Time Machine Europe project 9 which ambitions the application of AI technologies on cultural heritage data. grant number CR-SII5 173719. We would also like to thank C. Watter, G. Schneider and A. Flückiger for their invaluable help with the construction of the data sets, as well as R. Eckart de Castillo, C. Neudecker, S. Rosset and D. Smith for their support and guidance as part of the lab's advisory board.