Keywords

1 Introduction

Scholars concerned with cultural issues in Antiquity or Middle Ages have to deal with a huge documentation. The literary material is a significant part of this material, but the commonly used technology supporting these researches is to date far from satisfactory. In spite of pioneering undertakings in digitization since the 70’s, historians and philologists still have access to few tools to operate on texts, mostly limited to lexical searches. Therefore they stand in need for more intelligent tools, in order to overcome this word-dependency, to access the semantics of texts and to achieve more elaborated investigations.

The Semantic Web has an increasing role to play in this process of providing new methodological implements in cultural studies. During the last decade, several works addressed the semantic annotation and search in Cultural Heritage collections and Digital Library systems. They focus on producing Cultural Heritage RDF datasets [2, 5], aligning these data and their vocabularies on the Linked Data cloud [3, 9], and exploring and searching among heterogenous semantic data stores [4, 7, 8, 12]. In [2, 3] the authors describe how they created a five stars dataset for the entire collection of the Amsterdam Museum. They present an approach and a tool to create linked cultural datasets for Cultural Heritage institutes, based on the Europeana Data Model (EDM), and they stress on the prime importance of the vocabulary supporting the modeling and the reuse LOD vocabularies. The E-culture project investigates the use of Semantic Web technologies to integrate heterogenous cultural datasets from multiple museum collections and support the semantic annotation and exploration of these collections [7, 12]. In addition, in [4], the authors address the problem of diversying search results when searching Cultural Heritage collections.

The PleiadesFootnote 1 project aims to collaboratively create and share historical geographic information about the ancient world in digital form. It publishes geospatial data in RDF, aligned with the Geonames dataset [5]. The SPANFootnote 2 project addresses the problem of aligning existing datasets from the classical world on persons, names and person-like entities in Greco-Roman antiquity and publish it as Linked Data [9]. The PELAGIOSFootnote 3 project aims to help introduce Linked Open Data into online resources that refer to places in the historic past, with the goal of providing new modes of discovery and visualization [8]. The Linked Ancient World Data Institute (LAWDI) proceedingsFootnote 4 are a survey of current initiatives or expressions of interest on the use of Linked Open Data (LOD) in the study of the ancient world. Most of them consider the first step of making publicly available online the data of digital libraries or of projects working on ancient texts.

The international research network ZoomathiaFootnote 5 has been set up to address this challenge of adopting a LOD-based methodological approach in the area of History of Science. Zoomathia primarily focuses on the transmission of zoological knowledge from Antiquity to Middle Ages, through textual resources and considers compilation literature such as encyclopaedias. It aims to develop interconnected researches on History of Zoology in pre-modern times and to raise collaborative work involving philologists, historians, naturalists and researchers in Knowledge Engineering and Semantic Web.

In this context, we conducted a preliminary work, first presented in [13] and updated and extended in this paper. It focuses on the fourth book of the late mediaeval encyclopaedia Hortus Sanitatis (15th century), written in Latin, which compiles ancient texts on fishes. Each chapter of this book is dedicated to one fish, with possible references to other fishes. In this work we aim at (i) automating information extraction from these texts, such as zoonyms, zoological sub-discipline (ethology, anatomy, medicinal properties, etc.); (ii) building an RDF dataset and its vocabulary representing the extracted knowledge, and link them to the Linked Data; and finally, at (iii) reasoning on this linked data to produce new expert knowledge. We build upon the results of two previous French research projects on structuring mediaeval encyclopaedias in XML according to the TEI model and manualy annotating author sources (SourceEncyMe projectFootnote 6) and zoonyms (Ichtya projectFootnote 7).

This paper is organized as follows: Sect. 2 presents the challenges addressed within Zoomathia and its general goals. Section 3 presents our work on knowledge extraction from the mediaeval encyclopaedia Hortus Sanitatis, while Sect. 4 describes the publication of a linked RDF dataset and its vocabularies. Section 5 presents preliminary work on the exploitation of these data to support the study of the history of pre-modern zoology, and Sect. 6 concludes the paper.

2 The Zoomathia Research Network

Zoomathia primarily focuses on the transmission of zoological knowledge from Antiquity to Middle Ages. The intellectual challenge is to go beyond the classical Quellenforschung, only focused on the analysis of the discontinuous transmission of specific information. We aim at operating methodically on a set of five representative works distant from each other of approximatively five centuries through two millenaries of zoological discourse: Aristotles Historia Animalium (4 BC), Plinys Historia naturalis (1 CE), Isidorus Etymologies (7 CE), Vincent of Beauvais Speculum Naturale (13 CE), and Hortus sanitatis (15 CE). Historians of zoology traditionally regard biological ambition as suffering a decline after Aristotle, if not a disappearance. [10] We should rather consider that a lasting shift in intellectual interest and cultural involvement with animals occurs in the Alexandrine and Roman period. Which animals are worth to take into account and to comment? What is worth to examine in them? What is worth to say about them?

The automatic annotation of the selected texts and the systematic identification of the topics and subjects of all their units will enable an accurate evaluation and interpretation of the development of the zoological knowledge. Manual search and computing on ancient and mediaeval texts enable to some extend to address the quantitative dimension of data but fail to answer the epistemological demands, which concern the scientific relevancy and the diachronic features of the documentation. A large range of investigations on specific topics is inaccessible through simple lexical queries and requires a rich, scientific and semantic annotation. When investigating, for example, on ethological issues (such as animal breeding, intraspecific communication or technical skills) or on pharmaceutical properties of animal products, we have to face a scattered documentation and a changing terminology hampering a direct access to and a synthetic grasp of the topics studied. An automatic and semantic-based process will help to link and cluster together the related data, compare evidences in a diachronic approach and to figure out the major trends of the cultural representations of animal life and behaviour.

In this network, we aim at (i) identifying a corpus of zoology-related historical data, in order to progressively encompass the whole known documentation, and (ii) producing a common thesaurus operating on heterogeneous resources (iconographic, archaeological and literary). This thesaurus should enable to represent different kinds of knowledge: zoonyms; historical period; geographical area; literary genre; economical context; zoological sub-discipline (ethology, anatomy, physiology, psychology, animal breeding, etc.). The difficulty of the task lies in the extreme variety of the expresions used to refer to each of these subjects. The ultimate goal is to synthesize the available cultural data on zoological matters and to crosscheck them with a synchronic perspective. This would enable to reach the crucial concern, i.e. to precisely assess the transmission of zoological knowledge along the period and the evolution of the human-animal relations. Finally, this thesaurus should be published on the Linked Data and linked to modern reference sources (biological and ecological) to appraise the relevance of the historical documentation.

3 Knowledge Extraction from Historians and Texts

3.1 Interviews with Historians

We conducted several interviews with three Historians participating in Zoomathia to explicit a list of major knowledge elements which would be useful in the study of the transmission of ancient zoological knowledge in mediaeval texts. Among them, let us cite the presence (or absence) of zoonyms in the corpus texts, variant names or name alternatives given to an animal (polyonymy), the relative volume of textual records devoted to a given zoonym, references to a zoonym and frequency of occurrences related to it out of their dedicated chapter, geographical location of the described animals, numerical data in the text (size, longevity, fertility, etc.) and other animal properties related to zoological sub-disciplines (ethology, anatomy, physiology, psychology, animal breeding, etc.).

3.2 Extraction of Zoonyms and Animal Properties from Texts

We processed two versions of book 4 of Hortus Sanitatis, the original Latin text and its translation in French. We used the XML structured version of these texts, identifying the 106 chapters of the book, divided in paragraphs, themselves including 753 citations. We used TreeTagger to parse Latin and French texts and determine the lemmas and part of speech (PoS) of each word in the text.

Extraction of Zoonyms. We searched for the resources available to support the knowledge extraction process. A lexicon of fish names in French and in Latin has been provided by the Ichtya project and we — Knowledge Engineers and Historians — collaboratively built a thesaurus of zoological sub-disciplines and concepts involved in the descriptions relative to these sub-disciplines. Then we defined two sets of patterns (i.e. syntactic rules) for French and Latin to recognize zoonyms from the lexicon of fish names among the lemmas identified in the texts. For instance, the rule SN+SADJ (noun + adjective) applies to zoonym Vitulus marinus or Testudo lutaria. As a result, we extracted 736 zoonyms from the 106 chapters of book 4 of Hortus Sanitatis.

Extraction of Animal Properties. We conducted a second processing of the same two texts to extract zoological sub-disciplines and animal properties. We focused on these seven topics: reproduction, fishing, therapeutic, cooking, anatomy, diet, longevity. The process consists in the following steps: For each sub-discipline, we constructed a list of semantically related terms based on EXtended WordNet DomainsFootnote 8 (XWND) and BabelNetFootnote 9. WordNetFootnote 10 is a large lexical database in English, where nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept; synsets are interlinked by conceptual and lexical relations. WordNet Domains (WND) is a lexical resource where synsets have been semi-automatically annotated with one or more domain labels from a set of 170 hierarchically organized domains, thus reducing the problem of word polysemy. eXtended WordNet Domains (XWND) [6] is an ongoing work aiming to automatically improve WordNet Domains. BabelNet is a very large multilingual ontology and semantic network created by linking the largest multilingual Web encyclopedia — Wikipedia — to the most popular computational lexicon — WordNet. The integration is performed via an automatic mapping and by filling in lexical gaps in resource-poor languages based on machine translation. The result is an “encyclopedic dictionary that provides babel synsets, i.e., concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations [11]. Each list of semantically related terms is constructed as follows: (i) If the sub-discipline exists as an XWDM domain, we extract the offset of all the WordNet synset from this domain and we translate the terms in French via BabelNet based on the WordNet Offset. (ii) If the sub-discipline is not an XWND domain, we use BabelNet to extract all the terms in the network representing a specific domain. (iii) Finally, we apply a manual processing step to reduce the number of terms in each set associated to a zoological discipline (while keeping the more relevant). The list of semantically related terms in Latin is the translation of the French terms; this has been manually done by a philologist. As a result, the networks representing the cooking, therapeutic and fishing topics in French or in Latin comprise about 100 terms. For instance the Latin verbs medeor (heal) or cura (cure) are used to identify the therapeutic topic; the verbs epulor or edo (eat) are used to identify the diet topic. There are between 20 and 50 terms for the others topics.

Evaluation. The analysis of the results of the automatic annotation process was conducted by knowledge engineers and validated by two philologists involved in the manual annotation. For the evaluation of the extraction of zoonyms we considered Chaps. 1 to 53 of book 4 of Hortus Sanitatis. We compared the results of the automatic annotation with those of the manual annotation of zoonyms conducted within the past Ichtya project. F-measure equals to 0.93 for both the annotation of the Latin text and the French text. Most missing annotations are due to the fact that the parsing tool is unable to deduce the exact lemma of some words, especially for Latin words. Among 65 missing annotations, 51 (rare) fish names were not annotated because TreeTagger does not recognize them (e.g., loligo). Other missing annotations concern composed names and are due to a mismatch between the complete fish name in the reference lexicon and the short name used in the text to be annotated (e.g. locusta instead of locusta marina). Conversely, most annotation errors are due to ambiguities between marine animal names and terrestrial animals. For instance, lemma lupus (wolf) is present in the provided lexicon of fish names (wolffish) and there are some comparisons in the text with the (terrestrial) wolfFootnote 11.

For the evaluation of the automatic extraction of animal properties, we considered for the test set the 25 first chapters of Hortus Sanitatis, consisting in 142 citations. These citations have been manually annotated by two philologists to build a reference version. They considered the same seven topics as in the automatic annotation process. Most of them were declared relative to anatomy (62) and therapeutics (44). Then, we compared the result of our automatic annotation process with this reference version. F-measure is above 0.5 for the French text and 0.4 for the Latin text. Considering that the segments we annotate are relatively short texts, we did not expect a high value for this metrics. However this should be improved. These results are clearly those of an ongoing work. Most wrong annotations are related to the therapeutics topic which semantic network of terms intersects the networks of other topics, especially anatomy and diet. There texts dealing with the therapeutic power of some animal on a human organ, and therefore annotated with the anatomy topic instead of therapeutics. Also the question of considering the diet topic as a sub-topic of therapeutics is currently discussed by the philologists involved in the project. More generally, the choice of the topics should be further studied. We will iteratively conduct further experimentations by revising the semantic networks representing the topics and the targeted topics as well.

4 From Structured Data to Linked Data

The extracted knowledge has first been used to enrich the available XML annotation of Hortus Sanitatis. Then we translated the whole XML annotation (text structure, source authors, zoonyms and animal properties) into an RDF dataset and vocabularies and exploited it with SPARQL queries.

4.1 Zoomathia RDF Dataset

An RDF dataset describing Hortus Sanitatis has been automatically generated by writing an XSL stylesheet to be applied to its XML annotation. Listing 1.1 presents an extract of it describing quotation 4 of paragraph 3 of Chap. 20. It is a citation of Aristotle, refering to the crocodile zoonym and addressing the therapeutics and anatomy topics.

figure a

4.2 Zoomathia Vocabulary

Thesaurus of Zoonyms. Based on the lexicon initially provided by Historians involved in the Ichtya project, we built a SKOS thesaurus of 137 concepts representing zoonyms and we aligned it with both the cross-domain DBpedia ontology and the Agrovoc thesaurus specialized for Food and AgricultureFootnote 12. Listing 1.2 presents an extract of the thesaurus describing taxon Garfish.

figure b

Thesaurus of Animal Properties. We built an RDFS vocabulary of zoology-related sub-disciplines and animal properties, based on the results of interviews with Historians and the properties extracted from texts. It comprises 49 classes and we chose seven of them for the topic detection. This is a preliminary modelisation which has to be further developed.

5 Reasoning on Historical Zoological RDF Data

In order to exploit the extracted RDF knowledge base, we built a set of SPARQL queries enabling to answer questions such as “What are the zoonyms studied in this text?”, “What are the topics covered in this text?”, “Where can we find these topics?”,“What are the zoonym properties (in which chapter or paragraph or citation)?”. Let us note that it is the semantics captured in the constructed vocabularies which make it possible to answer these queries: multiple labels associated with a taxon in the thesaurus of zoonyms, hierarchy of zoology-related sub-disciplines, denoted by various terms.

We went a step further in the exploitation of the RDF dataset by writing SPARQL queries of the construct form to construct new RDF graphs capturing synthetic knowledge. When graphically visualized, they support the analytical reasoning of historians on texts. For instance, Fig. 1 presents an RDF graph capturing the relative importance of zoonyms in the Hortus Sanitatis and their location in it. At a glance, it shows that dolphins, whales and eels occupy a predominant place in this text, far ahead of other animals. Figure 2 presents the RDF graph capturing the relative importance of zoology-related sub-disciplines in the Hortus Sanitatis and their location in it. At a glance, it shows that anatomy occupies a predominant place in this text, far ahead of therapeutics and fishing.

Fig. 1.
figure 1

Relative importance of zoonyms in Hortus Sanitatis

Fig. 2.
figure 2

Relative importance of zoological topics in Hortus Sanitatis

These are preliminary results, but yet showing the potentiality of using Semantic Web technologies to support the analysis of ancient and mediaeval zoological knowledge from texts. They are the starting point of an iterative process which will be collaboratively conducted with researchers in Humanities and in Knowledge Engineering.

6 Conclusion and Future Work

We presented a preliminary work conducted in the context of the Zoomathia network, on the zoological mediaeval encyclopaedia Hortus Sanitatis. This work combines NLP techniques to extract knowledge from texts, and knowledge engineering and semantic web methods to build a linked RDF dataset of zoological annotations of this scientific text. It exploits this dataset to support the analysis of the Ancient zoological knowledge compiled in the encyclopaedia.

We are currently working on applying the knowledge extraction process presented in this paper on a classical Latin book on fishes (Pliny, Historia Naturalis, book 9, 1st century AD), which is a major, though indirect, source of the Hortus Sanitatis. We want to deal with the historical perspective of zoology, by comparing the knowledge extracted from these two texts, and appraising the density of the transmission and the evolution of the zoological knowledge on an epistemological point of view. We intend to systematically compare the two texts, with the aim of evaluating the loss, distortion or enrichment of information, and comparing the relative importance in the books of the different zoological perspectives (anatomical, ethological, geographical, etc.) and of the different animal species. Reasoning on the extracted knowledge will enable to identifiy for both texts a “typological profile”, shaped by the relative proportions of text devoted to each of the zoological specialities (or scientific perspectives). By extending the corpus this method will enable a progressive assess of the epistemological evolution of the zoological discourse.

In a near future we will align the Zoomathia thesaurus of zoonyms with the TAXREF taxonomy specialized in Conservation Biology and integrating Archaeozoological dataFootnote 13, thus enabling to support the integration of heterogenous datasets in order to crosscheck the zoological evidences extracted from texts with archaeozoological and iconographical datasets. We are currently working on the formalisation and publication of a SKOS thesaurus from the JSON output format of TAXREF [1].

Given the chronological extension of the corpus and its multilingualism (Greek, Latin and modern languages) a related issue of this investigation concerns historical semantics. This affects not only the theoretical concepts but also the distinction and clustering of the animals themselves. It implies to build up a vocabulary both saving the linguistic and historical meaning of concepts and linking them to the modern state of knowledge and common terminology.