As pointed out in Sect. 1, for the purposes of biofid, a general annotation scheme is required that covers taxonomic names as well as “mundane” description of (at least) organisms and their temporal and spatial relationships, and that for German-language texts. This section describes the design of an annotation scheme that aims to fulfill this purpose. The annotated data gained in this way can be used as (a part of) a training corpus for large-scale ML methods for automatic text processing (Ahmed et al., 2019), methods which can be regarded a standard in bioinformatic contexts by now (Blaschke et al., 2002). The development of an annotation scheme has to be put to test and should be regimented by annotation guidelines, since annotation is a data generating rather than a data documenting process (Consten & Loll, 2012). Accordingly, in biofid, annotation guidelines are collected as part of an annotation manual (Lücking et al., 2020). The main annotation classes and the rationale of their application are covered in the following subsections.
Ontological classification
Taxon names
One of the most conspicuous features of biological texts is the use of a certain class of proper names, namely taxonomic names. These are names that refer to kinds.Footnote 7 Accordingly, the first task for automatic processing of biological texts is to identify such kind-denoting proper names. Thus, there is a straightforward starting point of biological and text-technological collaboration within biofid, namely Named Entity Recognition (NER), a sub-task of information extraction (for a survey see, e.g., Nadeau & Sekine, 2007).
With regard to the term “entity”, we have to distinguish two usage traditions (Prechtl & Burkard, 2008, p. 138), a “classical” and a “logical” one. Classically, “entity” is a basic ontological notion, referring to something of independent existence—an individual. In logical semantics, however, an entity is ontologically unspecific and refers to any kind of extralinguistic object (things, concepts, propositions, events, sets, etc.). This is also the view of semantic data models (e.g. UML, ER model etc.) in computer science, where an entity is an instance of a concept. Despite that, in our annotation framework, entity and concept are two disjoint ranks; each annotation unit (that is, words) has to be specified whether it refers to something of the rank entity or concept.
Why are these digressions relevant to biofid and the annotation scheme developed therein? The reason is that taxonomical names are proper names, but they do not refer to an entity in the sense of an individual; rather, they can be conceived as referring to collections of individuals. Thus, from the classical perspective, taxa cannot be the referents of names, because they simply are no individuals, while logical semantics is much more permissive in this respect. Of course, this has not gone unnoticed and the semantics of kind reference is a well-known fact about languages (see, e.g., Chierchia, 1998). Taxon names are therefore annotated to be of rank concept.
Common nouns and WordNet categories
As aforementioned, taxonomic names are not the only part of speech that is central to the questions addressed by biofid (cf. Sect. 1). Additionally we focus on common nouns. Let us make things more concrete with an example from the biofid corpusFootnote 8:
There has been a sleeping place of Corvidae in the outskirts of Bad Salzungen for more than two decades. The birds use a small forest with old deciduous trees near the city park for their night roost. From 1985 to 1988, once a week, the author had checked the sleeping place and noted the quantity of birds. The sleeping place is used the whole year, in summer by Jackdaws (Corvus monedula) and Carrion Crows (Corvus c. corone), in winter by Rooks (Corvus frugilegus), too. The maximum crowds of sleeping birds varied from 2500 to 9000 individuals. Circadium rhythm of approaching is nearly the same in every winter evening. Arrival and departure are determined by light intensity and weather. The Corvidae are very sensitive regarding disturbance at their sleeping place, but in spite of many injuries they don’t change their sleeping trees.
This example highlights that biological texts do not content themselves with appellatives (e.g., birds) and taxonomic kind reference (e.g., Corvus monedula), but also contain mundane common nouns of biological impact (e.g., outskirts). In order to account likewise for genre-specific, scientific or vernacular names as well as for everyday descriptions, we employ a mixed classification system. “All-purpose categories” are derived from the lexical database WordNet (Fellbaum, 1998; Miller, 1995) and are used for a general ontological annotation. However, some caution is appropriate in this respect (cf. Sanfilippo et al., 2006), since WordNet includes proper name entries (e.g., “Ludwig van Beethoven”) but has no instance_of relation at its disposal. Instead, WordNet uses lexical or sense relations throughout. This leads to a confusion between common nouns and proper names (what Oltramari et al., 2002, p. 18 call a “[c]onfusion between concepts and individuals”). In other words: WordNet is rather a lexical database or “terminological ontology” (Sowa, 2000) than an ontology simpliciter. However, since we distinguish “entities” from “concepts”, we meet the pre-requirement for using WordNet’s 26 top-level entity categories (i.e., the unique beginner synset for nouns) for ontological classification. WordNet distinguishes the following top-level categories:
- WordNet Categories::
-
{person, human being}, {animal, fauna}, {plant, flora}, {group, collection}, {society}, {location, place}, {time}, {communication}, {quantity, amounts}, {event, happening}, {natural object}, {possession, property}, {attribute, property}, {body, corpus}, {food}, {artifact}, {act, action, activity}, {process}, {natural phenomenon}, {cognition, ideation}, {feeling, emotion}, {motive}, {relation}, {shape}, {state, condition}, {substance}
Biology-specific categories
The WordNet categories are complemented by additional biology-specific categories, though. Since WordNet distinguishes only two realms of living beings: plants and animals, we have to extend its categories to include the whole variety of biological taxonomic entities. Specifically, we added the composite organism group of lichens as well as all missing taxonomic kingdoms, including Archaea, Bacteria, Chromista, Fungi, Protozoa, and Viruses. These are the accepted kingdoms according to one of the leading repositories on biodiversity data, the Global Biodiversity Information Facility (GBIF).Footnote 9 To distinguish between organism names that correspond to a taxonomic entity from those used in a more general sense, we introduced the annotation category taxon. Based on our initial explorations of the biofid text corpus, we also considered it necessary to implement biology-specific annotation categories that exhibit a more refined meaning than those of related WordNet categories. In particular, for the WordNet categories to the left of the arrows, we introduced the biological category on the right of the arrow: attribute → morphology, body → morphology, location → habitat, process → reproduction. This enables the differentiation of more general annotations from biology-specific terms and is intended to promote the adaptation and enrichment of the ontologies underlying the semantic search in the biofid portal. For instance, while every habitat is a location, not every location needs to be a habitat. All categories are considered first-class citizens of the ontology, however.
- Biology-specific Categories::
-
{taxon}, {archaea}, {bacteria}, {chromista}, {fungi}, {protozoa}, {viruses}, {lichens}, {habitat}, {morphology}, {reproduction}
These 37 annotation categories are all on the same level and constitute the basic ontological annotation grid of biofid. Each category comes with a description which guides its application—in case of the WordNet categories, the description is obtained from the entries in the WordNet database; descriptions of biology-specific categories are given in Appendix. However, it turned out that WordNet’s beginner synset for nouns (i.e., the above-given 26 top-level categories for entities) is highly anthropocentric. For instance, artifact is described as “a man-made object taken as a whole”.Footnote 10 This definition leaves open of how to deal with objects like a bird’s nest, which seem to be an animal artifact. Following philosophical theories of action (Gould, 2007; Steward, 2009), we also conceive animals as agents. Thus, contrary to (or relaxing) the WordNet descriptions, we assume that any category that involves an agent applies to non-human agents as well.
Now, basically, any instance of any of the above-listed categories can be either referred to by means of a proper name, or described by means of predication. In the sentence Lassie is a dog, for example, Lassie is a proper name according to the classical notion: it picks out a specific individual (a dog, in this case). The common noun dog, as well as the corresponding technical taxon term Canis lupus familiaris, lacks such a discerning power. Hence, we distinguish between proper names referring to single individuals (“Lassie”) from proper names and common nouns referring to other ontological classes such as sets (“dog”). For this purpose, we assign any application of an annotation label to either entity (an individual referred to by a proper name) or \(\overline{{{\textsc{concept}}}}\) (a class, or set of entities). Typographically, annotation categories that refer to a classical entity are typeset in small caps while concepts are additionally indicated by an overbar—for instance, pers is the label for a proper name whose bearer is a human being (Alfred Russel Wallace), \(\overline{{{\textsc{animal}}}}\) labels a noun that denotes a set of entities of the kingdom of animals such as dogs. We employ this typographic convention in the examples given throughout the paper.
Multiple classification
The ontological annotation categories outlined in the preceding section comprise both very general and more specific labels. For instance, presumably any object from the physical world can be said to be a \(\overline{{{\textsc{natural object}}}}\), including animals and plants. So, for a given animal, what is the correct annotation label: \(\overline{{{\textsc{natural object}}}}\) or \(\overline{{{\textsc{animal}}}}\)? Since this does not seem to be an either-or question, we decided to employ a multi-label annotation.Footnote 11 In fact, annotation units receive multiple annotation labels as a rule, not as an exception.
The following examples illustrates the multiple annotation approach of biofid by means of a couple of “real world” data:
-
The most common multiple annotation within biofid probably is the annotation of taxonomic names. Each taxonomic name is marked as such (i.e., \(\overline{{{\textsc{taxon}}}}\)) and coupled with a label indicating the biological kingdom of the taxon, such as \(\overline{{{\textsc{plant}}}}\), \(\overline{{{\textsc{animal}}}}\), or \(\overline{{{\textsc{fungi}}}}\).
-
The category morph(ology) explicitly mentions parthood.Footnote 12 Accordingly, when \(\overline{{{\textsc{morph}}}}\) is used in addition to some other label, it is interpreted as “morphological part of [that other label]”. For instance, a combination of \(\overline{{{\textsc{morph}}}}\) and \(\overline{{{\textsc{plant}}}}\) characterizes a part of a plant (say, its stem). That is, morph implements a minimal mereology.
-
A garden is an artificially created location that also provides a living environment for plants and animals. Its heterogeneity is captured by the following multiple categorization: \(\overline{{{\textsc{location}}}}\), \(\overline{{{\textsc{habitat}}}}\), \(\overline{{{\textsc{artifact}}}}\). However, since “*”garden is also a GeoNames entity (see Sect. 2.3), namely s/gdn (read: sub-category gdn in main category s), it is sufficient to use the GeoNames classification, which can be mapped onto the more elaborate multiple annotation.
-
A report can be categorized as artifact, communication, and cognition, since it is man-made (artifact), conveys information (communication), and is the result and possibly the trigger of mental processes (cognition).Footnote 13
A multiple annotation approach avoids the decision problem of choosing just one ontological label. However, it poses problems on its own, most notably, the difficulty of keeping annotations consistent. On the one hand, too permissive annotations have to be avoided. Although there is not just one “correct” ontological label in most of the cases (what is the true, single category of, say, peduncle or inquiry?), classification is by no means arbitrary. Re-using a previous example: a garden involves plants,Footnote 14 but is not a plant itself. Hence, it would go too far to label garden with the category \(\overline{{{\textsc{plant}}}}\).
On the other hand, annotation should be as informative as possible. For instance, classifying a report merely as \(\overline{{{\textsc{artifact}}}}\) would be correct, but not very informative. This approach would simply group together reports with other kinds of artifacts (that is, man-made objects), such as shoes, cooking spoons, or space ships. Rather, a report is also an instance of communication, and multiple annotation should reflect this. One challenge for multiple annotation projects therefore is to find the right level of granularity. Within biofid, this challenge is met by means of three measures:
-
1.
Annotators discuss extracts of their annotations and highlight difficult examples at regular annotation meetings.
-
2.
Such a meeting can result in finding annotation conventions (like the previous examples), which are compiled in the annotation manual.
-
3.
Tool-wise, a consistent annotation is supported by a recommendation function, which is described in Sect. 4.1 (roughly speaking, a recommendation assigns the annotation categories chosen by the annotator for a given token to all tokens of the same lemma within a certain text span).
Time and space
Following the basic question posed in Sect. 1—Which species occurred when and where?—two categories receive special attention, namely loc(ation) and time.Footnote 15 To this end, we apply GeoNamesFootnote 16 locations, which subdivide WordNet’s category loc into nine major categories, while time is coded according to the ISO standard ISOTimeML (ISO, 2012).
GeoNames categories include geographical entities like cities, lakes, countries, or landmarks. Thus, any location is assigned to one of the following main categories, which are addressed in terms of an alphabetic character:
$$ \begin{array}{*{20}l} { - ~{\text{A}}:\{ {\text{country, state, region,}} \ldots \} } \hfill & {\qquad - ~{\text{S}}:\{ {\text{spot, building, farm}}\} } \hfill \\ { - ~{\text{H}}:\{ {\text{stream, lake,}} \ldots \} } \hfill & {\qquad - ~{\text{T}}:\{ {\text{mountain, hill, rock,}} \ldots \} } \hfill \\ { - ~{\text{L}}:\{ {\text{parks, area,}} \ldots \} } \hfill & {\qquad - ~{\text{U}}:\{ {\text{undersea}}\} } \hfill \\ { - ~{\text{P}}:\{ {\text{city, village,}} \ldots \} } \hfill & {\qquad - ~{\text{V}}:\{ {\text{forest, heath,}} \ldots \} } \hfill \\ { - ~{\text{R}}:\{ {\text{road, railroad}}\} } \hfill & {} \hfill \\ \end{array} $$
In addition to the nine GeoNames main classes, there are 680 sub-categories (excluding unavailable), which allow a very finegrained categorization.Footnote 17
The time-annotation unit is about temporal entities, “the fourth coordinate that is required (along with three spatial dimensions) to specify a physical event” (WordNetFootnote 18), including clock times (“a reading of a point in time as given by a clock”, WordNetFootnote 19). Following ISO-TimeML (ISO 2012), we distinguish date (referring to calendric time units), \(\overline{{{\textsc{time}}}}\) (referring to daytimes, even in an unspecific way), duration/\(\overline{{{\textsc{duration}}}}\) (referring to temporal intervals), and set/\(\overline{{{\textsc{set}}}}\) (quantifying over time points or intervals, say as a result of repetition).
In addition to the above-mentioned categories, a document exhibits both a distinguished location and a distinguished date, namely the document creation location (DCL) and the document creation time (DCT), respectively (Pustejovsky, 2017a, b). DCL and DCT are used to label those locational or temporal expressions that refer to the place and time of the author writing the text. Note that DCL and DCT may be given as part of the metadata of a given text, or that they may be unknown. An example of a DCT mentioned at beginning of the main text is given in (1) (from document 3673151).
-
(1)
Herr cand . iur. Hepp hat Isoetes lacustris L. am 17. Juli dieses Jahres (1898) im Steinsee bei Grafing angetroffen.
(Mr. Hepp found Isoetes lacustris L. on July 17 of this year (1898) in the Steinsee near Grafing.)
By using the demonstrative noun phrase dieses Jahr “*”this year the author refers to the year when he or she was actually writing the sentence. This indexical reference is resolved by the date given in parenthesis (viz. 1898). That is, 1898 can be tagged “DCT”. Having applied this label, it is at disposal for resolving further indexically given temporal expressions. Later in the text we find jetzt ‘now’:
-
(2)
Die von Schmidt bezeichnete Stelle nimmt jetzt eine kultivierte Wiese ein.
(The place designated by Schmidt is now occupied by a cultivated meadow.)
By identifying jetzt ‘now’ with the DCT within the annotation tool (see Sect. 4.1) we receive the information that there is a cultivated meadow as of 1898.
Beyond words
In biofid, we pursue basically a word-based annotation.Footnote 20 However, there are a couple of phenomena that go beyond words, but nonetheless affect word annotations. We exemplarily discuss compounds, possessives, speaker’s reference, and anaphora in the following.
Compounds
The main language of the texts investigated in biofid is German. Ever since Mark Twain’s “The Awful German Language”, German is famously known to be a compounding language. A nominal compound is a noun which consists of several other modifying components (Matthews, 1991, Sect. 5). A modifying component can be an adjective (green tea), a verb (swimming pool), or another noun (football)). Most nominal compounds are determinative, meaning that the modifying expression determines the head noun. From a taxonomic perspective, the head noun determines the compound’s category. Hence, a compound is labeled only according to its head. For instance, football is labeled as an \(\overline{{{\textsc{artifact}}}}\), and not (additionally) as \(\overline{{{\textsc{body}}}}\).
Possessives
Genitive noun phrases raise the question of how to deal with possessives and relational nouns in general. Take, for instance, the following example: the foodplant of the monophagous moorland clouded yellow [which is Colias palaeno]. Here we have two nouns, the head noun foodplant and the modifying compound noun moorland clouded yellow (which itself is modified by the adjective monophagous). Genitives can be thought of as functions in the mathematical sense: the head noun applies to the modifying noun and returns a value. However, although the referent type is uniquely determined, the returned value does not need to be a specific individual. Accordingly, the example sentence calls for a nested annotation, which, in this case, is on the level of concepts: [foodplant of the monophaguous [moorland clouded yellow]\(\overline{{{\textsc{taxon}}}}\),\(\overline{{{\textsc{animal}}}}\)]\(\overline{{{\textsc{plant}}}}\).
Speaker’s reference
Definite noun phrases show an interesting feature: Their usage can pick out an individual—just like a proper name does—even if this individual is unknown to the speaker. Suppose the speaker listens to a radio broadcast that announces that the jackpot was hit. Then the speaker can assert The lottery winner must be happy, referring to the jackpot winner, whoever he or she is. This usage contrasts to the noun phrase in, e.g., Yesterday, my sister hit the jackpot, where the genitive noun phrase my sister refers to a particular individual known to the speaker.Footnote 21 Within the biofid text corpus, there are descriptions such as every morning, I saw the swallow leaving its bird-nest, which are about a specific bird the author observed. However, there are also general statements such as the swallow builds its bird-nest in March, where the noun phrase receives a kind reading. We want to capture these two different usages of nouns within biofid. To this end, the distinction between specific and unspecific is introduced. As a rule of thumb, the following question guides the specific/unspecific distinction: Is the author speaking as an eyewitness? If yes, the annotation unit is a specific one; if no (e.g., if the author refers to general knowledge), the annotation unit is unspecific.
Anaphora
So far we have only considered nouns and noun phrases which are used by text authors as part of their real-world observations. This leads to the question of how to deal with nominal expressions that are used in other ways, most importantly anaphorically ones (nominals whose interpretation rest on their linguistic context). We have to consider two main classes in this respect, namely pronouns and anaphorically used definite noun phrases. Both kinds of expressions refer back to some preceding noun phrase in the text.Footnote 22 We can find examples of both types of nominal expression in the example extract in Sect. 2.1. In the final sentence, the plural pronoun their occurs, referring back to The Corvidae. The second sentence starts with The birds, referring back to Corvidae from the initial sentence. Hence, there are two mentions of Corvidae without using that name! However, a pronoun receives its interpretation from its antecedent, a computational linguistics task known as anaphora resolution (Mitkov, 2013). For that reason, pronouns are ignored, they do not constitute a markable in biofid.
In contrast to pronouns, anaphoric noun phrases exhibit a descriptive content that can be annotated. Although the noun phrase The birds picks up Corvidae, it nonetheless is about birds and can be labeled accordingly (i.e., \(\overline{{{\textsc{animal}}}}\)). Hence, anaphoric noun phrases are labeled according to their descriptive information.
As of the time writing, there are 79,813 “net” annotations (5877 of rank entity and 73,936 of rank \(\overline{{{\textsc{concept}}}}\), cf. Table 1 in Sect. 4.2). These annotations have been carried out according to an annotation hub procedure.