Modeling principles
Existing approaches for quantitative reconstruction of paleoclimate based on observational data fall into three main categories: co-occurrence, taxon assemblage and ecometric approaches.
Reasoning about paleoclimate based on fossil occurrences has been around for nearly as long as paleontology itself. Early qualitative approaches relied on parallels with living species. For example, if a fossil of an elephant has been found, one could, to an approximation, argue that the environmental conditions should have been similar to those where elephants live today. Quantitative approaches to climate estimation based on fossils go back to the nineteenth century when Oswald Heer estimated the yearly temperatures from the climatic requirements of the living relatives of the pre-Quaternary Cenozoic trees.Footnote 2 Recent approaches (Mosbrugger and Utescher 1997; Utescher et al. 2014) derive estimates from assemblages of species using similar principles. For each fossil species its climatic range is inferred by considering habitats of its living relatives. The climates of a fossil locality is then estimated as an intersection of the climatic ranges of all fossil species encountered at that locality. Since inference is based on intersection of ranges, it is not required that all living relatives share a locality in the present day. These approaches fall under lazy machine learning (Aha et al. 1991) since they do not produce a model in a functional form, instead generalization is delayed until a query about a particular locality is made.
And alternative is to consider as units of analysis real localities in the present day, where occurrence of taxa and climate variables are known, and fit a model in a functional form relating the two. Such approaches, known as taxon assemblage models, rely on species or higher taxa indicating particular environmental conditions. The relationships between occurrence of taxa and climate are modeled computationally by fitting, for example, a regression model (Fernandez and Vrba 2006). The main challenge is to find a mapping between the past species (features) and the present species, such mapping is most commonly done manually from domain knowledge. To the best of our knowledge computational approaches for discovering such mapping have not yet been reported, but there is a potential for discovering such mappings from data, for example, using information-theory approaches (Tatti and Vreeken 2012).
The third type of approaches does not explicitly link past and present species, but instead transform the feature space into a representation that is common for past and present species. In paleobiology such approaches are known as ecometrics,Footnote 3 and primarily refer to capturing functional relationships between faunal communities and their environments (Fortelius et al. 2002; Liu et al. 2012; Eronen et al. 2010a; Polly et al. 2011; Vermillion et al. 2017), as opposed to relying on taxonomic links in earlier two types of approaches. This computational methodology can be used for analyzing evolutionary contexts (Eronen et al. 2009; Schnitzler et al. 2017; Sukselainen et al. 2015), global scale relationships between animals, their environments (Liu et al. 2012; Zliobaite et al. 2016; Eronen et al. 2010a; Polly and Head 2015; Barr 2017; Lawing et al. 2012), reconstructing past climates and environmental change (Fortelius et al. 2016; Saarinen 2015; Eronen et al. 2010b; Meloro and Kovarovic 2013). Different traits have been explored for ecometric analysis. For plants, leaf shapes have been considered (Wolfe 1995; Traiser et al. 2005). For animals, considered traits include teeth (Eronen et al. 2010a; Fortelius et al. 2016; Liu et al. 2012; Zliobaite et al. 2016; Meloro and Kovarovic 2013; Polly and Head 2015), limbs and locomotion (Barr 2017; Polly and Head 2015), skeletal traits (Lawing et al. 2012), as well as body mass (Meloro and Kovarovic 2013). Traditionally, ecometrics refers to the analysis of animals. Conceptually similar computational approaches for the analysis of plant traits are referred to as transfer functions (Wolfe 1995; Traiser et al. 2005) or as species distribution models in ecology (Elith and Leathwick 2009; Ovaskainen et al. 2017). Generic statistical approaches, such as ordinary least squares regression, have served as core computational techniques in ecometrics until quite recently (Zliobaite et al. 2016).
Ecometric approaches link the present and the past via functional traits of species disregarding their shared origin. Even though most of present day species are likely to be different from those in fossil record, the functional traits of taxa, governed by the laws of physics, chemistry, and physiology, are likely to be similar in the present and in the past. For example, animals that run tend to leave the same pattern of skeletal architecture, e.g. long limbs (Reed 2013). Similar patterns of adaptation within communities would indicate similar habitats, therefore, reconstructing them can be approached as a predictive modeling task.
Relation to transfer learning and concept drift
In machine learning a situation where labeled training data is lacking is quite common, particularly in text or image processing domains. The branch of reusing models built on one problem for another but related problem is known as transfer learning (Pan and Yang 2010). There are three major categories of transfer learning: inductive transfer, where the source and target domains are the same, but the prediction tasks are different; transductive transfer, where the source and target domains are different, but prediction task is the same; and unsupervised transfer, where both are different. Environment reconstruction from the fossil record falls under transductive transfer. The prediction task is the same in the present and the past, but there is very little direct overlap in the feature space between the present, as the source domain, and the past, as the target domain. Within transductive learning our task falls under feature representation transfer approaches, where the goal is to find a feature representation that reduces difference between the source and the target domains and the error of classification or regression model (Pan and Yang 2010).
Feature representation transfer approaches have became increasingly popular about a decade ago in connection to natural language processing. There are two general types of solutions that can roughly be divided into supervised and unsupervised. The first type of solutions transforming (Raina et al. 2007; Argyriou et al. 2007; Davis et al. 2007), augmenting (Daume III 2007; Li et al. 2014) feature space, or inferring correspondence between features (Blitzer et al. 2006) in a supervised way. The second type of solutions assume some overlap between feature spaces and try to find a joint representation or links between the two domains in an unsupervised way, for instance, by means of co-clustering (Dai et al. 2007; Gopalan et al. 2014; Hoffman et al. 2012). The main difference from the fossil setting is that here at least some labeled data is assumed to be available for the target domain, while in the fossil setting there is none. Available instead is some auxiliary data that describes features, but not the instances in the source and target domain. This auxiliary data allows establishing links between the source and the target domain even at complete absence of labels in the target domain.
Concept drift is another closely connected research area in machine learning. Concept drift refers to changing data distribution over time, and as a result, predictive models having to detect changes and update themselves automatically to adapt to ever ongoing changes as new data keeps arriving (Gama et al. 2014). From the concept drift perspective transfer learning only deals with abrupt changes, changes are known in advance. That is, transfer learning does not need to detect change, the fact that the source and the target domains are different is known and the main algorithmic challenge is how to do adaptation when the target labels are scarce. Fossil data has a time dimension, like in the concept drift settings, but fossil data does not arrive in a stream over time. Therefore, there is no need for real time processing and no particular pressure for incremental algorithms. Unless a very coarse spatial resolution is necessary or climate models (Stute et al. 2001) need to be run in parallel to provide additional feedback for the analysis, it is reasonable and practical to have algorithms access and operate on all the historical data. In fact, data of the past is rather scarce, and the most recent data, for which labels are available, is more abundant.
Concept drift in fossil data can be of varying intensity, depending on how severe climate change has taken place, but most of the time an incremental and continuous drift is taking place, as can be seen from Fig. 2. The plot shows an aggregated statistics of the fossil record in the Turkana Basin area in Africa over the last 8 million years, from which a decline in feature overlap between the present and the past times can be seen. Standard adaptive preprocessing methods that have been developed to deal with drifts in feature space (Zliobaite and Gabrys 2014) do not directly apply here, since labeled data is only available at one point in time, at the end point of the time series, while those feature space adaptation methods require continuous feedback. This setting, thus, is a novel combination of transfer learning and concept drift settings, and therefore presents interesting algorithmic challenges that have not been considered before in either of the areas. We do not claim that the approaches formulated here solve the problem from the algorithmic point of view, but rather present simple baselines and a conceptual task setting for further research. These approaches can be relevant not only for geological data, but any historical data where we cannot go back to the past and obtain labels, for example, analysis of old texts, old demographic or social data.
Algorithmic approaches
Based on the existing work in palaeontology we formulate three baseline approaches for paleoclimate reconstruction. We analyze the performance experimentally with the focus on the transfer learning aspects to address the challenge that data is subject to a persistent concept drift over time. By concept drift here we primarily refer to drift in the feature space, as illustrated in Fig. 2. The traditional incremental concept drift methods (see e.g. Gama et al. 2014) do not apply to this setting, because there are no possibilities for online model updates. Standard concept drift handling methods require continuous feedback via arriving true labels, and in the fossil setting there are no labels for the past, but there is auxiliary data which can be used to link present and the past, and that is where transfer learning perspective comes into the solution. Transfer learning here refers to mechanisms for training predictive models that can be applied and are expected to perform well on data with different characteristics or different distribution from that of the training data. There are no model updates, as would be in the concept drift setting. But there are internal model training mechanisms that produce models tailored for the target data right away.
The algorithmic approaches presented here conceptually follow the three types of methods discussed in the related work section, but they are not direct replicas of those methods. We have simplified algorithmic approaches to make their transfer learning mechanisms comparable to each other, and therefore we refer to them as baselines. Our goal is to introduce the principle to the data mining and machine learning community, and leave computational choices open for future improvement. We therefore make the datasets and the code for our experimentally analysis publicly available.Footnote 4
The mean habitat approach works as follows. For each reference species from the present day the mean of the target variable representing environmental conditions is computed. Then for each fossil species the nearest living relative is found (based on domain expertise), this is the transfer learning aspect. The mean over environmental conditions of the nearest living relative is computed and assigned to the fossil species. A prediction for a given locality is a simple average over environments of all occurring species.
For example, if a fossil site has a giraffe and an elephant, we look where these animals or their nearest living relatives occur in the present day. Suppose we find three sites where elephants occur in the present day: with the mean annual temperatures correspondingly 20, 19 and 21 \({^{\circ }}\)C. The average environmental condition for occurrence of an elephant is thus 20. Similarly, suppose an average condition for occurrence of a giraffe is 24. Since the fossil locality has an elephant and a giraffe, we average over the mean environmental conditions of each animal occurrence and get a temperature estimate for the fossil locality equal to 22. The catch is that we find species in the fossil record that do not exist today, for example, a mammoth. Then we have to link mammoth with an elephant of the present day. This can be done computationally, but for now we use existing taxonomic trees to map an approximate relation. The nearest relative mapping has been assembled specifically for this paper and is given in the “Appendix” section.
The mean habitat approach is a simplified version of co-occurrence approach (Mosbrugger and Utescher 1997). The main difference is that instead of picking a range of overlapping habitats we take a simple average over all the habitats. The mean habitat approach does not explicitly build a predictive model, but works as an instance based learning approach. The approach is summarized in Algorithm 1.
The taxon assemblage approach works in a regular machine learning task setting, where the goal is to learn a predictive model over a set of binary features, where each feature describes presence or absence of a particular species. The learning instances are geographic areas (known as localities in the fossil record). The transfer learning element is that each species in the fossil record needs to be mapped to the species at the present day in order to match the input feature space of the training data (the present day) with the application data (fossils). The mapping is based on the identification of the nearest living relative, which requires domain expertise. We use the same mapping as for the mean habitat approach, the mapping is given in the “Appendix” section.
The taxon assemblage approach tested here is a simplified version of Fernandez and Vrba (2006). The main difference is that our simplified approach uses presence and absence data, while taxon-based approaches applied in paleobiology typically work on relative abundances of taxa (the proportion of each taxa in each locality). Here we use simple occurrence data in order for this approach to be experimentally comparable to the other two approaches, using the same occurrence information. The approach is summarized in Algorithm 2.
The functional approach (ecometrics) (Fortelius et al. 2002) works as follows. For each locality average traits can be computed over occurring species. This way the input space of the present day data becomes comparable to the input space of the fossil data. Any traditional machine learning model can be fit on the new input space. The approach is summarized in Algorithm 3.
This computational task setting closely relates to multiple instance learning (Zhou et al. 2012), where a bag of instances is a unit of analysis and predictive modeling. In our setting units of analysis are geographic areas, in the fossil record they are called localities. Each locality contains a different number of animal remains with their species identifications, and each species can be described by a set of quantitative features, as illustrated by a toy example in Fig. 3. The left panel illustrates a situation of extreme concept drift, where there is no overlap between the present day species and the fossil species. In such a case, instead of using species occurrence as input features directly, one can use traits as proxies, which can be measured for any species. A trait can be, for instance, body mass, number of legs, or height of teeth, as specified in Algorithm 3.
For example, if at one locality we have a giraffe and a zebra at a locality, we can compute the average length of a neck over these two occurring animals. If at another locality we have a hippo, and an elephant, we can compute average length of a neck over those two as well. Then instead of using animal occurrence as inputs to the models we can use average traits. This makes localities computationally comparable even if there are no overlap in species.
The mean habitat and taxon assemblage approaches use the nearest living relative as the transfer link between the present and the past. The transfer function is not learned computationally, but inferred from the domain knowledge. The latter, ecometric approach, learns the transfer function computationally using extra features that can describe species in the fossil record and at present day in the common feature space. The features for the taxon assemblage are binary, indicating presence or absence of a particular species at a particular site. The features for the eccometric approach are numeric, indicating average functional characteristics of species found at each site, for example, their body mass.
In this study we use eight features of mammalian teeth, described by Zliobaite et al. (2016): relative height of molar teeth, relative length of molar teeth, the number of longitudinal cutting edges, presence of sharp edges, presence of blunt edges, flatness, presence of rounded structural elements and presence of cement. Dental characteristics of plant eating mammals are reflective of their environments, because teeth act as an interface for obtaining energy from the environment. Different types of plant food require different properties of teeth, and different types of plants grow in different climatic conditions. Therefore, plants provide a functional link between plant eating mammals and climatic conditions. Plant types do not need to be known, they work as hidden variables in this predictive modeling. In palaeontology this approach is called dental ecometrics (Vermillion et al. 2017).