Introduction

Background

Knowing and understanding the organisms around us has always been important for mankind and thus describing and comparing phenotypes has a long tradition that goes beyond the emergence of academic disciplines (e.g., Pruvost et al. 2011). The phenotype of an organism refers to its observable constituents, properties, and relations. In mammalogy, morphological* and anatomical* data describing the body plan based on skeletal and visceral traits usually make up the largest part of phenotype descriptions. But features associated with physiology, behaviour, ecology, or lifestyle traits are also important to characterize intra- and interspecific differences and hence to describe biodiversity. Depending on preservation, the same traits can be studied in extinct species also via fossil remains. The phenotype of organisms and species can be considered to result from the interaction of the organism’s genome with itself and its environment. Consequently, the era of genomics provides the basis to identify genomic loci that are associated with the variety of phenotypic traits. To understand genomic bases of phenotypic diversity is not only a challenge to the field of genomics, but also to the scientific disciplines of organismic biology. To support this, a short summary of concepts underlying the discovery of genomic loci associated with phenotypic traits is given below.

Pioneering work that enabled first insight into links between genome and phenotype relied on model organisms. This required studying the molecular and phenotypic features of single species such as the fruit fly (Drosophila melanogaster), the zebra fish (Danio rerio) or the mouse (Mus musculus). These models provided decisive insights into the genes behind basic developmental processes, including organ function and morphogenesis (Meunier 2012). Translating developmental processes from model to a limited number of non-model organisms opened the field for evolutionary developmental biology (Evo-Devo) and explained the molecular basis of processes such as body plan evolution. Criteria and limitations in the choice of model organisms to use in Evo-Devo studies were discussed by Milinkovitch and Tzika (2007). However, there are some limitations on what model organisms can tell (Bolker 2012). Insights from experiments on a limited number of model organisms are restricted to the phenotypes present in that particular species. For example, rodents such as mice do not have canine teeth, making the mouse an inappropriate model to study the molecular mechanisms associated with these teeth. Furthermore, even if model organism research would reveal all genes necessary to develop a given phenotype (e.g., the digestive system), it would still remain unknown which of these genes played the significant role in evolution and caused specific phenotypic differences between species (e.g., adaptation to particular diets).

Given these limitations, novel approaches to explore genome–phenotype relationships were developed using the availability of an increasing number of fully sequenced genomes. In fact, with the improvement of sequencing technologies, sequencing and assembly of whole genomes became possible; the first was published in 1995 (of the bacteria Haemophilus influenzae, Fleischmann et al. 1995) and the mouse genome was “only” published in 2002 (Waterston et al. 2002). Due to advancements in high-throughput DNA sequencing, there is an increasing number of species for which sequenced nuclear genomes are available (e.g., Genome 10K Community of Scientists 2009; Teeling et al. 2018; Feng et al. 2020; Zoonomia Consortium 2020). This wealth of genomes provides a basis for comparative genomics (“defined as the comparison of biological information derived from whole-genome sequences” and as discipline / methodology thus only started in 1995 (de Crécy-Lagard and Hanson 2018)). While comparative genomics often aims at identifying genomic elements that are conserved across species and thus likely have an evolutionarily important function (Nobrega and Pennacchio 2004), comparative genomics can also be used to detect differences in functional genomic elements and associate them with phenotypic differences of species of interest. For example, targeted analyses of genes associated with the formation of dentin (DSPP) and enamel (AMTN, AMBN, ENAM, AMELX, MMP20) across Mammalia and Sauropsida (including Aves, Crocodylia, Testudines, Squamata) showed an association between the loss of these genes and the loss of teeth (Meredith et al. 2009, 2013). Another example are losses of chitinase genes (CHIAs), enzymes that digest chitin, which preferentially occurred in mammalian species that have non-insectivorous diets (Emerling 2018).

The above cited studies exemplify that the application of comparative genomics to identify links between genome and phenotype requires the systematic and comparative assessment of phenotypes for many non-model organisms. The same is true for recent advances in comparative genomics which follow the idea that convergent phenotypic evolution can be associated with convergent genomic changes, e.g., gene loss (Lamichhaney et al., 2019). This assumption is one conceptual foundation of the general Forward Genomics approach that performs an unbiased screen for genomic changes being associated with convergent phenotypic traits (Hiller et al. 2012; Prudent et al. 2016). This approach employs phenotype matrices and genome alignments to search for associations between convergent phenotypic traits and genomic signatures. Forward Genomics primarily delivers candidate genes or candidate genomic signatures and their causal relationship to the phenotype of interest needs to be inferred from independent studies. This may require experimental work on gene function, e.g., using model organisms or model systems such as cell culture. But the function of candidate genes may also be described from other studies in the scientific literature. In this way, Forward Genomics identified new links between genomic changes in genes as well as regulatory elements and various phenotypic changes such as adaptations to fully aquatic lifestyles in cetaceans and manatees (Sharma et al. 2018a), echolocation in bats and toothed whales (Lee et al. 2018), reductions and losses of the mammalian vomeronasal system (Hecker et al. 2019a), the evolution of body armour in pangolins and armadillos (Sharma et al. 2018a), the absence of testicular descent (Sharma et al. 2018b), and the reduction of eye sight in subterranean mammals (Roscito et al. 2018; Langer et al. 2018).

Development of MaTrics

As demonstrated above, novel approaches in comparative genomics (including Forward Genomics) have proven their potential to link phenotypic differences between mammals to differences in their genomes. But these novel methods also depend on phenotype information on species of interest. Consequently, it would be advantageous for this emerging science field to fall back to phenotype knowledge made digitally available in fully referenced data repositories. This should not only be a compendium of phenotype information on model and non-model species but should presented them in a discretized from, e.g., using a numeric code to label distinct phenotype categories. This is because currently available methods and approaches in comparative genomics (including Forward Genomics) cannot handle continuous data. Instead, they are best suited to explore genomic signatures underlying discrete traits such as the presence or absence of a structure or a trait (see examples and citations above). However, in contrast to genomic data, phenotypic data are not readily available in such a digitized form that it can be used by computer programs, not even for well-characterized species such as mammals with sequenced genomes. Research in zoology and related fields assembled a rich body of phenotypic knowledge. But the information assembled over centuries is usually documented using natural language and thus in the form of texts unstructured for computer-programs and so the information is not machine actionable* (Vogt et al. 2010). Although this form of documentation is thorough, has proven its worth and will continue to be used effectively in zoology and related fields, it is of limited use for other disciplines. This is because substantial time investment would be required to search and extract relevant phenotypic data from published descriptions. As a result, this important cultural and scientific heritage is underutilized in scientific fields such as genomics.

Here we address the need for digitally available trait information by creating a phenotypic character matrix that summarizes the knowledge of organismic biology but meets the specific requirements of genomics. A central feature of such a data repository should be a data matrix presenting comprehensive information of many traits and where rows represent species and columns represent traits.

Constructing a comprehensive phenotype matrix poses several challenges. While “simple” phenotypes that can be compiled relatively easily across several mammals, more complex phenotypes require experienced researchers in morphology, anatomy, physiology, veterinary science or related fields. This is interpreting the collected information on phenotypes requires specialized knowledge of the terminology and taxon of interest. For example, the exact meaning of specialized terms might depend on the described taxon, the author, and the time of publication. Additionally, some terms might refer to spatio-structural properties, others to common function or presumed common evolutionary origin, or to a mixture of both. All this is well understandable to the experts, but difficult for non-experts. This also holds for information on phenotypes provided by matrices associated to and published with phylogenetic (cladistics) studies (e.g. Horovitz and Sánchez‐Villagra 2003). They represent a valuable source of information in organismic biology. But also, their use requires expert knowledge and particularly if information of different independent matrices have to be combined.

To make phenotypic information better understandable and retrievable, a Mammalian Phenotype Ontology (MPO) was developed (Smith et al. 2003; http://www.informatics.jax.org/searches/MP_form.shtml?). However, MPO is focused on “annotation of mammalian phenotypes in the context of mutations, quantitative trait loci and strains that are used as models of human biology and disease” (Smith et al. 2005). Washington et al. (2009) and Haendel et al. (2015) extended the ontology-based notation of phenotypes of human diseases to link them to animal models. So, working with information on phenotypes of mammals across all orders is still difficult for non-experts and even more so for computer algorithms. Thus, integrating the information on phenotypes in a machine actionable form with other sources of data becomes exceedingly challenging and time-consuming (Lamichhaney et al. 2019; Vogt 2019). For integrative research, a way is sought to allow exploiting this knowledge without involving experts in each project.

To improve the accessibility and usability (and to take full advantage) of expert knowledge, more and more information is being digitized, stored, and made accessible online such as current journals or even older and classic books (e.g., Biodiversity Heritage Library). There are currently several online databases that allow for storing, editing or publishing information on phenotypes (mainly on morphological ones) covering various taxa. Some examples are given in Table 1, but it does not represent an exhaustive view of all current efforts in that direction. Each of these databases have their own research purpose and relevance. Many gather state-coded morphological traits on a selection of taxa, in order to perform phylogenetical analyses (e.g., Morphobank; see Table 1). Even though in these cases encoded characters are available they are matrices from individual projects and not combined in one extendable matrix with cross-linked references, specimens, pictures or other information. Some databases provide illustrations of anatomical structures, but do not give detailed description, nor encode characters (e.g., Digimorph; see Table 1). Thus, most of existing matrices with information on phenotypic traits do not fulfil all requirements of Forward Genomics (or other comparative genomics methods). On the other hand, Washington et al. (2009) and Haendel et al. (2015), however) proposed ways to create matrices that function as an interface between phenotype and genotype, but they focused on human diseases only. With MaTrics, we created a machine actionable dataset about phenotypic traits of mammals specifically tailored for comparative genomics research. The focus of phenotypic traits recorded in MaTrics is on convergent traits, coded without a distinction between apomorphies or plesiomorphies.

Table 1 Examples of data repositories in which phenotypic data of different vertebrate taxa are collected.

Our goal is to create a knowledge pool on all “phenomes*” that represents the actual mammalian diversity. As a first step, the current MaTrics version intends to gather phenotypic information for taxa with well-aligned sequenced genomes in a machine actionable way to simplify and enhance the use of Forward Genomics.

While inference of putative homologies in genomic data, resulting in nucleotide sequence alignments, is fully automated, analyses of homology or more precisely 'comparative homology' (Vogt 2017) of phenotypic data cannot be executed by computer algorithms so far. This is irrespective of the type of basic information available (e.g., digitized literature, 2D/3D scans of museum specimens). Establishing comparative homology requires the identification of units of comparison across different OTUs, resulting in a phylogenetic character matrix. When creating matrices usable to link phenotypic differences between species to genomic loci one must first identify the phenotypic units that can be compared across the OTUs (identification of comparative homologies) prior to coding the MaTrics.

Design and coding* principles of MaTrics

MaTrics (version 1.0, released in January 2021; https://www.morphdbase.de/?MaTrics-Mx-v1) is implemented in the online data repository* Morph·D·Base (MDB, www.morphdbase.de, Grobe and Vogt 2009) and publicly available to anybody who registers. MDB is a state-of-the-art data repository to document phenotypic data and its matrix module is best suited to host MaTrics.

Principles and data entry

MaTrics meets all requirements of Forward Genomics. We primarily focused on mammalian species for which genome sequences are available. Some basic principles of MaTrics are described herein; a detailed user’s guide is available online (Wagner et al. 2021).

According to Sereno (2007), a (phenotypic) trait of an operational taxonomic unit (OTU; here the respective mammalian species) can be represented in a character statement, that is composed of two parts: character and statement, and can be divided into four types of logical components (Sereno 2007: Table 4): one or more locators, a variable, and a variable qualifier as parts of the character and a character state as the statement. Not all these components are needed in any case, but a locator and a character state are the minimum (representing character and statement). Thus, each character consists of at least one locator (L – the morphological structure, the structure bearing the trait) and the statement of the character state (v – mutually exclusive condition of a character) (Fig. 1). Specifying a locator and a character state is sufficient in case of absent-present character statements. Such kind of character statements can easily be encoded in a discretized form, i.e., coded as 0 or 1 (Fig. 1A, Table 2). Following Sereno’s (2007) coding scheme, each character in MaTrics is named with a label starting with a single locator or a sequence of locators starting with Ln to L1 (the trait-bearing structure), which provide all information necessary for unambiguously identifying and locating the trait within the OTU. The sequence of locators (Ln to L1 as illustrated in Fig. 1) in the character label is hierarchically organized. And for clearer organization and orientation of characters a character category was added at the beginning of the ‘character label’ in MDB. While Sereno (2007) developed his coding scheme primarily for structural traits, we extended it here and applied it also to ecological or behavioural traits.

Fig. 1
figure 1

Schematic illustration showing how phenotypic traits are reflected in character statements and in the character labels (shaded in grey) in MaTrics. The basic nomenclature is based on Sereno (2007: table 4, scheme 3). Top: structure for characters which can be described with only two character states (absent and present) exemplified by the jugal bone. Bottom: structure for characters which require more than two character states (multistate characters) exemplified by the length of the canine tooth in relation to the tooth row. Sereno’s (2007) terminology recognizes character statements (CS) consisting of characters (C) and statements (S). The character is represented by locators (Ln, … L1; hierarchically organized) and optionally the variable (V) and the variable qualifier (q). The different expressions of the variable are given as character states (v0, … vn) representing the statement.

Table 2 The options for numeric code options for (A) absent/present and (B) multistate characters in MaTrics.

In case a phenotypic trait may have several different expressions or patterns, it must be coded as a ‘multistate’ character. Such a character needs a variable qualifier (q – the variable qualifier) (Fig. 1, Table 2). The character states of a multistate character in MaTrics are as discrete states coded using integers 2 to n. For example, the height of the mandibular canine teeth in relation to the level of the occlusal height (averaged) of the cheek teeth are coded as low (2), occlusal height (3) or high (4) (Fig. 1). Absence is recorded alongside other states, which is a shortcoming discussed by Sereno (2007). However, the state ‘absent’ is needed for the application of Forward Genomics. Multistate characters in MaTrics can be described by different states or specifications. These states may be ordered based on their nature (like for example for the character clavicula pattern which might be absent, reduced or fully developed), or might even be metric (i.e. number of incisors inferior or superior), or nominal (e.g., the shape of the anterior nasals). The definition of each state is given by the author who included the character in MaTrics for a certain purpose. Another person for another purpose might define and use different states. Up to ten states per character states can be entered in MaTrics, so that the state definitions can be adjusted to fit other purposes.

A key consideration when generating MaTrics was to clearly document the source(s) of evidence for each phenotypic entry. The character part of each character statement possesses a short textual definition that is extracted from published sources (e.g. journals, text books, online references); it includes references to relevant ontology terms from various biomedical ontologies. The following online resources were used for the identification of adequate terms: Ontology Lookup Service, OLS, https://www.ebi.ac.uk/ols/index, Jupp et al. (2015); Ontobee, https://www.ontobee.org, Xiang et al. (2011); Bioportal, https://bioportal.bioontology.org, Musen et al. (2012). If no adequate definition was available, we provided a definition and clearly marked it as such.

Phenotypic traits coded in MaTrics represent by default adult states.

The dimensions of MaTrics are defined by the number of rows (OTUs) and columns (characters) that result in a specific number of cells (rows x columns). These cells primarily contain the character states. Morph·D·Base enables the addition of further information such as references, photos, illustrations, or museum specimen IDs to each matrix cell. All recorded character states and thus each cell of MaTrics is linked to at least one supporting reference. This refers either to citations from the literature (e.g., published journal articles, books, reliable scientific online resources) or to primary data sources. These data sources can cover IDs of museum specimens or media (e.g., photographs, images taken by microscopy, electron microscopy (TEM and SEM), magnetic resonance tomography (MRT), micro computed tomography (µCT), or synchrotron) which can be uploaded in MDB or larger datasets might be linked to MDB. As a result, researchers using MaTrics can trace the information to at least one original source. This makes data entries not only revisable but offers the opportunity to post hoc re-analyse phenotypes for instance based on user-defined categories or even is raw data sets, e.g., continuous data sets.

The MaTrics or individual characters can be exported as a NEXUS file that provides data in a structured way and can be used as input in various software analysis tools.

Specificities of MaTrics

The primary motivation generating MaTrics was to create a research tool to link phenotypic differences between species to differences in their genomes. With this aim we follow the “from genome to phenome” approach initiated with the Human Phenome Project (Freimer and Sabatti 2003) and also discussed by others (see Edwards and Batley 2004; Scriver 2004). This is the reason why intraspecific variation of traits such as sexual dimorphism was not considered. Character states (presence/absence; multistate) do not take character polarity into account and character dependencies were not specifically considered. Specific characters of interest were added to MaTrics for some each research question, under certain considerations. Similarly, for different projects, characters can be selected individually to be retrieved from MaTrics for other use. Character dependencies can be avoided or reduced in this way, if needed.

Current status: MaTrics (version 1.0, release January 2021)

To date, MaTrics contains 231 characters for 147 mammalian species, resulting in a total of 33,957 matrix cells. The mammalian species considered in MaTrics include two representatives of Monotremata, five of Marsupialia and 140 of placental mammals (Supplementary Material Table S1). The number of species from each major clade of mammals neither represents the respective diversity nor morphological disparity of the respective trait. This is due to that the primary criterion for the inclusion in MaTrics was the availability and suitable quality of whole genomes when taxa were selected in 2016. A majority of the characters, 186 out of 231 (=80.52%), are coded as absent-present characters and the remaining 45 (19.48%) are multistate characters. The characters in MaTrics cover structural, ecological, ethological, and physiological phenotypic traits (Table 3). All refer to the adult stage. For three characters (os jugale; fully aquatic; body armor in the form of scales), the recording is 100%, so all cells for these characters contain coded and referenced character states. Some traits were specifically included for the study in subsets of the listed mammals, and, therefore, the recording purposely is less complete and these characters include more cells still filled with “missing”. For overall coding status see Supplementary Material Table S2.

Table 3 Gross categories of 231 characters included in MaTrics and number of characters in these categories

Notes on application

The primary motivation for creating MaTrics was to provide fully referenced phenotypic information for applications in comparative genomics, especially the Forward Genomics approach. The creation and filling of MaTrics and studies applying Forward Genomics were developed in parallel within the mentioned project. So, some phenotypes recorded in MaTrics were successfully used in earlier studies and simpler shorter tables, e.g., by Sharma et al. (2018a) who identified various convergent gene losses associated with some specific convergent mammalian phenotypes. They showed convincingly that tooth and enamel loss are associated with the loss of ACP4 (a gene that is associated with the enamel disorder amelogenesis imperfecta) and that the presence of scales is associated with the loss of the gene DDB2 (which detects substances resulting from UV light and helps to induce DNA repair). The fully aquatic lifestyle is associated with the loss of MMP12, a gene associated with breathing adaptation. The documented loss of these genes in some mammalian species is functionally explainable either as a consequence of trait loss (the genes ACP4 and DDB2 have no function after trait loss) or as putative adaptive genomic alteration, causing novel phenotypes (MMP12-loss is associated with novel lung functions in aquatic mammals) (Sharma et al. 2018a). Such results might help to better understand some related human diseases, as for example in the case of DDB2 whose mutations cause xeroderma pigmentosum which manifests in hypersensitivity to sunlight (Rapić-Otrin et al. 2003).

Another study investigated the gene losses associated with the reduction of the vomeronasal system (VNS) in several mammals. A genomic comparison of 115 mammalian genomes confirmed that Trpc2 is an indicator for the functionality of the VNS (Hecker et al. 2019a). Moreover, it indicated a loss of functionality of the VNS in seals (Phocidae) and otters (Lutrinae). Morphological data are scarce for seals and there are no data for otters (Hecker et al. 2019a; Zhang and Nikaido 2020) and, therefore, we will proceed to test the accuracy of the suggested predictability. This study on the VNS is an example for testing genotype–phenotype associations in non-model organisms and shows the potential of the combination of comparative morphological and genomic approaches.

Nevertheless, the relevance of MaTrics is by no means restricted to the Forward Genomics approach. Characters were also included in MaTrics for the usage in the contemporary study to explore evolutionary conditions associated with the loss of genes related to convergent evolution of herbivorous and carnivorous diet in mammals (Hecker et al. 2019b). This study included 52 placental species and suggests that the lipase inhibitor gene PNLIPRP1 is preferentially lost in herbivores, whereas the xenobiotic receptor NR1I3 is preferably lost in carnivores. Even though the authors put forward hypotheses, the lack of accessible data on mammalian diet preferences made it difficult to test whether gene losses are associated with dietary fat content and diet-related toxins. Investigating whether convergent gene loss is associated with similar dietary preferences may additionally hold information on whether gene losses might be adaptive (Albalat and Cañestro 2016). Consequently, an ongoing study records dietary categories in MaTrics that allow a semi-quantitative encoding of dietary fat content (associated with PNLIPRP1) and diet-related toxins (associated with NR1I3) (Wagner et al. accepted). This study will test whether the convergent loss of both genes is associated with the convergent evolutionary change of dietary preferences, i.e., the consumption of a diet with reduced fat and toxin contents.

Future analyses using MaTrics have the potential to test how gene losses and dietary composition are related to the presence/absence of structures or organs associated with digestive processes. Even further, it allows investigating whether evolutionary changes in diet composition are not only associated with the loss / presence of single molecules (e.g., lipase inhibitor, xenobiotic receptor), but also with changes in complex structures and their associated genes.

The two studies by Hecker et al. (2019b) and Wagner et al. (accepted) mentioned above show how genomic and morphological studies are entangled: current knowledge of morphology serves as basis for creating phenotypic trait matrices like MaTrics which — on the other hand — forms the basis of genomic research, especially the Forward Genomics approach. Hypotheses associated with findings of candidate loci, may in turn inspire further morphological research.

The most obvious applications are morphological studies. Although mammal dentitions are well studied and a lot is known about teeth number, form, and shape in particular in relation to dietary specialization (see Thenius 1989; Hillson 2005; Ungar 2010), we still have many knowledge gaps, e.g., concerning functional adaptations and evolutionary transformations. Thus, Sole and Ladevèze (2017) aimed to put forward new ideas on how the basic mammalian tribosphenic molar was transformed to sectorial teeth in hypercarnivorous mammals. The study only included carnivores as defined by flesh-eating and the presence of carnassial teeth, representatives of the living Carnivoramorpha (including the extinct Nimravidae) and Dasyuromorphia, as well as from the extinct Sparassdonta, Oxyaenodonta, and Hyaenodonta. Comparing the cusp pattern/morphology of the upper and lower molars of these species Solé and Ladevèze (2017: fig. 4) derived a scheme for the morphological evolution of the sectorial teeth in hypercarnivorous mammals. They also aimed at providing new arguments to discuss the developmental aspects of the evolution of hypercarnivory by associating their morphological observations with ontogenetic studies. The latter highlighted the importance of the expression of ectodysplasin A (Eda): increased levels are able to modify the number, shape, and position of cusps in mice during tooth development (Kangas et al. 2004). Further, Häärä et al. (2012:3189) showed — again in mice — that “Fgf20 is a major downstream effector of Eda and affects Eda-regulated characteristics of tooth morphogenesis, including the number, size, and shape of teeth. Fgf20 function is compensated for by other Fgfs”. A study of hairless dog phenotypes in primary teeth (deciduous premolars and permanent molars) indicated that “the haploinsufficiency of FOXI3 leads to an incomplete development of the lingually positioned cusps in the trigon(id) and talon(id) parts of both upper and lower molars and deciduous fourth premolars, respectively” (Kupczik et al. 2017:5). The ectodermal development regulator gene Foxi3 is known to be a target of Eda and to be involved in tooth cusp development (Drögemüller et al. 2008); it suppresses epithelial differentiation (Jussila et al 2015). Inspired by the observations and the model of Solé and Ladevèze (2017), we started a study with teeth and cusps in a subsample of Carnivora collected in MaTrics with two aims: first, to test the suitability of MaTrics in comparative morphological studies, and second, to set the basis to proceed with genome wide searches for genomic causes correlated with the loss of cusps. This seems to be promising with the development of new methods to include searches for regulatory elements (see below).

For the selected Carnivora (Supplementary Material Table S3) the absence and presence of individual tooth cusps for the fourth upper premolar (P4) and all molar teeth were recorded in MaTrics. The nomenclature of the cusps followed Thenius (1989). The detailed descriptions of cusp patterns for the species are given in the Supplementary Material document S4 and examples are illustrated in Fig. 2 and detailed in Supplementary Material Table S5. Some of our results confirmed the observations of Solé and Ladevèze (2017). So, we confirm that parastyle and protocone of the P4 are generally reduced in hypercarnivorous carnivorans. Interestingly, both structures are more reduced in the Canidae and the polar bear (Ursus maritimus) than in the members of the Felidae and Hyaenidae. Solé and Ladevèze (2017) reported that in the upper molars protocone, paraconule and metaconule are reduced in hypercarnivorous mammals which is also in line with our findings (Fig. 3D–F; Supplementary Material Table S5).

Fig. 2
figure 2

Some examples for the presence of cusps in the studied Carnivora in P4 as well as upper and lower molars. A The spotted hyaena Crocuta crocuta MTD B4936, B the red panda Ailurus fulgens MTD B17478, C the giant panda Ailuropoda melanoleuca ZMB_Mam_17246 and D) the Weddell seal Leptonychotes weddellii MTD B5029. For each species P4 and upper molars (insets 1, 2) as well as the lower molars (insets 3, 4) are illustrated. The teeth are photographed in lateral (insets 1, 3) and occlusal (insets 2, 4) view. Abbreviations alphabetically: End – entoconid, Enld – entoconulid, Hy – hypocone, Hyd – hypoconid, Hyld – hypoconulid, Me – metacone, Mec – metaconule, Med – metaconid, Mes – mesostyle, Ms – metastyle, Pa – paracone, Pac – paraconule, Pad – paraconid, Pr – protocone, Prd – protoconid and Ps – parastyle.

Fig. 3
figure 3

Comparison of the presence or absence of individual teeth (P4-M1-M2 and M1-M3), of trigon/id and talon/id (M1, M1), and of their cusps (P4, M1, M1 in some species of Carnivora plotted on a phylogenetic tree based on Nyakatura and Bininda-Emonds (2012) and Agnarsson et al. (2010) A Presence/Absence of individual teeth, B presence/absence of cusps on the P4, C presence/absence of the trigon and talon on the M1, D presence/absence of cusps on the M1, E presence/absence of the trigonid and talonid on the M1, F presence/absence of cusps on the M1. End – entoconid, Enld – entoconulid, Hy – hypocone, Hyd – hypoconid, Hyld – hypoconulid, Me – metacone, Med – metaconid, Mec – metaconule, Mes – mesostyle, Ms – metastyle, Pa – paracone, Pac – paraconule, Pad – paraconid, Pr – protocone, Prd – protoconid and Ps – parastyle

Solé and Ladevèze (2017) also observed that metaconid and talonid are generally lost in hypercarnivorous mammals, especially felid-like and hyaenid-like hypercarnivores. Based on our study, we found that metaconid and talonid are completely reduced only in the Felidae (except the cheetah, Acinonyx jubatus) and the spotted hyena (Crocuta crocuta). Like in the Canidae and the striped hyena (Hyaena hyaena), both structures are also present in Ursus maritimus. The specialized hypercarnivorous diet of several Feliformia lead to an extreme reduction of the tribosphenic molar, whereas the Canidae and Ursus maritimus also eat fruits and vegetables and, therefore, need crushing structures. The presence of protocone and talonid seems to be necessary for an omnivorous diet (Solé and Ladevèze 2017), but, based on our study, we can confirm that this is also true for herbivorous species (e.g., red panda, Ailurus fulgens; giant panda, Ailuropoda melanoleuca).

Except for the Pacific walrus (Odobenus rosmarus) at least ten specimen per species were analysed and for several species’ individual deviations from the common cusp pattern were observed (Table 4). MaTrics was not designed to take intraspecific variability into account. Therefore, only the most common cusp patterns for each species were recorded. Variations of the cusp patterns can affect several cusps in domestic dog and brown bear (Ursus arctos), but only one cusp in the red fox (Vulpes vulpes). Such exceptions are important as they might indicate evolutionary trends. However, variations within a species cannot be reflected in MaTrics as only one-character state is attributed to a given species for each character here. Only in this way the (common) absence or presence of a trait can be compared with the genome of again one representative of a species. Studies on intraspecific variability of certain characters would need matrices redesigned for this purpose. There are two options for this: first, record characters in Morph·D·Base for specimens instead of species which would result in several rows of specimen for one taxon. These could be pooled in a phylogenetic analysis by restricting the tree-space for searching the best tree to trees that include clades that comprise all pooled specimens of a species. Or, second, one could enable the recording of several character states for the same character in the matrix, thus representing the variability found across the various specimens of a given species.

Table 4 Deviations in cusp patterns in the studied Carnivora.

Conclusion and future perspectives

Recent advances in molecular techniques lead to a rapid increase in the assembly and publication of genomes from various organisms. However, knowledge of the genome sequences is only a first step to understand the relationships between genomic changes, the phenotype of organism and phenotypic differences between different organisms (Hardison 2003). The systematic description of phenotypic information in matrix form like in MaTrics is necessary to understand the genome information and to deal with questions related to evolutionary biology and biomedicine. This is not restricted to mammals as the coding principles of MaTrics, which comply with the requirements of molecular research, can serve as a template for matrices comprising trait knowledge of other vertebrate and non-vertebrate groups. The establishment of trait matrices for various taxa could lead to a broad documentation of phenotypes for applications in comparative genomics, and hence, enable a systematic exploration of genotype-phenotype associations.

However, trait collections such as MaTrics also revealed a tremendous research gap in phenotypic data. In fact, filling MaTrics with information on different phenotypic traits across mammals showed that detailed information on structural, physiological, or life history traits was often not available for many species, even with intensive literature research. For example, reductions of the vomeronasal system (VNS) are documented in several mammals and our previous genomic comparison of 115 mammalian genomes uncovered several genes whose loss is associated with a reduced or non-functional VNS (Hecker et al. 2019a). This genomic screen also revealed that seals (Phocidae) and otters (Lutrinae) have lost some of these genes, indicating a reduced VNS. However, to the best of our knowledge, information concerning the vomeronasal organ of Phocidae and Lutrinae is not available. Indeed, the recording status in MaTrics for the character “vomeronasal organ” with the states absent/present is only 37%. Another example of a character, that would be assumed to be well-known, is the absence/presence of the gall bladder (“Vesica bilaris”), with a recording status of 70%. In other words, the recording status of the characters in MaTrics demonstrate the lack of information on phenotypic traits in several species. These research gaps can only be filled by specimen-based research (e.g., Thier and Stefen 2020). Although individual studies are valuable scientific contributions, they may not suffice to close the substantial research gaps in short time. The authors see the need for more basic zoological research complementing the systematic exploration of the genomic basis of biodiversity, i.e., research activities on biodiversity genomics could be assisted by research initiatives on biodiversity phenomics (= systematically phenotyping animals in matrices like MaTrics).

Most of the genomic studies mentioned above identified protein coding genes associated with complex body plan changes (e.g., aquatic and aerial lifestyle of cetaceans and bats, respectively). However, evolutionary theory predicts that changes in cis-regulatory genetic elements are probably more important for morphological changes than protein-coding genes. For instance, Roscito et al. (2018) stated that the loss of morphological traits is (often) associated with the decay of the cis-regulatory elements. Consequently, the Forward Genomics approach has been further developed to include methodologies that can successfully associate phenotypes with the loss or presence of regulatory elements (e.g., Langer et al. 2018; Langer and Hiller 2019). In awareness of these developments, the phenotype matrix presented here already provides a number of morphological characters that will be subject to further exploration in the near future. Thus, the phenotypic information compiled in MaTrics will be of increasing importance. This applies for instance to those referring to tooth morphology and tooth cusps discussed above. In fact, tooth characters are known to be the result of a complex signalling network involving timely graded activation and deactivation of genes controlled by regulatory elements (e.g., Jernvall and Thesleff 2000; Thesleff et al. 2001).

A last aspect to be mentioned refers to the way how phenotypic information is documented. So far, filling MaTrics with information is still mostly conducted by hand; experienced scientists have to control the content and to check for homology. However, some recent developments may open the door to the partial automation of this work. First, the implementation of ontologies and semantic phenotypes in the platform Morph·D·Base. The development of a respective semantic description module is already initiated (Vogt and Baum 2019; Vogt 2019). This is expected to allow the development of computer algorithms to mine data on homologous structures to establish matrices more automatically (Vogt 2018).

MaTrics is a new and unique data collection of phenotypic traits of mammal species. By including homologous phenotypic traits across (an increasing number of) species, MaTrics and similar matrices can serve as basis for a variety of research fields as illustrated herein. The recorded phenotypic traits are well defined and fully referenced (characters as well the character state for each species). Not only literature data are accepted for the latter, but also references to specimens in collections, which contribute in a specific way to the digitalization of collection material. MaTrics data are directly useful in genomic studies since the character states are numerically coded and hence can be extracted as NEXUS file to be machine actionable. The scientific potential of digitized phenotype matrices is apparent and motivates thinking about future development.