An ontology approach to comparative phenomics in plants
Plant phenotype datasets include many different types of data, formats, and terms from specialized vocabularies. Because these datasets were designed for different audiences, they frequently contain language and details tailored to investigators with different research objectives and backgrounds. Although phenotype comparisons across datasets have long been possible on a small scale, comprehensive queries and analyses that span a broad set of reference species, research disciplines, and knowledge domains continue to be severely limited by the absence of a common semantic framework.
We developed a workflow to curate and standardize existing phenotype datasets for six plant species, encompassing both model species and crop plants with established genetic resources. Our effort focused on mutant phenotypes associated with genes of known sequence in Arabidopsis thaliana (L.) Heynh. (Arabidopsis), Zea mays L. subsp. mays (maize), Medicago truncatula Gaertn. (barrel medic or Medicago), Oryza sativa L. (rice), Glycine max (L.) Merr. (soybean), and Solanum lycopersicum L. (tomato). We applied the same ontologies, annotation standards, formats, and best practices across all six species, thereby ensuring that the shared dataset could be used for cross-species querying and semantic similarity analyses. Curated phenotypes were first converted into a common format using taxonomically broad ontologies such as the Plant Ontology, Gene Ontology, and Phenotype and Trait Ontology. We then compared ontology-based phenotypic descriptions with an existing classification system for plant phenotypes and evaluated our semantic similarity dataset for its ability to enhance predictions of gene families, protein functions, and shared metabolic pathways that underlie informative plant phenotypes.
The use of ontologies, annotation standards, shared formats, and best practices for cross-taxon phenotype data analyses represents a novel approach to plant phenomics that enhances the utility of model genetic organisms and can be readily applied to species with fewer genetic resources and less well-characterized genomes. In addition, these tools should enhance future efforts to explore the relationships among phenotypic similarity, gene function, and sequence similarity in plants, and to make genotype-to-phenotype predictions relevant to plant biology, crop improvement, and potentially even human health.
Maize Genetics and Genomics Database
Web Ontology Language
Phenotype and Trait Ontology
Quantitative trait locus (or Loci)
Sol Genomics Network
The arabidopsis information resource
Plant phenotypic variation constitutes the raw material for much of plant biology, including research on gene function in model species, breeding of desirable crop varieties, functional investigations from the cellular to ecosystem scale, and inference about the evolution and ecology of both plants and the species that interact with them. Disentangling the relationships among genotypes, phenotypes, and the environment is one of the grand challenges of contemporary biology , yet this endeavor is severely limited by our ability to collect, integrate, and systematically analyze phenotypic data . Researchers generally use free text to describe phenotypes, which allows for rich descriptions, but makes it hard to compare phenotypes across species, integrate data into the existing knowledge landscape, or derive information from combined datasets . In recent years, ontologies have become powerful tools for working with phenotypic data, particularly in biomedicine, because standardizing terminology across species and sub-disciplines enables inference based on logical relationships [4-6]. Here we present a new approach to studying plant phenotypes modeled on recent advances in the use of ontologies in biomedical research on animal model systems.
Throughout this paper, we use the words phenotype, phene, and phenome with precise meanings. A ‘phenotype’ is the composite set of one or more observable characteristics associated with a given organism or cell, that results from the interaction of the genotype and the environment [7,8]. The separate characteristics that make up a phenotype are termed ‘phenes’ [9,10]. For example, in maize, a dwarf phenotype can be defined as a composite of the phenes ‘reduced internode length’ and ‘compact, broad leaves’. Phenes relate to ‘phenomes’ in the way that genes relate to genomes: an organism’s or species’ phenome is composed of the complete set of its phenes. Phenomics, therefore, is the study of all phenotypes associated with an organism or species (i.e. its phenotype space). In correspondence with Genome Wide Association Studies (GWAS), Phenome Wide Association Studies (PheWAS) associate a gene with a variety of phenes or phenotypes, which is particularly relevant for genes that have a pleiotropic effect .
Biomedical scientists have developed and utilized phenotype ontologies and ontological reasoning to support comparative and predictive phenomics [12,13]. Phenotype ontologies are controlled, hierarchically-related phenotypic descriptions that enable large-scale computation among individuals, populations, and even multiple species . A number of vocabularies and pre-composed phenotype ontologies (in which terms are pre-defined) have been developed for specific taxa or applications [15-18], but comparison across datasets or among different species requires an extensive alignment process whenever different vocabularies/ontologies are used to represent the data. An alternative to phenotype ontology alignment is the use of post-composed phenotypes, in which all the elements of a phenotype are explicitly logically defined or “composed” from existing terms from species-independent ontologies . One method of post-composing a phenotype description is to first break it down into its component phenes, and then define an affected Entity (E) and a describing Quality (Q) for each phene [19,20]. In this method of post-composing phenotypes, Entity-Quality (EQ) statements are composed for all phenes under considerations, and the entire set of phenes is reasoned over simultaneously. Finally, to derive novel insights from curated genotype and phenotype data, semantic similarity measures are applied, based on a consistent ontological representation [21-23].
This approach has been applied successfully to mammalian phenotypes to predict gene function across species, as well as disease, drug, or pathway involvement of genes [5,12,13,24,25]. Two major limitations to adopting a similar approach in plants are the lack of phenotype data curated with species-neutral ontology terms, and the need for standards for creating EQ statements to describe plant phenotypes. Nonetheless, two important existing resources are available to support post-composed ontology analysis of plant phenotype data: 1) well-developed ontologies for plant science , particularly the Plant Ontology (PO)  and Gene Ontology (GO) [28,29]; 2) curated sets of mutant phenotype descriptions for multiple plant species in model-organism and crop databases such as MaizeGDB [30,31], Oryzabase , Gramene [33,34], and the Sol Genomics Network (SGN) [15,35] as well as in the literature (e.g., ). In addition, an intellectual framework for logically defining plant traits has been developed in the Plant Trait Ontology (TO) .
To push the field of plant phenomics forward, it is clear that there is a need for additional high-quality phenotype descriptions generated by research, as well as for high-confidence predictions of phenotypic associations among equivalent phenotypes, both across species and between phenotypes and their causative genotypic variants and environments. Here we describe how our work to translate existing high-quality phenotypic descriptions across six plant species enabled the prediction of phenotypic associations. Furthermore, we demonstrate that additional curation of such data into ontological representations can expand the phenotypic predictive capacity of plant sciences.
This paper includes methodology, as well as an initial dataset that was used to test and refine the methodology. In brief, we compiled EQ statements for 1,742 phenes from 2,747 genes and gene models in six plant species [Arabidopsis thaliana (Arabidopsis), Zea mays ssp. mays (maize), Medicago truncatula (barrel medic or Medicago), Oryza sativa (rice), Glycine max (soybean), and Solanum lycopersicum (tomato)] and applied consistency checks to ensure a high-quality phenotype annotation set. The annotated phenotype data set was subjected to an automated semantic similarity analysis, based on PhenomeNET [37,38], and the results are available in a separate plant instance . The semantic similarity dataset was evaluated for its ability to enhance predictions of gene families, gene functions, and shared metabolic pathways across the six species and compared to an existing classification of plant phenotypes .
Results and discussion
A method for describing phenotypes with a common semantic representation across six plant species
We include in the Results a brief description of our method, because this is the first report outlining this type of analysis of phenotypes across multiple reference species in plants. For this analysis we limited our species set to the model/crop species Arabidopsis, maize, Medicago, rice, soybean, and tomato, to take advantage of the existing data for these species. However, the method could be applied to any plant for which there are characterized mutant phenotypes associated with sequenced genes. To maximize the ability to compare both phenotypes and genotypes across species, we used only genotypes for which the sequence was known and made efforts to limit our datasets to phenotypes resulting from mutations to a single gene. For genes where phenotype information was available for different alleles, we counted each allele as a separate genotype (Additional file 1). For each species, the authors with the most relevant expertise selected free text phenotype descriptions for inclusion using methods specific to that species (see Methods). The number of genotypes analyzed varied widely among species (maximum 2,393 in Arabidopsis, minimum 30 in soybean), reflecting the availability of phenotypic descriptions for each species.
Description of applied ontologies
Plant Ontology (PO) 
Plant anatomy and morphology and development stages
Gene Ontology (GO) 
Biological processes, cellular components and molecular functions
Chemical Entities of Biological Interest ontology (ChEBI) 
Molecular entities focused on ‘small’ chemical compounds.
Plant Experimental Conditions Ontology (EO)
Treatments, growing conditions, and/or study types
NCBI taxonomy (NCBITAXON)
A curated classification and nomenclature for all of the organisms in the public sequence databases.
Relation Ontology (RO) 
Core upper-level relations and biology-specific relations
Because a phenotype consists of one or more phenes, one or more EQ statements were used to describe it. For example, the phenotype “corngrass” in maize is described as “narrow leaves, extreme tillering, highly reduced ears and tassel; grasslike growth habit, often vegetative leaves in the ear and tassel”. This phenotype was broken down into 5 phenes, which were annotated with EQ statements (Figure 1). Likewise, an individual EQ statement can be used to describe more than one phenotype. In the corngrass example, the EQ statement that describes the “narrow leaf” phene is also used in several other phenotypes, such as “narrowleaf” (Additional file 1). Our approach considers each “phenotype” as the sum of its individual EQ statements.
This method allows for highly detailed and species-independent descriptions of phenotypes, but still has several limitations. Creating accurate EQ statements requires knowledge of both the species-specific phenotypes as well as the ontologies used to describe them. Brief phenotype descriptions may be available in databases (e.g., MaizeGDB, SGN), but much more complete and accurate descriptions of mutant phenotypes are spread over many publications spanning several years. Collecting and extracting phenotype information is very labor intensive. In addition, EQ statements are created by curators and thus still reflect a certain amount of subjectivity. It is often possible to build more than one EQ statement for the same textual description, and it is not always clear when to use a process entity versus a structural entity. In this project, we established a strict set of rules and protocols and held regular meetings to help ensure consistent construction of comparable EQ statements across species (see Methods). In the future, we would like to evaluate the importance of consistently structured EQ statements for analyzing semantic similarity, and determine whether some variation can be tolerated.
An ontology-based dataset of mutant phenotypes for six reference plant species
The number of EQ statements, genes, genotypes, and phenotypes they were associated with, for six plant species
#unique EQs - all genotypes
Zea mays ssp mays
Oryza sativa L.
Quantitative analysis of pairwise semantic phenotype similarity of genotypes across the entire dataset
To determine pairwise semantic phenotype similarity scores, we used the method described for mammalian genotypes  where phenotypes are represented by EQ statements that are then integrated using species-independent ontologies and a semantic similarity measure. That is, every phenotype (which is composed of one of more EQ statements) is compared to every other phenotype, and their similarity within the ontological graph is evaluated. For a pair of phenotypes to receive a score of 1, each phenotype would have to contain the same number of identical (or nearly identical) EQ statements. A score of 0 would mean that none of the EQ statements for either phenotype were similar. For the 8,213,956 possible pairs from the 2,866 genotypes, 548,888 (7%) of the genotype pairs yielded phenotype semantic similarity scores greater than zero. Score distributions, overall and on a per-species basis, are provided in the following sub-sections.
Distribution of similarity scores
Almost half (241,042 = 44%) of the non-zero semantic similarity scores are below 0.1, indicating that many of the phenotypes show only a small overlap in their description. For example, the rice mutant DWARF4 (Os03g0227700 , allele osdwarf4-1) shows a similarity of 0.08 with the rice mutant MADS18 (Os07g0605200 ). This results from both genes being annotated with “dwarf”-related phenotypes but also possessing diverging annotations, such as increased panicle number for DWARF4 and early flowering for MADS18. It is not surprising that most of the phenotypes show at least some marginal overlap, as this is intrinsic to the aim of the study: making phenotypes comparable. This highlights the potential of the method, but, at the same time, raises the need for consistent, coherent, and complete phenotype annotations in order to computationally replicate the underlying biology and derive accurate predictions.
Although there are more complex scoring mechanisms that take frequency of EQ statements into consideration , we applied a Jaccard index that determines the overlap of phenes used in the phenotype descriptions. In an earlier study, it was shown that different types of semantic similarity measures do not differ much as long as the results are interpreted carefully . In future work, we intend to investigate the applicability of alternative scoring methods, in combination with the development of benchmark sets for evaluation purposes.
Species-specific distribution of scores
To obtain further insights into the distribution of similarity scores, we split similarity scores according to species. If both genotypes that were used to calculate the pairwise similarity score belong to the same species, we recorded the resulting similarity score only for this species. If both genotypes leading to a particular similarity score belong to different species, we recorded it as a cross-species score. The resulting seven similarity score distributions are illustrated in Figure 2B-H. Species-specific score distributions are mostly consistent with the overall score distribution (Figure 2A). There are some differences for soybean and Medicago, but this is likely due to the small sizes of the phenotype annotation sets in these species.
Differences between the semantically-generated phenotype network and a manually derived phenotype grouping
A previous analysis of Arabidopsis used the same set of phenotypes and laid much of the groundwork for this present study . Mutant phenotypes were categorized in a simplified, three-level hierarchy consisting of 4 groups divided into 11 classes and 42 subsets. Each gene was assigned to one of 11 phenotypic classes based on the developmental stage when the phenotype was first observed and what methods and conditions were used to detect it (see more details in Methods). Genes were also assigned to one or more of the 42 subsets, based on the nature of the phenotype (e.g., gametophyte defective, flowering time).
There could be several reasons for low semantic similarity scores within classes or subsets, such as annotations that are not ideally defined, a poor choice of semantic scoring mechanism, or classes/subsets that are too broadly defined and therefore contain a large variety of phenotypes. In general, we expected pairs of genes within the same subset to have lower semantic similarity scores than pairs of genes within the same class, because genes can belong to multiple subsets, but only a single class. If the phenotype of a gene has multiple phenes, that gene should to belong to multiple subsets, and unless two genes share all of the same phenes (and therefore belong to all of the same subsets), they would have a relatively low similarity score within each subset. Genes in the same class may have somewhat higher similarity scores, because classes aggregate several phenotypic subsets (Figure 3). However, they do not aggregate subsets from other classes (as semantic similarity does) and thus are unlikely to completely mirror semantic similarity scores.
Examination of two contrasting subsets, Flowering time (FLT) and Pathogens/Herbivores (PTH), can help to explain some of the agreement or disagreement between membership in a class or subset and degree of semantic similarity (Figure 3). The PTH subset seems to be more coherent with respect to phenotype annotations than the other groups, which suggests that PTH genes are not documented as having pleiotropic effects. In contrast, pairs of genes in the FLT subset have low average semantic similarity, suggesting that these genes are highly pleiotropic. Consistent with this, the PTH subset genes have on average 1.68 phenes whereas genes in the FLT subset have on average 3.99 phenes.
The categorical system devised by  has the distinct advantages of being more intuitive and not requiring an understanding of ontologies to make annotations or carry out an analysis of the data. However, the disadvantages are that category boundaries are sometimes somewhat arbitrary, very disparate phenotypes may be included in a single category (e.g., miscellaneous categories), and each phenotype may be forced into a single class. Although the class/subset classification can capture pleiotropic phenotypes, it does not provide a way to compare pleiotropic phenotypes of multiple genes the way semantic similarity scores based on collections of EQ statements does. In contrast, the ontology approach allows the grouping of phenotypes at any level of the ontology that may be appropriate for a particular analysis, while still allowing each observation (phene) to be separately annotated.
Semantic similarity predicts participation in shared metabolic and regulatory pathways
It is a premise of this work that through computational analysis of EQ statements representing phenotypes, biological processes can be recapitulated, modeled, and even discovered. Were this to be true, one would expect, for example, that gene products in the same metabolic pathways would be annotated with EQ statements that are highly similar. To test this hypothesis, we used the PlantCyc project databases AraCyc (v 11.5) , Oryzacyc (v 1.0), SoyCyc (v 4.0) and CornCyc (v 4.0) as well as LycoCyc from SGN (v 3.3 Solanum lycopersicum) , and MedicCyc from the Noble Foundation . One metabolic pathway that is well populated among those databases and for which our phenotype datasets have representation is the phenylpropanoid biosynthesis initial reactions of flavonoid biosynthesis. 3-hydroxy flavonoids, also called anthocyanins, are pigments. They serve to, e.g., attract pollinators and protect plants from UV-B damage [49,50].
c2 GRMZM2G422750 similarity score 1 (identity: this is the query)
c1 GRMZM2G005066 similarity score 1
r1 GRMZM5G822829 similarity score 0.6666666667
b1 GRM similarity score 0.5
All three of the identified gene models are involved in the anthocyanin pathway of maize, which controls flavonoid synthesis (reviewed in ). More specifically, the gene products of the c1, r1, and b1 loci activate genes in the anthocyanin pathway. This result: (1) indicates that reasoning across curated phenotypes in plants is capable of creating result sets that recapitulate well-characterized biological phenomena, (2) hints that for plant species that are not genetically well-characterized, the ontological reasoning approach to predicting phenotypic associations could assist in forward genetics approaches, and (3) highlights the potential use of reasoning across phenotypic ontological associations to prioritize high-quality data curation where data are missing from or complementary to repositories like the PlantCyc database.
Focusing on (2) – that the suggested approach can help with characterizing understudied species – the reasoning is as follows. Consider a poorly studied species with a number of mutant phenotypes that include an altered seed color phene. The phenotypes of this species would be described and codified using ontological representations. These phenotypic descriptions then could be used as queries to return genes from a well-characterized species (e.g., maize) with phenotypes that have high similarity to the phenotype in the poorly studied species. This result set could indicate to a researcher who is not an expert in pigment biology that the flavonoid and anthocyanin biosynthetic pathways and their regulators were of interest for determining which genes were responsible for the phenotype.
Evaluation of phenotypic similarity across orthologs and gene families
Manual assessment of gene families
We were able to place 2,741 EQ-annotated genes (2,393 Arabidopsis, 30 soybean, 40 Medicago, 92 rice, 72 tomato, 114 maize) into 1,895 gene families, of which 460 families contain two or more genes annotated with EQ statements. The gene families, based on the Phytozome 10 Angiosperm-level families , generally contain both dicot and monocot representatives from the species in this study. Forty-two of the families contain between five and 12 genes with EQ statements, allowing us to assess how often homologous genes have similar functions. Further, there are 147 families that contain EQ statements from two or more species. These are of interest because it allows us to assess how often functions are conserved between orthologs.
For most families with multiple EQ-annotated genes, gene function is conserved or similar. For example, in the terpene synthase family (family 54585183, Additional files 4, 5 and 6), with 12 EQ-annotated genes from Arabidopsis, rice, and maize, all genes included aspects of “dwarf” phenotypes (quality “decreased height”, PATO:0000569). However, salient phenotypes in maize also include floral hermaphrodism, in contrast to the typical male and female floral separation in wild type domesticated maize. In the Flowering Locus T family (family 54614050, Additional files 4, 5 and 6), there are 12 EQ-annotated genes from five of our study species. All of the characterized mutant phenotypes involve floral development or photoperiod control.
We also observed gene families in which annotated phenotypes are quite different across orthologs. For example, in the family (54614050, Additional files 4, 5 and 6), a leucine-rich repeat, serine-threonine kinase family, the SUNN mutant in Medicago display extra root nodules, while the CLV1 mutant in Arabidopsis displays abnormal leaf phyllotaxy and altered floral morphology [53,54].
Plant phenomeNET: a web interface for searching the plant dataset
We adapted PhenomeNET  to provide the results of the computational analysis of the plant data sets to the broader research community in an online form. Plant PhenomeNET is available via  and provides access to the genotypes of all six species that possess at least one EQ statement. For each genotype, a detailed genotype page provides information about similarity scores to any of the other genotypes as well as a link to an additional page providing the phenotype assigned by the curator and those inferred via the ontologies. We note here that similarity scores of 0 for genotype pairs are not reported in Plant PhenomeNET.
Using plant phenomeNET -- Searching for tasselseed1
By following the first link “show phenotypes”, the user obtains the assigned (top list, panel C, Figure 4) as well as the inferred EQ statements (bottom list, panel C, Figure 4) for the ts1 gene. For example, the curator assigned an EQ statement for the “abnormality of the tassel” as part of the phenotype. One of the EQ statements that was inferred based on the ontology structure is “extra floral organs in spikelet”.
Following the second link to “explore” similar mutants, the user obtains an ordered list of phenotypically similar mutants with the most similar at the top and the least similar at the bottom (see panel D, Figure 4). Each of the mutants provided in the list can then be explored further with the links provided for navigation. One interesting case is presented in our list with the rice mutant FOR1, with the atomized statement “extra floral organs in the spikelet”. This example illustrates how important correctly assigned EQ statements are, and how, using inferred EQ statements, connections can made between mutants from different species.
After defining a common method for semantic representation of plant phenotypes, we annotated 2,741 genotypes with 2,023 unique EQ statements. This represents the first cross-species plant data set that can readily be integrated with other data via shared ontologies. This use of ontologies to support phenotypic reasoning enables integration beyond plants and would enable generalized analyses to discover phenomena conserved across all domains of life as described in . An example of such cross-domain inference is their finding that the pathways that underlie gravitropism sensing in Arabidopsis root tips are concordant with an inner ear defect in human developmental biology known as Waardberg syndrome .
In our initial computational analyses using semantic phenotype similarity scoring, we were able to identify gene sets that are functionally related, i.e. belonging to the same gene family or involved in the same pathway. The method described here can be readily applied to other plant species to suggest genes for analysis in under-studied species or crop wild relatives, or expanded to describe and compare phenotypes across diverse plant species for evolutionary analysis, as has been done for fish [17,56]. Because the ontologies used for comparisons are taxon-neutral, there are no restrictions to expanding this method to non-flowering plant models such as Physcomitrella patens, Selaginella moellendorffi, or Ceratopteris [57-59], and comparisons across widely divergent species (e.g., maize versus moss) may even reveal surprising conservation or co-option of gene function. Even for the species used in this initial study, there is much to learn about the relationships between genotype and phenotype, and we urge the continued curation and EQ-based annotation of plant phenotypes, to expand this data set and increase its utility. Because species- and clade-specific databases will need to curate and hold these statements, a standardized method for storing this data, preferably using a common database schema such as , should be implemented.
Generation of the lists of phenotypes for each plant species
Each of the model plant species represented in this analysis is supported by a database of genomic and other data. These databases are highly individualized, based on the differing needs of their community members. As such, the generation of a list of phenotypes associated with a mutant allele of a known gene was slightly different for each species.
Selection of Arabidopsis phenotypes
The Arabidopsis mutant phenotype dataset, first compiled by , includes ~2,400 genes with recessive mutant phenotypes for which the disrupted gene is known. Information for this dataset was previously gathered from: 1) a sequence-based map of genes with mutant phenotypes ; 2) the SeedGenes database of essential genes , as updated by ; 3) a list of genes associated with mutant phenotypes obtained from TAIR ; and 4) several thousand publications describing Arabidopsis mutant phenotypes retrieved from the Pubmed Database  using appropriate keywords (Arabidopsis, mutant(s), mutation(s), knockout, and null). Short, free-text phenotype descriptions found in column I of Supplemental Table S2 of  were used as the input for the Arabidopsis EQ statements. Genes with only a dominant, gain-of-function mutant phenotype  were generally excluded. Based on past work, the Arabidopsis phenotypes analyzed here are associated with sequenced genes but not with specific mutant alleles.
Selection of maize phenotypes
In the MaizeGDB database , maize phenotypes are associated with mutant alleles (variations) of genes based on a maize-specific controlled phenotype vocabulary, consisting of 1,088 phenotypes. Of the 1,088 phenotypes associated with mutant-defined loci, we removed continuous trait phenotypes (e.g., phenotypes that are attributable to quantitative trait loci or QTL), and several other types of phenotypes not likely to be relevant for this analysis, such as gel mobility of a protein on a starch gel. Of the remaining phenotypes, we selected only those associated with gene models (DNA sequences).
Selection of rice phenotypes
In order to create a list of rice mutants that were associated with known genes, data was combined from Gramene  and Oryzabase [32,67]. The Oryzabase file was quite large (about 4,800 traits/phenotypes listed, with about 1,600 of those associated with a known locus), while the Gramene list was smaller, with about 160 loci. The information from the two sets was combined and all the mutants with identifiers from both databases were cross-referenced to ensure there was no overlap or duplications. Many of the described mutants had to be eliminated from the master list as they were only described morphologically (i.e. not associated with a known locus or gene). For the remaining mutants, we combined all available phenotypic descriptions from the two sources.
Selection of soybean and Medicago phenotypes
Curated lists of phenotypes for these species are not available in public databases. Thus, in order to create lists of mutant genes in soybean and Medicago for this study, the primary literature was searched for phenotypes and their descriptions.
Selection of tomato phenotypes
Tomato loci with a known phenotype were selected from the Sol Genomics Network database (SGN) [15,35]. Phenotypes are associated with alleles, with some loci having multiple alleles with different phenotypes. The loci were curated manually based on previously described mutants  and literature curation of published tomato cloned genes with an associated phenotype. We included only loci with morphological or metabolic phenotypes, excluding isozyme alleles and loci that have a described phenotype but no associated gene sequence.
Quality assurance across the entire data set
In order to provide consistency across species and allow for computational analysis of the entire phenotype data set, we developed a set of rules to define how the EQ statements should be constructed, and employed manual and automated quality checks to verify compliance with the rules. Manual checks determined if the EQ statements were made in a consistent manner across species. We did find consistency in most cases; however, minor inconsistencies have a relatively small effect, as the power of using hierarchical ontologies to describe phenotypes allows similar but not identical EQ statements to have high similarity scores.
Automated quality checks computationally verified the validity of the assigned EQ statements based on our pre-defined set of rules. An example of such a rule is a requirement that entities be represented with either PO (for structural) or GO IDs (for process phenotypes) and that the type of Quality chosen from PATO must match the Entity (i.e. a structure quality for a structure entity and a process quality for a process entity). Furthermore, the automated checks ensured that valid identifiers were used for each ontology term and that each term label matched its ID, which was useful for correcting typographical errors. More details on the rules we employed are provided in Additional file 7. The automated quality assurance was an iterative process in which the errors were removed continuously as the data set expanded. The data set comprising EQ statements from all six species (Additional file 1) successfully passed the automated checking procedure.
Building a phenotype network using semantic similarities of gene pairs based on assigned EQ statements
The computational analysis relied on the representation of phenotypes as EQ statements. Each phenotype was represented as an affected entity that is further described with a quality. The application of EQ statements has been proven useful for cross-species gene function prediction, as well as pathway involvement and the identification of disease gene candidates [53,69]. As described by , species-specific phenes were decomposed into an affected Entity and Quality, and represented using species-independent ontologies. All the ontologies used here for the description of the phenotypes in any of the six species were downloaded on 15 March 2014 and converted to OWL EL. In addition to the ontologies, a set of logical definitions to connect plant structures with biological processes has been downloaded on 29 April 2013 and was also integrated with the ontologies (see Additional file 8). For further details on the applied ontologies see Table 1.
where Entities and Qualities were used as defined by the curators. Following this approach generates one integrated ontology that then can be used to infer additional phenes using reasoning over the ontology. An inferred phene is an EQ statement that is an ancestor term of the assigned EQ statement. For example, the maize mac1 (multiple archesporial cells1) gene was curated with an EQ statement named “Male and female infertility” and from the complete list of curator-assigned statements, and one additional EQ statement named “Complete sterility” was inferred.
where P_geno_A represents the phenes of genotype A and P_geno_B represents the phenes of genotype B. Applying this scoring method, phenotype semantic similarity scores fall into the range [0, 1], with 0 indicating no overlap between phenotypes and 1 indicating identical phenotypes. Calculating the semantic similarity score for each possible combination of genotypes results in a 2,866 × 2,866 data matrix. Similarity scores > 0 are provided as Additional file 9. We note here that 10 EQ statements of Arabidopsis genotypes (<0.2% of total EQ statements) were excluded from the computational analysis, because they either needed further discussion among the curators due to the relations used to build the entity or include a term that was removed from the ontology in the period between curation and the computational analysis.
This matrix constituting a genotype network based on phenotype similarities was 1) compared to an existing, manually created phenotype-specific grouping of genes , and 2) used to assess gene function (see following sections and Results and Discussion). We note here that this scoring is highly dependent on the assigned EQ statements and that the annotations assigned to date are as complete as can be derived from existing findings. This means that for phenes that have not been tested yet, we assume that this phene is absent. With the growth of the data set, more detail will be added to the genotypes, which in consequence will improve the accuracy of semantic phenotype similarity scores and the representation of biological processes.
Employed data and software
We downloaded all the ontologies from the OBO Foundry [70,71] or their respective download site (see Table 1), and used El Vira (version 0.2)  to transform ontologies from an OWL DL profile into an OWL EL profile. The application of OWL EL files facilitates faster reasoning over the combined ontologies and is consistent with the description of the method described for mammal data . To integrate the individual annotation files along with the respective ontologies used in annotation into a single ontology, the Brain library version 1.5.2 was used to easily modify OWL EL ontologies . All scripts required for the data analysis were implemented in Groovy (version 2.0.4) . A copy of PhenomeNET was set up to hold the results of the computational analysis, which were uploaded using the PhenomeNET database scheme. Plant PhenomeNET is accessible from .
Comparison of semantic similarity and an existing classification of plant phenotypes
For intraspecific comparison of Arabidopsis phenotypes, we used Table S2 from . In this previous work, genes were sorted into a three-tiered hierarchy of phenotypes of groups, class, and subsets. Their classification system was designed for the specific purpose of defining the set of essential genes for an organism, and for this purpose it was not necessary to differentiate among phenotypes of different mutant alleles of the same gene. Genes were placed into a single group and class, prioritized by developmental stage when phenotypes are first observed and what methods and conditions are used to detect them. The lowest rank included phenotypes where detection required a biochemical assay or microscopic examination. When the phenotype of a weak allele was more informative or better characterized than the phenotype of a null allele, the assignment was made on the basis of the better-known phenotype (e.g., fy - null is emb lethal but known as flowering time gene). Genes were also assigned to one or more of 42 phenotypic subsets, such as shoot architecture, flowering time, miscellaneous seed defects, and temperature.
To carry out a comparison of the previous results to the present work, we rearranged the dataset from  so that each unique gene/subset combination was on a single row. Because genes could belong to multiple subsets, there were multiple rows per gene. We removed data for 82 genes that were in  but not included in the present study. We calculated average semantic similarity of the classes and subsets as the average of all pairs of genes where both genes were in the same class or subset.
Pathway assessment based on phenotype network
The BioCyc databases for Arabidopsis (AraCyc version 11.5), maize (CornCyc version 4.0), rice (OryzaCyc version 1.0), and soybean (SoyCyc version 4.0) were downloaded from Plant Metabolic Network [45,75]. The database for tomato (LycoCyc version 3.3) was downloaded from the Sol Genomics Network [15,35], and the database for Medicago (MedicCyc version 2.0) was requested from and provided by The Samuel Roberts Noble Foundation [47,76].
To identify well-populated pathways across all six species, we divided the number of pathway steps catalyzed by a gene product for which a phenotype was included in our dataset by the average number of reactions in the pathway across the species examined (e.g., number of steps with a curated phenotype divided by number of total steps in the pathway). For instances where more than one gene encoded the enzyme responsible for a single step, that step was counted only once (i.e. the presence or absence of a gene encoding the enzyme was counted, not the number of genes encoding that step in a particular plant genome).
Assessment of gene families using the phenotype network
Gene families are based on the Angiosperm-level families from the Phytozome10 release , accessed on August 13, 2014, as multiple-sequence alignments for each family. These gene family alignments included peptide sequences from 43 species, and comprised 29,803 gene families. From these alignments, we calculated HMM-based alignment models using hmmbuild (HMMer package version 3.1 r4562, Eddy, 2011 ). We then searched the peptide sequences from each of the seven species discussed in this paper, along with peptide sequences from Amborella trichopoda (to serve as an outgroup in phylogenies), against the gene family HMMs, using hmmscan (maximum E-value 1e-4), and then placed each sequence into the family of the top HMM match, giving a multi-fasta file for each gene family. The resulting family files were realigned to the respective HMM using hmmalign. Prior to generating phylogenetic trees, the resulting alignments were trimmed of non-aligning residues (as lower case characters in the output of hmmalign, indicating non-match-state residues in the HMM alignments). Phylogenetic trees were calculated using RAxML (raxmlHPC-PTHREADS-AVX, v. 8.0.26 ), using model PROTGAMMAAUTO. Analyses of EQ statements relative to gene families were conducted by generating combined EQ statements for each gene (concatenating multiple EQ statements into a single string separated by “;;”) and then joining these combined statements with genes. The resulting analyses are in Additional file 4. Alignments and phylogenetic trees are in Additional files 5 and 6, respectively.
Genome assembly and annotation versions used in these gene families were: Glycine max assembly and annotation version Wm82.a2.v1; Medicago truncatula assembly v 4.0v1; Arabidopsis thaliana v TAIR10; Oryza sativa Japonica (Nipponbare) assembly IRGSP-1.0, with the IRGSP-1.0 gene model names; Zea mays spp mays B75 RefGen v3, assembly annotation v 6a; Lycopersicon esculentum v iTAG2.3; Amborella trichopoda v 1.0.
aOntology term identifiers of the form PO:0000925 are shorthand for identifiers of the form http://purl.obolibrary.org/obo/PO_0009025.
bSome relations in the Relation Ontology fall within the BFO namespace, because they are imported from the Basic Formal Ontology.
The NSF Phenotype RCN supported this work by allowing plant and computational biologists to come together and develop a common vocabulary and supported the preliminary analysis of generated data. The Samuel Roberts Noble Foundation provided the Cyc database for Medicago (MedicCyc version 2.0). RLW was supported in part by the National Science Foundation under Grant Numbers #DBI-0735191 and #DBI-1265383. PJ and LC contributions are supported by the National Science Foundation Awards #IOS-0822201, #IOS-1127112 and #IOS-1340112.
- 1.National Research Council (US) Committee on Research at the Intersection of the Physical and Life Sciences. Research at the Intersection of the Physical and Life Sciences. Washington, D.C: National Academies Press; 2010.Google Scholar
- 5.Doelken SC, Köhler S, Mungall CJ, Gkoutos GV, Ruef BJ, Smith C, et al. Phenotypic overlap in the contribution of individual genes to CNV pathogenicity revealed by cross-species computational analysis of single-gene mutations in humans, mice and zebrafish. Dis Model Mech. 2013;6:358–72.CrossRefPubMedCentralPubMedGoogle Scholar
- 7.Hartl DL, Jones EW. Essential genetics: s genomics perspective. 4th ed. Sudbury, Massachusettes, USA: Jones and Bartlett Publishers; 2006.Google Scholar
- 8.Wikipedia phenotype article [http://en.wikipedia.org/wiki/Phenotype]
- 9.York LM, Nord E, Lynch J. Integration of root phenes for soil resource acquisition. Funct Plant Ecol. 2013;4:355.Google Scholar
- 30.Schaeffer ML, Harper LC, Gardiner JM, Andorf CM, Campbell DA, Cannon EKS, et al. MaizeGDB: curation and outreach go hand-in-hand. Database J Biol Databases Curation. 2011;2011:bar022.Google Scholar
- 31.Maize Genetics and Genomics Database [http://maizegdb.org/]
- 34.Gramene [http://www.gramene.org/]
- 35.Sol genomics network [http://solgenomics.net/]
- 38.PhenomeNET [http://phenomebrowser.net]
- 39.Plant PhenomeNET [http://phenomebrowser.net/plant/]
- 42.The OBO relations ontology [https://code.google.com/p/obo-relations/]
- 44.Link to complete dataset on iPlant Data Store. [http://ezid.cdlib.org/id/ark:/87946/p3159c]
- 45.The Rice Annotation Project Database [http://rapdb.dna.affrc.go.jp/index.html]
- 47.SolCyc biochemical pathways [http://solgenomics.net/tools/solcyc/index.pl]
- 53.Schnabel E, Karve A, Kassaw T, Mukherjee A, Zhou X, Hall T, et al. The M. truncatula SUNN gene is expressed in vascular tissue, similarly to RDN1, consistent with the role of these nodulation regulation genes in long distance signaling. Plant Signal Behav. 2012;7:4–6.CrossRefPubMedCentralPubMedGoogle Scholar
- 62.SeedGenes Project [http://www.seedgenes.org/]
- 64.The Arabidopsis Information Resource [http://www.arabidopsis.org/]
- 65.PubMed database [www.ncbi.nlm.nih.gov/pubmed]
- 67.Oryzabase [http://www.shigen.nig.ac.jp/rice/oryzabase/]
- 68.Tomato Genetics Resource Center [http://tgrc.ucdavis.edu/]
- 69.Smedley D, Oellrich A, Köhler S, Ruef B, Sanger Mouse Genetics P, Westerfield M, et al. PhenoDigm: analyzing curated annotations to associate animal models with human diseases. Database J Biol Databases Curation. 2013;2013:bat025.Google Scholar
- 70.The Open Biological and Biomedical Ontologies foundry [http://www.obofoundry.org/]
- 74.Groovy [http://groovy-lang.org]
- 75.PlantCyc [http://www.plantcyc.org]
- 76.The Samuel Roberts Noble Foundation- Medicago truncatula Pathway Database [http://mediccyc.noble.org/]
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.