Systems developmental biology: the use of ontologies in annotating models and in identifying gene function within and across species
- 508 Downloads
Systems developmental biology is an approach to the study of embryogenesis that attempts to analyze complex developmental processes through integrating the roles of their molecular, cellular, and tissue participants within a computational framework. This article discusses ways of annotating these participants using standard terms and IDs now available in public ontologies (these are areas of hierarchical knowledge formalized to be computationally accessible) for tissues, cells, and processes. Such annotations bring two types of benefit. The first comes from using standard terms: This allows linkage to other resources that use them (e.g., GXD, the gene-expression [G-E] database for mouse development). The second comes from the annotation procedure itself: This can lead to the identification of common processes that are used in very different and apparently unrelated events, even in other organisms. One implication of this is the potential for identifying the genes underpinning common developmental processes in different tissues through Boolean analysis of their G-E profiles. While it is easiest to do this for single organisms, the approach is extendable to analyzing similar processes in different organisms. Although the full computational infrastructure for such an analysis has yet to be put in place, two examples are briefly considered as illustration. First, the early development of the mouse urogenital system shows how a line of development can be graphically formalized using ontologies. Second, Boolean analysis of the G-E profiles of the mesenchyme-to-epithelium transitions that take place during mouse development suggest Lhx1, Foxc1, and Meox1 as candidate transcription factors for mediating this process.
KeywordsNeural Crest Cell Mouse Development Metanephric Mesenchyme Anatomy Ontology Developmental Anatomy
Up until the 1980s, most research in developmental biology involved analyzing the interactions among and within the tissues that participated in some embryologic event (e.g., limb development) and, on the basis of careful experimentation, inferring something about these interactions. A second and complementary approach was to use kinetics and other theoretical approaches to model a problem in development such as patterning. In either case, where there was more than one possible explanation of a phenomenon, it seemed obvious and sensible to give preference to the explanation that seemed the most parsimonious on grounds of natural selection. The gradual and continuing discovery of the intricacy of the signalling conversations between participating tissues, the richness of the activated molecular networks that regulate developmental change, and the complexity of the resulting processes have shown just how naïve was that original paradigm.
Over the past two decades, our ability to use a wide range of molecular technologies to investigate these regulatory networks and to collate the patterns of gene expression characterizing a particular state of differentiation has produced enormous amounts of information, often accessible from online databases (e.g., http://www.informatics.jax.org), on how development proceeds. This ability to exploit the new technologies and so to explore complex developmental events at the molecular level has enabled the field, over a period of some 20 years, to progress from a small-scale subject interesting relatively few scientists to an area of major interest and excitement across the world. One stimulus here has been the realization that mutation-derived errors in these networks underpin many human congenital abnormalities. The consequent study of these abnormalities, often using mouse models, has the dual benefit of advancing medical research and giving us a tool to pry open these networks. A second has been the realization that homologous networks do similar things in very different organisms and that we therefore have a means to explore the mechanisms of evolutional change which usually operate, as Waddington was probably the first to emphasize, through mediating changes in development (see below and Waddington 1975).
All this work has led to a wonderful increase in our understanding of developmental events, particularly those that involve signalling and those in which the activation of a transcription factor initiates a new process (for review, see Gilbert 2006). That said, it has to be admitted that, for most developmental events, there are now large amounts of molecular expression data that are hard to interpret unambiguously. Often we do not really know in a particular event which proteins are important, which are secondary, and which are background, and knockout and other experimental data can be either ambiguous or unhelpful. In one sense, the situation is worse than it was in the 1980s: Then we could appeal to parsimony via natural selection to make choices; now things are so complicated that we have no means of recognizing parsimony, and would not trust the concept anyway.
One approach to this complexity is to say that if only we had enough data, everything would become clear, but it is unlikely that anyone in the field really believes this. A second is to say that we need better and stronger intellectual frameworks than just relational databases for organizing and analyzing the new data that are pouring out of laboratories. A third is to take the view that we need not just a better framework for handling data, but better intellectual ideas. The third view is certainly right, but those ideas have yet to emerge and, in the absence of some deeply original and intuitive thinking, may well emerge from the second approach, which is hard enough at the moment and which is articulated under the general name of systems developmental biology. This approach is new and does not yet have any formal structure but, in general, seeks to embed the events of a particular developmental event within a computational and hierarchical framework that links tissues, cells, processes,and molecular/genomic data, and often aims to capture the results of high-throughput technology (e.g., Kimelman 2006). Perhaps the best-known example of a systems approach is the work on sea-urchin development (e.g., Ben-Tabou de Leon and Davidson 2006), which integrates tissues, genes, and networks (Longabough et al. 2005). Other important systems approaches include analyses of developmental networks (Xia et al. 2006) and the molecular basis of very early mouse development (Eviskov et al. 2004).
This article does not seek to provide a systems approach to any particular phenomenon but to consider how best to take advantage of the computational tools currently available so as to ensure that systems descriptions based on tissues, cells, and processes can be interoperable in the sense that they use a common language. This would enable them to query one another and use each other’s formal knowledge (much as we can do for genes and proteins that are already linked through their IDs to their appropriate database). The key tools here are ontologies and the purpose of this article is to discuss what ontologies are, how they can be used in formalizing systems approaches to development within and across species, and what are the resulting benefits.
Ontologies of anatomical tissues and of cell types
At the core of development is the predictable production of functional and differentiated tissues from early, less well-defined tissues. It would therefore be sensible if, when one person uses, for example, the term “E14.5 mouse left atrium” in his systems model of heart development, another person using the same term in her model can link to that of the first. The way that such linkage is done for proteins is to use an ID from a standard database (e.g., the protein ID from Uniprot, http://www.ebi.uniprot.org), and because proteins are all amino-acid strings and hence of the same rank, they can readily be stored in the tables of relational databases.
Anatomical tissue organization, in contrast, is hierarchical in nature: The vertebrate hindlimb, for instance, is obviously partitioned into regions (thigh, knee, calf, foot), each of which has its own parts, and the concept of “hindlimb” would naturally be expected to include these subordinate parts, together with information about their relationship to the hindlimb and to one another. While it is obviously straightforward to assign a unique ID to a given tissue at a given developmental age, it is clear that the hierarchical organization of tissues poses some organizational problems beyond those needed for handling sequence data.
The way that such hierarchical information is most appropriately handled is through ontologies. These are domains of knowledge formalized in a way that allows them to be computationally accessible. In practice, ontologies are built up by linking facts in a hierarchical way. Here, a fact is a triad of the general form <term><relationship><term> and terms can have parents and children (e.g., the E14.5 left atrium is part of the E14.5 heart; the E14.5 heart is part of the E14.5 cardiovascular system, etc.). Although they are tedious to produce (even the simplest organ system has a great many tissues and a lot of organization), there are now part-of ontologies for the tissues of all the main model adult organisms and for the developmental anatomy of the mouse, zebrafish, and Drosophila (accessible from the Open Bio-Ontologies site, http://www.obo.sourceforge.net). Every term in these ontologies carries a standard ID of the form <abcd><ijkl>, where abcd gives a short letter code for the ontology (e.g., EMAP for mouse development) and ijkl gives the number for a specific tissue at a specific developmental age (e.g., EMAP:7917 is the ID for the E14.5 mouse left atrium, with EMAP standing for the Edinburgh Mouse Atlas Project, http://www.genex.hgu.mrc.ac.uk). It is these IDs that allow for interoperability because they represent defined concepts (or terms) that can be used anywhere, even as synonyms.
In the context of systems developmental biology and in addition to the appropriate anatomy ontology, there are two general ontologies that are also useful. The first is the Cell-Type Ontology (Bard et al. 2005) and the second is the Gene Ontology (Ashburner et al. 2000; Harris et al. 2004). The former, unlike the anatomy ontologies, not only includes all the common and many of the uncommon cell types that are found across the phyla but it is essentially species-independent and so facilitates cross-species analyses and comparisons. This ontology is structured to include our knowledge of the many properties of these cell types and each is separately coded under function, morphology, ploidy, development, etc., using two relationships, is-a and descends-from (see Fig. 2). This ontology is thus a terse summary of a great deal of knowledge about cell types and their properties.
One important factor about ontology terms is that they can be associated with data (usually held in a standard relational database and linked to the ontology via the appropriate IDs); examples include the proteins that satisfy the definition of a GO term (http://www.godatabase.org), the genes expressed in a particular mouse tissue at a particular time, (http://www.informatics.jax), and the micrographs associated with a pathologic state (http://www.pathbase.net). Here, the hierarchical knowledge within the ontology comes into play: If, for example, a user requires the genes associated with the developing mouse forelimb at E12.5, the response comes from searching the ontology to identify the constituent tissues in the limb and using their IDs to collect all the associated data. This can be done because this type of part of relationship has the property known as upwards propagation. This means that if a term has data associated with it, then these data can be associated with the parent (e.g., a gene expressed in the tarsus is also expressed in the hindlimb). Propagation is associated with some bio-ontology relationships (e.g., part of, is a) but not with others (e.g., develops from; one would not expect pigment cells to have the same properties as their neural-crest-cells precursors).
Using ontology terms for annotating systems models for mouse development
Ontologies, together with their linked data, provide an important online resource and have several key roles in systems developmental biology. The first is the use of well-defined terms (with their associated IDs) to standardize annotation, the second is for linkage to databases that store data associated with the terms, and the third is to facilitate the identification of similar terms in very different contexts. These ideas are explored here and in the next section and, while the approach is applicable to the development of any organism and also across organisms (see Discussion), the examples focus on the mouse. This is because our knowledge about its development is now so deep that it is often possible not only to describe how any tissue develops morphologically over time (Kaufman and Bard 1999) but to identify the processes and changes in cell type that underpin each time slice of a tissue’s development (see http://www.xspan.org). In addition, the mouse community is fortunate to have access to substantial online informatics resources that are available from The Jackson Laboratory. In the context of this article, the most important of these is GXD, a database of gene-expression (G-E) data for the developing mouse in which expression data are annotated with (and hence searchable by) tissue name, developmental stage, and GO IDs, as well as other genetic identities.
Patterning – this sets up future events in groups of cells
Proliferation and apoptosis – the basis of growth and shaping
Cell differentiation – changing a cell’s phenotype
Morphogenesis – the generation of spatial organization (e.g., via movement)
If the formation of a system is to be modeled, then the first step is to lay out its normal pattern of development graphically. Much of the stage-by-stage lineage data for mouse embryogenesis is available in text format (Kaufman and Bard 1999) and can be linked to the tissues (with their IDs) in the ontology of mouse developmental anatomy (Bard et al. 1998) and hence with GXD. Staging of mouse embryos is based on the appearance of standard external identifying features as embryogenesis proceeds; Theiler staging for the mouse gives, in essence, two stages a day when things are going rapidly (E6–E12.5) and one stage a day when the appearance of new features is slower (E1–E5 and E12.5 onward). Annotating the tissue names is straightforward because each tissue at each stage has a unique ID accessible from the ontology of mouse developmental anatomy (e.g., Fig. 1).
Where the state of a tissue changes between two Theiler periods, one can annotate the developmental change that drives this transition (this is not the usual way in which development is considered!) with the appropriate GO process terms. In this way, the final graph has, superimposed on the lineage flow of developmental anatomy, the appropriate differentiation and process terms that drive the development of each tissue. Underpinning each of these transitions is the appropriate ontology ID, so that the final graph is set up to be complete, formal, and interoperable.
There is one immediate use of this model that derives from annotating terms with standard ontology IDs. The graph as it stands has no molecular data, but all the current gene-expression information associated with a particular developing mouse tissue at a given Theiler stage is computationally accessible from GXD through ID interoperability. GXD genes also carry GO IDs which enables searches to be quite sophisticated. It is straightforward, in principle at least, to use these GO IDs to identify, for example, signals and receptors for tissues that signal to one another (Bard 2002) or transcription factors that are synthesized at a particular stage and ready for a future event.
In short, ontology annotations of developmental systems are not only the key to interoperability and standardization of systems models, they give rich searching possibilities.
The genes underpinning common processes
There is a further bonus from such annotations: As development proceeds, the same developmental processes are used in very different contexts within one organism. This similarity goes beyond the differentiation of the same cell type from different tissues (e.g., neurons can differentiate directly from neuroepithelium and indirectly after the migration of cells originally from the neural crest or from epithelial placodes). Obvious examples are the branching of epithelial tubules (in glands and in the vascular system), epithelial folding in its many forms, the forming of mesenchymal condensations (the first step in the development of muscles, bones, and cartilage), and the initiation of movement (in tissues as different as neural crest cell, primary germ cells, neurons, and gastrulating epiblast cells), pigmentation (retinal epithelium, neural crest cells). Indeed, such processes are common to development across the phyla.
Consider the hypothesis that each of these processes can be viewed as a “motor” driven by the activation of particular set of transcription factors (TFs). If this hypothesis is correct, then each set of tissues that are about to participate in a particular event should express those TFs and they should be present in the appropriate G-E profile in the associated database (they may also be missing, but they should not have been shown to be absent). If so, then the overlap of the G-E profiles of tissues about to initiate a particular process should include (1) those proteins involved in initiating that process and (2) housekeeping proteins common to all (or at least most) cells. If these housekeeping genes can be excluded, such Boolean analysis should yield key proteins involved in that process. A similar analysis of the G-E profiles for those tissues immediately after they have initiated a particular process should yield those proteins involved in that process.
Any analysis along the lines suggested makes several assumptions beyond that of common TFs underpinning common processes. First, the time resolution of the G-E database has to be fine enough to discriminate between the period of a tissue’s competence to undergo a process and the process itself. In the case of mouse development for which the database archives expression by Theiler staging, this means time slices of 24 hours for early and late mouse development and 12 hours over the period E6–E12; this is probably adequate. Second, the database needs to contain enough data on the expression of all relevant genes. This latter criterion is unlikely to be met, even for GXD. While this rich resource currently includes some 250,000 expression results for about 7500 genes (information courtesy of Dr. Martin Ringwald), the data are not uniformly distributed across tissues or time slots. There is thus an element of chance as to whether the database holds information about the expression of a gene in a particular tissue at a given time.
Collate all the G-E data for each tissue in the 24 hours leading up to the process (a reasonable estimate of the period of competence); this list should include the infrastructure proteins for establishing that process.
Collate all the G-E data for each tissue during the stage at which the process is initiated, and probably the following one; this list should include all the process genes.
Identify the overlaps of the G-E patterns for steps 1 and 2. Given that GXD is incomplete, this probably means, in the first instance, including any protein expressed before or after the process in more than one tissue (Fig. 5), and particularly any of the latter whose expression is initiated just before the process is initiated (these are candidate genes for being activated by the TFs) .
Remove any of these proteins that can be identified as a housekeeping gene (this can be done from a standard list or perhaps from the G-E overlap of very different tissues); this will give the candidate infrastructure and process genes.
Analyze these candidate populations to see which genes (a) are heavily represented and (b) seem important (e.g., TFs); this may involve Bayesian statistical analysis.
Fortunately, there is a relatively simple shortcut that can be used for a quick exploration of the approach, and which takes advantage of the GO annotations in GXD. If one merely restricts one’s searches to (1) the periods of competence of tissues about to initiate a process and (2) genes with a GO transcription factor ID, the output should be restricted to those TFs associated with the initiation of that process. As an example, consider the mesenchyme-to-epithelium transition that takes place many times during development. A preliminary examination (full details will be published elsewhere) of the gene-expression profiles in GXD shows that there are substantial entries for the formation of blood vessels in the early mouse heart, the differentiation of heart endocardium, the metanephric ducts, the mesenchyme that forms the mesonephric ducts, and the early stages of somite development. If the search is restricted, using GO IDs, to TFs in the participating tissues in the two Theiler stages before these transitions take place, the data show that three TFs, Lhx1, Foxc1, and Meox1, are present in all these tissues (apart from a couple where there is incomplete data). A further inspection of the complete distributions of these genes shows that their expression (insofar as it is fully represented in GXD) is highly restricted over space and time, and because they do not in general overlap one another, they cannot be considered as housekeeping genes; their coexpression is hence unlikely to be a coincidence.
The TFs Lhx1, Foxc1, and Meox1 are thus, as a set, good candidates for collectively initiating a mesenchyme-to-epithelium transition, although they seem not to have been previously identified as fulfilling this role. It is therefore a prediction that this set be expressed in other tissues undergoing a mesenchyme-to epithelial transition. Examples that might be worth investigating here include the stromal fibroblasts in the cornea that become the corneal endothelium and the splanchnopleure mesoderm that forms mesothelium (GXD currently includes no relevant expression data for these tissues). If the prediction were confirmed, it would be worth investigating which proteins were synthesized following their activation.
We are now beginning to catch up with Waddington’s thinking. Systems models are starting to be produced that aim to integrate the complexity of the molecular, cellular, and tissue details that underpin development using the computational resources that are now available. There will be many more such models, and, given the richness and complexity of development, they are bound to overlap. It is important that such overlaps allow interoperability, and a key point made in this article is that the community should not only use, as it already does, the terms and IDs for the gene and protein databases, but also incorporate the terms and IDs for cells, tissues, and processes that are to be found in standard bio-ontologies. This is partly for interoperability across models, but also to allow direct linkage to such databases as those handling G-E data.
This article also points out that there is an additional bonus from using these IDs, i.e., where the same process is used in the development of different tissues, the linking of tissue IDs to their associated gene-expression profiles can, in principle, lead through Boolean analysis to the identification of candidate genes associated with the initiation and execution of this process. The databases are currently populated with genes whose roles are still unclear so this computational approach complements experimental approaches because it enables small groups of genes to be linked to the initiation and execution of processes. This contrasts with the analysis of individual genes whose roles can be analyzed using, for example, transgenic technology and high-throughput technology that picks up a large numbers of genes but yields little about their function.
A further point to be made is that the type of computational approach to the identification of gene function given here allows us to test the hypothesis that common processes are underpinned by common TFs. If such a search yielded several candidate TFs that are found to be associated with the initiation of a process in some tissues but have yet to be found in all tissues, it suggests that the expression of these TFs should be further examined in these other tissues. A lack of expression there would cast doubt on or at least narrow the extent of the hypothesis. This approach to systems biology thus provides assays for testing our ideas.
In this article, the focus has been on formalizing mouse development and analyzing the molecular underpinnings of its underlying processes because it is for this organism that the associated expression database has the finest spatial and temporal granularity. It should of course be pointed out that the approach is equally applicable to other organisms and even across organisms that share equivalent developmental processes. In the first instance, the mouse can be used as a model for identifying process-associated genes. Where a similar process occurs in other organisms, the homologs will be candidate genes for that process (and the XSPAN facility, http://www.xspan.org, will be helpful in identifying equivalent tissues in model organisms). A further tool under development that may be useful in this context is CARO, the Common Anatomy Reference Ontology (http://www.obo.sourceforge.net/cgi-bin/detail.cgi?caro, Haendel et al. 2007) which aims to provide interoperability across species-specific anatomy ontologies. In the longer term, one can envision the construction of complex systems models that span organisms and that employ the full richness of the computational resources that are available.
At a slightly deeper level, what distinguishes systems developmental biology from other approaches to unpicking the complexities of development is the formalization of the events of embryogenesis. This in turn enable tissues, cells, and expressed genes to be linked in a way that lends to computational as well as other forms of analysis. The exercise of formalizing embryogenesis encourages the biologist to think in ways that complement other more traditional approaches.
The author thanks Stuart Aitken for the many discussions about ontologies and for his computational help.
- Ben-Tabou de Leon S, Davidson EH (2006) Deciphering the underlying mechanism of specification and differentiation: the sea urchin gene regulatory network. SciSTKE 2006(361):pe47Google Scholar
- Gilbert SF (2006) Developmental Biology, 8th ed. (Sunderland, MA: Sinauer Associates)Google Scholar
- Haendel MA, Neuhaus F, Osumi-Sutherland DS, Mabee PM, Mejino JLV, Mungall CJ, Smith B (2007) CARO – the common anatomy reference ontology. In: Burger A, Davidson D, Baldock R (eds) Anatomy ontologies for bioinformatics, principles and practice. Springer verlag, Heidelberg. In pressGoogle Scholar
- Kaufman MH, Bard JBL (1999) The Anatomical Basis of Mouse Development. (London: Academic Press)Google Scholar
- Waddington CH (1940) Organisers and genes (Cambridge: Cambridge University Press)Google Scholar
- Waddington CH (1957) The strategy of the genes. (London: Allen & Unwin)Google Scholar
- Waddington CH (1975) Evolution of an evolutionist. (Edinburgh: Edinburgh University Press)Google Scholar