Metabolomics standards initiative: ontology working group work in progress
In this article we present the activities of the Ontology Working Group (OWG) under the Metabolomics Standards Initiative (MSI) umbrella. Our endeavour aims to synergise the work of several communities, where independent activities are underway to develop terminologies and databases for metabolomics investigations. We have joined forces to rise to the challenges associated with interpreting and integrating experimental process and data across disparate sources (software and databases, private and public). Our focus is to support the activities of the other MSI working groups by developing a common semantic framework to enable metabolomics-user communities to consistently annotate the experimental process and to enable meaningful exchange of datasets. Our work is accessible via a public webpage and a draft ontology has been posted under the Open Biological Ontology umbrella. At the very outset, we have agreed to minimize duplications across omics domains through extensive liaisons with other communities under the OBO Foundry. This is work in progress and we welcome new participants willing to volunteer their time and expertise to this open effort.
KeywordsControlled vocabulary Annotation Terminology Semantic Metadata Ontology Functional genomics Metabolomics Metabonomics Standard Metabolomics society Metabolomics standards initiative OBO
The storage, management, exchange and description of ‘omics based investigations, such as metabolomics, present challenges to biologists and bioinformaticians (Brooksbank and Quackenbush 2006; Fiehn et al. 2006; Sansone et al. 2006; Shulaev 2006). The Metabolomics Standards Initiative (MSI, http://msi-workgroups.sf.net) Working Groups have recognized that the establishment of reporting standards, such as minimal information requirements and exchange formats with defined semantics are necessary to enable efficient data sharing and meaningful data mining (Castle et al. 2006). Often these requirements are captured as free text, which is subject to ambiguities, redundancy, and typographical errors and as such reduces the power of computational approaches to retrieve the information and unambiguously interpret the experimental procedures.
Adding an interpretive annotation layer to the textual information is commonly done with representational artefacts (RA) such as structured controlled vocabularies (CVs) and/or ontologies (Cimino and Zhu 2006; Schulze-Kremer 1998), consisting of related representational units (RU). A CV is a set of terms (or RU), defined by an authority or through community agreement, in most cases formalized as is-a hierarchy of terms (taxonomy, although within the bio community this term is often used in the more restricted sense of a biological species taxonomy). Each RU is described by means of attributes such as identifier, name, definition and definition source (Smith et al. 2006). A CV is a simple and intuitive way of inserting an interpretive layer of semantics amongst terms used by different experimentalists to describe (annotate) an experimental parameter, in an unambiguous way, for example a type of sample treatment or instrument. Compared to ontologies, CVs are rather informal and lightweight representation artefacts. An ontology is a more explicit and formal representation of the knowledge in a domain, lying at the top end of the semantic complexity scale. Ontologies are semantically rich representations, containing CVs terms as classes as well as their properties, and logical statements for characterising those classes and the ways in which they can or cannot be related to each other.
By way of defined semantics, ontologies provide regimentations of a given terminology, while the defined syntax increases the interoperability between systems exchanging information. Ontologies facilitate the development of systems for data annotation and natural language processing and thereby ontology-based knowledge representations can extend the power of computational approaches to information retrieval, interpretation of experimental procedures, data exploration and knowledge discovery (Blake and Bult 2006). This potential has encouraged several scientific communities, including those operating in the metabolomics domain, to develop ontologies to be used for data annotation (Bodenreider and Stevens 2006; Field and Sansone 2006; Lan et al. 2003; Rubin et al. 2006; Schulze-Kremer 2002; Shulaev 2006; Stevens et al. 2006).
This article describes the working strategy, the developmental phases, the current activities and the challenges of the MSI Ontology Working Group (MSI OWG, http://msi-ontology.sf.net) in its effort to reach a broad consensus in the community on the formal semantic representation that is required to describe metabolomics investigations unambiguously.
The MSI OWG working strategy
The MSI OWG brings together members from diverse backgrounds and perspectives, including metabolomics practitioners, chemometrician, computer scientists, bioinformaticians (data managers, systems developers and data analysts) and ontology engineers.
The scope of the MSI OWG is to support the activities of the (1) Biological Context Metadata sub-WGs as well as the (2) Chemical Analysis, (3) Data Processing and (4) Exchange Format WGs (Sumner et al. 2007; Morrison et al. 2007; Griffin et al. 2007; van der Werf et al. 2007; Fiehn et al. 2007). The minimal reporting requirements identified by the first three WGs will inform the development of data exchange standards (Hardy and Taylor 2007) in order to provide a common mode of transport for the information between systems. Our work will ultimately provide a formal semantic interpretation for the format, by developing a common semantic framework to enable the metabolomics user communities to consistently annotate the experimental process and ensure meaningful exchange of their datasets. The MSI OWG has been conceived as a ‘single point of focus’ for communities where independent activities—to develop terminologies and databases for metabolomics investigations—are underway. Interoperability among these systems is the key driving force behind this endeavour.
Assist with the representation of study designs, protocols and instrumentation used, data generated and the types of analyses performed on them;
Provide a consensual set of terms for the consistent semantic description of data across disparate metabolomics resources (software and databases, both private and public).
MSI OWG members as developers of the CVs and ontology;
Ontology experts/knowledge engineers to provide advice about the engineering of the ontology and practical use cases for an ontology-driven application;
Last but not least metabolomics practitioners/domain experts to provide use cases for the ontology, validate the CVs and ontology produced and advise on additional terms to be included.
The MSI OWG operates under the assumption that no one group or community alone can bridge the ‘semantic gap’, and that a synergistic effort is the only way forward. We work cooperatively and maintain a public website with the names of participating members to remain approachable, inclusive and transparent while the size of the group and the complexity of the tasks increase. We communicate via two mailing lists. The first list is open to the public, or those only interested to be kept informed with the progress, while the other list is ‘closed’ and available only to those willing to (1) share the terminology they currently use and (2) invest time and expertise in such collaborative endeavour. Our documents are publicly available via the MSI OWG webpage and drafts of the ontology are posted under the Open Biomedical Ontology (OBO, http://obo.sf.net; Rubin et al. 2006) umbrella. Readers, potential users and developers wishing to send feedback to this and other MSI WGs, can also use the following email address: msi-workgroups-feedback[at]lists.sourceforge.net.
Fortunately, there is a generally accepted view that concerted efforts are required across the functional genomics and system biology communities to work towards harmonised and interoperable reporting standards. At the very outset, we strived to reduce the duplication of efforts across ‘omics domains, where commonality exists, through extensive liaisons with other standards initiatives (described in the next sections) and other ontology communities under the OBO Foundry (Smith et al. under review; http://obofoundry.org). Common standards will benefit the entire scientific community by simplifying the task of data integration and facilitate the work of software developers, vendors and equipment manufacturers by reducing the time involved in and costs of implementing standards-compliant products (Quackenbush 2004).
Phase 1—Use cases and CV
As described in the section above, ultimately our work will provide a semantic framework for the exchange format, to be agreed upon by the Data Exchange WG, that describes the minimal reporting requirements relevant for the interpretation of metabolomics investigations. Since both the definition of minimal reporting requirements and the development of a data exchange format represent work in progress, our work should be considered explorative and at a very early stage.
Domain coverage and resources
Naming conventions and metadata recommendations
At present, neither unified naming conventions, nor common metadata elements have been agreed upon by the ontology-oriented communities for naming and annotating RUs within RAs as well as the RA as a whole (Rickard et al. 2004; Supekar and Musen 2005). Naming conventions prescribe how CV terms and ontological classes should be named and formulated in a consistent manner to unify term appearance, reducing redundancy and increasing precision. These conventions would also provide guidance the ontology engineer on how to handle content related issues, for example Definition and Synonym (semantic naming conventions) and how to tackle lexical issues, such as term/class name length, allowed character set and format, word separators and word tense (syntactic naming conventions). Metadata elements belong to different categories. For example descriptive metadata helps to add useful information on RUs, e.g., definitions or provides examples, while administrative metadata provides information such as when and how a RU or RA was created (release date, version, authority etc.). In the absence of naming conventions and metadata elements applicable to our case, we have started working on such common recommendations in collaboration with the PSI Ontology WGs (Schober et al. 2007). The use of such common conventions would be pivotal in the development and maintenance of the ontology resource by the large participating communities. First drafts of the naming conventions and metadata ontologies are available from our webpage (http://msi-ontology.sourceforge.net/recommendations).
CVs master list
Start from an initial list of terms for a sub-component from a certain resource (database models, glossaries etc.); add definitions for each term and make these compliant to the naming conventions. Keep track of the relationships between the terms (is_a, part_of etc) if provided for the ontology development phase;
Structure the terms in an is_a hierarchy (taxonomy) for sorting and redundancy removal;
Discuss the CVs within the OWG, and then circulate to the practitioners in the relevant metabolomics area. This will ensure that the lists are as complete as possible, that we obtain valid definitions and will aid ontology construction later on;
Explore the use of text mining over a relevant collection of metabolomics papers to identify frequently used terms and enrich the term list;
Once general agreement has been reached on the initial CVs, further resources will be processed in turn by deciding which of their terms should be incorporated into the initial CV. For each of these terms synonyms, definitions and relationships will be identified as before;
When all resources for a given sub-component have been exhausted, it will be determined which domains remain to be covered. At this stage, we will need to actively collaborate with both the metabolomics practitioners and the other MSI WGs, particularly with the Data Exchange WG, to ensure the quality and completeness of the proposed CV.
The OWG’s ultimate goal is to combine the CVs master lists and add further formal structure to create a formal ontology. To achieve this goal, the OWG engages with leading experts in the field of ontology and other ontology communities under the OBO and the OBO Foundry umbrellas. The OBO Foundry is a recent initiative that aims to establish a framework for semantic interoperability in the field of life science. To ensure consistent evolution of the ontologies, the OBO Foundry leaders have issued a set of development recommendations, which will be enhanced in the course of time as new aspects of ontology best practice become established. These recommendations will include the use of (1) an upper formal ontology, OBO Upper Biomedical Ontology (UBO) currently being developed and based on the Basic Formal Ontology (BFO, http://www.ifomis.uni-saarland.de/bfo) to define the top-level class framework under which knowledge representation will be carried out and (2) Relation Ontology (Smith et al. 2005) providing well characterized relations to describe how entities relate to each other (e.g., foundational relations is_a or part_of, but also temporal and spatial relations such as develops_from and located_in). The OBO Foundry also addresses housekeeping needs for ontology maintenance and editing, recommending, among other things Ontology Web Language (OWL, http://www.w3.org/2003/08/owlfaq) and OBO as the format for distribution.
The OWG directly participates in OBI (previously titled FuGO, http://obi.sourceforge.net, Whetzel et al. 2006), an international collaborative project, initiated in 2005, which aims to build a cross-domain ontology as a resource for the annotation of biological, medical and environmental investigations. OBI is an OBO Foundry project set to provide terms that can be used to annotate investigations and the protocols, instrumentation and materials used in those investigations, along with the data generated and analyses performed. OBI brings together HUPO-PSI, MGED Society and other communities, and where the MSI OWG represents the metabolomics domain in this collaborative effort. According to the OBI working strategy, the general experimental components are built collaboratively, while each participating community proposes an informal ontology model relevant to their specific domains. The MSI OWG will provide technology-dependant components by using the relevant OBI “leaf nodes” (e.g., Instrument) as top-level classes. These are then harmonized and positioned within the common BFO top level ontology to ensure reuse and integration with other existing bio-ontologies as described in Rosse et al. 2005. The OBI project is driven by a coordinating committee, bringing together representatives of the participating communities, while guidance on design and engineering is provided by an Advisory Board, including recognised ontology experts. OBI is being developed in OWL using the Protégé ontology editor (Noy et al. 2003). Use cases and terms from each community, minutes of the teleconferences, reports from face-to-face workshops and presentations are available from the project website. An initial version of the top-level structure of OBI, using the BFO is available at the OBO website (http://obo.sourceforge.net/cgi-bin/detail.cgi?obi). A first draft of OBI will be considered ‘completed’ when the general (common) experimental component and all the technology-dependant components have been developed and harmonized (redundancy removed).
The initial source of terms for the CV is Rubtsov et al. (2007). As stated before, the minimal reporting requirements and the format are both work in progress conducted by other MSI WGs, therefore, the ontology for the NMR sub-component should not be considered complete at this stage. The NMR.owl has also served as a test bed to evaluate the BFO top-level ontology as well as technical issues such as the OWL-import, cross ontology referencing mechanism, modularisation, constraint inheritance and the usage of RA and RU metadata annotation properties in Protégé. Overall, we can say that this experience has been an excellent use case to practice our working strategy and collaboration with the larger OBI group.
Every effort will be made to meet the group goals in a timely fashion, although the MSI OWG members are geographically distributed and central funds do not exits for the MSI WGs activities. One of the major bottlenecks in building bio-ontologies is the lack of a unified methodology and tools for collaborative development, making large collaborative endeavours more challenging (Castro et al. 2006). The MSI OWG and the OBI WG pose scenarios in which domain experts are geographically widespread and the structure of the ontology is constantly evolving. Consequently face-to-face workshops have proved to be the most efficient way to significantly advance the project. In addition, the sociological barriers involved in these kinds of large-scale collaborations can be far more challenging and extensive liaison is necessary between communities. Managing this process of consensus building from start to finish requires ample time, resources and expertise. The time invested to identify commonalities and synergies with other projects, such as OBI, is often limited due to a lack of resources. The massively collaborative nature of the ontology undertaking requires frequent face-to-face workshops to create the optimal conditions for building of consensus. Teleconferences and web meetings are also used, but these are generally short and are not an ideal mechanism for efficient collaborative development; rarely are they as effective as direct interactions established at face-to-face workshops. Unfortunately it is very difficult to hold such workshops without central funds; such funds having previously been difficult to obtain in competition with more traditional scientific projects. In the special issue of the journal OMICS (Field and Sansone 2006) twenty invited manuscripts describe different standardisation initiatives focusing on both the successes and pitfalls encountered, and lessons learned. This issue also includes a special call for action for further recognition of the importance of global omics standardisation activities (Brooksbank and Quackenbush 2006), where the authors eloquently describe the Herculean efforts that are often accomplished ‘on the side’ and without formal funding, simply because the lack of standardisation is an unacceptable state of affairs for omics researchers and is repeatedly proving to be a significant bottleneck in the collection, querying, processing, and sharing of data.
The authors are members of the MSI Ontology WG; Susanna-Assunta Sansone is the current acting chair and Daniel Schober is the post-doctoral ontologist assisting the WG with the developmental phases. We kindly acknowledge the MSI Oversight Committee, the other MSI WGs chairs and members, the OBI working group, the OBO Foundry leaders and the Ontogenesis Networks members for their contributions in fruitful discussions. We also gratefully thank the BBSRC e-Science Development Fund (BB/D524283/1 and BB/E025080/1, to Susanna-Assunta Sansone), the BBSRC MeT-RO project (MET20483, to Helen Jenkins), the BBSRC/EPSRC “The Manchester Centre for Integrative Systems Biology” (to Irena Spasic), the EU Network of Excellence NuGO (NoE 503630, to Philippe Rocca-Serra) and the EU Network of Excellence Semantic Interoperability and Data Mining in Biomedicine (NoE 507505, supporting Daniel Schober and Irena Spasic exchange visits).
- de Matos, P., Ennis, M., Darsow, M., Guedj, M., Degtyarenko, K., & Apweiler, R. (2006). ChEBI—Chemical Entities of Biological Interest. Nucleic Acids Research. Google Scholar
- Fiehn, O., Sumner, L. W., Rhee, S. Y., Ward, J., Dickerson, J., Lange, B. M., Lane, G., Roessner, U., Last, R., & Nikolau, B. (2007). Minimum reporting standards for plant biology context information in metabolomics studies. Metabolomics, 3, this issue.Google Scholar
- Griffin, J. L., Nicholls, A. W., Daykin, C. A., Heald, S., Keun, H. C., Schuppe-Koistinen, I., Griffiths, J. R., Cheng, L. L., Rocca-Serra, P., Rubtsov, D. V., & Robertson, D. (2007). Standard reporting requirements for biological samples in metabolomics experiments: mammalian / in vivo experiments. Metabolomics, 3, this issue.Google Scholar
- Hardy, N. W., & Taylor, C. F. (2007). A roadmap for the establishment of standard data exchange structures for metabolomics. Metabolomics, 3, this issue.Google Scholar
- Jenkins, H., Hardy, N., Beckmann, M., Draper, J., Smith, A. R., Taylor, J., Fiehn, O., Goodacre, R., Bino, R. J., Hall, R., Kopka, J., Lane, G. A., Lange, B. M., Liu, J. R., Mendes, P., Nikolau, B. J., Oliver, S. G., Paton, N. W., Rhee, S., Roessner-Tunali, U., Saito, K., Smedsgaard, J., Sumner, L. W., Wang, T., Walsh, S., Wurtele, E. S., & Kell, D. B. (2004). A proposed framework for the description of plant metabolomics experiments and their results. Nature Biotechnology, 22, 1601–1606.CrossRefPubMedGoogle Scholar
- Lindon, J. C., Nicholson, J. K., Holmes, E., Keun, H. C., Craig, A., Pearce, J. T., Bruce, S. J., Hardy, N., Sansone, S. A., Antti, H., Jonsson, P., Daykin, C., Navarange, M., Beger, R. D., Verheij, E. R., Amberg, A., Baunsgaard, D., Cantor, G. H., Lehman-McKeeman, L., Earll, M., Wold, S., Johansson, E., Haselden, J. N., Kramer, K., Thomas, C., Lindberg, J., Schuppe-Koistinen, I., Wilson, I. D., Reily, M. D., Robertson, D. G., Senn, H., Krotzky, A., Kochhar, S., Powell, J., van der Ouderaa, F., Plumb, R., Schaefer, H., & Spraul, M. (2005). Summary recommendations for standardization and reporting of metabolic analyses. Nature Biotechnology, 23, 833–838.CrossRefPubMedGoogle Scholar
- Morrison, N., Bearden, D., Bundy, J. G., Collette, T., Currie, F., Davey, M. P., Haigh, N. S., Hancock, D., Jones, O. A. H., Rochfort, S., Sansone, S-A., Stys, D., Teng, Q., Field, D., & Viant, M. R. (2007). Standard reporting requirements for biological samples in metabolomics experiments: environmental context. Metabolomics, 3, this issue.Google Scholar
- Noy, N. F., Crubezy, M., Fergerson, R. W., Knublauch, H., Tu, S. W., Vendetti, J., & Musen, M. A. (2003) Protege-2000: An Open-source Ontology-development and Knowledge-acquisition Environment. Proc AMIA Symp, 953.Google Scholar
- Rosse, C., Kumar, A., Mejino, J. L. Jr., Cook, D. L., Detwiler, L. T., & Smith, B. (2005) A strategy for improving and integrating biomedical ontologies. AMIA Annu Symp Proc, 639–643.Google Scholar
- Rubin, D. L., Lewis, S. E., Mungall, C. J., Misra, S., Westerfield, M., Ashburner, M., Sim, I., Chute, C. G., Solbrig, H., Storey, M. A., Smith, B., Day-Richter, J., Noy, N. F., & Musen, M. A. (2006). National center for biomedical ontology: Advancing biomedicine through structured organization of scientific knowledge. Omics, 10, 185–198.CrossRefPubMedGoogle Scholar
- Rubtsov, D. V., Jenkins, H., Ludwig, C., Easton, J., Viant, M. R., Gunther, U., Griffin, J. L., & Hardy, N. (2007). Proposed reporting requirements for the description of NMR-based metabolomics experiments. Metabolomics, 3, this issue.Google Scholar
- Schober, D., Kusnierczyk, W., Lewis, S., Lomax, J., Members of the MSI, PSI Ontology Working Groups, Mungall, S., Rocca-Serra, P., Smith B., & Sansone, S.-A. (2007). Towards naming conventions for use in controlled vocabulary and ontology engineering. In Proceedings of the Bio-Ontologies Workshop, ISMB/ECCB, Vienna http://bio-ontologies.org.uk/download/Bio-Ontologies2007.pdf, pp. 29–32.
- Schulze-Kremer, S. (1998). Ontologies for molecular biology. Pac Symp Biocomput, 695–706.Google Scholar
- Smith, B., Kusnierczyk, W., Schober, D., & Ceusters, W. (2006). Towards a Reference Terminology for Ontology Research and Development in the Biomedical Domain. KR-MED 2006.Google Scholar
- Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L., Eilbeck, K., Ireland, A., Mungall, C., the OBI Consortium, Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S-A., Shah, N., Whetzel, P. L., Lewis, S. The OBO Foundry: Coordinated Evolution of Ontologies to Support Biomedical Data Integration. Nature Biotechnology (under review).Google Scholar
- Soldatova, L. N., & King, R. D. (2006). An ontology of scientific experiments. Journal of the Royal Society Interface. Google Scholar
- Spasic, I., Schober, D., Sansone, S-A., Rebholz-Schuhmann, D., Kell, D. B., Paton, N., & MSI Ontology Working Group Members (2007). Facilitating the development of controlled vocabularies for metabolomics with text mining. In Proceedings of the Bio-Ontologies Workshop, ISMB/ECCB, Vienna, http://bio-ontologies.org.uk/download/Bio-Ontologies2007.pdf, pp. 45–48.
- Stevens, R., Bodenreider, O., & Lussier, Y. A. (2006). Semantic webs for life sciences. Pacific Symposium on Biocomputing, 112–115.Google Scholar
- Sumner, L. W., Amberg, A., Barrett, D., Beale, M. H., Beger, R., Daykin, C. A., Fan, T. W-M., Fiehn, O., Goodacre, R., Griffin, J. L., Hankemeier, T., Hardy, N., Harnly, J., Higashi, R., Kopka, J., Lane, A. N., Lindon, J. C., Marriott, P., Nicholls, A. W., Reily, M. D., Thaden, J. J., & Viant, M. R. (2007). Proposed minimum reporting standards for chemical analysis. Metabolomics, 3, this issue.Google Scholar
- Supekar, K., & Musen, M. (2005). Ontology metadata to support the building of a library of biomedical ontologies. AMIA Annual Symposium Proceedings, 1127.Google Scholar
- van der Werf, M. J., Takors, R., Smedsgaard, J., Nielsen, J., Ferenci, T., Portais, J. C., Wittmann, C., Hooks, M., Tomassini, A., Oldiges, M., Fostel, J., & Sauer, U. (2007). Standard reporting requirements for biological samples in metabolomics experiments: microbial and in vitro biology experiments. Metabolomics, 3, this issue.Google Scholar
- Whetzel, P. L., Brinkman, R. R., Causton, H. C., Fan, L., Field, D., Fostel, J., Fragoso, G., Gray, T., Heiskanen, M., Hernandez-Boussard, T., Morrison, N., Parkinson, H., Rocca-Serra, P., Sansone, S. A., Schober, D., Smith, B., Stevens, R., Stoeckert, C. J. Jr., Taylor, C., White, J., & Wood, A. (2006). Development of FuGO: an ontology for functional genomics investigations. Omics, 10, 199–204.CrossRefPubMedGoogle Scholar
- Wishart, D. S., Tzur, D., Knox, C., Eisner, R., Guo, A. C., Young, N., Cheng, D., Jewell, K., Arndt, D., Sawhney, S., Fung, C., Nikolai, L., Lewis, M., Coutouly, M. A., Forsythe, I., Tang, P., Shrivastava, S., Jeroncic, K., Stothard, P., Amegbey, G., Block, D., Hau, D. D., Wagner, J., Miniaci, J., Clements, M., Gebremedhin, M., Guo, N., Zhang, Y., Duggan, G. E., Macinnis, G. D., Weljie, A. M., Dowlatabadi, R., Bamforth, F., Clive, D., Greiner, R., Li, L., Marrie, T., Sykes, B. D., Vogel, H. J., & Querengesser, L. (2007). HMDB: the Human Metabolome Database. Nucleic Acids Research, 35, D521–D526.CrossRefPubMedGoogle Scholar