XML schemas for common bioinformatic data types and their application in workflow systems

Seibel, Philipp N; Krüger, Jan; Hartmeier, Sven; Schwarzer, Knut; Löwenthal, Kai; Mersch, Henning; Dandekar, Thomas; Giegerich, Robert

doi:10.1186/1471-2105-7-490

XML schemas for common bioinformatic data types and their application in workflow systems

Software
Open access
Published: 06 November 2006

Volume 7, article number 490, (2006)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

XML schemas for common bioinformatic data types and their application in workflow systems

Download PDF

Philipp N Seibel¹,
Jan Krüger²,
Sven Hartmeier²,
Knut Schwarzer³,
Kai Löwenthal²,
Henning Mersch⁴,
Thomas Dandekar¹ &
…
Robert Giegerich²

15k Accesses
29 Citations
Explore all metrics

Abstract

Background

Today, there is a growing need in bioinformatics to combine available software tools into chains, thus building complex applications from existing single-task tools. To create such workflows, the tools involved have to be able to work with each other's data – therefore, a common set of well-defined data formats is needed. Unfortunately, current bioinformatic tools use a great variety of heterogeneous formats.

Results

Acknowledging the need for common formats, the Helmholtz Open BioInformatics Technology network (HOBIT) identified several basic data types used in bioinformatics and developed appropriate format descriptions, formally defined by XML schemas, and incorporated them in a Java library (BioDOM). These schemas currently cover sequence, sequence alignment, RNA secondary structure and RNA secondary structure alignment formats in a form that is independent of any specific program, thus enabling seamless interoperation of different tools. All XML formats are available at http://bioschemas.sourceforge.net, the BioDOM library can be obtained at http://biodom.sourceforge.net.

Conclusion

The HOBIT XML schemas and the BioDOM library simplify adding XML support to newly created and existing bioinformatic tools, enabling these tools to interoperate seamlessly in workflow scenarios.

The Biological Object Notation (BON): a structured file format for biological data

Article Open access 25 June 2018

Solving Data Mismatches in Bioinformatics Workflows by Generating Data Converters

Encompassing new use cases - level 3.0 of the HUPO-PSI format for molecular interactions

Article Open access 11 April 2018

Background

Today, bioinformatic analyses are becoming more and more complex tasks. Large scale analyses usually integrate several independently calculated results. A good example for such a complex application is automated genome annotation, a process starting from detecting promotor regions and automatically predicting genes to complete functional annotation of novel genomes [1–3]. To accomplish such complex tasks, many different tools are needed. This can lead to problems when trying to find a simple way to combine the necessary tools.

Workflow tools, such as Taverna [4], Wildfire [5] or Pegasys [6], address these problems and provide a user interface to orchestrate bioinformatic tools into complex workflows. These workflow systems mostly require that the programs to be integrated are provided as webservices [7] or need extra programming effort to make the tools interoperate with each other [8].

Webservices provide a well defined programming interface to integrate tools into applications over the internet or other network connections. Currently a growing number of bioinformatic applications are provided as webservices, such as Eclair [9], SOAP-based services provided by the EBI [10], Biosphere [11], AliasServer [12], Soap-HT-BLAST [13], biological SOAP servers and webservices provided by the public sequence data bank [14], INCLUSive [15], and BioMoby [16]. Therefore, using webservices to build complex networked toolchains is a widely accepted solution.

To be able to create workflows consisting of webservices, the involved tools need to work correctly with each other's data – therefore, a common set of well-defined data formats is necessary. Furthermore, it should be possible to validate the correctness of data given in these formats, enabling the system to inform the user of problems caused by the data as early as possible. Unfortunately, current bioinformatic tools use a great variety of heterogeneous formats for reading data and storing their results. Some of these formats, like FASTA [17] for sequence data, CLUSTAL [18] for alignments or Vienna style DotBracket [19] for RNA secondary structure information, were originally designed to be read in a standard text editor. These formats often lack consistency: There are, for example, many different interpretations of the 'correct' FASTA file format. A formal, machine-readable definition of the formats is often missing. For instance, some applications use lower- and uppercase characters in the raw sequence to encode additional annotations in FASTA, while others append extensive information to the header line. Various tools often use the same encoding to store different information. Because of these circumstances automated processing of FASTA can be difficult or even impossible. Of course FASTA isn't designed for this kind of additional annotations, but this fact doesn't prevent people from using it for other purposes, e.g. to store binding site information.

To overcome the problem of identifying, specifying and extracting the desired information, the use of XML [20] for all kinds of bioinformatic data provides a good solution. XML is platform and programming language independent and can be formally defined using the XML Schema language [21]. Modern webservice technology is based on XML [22] and the data exchanged by these webservices also relies on XML (a well known example is the eUtils webservice at NCBI [23, 24]). Therefore, well specified XML data seems to be a useful, modern and accepted technology for exchange of bioinformatic data.

Today, a simple search on the WWW reveals literally dozens of existing XML formats for bioinformatical data, and several sites collecting information about these formats. A well-known example is the XML web site of Paul Gordon [25], which contains links to various XML information sites relevant for molecular biology. But despite the multitude of possible formats, compact and simple data formats for frequently used simple information such as only sequence or alignment data are still missing, as the existing formats are usually quite complex or suffer from other limitations as discussed below.

Furthermore, most of the existing XML formats are defined by XML Document Type Definitions (DTDs) only, but DTDs do not allow specification of conditions and constraints on the content of XML tags. This makes syntactic validation of the actual data described by such formats, e.g sequence data, impossible. Fortunately, a more modern approach is available by usage of XML Schema Definitions (XSD) [21] instead of DTDs, which allow a much more rigid definition of XML syntax. Unfortunately, only very few existing XML formats use this approach, and most of those which do, do not use XSDs potential properly.

An overview of the most prevalent existing XML formats for sequence based bioinformatical data (which was the initial focus of our work) and a short evaluation for determining their general suitability as data formats in workflow scenarios is given in table 1. As can be seen, most current formats in this area do have certain shortcomings. These cause problems when trying to use them within a workflow application scenario.

Table 1 Commonly used XML formats and their features

Full size table

For other application areas like protein interaction, microarray experiments, phylogenetics and systems biology, other formats like e.g. PSI-ML [26], ProML [27], Mage-ML [28], PhyloXML [29] and SBML [30] exist, which have been adopted in varying degrees by their respective communities. These applications are not yet covered by our project.

Implementation

Basic Concepts

Due to the need for common data formats, we have identified several basic data types used throughout the bioinformatic community and created elementary format descriptions (see table 2), formally specified by XML schemas, and a library (BioDOM) to create XML files according to these schemas and additionally convert from prevalent formats to the XML formats and vice versa. These XML formats are extensively used within the HOBIT project to facilitate interoperation between the bioinformatic webservices provided by the project members at several different universities and research institutes throughout Germany. Nevertheless, we would especially like to emphasise the fact that although the formats have initially been defined within the HOBIT project, their use is by no means restricted to this context. On the contrary, they have been explicitly designed to be useful building blocks for any user in the bioinformatic community, and their usage for data exchange between bioinformatic tools is highly encouraged. In the following, we describe some of these XML schemas and show examples of their application.

XML schema structure

Some implementation guidelines were defined for the different HOBIT XML schemas to guarantee consistency in development and results. These guidelines are as follows:

XML schemas grant the ability to validate the payload data, which is not the case in DTDs. Since this ability is important in workflow environments, XML Schema based format definitions are a requirement. Another requirement originating from the distributed workflow scenario is stability. Therefore, only stable specifications can be used. In accordance with the HOBIT guidelines it is mandatory that the format is not bound by a closed licence restriction, but may be used and extended freely.

Active maintenance of formats is also essential, since this is especially important in an area of rapid development like bioinformatics. Likewise it should be possible to extend the format to accomodate special use cases.

Two additional features of formats that we recommend, but not require, are simplicity and usage of building blocks. Both features improve the usability of the format.

We do not necessarily want to replace existing schemas. A new schema was developed only if no available schema was suitable for the given requirements. All the schemas make extensive use of inheritance. HOBIT XML format descriptions are based on two XML schemas, containing elementary types: Basic biological types like amino acid sequences are defined in a collection named BioTypes [31]. The usage of this types in the newly developed XML schemas is illustrated in figure 1. More technical elements, e.g. parameters of a commandline application, often needed in the XML schemas, are collected in the HobitTypes [32]. To allow widespread and diversified use of the schemas, some extension points were incorporated into the schemas. Information unaccounted for in a given version of a schema can easily be added as attribute. This is accomplished by adding XSD anyAttribute declarations to central tags.

Since this opens a possibility for improper extension of a given schema, but reasonable extensions should be considered during validation, a mechanism to support ongoing development was necessary. To fulfill this requirement, a public Wiki page was installed [33]. Every interested person is invited to make suggestions for the improvement of the schemas directly in the Wiki, working cooperatively with other persons to improve the schema definitions. The XML schemas can be obtained from the subversion repository located at [34]. For quality control purposes, changes can be commited only by registered SourceForge [35] BioSchema project [36] members.

Sequence formats

SequenceML

SequenceML deals with all kinds of simple sequence information often used as input for several common bioinformatic tools. It is designed to be used as an XML replacement of the FASTA [17] format, containing all of FASTA's information while avoiding that format's aforementioned consistency problems. SequenceML differentiates between nucleic and amino acid sequences following the IUPAC standard and also allows the user to add free sequence information based on basic types defined by BioTypes [31] (figure 1). SequenceML also supports a mandatory sequence id and an optional detailed sequence description. SequenceML does not contain any annotation information.

SequenceAnnotationML

SequenceAnnotationML is based on SequenceML. While SequenceML contains raw sequence information, SequenceAnnotationML allows additional annotations. Thus, while SequenceML is often used as input for bioinformatic tools, SequenceAnnotationML can be used to store the result. SequenceAnnotationML allows modelling sites of interest of small sequences (DNA, RNA or protein). Furthermore it is possible to encapsulate whole genome annotations due to its recursive structure.

AlignmentML

AlignmentML is a format describing (multiple) alignment information any alignment program like CLUSTALW [18], DCA [37] and Dialign [38] can produce. Similar to SequenceML, different sequence types are supported.

RNA secondary structure formats

RNAStructML

RNAStructML is a format for storing RNA secondary structure information. The most widely used application for RNA tools, such as RNAshapes [39], RNAfold [19] and Mfold [40] is the proprosal of RNA secondary structures, based on thermodynamic principles. RNAStructML is inspired by SequenceML and uses Vienna style DotBracket strings for storing information about RNA secondary structures.

RNAStructAlignmentML

RNAStructAlignmentML is a format for storing RNA secondary structure alignments as computed by e.g. RNAforester [41] or RNAalifold [42]. RNAStructAlignmentML uses an RNAStructML-like architecture, but is based on AlignmentML instead of SequenceML.

BioDOM

To simplify the usage of the HOBIT XML formats, an easy-to-use Java library (BioDOM) has been developed. BioDOM provides an easy way to build XML files following the HOBIT format descriptions from inside the user's own programs. It is designed to be a modular system which can easily be extended as necessary to accomodate new formats. Additionally, BioDOM provides functions to convert native non-XML output of various bioinformatic tools to the HOBIT XML formats.

The BioDOM library contains one Java class for each natively supported XML format, which implements methods to create the corresponding data structure by adding the necessary parts to the new document or importing data from ordinary data formats to XML elements.

Each of these classes is based on the abstract class AbstractBioDOM, which provides commonly required methods for all converters, e.g. for setting and getting the documents object model (DOM) content, validating the document against an XML schema or creating a string representation of the XML data contained in the object. AbstractBioDOM also provides a general mechanism for XML-to-XML format conversion via XSLT [43] scripts. Finally, some methods for accessing the logging and error/exception handling facilities of the BioDOM library are integrated. This allows for graceful degradation of the system and user notification in case of erroneous input data or unforeseen circumstances during data creation or conversion.

The current version 1.2 of BioDOM supports the HOBIT XML formats SequenceML, AlignmentML, RNAStructML and RNAStructAlignmentML, allowing creation of documents in these formats and, additionally, conversion from and to (multiple) FASTA, CLUSTALW and the Vienna style DotBracket format. XSLT converters for TinySeq [44], INSDseq [45] and EMBLxml [45] are also provided. Owing to its modular design, BioDOM can very easily be extended by third party XSL scripts or own Java classes. Furthermore it is under constant development and testing to support additional data formats.

Results & Discussion

HOBIT XML formats

Recent approaches like RNAML [46] or PDBML [47] try to model all possible aspects of a complete field of application in one single XML schema. In the case of RNA this means storing sequences, secondary structures, tertiary structures and even all kinds of experimental data in only one file.

Most programs focus on a specific application, and in this case storing all data leads to huge and unmanageable XML documents, compare e.g. the size of NCBI ASN.1 formatted GenBank data versus the same data's XML representation, which can be about one order of magnitude (or even more) bigger. While keeping the whole complex data in one file might be appropriate for archiving, data exchange in distributed workflow scenarios requires the information to be available in a compact format to allow short response times. This compact format can be achieved in two ways. One is to declare many parts in a complex XML schema 'optional' to reduce the overhead for the different usage scenarios, the other is to use more simple schemas representing small and simple building blocks. The HOBIT XML schemas are designed following the building blocks concept, which is, from our point of view, the preferred solution for the application in workflow systems. Our rationale for this is as follows:

While it is of course possible to extract the data from larger XML data structures using XSLT or XPath expressions, it is much more difficult to generate workflows from webservice modules exchanging general XML data. The extraction patterns have to be defined either by the webservice developer or the person who builds the workflow. In the first case, the webservice is called with an XML file conforming to the complex XML schema as input, but the input doesn't explicitly represent the data the webservice really requires. The second case would require the person building the workflow to have the knowledge to define the extraction patterns for every connection between webservices to handle the in- and output. By using small XML schemas, only the data really used by the webservice is modelled. This makes the handling much more intuitive for the typical non-XML-savvy user of a workflow system.

Additionally, declaring many parts as 'optional' in larger XML schemas leads to a much more complex validation process, e.g. if the XML content elements a webservice requires as input are declared as 'optional' in the XML schema of the previous output data, the service can detect the existence of the required data only at runtime.

To illustrate the problems of using general XML schemas in workflow systems, consider a simple workflow for predicting and drawing the secondary structure of a given RNA sequence. The workflow consists of two webservices, the first one a service for predicting the secondary structure and the second one a service that creates a picture from the prediction afterwards. In the case of general XML schemas, like RNAML [46], the first webservice would require XML input data conforming to the general schema and produce XML output data conforming to the same general schema, but adding the RNA secondary structure information. Both XMLs can be validated with the same complex schema, but it is not known whether the output really contains the required data until the second webservice is called. In the case of using building blocks the first service requires SequenceML as input and produces RNAStructML as output, the second service requires RNAStructML as input and generates an image as output. Due to these input/output requirements, the webservices can only be connected if the output of the first service contains the data the second service requires, and this enables the system to validate the correctness of the connections between single services during the construction process of the workflow. Therefore, possible type errors can be detected and prevented before a (possibly very expensive or time-consuming) actual run of the workflow. (Of course, errors due to problematic data produced during workflow execution are still possible and can in principle not be prevented.)

Another problem not directly adressed by the HOBIT XML formats themselves is the task of finding and selecting the relevant information from existing XML archive formats to acquire only the data needed for a specific application. After this task has been accomplished, the lightweight HOBIT formats can then be used for further transfer of the data inside a workflow. There are several approaches to solving the data selection problem. Within the BioDOM framework, XSL scripts can be used to select and transform the desired information. If another solution already exists for a specific use case (like e.g. [8]), it can easily be combined with our work.

The HOBIT XML schemas are open for extensions and especially new formats. An open community environment based on a Wiki system is provided to allow discussion and rapid development [48]. For collaborative XML schema enhancements, all XML schemas are available at a subversion repository [34]. By providing open access to our XML schemas, we try to address the common problem of data format extension, e.g. adding features to a existing data format vs. having to define a new one.

Easy integration

We have implemented a Java package (BioDOM), which can be used to build the supported XML formats directly from an application's internal data structures. BioDOM can also convert between different XML formats (HOBIT defined and others) using XSLT. In addition to these functions, there are conversion functions for many commonly used non-XML formats, which allow traditional tools and services a smooth transition from their data formats towards the XML formats presented in this study.

Figure 3 shows a typical use case for integrating BioDOM as a library. In addition to this, BioDOM can also be used online to convert bioinformatic data automatically by using the BioDOM webservice [49]. This is especially useful for current workflow systems to connect webservices with incompatible, but convertable data, without any programming effort. For testing purposes and small amounts of data, the conversion functions of BioDOM can additionally be used manually by using the online submission form [49].

For now, BioDOM is focused primarily on providing support for creation of data in the HOBIT XML formats plus simple conversion routines from existing well-known non-XML formats. It is intended to add interfaces to BioJava [50] objects and methods in later versions. Implementations of the BioDOM API in other programming languages are also planned.

SOAP based webservices

Table 3 shows a list of available webservices supporting HOBIT XML formats as input and ouput. Currently (June 2006), ten webservices from different bioinformatic fields are available. A constantly updated list of offered webservices can be obtained from the HOBIT website [51].

Table 3 Available webservices supporting HOBIT XML formats

Full size table

Workflow systems

More and more bioinformatic tools offered today provide a webservice interface [9–16] in addition to a browser based user interface. In the HOBIT network bioinformatic applications and resources are connected in a uniform way. Webservices are the method of choice for building workflows using bioinformatic tools from different locations.

Until now connecting webservices is mostly a problem of incompatible data formats which require more or less complex data conversion. The BioDOM library remedies this problem by allowing easy conversion of ordinary formats to HOBIT XML formats. Using standardized input and output formats simplifies combining webservices to a pipeline, thus one webservice choreography (composed of different tasks) could easily be used multiple times without any user interaction. An example is a simple pipeline of SOAPDB offered by DKFZ-Heidelberg and e2g [52] offered by BiBiServ (see figure 2). A sequence is retrieved from the EMBL database via SOAPDB. This sequence is transformed to SequenceML via BioDOM and pushed to the e2g webservice, which maps it to the genomic sequence of the specified species. The result is presented as alignments corresponding to the EBIApplicationResult schema, which is also used by the EBI Blast webservice [10]. Figure 2 shows e2g as an (internal) workflow using RepeatMasker [53], vmatch [54, 55] and genscan [56].

Current workflow systems normally require additional programming effort to integrate existing applications without webservice interfaces into complex tool chains. There are two possibilities to minimize the neccessary work using BioDOM:

One can either use BioDOM as a local converter of traditional data, or apply it as a wrapper for a webservice based approach to produce verifiable input and output with the additional value gained by the HOBIT formats (for an example of this approach, see [49]).

Furthermore, new bioinformatic applications can integrate BioDOM directly to support the provided XML formats natively. This eliminates most of the additional effort for workflow integration of the tool.

Conclusion

Using the HOBIT XML schemas and the BioDOM library, it is easy to add XML support to newly created and existing bioinformatic tools, enabling these tools to seamlessly interoperate in workflow scenarios.

Availability and requirements

The XML schemas of the HOBIT XML formats can be found on the project's website at http://bioschemas.sourceforge.net/.

The BioDOM library is freely available for download in the stable and development versions at http://biodom.sourceforge.net/ and may be included in external programs under the conditions of the Apache Licence 2.0 (BioDOM requires at least Java Version 1.5, which can be downloaded from http://java.sun.com/).

Table 4 shows the WSDL locations of SOAP based webservices supporting the HOBIT XML schemas described before.

Table 4 WSDL location of SOAP based webservices

Full size table

Table 2 Comparison of native formats and their HOBIT XML counterparts

Full size table

Abbreviations

HOBIT:: Helmholtz Open Bioinformatics Technology – a cooperative project of several german bioinformatic centres at universities and public research institutes for coordination and interconnection of bioinformatic activities
XML:: eXtensible Markup Language
WSDL:: WebService Description Language
XSL:: eXtensible Stylesheet Language
XSLT:: XSL Transformations
DOM:: Document Object Model

References

Mewes HW, Frishman D, Mayer KFX, Münsterkötter M, Noubibou O, Pagel P, Rattei T, Oesterheld M, Ruepp A, Stümpflen V: MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res 2006, (34 Database):D169-D172. 10.1093/nar/gkj148
Google Scholar
Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T, Clausen J, Kalinowski J, Linke B, Rupp O, Giegerich R, Pühler A: GenDB-an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 2003, 31(8):2187–2195. 10.1093/nar/gkg312
Article PubMed Central CAS PubMed Google Scholar
Riley ML, Schmidt T, Wagner C, Mewes HW, Frishman D: The PEDANT genome database in 2005. Nucleic Acids Res 2005, (33 Database):D308-D310.
Google Scholar
Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 2004, 20(17):3045–3054. 10.1093/bioinformatics/bth361
Article CAS PubMed Google Scholar
Tang F, Chua CL, Ho LY, Lim YP, Issac P, Krishnan A: Wildfire: distributed, Grid-enabled workflow construction and execution. BMC Bioinformatics 2005, 6: 69. 10.1186/1471-2105-6-69
Article PubMed Central PubMed Google Scholar
Shah SP, He DYM, Sawkins JN, Druce JC, Quon G, Lett D, Zheng GXY, Xu T, Ouellette BFF: Pegasys: software for executing and integrating analyses of biological sequences. BMC Bioinformatics 2004, 5: 40. 10.1186/1471-2105-5-40
Article PubMed Central PubMed Google Scholar
Stein L: Creating a bioinformatics nation. Nature 2002, 417(6885):119–120. 10.1038/417119a
Article CAS PubMed Google Scholar
Chagoyen M, Kurul ME, De-Alarcón PA, Carazo JM, Gupta A: Designing and executing scientific workflows with a programmable integrator. Bioinformatics 2004, 20(13):2092–2100. 10.1093/bioinformatics/bth209
Article CAS PubMed Google Scholar
Rudd S, Tetko IV: Eclair-a web service for unravelling species origin of sequences sampled from mixed host interfaces. Nucleic Acids Res 2005, (33 Web Server):W724-W727. 10.1093/nar/gki434
Google Scholar
Pillai S, Silventoinen V, Kallio K, Senger M, Sobhany S, Tate J, Velankar S, Golovin A, Henrick K, Rice P, Stoehr P, Lopez R: SOAP-based services provided by the European Bioinformatics Institute. Nucleic Acids Res 2005, (33 Web Server):W25-W28. 10.1093/nar/gki491
Google Scholar
Cheung KH, de Knikker R, Guo Y, Zhong G, Hager J, Yip KY, Kwan AKH, Li P, Cheung DW: Biosphere: the interoperation of web services in microarray cluster analysis. Appl Bioinformatics 2004, 3(4):253–256. 10.2165/00822942-200403040-00007
Article CAS PubMed Google Scholar
Iragne F, Barré A, Goffard N, Daruvar AD: AliasServer: a web server to handle multiple aliases used to refer to proteins. Bioinformatics 2004, 20(14):2331–2332. 10.1093/bioinformatics/bth241
Article CAS PubMed Google Scholar
Wang J, Mu Q: Soap-HT-BLAST: high throughput BLAST based on Web services. Bioinformatics 2003, 19: 1863–1864. 10.1093/bioinformatics/btg244
Article CAS PubMed Google Scholar
Sugawara H, Miyazaki S: Biological SOAP servers and web services provided by the public sequence data bank. Nucleic Acids Res 2003, 31(13):3836–3839. 10.1093/nar/gkg558
Article PubMed Central CAS PubMed Google Scholar
Coessens B, Thijs G, Aerts S, Marchal K, Smet FD, Engelen K, Glenisson P, Moreau Y, Mathys J, Moor BD: INCLUSive: A web portal and service registry for microarray and regulatory sequence analysis. Nucleic Acids Res 2003, 31(13):3468–3470. 10.1093/nar/gkg615
Article PubMed Central CAS PubMed Google Scholar
Wilkinson MD, Links M: BioMOBY: an open source biological web services proposal. Brief Bioinform 2002, 3(4):331–341. 10.1093/bib/3.4.331
Article PubMed Google Scholar
Pearson WR: Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol 1994, 24: 307–331.
CAS PubMed Google Scholar
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673–4680.
Article PubMed Central CAS PubMed Google Scholar
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie/Chemical Monthly 1994, 125(2):167–188. [http://www.springerlink.com/openurl.asp?genre=article&id=doi:10.1007/BF00818163] 10.1007/BF00818163
Article CAS Google Scholar
XML[http://www.w3.org/XML/]
XML Schema language[http://www.w3.org/XML/Schema]
WebServices technology based on XML[http://www.w3.org./2002/ws]
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Helmberg W, Kapustin Y, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2006, (34 Database):D173-D180. 10.1093/nar/gkj158
Google Scholar
NCBI Entrez Programming Utilities[http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutilshelp.html]
XML for Molecular Biology as compiled by Paul Gordon[http://www.visualgenomics.ca/gordonp/xml/]
Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, Roechert B, Poux S, Jung E, Mersch H, Kersey P, Lappe M, Li Y, Zeng R, Rana D, Nikolski M, Husi H, Brun C, Shanker K, Grant SGN, Sander C, Bork P, Zhu W, Pandey A, Brazma A, Jacq B, Vidal M, Sherman D, Legrain P, Cesareni G, Xenarios I, Eisenberg D, Steipe B, Hogue C, Apweiler R: The HUPO PSI's molecular interaction format-a community standard for the representation of protein interaction data. Nat Biotechnol 2004, 22(2):177–183. 10.1038/nbt926
Article CAS PubMed Google Scholar
Hanisch D, Zimmer R, Lengauer T: ProML-the protein markup language for specification of protein sequences, structures and families. In Silico Biol 2002, 2(3):313–324.
CAS PubMed Google Scholar
Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, Ball C, Lepage M, Swiatek M, Marks WL, Goncalves J, Markel S, Iordan D, Shojatalab M, Pizarro A, White J, Hubley R, Deutsch E, Senger M, Aronow BJ, Robinson A, Bassett D, Stoeckert CJ, Brazma A: Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol 2002, 3(9):RESEARCH0046. 10.1186/gb-2002-3-9-research0046
Article PubMed Central PubMed Google Scholar
PhyloXML[http://www.phyloxml.org]
Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Gilles ED, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hofmeyr JH, Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Novère NL, Loew LM, Lucio D, Mendes P, Minch E, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J, Forum SBML: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 2003, 19(4):524–531. 10.1093/bioinformatics/btg015
Article CAS PubMed Google Scholar
BioTypes[http://bioschemas.sourceforge.net/index.php/BioTypes]
HobitTypes[http://bioschemas.sourceforge.net/index.php/HobitTypes]
HOBIT Wiki[http://hobit.gsf.de/wiki]
BioSchemas Subversion Repository[http://bioschemas.sourceforge.net/].
Sourceforge website[http://www.sourceforge.net.]
BioSchemas SourceForge Project Site[http://sourceforge.net/projects/bioschemas/.]
Stoye J: Multiple sequence alignment with the Divide-and-Conquer method. Gene 1998, 211(2):GC45-GC56. 10.1016/S0378-1119(98)00097-3
Article CAS PubMed Google Scholar
Morgenstern B: DIALIGN: multiple DNA and protein sequence alignment at BiBiServ. Nucleic Acids Res 2004, (32 Web Server):W33-W36.
Steffen P, Voss B, Rehmsmeier M, Reeder J, Giegerich R: RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics 2006, 22(4):500–503. 10.1093/bioinformatics/btk010
Article CAS PubMed Google Scholar
Zuker M: Computer prediction of RNA structure. Methods Enzymol 1989, 180: 262–288.
Article CAS PubMed Google Scholar
Höchsmann M, Töller T, Giegerich R, Kurtz S: Local similarity in RNA secondary structures. Proc IEEE Comput Soc Bioinform Conf 2003, 2: 159–168.
PubMed Google Scholar
Hofacker IL, Fekete M, Stadler PF: Secondary structure prediction for aligned RNA sequences. J Mol Biol 2002, 319(5):1059–1066. 10.1016/S0022-2836(02)00308-X
Article CAS PubMed Google Scholar
XSL Transformations (XSLT)[http://www.w3.org/TR/xslt]
TinySeq[http://www.ncbi.nlm.nih.gov/IEB/ToolBox/XML/ncbixml.txt]
INSDseq/EMBLxml[http://www.ebi.ac.uk/xembl/]
Waugh A, Gendron P, Altman R, Brown JW, Case D, Gautheret D, Harvey SC, Leontis N, Westbrook J, Westhof E, Zuker M, Major F: RNAML: a standard syntax for exchanging RNA information. RNA 2002, 8: 707–717. 10.1017/S1355838202028017
Article PubMed Central CAS PubMed Google Scholar
Westbrook J, Ito N, Nakamura H, Henrick K, Berman HM: PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 2005, 21(7):988–992. 10.1093/bioinformatics/bti082
Article CAS PubMed Google Scholar
BioSchemas[http://bioschemas.sourceforge.net/]
BioDOM WebService[http://bibiserv.techfak.uni-bielefeld.de/biodomws/]
Mangalam H: The Bio* toolkits-a brief overview. Brief Bioinform 2002, 3(3):296–302. 10.1093/bib/3.3.296
Article PubMed Google Scholar
HOBIT website[http://hobit.sourceforge.net]
Krüger J, Sczyrba A, Kurtz S, Giegerich R: e2g: an interactive web-based server for efficiently mapping large EST and cDNA sets to genomic sequences. Nucleic Acids Res 2004, (32 Web Server):W301-W304.
Google Scholar
RepeatMasker[http://www.repeatmasker.org]
Abouelhoda M, Kurtz S, Ohlebusch E: Replacing Suffix Trees with Enhanced Suffix Arrays. J Discrete Algo 2004, 2: 53–86. 10.1016/S1570-8667(03)00065-0
Article Google Scholar
Vmatch[http://www.vmatch.de]
Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997, 268: 78–94. 10.1006/jmbi.1997.0951
Article CAS PubMed Google Scholar
Comparison of bioinformatic XML schemas[http://hobit.gsf.de/wiki/display/wiki/XML+Schemas]
Schultz J, Müller T, Achtziger M, Seibel PN, Dandekar T, Wolf M: The internal described spacer 2 database – a web server for (not only) low level phylogentic analyses. Nucleic Acids Res 2006, in press.
Google Scholar
Schultz J, Maisel S, Gerlach D, Müller T, Wolf M: A common core of secondary structure of the internal transcribed spacer 2 (ITS2) throughout the Eukaryota. RNA 2005, 11(4):361–364. 10.1261/rna.7204505
Article PubMed Central CAS PubMed Google Scholar
Wolf M, Achtziger M, Schultz J, Dandekar T, Müller T: Homology modeling revealed more than 20,000 rRNA internal transcribed spacer 2 (ITS2) secondary structures. RNA 2005, 11(11):1616–1623. 10.1261/rna.2144205
Article PubMed Central CAS PubMed Google Scholar
Reeder J, Giegerich R: Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics 2004, 5: 104. 10.1186/1471-2105-5-104
Article PubMed Central PubMed Google Scholar
Kurtz S, Choudhuri JV, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R: REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res 2001, 29(22):4633–4642. 10.1093/nar/29.22.4633
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

We thank Martin Szugat (Teaching and Research Unit Bioinformatics of the LMU Munich), initiator and maintainer of BioTypes, for his work and all other HOBIT members for helpful discussions about the XML schemas. Further we gratefully acknowledge the funding from the "Impuls- und Vernetzungsfonds der Helmholtz-Gemeinschaft Deutscher Forschungszentren e.V.".

Author information

Authors and Affiliations

Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany
Philipp N Seibel & Thomas Dandekar
Bioinformatics Group, Practical Computer Science Department, Faculty of Technology, Bielefeld University, Bielefeld, Germany
Jan Krüger, Sven Hartmeier, Kai Löwenthal & Robert Giegerich
Department of Bioinformatics, UKG, University of Göttingen, Göttingen, Germany
Knut Schwarzer
Distributed Systems and Grid Computing, Central Institute for Applied Mathematics, Research Centre Jülich, Jülich, Germany
Henning Mersch

Authors

Philipp N Seibel
View author publications
You can also search for this author in PubMed Google Scholar
Jan Krüger
View author publications
You can also search for this author in PubMed Google Scholar
Sven Hartmeier
View author publications
You can also search for this author in PubMed Google Scholar
Knut Schwarzer
View author publications
You can also search for this author in PubMed Google Scholar
Kai Löwenthal
View author publications
You can also search for this author in PubMed Google Scholar
Henning Mersch
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Dandekar
View author publications
You can also search for this author in PubMed Google Scholar
Robert Giegerich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Krüger.

Additional information

Authors' contributions

PNS, KS, JK conceived the study. HM provided the initial work. XML schema design and BioDOM implementation by HM, SH, KL, JK, PNS and KS. PNS, KS, JK, KL and SH drafted the manuscript. RG and TD participated in study design and coordination. All authors read and approved the final version of the manuscript.

Philipp N Seibel, Jan Krüger, Sven Hartmeier contributed equally to this work.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Seibel, P.N., Krüger, J., Hartmeier, S. et al. XML schemas for common bioinformatic data types and their application in workflow systems. BMC Bioinformatics 7, 490 (2006). https://doi.org/10.1186/1471-2105-7-490

Download citation

Received: 21 June 2006
Accepted: 06 November 2006
Published: 06 November 2006
DOI: https://doi.org/10.1186/1471-2105-7-490

XML schemas for common bioinformatic data types and their application in workflow systems

Abstract

Background

Results

Conclusion

Similar content being viewed by others

The Biological Object Notation (BON): a structured file format for biological data

Solving Data Mismatches in Bioinformatics Workflows by Generating Data Converters

Encompassing new use cases - level 3.0 of the HUPO-PSI format for molecular interactions

Background

Implementation

Basic Concepts

XML schema structure

Sequence formats

SequenceML

SequenceAnnotationML

AlignmentML

RNA secondary structure formats

RNAStructML

RNAStructAlignmentML

BioDOM

Results & Discussion

HOBIT XML formats

Easy integration

SOAP based webservices

Workflow systems

Conclusion

Availability and requirements

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation