A graph-based approach for designing extensible pipelines
In bioinformatics, it is important to build extensible and low-maintenance systems that are able to deal with the new tools and data formats that are constantly being developed. The traditional and simplest implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach can lead to problems when a pipeline is expanding because the incorporation of new tools is often error prone and time consuming. Current approaches to pipeline development such as workflow management systems focus on analysis tasks that are systematically repeated without significant changes in their course of execution, such as genome annotation. However, more dynamism on the pipeline composition is necessary when each execution requires a different combination of steps.
We propose a graph-based approach to implement extensible and low-maintenance pipelines that is suitable for pipeline applications with multiple functionalities that require different combinations of steps in each execution. Here pipelines are composed automatically by compiling a specialised set of tools on demand, depending on the functionality required, instead of specifying every sequence of tools in advance. We represent the connectivity of pipeline components with a directed graph in which components are the graph edges, their inputs and outputs are the graph nodes, and the paths through the graph are pipelines. To that end, we developed special data structures and a pipeline system algorithm. We demonstrate the applicability of our approach by implementing a format conversion pipeline for the fields of population genetics and genetic epidemiology, but our approach is also helpful in other fields where the use of multiple software is necessary to perform comprehensive analyses, such as gene expression and proteomics analyses. The project code, documentation and the Java executables are available under an open source license at http://code.google.com/p/dynamic-pipeline. The system has been tested on Linux and Windows platforms.
Our graph-based approach enables the automatic creation of pipelines by compiling a specialised set of tools on demand, depending on the functionality required. It also allows the implementation of extensible and low-maintenance pipelines and contributes towards consolidating openness and collaboration in bioinformatics systems. It is targeted at pipeline developers and is suited for implementing applications with sequential execution steps and combined functionalities. In the format conversion application, the automatic combination of conversion tools increased both the number of possible conversions available to the user and the extensibility of the system to allow for future updates with new file formats.
- Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol 1990, 215:403–410.
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22:4680. CrossRef
- Felsenstein J: PHYLIP – Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5:164–166.
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Bio Sci 1997, 13:555–556.
- Yang Z: PAML 4: a program package for phylogenetic analysis by maximum likelihood. Mol Biol Evol 2007, 24:1586–1591. CrossRef
- Stein L: Creating a bioinformatics nation. Nature 2002,417(6885):119–120. CrossRef
- Kaye J, Heeney C, Hawkins N, de Vries J, Boddington P: Data sharing in genomics - re-shaping scientific practice. Nat Rev Genet 2009, 10:331–335. CrossRef
- Hull D, Wolstencroft K, Stevens R, Goble C, Pocock M, Li P, Oinn T: Taverna: a tool for building and running workflows of services. Nucleic Acids Res 2006,34(Web Server issue):729–732. CrossRef
- Goecks J, Nekrutenko A, Taylor J, Team TG: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010, 11:R86. CrossRef
- Deelman E, Singh G, Su M, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman G, Good J, Laity A, Jacob J, Katz D: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Programming 2005, 13:219–237.
- Stevens R, Tipney H, Wroe C, Oinn T, Senger M, Lord P, Goble C, Brass A, Tassabehji M: Exploring Williams-Beuren syndrome using myGrid. Bioinformatics 2004,20(Suppl 1):i303-i310. CrossRef
- Orvis J, Crabtree J, Galens K, Gussman A, Inman J, Lee E, Nampally S, Riley D, Sundaram J, Felix V, Whitty B, Mahurkar A, Wortman J, White O, Angiuoli S: Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics 2010,26(12):1488–1492. CrossRef
- Goble CA, Bhagat J, Aleksejevs S, Cruickshank D, Michaelides D, Newman D, Borkum M, Bechhofer S, Roos M, Li P, Roure DD: myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Res 2010,38(2):W677-W682. CrossRef
- Altintas I, Berkley C, Jaeger E, Jones M, Ludascher B, Mock S: Kepler: an extensible system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management. Santorini Island Greece; 2004:423–424. CrossRef
- Deelman E, Gannon D, Shields M, Taylor I: Workflows and e-Science: An overview of workflow system features and capabilities. Future Gener Comput Syst 2009,25(5):528–540. CrossRef
- Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini A, Sawitzki G, Smith C, Smyth G, Tierney L, Yang J, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Res 2004, 5:R80.
- Excoffier L, Heckel G: Computer programs for population genetics data analysis: a survival guide. Nat Rev Genet 2006,7(10):745–758. CrossRef
- Mueller L, Brusniak M, Mani D, Aebersold R: An assessment of software solutions for the analysis of mass spectrometry based quantitative proteomics data. J Proteome Res 2008, 7:51–61. CrossRef
- Machado M, Magalhaes WCS, Sene A, Araujo B, Faria-Campos A, Chanock S, Scott L, Oliveira G, Tarazona-Santos E, Rodrigues MR: Phred-Phrap package to analyses tools: a pipeline to facilitate population genetics re-sequencing studies. Invest Genet 2011, 2:3. CrossRef
- Falush D, Stephens M, Pritchard J: Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 2003, 164:1567–1587.
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map (SAM) format and SAMtools. Bioinformatics 2009, 25:2078–2079. CrossRef
- Rios J, Karlsson J, Trelles O: Magallanes: a web services discovery and automatic workflow composition tool. BMC Bioinformatics 2009, 10:1–12. CrossRef
- Lamprecht A, Margaria T, Steffen B: Bio-jETI: a framework for semantic-based service composition. BMC Bioinformatics 2009, 10:1–19. CrossRef
- Martin D, Paolucci M, McIlraith S, Burstein M, McDermott D, McGuinness D, Parsia B, Payne T, Sabou M, Solanki M: Bringing Semantics to Web Services: the OWL-S approach. Lecture Notes Comput Sci 2005, 3387:26–42. CrossRef
- The World Wide Web Consortium: Web Services Description Language (WSDL) 1.1. [http://[http://www.w3.org/TR/wsdl]] 2001.
- A graph-based approach for designing extensible pipelines
- Open Access
- Available under Open Access This content is freely available online to anyone, anywhere at any time.
- Online Date
- July 2012
- Online ISSN
- BioMed Central
- Additional Links