A graph-based approach for designing extensible pipelines
In bioinformatics, it is important to build extensible and low-maintenance systems that are able to deal with the new tools and data formats that are constantly being developed. The traditional and simplest implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach can lead to problems when a pipeline is expanding because the incorporation of new tools is often error prone and time consuming. Current approaches to pipeline development such as workflow management systems focus on analysis tasks that are systematically repeated without significant changes in their course of execution, such as genome annotation. However, more dynamism on the pipeline composition is necessary when each execution requires a different combination of steps.
We propose a graph-based approach to implement extensible and low-maintenance pipelines that is suitable for pipeline applications with multiple functionalities that require different combinations of steps in each execution. Here pipelines are composed automatically by compiling a specialised set of tools on demand, depending on the functionality required, instead of specifying every sequence of tools in advance. We represent the connectivity of pipeline components with a directed graph in which components are the graph edges, their inputs and outputs are the graph nodes, and the paths through the graph are pipelines. To that end, we developed special data structures and a pipeline system algorithm. We demonstrate the applicability of our approach by implementing a format conversion pipeline for the fields of population genetics and genetic epidemiology, but our approach is also helpful in other fields where the use of multiple software is necessary to perform comprehensive analyses, such as gene expression and proteomics analyses. The project code, documentation and the Java executables are available under an open source license at http://code.google.com/p/dynamic-pipeline. The system has been tested on Linux and Windows platforms.
Our graph-based approach enables the automatic creation of pipelines by compiling a specialised set of tools on demand, depending on the functionality required. It also allows the implementation of extensible and low-maintenance pipelines and contributes towards consolidating openness and collaboration in bioinformatics systems. It is targeted at pipeline developers and is suited for implementing applications with sequential execution steps and combined functionalities. In the format conversion application, the automatic combination of conversion tools increased both the number of possible conversions available to the user and the extensibility of the system to allow for future updates with new file formats.
- Altschul, S, Gish, W, Miller, W, Myers, E, Lipman, D (1990) Basic local alignment search tool. J Mol Biol 215: pp. 403-410
- Thompson, JD, Higgins, DG, Gibson, TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: pp. 4680 CrossRef
- Felsenstein, J (1989) PHYLIP – Phylogeny Inference Package (Version 3.2). Cladistics 5: pp. 164-166
- Yang, Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Bio Sci 13: pp. 555-556
- Yang, Z (2007) PAML 4: a program package for phylogenetic analysis by maximum likelihood. Mol Biol Evol 24: pp. 1586-1591 CrossRef
- Stein, L (2002) Creating a bioinformatics nation. Nature 417: pp. 119-120 CrossRef
- Kaye, J, Heeney, C, Hawkins, N, de Vries, J, Boddington, P (2009) Data sharing in genomics - re-shaping scientific practice. Nat Rev Genet 10: pp. 331-335 CrossRef
- Hull, D, Wolstencroft, K, Stevens, R, Goble, C, Pocock, M, Li, P, Oinn, T (2006) Taverna: a tool for building and running workflows of services. Nucleic Acids Res 34: pp. 729-732 CrossRef
- Goecks, J, Nekrutenko, A, Taylor, J, Team, TG (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11: pp. R86 CrossRef
- Deelman, E, Singh, G, Su, M, Blythe, J, Gil, Y, Kesselman, C, Mehta, G, Vahi, K, Berriman, G, Good, J, Laity, A, Jacob, J, Katz, D (2005) Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Programming 13: pp. 219-237
- Stevens, R, Tipney, H, Wroe, C, Oinn, T, Senger, M, Lord, P, Goble, C, Brass, A, Tassabehji, M (2004) Exploring Williams-Beuren syndrome using myGrid. Bioinformatics 20: pp. i303-i310 CrossRef
- Orvis, J, Crabtree, J, Galens, K, Gussman, A, Inman, J, Lee, E, Nampally, S, Riley, D, Sundaram, J, Felix, V, Whitty, B, Mahurkar, A, Wortman, J, White, O, Angiuoli, S (2010) Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics 26: pp. 1488-1492 CrossRef
- Goble, CA, Bhagat, J, Aleksejevs, S, Cruickshank, D, Michaelides, D, Newman, D, Borkum, M, Bechhofer, S, Roos, M, Li, P, Roure, DD (2010) myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Res 38: pp. W677-W682 CrossRef
- Altintas, I, Berkley, C, Jaeger, E, Jones, M, Ludascher, B, Mock, S (2004) Kepler: an extensible system for design and execution of scientific workflows. Proceedings of the 16th International Conference on Scientific and Statistical Database Management. pp. 423-424 CrossRef
- Deelman, E, Gannon, D, Shields, M, Taylor, I (2009) Workflows and e-Science: An overview of workflow system features and capabilities. Future Gener Comput Syst 25: pp. 528-540 CrossRef
- Gentleman, R, Carey, V, Bates, D, Bolstad, B, Dettling, M, Dudoit, S, Ellis, B, Gautier, L, Ge, Y, Gentry, J, Hornik, K, Hothorn, T, Huber, W, Iacus, S, Irizarry, R, Leisch, F, Li, C, Maechler, M, Rossini, A, Sawitzki, G, Smith, C, Smyth, G, Tierney, L, Yang, J, Zhang, J (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Res 5: pp. R80
- Excoffier, L, Heckel, G (2006) Computer programs for population genetics data analysis: a survival guide. Nat Rev Genet 7: pp. 745-758 CrossRef
- Mueller, L, Brusniak, M, Mani, D, Aebersold, R (2008) An assessment of software solutions for the analysis of mass spectrometry based quantitative proteomics data. J Proteome Res 7: pp. 51-61 CrossRef
- Machado, M, Magalhaes, WCS, Sene, A, Araujo, B, Faria-Campos, A, Chanock, S, Scott, L, Oliveira, G, Tarazona-Santos, E, Rodrigues, MR (2011) Phred-Phrap package to analyses tools: a pipeline to facilitate population genetics re-sequencing studies. Invest Genet 2: pp. 3 CrossRef
- Falush, D, Stephens, M, Pritchard, J (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: pp. 1567-1587
- Li, H, Handsaker, B, Wysoker, A, Fennell, T, Ruan, J, Homer, N, Marth, G, Abecasis, G, Durbin, R (2009) The sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25: pp. 2078-2079 CrossRef
- Rios, J, Karlsson, J, Trelles, O (2009) Magallanes: a web services discovery and automatic workflow composition tool. BMC Bioinformatics 10: pp. 1-12 CrossRef
- Lamprecht, A, Margaria, T, Steffen, B (2009) Bio-jETI: a framework for semantic-based service composition. BMC Bioinformatics 10: pp. 1-19 CrossRef
- Martin, D, Paolucci, M, McIlraith, S, Burstein, M, McDermott, D, McGuinness, D, Parsia, B, Payne, T, Sabou, M, Solanki, M (2005) Bringing Semantics to Web Services: the OWL-S approach. Lecture Notes Comput Sci 3387: pp. 26-42 CrossRef
- Web Services Description Language (WSDL) 1.1.
- A graph-based approach for designing extensible pipelines
- Open Access
- Available under Open Access This content is freely available online to anyone, anywhere at any time.
- Online Date
- July 2012
- Online ISSN
- BioMed Central
- Additional Links
- Industry Sectors