Abstract
Genomic data analysis in evolutionary biology is becoming so computationally intensive that analysis of multiple hypotheses and scenarios takes too long on a single desktop computer. In this chapter, we discuss techniques for scaling computations through parallelization of calculations, after giving a quick overview of advanced programming techniques. Unfortunately, parallel programming is difficult and requires special software design. The alternative, especially attractive for legacy software, is to introduce poor man’s parallelization by running whole programs in parallel as separate processes, using job schedulers. Such pipelines are often deployed on bioinformatics computer clusters. Recent advances in PC virtualization have made it possible to run a full computer operating system, with all of its installed software, on top of another operating system, inside a “box,” or virtual machine (VM). Such a VM can flexibly be deployed on multiple computers, in a local network, e.g., on existing desktop PCs, and even in the Cloud, to create a “virtual” computer cluster. Many bioinformatics applications in evolutionary biology can be run in parallel, running processes in one or more VMs. Here, we show how a ready-made bioinformatics VM image, named BioNode, effectively creates a computing cluster, and pipeline, in a few steps. This allows researchers to scale-up computations from their desktop, using available hardware, anytime it is required. BioNode is based on Debian Linux and can run on networked PCs and in the Cloud. Over 200 bioinformatics and statistical software packages, of interest to evolutionary biology, are included, such as PAML, Muscle, MAFFT, MrBayes, and BLAST. Most of these software packages are maintained through the Debian Med project. In addition, BioNode contains convenient configuration scripts for parallelizing bioinformatics software. Where Debian Med encourages packaging free and open source bioinformatics software through one central project, BioNode encourages creating free and open source VM images, for multiple targets, through one central project. BioNode can be deployed on Windows, OSX, Linux, and in the Cloud. Next to the downloadable BioNode images, we provide tutorials online, which empower bioinformaticians to install and run BioNode in different environments, as well as information for future initiatives, on creating and building such images.
Availability: The 32-bit and 64-bit BioNode desktop images for VirtualBox and the BioNode Cloud images are based on free and open source software and can be found at http://www.evolutionarygenomics.net/ and http://biobeat.org/bionode.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ronquist F & Huelsenbeck J P (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572–1574
Eddy S R (2008) A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol. 4:e1000069p
Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13:555–556
Doctorow C (2008) Big data: welcome to the petacentre. Nature 455:16–21.
Durbin R M, Abecasis G R, Altshuler D L et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073
Kosiol C & Anisimova M (2012) Selection on the protein coding genome. In: Anisimova M (ed) Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media New York
Schadt E E, Linderman M D, Sorenson J, Lee L & Nolan G P (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet. 11:647–657
Trelles O, Prins P, Snir M & Jansen R C (2012) Big data, but are we ready?. Nat Rev Genet. 12:224p. http://www.ncbi.nlm.nih.gov/pubmed/21301471
Patterson D A & Hennessy J L (1998) Computer organization and design (2nd ed.): the hardware/software interface. Morgan Kaufmann Publishers Inc
Mattson T, Sanders B & Massingill B (2004) Patterns for parallel programming. Addison-Wesley Professional, 384 pages. http://portal.acm.org/citation.cfm?id=1406956
Graham R L, Woodall T S & Squyres J M (2005) Open MPI: a flexible high performance MPI
Stamatakis A & Ott M (2008) Exploiting fine-grained parallelism in the phylogenetic likelihood function with mpi, pthreads, and openmp: a performance study. Pattern Recognition in Bioinformatics, Springer Berlin/Heidelberg, 424–435. http://dx.doi.org/10.1007/978-3-540-88436-1_36
Tierney L, Rossini A & Li N (2009) Snow: a parallel computing framework for the R system. International Journal of Parallel Programming 37:78–90. http://dx.doi.org/10.1007/s10766-008-0077-2
Cesarini F & Thompson S (2009) Erlang programming. 1st. O'Reilly Media, Inc.
Peyton Jones S (2003) The Haskell 98 language and libraries: the revised report. Journal of Functional Programming 13:0--255
Odersky M, Altherr P, Cremet V et al. (2004) An overview of the Scala programming language. LAMP-EPFL
Okasaki C (1998) Purely functional data structures. Cambridge University Press, doi:10.2277/0521663504
Alexandrescu A (2010) The D programming language. 1st. Addison-Wesley Professional, 460p
Griesemer R, Pike R & Thompson K (2009) The Go programming language. http://golang.org
Hoare C A R (1978) Communicating sequential processes. Commun. ACM 21:666--677. doi:http://doi.acm.org/10.1145/359576.359585
Welch P, Aldous J & Foster J (2002) Csp networking for java (jcsp. net). Computational ScienceICCS 2002. 695--708
Sufrin B (2008) Communicating scala objects. Communicating Process Architectures. 35p
Dean J & Ghemawat S (2008) MapReduce: Simplified data processing on large clusters. Communications of the ACM 51:107--113
White T (2009) Hadoop: the definitive guide. first edition. O'Reilly, http://oreilly.com/catalog/9780596521981
May P, Ehrlich H & Steinke T (2006) Zib structure prediction pipeline: composing a complex biological workflow through web services. Euro-Par 2006 Parallel Processing, Springer Berlin/Heidelberg, 1148–1158. http://dx.doi.org/10.1007/11823285_121
Mungall C J, Misra S, Berman B P et al. (2002) An integrated computational pipeline and database to support whole-genome sequence annotation. Genome Biol. 3:RESEARCH0081p. http://www.ncbi.nlm.nih.gov/pubmed/12537570
Prins P, Smant G, & Jansen R (2012) Genetical genomics for evolutionary studies. In: Anisimova M (ed) Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media New York
Möller S, Krabbenhoft H N, Tille A et al. (2010) Community-driven computational biology with debian linux. BMC Bioinformatics 11(Suppl 12):S5p. http://www.ncbi.nlm.nih.gov/pubmed/21210984
Li P (2009) Exploring virtual environments in a decentralized lab. ACM SIGITE Newsletter 6:4--10
Tikotekar A, Ong H, Alam S et al. (2009) Performance comparison of two virtual machine scenarios using an hpc application: a case study using molecular dynamics simulations. Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, ACM, 33--40. doi:http://doi.acm.org/10.1145/1519138.1519143
Prins P, Belhachemi D & Möller S (2011) BioNode tutorial. http://biobeat.org/bionode
Altschul S F, Madden T L, Schaffer A A et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402.
Edgar R C (2004) Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792–1797. doi:10.1093/nar/gkh340
Schneider A, Souvorov A, Sabath N et al. (2009) Estimates of positive darwinian selection are inflated by errors in sequencing, annotation, and alignment. Genome Biol Evol. 1:114–118. doi:10.1093/gbe/evp012
Pond S L, Frost S D & Muse S V (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics 21:676–679. http://www.ncbi.nlm.nih.gov/pubmed/15509596
Gentzsch W (2002) Sun grid engine: towards creating a compute power grid. Cluster Computing and the Grid, 2001. Proceedings. First IEEE/ACM International Symposium on, IEEE, 35--36
Staples G (2006) Torque resource manager. Proceedings of the 2006 ACM/IEEE conference on Supercomputing, ACM, doi:http://doi.acm.org/10.1145/1188455.1188464
Openstack open source cloud computing software. http://www.openstack.org
Nurmi D, Wolski R, Grzegorczyk C et al. (2009) The Eucalyptus open-source cloud-computing system. Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, IEEE Computer Society, 124--131
Matthews S J & Williams T L (2010) Mrsrf: an efficient mapreduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinformatics 11 Suppl 1:S15p
Acknowledgments
The European Commission’s Integrated Project BIOEXPLOIT (FOOD-2005-513959 to G.S. and P.P.); the Netherlands Organization for Scientific Research/TTI Green Genetics (1CC029RP to P.P.).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Prins, P., Belhachemi, D., Möller, S., Smant, G. (2012). Scalable Computing for Evolutionary Genomics. In: Anisimova, M. (eds) Evolutionary Genomics. Methods in Molecular Biology, vol 856. Humana Press. https://doi.org/10.1007/978-1-61779-585-5_22
Download citation
DOI: https://doi.org/10.1007/978-1-61779-585-5_22
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-61779-584-8
Online ISBN: 978-1-61779-585-5
eBook Packages: Springer Protocols