Scalable Computing for Evolutionary Genomics

Prins, Pjotr; Belhachemi, Dominique; Möller, Steffen; Smant, Geert

doi:10.1007/978-1-61779-585-5_22

Pjotr Prins^2,3,
Dominique Belhachemi⁴,
Steffen Möller⁵ &
…
Geert Smant²

Part of the book series: Methods in Molecular Biology ((MIMB,volume 856))

4062 Accesses
3 Citations
6 Altmetric

Abstract

Genomic data analysis in evolutionary biology is becoming so computationally intensive that analysis of multiple hypotheses and scenarios takes too long on a single desktop computer. In this chapter, we discuss techniques for scaling computations through parallelization of calculations, after giving a quick overview of advanced programming techniques. Unfortunately, parallel programming is difficult and requires special software design. The alternative, especially attractive for legacy software, is to introduce poor man’s parallelization by running whole programs in parallel as separate processes, using job schedulers. Such pipelines are often deployed on bioinformatics computer clusters. Recent advances in PC virtualization have made it possible to run a full computer operating system, with all of its installed software, on top of another operating system, inside a “box,” or virtual machine (VM). Such a VM can flexibly be deployed on multiple computers, in a local network, e.g., on existing desktop PCs, and even in the Cloud, to create a “virtual” computer cluster. Many bioinformatics applications in evolutionary biology can be run in parallel, running processes in one or more VMs. Here, we show how a ready-made bioinformatics VM image, named BioNode, effectively creates a computing cluster, and pipeline, in a few steps. This allows researchers to scale-up computations from their desktop, using available hardware, anytime it is required. BioNode is based on Debian Linux and can run on networked PCs and in the Cloud. Over 200 bioinformatics and statistical software packages, of interest to evolutionary biology, are included, such as PAML, Muscle, MAFFT, MrBayes, and BLAST. Most of these software packages are maintained through the Debian Med project. In addition, BioNode contains convenient configuration scripts for parallelizing bioinformatics software. Where Debian Med encourages packaging free and open source bioinformatics software through one central project, BioNode encourages creating free and open source VM images, for multiple targets, through one central project. BioNode can be deployed on Windows, OSX, Linux, and in the Cloud. Next to the downloadable BioNode images, we provide tutorials online, which empower bioinformaticians to install and run BioNode in different environments, as well as information for future initiatives, on creating and building such images.

Availability: The 32-bit and 64-bit BioNode desktop images for VirtualBox and the BioNode Cloud images are based on free and open source software and can be found at http://www.evolutionarygenomics.net/ and http://biobeat.org/bionode.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ronquist F & Huelsenbeck J P (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572–1574
Article PubMed CAS Google Scholar
Eddy S R (2008) A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol. 4:e1000069p
Article Google Scholar
Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13:555–556
PubMed CAS Google Scholar
Doctorow C (2008) Big data: welcome to the petacentre. Nature 455:16–21.
Article PubMed CAS Google Scholar
Durbin R M, Abecasis G R, Altshuler D L et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073
Article CAS Google Scholar
Kosiol C & Anisimova M (2012) Selection on the protein coding genome. In: Anisimova M (ed) Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media New York
Google Scholar
Schadt E E, Linderman M D, Sorenson J, Lee L & Nolan G P (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet. 11:647–657
Article PubMed CAS Google Scholar
Trelles O, Prins P, Snir M & Jansen R C (2012) Big data, but are we ready?. Nat Rev Genet. 12:224p. http://www.ncbi.nlm.nih.gov/pubmed/21301471
Patterson D A & Hennessy J L (1998) Computer organization and design (2nd ed.): the hardware/software interface. Morgan Kaufmann Publishers Inc
Google Scholar
Mattson T, Sanders B & Massingill B (2004) Patterns for parallel programming. Addison-Wesley Professional, 384 pages. http://portal.acm.org/citation.cfm?id=1406956
Graham R L, Woodall T S & Squyres J M (2005) Open MPI: a flexible high performance MPI
Google Scholar
Stamatakis A & Ott M (2008) Exploiting fine-grained parallelism in the phylogenetic likelihood function with mpi, pthreads, and openmp: a performance study. Pattern Recognition in Bioinformatics, Springer Berlin/Heidelberg, 424–435. http://dx.doi.org/10.1007/978-3-540-88436-1_36
Tierney L, Rossini A & Li N (2009) Snow: a parallel computing framework for the R system. International Journal of Parallel Programming 37:78–90. http://dx.doi.org/10.1007/s10766-008-0077-2
Google Scholar
Cesarini F & Thompson S (2009) Erlang programming. 1st. O'Reilly Media, Inc.
Google Scholar
Peyton Jones S (2003) The Haskell 98 language and libraries: the revised report. Journal of Functional Programming 13:0--255
Google Scholar
Odersky M, Altherr P, Cremet V et al. (2004) An overview of the Scala programming language. LAMP-EPFL
Google Scholar
Okasaki C (1998) Purely functional data structures. Cambridge University Press, doi:10.2277/0521663504
Alexandrescu A (2010) The D programming language. 1st. Addison-Wesley Professional, 460p
Google Scholar
Griesemer R, Pike R & Thompson K (2009) The Go programming language. http://golang.org
Hoare C A R (1978) Communicating sequential processes. Commun. ACM 21:666--677. doi:http://doi.acm.org/10.1145/359576.359585
Google Scholar
Welch P, Aldous J & Foster J (2002) Csp networking for java (jcsp. net). Computational ScienceICCS 2002. 695--708
Google Scholar
Sufrin B (2008) Communicating scala objects. Communicating Process Architectures. 35p
Google Scholar
Dean J & Ghemawat S (2008) MapReduce: Simplified data processing on large clusters. Communications of the ACM 51:107--113
Article Google Scholar
White T (2009) Hadoop: the definitive guide. first edition. O'Reilly, http://oreilly.com/catalog/9780596521981
May P, Ehrlich H & Steinke T (2006) Zib structure prediction pipeline: composing a complex biological workflow through web services. Euro-Par 2006 Parallel Processing, Springer Berlin/Heidelberg, 1148–1158. http://dx.doi.org/10.1007/11823285_121
Mungall C J, Misra S, Berman B P et al. (2002) An integrated computational pipeline and database to support whole-genome sequence annotation. Genome Biol. 3:RESEARCH0081p. http://www.ncbi.nlm.nih.gov/pubmed/12537570
Prins P, Smant G, & Jansen R (2012) Genetical genomics for evolutionary studies. In: Anisimova M (ed) Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media New York
Google Scholar
Möller S, Krabbenhoft H N, Tille A et al. (2010) Community-driven computational biology with debian linux. BMC Bioinformatics 11(Suppl 12):S5p. http://www.ncbi.nlm.nih.gov/pubmed/21210984
Li P (2009) Exploring virtual environments in a decentralized lab. ACM SIGITE Newsletter 6:4--10
Article Google Scholar
Tikotekar A, Ong H, Alam S et al. (2009) Performance comparison of two virtual machine scenarios using an hpc application: a case study using molecular dynamics simulations. Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, ACM, 33--40. doi:http://doi.acm.org/10.1145/1519138.1519143
Prins P, Belhachemi D & Möller S (2011) BioNode tutorial. http://biobeat.org/bionode
Altschul S F, Madden T L, Schaffer A A et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402.
Article PubMed CAS Google Scholar
Edgar R C (2004) Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792–1797. doi:10.1093/nar/gkh340
Article PubMed CAS Google Scholar
Schneider A, Souvorov A, Sabath N et al. (2009) Estimates of positive darwinian selection are inflated by errors in sequencing, annotation, and alignment. Genome Biol Evol. 1:114–118. doi:10.1093/gbe/evp012
Article PubMed Google Scholar
Pond S L, Frost S D & Muse S V (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics 21:676–679. http://www.ncbi.nlm.nih.gov/pubmed/15509596
Google Scholar
Gentzsch W (2002) Sun grid engine: towards creating a compute power grid. Cluster Computing and the Grid, 2001. Proceedings. First IEEE/ACM International Symposium on, IEEE, 35--36
Google Scholar
Staples G (2006) Torque resource manager. Proceedings of the 2006 ACM/IEEE conference on Supercomputing, ACM, doi:http://doi.acm.org/10.1145/1188455.1188464
Openstack open source cloud computing software. http://www.openstack.org
Nurmi D, Wolski R, Grzegorczyk C et al. (2009) The Eucalyptus open-source cloud-computing system. Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, IEEE Computer Society, 124--131
Google Scholar
Matthews S J & Williams T L (2010) Mrsrf: an efficient mapreduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinformatics 11 Suppl 1:S15p
Google Scholar

Download references

Acknowledgments

The European Commission’s Integrated Project BIOEXPLOIT (FOOD-2005-513959 to G.S. and P.P.); the Netherlands Organization for Scientific Research/TTI Green Genetics (1CC029RP to P.P.).

Author information

Authors and Affiliations

Laboratory of Nematology, Wageningen University, Wageningen, The Netherlands
Pjotr Prins & Geert Smant
Groningen Bioinformatics Centre, University of Groningen, Groningen, The Netherlands
Pjotr Prins
Section of Biomedical Image Analysis, Department of Radiology, University of Pennsylvania, Philadelphia, PA, USA
Dominique Belhachemi
Department of Dermatology, University Clinics of Schleswig-Holstein, formerly University of Lübeck, Institute for Neuro- and Bioinformatics, Lübeck, Germany
Steffen Möller

Authors

Pjotr Prins
View author publications
You can also search for this author in PubMed Google Scholar
Dominique Belhachemi
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Möller
View author publications
You can also search for this author in PubMed Google Scholar
Geert Smant
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pjotr Prins .

Editor information

Editors and Affiliations

Department of Computer Science, ETH Zürich, Universitätsstr. 6, Zürich, 8092, Switzerland
Maria Anisimova

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Prins, P., Belhachemi, D., Möller, S., Smant, G. (2012). Scalable Computing for Evolutionary Genomics. In: Anisimova, M. (eds) Evolutionary Genomics. Methods in Molecular Biology, vol 856. Humana Press. https://doi.org/10.1007/978-1-61779-585-5_22

Download citation

DOI: https://doi.org/10.1007/978-1-61779-585-5_22
Published: 31 January 2012
Publisher Name: Humana Press
Print ISBN: 978-1-61779-584-8
Online ISBN: 978-1-61779-585-5
eBook Packages: Springer Protocols

Publish with us

Policies and ethics