Scalable Computing for Evolutionary Genomics

  • Pjotr Prins
  • Dominique Belhachemi
  • Steffen Möller
  • Geert Smant
Part of the Methods in Molecular Biology book series (MIMB, volume 856)


Genomic data analysis in evolutionary biology is becoming so computationally intensive that analysis of multiple hypotheses and scenarios takes too long on a single desktop computer. In this chapter, we discuss techniques for scaling computations through parallelization of calculations, after giving a quick overview of advanced programming techniques. Unfortunately, parallel programming is difficult and requires special software design. The alternative, especially attractive for legacy software, is to introduce poor man’s parallelization by running whole programs in parallel as separate processes, using job schedulers. Such pipelines are often deployed on bioinformatics computer clusters. Recent advances in PC virtualization have made it possible to run a full computer operating system, with all of its installed software, on top of another operating system, inside a “box,” or virtual machine (VM). Such a VM can flexibly be deployed on multiple computers, in a local network, e.g., on existing desktop PCs, and even in the Cloud, to create a “virtual” computer cluster. Many bioinformatics applications in evolutionary biology can be run in parallel, running processes in one or more VMs. Here, we show how a ready-made bioinformatics VM image, named BioNode, effectively creates a computing cluster, and pipeline, in a few steps. This allows researchers to scale-up computations from their desktop, using available hardware, anytime it is required. BioNode is based on Debian Linux and can run on networked PCs and in the Cloud. Over 200 bioinformatics and statistical software packages, of interest to evolutionary biology, are included, such as PAML, Muscle, MAFFT, MrBayes, and BLAST. Most of these software packages are maintained through the Debian Med project. In addition, BioNode contains convenient configuration scripts for parallelizing bioinformatics software. Where Debian Med encourages packaging free and open source bioinformatics software through one central project, BioNode encourages creating free and open source VM images, for multiple targets, through one central project. BioNode can be deployed on Windows, OSX, Linux, and in the Cloud. Next to the downloadable BioNode images, we provide tutorials online, which empower bioinformaticians to install and run BioNode in different environments, as well as information for future initiatives, on creating and building such images.

Key words

BioNode Bioinformatics Evolutionary biology Big data Parallelization MPI Cloud computing Cluster computing Virtual machine Amazon EC2 OpenStack PAML MrBayes VirtualBox Debian Linux 


  1. 1.
    Ronquist F & Huelsenbeck J P (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572–1574PubMedCrossRefGoogle Scholar
  2. 2.
    Eddy S R (2008) A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol. 4:e1000069pCrossRefGoogle Scholar
  3. 3.
    Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13:555–556PubMedGoogle Scholar
  4. 4.
    Doctorow C (2008) Big data: welcome to the petacentre. Nature 455:16–21.PubMedCrossRefGoogle Scholar
  5. 5.
    Durbin R M, Abecasis G R, Altshuler D L et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073CrossRefGoogle Scholar
  6. 6.
    Kosiol C & Anisimova M (2012) Selection on the protein coding genome. In: Anisimova M (ed) Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media New YorkGoogle Scholar
  7. 7.
    Schadt E E, Linderman M D, Sorenson J, Lee L & Nolan G P (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet. 11:647–657PubMedCrossRefGoogle Scholar
  8. 8.
    Trelles O, Prins P, Snir M & Jansen R C (2012) Big data, but are we ready?. Nat Rev Genet. 12:224p.
  9. 9.
    Patterson D A & Hennessy J L (1998) Computer organization and design (2nd ed.): the hardware/software interface. Morgan Kaufmann Publishers IncGoogle Scholar
  10. 10.
    Mattson T, Sanders B & Massingill B (2004) Patterns for parallel programming. Addison-Wesley Professional, 384 pages.
  11. 11.
    Graham R L, Woodall T S & Squyres J M (2005) Open MPI: a flexible high performance MPIGoogle Scholar
  12. 12.
    Stamatakis A & Ott M (2008) Exploiting fine-grained parallelism in the phylogenetic likelihood function with mpi, pthreads, and openmp: a performance study. Pattern Recognition in Bioinformatics, Springer Berlin/Heidelberg, 424–435.
  13. 13.
    Tierney L, Rossini A & Li N (2009) Snow: a parallel computing framework for the R system. International Journal of Parallel Programming 37:78–90. Google Scholar
  14. 14.
    Cesarini F & Thompson S (2009) Erlang programming. 1st. O'Reilly Media, Inc.Google Scholar
  15. 15.
    Peyton Jones S (2003) The Haskell 98 language and libraries: the revised report. Journal of Functional Programming 13:0--255Google Scholar
  16. 16.
    Odersky M, Altherr P, Cremet V et al. (2004) An overview of the Scala programming language. LAMP-EPFLGoogle Scholar
  17. 17.
    Okasaki C (1998) Purely functional data structures. Cambridge University Press, doi:10.2277/0521663504
  18. 18.
    Alexandrescu A (2010) The D programming language. 1st. Addison-Wesley Professional, 460pGoogle Scholar
  19. 19.
    Griesemer R, Pike R & Thompson K (2009) The Go programming language.
  20. 20.
    Hoare C A R (1978) Communicating sequential processes. Commun. ACM 21:666--677. doi: Google Scholar
  21. 21.
    Welch P, Aldous J & Foster J (2002) Csp networking for java (jcsp. net). Computational ScienceICCS 2002. 695--708Google Scholar
  22. 22.
    Sufrin B (2008) Communicating scala objects. Communicating Process Architectures. 35pGoogle Scholar
  23. 23.
    Dean J & Ghemawat S (2008) MapReduce: Simplified data processing on large clusters. Communications of the ACM 51:107--113CrossRefGoogle Scholar
  24. 24.
    White T (2009) Hadoop: the definitive guide. first edition. O'Reilly,
  25. 25.
    May P, Ehrlich H & Steinke T (2006) Zib structure prediction pipeline: composing a complex biological workflow through web services. Euro-Par 2006 Parallel Processing, Springer Berlin/Heidelberg, 1148–1158.
  26. 26.
    Mungall C J, Misra S, Berman B P et al. (2002) An integrated computational pipeline and database to support whole-genome sequence annotation. Genome Biol. 3:RESEARCH0081p.
  27. 27.
    Prins P, Smant G, & Jansen R (2012) Genetical genomics for evolutionary studies. In: Anisimova M (ed) Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media New YorkGoogle Scholar
  28. 28.
    Möller S, Krabbenhoft H N, Tille A et al. (2010) Community-driven computational biology with debian linux. BMC Bioinformatics 11(Suppl 12):S5p.
  29. 29.
    Li P (2009) Exploring virtual environments in a decentralized lab. ACM SIGITE Newsletter 6:4--10CrossRefGoogle Scholar
  30. 30.
    Tikotekar A, Ong H, Alam S et al. (2009) Performance comparison of two virtual machine scenarios using an hpc application: a case study using molecular dynamics simulations. Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, ACM, 33--40. doi:
  31. 31.
    Prins P, Belhachemi D & Möller S (2011) BioNode tutorial.
  32. 32.
    Altschul S F, Madden T L, Schaffer A A et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402.PubMedCrossRefGoogle Scholar
  33. 33.
    Edgar R C (2004) Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792–1797. doi:10.1093/nar/gkh340 PubMedCrossRefGoogle Scholar
  34. 34.
    Schneider A, Souvorov A, Sabath N et al. (2009) Estimates of positive darwinian selection are inflated by errors in sequencing, annotation, and alignment. Genome Biol Evol. 1:114–118. doi:10.1093/gbe/evp012 PubMedCrossRefGoogle Scholar
  35. 35.
    Pond S L, Frost S D & Muse S V (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics 21:676–679. Google Scholar
  36. 36.
    Gentzsch W (2002) Sun grid engine: towards creating a compute power grid. Cluster Computing and the Grid, 2001. Proceedings. First IEEE/ACM International Symposium on, IEEE, 35--36Google Scholar
  37. 37.
    Staples G (2006) Torque resource manager. Proceedings of the 2006 ACM/IEEE conference on Supercomputing, ACM, doi:
  38. 38.
    Openstack open source cloud computing software.
  39. 39.
    Nurmi D, Wolski R, Grzegorczyk C et al. (2009) The Eucalyptus open-source cloud-computing system. Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, IEEE Computer Society, 124--131Google Scholar
  40. 40.
    Matthews S J & Williams T L (2010) Mrsrf: an efficient mapreduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinformatics 11 Suppl 1:S15pGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Pjotr Prins
    • 1
    • 2
  • Dominique Belhachemi
    • 3
  • Steffen Möller
    • 4
  • Geert Smant
    • 1
  1. 1.Laboratory of NematologyWageningen UniversityWageningenThe Netherlands
  2. 2.Groningen Bioinformatics CentreUniversity of GroningenGroningenThe Netherlands
  3. 3.Section of Biomedical Image Analysis, Department of RadiologyUniversity of PennsylvaniaPhiladelphiaUSA
  4. 4.Department of DermatologyUniversity Clinics of Schleswig-Holstein, formerly University of Lübeck, Institute for Neuro- and BioinformaticsLübeckGermany

Personalised recommendations