Protocol

Evolutionary Genomics

Volume 856 of the series Methods in Molecular Biology pp 529-545

Date:

Scalable Computing for Evolutionary Genomics

  • Pjotr PrinsAffiliated withLaboratory of Nematology, Wageningen UniversityGroningen Bioinformatics Centre, University of Groningen Email author 
  • , Dominique BelhachemiAffiliated withSection of Biomedical Image Analysis, Department of Radiology, University of Pennsylvania
  • , Steffen MöllerAffiliated withDepartment of Dermatology, University Clinics of Schleswig-Holstein, formerly University of Lübeck, Institute for Neuro- and Bioinformatics
  • , Geert SmantAffiliated withLaboratory of Nematology, Wageningen University

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Genomic data analysis in evolutionary biology is becoming so computationally intensive that analysis of multiple hypotheses and scenarios takes too long on a single desktop computer. In this chapter, we discuss techniques for scaling computations through parallelization of calculations, after giving a quick overview of advanced programming techniques. Unfortunately, parallel programming is difficult and requires special software design. The alternative, especially attractive for legacy software, is to introduce poor man’s parallelization by running whole programs in parallel as separate processes, using job schedulers. Such pipelines are often deployed on bioinformatics computer clusters. Recent advances in PC virtualization have made it possible to run a full computer operating system, with all of its installed software, on top of another operating system, inside a “box,” or virtual machine (VM). Such a VM can flexibly be deployed on multiple computers, in a local network, e.g., on existing desktop PCs, and even in the Cloud, to create a “virtual” computer cluster. Many bioinformatics applications in evolutionary biology can be run in parallel, running processes in one or more VMs. Here, we show how a ready-made bioinformatics VM image, named BioNode, effectively creates a computing cluster, and pipeline, in a few steps. This allows researchers to scale-up computations from their desktop, using available hardware, anytime it is required. BioNode is based on Debian Linux and can run on networked PCs and in the Cloud. Over 200 bioinformatics and statistical software packages, of interest to evolutionary biology, are included, such as PAML, Muscle, MAFFT, MrBayes, and BLAST. Most of these software packages are maintained through the Debian Med project. In addition, BioNode contains convenient configuration scripts for parallelizing bioinformatics software. Where Debian Med encourages packaging free and open source bioinformatics software through one central project, BioNode encourages creating free and open source VM images, for multiple targets, through one central project. BioNode can be deployed on Windows, OSX, Linux, and in the Cloud. Next to the downloadable BioNode images, we provide tutorials online, which empower bioinformaticians to install and run BioNode in different environments, as well as information for future initiatives, on creating and building such images.

Key words

BioNode Bioinformatics Evolutionary biology Big data Parallelization MPI Cloud computing Cluster computing Virtual machine Amazon EC2 OpenStack PAML MrBayes VirtualBox Debian Linux