Parallel Software Architecture for Experimental Workflows in Computational Biology on Clouds

  • Luqman Hodgkinson
  • Javier Rosa
  • Eric A. Brewer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7204)

Abstract

Cloud computing opens new possibilities for computational biologists. Given the pay-as-you-go model and the commodity hardware base, new tools for extensive parallelism are needed to make experimentation in the cloud an attractive option. In this paper, we present EasyProt, a parallel message-passing architecture designed for developing experimental workflows in computational biology while harnessing the power of cloud resources. The system exploits parallelism in two ways: by multithreading modular components on virtual machines while respecting data dependencies and by allowing expansion across multiple virtual machines. Components of the system, called elements, are easily configured for efficient modification and testing of workflows during ever-changing experimentation. Though EasyProt, as an abstract cloud programming model, can be extended beyond computational biology, current development brings cloud computing to experimenters in this important discipline who are facing unprecedented data-processing challenges, with a type system designed for proteomics, interactomics and comparative genomics data, and a suite of elements that perform useful analysis tasks on biological data using cloud resources.

Availability: EasyProt is available as a public abstract machine image (AMI) on Amazon EC2 cloud service, with an open source license, registered with manifest easyprot-ami/easyprot.img.manifest.xml.

Keywords

parallel architectures scientific workflows cloud computing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Stein, L.D.: The case for cloud computing in genome informatics. Genome Biology 11(5), 207 (2010)CrossRefGoogle Scholar
  2. 2.
    Khalidi, Y.A.: Building a cloud computing platform for new possibilities. Computer 44(3), 29–34 (2011)CrossRefGoogle Scholar
  3. 3.
    Gil, Y., Deelman, E., Ellisman, M., Fahringer, T., Fox, G., et al.: Examining the challenges of scientific workflows. Computer 40(12), 24–32 (2007)CrossRefGoogle Scholar
  4. 4.
    Lord, H.D.: Improving the application development process with modular visualization environments. Computer Graphics 29(2), 10–12 (1995)CrossRefGoogle Scholar
  5. 5.
    Kohler, E., Morris, R., Chen, B., Jannotti, J., Kaashoek, F.: The Click modular router. ACM Trans. on Computer Systems 18(3), 263–297 (2000)CrossRefGoogle Scholar
  6. 6.
    Welsh, M., Culler, D., Brewer, E.: SEDA: an architecture for well-conditioned, scalable internet services. In: Proc. of the 18th Symposium on Operating Systems Principles, SOSP 2001 (2001)Google Scholar
  7. 7.
    Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., et al.: Above the clouds: a Berkeley view of cloud computing. EECS Department, University of California, Berkeley UCB/EECS-2009-28 (2009)Google Scholar
  8. 8.
    Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M. (eds.): Workflows for e-Science: Scientific Workflows for Grids. Springer, Heidelberg (2006)Google Scholar
  9. 9.
    Deelman, E., Singh, G., Su, M., Blythe, J., Gil, Y.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Scientific Programming 13, 219–237 (2005)Google Scholar
  10. 10.
    Juve, G., Deelman, E.: Scientific workflows in the cloud. In: Cafaro, M., Aloisio, G. (eds.) Grids, Clouds and Virtualization, pp. 71–91. Springer, Heidelberg (2010)Google Scholar
  11. 11.
    Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M.R., et al.: Taverna: a tool for building and running workflows of services. Nucleic Acids Research 34(Web Server issue), W729–W732 (2006)Google Scholar
  12. 12.
    Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., et al.: Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience 18, 1039–1065 (2006)CrossRefGoogle Scholar
  13. 13.
    Linke, B., Giegerich, R., Goesmann, A.: Conveyor: a workflow engine for bioinformatic analyses. Bioinformatics 27(7), 903–911 (2011)CrossRefGoogle Scholar
  14. 14.
    Dudley, J.T., Butte, A.J.: In silico research in the era of cloud computing. Nature Biotechnology 28(11), 1181–1185 (2010)CrossRefGoogle Scholar
  15. 15.
    Donoho, D.L., Maleki, A., Rahman, I.U., Shahram, M., Stodden, V.: Reproducible research in computational harmonic analysis. Computing in Science and Engineering 11(1), 8–18 (2009)CrossRefGoogle Scholar
  16. 16.
    Parr, T.J., Quong, R.W.: ANTLR: a predicated-LL(k) parser generator. Software-Practice and Experience 25(7), 789–810 (1995)CrossRefGoogle Scholar
  17. 17.
    Klipp, E., Liebermeister, W., Wierling, C., Kowald, A., Lehrach, H., Herwig, R.: Systems Biology: A Textbook. Wiley-VCH, Weinheim (2009)Google Scholar
  18. 18.
    Hodgkinson, L., Karp, R.M.: Algorithms to detect multiprotein modularity conserved during evolution. IEEE/ACM Trans. on Computational Biology and Bioinformatics (September 27, 2011), IEEE Computer Society Digital Library. IEEE Computer Society, http://doi.ieeecomputersociety.org/10.1109/TCBB.2011.125
  19. 19.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  20. 20.
    Bialecki, A., Cafarella, M., Cutting, D., OMalley, O.: Hadoop: a framework for running applications on large clusters built of commodity hardware, Wiki at, http://lucene.apache.org/hadoop

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Luqman Hodgkinson
    • 1
    • 2
    • 3
  • Javier Rosa
    • 1
  • Eric A. Brewer
    • 1
  1. 1.Computer Science DivisionUniversity of CaliforniaBerkeleyUSA
  2. 2.Center for Computational BiologyUniversity of CaliforniaBerkeleyUSA
  3. 3.International Computer Science InstituteBerkeleyUSA

Personalised recommendations