Journal of Grid Computing

, Volume 11, Issue 3, pp 407–428 | Cite as

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

  • Vincent C. Emeakaroha
  • Michael Maurer
  • Patrick Stern
  • Paweł P. Łabaj
  • Ivona Brandic
  • David P. Kreil
Article

Abstract

The rapid advancements in recent years of high-throughput technologies in the life sciences are facilitating the generation and storage of huge amount of data in different databases. Despite significant developments in computing capacity and performance, an analysis of these large-scale data in a search for biomedical relevant patterns remains a challenging task. Scientific workflow applications are deemed to support data-mining in more complex scenarios that include many data sources and computational tools, as commonly found in bioinformatics. A scientific workflow application is a holistic unit that defines, executes, and manages scientific applications using different software tools. Existing workflow applications are process- or data- rather than resource-oriented. Thus, they lack efficient computational resource management capabilities, such as those provided by Cloud computing environments. Insufficient computational resources disrupt the execution of workflow applications, wasting time and money. To address this issue, advanced resource monitoring and management strategies are required to determine the resource consumption behaviours of workflow applications to enable a dynamical allocation and deallocation of resources. In this paper, we present a novel Cloud management infrastructure consisting of resource level-, application level monitoring techniques, and a knowledge management strategy to manage computational resources for supporting workflow application executions in order to guarantee their performance goals and their successful completion. We present the design description of these techniques, demonstrate how they can be applied to scientific workflow applications, and present detailed evaluation results as a proof of concept.

Keywords

Workflow execution Resource level monitoring Application level monitoring Workflow management Knowledge database Cloud computing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    ActiveMQ: Messaging and integration pattern provider. http://activemq.apache.org/. Accessed 4 April 2013
  2. 2.
    Altintas, I., Berkley, C., Jones, E.M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: 16th International Conference on Scientific and Statistical Database Management, pp. 423–424 (2004)Google Scholar
  3. 3.
    Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Futur. Gener. Comput. Syst. 25(6), 599–616 (2009)CrossRefGoogle Scholar
  4. 4.
    Cantacessi, C., Jex, A.R., Hall, R.S., Young, N.D., Campbell, B.E., Joachim, A., Nolan, M.J., Abubucker, S., Sternberg, P.W., Ranganathan, S., Mitreva, M., Gasser, R.B.: A practical, bioinformatic workflow system for large data sets generated by next-generation sequencing. Nucleic Acids Res. 38(17), e171 (2010)CrossRefGoogle Scholar
  5. 5.
    Comuzzi, M., Kotsokalis, C., Spanoudkis, G., Yahyapour, R.: Establishing and monitoring SLAs in complex service based systems. In: Proceedings of the 7th International Conference on Web Services (ICWS’09) (2009)Google Scholar
  6. 6.
    Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 13, 219–237 (2005)Google Scholar
  7. 7.
    Emeakaroha, V.C., Brandic, I., Maurer, M., Dustdar, S.: Low level metrics to high level slas - lom2his framework: bridging the gap between monitored metrics and sla parameters in cloud environments. In: 2010 International Conference on High Performance Computing and Simulation (HPCS), pp. 48–54 (2010)Google Scholar
  8. 8.
    Emeakaroha, V.C., Calheiros, R.N., Netto, M.A.S., Brandic, I., De Rose, C.A.F.: DeSVi: an architecture for detecting SLA violations in cloud computing infrastructures. In: Proceedings of the 2nd International ICST Conference on Cloud Computing (CloudComp’10) (2010)Google Scholar
  9. 9.
    Emeakaroha, V.C., Labaj, P.P., Maurer, M., Brandic, I., Kreil, D.P.: Optimizing bioinformatics workflows for data analysis using cloud management techniques. In: Proceedings of the 6th Workshop on Workflows in Support of Large-Scale Science, WORKS ’11, pp. 37–46. ACM, New York, NY, USA (2011)CrossRefGoogle Scholar
  10. 10.
    Emeakaroha, V.C., Netto, M.A.S., Calheiros, R.N., Brandic, I., Buyya, R., De Rose, C.A.F.: Towards autonomic detection of sla violations in cloud infrastructures. Futur. Gener. Comput. Syst. 28(7), 1017–1029 (2012)CrossRefGoogle Scholar
  11. 11.
    Ferretti, S., Ghini, V., Panzieri, F., Pellegrini, M., Turrini, E.: Qos-aware clouds. In: 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD), pp. 321–328 (2010)Google Scholar
  12. 12.
    Goderis, A., Li, P., Goble, C.: Workflow discovery: the problem, a case study from e-science and a graph-based solution. In: IEEE International Conference on Web Services, pp. 312–319 (2006)Google Scholar
  13. 13.
    Goderis, A., Sattler, U., Lord, P., Goble, C.: Seven bottlenecks to workflow reuse and repurposing. In: Gil, Y., Motta, E., Benjamins, V., Musen, M. (eds.) Semantic Web - ISWC 2005. Lecture Notes in Computer Science, vol. 3729, pp. 323–337. Springer, Berlin/Heidelberg (2005)CrossRefGoogle Scholar
  14. 14.
    Goecks, J., Nekrutenko, A., Taylor, J., and The Galaxy Team: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8), R86 (2010)CrossRefGoogle Scholar
  15. 15.
    Halligan, B.D., Geiger, J.F., Vallejos, A.K., Greene, A.S., Twigger, S.N.: Low cost, scalable proteomics data analysis using amazons cloud computing services and open source search algorithms. J. Proteome Res. 8(6), 3148–3153 (2009)CrossRefGoogle Scholar
  16. 16.
    Hollingsworth, D.: The workflow reference model. In:: Technical Report (WFMC- TC00-1003) Workflow Management Coalition (1995)Google Scholar
  17. 17.
    Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34(Suppl 2), W729–W732 (2006)CrossRefGoogle Scholar
  18. 18.
    JFree.org: Jfreechart. http://www.jfree.org/jfreechart/. Last Access: 4 Apr 2013
  19. 19.
    JMS: Java messaging service. http://java.sun.com/ products/jms/. Last Access: 4 Apr 2013
  20. 20.
    Kephart, J.O., Chess, D.M.: The vision of autonomic computing. IEEE Comput. 36(1), 41–50 (2003)CrossRefGoogle Scholar
  21. 21.
    Koller, B., Schubert, L.: Towards autonomous SLA management using a proxy-like approach. Multiagent Grid Syst. 3(3), 313–325 (2007)Google Scholar
  22. 22.
    Kreil, D.P.: From general scientific workflows to specific sequence analysis applications: the study of compositionally biased proteins. Ph.D. thesis (2001)Google Scholar
  23. 23.
    Łabaj, P.P., Leparc, G.G., Linggi, B.E., Markillie, L.M., Wiley, H.S., Kreil, D.P.: Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics 27(13), i383–i391 (2011)CrossRefGoogle Scholar
  24. 24.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)CrossRefGoogle Scholar
  25. 25.
    Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and samtools. Bioinformatics 25(16), 2078–2079 (2009)CrossRefGoogle Scholar
  26. 26.
    Linke, B., Giegerich, R., Goesmann, A.. Conveyor: a workflow engine for bioinformatic analyses. Bioinformatics 27(7), 903–911 (2011)CrossRefGoogle Scholar
  27. 27.
    Massie, M.L., Chun, B.N., Culler, D.E.: The Ganglia distributed monitoring system: design, implementation and experience. Parallel Comput. 30(7), 817–840 (2004)CrossRefGoogle Scholar
  28. 28.
    Maurer, M., Brandic, I., Emeakaroha, V.C., Dustdar, S.: Towards knowledge management in self-adaptable clouds. In: IEEE 2010 Fourth International Workshop of Software Engineering for Adaptive Service-Oriented Systems, Miami, USA (2010)Google Scholar
  29. 29.
    Maurer, M., Brandic, I., Sakellariou, R.: Simulating autonomic sla enactment in clouds using case based reasoning. In: ServiceWave 2010: Proceedings of the 2010 ServiceWave Conference, Ghent, Belgium (2010)Google Scholar
  30. 30.
    Maurer, M., Brandic, I., Sakellariou, R.: Enacting slas in clouds using rules. In: Proceedings of Euro-Par 2011 (2011)Google Scholar
  31. 31.
    Maurer, M., Brandic, I., Sakellariou, R.: Adaptive resource configuration for cloud infrastructure management. Futur. Gener. Comput. Syst. 29(2), 472–487 (2013)CrossRefGoogle Scholar
  32. 32.
    Merchant, N., Hartman, J., Lowry, S., Lenards, A., Lowenthal, D., Skidmore, E.: Leveraging cloud infrastructure for life science research laboratories: a generalized view. In: International Workshop on Cloud Computing at OOPSLA09, Orlando, USA (2009)Google Scholar
  33. 33.
    Nurmi, D., Wolski, R., Grzegorczyk, C., Obertelli, G., Soman, S., Youseff, L., Zagorodnov, D.: The Eucalyptus open-source cloud-computing system. In: Proceedings of the 9th International Symposium on Cluster Computing and the Grid (CCGRID’09) (2009)Google Scholar
  34. 34.
    Pennisi, E.: Will computers crash genomics? Science 331(6018), 666–668 (2011)CrossRefGoogle Scholar
  35. 35.
    Robinson, G.E., Banks, J.A., Padilla, D.K., Burggren, W.W., Cohen, C.S., Delwiche, C.F., Funk, V., Hoekstra, H.E., Jarvis, E.D., Johnson, L., Martindale, M.Q., Rio, C.M., Medina, M., Salt, D.E., Sinha, S., Specht, C., Strange, K., Strassmann, J.E., Swalla, B.J., Tomanek, L.: Empowering 21st century biology. BioScience 60(11), 923–930 (2010)CrossRefGoogle Scholar
  36. 36.
    Rochwerger, B., Breitgand, D., Levy, E., Galis, A., Nagin, K., Llorente, L., Montero, R., Wolfsthal, Y., Elmroth, E., Caceres, J., Ben-Yehuda, M., Emmerich, W., Galan, F.: The RESERVOIR model and architecture for open federated cloud computing. IBM J. Res. Dev. 53(4), 535–545 (2009)CrossRefGoogle Scholar
  37. 37.
    Romano, P.: Automation of in-silico data analysis processes through workflow management systems. Brief. Bioinform. 9(1), 57–68 (2007)CrossRefGoogle Scholar
  38. 38.
    Smedley, D., Swertz, M.A., Wolstencroft, K., Proctor, G., Zouberakis, M., Bard, J., Hancock, J.M., Schofield, P.: Solutions for data integration in functional genomics: a critical assessment and case study. Brief. Bioinform. 9(6), 532–544 (2008)CrossRefGoogle Scholar
  39. 39.
    Stein, L.D.: Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nat. Rev. Genet. 9(9), 678–688 (2008)CrossRefGoogle Scholar
  40. 40.
    Stoegerer, C., Brandic, I., Emeakaroha, V.C., Kastner, W., Novak, T.: Applying availability slas to traffic management systems. In: Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC 2011) (2011)Google Scholar
  41. 41.
    Tang, F., Chua, C.L., Ho, L.-Y., Lim, Y.P., Issac, P., Krishnan, A.: Wildfire: distributed, grid-enabled workflow construction and execution. BMC Bioinforma. 6(69) (2005). http://www.biomedcentral.com/1471-2105/6/69
  42. 42.
    Tiwari, A., Sekhar, A.K.: Workflow based framework for life science informatics. Comput. Biol. Chem. 31(5–6), 305–319 (2007)CrossRefGoogle Scholar
  43. 43.
    Trapnell, C., Pachter, L., Salzberg, S.L.: Tophat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9), 1105–1111 (2009)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Vincent C. Emeakaroha
    • 1
  • Michael Maurer
    • 1
  • Patrick Stern
    • 1
  • Paweł P. Łabaj
    • 2
  • Ivona Brandic
    • 1
  • David P. Kreil
    • 2
  1. 1.Information Systems InstituteVienna University of TechnologyViennaAustria
  2. 2.Chair of BioinformaticsBoku University ViennaViennaAustria

Personalised recommendations