Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

Abstract

The rapid advancements in recent years of high-throughput technologies in the life sciences are facilitating the generation and storage of huge amount of data in different databases. Despite significant developments in computing capacity and performance, an analysis of these large-scale data in a search for biomedical relevant patterns remains a challenging task. Scientific workflow applications are deemed to support data-mining in more complex scenarios that include many data sources and computational tools, as commonly found in bioinformatics. A scientific workflow application is a holistic unit that defines, executes, and manages scientific applications using different software tools. Existing workflow applications are process- or data- rather than resource-oriented. Thus, they lack efficient computational resource management capabilities, such as those provided by Cloud computing environments. Insufficient computational resources disrupt the execution of workflow applications, wasting time and money. To address this issue, advanced resource monitoring and management strategies are required to determine the resource consumption behaviours of workflow applications to enable a dynamical allocation and deallocation of resources. In this paper, we present a novel Cloud management infrastructure consisting of resource level-, application level monitoring techniques, and a knowledge management strategy to manage computational resources for supporting workflow application executions in order to guarantee their performance goals and their successful completion. We present the design description of these techniques, demonstrate how they can be applied to scientific workflow applications, and present detailed evaluation results as a proof of concept.

This is a preview of subscription content, access via your institution.

References

  1. 1.

    ActiveMQ: Messaging and integration pattern provider. http://activemq.apache.org/. Accessed 4 April 2013

  2. 2.

    Altintas, I., Berkley, C., Jones, E.M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: 16th International Conference on Scientific and Statistical Database Management, pp. 423–424 (2004)

  3. 3.

    Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Futur. Gener. Comput. Syst. 25(6), 599–616 (2009)

    Article  Google Scholar 

  4. 4.

    Cantacessi, C., Jex, A.R., Hall, R.S., Young, N.D., Campbell, B.E., Joachim, A., Nolan, M.J., Abubucker, S., Sternberg, P.W., Ranganathan, S., Mitreva, M., Gasser, R.B.: A practical, bioinformatic workflow system for large data sets generated by next-generation sequencing. Nucleic Acids Res. 38(17), e171 (2010)

    Article  Google Scholar 

  5. 5.

    Comuzzi, M., Kotsokalis, C., Spanoudkis, G., Yahyapour, R.: Establishing and monitoring SLAs in complex service based systems. In: Proceedings of the 7th International Conference on Web Services (ICWS’09) (2009)

  6. 6.

    Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 13, 219–237 (2005)

    Google Scholar 

  7. 7.

    Emeakaroha, V.C., Brandic, I., Maurer, M., Dustdar, S.: Low level metrics to high level slas - lom2his framework: bridging the gap between monitored metrics and sla parameters in cloud environments. In: 2010 International Conference on High Performance Computing and Simulation (HPCS), pp. 48–54 (2010)

  8. 8.

    Emeakaroha, V.C., Calheiros, R.N., Netto, M.A.S., Brandic, I., De Rose, C.A.F.: DeSVi: an architecture for detecting SLA violations in cloud computing infrastructures. In: Proceedings of the 2nd International ICST Conference on Cloud Computing (CloudComp’10) (2010)

  9. 9.

    Emeakaroha, V.C., Labaj, P.P., Maurer, M., Brandic, I., Kreil, D.P.: Optimizing bioinformatics workflows for data analysis using cloud management techniques. In: Proceedings of the 6th Workshop on Workflows in Support of Large-Scale Science, WORKS ’11, pp. 37–46. ACM, New York, NY, USA (2011)

    Google Scholar 

  10. 10.

    Emeakaroha, V.C., Netto, M.A.S., Calheiros, R.N., Brandic, I., Buyya, R., De Rose, C.A.F.: Towards autonomic detection of sla violations in cloud infrastructures. Futur. Gener. Comput. Syst. 28(7), 1017–1029 (2012)

    Article  Google Scholar 

  11. 11.

    Ferretti, S., Ghini, V., Panzieri, F., Pellegrini, M., Turrini, E.: Qos-aware clouds. In: 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD), pp. 321–328 (2010)

  12. 12.

    Goderis, A., Li, P., Goble, C.: Workflow discovery: the problem, a case study from e-science and a graph-based solution. In: IEEE International Conference on Web Services, pp. 312–319 (2006)

  13. 13.

    Goderis, A., Sattler, U., Lord, P., Goble, C.: Seven bottlenecks to workflow reuse and repurposing. In: Gil, Y., Motta, E., Benjamins, V., Musen, M. (eds.) Semantic Web - ISWC 2005. Lecture Notes in Computer Science, vol. 3729, pp. 323–337. Springer, Berlin/Heidelberg (2005)

    Google Scholar 

  14. 14.

    Goecks, J., Nekrutenko, A., Taylor, J., and The Galaxy Team: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8), R86 (2010)

    Article  Google Scholar 

  15. 15.

    Halligan, B.D., Geiger, J.F., Vallejos, A.K., Greene, A.S., Twigger, S.N.: Low cost, scalable proteomics data analysis using amazons cloud computing services and open source search algorithms. J. Proteome Res. 8(6), 3148–3153 (2009)

    Article  Google Scholar 

  16. 16.

    Hollingsworth, D.: The workflow reference model. In:: Technical Report (WFMC- TC00-1003) Workflow Management Coalition (1995)

  17. 17.

    Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34(Suppl 2), W729–W732 (2006)

    Article  Google Scholar 

  18. 18.

    JFree.org: Jfreechart. http://www.jfree.org/jfreechart/. Last Access: 4 Apr 2013

  19. 19.

    JMS: Java messaging service. http://java.sun.com/ products/jms/. Last Access: 4 Apr 2013

  20. 20.

    Kephart, J.O., Chess, D.M.: The vision of autonomic computing. IEEE Comput. 36(1), 41–50 (2003)

    Article  Google Scholar 

  21. 21.

    Koller, B., Schubert, L.: Towards autonomous SLA management using a proxy-like approach. Multiagent Grid Syst. 3(3), 313–325 (2007)

    Google Scholar 

  22. 22.

    Kreil, D.P.: From general scientific workflows to specific sequence analysis applications: the study of compositionally biased proteins. Ph.D. thesis (2001)

  23. 23.

    Łabaj, P.P., Leparc, G.G., Linggi, B.E., Markillie, L.M., Wiley, H.S., Kreil, D.P.: Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics 27(13), i383–i391 (2011)

    Article  Google Scholar 

  24. 24.

    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)

    Article  Google Scholar 

  25. 25.

    Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and samtools. Bioinformatics 25(16), 2078–2079 (2009)

    Article  Google Scholar 

  26. 26.

    Linke, B., Giegerich, R., Goesmann, A.. Conveyor: a workflow engine for bioinformatic analyses. Bioinformatics 27(7), 903–911 (2011)

    Article  Google Scholar 

  27. 27.

    Massie, M.L., Chun, B.N., Culler, D.E.: The Ganglia distributed monitoring system: design, implementation and experience. Parallel Comput. 30(7), 817–840 (2004)

    Article  Google Scholar 

  28. 28.

    Maurer, M., Brandic, I., Emeakaroha, V.C., Dustdar, S.: Towards knowledge management in self-adaptable clouds. In: IEEE 2010 Fourth International Workshop of Software Engineering for Adaptive Service-Oriented Systems, Miami, USA (2010)

  29. 29.

    Maurer, M., Brandic, I., Sakellariou, R.: Simulating autonomic sla enactment in clouds using case based reasoning. In: ServiceWave 2010: Proceedings of the 2010 ServiceWave Conference, Ghent, Belgium (2010)

  30. 30.

    Maurer, M., Brandic, I., Sakellariou, R.: Enacting slas in clouds using rules. In: Proceedings of Euro-Par 2011 (2011)

  31. 31.

    Maurer, M., Brandic, I., Sakellariou, R.: Adaptive resource configuration for cloud infrastructure management. Futur. Gener. Comput. Syst. 29(2), 472–487 (2013)

    Article  Google Scholar 

  32. 32.

    Merchant, N., Hartman, J., Lowry, S., Lenards, A., Lowenthal, D., Skidmore, E.: Leveraging cloud infrastructure for life science research laboratories: a generalized view. In: International Workshop on Cloud Computing at OOPSLA09, Orlando, USA (2009)

  33. 33.

    Nurmi, D., Wolski, R., Grzegorczyk, C., Obertelli, G., Soman, S., Youseff, L., Zagorodnov, D.: The Eucalyptus open-source cloud-computing system. In: Proceedings of the 9th International Symposium on Cluster Computing and the Grid (CCGRID’09) (2009)

  34. 34.

    Pennisi, E.: Will computers crash genomics? Science 331(6018), 666–668 (2011)

    Article  Google Scholar 

  35. 35.

    Robinson, G.E., Banks, J.A., Padilla, D.K., Burggren, W.W., Cohen, C.S., Delwiche, C.F., Funk, V., Hoekstra, H.E., Jarvis, E.D., Johnson, L., Martindale, M.Q., Rio, C.M., Medina, M., Salt, D.E., Sinha, S., Specht, C., Strange, K., Strassmann, J.E., Swalla, B.J., Tomanek, L.: Empowering 21st century biology. BioScience 60(11), 923–930 (2010)

    Article  Google Scholar 

  36. 36.

    Rochwerger, B., Breitgand, D., Levy, E., Galis, A., Nagin, K., Llorente, L., Montero, R., Wolfsthal, Y., Elmroth, E., Caceres, J., Ben-Yehuda, M., Emmerich, W., Galan, F.: The RESERVOIR model and architecture for open federated cloud computing. IBM J. Res. Dev. 53(4), 535–545 (2009)

    Article  Google Scholar 

  37. 37.

    Romano, P.: Automation of in-silico data analysis processes through workflow management systems. Brief. Bioinform. 9(1), 57–68 (2007)

    Article  Google Scholar 

  38. 38.

    Smedley, D., Swertz, M.A., Wolstencroft, K., Proctor, G., Zouberakis, M., Bard, J., Hancock, J.M., Schofield, P.: Solutions for data integration in functional genomics: a critical assessment and case study. Brief. Bioinform. 9(6), 532–544 (2008)

    Article  Google Scholar 

  39. 39.

    Stein, L.D.: Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nat. Rev. Genet. 9(9), 678–688 (2008)

    Article  Google Scholar 

  40. 40.

    Stoegerer, C., Brandic, I., Emeakaroha, V.C., Kastner, W., Novak, T.: Applying availability slas to traffic management systems. In: Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC 2011) (2011)

  41. 41.

    Tang, F., Chua, C.L., Ho, L.-Y., Lim, Y.P., Issac, P., Krishnan, A.: Wildfire: distributed, grid-enabled workflow construction and execution. BMC Bioinforma. 6(69) (2005). http://www.biomedcentral.com/1471-2105/6/69

  42. 42.

    Tiwari, A., Sekhar, A.K.: Workflow based framework for life science informatics. Comput. Biol. Chem. 31(5–6), 305–319 (2007)

    Article  Google Scholar 

  43. 43.

    Trapnell, C., Pachter, L., Salzberg, S.L.: Tophat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9), 1105–1111 (2009)

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Vincent C. Emeakaroha.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Emeakaroha, V.C., Maurer, M., Stern, P. et al. Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds. J Grid Computing 11, 407–428 (2013). https://doi.org/10.1007/s10723-013-9260-9

Download citation

Keywords

  • Workflow execution
  • Resource level monitoring
  • Application level monitoring
  • Workflow management
  • Knowledge database
  • Cloud computing