Abstract
This paper focuses on data-intensive workflows and addresses the problem of scheduling workflow ensembles under cost and deadline constraints in Infrastructure as a Service (IaaS) clouds. Previous research in this area ignores file transfers between workflow tasks, which, as we show, often have a large impact on workflow ensemble execution. In this paper we propose and implement a simulation model for handling file transfers between tasks, featuring the ability to dynamically calculate bandwidth and supporting a configurable number of replicas, thus allowing us to simulate various levels of congestion. The resulting model is capable of representing a wide range of storage systems available on clouds: from in-memory caches (such as memcached), to distributed file systems (such as NFS servers) and cloud storage (such as Amazon S3 or Google Cloud Storage). We observe that file transfers may have a significant impact on ensemble execution; for some applications up to 90 % of the execution time is spent on file transfers. Next, we propose and evaluate a novel scheduling algorithm that minimizes the number of transfers by taking advantage of data caching and file locality. We find that for data-intensive applications it performs better than other scheduling algorithms. Additionally, we modify the original scheduling algorithms to effectively operate in environments where file transfers take non-zero time.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Cloud workflow simulator project: https://github.com/malawski/cloudworkflowsimulator, accessed: 2015-05-02
Google cloud pricing: https://cloud.google.com/pricing/, accessed: 2015-05-02
Google cloud storage: https://cloud.google.com/storage/, accessed: 2015-05-01
Google compute engine: https://cloud.google.com/compute/, accessed: 2015-09-09
Agarwal, R., Juve, G., Deelman, E.: Peer-to-peer data sharing for scientific workflows on amazon ec2. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis, SC Companion. 82–89. IEEE (2012)
Arabnejad, H., Barbosa, J.G.: List scheduling algorithm for heterogeneous systems by an optimistic cost table. Transactions on Parallel and Distributed Systems 25(3), 682–694 (2014)
Arpaci-Dusseau, R.H., Arpaci-Dusseau, A.C.: Operating systems: Three easy pieces. Arpaci-Dusseau Books (2014)
Berriman, G.B., Deelman, E., Juve, G., Rynge, M., Vöckler, J.S.: The application of cloud computing to scientific workflows: a study of cost and performance. Philosophical Transactions of the Royal Society A: Mathematical. Phys. Eng. Sci 371(1983), 20120066 (2013)
Berriman, G.B., Juve, G., Deelman, E., Regelson, M., Plavchan, P.: The application of cloud computing to astronomy: A study of cost and performance. In: Proceedings of 6th International Conference on e-Science, Workshops. 1–7. IEEE (2010)
Bharathi, S., Chervenak, A.: Data staging strategies and their impact on the execution of scientific workflows. In: Proceedings of the second international workshop on Data-aware distributed computing. ACM (2009)
Bharathi, S., Chervenak, A.: Scheduling data-intensive workflows on storage constrained resources. In: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science - WORKS ’09. pp. 1–10. ACM Press, New York, New York, USA. http://dl.acm.org/citation.cfm?id=1645164.1645167 (2009)
Bittencourt, L.F., Sakellariou, R., Madeira, E.R.: Dag scheduling using a lookahead variant of the heterogeneous earliest finish time algorithm. In: Proceedings of 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). 27–34 (2010)
Bittencourt, L.F., Madeira, E.R.M.: Hcoc: a cost optimization algorithm for workflow scheduling in hybrid clouds. J. Int. Se. Appl. 2(3), 207–227 (2011)
Bocchi, E., Mellia, M., Sarni, S.: Cloud storage service benchmarking: Methodologies and experimentations. In: Cloud Networking, 2014 et al. 3rd International Conference on. 395–400 (2014)
Callaghan, S., Maechling, P., Small, P., Milner, K., Juve, G., Jordan, T.H., Deelman, E., Mehta, G., Vahi, K., Gunter, D., et al.: Metrics for heterogeneous scientific workflows: A case study of an earthquake science application. International Journal of High Performance Computing Applications 25(3), 274–285 (2011)
Çatalyürek, U.V., Kaya, K., Uçar, B.: Integrated data placement and task assignment for scientific workflows in clouds. http://portal.acm.org/citation.cfm?id=1996014.1996022 (2011)
Chen, W., Deelman, E.: Partitioning and scheduling workflows across multiple sites with storage constraints. In: Parallel Processing and Applied Mathematics, LNCS 7204, 11–20. Springer (2012)
Chiang, R.C., Huang, H.H.: TRACON. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis - SC ’11. p. 1. ACM Press, New York, New York, USA. http://dl.acm.org/citation.cfm?id=2063384.2063447 (2011)
Costa, L.B., Yang, H., Vairavanathan, E., Barros, A., Maheshwari, K., Fedak, G., Katz, D., Wilde, M., Ripeanu, M., Al-Kiswany, S.: The Case for Workflow-Aware Storage: An Opportunity Study. Journal of Grid Computing 13(1) (2014). doi:10.1007/s10723-014-9307-6
Dan, A., Towsley, D.: An approximate analysis of the lru and fifo buffer replacement schemes. SIGMETRICS Perform. Eval. Rev. 18(1), 143–152 (1990) [10.1145/98460.98525]
Deelman, E., Singh, G., Su, M.H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., et al: Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program 13(3), 219–237 (2005)
Duan, R., Prodan, R., Li, X.: Multi-objective game theoretic scheduling of bag-of-tasks workflows on hybrid clouds. Transactions on Cloud Computing 2(1), 29–42 (2014)
Fitzpatrick, B.: Distributed caching with memcached. Linux journal 2004 5(124) (2004)
Garcia-Molina, H., Salem, K.: Main memory database systems: An overview. IEEE Trans. Knowl. Data Eng. 4(6), 509–516 (1992)
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: ACM SIGOPS Operating Systems Review. vol. 37, 29–43. ACM (2003)
Graves, R., Jordan, T.H., Callaghan, S., Deelman, E., Field, E., Juve, G., Kesselman, C., Maechling, P., Mehta, G., Milner, K.E.A.: Cybershake: A physics-based seismic hazard model for southern california. Pure Appl. Geophys. 168(3-4), 367–381 (2011)
Gunarathne, T., Zhang, B., Wu, T.L., Qiu, J.: Scalable parallel computing on clouds using Twister4Azure iterative MapReduce. Futur. Gener. Comput. Syst. 29(4), 1035–1048 (2013) [http://www.sciencedirect.com/science/article/pii/S0167739X12001379]
Hill, Z., Li, J., Mao, M., Ruiz-Alvarez, A., Humphrey, M.: Early observations on the performance of windows azure. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. pp. 367–376. ACM (2010)
Hoenisch, P., Hochreiner, C., Schuller, D., Schulte, S., Mendling, J., Dustdar, S.: Cost-efficient scheduling of elastic processes in hybrid clouds. In: Proceedings of 8th International Conference on Cloud Computing. 17–24. IEEE (2015). http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7214023
Jacob, J.C., Katz, D.S., Berriman, G.B., Good, J.C., Laity, A., Deelman, E., Kesselman, C., Singh, G., Su, M.H., Prince, T., et al.: Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. Int. J. Comput. Sci. Eng. 4(2), 73–87 (2009)
Jellinek, R., Zhai, Y., Ristenpart, T., Swift, M.: A day late and a dollar short: the case for research on cloud billing systems. In: Proceedings of the 6th USENIX conference on Hot Topics in Cloud Computing. pp. 21–21. USENIX Association (2014)
Juve, G., Chervenak, A., Deelman, E., Bharathi, S., Mehta, G., Vahi, K.: Characterizing and profiling scientific workflows. Futur. Gener. Comput. Syst 29(3), 682–692 (2013)
Juve, G., Deelman, E., Vahi, K., Mehta, G., Berriman, B., Berman, B.P., Maechling, P.: Data sharing options for scientific workflows on amazon ec2. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1–9. IEEE Computer Society (2010)
Kaur, G., Moghariya, U., Reed, J.: Understanding and taming the variability of cloud storage latency. Tech. rep. (2013)
Kosar, T., Balman, M.: A new paradigm: Data-aware scheduling in grid computing. Futur. Gener. Comput. Syst. 25(4), 406–413 (2009)
Livny, J., Teonadi, H., Livny, M., Waldor, M.K.: High-throughput, kingdom-wide prediction and annotation of bacterial non-coding rnas. PloS one 3(9), e3197 (2008)
Maechling, P.e.a.: Scec cybershake workflows automating probabilistic seismic hazard analysis calculations. In: Workflows for e-Science, 143–163. Springer (2007)
Malawski, M., Figiela, K., Bubak, M., Deelman, E., Nabrzyski, J.: Scheduling multi-level deadline-constrained scientific workflows on clouds based on cost optimization. Scientific Programming (2015)
Malawski, M., Juve, G., Deelman, E., Nabrzyski, J.: Algorithms for cost-and deadline-constrained provisioning for scientific workflow ensembles in iaas clouds. Futur. Gener. Comput. Syst. 48, 1–18 (2015)
Mao, M., Humphrey, M.: A performance study on the vm startup time in the cloud. In: Proceedings of the 5th International Conference on Cloud Computing. pp. 423–430. IEEE (2012)
Pereira, W.F., Bittencourt, L.F., da Fonseca, N.L.S.: Scheduler for data-intensive workflows in public clouds. In: Proceedings of 2nd Latin American Conference on Cloud Computing and Communications. 41–46. IEEE . http://ieeexplore.ieee.org/articleDetails.jsp?arnumber=6842221 (2013)
Ramakrishnan, A., Singh, G., Zhao, H., Deelman, E., Sakellariou, R., Vahi, K., Blackburn, K., Meyers, D., Samidi, M.: Scheduling data-intensive workflows onto storage-constrained distributed resources. In: Proceedings of 7th International Symposium on Cluster and Grid Computing (CCGrid). pp. 401–409. IEEE (2007)
Ranganathan, K., Foster, I.: Simulation studies of computation and data scheduling algorithms for data grids. Journal of Grid computing 1(1), 53–62 (2003)
Rodriguez, M., Buyya, R.: Deadline based resource provisioningand scheduling algorithm for scientific workflows on clouds. Transactions on Cloud Computing 2(2), 222–235 (2014)
Schmuck, F.B., Haskin, R.L.: Gpfs: A shared-disk file system for large computing clusters. In: FAST. vol. 2, 19 (2002)
da Silva, R.F., Chen, W., Juve, G., Vahi, K., Deelman, E.: Community resources for enabling research in distributed scientific workflows. In: Proceedings of 10th International Conference one-Science. vol. 1, 177–184. IEEE (2014)
Tolosana-Calasanz, R., BañAres, J.Á., Pham, C., Rana, O.F.: Enforcing qos in scientific workflow systems enacted over cloud infrastructures. J. Comput. Syst. Sci. 78(5), 1300–1315 (2012)
Topcuoglu, H., Hariri, S., Wu, M.y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. Transactions on Parallel and Distributed Systems 13(3), 260–274 (2002)
Vöckler, J.S., Juve, G., Deelman, E., Rynge, M., Berriman, B.: Experiences using cloud computing for a scientific workflow application. In: Proceedings of the 2nd international workshop on Scientific cloud computing. pp. 15–24. ACM (2011)
Xu, Y., Frachtenberg, E., Jiang, S., Paleczny, M.: Characterizing facebook’s memcached workload. IEEE Internet Computing 18(2), 41–49 (2014)
Yuan, D., Yang, Y., Liu, X., Chen, J.: A cost-effective strategy for intermediate data storage in scientific cloud workflow systems. In: Proceedings of International Parallel and Distributed Processing Symposium (IPDPS). pp. 1–12. IEEE, Atlanta, GA, USA. doi:10.1109/ipdps.2010.5470453(2010)
Zhang, S., Zhang, S., Chen, X., Huo, X.: Cloud computing research and development trend. In: Proceedings of Second International Conference on Future Networks, ICFN’10. pp. 93–97. IEEE (2010)
Zhang, Z., Katz, D., Wilde, M., Wozniak, J., Foster, I.: MTC envelope: defining the capability of large scale computers in the context of parallel scripting applications. In: Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing - HPDC ’13. pp. 37–48. ACM, New York, NY, USA. doi:10.1145/2462902.2462913 (2013)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Bryk, P., Malawski, M., Juve, G. et al. Storage-aware Algorithms for Scheduling of Workflow Ensembles in Clouds. J Grid Computing 14, 359–378 (2016). https://doi.org/10.1007/s10723-015-9355-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-015-9355-6