The Journal of Supercomputing

, Volume 72, Issue 11, pp 4069–4088 | Cite as

Exploiting in-memory storage for improving workflow executions in cloud platforms

  • Francisco Rodrigo Duro
  • Fabrizio Marozzo
  • Javier Garcia Blas
  • Domenico Talia
  • Paolo Trunfio


The Data Mining Cloud Framework (DMCF) is an environment for designing and executing data analysis workflows in cloud platforms. Currently, DMCF relies on the default storage of the public cloud provider for any I/O-related operation. This implies that the I/O performance of DMCF is limited by the performance of the default storage. In this work, we propose the usage of the Hercules system within DMCF as an ad hoc storage system for temporary data produced inside workflow-based applications. Hercules is a distributed in-memory storage system highly scalable and easy to deploy. The proposed solution takes advantage of the scalability capabilities of Hercules to avoid the bandwidth limits of the default storage. We evaluated the performance of Hercules compared with the Microsoft Azure Storage solution by using synthetic benchmarks with the objective of demonstrating the viability of the proposed solution. Then, we evaluated the integration of Hercules and DMCF on a real application consisting of a workflow that accesses temporary data using either Azure storage or Hercules. The I/O overhead in this real-life scenario using Hercules has been reduced by 36 % with respect to Azure storage, leading to a 13 % reduction of the total execution time. This confirms that our in-memory approach is effective in improving the performance of data-intensive workflow executions in cloud-based platforms.


DMCF Hercules Workflows In-memory storage  Data cache Microsoft Azure 



This work is partially supported by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS). This work is partially supported by the grant TIN2013-41350-P, Scalable Data Management Techniques for High-End Computing Systems from the Spanish Ministry of Economy and Competitiveness.


  1. 1.
    Al-Kiswany S, Gharaibeh A, Ripeanu M (2010) The case for a versatile storage system. Oper Syst Rev 44(1):10–14CrossRefGoogle Scholar
  2. 2.
    Costa LB, Yang H, Vairavanathan E, Barros A, Maheshwari K, Fedak G, Katz D, Wilde M, Ripeanu M, Al-Kiswany S (2014) The case for workflow-aware storage:an opportunity study. J Grid Comput 1–19Google Scholar
  3. 3.
    Donnelly P, Hazekamp N, Thain D (2015) Confuga: scalable data intensive computing for POSIX Workflows. In: IEEE/ACM international symposium on cluster, cloud and grid computingGoogle Scholar
  4. 4.
    Duro FR, Blas JG, Carretero J (2013) A hierarchical parallel storage system based on distributed memory for large scale systems. In: Proceedings of the 20th European MPI Users’ Group Meeting, EuroMPI ’13, , New York. ACM, pp 139–140Google Scholar
  5. 5.
    Fitzpatrick B (2004) Distributed caching with memcached. Linux J 2004(124):5Google Scholar
  6. 6.
    Florin I, Javier GBF, Jesús C, Wei-Keng L, Alok C (2010) A scalable message passing interface implementation of an ad-hoc parallel I/O system. Int J High Perform Comput Appl 24(2):164–184CrossRefGoogle Scholar
  7. 7.
    John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In : Eleventh conference on uncertainty in artificial intelligence,San Mateo. Morgan Kaufmann, pp 338–345Google Scholar
  8. 8.
    Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK (2001) Improvements to platt’s smo algorithm for svm classifier design. Neural Comput 13(3):637–649CrossRefzbMATHGoogle Scholar
  9. 9.
    Li H, Ghodsi A, Zaharia M, Shenker S , Stoica I (2014) Reliable, memory speed storage for cluster computing frameworks. Technical Report UCB/EECS-2014-135, EECS Department, University of California, Berkeley, JunGoogle Scholar
  10. 10.
    Marozzo F, Talia D, Trunfio P (2011) A cloud framework for parameter sweeping data mining applications. In: Proc. of the 3rd IEEE international conference on cloud computing technology and science (CloudCom 2011), Athens, Greece, 1 December. IEEE Computer Society Press. ISBN 978-0-7695-4622-3, pp 367–374Google Scholar
  11. 11.
    Marozzo F, Talia D, Trunfio P (2013) A cloud framework for big data analytics workflows on azure. In: Charlie C, Wolfgang G, Lucio G, Gerhard J, Jos Luis V-P (eds) Post-Proc. of the high performance computing workshop 2012, volume 23 of advances in parallel computing, Cetraro, Italy, IOS Press. ISBN 978-1-61499-321-6, pp 182–191Google Scholar
  12. 12.
    Marozzo F, Talia D, Trunfio P (2015) JS4Cloud: script-based workflow programming for scalable data analysis on cloud platforms. Concurr Comput Pract Exp 27(17):5214–5237Google Scholar
  13. 13.
    Ross Quinlan J (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CAGoogle Scholar
  14. 14.
    Duro FR, Marozzo F, García BJ, Pérez JC, Talia D, Trunfio P (2015) Evaluating data caching techniques in DMCF workflows using Hercules. In: Proceedings of the second international workshop on sustainable ultrascale computing systems (NESUS 2015), Krakow, Poland, pp 95–106Google Scholar
  15. 15.
    Thain D, Livny M (2005) Parrot: Transparent user-level middleware for data-intensive computing. Scalable Comput Pract Exp 6(3):9–18Google Scholar
  16. 16.
    Xindong W, Vipin Kumar J, Quinlan R, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37Google Scholar
  17. 17.
    Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI’12, Berkeley, CA. USENIX Association, pp 2–2Google Scholar
  18. 18.
    Zhang Z, Katz DS, Armstrong TG, Wozniak JM, Foster I (2013) Parallelizing the execution of sequential scripts. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC ’13, New York. ACM, pp 31:1–31:12Google Scholar
  19. 19.
    Zhao D, Qiao K, Raicu I (2014) Hycache+: Towards scalable high-performance caching middleware for parallel file systems. In: IEEE/ACM CCGridGoogle Scholar
  20. 20.
    Zhao D, Yang X, Sadooghi I, Garzoglio G, Timm S, Raicu I (2015) High-performance storage support for scientific applications on the cloud. In: Proceedings of the 6th workshop on scientific cloud computing, ScienceCloud ’15. ACM, New York, pp 33–36Google Scholar
  21. 21.
    Zhao D, Zhang Z, Zhou X, Li T, Wang K, Kimpe D, Carns P, Ross R, Raicu I (2014) FusionFS: toward supporting data-intensive scientific applications on extreme-scale high performance computing systems. In: 2014 IEEE international conference on big data (Big Data), pp 61–70Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.ARCOSUniversity Carlos III MadridLeganésSpain
  2. 2.DIMESUniversity of CalabriaRendeItaly

Personalised recommendations