Big Data Storage and Processing on Azure Clouds: Experiments at Scale and Lessons Learned

  • Radu Tudoran
  • Alexandru CostanEmail author
  • Gabriel Antoniu
  • Brasche Goetz


Data-intensive computing is now starting to be considered as the basis for a new, fourth paradigm for science. Two factors are encouraging this trend. First, vast amounts of data are becoming available in more and more application areas. Second, the infrastructures allowing to persistently store these data for sharing and processing are becoming a reality. This allows to unify knowledge acquired through the previous three paradigms for scientific research (theory, experiments and simulations) with vast amounts of multidisciplinary data. The technical and scientific issues related to this context have been designated as the “Big Data” challenges. In this landscape, building a functional infrastructure for the requirements of Big Data applications is critical and is still a challenge. An important step has been made thanks to the emergence of cloud infrastructures, which are bringing the first bricks to cope with the challenging scale of the Big Data vision. Clouds bring to life the illusion of a (more-or-less) infinitely scalable infrastructure managed through a fully outsourced ICT service. Instead of having to buy and manage hardware, users “rent” outsourced resources as needed. However, cloud technologies have not reached yet their full potential. In particular, the capabilities available now for data storage and processing are still far from meeting the application requirements. In this work we investigate several hot challenges related to Big Data management on clouds. We discuss current state-of-the-art solutions, their limitations and some ways to overcome them. We illustrate our study with a concrete application study from the area of joint genetic and neuroimaging data analysis. The goal of this chapter is to present the conclusions of this study performed through a large-scale experiment carried out across three data centers of Microsoft’s Azure cloud platform during 2 weeks, which consumed approximately 200.000 compute hours.


Cloud Storage Storage Node Cloud Storage Service Distribute Storage System Cloud Storage System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
  2. 2.
  3. 3.
    Extracting Value from Chaos. EMC Corporation, June 2011.
  4. 4.
    B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas, C. Uddaraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal, M. F. u. Haq, M. I. u. Haq, D. Bhardwaj, S. Dayanand, A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, and L. Rigas. Windows azure storage: a highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ‘11, pages 143–157, New York, NY, USA, 2011. ACM.Google Scholar
  5. 5.
    D. Chappell. Introducing the Windows Azure Platform. Technical report, Microsoft.
  6. 6.
    A. Costan, R. Tudoran, G. Antoniu, and G. Brasche. TomusBlobs: Scalable Data-intensive Processing on Azure Clouds. Journal of Concurrency and computation: practice and experience, 2013.Google Scholar
  7. 7.
    A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The cost of a cloud: research problems in data center networks. SIGCOMM Comput. Commun. Rev., 39(1):68–73, Dec. 2008.CrossRefGoogle Scholar
  8. 8.
    K. Keahey, M. Tsugawa, A. Matsunaga, and J. Fortes. Sky computing. IEEE Internet Computing, 13(5):43–51, Sept. 2009.CrossRefGoogle Scholar
  9. 9.
    B. Nicolae, G. Antoniu, L. Bougé, D. Moise, and A. Carpen-Amarie. BlobSeer: Next Generation Data Management for Large Scale Infrastructures. Journal of Parallel and Distributed Computing, 71(2):168–184, Feb. 2011.CrossRefGoogle Scholar
  10. 10.
    R. Tudoran, A. Costan, and G. Antoniu. Mapiterativereduce: a framework for reduction-intensive data processing on azure clouds. In Proceedings of third international workshop on MapReduce and its Applications Date, MapReduce ‘12, pages 9–16, New York, NY, USA, 2012. ACM.Google Scholar
  11. 11.
    R. Tudoran, A. Costan, and G. Antoniu. Datasteward: Using dedicated compute nodes for scalable data management on public clouds. In Proceedings of the 11th IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA ‘13, Washington, DC, USA, 2013. IEEE Computer Society.Google Scholar
  12. 12.
    R. Tudoran, A. Costan, G. Antoniu, and H. Soncu. Tomusblobs: Towards communication-efficient storage for mapreduce applications in azure. In Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), CCGRID ‘12, pages 427–434, Washington, DC, USA, 2012. IEEE Computer Society.Google Scholar
  13. 13.
    E. Yildirim and T. Kosar. Network-aware end-to-end data throughput optimization. In Proceedings of the first international workshop on Network-aware data management, NDM ‘11, pages 21–30, New York, NY, USA, 2011. ACM.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Radu Tudoran
    • 1
  • Alexandru Costan
    • 1
    Email author
  • Gabriel Antoniu
    • 1
  • Brasche Goetz
    • 2
  1. 1.INRIA RennesRennesFrance
  2. 2.Huawei TechnologiesDuesseldorf GmbHGermanyUSA

Personalised recommendations