Self-awareness of Cloud Applications

  • Alex Iosup
  • Xiaoyun Zhu
  • Arif Merchant
  • Eva Kalyvianaki
  • Martina Maggio
  • Simon Spinner
  • Tarek Abdelzaher
  • Ole Mengshoel
  • Sara Bouchenak


Cloud applications today deliver an increasingly larger portion of the information and communications technology (ICT) services. To address the scale, growth, and reliability of cloud applications, self-aware management and scheduling are becoming commonplace. How are they used in practice? In this chapter, we propose a conceptual framework for analyzing the state-of-the-art self-awareness approaches used in the context of cloud applications. We map important applications corresponding to the popular and emerging application domains to this conceptual framework and compare the practical characteristics, benefits, and drawbacks of self-awareness approaches. Last, we propose a road map for addressing the open challenges in self-aware cloud and datacenter applications.


Cloud Computing Virtual Machine Cloud Service Intrusion Detection Cloud Application 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.



This work is partially supported by the Dutch STW/NWO Veni personal grant @large and Vidi personal grant MagnaData, by the Dutch national program COMMIT and COMMissioner subproject, by the Dutch KIEM project KIESA, by a generous ERO gift from Oracle, by the European FP7 research project AMADEOS Grant Agreement 610535 on Systems of Systems, by the Swedish Research Council (VR) for the projects “Cloud Control” and “Power and temperature control for large-scale computing infrastructures,” and through the LCCC Linnaeus and ELLIIT Excellence Centers.


  1. 1.
    Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning From Data. AMLBook, 2012.Google Scholar
  2. 2.
    Orna Agmon Ben-Yehuda, Assaf Schuster, Artyom Sharov, Mark Silberstein, and Alexandru Iosup. Expert: Pareto-efficient task replication on grids and a cloud. In IPDPS, 2012.Google Scholar
  3. 3.
    Christoph Albrecht, Arif Merchant, Murray Stokely, Muhammad Waliji, François Labelle, et al. Janus: Optimal flash provisioning for cloud storage workloads. In USENIX ATC, 2013.Google Scholar
  4. 4.
    Luiz Andre Barroso and Urs Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan & Claypool, 2009.Google Scholar
  5. 5.
    Jean Arnaud and Sara Bouchenak. Performance and Dependability in Service Computing, chapter Performance, Availability and Cost of Self-Adaptive Internet Services. IGI, 2011.Google Scholar
  6. 6.
    Aniruddha Basak, Irina Brinster, and Ole J. Mengshoel. MapReduce for Bayesian network parameter learning using the EM algorithm. In Proc. of Big Learning: Algorithms, Systems and Tools, 2012.Google Scholar
  7. 7.
    Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In SIGMOD, 2013.Google Scholar
  8. 8.
    Arthur Choi, Adnan Darwiche, Lu Zheng, and Ole J. Mengshoel. A tutorial on Bayesian networks for system health management. In A. Srivastava and J. Han, editors, Data Mining in Systems Health Management: Detection, Diagnostics, and Prognostics. Chapman and Hall/CRC Press, 2011.Google Scholar
  9. 9.
    Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Simons, and Jeff Chase. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In OSDI, 2004.Google Scholar
  10. 10.
    Louis Columbus. Roundup of cloud computing forecasts and market estimates, 2015. Forbes Tech Report, 2015.Google Scholar
  11. 11.
    Kefeng Deng, Junqiang Song, Kaijun Ren, and Alexandru Iosup. Exploring portfolio scheduling for long-term execution of scientific workloads in iaas clouds. In SC, 2013.Google Scholar
  12. 12.
    Yixin Diao, Joseph L. Hellerstein, Sujay Parekh, Rean Griffith, Gail E. Kaiser, and Dan Phung. A control theory foundation for self-managing computing systems. IEEE J. on Selected Areas in Communications, 23(12):2213–2222, 2006.Google Scholar
  13. 13.
    Jonas Dürango, Manfred Dellkrantz, Martina Maggio, Cristian Klein, Alessandro Vittorio Papadopoulos, et al. Control-theoretical load-balancing for cloud applications with brownout. In CDC, 2014.Google Scholar
  14. 14.
    E.N. Elnohazy et al. System resilience at extreme scale. White paper. Defense Advanced Research Project Agency (DARPA) report, 2009.Google Scholar
  15. 15.
    European Commission. Uptake of cloud in europe. Final Report. Digital Agenda for Europe report. Publications Office of the European Union, Luxembourg, 2014.Google Scholar
  16. 16.
    Xi Fang, Satyajayant Misra, Guoliang Xue, and Dejun Yang. Smart grid; the new and improved power grid: A survey. IEEE Communications Surveys Tutorials, 14(4):944–980, 2012.CrossRefGoogle Scholar
  17. 17.
    Lipu Fei, Bogdan Ghit, Alexandru Iosup, and Dick H. J. Epema. KOALA-C: A task allocator for integrated multicluster and multicloud environments. In CLUSTER, 2014.Google Scholar
  18. 18.
    Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch. Making state explicit for imperative big data processing. In USENIX ATC, 2014.Google Scholar
  19. 19.
    Antonio Filieri, Carlo Ghezzi, Alberto Leva, and Martina Maggio. Self-adaptive software meets control theory: A preliminary approach supporting reliability requirements. In ASE, 2011.Google Scholar
  20. 20.
    Antonio Filieri, Henry Hoffmann, and Martina Maggio. Automated design of self-adaptive software with control-theoretical formal guarantees. In ICSE, 2014.Google Scholar
  21. 21.
    Andrew Forward and Timothy C. Lethbridge. A taxonomy of software types to facilitate search and evidence-based software engineering. In Conference of the Centre for Advanced Studies on Collaborative Research, page 14, 2008.Google Scholar
  22. 22.
    Matthieu Gallet, Nezih Yigitbasi, Bahman Javadi, Derrick Kondo, Alexandru Iosup, and Dick H. J. Epema. A model for space-correlated failures in large-scale distributed systems. In Euro-Par, 2010.Google Scholar
  23. 23.
    Anshul Gandhi, Mor Harchol-Balter, Ram Raghunathan, and Michael A. Kozuch. Autoscale: Dynamic, robust capacity management for multi-tier data centers. ACM Trans. Comput. Syst., 30(4):14, 2012.CrossRefGoogle Scholar
  24. 24.
    Shravan Gaonkar, Kimberly Keeton, Arif Merchant, and William H. Sanders. Designing dependable storage solutions for shared application environments. IEEE Trans. Dependable Secur. Comput., 7(4):366–380, 2010.CrossRefGoogle Scholar
  25. 25.
    Bogdan Ghit, Nezih Yigitbasi, Alexandru Iosup, and Dick H. J. Epema. Balanced resource allocations across multiple dynamic mapreduce clusters. In SIGMETRICS, 2014.Google Scholar
  26. 26.
    Ajay Gulati, Chethan Kumar, Irfan Ahmad, and Karan Kumar. BASIL: Automated IO load balancing across storage devices. In FAST, 2010.Google Scholar
  27. 27.
    Ajay Gulati, Ganesha Shanmuganathan, Anne Holler, Carl Waldspurger, Minwen Ji, and Xiaoyun Zhu. VMware Distributed Resource Management: Design, implementation, and lessons learned. VMware Technical Journal, 1(1), 2012.Google Scholar
  28. 28.
    Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patino-Martinez, Claudio Soriente, and Patrick Valduriez. Streamcloud: An elastic and scalable data streaming system. IEEE Trans. Parallel Distrib. Syst., 23(12):2351–2365, 2012.CrossRefGoogle Scholar
  29. 29.
    Joseph L. Hellerstein, Yixin Diao, Sujay Parekh, and Dawn M. Tilbury. Feedback Control of Computing Systems. John Wiley & Sons, 2004.Google Scholar
  30. 30.
    Jin Heo and Tarek Abdelzaher. Adaptguard: Guarding adaptive systems from instability. In ICAC, pages 77–86, 2009.Google Scholar
  31. 31.
    Nikolaus Huber, Jürgen Walter, Manuel Bähr, and Samuel Kounev. Model-based Autonomic and Performance-aware System Adaptation in Heterogeneous Resource Environments: A Case Study. In ICCAC, 2015.Google Scholar
  32. 32.
    IDC. Worldwide and regional public it cloud services: 2013-2017 forecast. IDC Tech Report. [Online] Available:, 2013.
  33. 33.
    Alexandru Iosup and Dick H. J. Epema. Grid computing workloads. IEEE Internet Computing, 15(2):19–26, 2011.CrossRefGoogle Scholar
  34. 34.
    Alexandru Iosup, Nezih Yigitbasi, and Dick H. J. Epema. On the performance variability of production cloud services. In CCGrid, 2011.Google Scholar
  35. 35.
    Gideon Juve, Ann L. Chervenak, Ewa Deelman, Shishir Bharathi, Gaurang Mehta, and Karan Vahi. Characterizing and profiling scientific workflows. Future Generation Comp. Syst., 29(3):682–692, 2013.CrossRefGoogle Scholar
  36. 36.
    Evangelia Kalyvianaki and Themistoklis Charalambous. A Min-Max framework for CPU resource provisioning in virtualized servers using H-infinity Filters. In CDC, 2010.Google Scholar
  37. 37.
    Evangelia Kalyvianaki, Themistoklis Charalambous, Marco Fiscato, and Peter Pietzuch. Overload Management in Data Stream Processing Systems with Latency Guarantees. In Feedback Computing, 2012.Google Scholar
  38. 38.
    Evangelia Kalyvianaki, Wolfram Wiesemann, Quang Hieu Vu, Daniel Kuhn, and Peter Pietzuch. Sqpr: Stream query planning with reuse. In ICDE, 2011.Google Scholar
  39. 39.
    Rini T. Kaushik, Tarek Abdelzaher, Ryota Egashira, and Klara Nahrstedt. Predictive data and energy management in greenhdfs. In IGCC, pages 1–9, 2011.Google Scholar
  40. 40.
    Kimberly Keeton, Dirk Beyer, Ernesto Brau, Arif Merchant, Cipriano Santos, and Alex Zhang. On the road to recovery: Restoring data after disasters. In EuroSys, 2006.Google Scholar
  41. 41.
    Kimberly Keeton, Terence Kelly, Arif Merchant, Cipriano A. Santos, Janet L. Wiener, Xiaoyun Zhu, and Dirk Beyer. Don’t settle for less than the best: Use optimization to make decisions. In HotOS, 2007.Google Scholar
  42. 42.
    Kimberly Keeton and Arif Merchant. A framework for evaluating storage system dependability. In DSN, 2004.Google Scholar
  43. 43.
    Mohammad M. H. Khan, Hieu Khac Le, Hossein Ahmadi, Tarek F. Abdelzaher, and Jiawei Han. Troubleshooting interactive complexity bugs in wireless sensor networks using data mining techniques. ACM Trans. Sen. Netw., 10(2):31:1–31:35, 2014.Google Scholar
  44. 44.
    Cristian Klein, Martina Maggio, Karl-Erik Årzén, and Francisco Hernández-Rodriguez. Brownout: Building more robust cloud applications. In ICSE, 2014.Google Scholar
  45. 45.
    Cristian Klein, Alessandro V. Papadopoulos, Manfred Dellkrantz, Jonas Durango, Martina Maggio, et al. Improving cloud service resilience using brownout-aware load-balancing. In SDRS, 2014.Google Scholar
  46. 46.
    Joseph A. Konstan and John Riedl. Recommended to you. IEEE Spectum, 2012.Google Scholar
  47. 47.
    Rouven Krebs, Simon Spinner, Nadia Ahmed, and Samuel Kounev. Resource usage control in multi-tenant applications. In CCGrid, 2014.Google Scholar
  48. 48.
    William S. Levine. The control handbook. The electrical engineering handbook series. CRC Press New York, 1996.Google Scholar
  49. 49.
    Shen Li, Shiguang Wang, Fan Yang, Shaohan Hu, Fatemeh Saremi, and Tarek Abdelzaher. Proteus: Power proportional memory cache cluster in data centers. In ICDCS, pages 73–82, 2013.Google Scholar
  50. 50.
    Lei Lu, Xiaoyun Zhu, Rean Griffith, Pradeep Padala, Aashish Parikh, Parth Shar, and Evgenia Smirni. Application-driven dynamic vertical scaling of virtual machines in resource pools. In NOMS, 2014.Google Scholar
  51. 51.
    Luo Mai, Evangelia Kalyvianaki, and Paolo Costa. Exploiting time-malleability in cloud-based batch processing systems. In LADIS, 2013.Google Scholar
  52. 52.
    Ole J. Mengshoel, Mark Chavira, Keith Cascio, Scott Poll, Adnan Darwiche, et al. Probabilistic model-based diagnosis: An electrical power system case study. IEEE Trans. on Systems, Man and Cybernetics, Part A: Systems and Humans, 40(5):874–885, 2010.Google Scholar
  53. 53.
    Ole J. Mengshoel, Bob Iannucci, and Abe Ishihara. Mobile computing: Challenges and opportunities for autonomy and feedback. In Feedback Computing’13, 2013.Google Scholar
  54. 54.
    Arif Merchant, Mustafa Uysal, Pradeep Padala, Xiaoyun Zhu, Sharad Singhal, and Kang G. Shin. Maestro: quality-of-service in large disk arrays. In ICAC, 2011.Google Scholar
  55. 55.
    Alexandru-Corneliu Olteanu, Alexandru Iosup, and Nicolae Tapus. Towards a workload model for online social applications. In ICPE, 2013.Google Scholar
  56. 56.
    Alexandru-Corneliu Olteanu, Nicolae Tapus, and Alexandru Iosup. Extending the capabilities of mobile devices for online social applications through cloud offloading. In CCGrid, 2013.Google Scholar
  57. 57.
    Pradeep Padala, Anne Holler, Lei Lu, Xiaoyun Zhu, Aashish Parikh, and Madhuri Yechuri. Scaling of cloud applications using machine learning. VMware Technical Journal, 2014.Google Scholar
  58. 58.
    Pradeep Padala, Kai-Yuan Hou, Kang Shin, Xiaoyun Zhu, Mustafa Uysal, Zhijui Wang, Sharad Singhal, and Arif Merchant. Automated control of multiple virtualized resources. In Eurosys, 2009.Google Scholar
  59. 59.
    Ken Peffers, Tuure Tuunanen, Marcus A. Rothenberger, and Samir Chatterjee. A design science research methodology for information systems research. J. of Management Information Systems, 24(3):45–77, 2008.CrossRefGoogle Scholar
  60. 60.
    Zhengping Qian, Yong He, Chunzhi Su, Zhuojie Wu, Hongyu Zhu, et al. Timestream: Reliable stream computation in the cloud. In EuroSys, 2013.Google Scholar
  61. 61.
    Ragunathan Rajkumar, Insup Lee, Lui Sha, and John Stankovic. Cyber-physical systems: The next computing revolution. In DAC, 2010.Google Scholar
  62. 62.
    Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In SOCC, 2012.Google Scholar
  63. 63.
    Brian Ricks and Ole J. Mengshoel. Diagnosis for uncertain, dynamic, and hybrid domains using bayesian networks and arithmetic circuits. International Journal on Approximate Reasoning, 55(5):1207–1234, 2014.CrossRefzbMATHGoogle Scholar
  64. 64.
    Matthew A. Russell. Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites. O’Reilly Media, Inc., 1st edition, 2011.Google Scholar
  65. 65.
    Johann Schumann, K. Y. Rozier, T. Reinbacher, Ole J. Mengshoel, T. Mbaya, and C. Ippolito. Real-time, on-board, hardware-supported sensor and software health management for unmanned aerial systems. In Annual Conf. Prognostics and Health Mgmt. Soc., 2013.Google Scholar
  66. 66.
    Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. Omega: Flexible, scalable schedulers for large compute clusters. In EuroSys, 2013.Google Scholar
  67. 67.
    Damián Serrano, Sara Bouchenak, Yousn Kouki, Thomas Ledoux, Jonathan Lejeune, et al. Towards QoS-oriented SLA guarantees for online cloud services. In CCGrid, 2013.Google Scholar
  68. 68.
    Damián Serrano, Sara Bouchenak, Yousri Kouki, Frederico Alvares de Oliveira Jr, Thomas Ledoux, et al. SLA guarantees for cloud services. Future Generation Comp. Sys., 2015.Google Scholar
  69. 69.
    Siqi Shen, Kefeng Deng, Alexandru Iosup, and Dick H. J. Epema. Scheduling jobs in the cloud using on-demand and reserved instances. In Euro-Par, 2013.Google Scholar
  70. 70.
    Siqi Shen, Alexandru Iosup, Assaf Israel, Walfredo Cirne, Danny Raz, and Dick H. J. Epema. An availability-on-demand mechanism for datacenters. In CCGRID, 2015.Google Scholar
  71. 71.
    Siqi Shen, Vincent van Beek, and Alexandru Iosup. Statistical characterization of business-critical workloads hosted in cloud datacenters. In CCGRID, 2015.Google Scholar
  72. 72.
    Snir et al. Addressing failures in exascale computing. IJHPCA, 28(2):129–173, 2014.Google Scholar
  73. 73.
    Simon Spinner, Samuel Kounev, Xiaoyun Zhu, Lei Lu, Mustafa Uysal, Anne Holler, and Rean Griffith. Runtime vertical scaling of virtualized applications via online model estimation. In SASO, 2014.Google Scholar
  74. 74.
    Priya K. Sundararajan, Eugen Feller, Julien Forgeat, and Ole J. Mengshoel. A constrained genetic algorithm for rebalancing of services in cloud data centers. In CLOUD, 2015.Google Scholar
  75. 75.
    Vincent van Beek, Jesse Donkervliet, Tim Hegeman, Stefan Hugtenburg, and Alexandru Iosup. Mnemos: Self-expressive management of business-critical workloads in virtualized datacenters. IEEE Computer, 48(7):46–54, 2015.CrossRefGoogle Scholar
  76. 76.
    Qiushi Wang and Katinka Wolter. Reducing task completion time in mobile offloading systems through online adaptive local restart. In ICPE, 2015.Google Scholar
  77. 77.
    Pengcheng Xiong, Calton Pu, Xiaoyun Zhu, and Rean Griffith. vPerfGuard: An automated model-driven framework for application performance diagnosis in consolidated cloud environments. In ICPE, 2013.Google Scholar
  78. 78.
    Yong Yang, Lu Su, Mohammad Khan, Michael Lemay, Tarek Abdelzaher, and Jiawei Han. Power-based diagnosis of node silence in remote high-end sensing systems. ACM Trans. Sen. Netw., 11(2):33:1–33:33, 2014.Google Scholar
  79. 79.
    Nezih Yigitbasi, Matthieu Gallet, Derrick Kondo, Alexandru Iosup, and Dick H. J. Epema. Analysis and modeling of time-correlated failures in large-scale distributed systems. In GRID, 2010.Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Alex Iosup
    • 1
  • Xiaoyun Zhu
    • 2
  • Arif Merchant
    • 3
  • Eva Kalyvianaki
    • 4
  • Martina Maggio
    • 5
  • Simon Spinner
    • 6
  • Tarek Abdelzaher
    • 7
  • Ole Mengshoel
    • 8
  • Sara Bouchenak
    • 9
  1. 1.Delft University of TechnologyDelftThe Netherlands
  2. 2.Futurewei TechnologiesSanta ClaraUSA
  3. 3.Google, Inc.Menlo ParkUSA
  4. 4.Imperial College of LondonLondonUK
  5. 5.Lund UniversityLundSweden
  6. 6.University of WuertzburgWurzburgGermany
  7. 7.University of Illinois at Urbana ChampaignChampaignUSA
  8. 8.CMU Silicon Valley at the NASA Ames Research CenterMoffett FieldUSA
  9. 9.INSA LyonLyonFrance

Personalised recommendations