Environment-Sensitive Performance Tuning for Distributed Service Orchestration

  • Yu Lin
  • Franjo Ivančić
  • Pallavi Joshi
  • Gogul Balakrishnan
  • Malay Ganai
  • Aarti Gupta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8969)


Modern distributed systems are designed to tolerate unreliable environments, i.e., they aim to provide services even when some failures happen in the underlying hardware or network. However, the impact of unreliable environments can be significant on the performance of the distributed systems, which should be considered when deploying the services. In this paper, we present an approach to optimize performance of the distributed systems under unreliable deployed environments, through searching for optimal configuration parameters. To simulate an unreliable environment, we inject several failures in the environment of a service application, such as a node crash in the cluster, network failures between nodes, resource contention in nodes, etc. Then, we use a search algorithm to find the optimal parameters automatically in the user-selected parameter space, under the unreliable environment we created. We have implemented our approach in a testing-based framework and applied it to several well-known distributed service systems.


Distributed application Disturbance action Performance optimization 


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
    Allspaw, J.: Fault injection in production. Commun. ACM 55(10), 48–52 (2012)CrossRefGoogle Scholar
  8. 8.
    Babu, S.: Towards automatic optimization of mapreduce programs. In: SOCC 2010, pp. 137–142 (2010)Google Scholar
  9. 9.
    Banabic, R., Candea, G.: Fast black-box testing of system recovery code. In: Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys 2012, pp. 281–294 (2012)Google Scholar
  10. 10.
    Broadwell, P., Sastry, N., Traupman, J.: FIG: A prototype tool for online verification of recovery. In: Workshop on Self-Healing, Adaptive and Self-Managed Systems (2002)Google Scholar
  11. 11.
    Carbone, M., Rizzo, L.: Dummynet revisited. SIGCOMM Comput. Commun. Rev. 40(2), 12–20 (2010)CrossRefGoogle Scholar
  12. 12.
    Dawson, S., Jahanian, F., Mitton, T.: Experiments on six commercial TCP implementations using a software fault injection tool. Softw. Pract. Exper. 27(12), 1385–1410 (1997)CrossRefGoogle Scholar
  13. 13.
    Gunawi, H., Do, T., Joshi, P., Alvaro, P., Hellerstein, J., Arpaci-Dusseau, A., Arpaci-Dusseau, R., Sen, K., Borthakur, D.: FATE and DESTINI: A framework for cloud recovery testing. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI 2011, pp. 18–18 (2011)Google Scholar
  14. 14.
    Herodotos, H., Babu, S.: Profiling, What-if analysis, and cost-based optimization of MapReduce programs. In: VLDB 2011, pp. 1111–1122 (2011)Google Scholar
  15. 15.
    Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A self-tuning system for big data analytics. In: CIDR 2011, pp. 261–272 (2011)Google Scholar
  16. 16.
    Hoarau, W., Tixeuil, S., Vauchelles, F.: FAIL-FCI: Versatile fault injection. Future Gener. Comput. Syst. 23(7), 913–919 (2007)CrossRefGoogle Scholar
  17. 17.
    Joshi, P., Ganai, M., Balakrishnan, B., Gupta, A., Papakonstantinou, N.: SETSUDO: Perturbation-based testing framework for scalable distributed systems. In: Proceeding of the Conference on Timely Results in Operating Systems (2013)Google Scholar
  18. 18.
    Joshi, P., Gunawi, H., Sen, K.: PREFAIL: A programmable tool for multiple-failure injection. In: Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA 2011, pp. 171–188 (2011)Google Scholar
  19. 19.
    Lubke, R., Lungwitz, R., Schuster, D., Schill, A.: Large-scale tests of distributed systems with integrated emulation of advanced network behavior. WWW/Internet 10(2), 138–151 (2013)Google Scholar
  20. 20.
    Marinescu, P., Candea, G.: Efficient testing of recovery code using fault injection. ACM Trans. Comput. Syst. 29(4), 11:1–11:38 (2011)CrossRefGoogle Scholar
  21. 21.
    Molyneaux, I.: The Art of Application Performance Testing: Help for Programmers and Quality Assurance. O’Reilly Media (2009)Google Scholar
  22. 22.
    Tseitlin, A.: The antifragile organization. CACM 56(8), 40–44 (2013)CrossRefGoogle Scholar
  23. 23.
    Ye, T., Kalyanaraman, S.: A recursive random search algorithm for large-scale network parameter configuration. In: SIGMETRICS 2003, pp. 196–205 (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Yu Lin
    • 1
  • Franjo Ivančić
    • 2
  • Pallavi Joshi
    • 2
  • Gogul Balakrishnan
    • 2
  • Malay Ganai
    • 2
  • Aarti Gupta
    • 2
  1. 1.University of Illinois at Urbana-ChampaignChampaignUSA
  2. 2.NEC Laboratories AmericaPrincetonUSA

Personalised recommendations