Cluster Computing

, Volume 20, Issue 3, pp 2749–2762 | Cite as

Self-organized dynamic provisioning for big data

Article
  • 213 Downloads

Abstract

Recent rapid expansion of datasets in big data problems has resulted in data sizes that exceed processing capabilities of available distributed computing power. In other words, we are producing more data than we can process. In addition, further analysis of a dataset collective state may require duplicating, transferring, and distributing to increase the scale of the problem. Orchestrating these steps in large-scale complex systems is non-trivial. One basic technique to help minimize effects of data re-distribution is to use dynamic resource provisioning environments. When the node organization and structure is dynamic and eclectic, provisioning environments require up-to-date information about resource availability. Maintaining freshness of available resource state in centralized or hierarchical scheduling systems imposes a network communication overhead. Centralization also introduces administrative barriers, limiting interoperability. One effective method to improve the extent of self-organization is taking feedback. Based on this feedback, nodes can then alter their behavior to better respond to changing characteristics in dynamic resource provisioning environments. In this article, we present a decentralized scheduling framework that takes feedback from the system, and adjusts its behavior accordingly. Our framework presents an enabling mechanism for self-organization, where each cloud node adapts its behavior based on the feedback. This approach, compared to centralized resource provisioning solutions that exist in current cloud systems, achieves comparable scheduling decisions, with half the packet overhead. We show that by taking advantage of spatial locality with dynamic provisioning, and due to better scheduling decisions with our framework, data processing overhead of big data problems can be reduced by at least 30% in general, and up to 55% in particular resource distributions. This in turn, results in efficient scheduling decisions to provision better resources for big data tasks.

Keywords

Dynamic resource provisioning Big data Cloud interoperability Resource scheduling Resource matchmaking 

References

  1. 1.
    Aberer, K., Cudré-Mauroux, P., Datta, A., Despotovic, Z., Hauswirth, M., Punceva, M., Schmidt, R.: P-grid: a self-organizing structured p2p system. SIGMOD Rec. 32(3), 29–33 (2003)CrossRefGoogle Scholar
  2. 2.
    Berman, F., Fox, G., Hey, A.: Grid Computing: Making the Global Infrastructure a Reality, vol. 2. Wiley, NewYork (2003)CrossRefGoogle Scholar
  3. 3.
    Bode, B., Halstead, D., Kendall, R., Lei, Z., Jackson, D.: The portable batch scheduler and the maui scheduler on linux clusters. In: Usenix, 4th Annual Linux Showcase and Conference (2000)Google Scholar
  4. 4.
    Borthakur, D.: The hadoop distributed file system: architecture and design. Hadoop Project Website 11, 21 (2007)Google Scholar
  5. 5.
    Chakravarti, A., Baumgartner, G., Lauria, M.: The organic grid: self-organizing computation on a peer-to-peer network. Syst. Man Cybern. A 35(3), 373–384 (2005)CrossRefGoogle Scholar
  6. 6.
    Chapin, S.J., Katramatos, D., Karpovich, J., Grimshaw, A.: Resource management in Legion. Future Gener. Comput. Syst. 15(5–6), 583–594 (1999)CrossRefGoogle Scholar
  7. 7.
    Chase, J., Irwin, D., Grit, L., Moore, J., Sprenkle, S.: Dynamic virtual clusters in a grid site manager. In: High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium, pp. 90–100 (2003)Google Scholar
  8. 8.
    Cowie, J., Liu, H., Liu, J., Nicol, D., Ogielski, A.: Towards realistic million-node internet simulations. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (1999)Google Scholar
  9. 9.
    Czajkowski, K., Fitzgerald, S., Foster, I. and Kesselman, C.: Grid information services for distributed resource sharing. In: Proceedings of the 10th IEEE International Symposium on High-Performance Distributed Computing (HPDC-10) (2001)Google Scholar
  10. 10.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  11. 11.
    Dejun, J., Pierre, G., Chi, C.-H.: Autonomous resource provisioning for multi-service web applications. In: Proceedings of the International World-Wide Web Conference (2010)Google Scholar
  12. 12.
    Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D., Terry D.: Epidemic algorithms for replicated database maintenance. In: PODC ’87: Proceedings of the Sixth Annual ACM Symposium on Principles of Distributed Computing, pp. 1–12. ACM Press, New York (1987)Google Scholar
  13. 13.
    Desai, R., Tilak, S., Gandhi, B., Lewis, M. J., Abu-Ghazaleh, N. B.: Analysis of query matching criteria and resource monitoring for grid application scheduling. In: Proceedings of CCGrid2006: IEEE International Symposium on Cluster Computing and the Grid (2006)Google Scholar
  14. 14.
    Drost, N., Ogston, E., van Nieuwpoort, R.V., Bal, H.E.: Arrg: real-world gossiping. In: Proceedings of the 16th IEEE International Symposium on High Performance Distributed Computing (2007)Google Scholar
  15. 15.
    Dubois, D.J., Casale, G.: Optispot: minimizing application deployment cost using spot cloud resources. Cluster Comput. 19(2), 893–909 (2016)CrossRefGoogle Scholar
  16. 16.
    Epema, D.H.J., Livny, M., van Dantzig, R., Evers, X., Pruyne, J.: A worldwide flock of condors: load sharing among workstation clusters. Technical Report DUT-TWI-95-130, Delft, The Netherlands (1995)Google Scholar
  17. 17.
    Erdil, D.C., Lewis M.J.: Supporting self-organization for hybrid grid resource scheduling. In: Proceedings of the 2008 ACM Symposium on Applied Computing, pp. 1981–1986. SAC ’08, ACM, New York (2008)Google Scholar
  18. 18.
    Erdil, D.C., Lewis, M.J.: Grid resource scheduling with gossiping protocols. In: Proceedings of the 7th IEEE International Conference, Peer-to-Peer Computing, Dublin, pp. 193–200 (2007)Google Scholar
  19. 19.
    Erdil, D.C., Lewis, M.J., Abu-Ghazaleh, N.: An adaptive algorithm for information dissemination in self-organizing grids. In: Proceedings of the 2nd IEEE International Conference on e-Science and Grid Computing (eScience 2006), Amsterdam, the Netherlands, 4–6 December (2006)Google Scholar
  20. 20.
    Fritzke, B.: Growing grid a self-organizing network with constant neighborhood range and adaptation strength. Neural Proc. Lett. 2, 9–13 (1995)CrossRefGoogle Scholar
  21. 21.
    Gentzsch, W.: Sun grid engine: towards creating a compute power grid. In: Cluster Computing and the Grid, 2001. Proceedings. First IEEE/ACM International Symposium, IEEE, Piscataway, pp. 35–36 (2001)Google Scholar
  22. 22.
    Goldberg, A.V.: An efficient implementation of a scaling minimum-cost flow algorithm. J. Alg. 22(1), 1–29 (1997)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Herodotou H., Lim H., Luo G., Borisov N., Dong L., Cetin, F., Babu, S.: Starfish: a self-tuning system for big data analytics. In: Procceeding of the Fifth CIDR Conference (2011)Google Scholar
  24. 24.
    Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W., Hill, D., Kania, R., Schaeffer, M., St Pierre, S., et al.: Big data: the future of biocuration. Nature 455(7209), 47–50 (2008)CrossRefGoogle Scholar
  25. 25.
    Kempe, D., Kleinberg, J., Demers, A.: Spatial gossip and resource location protocols. In: Annual ACM Symposium on Theory of Computing (STOC) (2001)Google Scholar
  26. 26.
    Kermarrec, A.-M., Massoulie, L., Ganesh, A.J.: Probabilistic relieable dissemination in large-scale systems. In: IEEE Transactions on Parallel and Distributed Systems (2003)Google Scholar
  27. 27.
    Lehman, T., Sobieski, J., Jabbari, B.: Dragon: a framework for service provisioning in heterogeneous grid networks. Commun. Mag. IEEE 44(3), 84–90 (2006)CrossRefGoogle Scholar
  28. 28.
    Li, L., Halpern, J., Haas, Z.: Gossip-based ad hoc routing. In: IEEE Infocom (2002)Google Scholar
  29. 29.
    Lynch, C.: Big data: how do your data grow? Nature 455(7209), 28–29 (2008)CrossRefGoogle Scholar
  30. 30.
    Marozzo, F., Talia, D., Trunfio, P.: P2p-mapreduce: parallel data processing in dynamic cloud environments. J. Comput. Syst. Sci. 78, 1382–1402 (2012)CrossRefGoogle Scholar
  31. 31.
    Murphy, M. A., Kagey, B., Fenn, M., Goasguen, S.: Dynamic provisioning of virtual organization clusters. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGRID ’09, IEEE Computer Society, Washington, pp. 364–371 (2009)Google Scholar
  32. 32.
    Nottingham, M., Liu, X.: Amazon elastic compute cloud. http://aws.amazon.com/ec2/
  33. 33.
    Palanisamy, B., Singh, A., Liu, L., Jain B.: Purlieus: locality-aware resource allocation for mapreduce in a cloud. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ACM (2011)Google Scholar
  34. 34.
    Park, J., Lee, S., Kim, J.M.: An autonomic control system for high-reliable cps. Cluster Comput. 18(2), 587–598 (2015)CrossRefGoogle Scholar
  35. 35.
    Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., Wilde, M.: Falkon: a fast and light-weight task execution framework. In: Supercomputing, 2007. SC’07. Proceedings of the 2007 ACM/IEEE Conference, pp. 1–12. IEEE (2007)Google Scholar
  36. 36.
    Serugendo, G.D., Karageorgos, A., Rana, O.F., Zambonelli, F.: Engineering self-0rganizing systems: Nature-inspired approaches to software engineering. Lecture Notes in Artificial Intelligence, (2977), Berlin, Germany (2004)Google Scholar
  37. 37.
    Shen, Z., He, J.: Apache Hadoop Yarn: The Next-Generation Distributed Operating System. In ApacheCon North America, Denver (2014)Google Scholar
  38. 38.
    Van Essen, B., Hsieh, H., Ames, A., Pearce, R., Gokhale, M.: Di-mmap a scalable memory-map runtime for out-of-core data-intensive applications. Cluster Comput. 18(1), 15–28 (2015)Google Scholar
  39. 39.
    Vijayakumar, S., Zhu, Q., Agrawal, G.: Dynamic resource provisioning for data streaming applications in a cloud environment. In: 2nd IEEE International Conference on Cloud Computing Technology and Science, (2010)Google Scholar
  40. 40.
    White, T.: Hadoop: The definitive Guide. O’Reilly Media, Sebastopol (2012)Google Scholar
  41. 41.
    Yalagandula, P., Dahlin, M.: A Scalable Distributed Information Management System. Proceedings of ACM SIGCOMM, Portland (2004)CrossRefGoogle Scholar
  42. 42.
    Zegura, E., Calvert, K.: GT Internetwork Topology Models (GT-ITM). http://www.cc.gatech.edu/projects/gtitm
  43. 43.
    Zhou, S.: Lsf: Load sharing in large heterogeneous distributed systems. In: I Workshop on Cluster Computing (1992)Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.School of ComputingSacred Heart UniversityFairfieldUSA

Personalised recommendations