Performance Evaluation of Clustering and Collaborative Filtering Algorithms for Resource Scheduling in a Public Cloud Environment

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 612)


Big Data has become the cornerstone of modern knowledge based system. However, taking advantage of the knowledge found in big data sets requires advanced solutions to store, access and analyze data in a feasible way, either online, offline or both. Such solutions comprise on the one hand a better understanding of computational needs for big data and on the other, the design of new computational infrastructures for such purpose. This paper evaluates the performance in terms of CPU, Load and Memory utilization and scalability of some clustering and collaborative filtering algorithms of Apache Spark MLlib, which provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. The aim is to reveal the performance of such algorithms and draw conclusions for their application to real life problems. To that end, the performance evaluations are done by using a large scale Google cluster usage trace dataset.


  1. 1.
    Terzo, O., Ruiu, P., Bucci, E., Xhafa, F.: Data as a Service (DaaS) for sharing and processing of large data collections in the cloud. In: 2013 Seventh International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS), pp. 475–480. IEEE (2013)Google Scholar
  2. 2.
    Li, K., Gibson, C., Ho, D., Zhou, Q., Kim, J., Buhisi, O., Brown, D.E., Gerber, M.: Assessment of machine learning algorithms in cloud computing frameworks. In: 2013 IEEE Systems and Information Engineering Design Symposium (SIEDS), pp. 98–103. IEEE, April 2013Google Scholar
  3. 3.
    Demirci, M.: A survey of machine learning applications for energy-efficient resource management in cloud computing environments. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) (2015)Google Scholar
  4. 4.
    Ezugwu, A.E., Buhari, S.M., Junaidu, S.B.: Resource management system for scientific virtual laboratory applications. Int. J. Grid Util. Comput. 6(1), 8–20 (2015)CrossRefGoogle Scholar
  5. 5.
    Verma, M., Gangadharan, G.R., Narendra, N.C., Vadlamani, R., Inamdar, V., Ramachandran, L., Calheiros, R.N., Buyya, R.: Dynamic resource demand prediction and allocation in multi-tenant service clouds. Concurrency Comput. Pract. Exp. 28(17), 4429–4442 (2016). CPE-15-0088.R1CrossRefGoogle Scholar
  6. 6.
    Zheng, J., Dagnino, A.: An initial study of predictive machine learning analytics on large volumes of historical data for power system applications. In: 2014 IEEE International Conference on Big Data, Big Data 2014, Washington, D.C., USA, 27–30 October 2014, pp. 952–959 (2014)Google Scholar
  7. 7.
    Apache mahout. Accessed 06 Mar 2017
  8. 8.
    Apache spark. Accessed 06 Mar 2017
  9. 9.
    MLlib: Rdd-based api. Accessed 06 Mar 2017
  10. 10.
    Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)MathSciNetMATHGoogle Scholar
  11. 11.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, Berkeley, CA, USA, p. 10. USENIX Association (2010)Google Scholar
  12. 12.
    The R project for statistical computing. Accessed 06 Mar 2017
  13. 13.
    Machine learning group. Accessed 06 Mar 2017
  14. 14.
    Sharma, B., Chudnovsky, V., Hellerstein, J.L., Rifaat, R., Das, C.R.: Modeling and synthesizing task placement constraints in Google compute clusters. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 3. ACM (2011)Google Scholar
  15. 15.
    Zhang, Q., Hellerstein, J., Boutaba, R.: Characterizing task usage shapes in Google compute clusters. In: Proceedings of the 5th International Workshop on Large Scale Distributed Systems and Middleware (2011)Google Scholar
  16. 16.
    Chen, Y., Ganapathi, A.S., Griffith, R., Katz, R.H.: Analysis and lessons from a publicly available Google cluster trace. Technical Report UCB/EECS-2010-95, EECS Department, University of California, Berkeley, June 2010Google Scholar
  17. 17.
    Chudnovsky, V., Rifaat, R., Hellerstein, J., Sharma, B., Das, C.: Modeling and synthesizing task placement constraints in Google compute clusters. In: Symposium on Cloud Computing (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Istituto Superiore Mario BoellaTorinoItaly
  2. 2.Universitat Politècnica de CatalunyaBarcelonaSpain

Personalised recommendations