Evaluation of MapReduce-Based Distributed Parallel Machine Learning Algorithms

  • Ashish Kumar Gupta
  • Prashant Varshney
  • Abhishek Kumar
  • Bakshi Rohit Prasad
  • Sonali Agarwal
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 645)


We are moving toward the multicore era. But still there is no good programming framework for these architectures, and therefore no general and common way for machine learning to take advantage of the speedup. In this paper, we will give framework that can be used for parallel programming method and that can be easily applied to machine learning algorithms. This work is different from methods that try to parallelize an individual algorithm differently. For achieving parallel speedup on machine learning algorithms, we use MapReduce framework. Our experiments will show speedup with an increasing number of nodes present in cluster.


MapReduce Naïve Bayes K-means Linear regression Hadoop 


  1. 1.
    Liu, K. et al.: Distributed data mining bibliography (2006)Google Scholar
  2. 2.
    Graf, H.P. et al.: Parallel support vector machines: the cascade svm. Adv. Neural Inf. Process. Syst. (2004)Google Scholar
  3. 3.
    Jin, R., Yang, G., Agrawal, G.: Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance. IEEE Trans. Knowl. Data Eng. 17(1), 71–89 (2005)CrossRefGoogle Scholar
  4. 4.
    Caragea, D., Silvescu A., Honavar V.: A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. Int. J. Hybrid Intell. Syst. 1(1–2), 80–89 (2004)Google Scholar
  5. 5.
    Rasmussen, E.M., Willett, P.: Efficiency of hierarchic agglomerative clustering using the ICL distributed array processor. J. Doc. 45(1), 1–24 (1989)Google Scholar
  6. 6.
    Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Comput. 11(3), 275–290 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21(8), 1313–1325 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  9. 9.
    Ranger, C. et al.: Evaluating mapreduce for multi-core and multiprocessor systems. In: 2007 IEEE 13th International Symposium on High Performance Computer Architecture. IEEE (2007)Google Scholar
  10. 10.
    Lämmel, R.: Google’s mapreduce programming model—revisited. Sci. Comput. Program. 70(1), 1–30 (2008)Google Scholar
  11. 11.
    Shvachko, K. et al.: The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE (2010)Google Scholar
  12. 12.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: ACM SIGOPS Operating Systems Review vol. 37, no. 5. ACM (2003)Google Scholar
  13. 13.
    Borthakur, D.: The hadoop distributed file system: Architecture and design. Hadoop Project Website 11(2007), 21 (2007)Google Scholar
  14. 14.
    Liu, X. et al.: Implementing WebGIS on hadoop: a case study of improving small file I/O performance on HDFS. In: 2009 IEEE International Conference on Cluster Computing and Workshops. IEEE (2009)Google Scholar
  15. 15.
    Mohandas, N., Thampi, S.M.: Improving hadoop performance in handling small files. In: International Conference on Advances in Computing and Communications. Springer, Berlin, Heidelberg (2011)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Ashish Kumar Gupta
    • 1
  • Prashant Varshney
    • 1
  • Abhishek Kumar
    • 1
  • Bakshi Rohit Prasad
    • 1
  • Sonali Agarwal
    • 1
  1. 1.Indian Institute of Information TechnologyAllahabadIndia

Personalised recommendations