Skip to main content

Parallel Prediction Algorithms for Heterogeneous Data: A Case Study with Real-Time Big Datasets

  • Conference paper
  • First Online:
Advances in Big Data and Cloud Computing

Abstract

Parallel data mining algorithms are extensively used to mine and discover hidden knowledge from varied, unrelated data. Parallel data mining algorithms provide advantages such as reduced training time, less execution time, and less memory requirement. There are several issues in executing parallel data mining algorithms in a distributed environment. It is crucial to partition the data among processors such that there is minimal data dependency, proper synchronization, communication overhead, work load balancing among nodes in distributed processors and disk IO cost. Few of these issues can be resolved when parallel data mining algorithms are executed on Apache framework called Hadoop Map Reduce. Hadoop Map Reduce provides improved performance, reduced communication cost, reduced execution time, reduced training time, and reduced IO access. This paper proposes a novel framework that aims at enhancing the aforementioned advantages in terms of scalability by increasing the number of nodes in the Hadoop cluster and analyzing the performance of classification algorithms like K-Nearest Neighbor, Naïve Bayes and Decision Tree. This parallel framework could be extended to other fields of biotechnology where prediction on large datasets is essential.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Grossman, L., Gou, Y.: Parallel Methods for Scaling Data Mining Algorithms to Large Data Sets, Handbook on Data Mining and Knowledge Discovery. Oxford University Press, Oxford (2001)

    Google Scholar 

  2. Talia, D.: Parallelism in knowledge discovery techniques. In: Applied Parallel Computing. Springer Berlin Heidelberg, pp. 127–136 (2002)

    Google Scholar 

  3. Wang, J., Chen, X., Zhou, K.: Research on a scalable parallel data mining algorithm, In: Fifth International Joint Conference on INC, IMS and IDC, 2009. NCM’09. IEEE, pp. 888–893 (2009)

    Google Scholar 

  4. Masih, S., Tanwani, S.: Data mining techniques in parallel and distributed environment-a comprehensive survey. Int. J. Emerging Technol. Adv. Eng. 4(3), 453–461 (2014)

    Google Scholar 

  5. Zhou, L., Wang, H., Wang, W.: Parallel implementation of classification algorithms based on cloud computing environment, TELKOMNIKA Indonesian J. Electr. Eng. 10(5), 1087–1092 (2012)

    Google Scholar 

  6. Xiao, H.: Towards parallel and distributed computing in large-scale data mining: a survey. Technical University of Munich, Technical Report (2010)

    Google Scholar 

  7. Hall, L.O., Chawla, N., Bowyer, K.W.: Combining decision trees learned in parallel. In: Working Notes of the KDD-97 Workshop on Distributed Data Mining, pp.10–15 (1998)

    Google Scholar 

  8. Joshi, M.V., Karypis, G., Kumar, V.: ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets. In: Parallel Processing Symposium, 1998, IPPS/SPDP 1998. Proceedings of the First Merged International and Symposium on Parallel and Distributed Processing 1998, IEEE pp. 573–579

    Google Scholar 

  9. Pakize, S.R., Gandomi, A.: Comparative study of classification algorithms based on MapReduce model. Int. J. Innovative Res. Adv. Eng. 2349–2163 (2014)

    Google Scholar 

  10. Parallel K-NN classifier: https://alitarhini.wordpress.com/2011/02/26/parallel-K-nearest-neighbor/ (2017)

  11. Maha Lakshmi, N.V., Kanya Kumari, L., Rama Satish, A.: A study of classification algorithms using MapReduce framework. Int. J. Adv. Res. Comput. Sci. Software Eng. 5(5), 885–891 (2015)

    Google Scholar 

  12. Anchalia, P.P., Roy, K.: The K-nearest neighbour algorithm using mapreduce paradigm. In: Fifth International Conference on Intelligent Systems. Modelling and Simulation (2014)

    Google Scholar 

  13. Wu, G., Haiguang, L.I., Hu, X., Bi, Y., Zhang, J., Wu, X.: MReC4.5: C4. 5 ensemble classification with MapReduce, In: Fourth IEEE ChinaGrid Annual Conference, pp. 249–255 (2009)

    Google Scholar 

  14. Parallel Naïve Bayesian Classifier: https://alitarhini.wordpress.com/2011/03/02/parallel-naive-bayesian-classifier/ (2017)

  15. Katkar, V.D., Kulkarni, S.V.: A novel parallel implementation of naive bayesian classifier for big data. In: 2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE), IEEE, pp. 847–852 (2013)

    Google Scholar 

  16. Zheng, S, Bayes, N.: Classifier: a mapreduce approach, a paper submitted to the graduate faculty of the North Dakota State University of agriculture and applied science (2014)

    Google Scholar 

  17. Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel Formulations of Decision-Tree Classification Algorithms, High Performance Data Mining. ISBN 978-0-7923-7745-0, Kluwer Academic Publishers. Manufactured in The Netherlands, pp. 237–261 (2002)

    Google Scholar 

  18. Paul, S.: Parallel and distributed data mining. In: Technical Report. ISBN: 978–953-307-547-1. Karunya University. Coimbatore, India (2011)

    Google Scholar 

  19. Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Machine Learning Res. 11, 849–872 (2010)

    MathSciNet  MATH  Google Scholar 

  20. Kubota, K., Nakase, A., Sakai, H., Oyanagi, S.: Parallelization of decision tree algorithm and its performance evaluation. In: The Fourth International Conference/Exhibition on IEEE High Performance Computing in the Asia-Pacific Region, 2000 Proceedings, vol. 2, pp. 574-579 (2000)

    Google Scholar 

  21. Chauhan, H., Chauhan, A.: Implementation of decision tree algorithm C4.5. Int. J. Sci. Res. Publications, 3(10), 1–2 (2013)

    Google Scholar 

  22. Dai, W., Ji, W.: A map reduce implementation of C4.5 decision tree algorithm. Int. J. Database Theory Appl 4, 49–60 (2014)

    Google Scholar 

  23. Shafer, J., Agrawal, R., Mehta, M.: SPRINT: a scalable parallel classifier for data mining. In: Proceeding of 1996 International Conference on Very Large Data Bases, pp. 544–555 (1996)

    Google Scholar 

  24. UCI Irvine Machine Learning Repository http://archive.ics.uci.edu/ml/datasets.html (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Y. V. Lokeswari .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lokeswari, Y.V., Jacob, S.G., Ramadoss, R. (2019). Parallel Prediction Algorithms for Heterogeneous Data: A Case Study with Real-Time Big Datasets. In: Peter, J., Alavi, A., Javadi, B. (eds) Advances in Big Data and Cloud Computing. Advances in Intelligent Systems and Computing, vol 750. Springer, Singapore. https://doi.org/10.1007/978-981-13-1882-5_46

Download citation

Publish with us

Policies and ethics