Abstract
Parallel data mining algorithms are extensively used to mine and discover hidden knowledge from varied, unrelated data. Parallel data mining algorithms provide advantages such as reduced training time, less execution time, and less memory requirement. There are several issues in executing parallel data mining algorithms in a distributed environment. It is crucial to partition the data among processors such that there is minimal data dependency, proper synchronization, communication overhead, work load balancing among nodes in distributed processors and disk IO cost. Few of these issues can be resolved when parallel data mining algorithms are executed on Apache framework called Hadoop Map Reduce. Hadoop Map Reduce provides improved performance, reduced communication cost, reduced execution time, reduced training time, and reduced IO access. This paper proposes a novel framework that aims at enhancing the aforementioned advantages in terms of scalability by increasing the number of nodes in the Hadoop cluster and analyzing the performance of classification algorithms like K-Nearest Neighbor, Naïve Bayes and Decision Tree. This parallel framework could be extended to other fields of biotechnology where prediction on large datasets is essential.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Grossman, L., Gou, Y.: Parallel Methods for Scaling Data Mining Algorithms to Large Data Sets, Handbook on Data Mining and Knowledge Discovery. Oxford University Press, Oxford (2001)
Talia, D.: Parallelism in knowledge discovery techniques. In: Applied Parallel Computing. Springer Berlin Heidelberg, pp. 127–136 (2002)
Wang, J., Chen, X., Zhou, K.: Research on a scalable parallel data mining algorithm, In: Fifth International Joint Conference on INC, IMS and IDC, 2009. NCM’09. IEEE, pp. 888–893 (2009)
Masih, S., Tanwani, S.: Data mining techniques in parallel and distributed environment-a comprehensive survey. Int. J. Emerging Technol. Adv. Eng. 4(3), 453–461 (2014)
Zhou, L., Wang, H., Wang, W.: Parallel implementation of classification algorithms based on cloud computing environment, TELKOMNIKA Indonesian J. Electr. Eng. 10(5), 1087–1092 (2012)
Xiao, H.: Towards parallel and distributed computing in large-scale data mining: a survey. Technical University of Munich, Technical Report (2010)
Hall, L.O., Chawla, N., Bowyer, K.W.: Combining decision trees learned in parallel. In: Working Notes of the KDD-97 Workshop on Distributed Data Mining, pp.10–15 (1998)
Joshi, M.V., Karypis, G., Kumar, V.: ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets. In: Parallel Processing Symposium, 1998, IPPS/SPDP 1998. Proceedings of the First Merged International and Symposium on Parallel and Distributed Processing 1998, IEEE pp. 573–579
Pakize, S.R., Gandomi, A.: Comparative study of classification algorithms based on MapReduce model. Int. J. Innovative Res. Adv. Eng. 2349–2163 (2014)
Parallel K-NN classifier: https://alitarhini.wordpress.com/2011/02/26/parallel-K-nearest-neighbor/ (2017)
Maha Lakshmi, N.V., Kanya Kumari, L., Rama Satish, A.: A study of classification algorithms using MapReduce framework. Int. J. Adv. Res. Comput. Sci. Software Eng. 5(5), 885–891 (2015)
Anchalia, P.P., Roy, K.: The K-nearest neighbour algorithm using mapreduce paradigm. In: Fifth International Conference on Intelligent Systems. Modelling and Simulation (2014)
Wu, G., Haiguang, L.I., Hu, X., Bi, Y., Zhang, J., Wu, X.: MReC4.5: C4. 5 ensemble classification with MapReduce, In: Fourth IEEE ChinaGrid Annual Conference, pp. 249–255 (2009)
Parallel Naïve Bayesian Classifier: https://alitarhini.wordpress.com/2011/03/02/parallel-naive-bayesian-classifier/ (2017)
Katkar, V.D., Kulkarni, S.V.: A novel parallel implementation of naive bayesian classifier for big data. In: 2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE), IEEE, pp. 847–852 (2013)
Zheng, S, Bayes, N.: Classifier: a mapreduce approach, a paper submitted to the graduate faculty of the North Dakota State University of agriculture and applied science (2014)
Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel Formulations of Decision-Tree Classification Algorithms, High Performance Data Mining. ISBN 978-0-7923-7745-0, Kluwer Academic Publishers. Manufactured in The Netherlands, pp. 237–261 (2002)
Paul, S.: Parallel and distributed data mining. In: Technical Report. ISBN: 978–953-307-547-1. Karunya University. Coimbatore, India (2011)
Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Machine Learning Res. 11, 849–872 (2010)
Kubota, K., Nakase, A., Sakai, H., Oyanagi, S.: Parallelization of decision tree algorithm and its performance evaluation. In: The Fourth International Conference/Exhibition on IEEE High Performance Computing in the Asia-Pacific Region, 2000 Proceedings, vol. 2, pp. 574-579 (2000)
Chauhan, H., Chauhan, A.: Implementation of decision tree algorithm C4.5. Int. J. Sci. Res. Publications, 3(10), 1–2 (2013)
Dai, W., Ji, W.: A map reduce implementation of C4.5 decision tree algorithm. Int. J. Database Theory Appl 4, 49–60 (2014)
Shafer, J., Agrawal, R., Mehta, M.: SPRINT: a scalable parallel classifier for data mining. In: Proceeding of 1996 International Conference on Very Large Data Bases, pp. 544–555 (1996)
UCI Irvine Machine Learning Repository http://archive.ics.uci.edu/ml/datasets.html (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lokeswari, Y.V., Jacob, S.G., Ramadoss, R. (2019). Parallel Prediction Algorithms for Heterogeneous Data: A Case Study with Real-Time Big Datasets. In: Peter, J., Alavi, A., Javadi, B. (eds) Advances in Big Data and Cloud Computing. Advances in Intelligent Systems and Computing, vol 750. Springer, Singapore. https://doi.org/10.1007/978-981-13-1882-5_46
Download citation
DOI: https://doi.org/10.1007/978-981-13-1882-5_46
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1881-8
Online ISBN: 978-981-13-1882-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)