Parallel Prediction Algorithms for Heterogeneous Data: A Case Study with Real-Time Big Datasets

Lokeswari, Y. V.; Jacob, Shomona Gracia; Ramadoss, Rajavel

doi:10.1007/978-981-13-1882-5_46

Y. V. Lokeswari¹⁷,
Shomona Gracia Jacob¹⁷ &
Rajavel Ramadoss¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 750))

882 Accesses
3 Citations

Abstract

Parallel data mining algorithms are extensively used to mine and discover hidden knowledge from varied, unrelated data. Parallel data mining algorithms provide advantages such as reduced training time, less execution time, and less memory requirement. There are several issues in executing parallel data mining algorithms in a distributed environment. It is crucial to partition the data among processors such that there is minimal data dependency, proper synchronization, communication overhead, work load balancing among nodes in distributed processors and disk IO cost. Few of these issues can be resolved when parallel data mining algorithms are executed on Apache framework called Hadoop Map Reduce. Hadoop Map Reduce provides improved performance, reduced communication cost, reduced execution time, reduced training time, and reduced IO access. This paper proposes a novel framework that aims at enhancing the aforementioned advantages in terms of scalability by increasing the number of nodes in the Hadoop cluster and analyzing the performance of classification algorithms like K-Nearest Neighbor, Naïve Bayes and Decision Tree. This parallel framework could be extended to other fields of biotechnology where prediction on large datasets is essential.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Comparative Study of Parallelism on Data Mining

Parallel Computing Algorithms for Bigdata Frequent Pattern Mining

TUKNN: A Parallel KNN Algorithm to Handle Large Data

References

Grossman, L., Gou, Y.: Parallel Methods for Scaling Data Mining Algorithms to Large Data Sets, Handbook on Data Mining and Knowledge Discovery. Oxford University Press, Oxford (2001)
Google Scholar
Talia, D.: Parallelism in knowledge discovery techniques. In: Applied Parallel Computing. Springer Berlin Heidelberg, pp. 127–136 (2002)
Google Scholar
Wang, J., Chen, X., Zhou, K.: Research on a scalable parallel data mining algorithm, In: Fifth International Joint Conference on INC, IMS and IDC, 2009. NCM’09. IEEE, pp. 888–893 (2009)
Google Scholar
Masih, S., Tanwani, S.: Data mining techniques in parallel and distributed environment-a comprehensive survey. Int. J. Emerging Technol. Adv. Eng. 4(3), 453–461 (2014)
Google Scholar
Zhou, L., Wang, H., Wang, W.: Parallel implementation of classification algorithms based on cloud computing environment, TELKOMNIKA Indonesian J. Electr. Eng. 10(5), 1087–1092 (2012)
Google Scholar
Xiao, H.: Towards parallel and distributed computing in large-scale data mining: a survey. Technical University of Munich, Technical Report (2010)
Google Scholar
Hall, L.O., Chawla, N., Bowyer, K.W.: Combining decision trees learned in parallel. In: Working Notes of the KDD-97 Workshop on Distributed Data Mining, pp.10–15 (1998)
Google Scholar
Joshi, M.V., Karypis, G., Kumar, V.: ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets. In: Parallel Processing Symposium, 1998, IPPS/SPDP 1998. Proceedings of the First Merged International and Symposium on Parallel and Distributed Processing 1998, IEEE pp. 573–579
Google Scholar
Pakize, S.R., Gandomi, A.: Comparative study of classification algorithms based on MapReduce model. Int. J. Innovative Res. Adv. Eng. 2349–2163 (2014)
Google Scholar
Parallel K-NN classifier: https://alitarhini.wordpress.com/2011/02/26/parallel-K-nearest-neighbor/ (2017)
Maha Lakshmi, N.V., Kanya Kumari, L., Rama Satish, A.: A study of classification algorithms using MapReduce framework. Int. J. Adv. Res. Comput. Sci. Software Eng. 5(5), 885–891 (2015)
Google Scholar
Anchalia, P.P., Roy, K.: The K-nearest neighbour algorithm using mapreduce paradigm. In: Fifth International Conference on Intelligent Systems. Modelling and Simulation (2014)
Google Scholar
Wu, G., Haiguang, L.I., Hu, X., Bi, Y., Zhang, J., Wu, X.: MReC4.5: C4. 5 ensemble classification with MapReduce, In: Fourth IEEE ChinaGrid Annual Conference, pp. 249–255 (2009)
Google Scholar
Parallel Naïve Bayesian Classifier: https://alitarhini.wordpress.com/2011/03/02/parallel-naive-bayesian-classifier/ (2017)
Katkar, V.D., Kulkarni, S.V.: A novel parallel implementation of naive bayesian classifier for big data. In: 2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE), IEEE, pp. 847–852 (2013)
Google Scholar
Zheng, S, Bayes, N.: Classifier: a mapreduce approach, a paper submitted to the graduate faculty of the North Dakota State University of agriculture and applied science (2014)
Google Scholar
Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel Formulations of Decision-Tree Classification Algorithms, High Performance Data Mining. ISBN 978-0-7923-7745-0, Kluwer Academic Publishers. Manufactured in The Netherlands, pp. 237–261 (2002)
Google Scholar
Paul, S.: Parallel and distributed data mining. In: Technical Report. ISBN: 978–953-307-547-1. Karunya University. Coimbatore, India (2011)
Google Scholar
Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Machine Learning Res. 11, 849–872 (2010)
MathSciNet MATH Google Scholar
Kubota, K., Nakase, A., Sakai, H., Oyanagi, S.: Parallelization of decision tree algorithm and its performance evaluation. In: The Fourth International Conference/Exhibition on IEEE High Performance Computing in the Asia-Pacific Region, 2000 Proceedings, vol. 2, pp. 574-579 (2000)
Google Scholar
Chauhan, H., Chauhan, A.: Implementation of decision tree algorithm C4.5. Int. J. Sci. Res. Publications, 3(10), 1–2 (2013)
Google Scholar
Dai, W., Ji, W.: A map reduce implementation of C4.5 decision tree algorithm. Int. J. Database Theory Appl 4, 49–60 (2014)
Google Scholar
Shafer, J., Agrawal, R., Mehta, M.: SPRINT: a scalable parallel classifier for data mining. In: Proceeding of 1996 International Conference on Very Large Data Bases, pp. 544–555 (1996)
Google Scholar
UCI Irvine Machine Learning Repository http://archive.ics.uci.edu/ml/datasets.html (2017)

Download references

Author information

Authors and Affiliations

Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
Y. V. Lokeswari & Shomona Gracia Jacob
Department of ECE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
Rajavel Ramadoss

Authors

Y. V. Lokeswari
View author publications
You can also search for this author in PubMed Google Scholar
Shomona Gracia Jacob
View author publications
You can also search for this author in PubMed Google Scholar
Rajavel Ramadoss
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Y. V. Lokeswari .

Editor information

Editors and Affiliations

Department of Computer Sciences Technology, Karunya Institute of Technology & Sciences, Coimbatore, Tamil Nadu, India
J. Dinesh Peter
Department of Civil and Environmental Engineering, University of Missouri, Columbia, MO, USA
Amir H. Alavi
School of Computing, Engineering and Mathematics, University of Western Sydney, Sydney, NSW, Australia
Bahman Javadi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lokeswari, Y.V., Jacob, S.G., Ramadoss, R. (2019). Parallel Prediction Algorithms for Heterogeneous Data: A Case Study with Real-Time Big Datasets. In: Peter, J., Alavi, A., Javadi, B. (eds) Advances in Big Data and Cloud Computing. Advances in Intelligent Systems and Computing, vol 750. Springer, Singapore. https://doi.org/10.1007/978-981-13-1882-5_46

Download citation

DOI: https://doi.org/10.1007/978-981-13-1882-5_46
Published: 12 December 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1881-8
Online ISBN: 978-981-13-1882-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Parallel Prediction Algorithms for Heterogeneous Data: A Case Study with Real-Time Big Datasets

Abstract

Access this chapter

Similar content being viewed by others

Comparative Study of Parallelism on Data Mining

Parallel Computing Algorithms for Bigdata Frequent Pattern Mining

TUKNN: A Parallel KNN Algorithm to Handle Large Data

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Parallel Prediction Algorithms for Heterogeneous Data: A Case Study with Real-Time Big Datasets

Abstract

Access this chapter

Similar content being viewed by others

Comparative Study of Parallelism on Data Mining

Parallel Computing Algorithms for Bigdata Frequent Pattern Mining

TUKNN: A Parallel KNN Algorithm to Handle Large Data

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation