Skip to main content
Log in

Big data classification using deep learning and apache spark architecture

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The oddity in large information is rising step by step so that the current programming instruments faces trouble in supervision of huge information. Moreover, the pace of the irregularity information in the immense datasets is a key imperative to the exploration business. Along these lines, this paper proposes a novel method for taking care of the large information utilizing Spark structure. The proposed method experiences two stages for arranging the enormous information, which includes highlight choice and arrangement, which is acted in the underlying hubs of Spark engineering. The proposed improvement calculation is named Rider Chaotic Biography streamlining (RCBO) calculation, which is the incorporation of the Rider Optimization Algorithm (ROA) and the standard confused biogeography-based-advancement (CBBO). The proposed RCBO-profound stacked auto-encoder utilizing Spark structure successfully handles the large information for achieving powerful huge information arrangement. Here, the proposed RCBO is utilized for choosing reasonable highlights from the monstrous dataset. Besides, the profound stacked auto-encoder utilizes RCBO for preparing so as to characterize colossal datasets. In this research we focused on problem of supervision related to big information of The Cover type Data in UCI machine learning repository. The dataset describes the forest cover set data to predict the forest cover type from cartographic variables. The dataset is multivariate in nature with number of web hits 263,361. The number of instances is 581012 with 54 numbers of attributes and the task associated for the dataset is classification. The examination of the proposed RCBO-profound stacked auto-encoder-based Spark structure utilizing the UCI AI datasets uncovered that the proposed technique beat different strategies, by procuring maximal exactness of 86.71%, dice coefficient of 92.7%, affectability of 75.2% and explicitness of 95.4% separately.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Ramírez-Gallego S, Krawczyk B, García S, Woźniak M, Benítez JM, Herrera F (2017) Nearest neighbor classification for high-speed big data streams using spark. IEEE Trans Syst Man Cybern: Syst 47(10):2727–2739

    Article  Google Scholar 

  2. Duan M, Li K, Liao X, Li K (2018) A parallel multi classification algorithm for big data using an extreme learning machine. IEEE Trans Neural Netw Learn Syst 29(6):2337–2351

    Article  MathSciNet  Google Scholar 

  3. Elsebakhi E, Lee F, Schendel E, Haque A, Kathireason N, Pathare T, Syed N, Al-Ali R (2015) Large-scale machine learning based on functional networks for biomedical big data with high performance computing platforms. J Comput Sci 11:69–81

    Article  MathSciNet  Google Scholar 

  4. Lin W, Wu Z, Lin L, Wen A, Li J (2017) An Ensemble Random Forest Algorithm for Insurance Big Data Analysis. IEEE Access 5:16568–16575

    Article  Google Scholar 

  5. Hernández ÁB, Perez MS, Gupta S, Muntés-Mulero V (2018) Using machine learning to optimize parallelism in big data applications. Futur Gener Comput Syst 86:1076–1092

    Article  Google Scholar 

  6. Ramírez-Gallego S, García S, Benítez JM, Herrera F (2018) A distributed evolutionary multivariate discretizer for big data processing on apache spark. Swarm Evol Comput 38:240–250

    Article  Google Scholar 

  7. Karim MR, Cochez M, Beyan OD, Ahmed CF, Decker S (2018) Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach. Inf Sci 432:278–300

    Article  MathSciNet  Google Scholar 

  8. Salloum S, Dautov R, Chen X, Peng PX, Huang JZ (2016) Big data analytics on Apache Spark. Int J Data Sci Anal 1:145–164

    Article  Google Scholar 

  9. Zhao B, Zhou H, Li G, Huang Y (2018) ZenLDA: Large-scale topic model training on distributed data-parallel platform. Big Data Min Anal 1(1):57–74

    Article  Google Scholar 

  10. J. Yan, Y. Meng, L. Lu and C. Guo, Big-data-driven based intelligent prognostics scheme in industry 4.0 environment, 2017 Prognostics and System Health Management Conference (PHM-Harbin), Harbin, pp. 1–5, 2017.

  11. K. Zhang, Y. Tanimura, H. Nakada and H. Ogawa, Understanding and improving disk-based intermediate data caching in Spark, 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, pp. 2508–2517, 2017.

  12. S. Caíno-Lores, J. Carretero, B. Nicolae, O. Yildiz and T. Peterka, "Spark-DIY: A Framework for Interoperable Spark Operations with High Performance Block-Based Data Models," 2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT), Zurich, pp. 1–10, 2018.

  13. G. Ditzler, S. Hariri and A. Akoglu, High Performance Machine Learning (HPML) Framework to Support DDDAS Decision Support Systems: Design Overview, 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W), Tucson, AZ, pp. 360–362, 2017.

  14. S. Ekanayake, S. Kamburugamuve, P. Wickramasinghe and G. C. Fox, Java thread and process performance for parallel machine learning on multicore HPC clusters, 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, pp. 347–354, 2016.

  15. J. Fu, J. Sun and K. Wang, SPARK—A Big Data Processing Platform for Machine Learning, 2016 International Conference on Industrial Informatics-Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII), Wuhan, pp. 48–51, 2016.

  16. A. Gupta, H. K. Thakur, R. Shrivastava, P. Kumar and S. Nag, A Big Data Analysis Framework Using Apache Spark and Deep Learning, 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, pp. 9–16, 2017.

  17. A. T. Hadgu, A. Nigam and E. Diaz-Aviles, Large-scale learning with AdaGrad on Spark, 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, pp. 2828–2830, 2015.

  18. Z. Han and Y. Zhang, Spark: A Big Data Processing Platform Based on Memory Computing, 2015 Seventh International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), Nanjing, pp. 172–176, 2015.

  19. K. Kato, A. Takefusa, H. Nakada and M. Oguchi, Consideration of parallel data processing over an apache spark cluster, 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, pp. 4757–4759, 2017.

  20. A. Koliopoulos, P. Yiapanis, F. Tekiner, G. Nenadic and J. Keane, A Parallel Distributed Weka Framework for Big Data Mining Using Spark, 2015 IEEE International Congress on Big Data, New York, NY, pp. 9–16, 2015.

  21. S. N. Lighari and D. M. A. Hussain, Testing of algorithms for anomaly detection in Big data using apache spark, 2017 9th International Conference on Computational Intelligence and Communication Networks (CICN), Girne, pp. 97–100, 2017.

  22. J. Lv, B. Wu, C. Liu and X. Gut, PF-Face: A Parallel Framework for Face Classification and Search from Massive Videos Based on Spark, 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), Xi'an, pp. 1–7, 2018.

  23. M. A. Rahman, J. Hossen and V. C, SMBSP: A Self-Tuning Approach using Machine Learning to Improve Performance of Spark in Big Data Processing, 2018 7th International Conference on Computer and Communication Engineering (ICCCE), Kuala Lumpur, pp. 274–279, 2018.

  24. A. Sheshasaayee and J. V. N. Lakshmi, An insight into tree based machine learning techniques for big data analytics using Apache Spark, 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT), Kannur, pp. 1740–1743, 2017.

  25. S. Srivastava, A. Nigam and R. Kumari, Work-in-Progress: Towards Efficient and Scalable Big Data Analytics: Mapreduce vs. RDD’s, 2017 International Conference on Information Technology (ICIT), Bhubaneswar, pp. 272–275, 2017.

  26. UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/covertype, Accessed on February 2019.

  27. Binu D, Kariyappa BS (2019) RideNN: a new rider optimization algorithm-based neural network for fault diagnosis in analog circuits. IEEE Trans Instrum Meas 68(1):2–26

    Article  Google Scholar 

  28. Wang J-S, Song J-D (2017) Chaotic biogeography-based optimisation (CBBO) algorithm. IAENG Int J Comput Sci 44(2):24

    Google Scholar 

  29. Jayapriya, K., & Mary, N. A. B, Employing a novel 2-gram subgroup intra pattern (2GSIP) with stacked auto encoder for membrane protein classification, Molecular Biology Reports, 2019.

  30. Liu, G., Bao, H. and Han, B., A stacked autoencoder-based deep neural network for achieving gearbox fault diagnosis, Mathematical Problems in Engineering, 2018.

  31. Bobe A, Nicola A, Popa C (2015) Weaker hypotheses for the genral projection algorithm with corrections An. St. Uni. “ Ovidius. Constanta-Seria Mathematica 23(3):9–16. https://doi.org/10.1515/auom-2015-0043

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anilkumar V. Brahmane.

Ethics declarations

Conflict of interest

We declare that there is no any financial relationship with any organization or funding agencies related to this article. Also no any research grant is available for this research work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brahmane, A.V., Krishna, B.C. Big data classification using deep learning and apache spark architecture. Neural Comput & Applic 33, 15253–15266 (2021). https://doi.org/10.1007/s00521-021-06145-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06145-w

Keywords

Navigation