Skip to main content
Log in

Efficient DANNLO classifier for multi-class imbalanced data on Hadoop

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

In recent years, multi-class imbalance data classification is a major problem in big data. In such situations, we focused on developing a new Deep Artificial Neural Network Learning Optimization (DANNLO) Classifier for large collection of imbalanced data. In our proposed work, first the dataset reduction using principal component analysis for dimensionality reduction and initial centroid is computed. Then, parallel hierarchical pillar k-means clustering algorithm based on MapReduce is used to partitioning of an imbalanced data set into similar subset, which can improve the computational cost. The resultant clusters are given as input to the deep ANN for learning. In the next stage, deep neural network has been trained using the back propagation algorithm. In order to optimize the n-dimensional weight space, firefly optimization algorithm is used. Attractiveness and distance of each firefly is computed. Hadoop is used to handle these large volumes of variable size data. Imbalanced datasets is taken from ECDC (European Centre for Disease Prevention and Control) repository. The experimental results illustrated that the proposed method can significantly improve the effectiveness in classifying imbalanced data based on TP rate, F-measure, G-mean measures, confusion matrix, precision, recall, and ROC. The experimental results suggests that DANNLO classifier exceed other ordinary classifiers such as SVM and Random forest classifier on tested imbalanced data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687

    Article  Google Scholar 

  2. Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107

    Article  Google Scholar 

  3. Triguero I, Peralta D, B J, García S, Herrera F (2015) MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150:331–345 (Elsevier)

    Article  Google Scholar 

  4. Ou G, Murphey YL (2007) Multi-class pattern classification using neural networks. Pattern Recognit 40(1):4–18 (Elsevier)

    Article  MATH  Google Scholar 

  5. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47

    Article  MathSciNet  Google Scholar 

  6. López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141 (Elsevier)

    Article  Google Scholar 

  7. Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378 (Elsevier)

    Article  MATH  Google Scholar 

  8. Lee J, Lapira E, Bagheri B, Kao H-A (2013) Recent advances and trends in predictive manufacturing systems in big data environment. Manuf Lett 1(1):38–41 (Elsevier)

    Article  Google Scholar 

  9. Dubey R, Zhou J, Wang Y, Thompson PM, Ye J (2014) Analysis of sampling techniques for imbalanced data: an n = 648 ADNI study. NeuroImage 87:220–241

    Article  Google Scholar 

  10. Rokach L (2006) Decomposition methodology for classification tasks: a meta decomposer framework. Pattern Anal Appl 9(2):257–271 (Elsevier)

    Article  MathSciNet  Google Scholar 

  11. Kumar CN, Rao KN, Govardhan A, Sandhya N (2015) Subset K-means approach for handling imbalanced-distributed data. In: Emerging ICT for bridging the future—proceedings of the 49th annual convention of the Computer Society of India CSI, Springer, vol 2, pp 497–508

  12. Shim K (2012) MapReduce algorithms for big data analysis. Proc VLDB Endow 5(12):2016–2017 (ACM)

    Article  Google Scholar 

  13. Polat K, Güneş S (2009) A new feature selection method on classification of medical datasets: kernel F-score feature selection. Expert Syst Appl 36(7):10367–10373

    Article  Google Scholar 

  14. Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2):427–436 (Elsevier)

    Article  Google Scholar 

  15. Partovi FY, Anandarajan M (2002) Classifying inventory using an artificial neural network approach. Comput Ind Eng 41(4):389–404 (Elsevier)

    Article  Google Scholar 

  16. Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77

    Article  Google Scholar 

  17. Chen Y, Raab F, Katz R (2014) From tpc-c to big data benchmarks: a functional workload model. In: Specifying big data benchmarks, Springer, pp 28–43

  18. Pal A, Agrawal S (2014) An experimental approach towards big data for analyzing memory utilization on a Hadoop cluster using HDFS and MapReduce. In: Networks & soft computing (ICNSC), IEEE, pp 442–447

  19. Dittrich J, Quiané-Ruiz JA (2012) Efficient big data processing in Hadoop MapReduce. Proc VLDB Endow 5(12):2014–2015 (ACM)

    Article  Google Scholar 

  20. del Río S, López V, Benítez JM, Herrera F (2014) On the use of MapReduce for imbalanced big data using random forest. Inf Sci 285:112–137 (Elsevier)

    Article  Google Scholar 

  21. Krawczyk B, Galar M, Jeleń Ł, Herrera F (2016) Evolutionary under sampling boosting for imbalanced classification of breast cancer malignancy. Appl Soft Comput 38:714–726 (Elsevier)

    Article  Google Scholar 

  22. Ibarguren I, Pérez JM, Muguerza J, Gurrutxaga I, Arbelaitz O (2015) Coverage-based resampling: building robust consolidated decision trees. Knowl Based Syst 79:51–67 (Elsevier)

    Article  Google Scholar 

  23. Geiß C, Pelizari PA, Marconcini M, Sengara W, Edwards M, Lakes T, Taubenböck H (2015) Estimation of seismic building structural types using multi-sensor remote sensing and machine learning techniques. ISPRS J Photogramm Remote Sens 104:175–188 Elsevier

    Article  Google Scholar 

  24. Sun Z, Song Q, Zhu X, Sun H, Xu B, Zhou Y (2015) A novel ensemble method for classifying imbalanced data. Pattern Recognit 48(5):1623–1637 (Elsevier)

    Article  Google Scholar 

  25. Zhang J, Wong JS, Li T, Pan Y (2014) A comparison of parallel large-scale knowledge acquisition using rough set theory on different MapReduce runtime systems. Int J Approx Reason 55(3):896–907 (Elsevier)

    Article  Google Scholar 

  26. Nayak J, Naik B, Behera HS (2016) A novel nature inspired firefly algorithm with higher order neural network: performance analysis. Eng Sci Technol Int J 19(1):197–211

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Satyanarayana.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Satyanarayana, S., Tayar, Y. & Prasad, R.S.R. Efficient DANNLO classifier for multi-class imbalanced data on Hadoop. Int. j. inf. tecnol. 11, 321–329 (2019). https://doi.org/10.1007/s41870-018-0187-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-018-0187-z

Keywords

Navigation