Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification

  • Sachin Subhash PatilEmail author
  • Shefali Pratap Sonavane
Part of the Studies in Big Data book series (SBD, volume 24)


Facing hundreds of gigabytes of data has triggered a need to reconsider data management options. There is a tremendous requirement to study data sets beyond the capability of commonly used software tools to capture, curate and manage within a tolerable elapsed time and also beyond the processing feasibility of the single machine architecture. In addition to the traditional structured data, the new avenue of NoSQL Big Data has urged a call to experimental techniques and technologies that require ventures to re-integrate. It helps to discover large hidden values from huge datasets that are complex, diverse and of a massive scale. In many of the real world applications, classification of imbalanced datasets is the point of priority concern. The standard classifier learning algorithms assume balanced class distribution and equal misclassification costs; as a result, the classification of datasets having imbalanced class distribution has produced a notable drawback in performance obtained by the most standard classifier learning algorithms. Most of the classification methods focus on two-class imbalance problem inspite of multi-class imbalance problem, which exist in real-world domains. A methodology is introduced for single-class/multi-class imbalanced data sets (Lowest vs. Highest—LVH) with enhanced over_sampling (O.S.) techniques (MEre Mean Minority Over_Sampling Technique—MEMMOT, Majority Minority Mix mean—MMMm, Nearest Farthest Neighbor_Mid—NFN-M, Clustering Minority Examples—CME, Majority Minority Cluster Based Under_Over Sampling Technique—MMCBUOS, Updated Class Purity Maximization—UCPM) to improve classification. The study is based on broadly two views: either to compare the enhanced non-cluster techniques to prior work or to have a clustering based approach for advance O.S. techniques. Finally, this balanced data is to be applied to form Random Forest (R.F.) tree for classification. O.S. techniques are projected to apply on imbalanced Big Data using mapreduce environment. Experiments are suggested to perform on Apache Hadoop and Apache Spark , using different datasets from UCI/KEEL repository. Geometric mean, F-measures, Area under curve (AUC), Average accuracy, Brier scores are used to measure the performance of this classification.


Imbalanced datasets Big data Over_sampling techniques Data level approach Minority class Multi-class Mapreduce Clustering Streaming inputs Reduct 


  1. 1.
    A. Gandomi and M. Haider, “Beyond the hype: big data concepts, methods, and analytics,” International Journal of Information Management, vol. 35, no. 2, pp. 137–144, 2015.Google Scholar
  2. 2.
    W. Zhao, H. Ma, and Q. He., “Parallel k-means clustering based on mapreduce,” CloudCom, pp. 674–679, 2009.Google Scholar
  3. 3.
    D. Agrawal et al., “Challenges and Opportunity with Big Data,” Community White Paper, pp. 01–16, 2012.Google Scholar
  4. 4.
    X. Wu et al., “Data Mining with Big Data,” IEEE Trans. Knowledge Data Engg, vol. 26, no. 1, pp. 97–107, 2014.Google Scholar
  5. 5.
    X.-W. Chen et al., “Big data deep learning: Challenges and perspectives,” IEEE Access Practical Innovations: open solutions, vol. 2, pp. 514–525, 2014.Google Scholar
  6. 6.
    M. A. Nadaf, S. S. Patil, “Performance Evaluation of Categorizing Technical Support Requests Using Advanced K-Means Algorithm,” IEEE International Advance Computing Conference (IACC), pp. 409–414, 2015.Google Scholar
  7. 7.
    “Big Data: Challenges and Opportunities, Infosys Labs Briefings - Infosys Labs,” http://www.infosys. com/infosys-labs/publications/Documents/bigdata-challenges-opportunities.pdf.
  8. 8.
    “A Modern Data Architecture with Apache Hadoop,” A White paper developed by Hortonworks, pp. 01–18, 2014.Google Scholar
  9. 9.
    H. Jiang, Y. Chen, and Z. Qiao, “Scaling up mapreduce-based big data processing on multi-GPU systems,” SpingerLink Clust. Comput, vol. 18, no. 1, pp. 369–383, 2015.Google Scholar
  10. 10.
    N. Chawla, L. Aleksandar, L. Hall, and K. Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting,” PKDD Springer Berlin Heidelberg, pp. 107–119, 2003.Google Scholar
  11. 11.
    R. C. Bhagat, S. S. Patil, “Enhanced SMOTE algorithm for classification of imbalanced bigdata using Random Forest,” IEEE International Advance Computing Conference (IACC), pp. 403–408, 2015.Google Scholar
  12. 12.
    W. A. Rivera, O. Asparouhov, “Safe Level OUPS for Improving Target Concept Learning in Imbalanced Data Sets,” Proceedings of the IEEE Southeast Con, pp. 1–8, 2015.Google Scholar
  13. 13.
    K. Shvachko, H. Kuang, S. Radia, R. Chansler, “The Hadoop Distributed File System,” IEEE 26 th Symposium on Mass Storage Systems and Technologies, pp. 1–10, 2010.Google Scholar
  14. 14.
    F. Khan, “An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application,” Elsevier publication- Journal of Applied Soft Computing, vol. 12, pp. 3698– 3700, 2012. Google Scholar
  15. 15.
    R. Esteves, C. Rong, “Using Mahout for Clustering Wikipedia’s Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud,” IEEE Third International Conference on Cloud Computing Technology and Science, pp. 565–569, 2011.Google Scholar
  16. 16.
    A. Osama, “Comparisons between data clustering algorithms,” The international Arab Journal of Information Technology, vol. 5, pp. 320–325, 2008.Google Scholar
  17. 17.
    R. Jensen and Q. Shen, “New approaches to fuzzy-rough feature selection,” IEEE Trans. Fuzzy Syst, vol. 17, no. 4, pp. 824–838, 2009.Google Scholar
  18. 18.
    R. Sara, V. Lopez, J. Benitez, and F. Herrera, “On the use of MapReduce for imbalanced big data using Random Forest,” Elsevier: Journal of Information Sciences, pp. 112–137, 2014.Google Scholar
  19. 19.
    B. Japp et al., “No More Secrets with Big Data Analytics,” White paper developed by Sogeti Trend Lab VINT, pp. 01–210, 2013.Google Scholar
  20. 20.
    E. Tsang, D. Chen, D. Yeung, and X. Wang, “Attributes reduction using fuzzy rough sets,” IEEE Trans. Fuzzy Syst, vol. 16, no. 5, pp. 1130–1141, 2008.Google Scholar
  21. 21.
    R. Renu, G. Mocko, and A. Koneru, “Use of big data and knowledge discovery to create data backbones for decision support systems,” Elsevier Procedia Comput. Sci, vol. 20, pp. 446–453, 2013.Google Scholar
  22. 22.
    L. Nikolay, Z. Kai, and Z. Carlo, “Early accurate results for advanced analytics on MapReduce,” Proceedings of the VLDB Endowment, vol. 5, no.10, pp. 1028–1039, 2012.Google Scholar
  23. 23.
    J. Shao, and D. Tu., “The jackknife and bootstrap,” Springer series in statistics Springer Verlag, pp. 01–13, 2013.Google Scholar
  24. 24.
    B. Chumphol, K. Sinapiromsaran, and C. Lursinsap, “Safe-level-smote: Safelevel- synthetic minority over-sampling technique for handling the class imbalanced problem,” AKDD Springer Berlin Heidelberg, pp. 475–482, 2009.Google Scholar
  25. 25.
    P. Byoung-Jun, S. Oh, and W. Pedrycz, “The design of polynomial function-based neural network predictors for detection of software defects,” Elsevier: Journal of Information Sciences, pp. 40–57, 2013.Google Scholar
  26. 26.
    N. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.Google Scholar
  27. 27.
    S. Garcia et al., “Evolutionary-based selection of generalized instances for imbalanced classification,” Elsevier: Journal of Knowl. Based Syst, pp. 3–12, 2012.Google Scholar
  28. 28.
    H. Xiang, Y. Yang, and S. Zhao, “Local clustering ensemble learning method based on improved AdaBoost for rare class analysis,” Journal of Computational Information Systems, Vol. 8, no. 4, pp. 1783–1790, 2012.Google Scholar
  29. 29.
    H. Feng, and L. Hang, “A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE,” Hindawi Mathematical Problems in Engineering, 2013.Google Scholar
  30. 30.
    F. Alberto, M. Jesus, and F. Herrera, “Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning,” Springer IPMU, pp. 89–98, 2010.Google Scholar
  31. 31.
    J. Hanl, Y. Liul, and X. Sunl, “A Scalable Random Forest Algorithm Based on MapReduce,” IEEE, pp. 849–852, 2013.Google Scholar
  32. 32.
    D. Chen, Y. Yang, “Attribute reduction for heterogeneous data based on the combination of classical and Fuzzy rough set models,” IEEE Trans. Fuzzy Syst, vol. 22, no. 5, pp. 1325–1334, 2014.Google Scholar
  33. 33.
    J. Ji, W. Pang, C. Zhou, X. Han, and Z. Wang, “A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data,” Elsevier Knowl. Based Syst, vol. 30, pp. 129–135, 2012.Google Scholar
  34. 34.
    D. Chen, S. Zhao, L. Zhang, Y. Yang, and X. Zhang, “Sample pair selection for attribute reduction with rough set,” IEEE Trans. Knowl. Data Engg, vol. 24, no. 11, pp. 2080–2093, 2012.Google Scholar
  35. 35.
    D. G. Chen and S. Y. Zhao, “Local reduction of decision system with fuzzy rough sets,” Elsevier Fuzzy Sets Syst, vol. 161, no. 13, pp. 1871–1883, 2010.Google Scholar
  36. 36.
    D. Chen, L. Zhang, S. Zhao, Q. Hu, and P. Zhu, “A novel algorithm for finding reducts with fuzzy rough sets,” IEEE Trans. Fuzzy Syst, vol. 20, no. 2, pp. 385–389, 2012.Google Scholar
  37. 37.
    Q. Hu, D. Yu, J. Liu, and C. Wu, “Neighborhood rough set based heterogeneous feature subset selection,” Elsevier Information Sci, vol. 178, pp. 3577–3594, 2008.Google Scholar
  38. 38.
    UCI Machine Learning Repository,
  39. 39.
    S. Yen and Y. Lee, “Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset,” ICIC 2006, LNCIS 344, pp. 731–740, 2006.Google Scholar
  40. 40.
    K.Yoon, S. Kwek, “An Unsupervised Learning Approach to Resolving the Data Imbalanced Issue in Supervised Learning Problems in Functional Genomics,” International Conference on Hybrid Intelligent Systems, pp. 1–6, 2005.Google Scholar
  41. 41.
    J. Kwak, T. Lee, C. Kim, “An Incremental Clustering-Based Fault Detection Algorithm for Class-Imbalanced Process Data,” IEEE Transactions on Semiconductor Manufacturing, pp. 318–328, 2015.Google Scholar
  42. 42.
    S. Kim, H. Kim, Y. Namkoong, “Ordinal Classification of Imbalanced Data with Application in Emergency and Disaster Information Services,” IEEE Intelligent Systems, pp. 50–56, 2016.Google Scholar
  43. 43.
    M. Chandak, “Role of big-data in classification and novel class detection in data streams,” Springer Journal of Big Data, pp. 1–9, 2016.Google Scholar
  44. 44.
    M. Bach, A. Werner, J. Żywiec, W. Pluskiewicz, “The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis,” Elsevier Information Sciences(In Press, Corrected Proof), 2016.Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Sachin Subhash Patil
    • 1
    Email author
  • Shefali Pratap Sonavane
    • 2
  1. 1.Faculty of Computer Science and EngineeringRajarambapu Institute of Technology RajaramnagarIslampurIndia
  2. 2.Faculty of Information TechnologyWalchand College of Engineering VishrambagSangliIndia

Personalised recommendations