Skip to main content

Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification

  • Chapter
  • First Online:
Book cover Data Science and Big Data: An Environment of Computational Intelligence

Part of the book series: Studies in Big Data ((SBD,volume 24))

Abstract

Facing hundreds of gigabytes of data has triggered a need to reconsider data management options. There is a tremendous requirement to study data sets beyond the capability of commonly used software tools to capture, curate and manage within a tolerable elapsed time and also beyond the processing feasibility of the single machine architecture. In addition to the traditional structured data, the new avenue of NoSQL Big Data has urged a call to experimental techniques and technologies that require ventures to re-integrate. It helps to discover large hidden values from huge datasets that are complex, diverse and of a massive scale. In many of the real world applications, classification of imbalanced datasets is the point of priority concern. The standard classifier learning algorithms assume balanced class distribution and equal misclassification costs; as a result, the classification of datasets having imbalanced class distribution has produced a notable drawback in performance obtained by the most standard classifier learning algorithms. Most of the classification methods focus on two-class imbalance problem inspite of multi-class imbalance problem, which exist in real-world domains. A methodology is introduced for single-class/multi-class imbalanced data sets (Lowest vs. Highest—LVH) with enhanced over_sampling (O.S.) techniques (MEre Mean Minority Over_Sampling Technique—MEMMOT, Majority Minority Mix mean—MMMm, Nearest Farthest Neighbor_Mid—NFN-M, Clustering Minority Examples—CME, Majority Minority Cluster Based Under_Over Sampling Technique—MMCBUOS, Updated Class Purity Maximization—UCPM) to improve classification. The study is based on broadly two views: either to compare the enhanced non-cluster techniques to prior work or to have a clustering based approach for advance O.S. techniques. Finally, this balanced data is to be applied to form Random Forest (R.F.) tree for classification. O.S. techniques are projected to apply on imbalanced Big Data using mapreduce environment. Experiments are suggested to perform on Apache Hadoop and Apache Spark , using different datasets from UCI/KEEL repository. Geometric mean, F-measures, Area under curve (AUC), Average accuracy, Brier scores are used to measure the performance of this classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. A. Gandomi and M. Haider, “Beyond the hype: big data concepts, methods, and analytics,” International Journal of Information Management, vol. 35, no. 2, pp. 137–144, 2015.

    Google Scholar 

  2. W. Zhao, H. Ma, and Q. He., “Parallel k-means clustering based on mapreduce,” CloudCom, pp. 674–679, 2009.

    Google Scholar 

  3. D. Agrawal et al., “Challenges and Opportunity with Big Data,” Community White Paper, pp. 01–16, 2012.

    Google Scholar 

  4. X. Wu et al., “Data Mining with Big Data,” IEEE Trans. Knowledge Data Engg, vol. 26, no. 1, pp. 97–107, 2014.

    Google Scholar 

  5. X.-W. Chen et al., “Big data deep learning: Challenges and perspectives,” IEEE Access Practical Innovations: open solutions, vol. 2, pp. 514–525, 2014.

    Google Scholar 

  6. M. A. Nadaf, S. S. Patil, “Performance Evaluation of Categorizing Technical Support Requests Using Advanced K-Means Algorithm,” IEEE International Advance Computing Conference (IACC), pp. 409–414, 2015.

    Google Scholar 

  7. “Big Data: Challenges and Opportunities, Infosys Labs Briefings - Infosys Labs,” http://www.infosys. com/infosys-labs/publications/Documents/bigdata-challenges-opportunities.pdf.

  8. “A Modern Data Architecture with Apache Hadoop,” A White paper developed by Hortonworks, pp. 01–18, 2014.

    Google Scholar 

  9. H. Jiang, Y. Chen, and Z. Qiao, “Scaling up mapreduce-based big data processing on multi-GPU systems,” SpingerLink Clust. Comput, vol. 18, no. 1, pp. 369–383, 2015.

    Google Scholar 

  10. N. Chawla, L. Aleksandar, L. Hall, and K. Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting,” PKDD Springer Berlin Heidelberg, pp. 107–119, 2003.

    Google Scholar 

  11. R. C. Bhagat, S. S. Patil, “Enhanced SMOTE algorithm for classification of imbalanced bigdata using Random Forest,” IEEE International Advance Computing Conference (IACC), pp. 403–408, 2015.

    Google Scholar 

  12. W. A. Rivera, O. Asparouhov, “Safe Level OUPS for Improving Target Concept Learning in Imbalanced Data Sets,” Proceedings of the IEEE Southeast Con, pp. 1–8, 2015.

    Google Scholar 

  13. K. Shvachko, H. Kuang, S. Radia, R. Chansler, “The Hadoop Distributed File System,” IEEE 26 th Symposium on Mass Storage Systems and Technologies, pp. 1–10, 2010.

    Google Scholar 

  14. F. Khan, “An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application,” Elsevier publication- Journal of Applied Soft Computing, vol. 12, pp. 3698– 3700, 2012.

    Google Scholar 

  15. R. Esteves, C. Rong, “Using Mahout for Clustering Wikipedia’s Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud,” IEEE Third International Conference on Cloud Computing Technology and Science, pp. 565–569, 2011.

    Google Scholar 

  16. A. Osama, “Comparisons between data clustering algorithms,” The international Arab Journal of Information Technology, vol. 5, pp. 320–325, 2008.

    Google Scholar 

  17. R. Jensen and Q. Shen, “New approaches to fuzzy-rough feature selection,” IEEE Trans. Fuzzy Syst, vol. 17, no. 4, pp. 824–838, 2009.

    Google Scholar 

  18. R. Sara, V. Lopez, J. Benitez, and F. Herrera, “On the use of MapReduce for imbalanced big data using Random Forest,” Elsevier: Journal of Information Sciences, pp. 112–137, 2014.

    Google Scholar 

  19. B. Japp et al., “No More Secrets with Big Data Analytics,” White paper developed by Sogeti Trend Lab VINT, pp. 01–210, 2013.

    Google Scholar 

  20. E. Tsang, D. Chen, D. Yeung, and X. Wang, “Attributes reduction using fuzzy rough sets,” IEEE Trans. Fuzzy Syst, vol. 16, no. 5, pp. 1130–1141, 2008.

    Google Scholar 

  21. R. Renu, G. Mocko, and A. Koneru, “Use of big data and knowledge discovery to create data backbones for decision support systems,” Elsevier Procedia Comput. Sci, vol. 20, pp. 446–453, 2013.

    Google Scholar 

  22. L. Nikolay, Z. Kai, and Z. Carlo, “Early accurate results for advanced analytics on MapReduce,” Proceedings of the VLDB Endowment, vol. 5, no.10, pp. 1028–1039, 2012.

    Google Scholar 

  23. J. Shao, and D. Tu., “The jackknife and bootstrap,” Springer series in statistics Springer Verlag, pp. 01–13, 2013.

    Google Scholar 

  24. B. Chumphol, K. Sinapiromsaran, and C. Lursinsap, “Safe-level-smote: Safelevel- synthetic minority over-sampling technique for handling the class imbalanced problem,” AKDD Springer Berlin Heidelberg, pp. 475–482, 2009.

    Google Scholar 

  25. P. Byoung-Jun, S. Oh, and W. Pedrycz, “The design of polynomial function-based neural network predictors for detection of software defects,” Elsevier: Journal of Information Sciences, pp. 40–57, 2013.

    Google Scholar 

  26. N. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.

    Google Scholar 

  27. S. Garcia et al., “Evolutionary-based selection of generalized instances for imbalanced classification,” Elsevier: Journal of Knowl. Based Syst, pp. 3–12, 2012.

    Google Scholar 

  28. H. Xiang, Y. Yang, and S. Zhao, “Local clustering ensemble learning method based on improved AdaBoost for rare class analysis,” Journal of Computational Information Systems, Vol. 8, no. 4, pp. 1783–1790, 2012.

    Google Scholar 

  29. H. Feng, and L. Hang, “A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE,” Hindawi Mathematical Problems in Engineering, 2013.

    Google Scholar 

  30. F. Alberto, M. Jesus, and F. Herrera, “Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning,” Springer IPMU, pp. 89–98, 2010.

    Google Scholar 

  31. J. Hanl, Y. Liul, and X. Sunl, “A Scalable Random Forest Algorithm Based on MapReduce,” IEEE, pp. 849–852, 2013.

    Google Scholar 

  32. D. Chen, Y. Yang, “Attribute reduction for heterogeneous data based on the combination of classical and Fuzzy rough set models,” IEEE Trans. Fuzzy Syst, vol. 22, no. 5, pp. 1325–1334, 2014.

    Google Scholar 

  33. J. Ji, W. Pang, C. Zhou, X. Han, and Z. Wang, “A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data,” Elsevier Knowl. Based Syst, vol. 30, pp. 129–135, 2012.

    Google Scholar 

  34. D. Chen, S. Zhao, L. Zhang, Y. Yang, and X. Zhang, “Sample pair selection for attribute reduction with rough set,” IEEE Trans. Knowl. Data Engg, vol. 24, no. 11, pp. 2080–2093, 2012.

    Google Scholar 

  35. D. G. Chen and S. Y. Zhao, “Local reduction of decision system with fuzzy rough sets,” Elsevier Fuzzy Sets Syst, vol. 161, no. 13, pp. 1871–1883, 2010.

    Google Scholar 

  36. D. Chen, L. Zhang, S. Zhao, Q. Hu, and P. Zhu, “A novel algorithm for finding reducts with fuzzy rough sets,” IEEE Trans. Fuzzy Syst, vol. 20, no. 2, pp. 385–389, 2012.

    Google Scholar 

  37. Q. Hu, D. Yu, J. Liu, and C. Wu, “Neighborhood rough set based heterogeneous feature subset selection,” Elsevier Information Sci, vol. 178, pp. 3577–3594, 2008.

    Google Scholar 

  38. UCI Machine Learning Repository, http://archieve.ics.uci.edu/ml/

  39. S. Yen and Y. Lee, “Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset,” ICIC 2006, LNCIS 344, pp. 731–740, 2006.

    Google Scholar 

  40. K.Yoon, S. Kwek, “An Unsupervised Learning Approach to Resolving the Data Imbalanced Issue in Supervised Learning Problems in Functional Genomics,” International Conference on Hybrid Intelligent Systems, pp. 1–6, 2005.

    Google Scholar 

  41. J. Kwak, T. Lee, C. Kim, “An Incremental Clustering-Based Fault Detection Algorithm for Class-Imbalanced Process Data,” IEEE Transactions on Semiconductor Manufacturing, pp. 318–328, 2015.

    Google Scholar 

  42. S. Kim, H. Kim, Y. Namkoong, “Ordinal Classification of Imbalanced Data with Application in Emergency and Disaster Information Services,” IEEE Intelligent Systems, pp. 50–56, 2016.

    Google Scholar 

  43. M. Chandak, “Role of big-data in classification and novel class detection in data streams,” Springer Journal of Big Data, pp. 1–9, 2016.

    Google Scholar 

  44. M. Bach, A. Werner, J. Żywiec, W. Pluskiewicz, “The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis,” Elsevier Information Sciences(In Press, Corrected Proof), 2016.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sachin Subhash Patil .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Patil, S.S., Sonavane, S.P. (2017). Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification. In: Pedrycz, W., Chen, SM. (eds) Data Science and Big Data: An Environment of Computational Intelligence. Studies in Big Data, vol 24. Springer, Cham. https://doi.org/10.1007/978-3-319-53474-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-53474-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-53473-2

  • Online ISBN: 978-3-319-53474-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics