Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification

Patil, Sachin Subhash; Sonavane, Shefali Pratap

doi:10.1007/978-3-319-53474-9_3

Sachin Subhash Patil⁴ &
Shefali Pratap Sonavane⁵

Part of the book series: Studies in Big Data ((SBD,volume 24))

2707 Accesses
5 Citations

Abstract

Facing hundreds of gigabytes of data has triggered a need to reconsider data management options. There is a tremendous requirement to study data sets beyond the capability of commonly used software tools to capture, curate and manage within a tolerable elapsed time and also beyond the processing feasibility of the single machine architecture. In addition to the traditional structured data, the new avenue of NoSQL Big Data has urged a call to experimental techniques and technologies that require ventures to re-integrate. It helps to discover large hidden values from huge datasets that are complex, diverse and of a massive scale. In many of the real world applications, classification of imbalanced datasets is the point of priority concern. The standard classifier learning algorithms assume balanced class distribution and equal misclassification costs; as a result, the classification of datasets having imbalanced class distribution has produced a notable drawback in performance obtained by the most standard classifier learning algorithms. Most of the classification methods focus on two-class imbalance problem inspite of multi-class imbalance problem, which exist in real-world domains. A methodology is introduced for single-class/multi-class imbalanced data sets (Lowest vs. Highest—LVH) with enhanced over_sampling (O.S.) techniques (MEre Mean Minority Over_Sampling Technique—MEMMOT, Majority Minority Mix mean—MMMm, Nearest Farthest Neighbor_Mid—NFN-M, Clustering Minority Examples—CME, Majority Minority Cluster Based Under_Over Sampling Technique—MMCBUOS, Updated Class Purity Maximization—UCPM) to improve classification. The study is based on broadly two views: either to compare the enhanced non-cluster techniques to prior work or to have a clustering based approach for advance O.S. techniques. Finally, this balanced data is to be applied to form Random Forest (R.F.) tree for classification. O.S. techniques are projected to apply on imbalanced Big Data using mapreduce environment. Experiments are suggested to perform on Apache Hadoop and Apache Spark , using different datasets from UCI/KEEL repository. Geometric mean, F-measures, Area under curve (AUC), Average accuracy, Brier scores are used to measure the performance of this classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A. Gandomi and M. Haider, “Beyond the hype: big data concepts, methods, and analytics,” International Journal of Information Management, vol. 35, no. 2, pp. 137–144, 2015.
Google Scholar
W. Zhao, H. Ma, and Q. He., “Parallel k-means clustering based on mapreduce,” CloudCom, pp. 674–679, 2009.
Google Scholar
D. Agrawal et al., “Challenges and Opportunity with Big Data,” Community White Paper, pp. 01–16, 2012.
Google Scholar
X. Wu et al., “Data Mining with Big Data,” IEEE Trans. Knowledge Data Engg, vol. 26, no. 1, pp. 97–107, 2014.
Google Scholar
X.-W. Chen et al., “Big data deep learning: Challenges and perspectives,” IEEE Access Practical Innovations: open solutions, vol. 2, pp. 514–525, 2014.
Google Scholar
M. A. Nadaf, S. S. Patil, “Performance Evaluation of Categorizing Technical Support Requests Using Advanced K-Means Algorithm,” IEEE International Advance Computing Conference (IACC), pp. 409–414, 2015.
Google Scholar
“Big Data: Challenges and Opportunities, Infosys Labs Briefings - Infosys Labs,” http://www.infosys. com/infosys-labs/publications/Documents/bigdata-challenges-opportunities.pdf.
“A Modern Data Architecture with Apache Hadoop,” A White paper developed by Hortonworks, pp. 01–18, 2014.
Google Scholar
H. Jiang, Y. Chen, and Z. Qiao, “Scaling up mapreduce-based big data processing on multi-GPU systems,” SpingerLink Clust. Comput, vol. 18, no. 1, pp. 369–383, 2015.
Google Scholar
N. Chawla, L. Aleksandar, L. Hall, and K. Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting,” PKDD Springer Berlin Heidelberg, pp. 107–119, 2003.
Google Scholar
R. C. Bhagat, S. S. Patil, “Enhanced SMOTE algorithm for classification of imbalanced bigdata using Random Forest,” IEEE International Advance Computing Conference (IACC), pp. 403–408, 2015.
Google Scholar
W. A. Rivera, O. Asparouhov, “Safe Level OUPS for Improving Target Concept Learning in Imbalanced Data Sets,” Proceedings of the IEEE Southeast Con, pp. 1–8, 2015.
Google Scholar
K. Shvachko, H. Kuang, S. Radia, R. Chansler, “The Hadoop Distributed File System,” IEEE 26 ^th Symposium on Mass Storage Systems and Technologies, pp. 1–10, 2010.
Google Scholar
F. Khan, “An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application,” Elsevier publication- Journal of Applied Soft Computing, vol. 12, pp. 3698– 3700, 2012.
Google Scholar
R. Esteves, C. Rong, “Using Mahout for Clustering Wikipedia’s Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud,” IEEE Third International Conference on Cloud Computing Technology and Science, pp. 565–569, 2011.
Google Scholar
A. Osama, “Comparisons between data clustering algorithms,” The international Arab Journal of Information Technology, vol. 5, pp. 320–325, 2008.
Google Scholar
R. Jensen and Q. Shen, “New approaches to fuzzy-rough feature selection,” IEEE Trans. Fuzzy Syst, vol. 17, no. 4, pp. 824–838, 2009.
Google Scholar
R. Sara, V. Lopez, J. Benitez, and F. Herrera, “On the use of MapReduce for imbalanced big data using Random Forest,” Elsevier: Journal of Information Sciences, pp. 112–137, 2014.
Google Scholar
B. Japp et al., “No More Secrets with Big Data Analytics,” White paper developed by Sogeti Trend Lab VINT, pp. 01–210, 2013.
Google Scholar
E. Tsang, D. Chen, D. Yeung, and X. Wang, “Attributes reduction using fuzzy rough sets,” IEEE Trans. Fuzzy Syst, vol. 16, no. 5, pp. 1130–1141, 2008.
Google Scholar
R. Renu, G. Mocko, and A. Koneru, “Use of big data and knowledge discovery to create data backbones for decision support systems,” Elsevier Procedia Comput. Sci, vol. 20, pp. 446–453, 2013.
Google Scholar
L. Nikolay, Z. Kai, and Z. Carlo, “Early accurate results for advanced analytics on MapReduce,” Proceedings of the VLDB Endowment, vol. 5, no.10, pp. 1028–1039, 2012.
Google Scholar
J. Shao, and D. Tu., “The jackknife and bootstrap,” Springer series in statistics Springer Verlag, pp. 01–13, 2013.
Google Scholar
B. Chumphol, K. Sinapiromsaran, and C. Lursinsap, “Safe-level-smote: Safelevel- synthetic minority over-sampling technique for handling the class imbalanced problem,” AKDD Springer Berlin Heidelberg, pp. 475–482, 2009.
Google Scholar
P. Byoung-Jun, S. Oh, and W. Pedrycz, “The design of polynomial function-based neural network predictors for detection of software defects,” Elsevier: Journal of Information Sciences, pp. 40–57, 2013.
Google Scholar
N. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
Google Scholar
S. Garcia et al., “Evolutionary-based selection of generalized instances for imbalanced classification,” Elsevier: Journal of Knowl. Based Syst, pp. 3–12, 2012.
Google Scholar
H. Xiang, Y. Yang, and S. Zhao, “Local clustering ensemble learning method based on improved AdaBoost for rare class analysis,” Journal of Computational Information Systems, Vol. 8, no. 4, pp. 1783–1790, 2012.
Google Scholar
H. Feng, and L. Hang, “A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE,” Hindawi Mathematical Problems in Engineering, 2013.
Google Scholar
F. Alberto, M. Jesus, and F. Herrera, “Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning,” Springer IPMU, pp. 89–98, 2010.
Google Scholar
J. Hanl, Y. Liul, and X. Sunl, “A Scalable Random Forest Algorithm Based on MapReduce,” IEEE, pp. 849–852, 2013.
Google Scholar
D. Chen, Y. Yang, “Attribute reduction for heterogeneous data based on the combination of classical and Fuzzy rough set models,” IEEE Trans. Fuzzy Syst, vol. 22, no. 5, pp. 1325–1334, 2014.
Google Scholar
J. Ji, W. Pang, C. Zhou, X. Han, and Z. Wang, “A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data,” Elsevier Knowl. Based Syst, vol. 30, pp. 129–135, 2012.
Google Scholar
D. Chen, S. Zhao, L. Zhang, Y. Yang, and X. Zhang, “Sample pair selection for attribute reduction with rough set,” IEEE Trans. Knowl. Data Engg, vol. 24, no. 11, pp. 2080–2093, 2012.
Google Scholar
D. G. Chen and S. Y. Zhao, “Local reduction of decision system with fuzzy rough sets,” Elsevier Fuzzy Sets Syst, vol. 161, no. 13, pp. 1871–1883, 2010.
Google Scholar
D. Chen, L. Zhang, S. Zhao, Q. Hu, and P. Zhu, “A novel algorithm for finding reducts with fuzzy rough sets,” IEEE Trans. Fuzzy Syst, vol. 20, no. 2, pp. 385–389, 2012.
Google Scholar
Q. Hu, D. Yu, J. Liu, and C. Wu, “Neighborhood rough set based heterogeneous feature subset selection,” Elsevier Information Sci, vol. 178, pp. 3577–3594, 2008.
Google Scholar
UCI Machine Learning Repository, http://archieve.ics.uci.edu/ml/
S. Yen and Y. Lee, “Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset,” ICIC 2006, LNCIS 344, pp. 731–740, 2006.
Google Scholar
K.Yoon, S. Kwek, “An Unsupervised Learning Approach to Resolving the Data Imbalanced Issue in Supervised Learning Problems in Functional Genomics,” International Conference on Hybrid Intelligent Systems, pp. 1–6, 2005.
Google Scholar
J. Kwak, T. Lee, C. Kim, “An Incremental Clustering-Based Fault Detection Algorithm for Class-Imbalanced Process Data,” IEEE Transactions on Semiconductor Manufacturing, pp. 318–328, 2015.
Google Scholar
S. Kim, H. Kim, Y. Namkoong, “Ordinal Classification of Imbalanced Data with Application in Emergency and Disaster Information Services,” IEEE Intelligent Systems, pp. 50–56, 2016.
Google Scholar
M. Chandak, “Role of big-data in classification and novel class detection in data streams,” Springer Journal of Big Data, pp. 1–9, 2016.
Google Scholar
M. Bach, A. Werner, J. Żywiec, W. Pluskiewicz, “The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis,” Elsevier Information Sciences(In Press, Corrected Proof), 2016.
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science and Engineering, Rajarambapu Institute of Technology Rajaramnagar, Post-Rajaramnagar, Islampur, 415414, Maharashtra, India
Sachin Subhash Patil
Faculty of Information Technology, Walchand College of Engineering Vishrambag, Post-Vishrambag, Sangli, 416415, Maharashtra, India
Shefali Pratap Sonavane

Authors

Sachin Subhash Patil
View author publications
You can also search for this author in PubMed Google Scholar
Shefali Pratap Sonavane
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sachin Subhash Patil .

Editor information

Editors and Affiliations

Electrical & Computer Engineering, University of Alberta Electrical & Computer Engineering, Edmonton AL, Canada
Witold Pedrycz
Dept of CS and Information Engineering, National Taiwan Univ of Science and Tech Dept of CS and Information Engineering, Taipei, Taiwan
Shyi-Ming Chen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Patil, S.S., Sonavane, S.P. (2017). Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification. In: Pedrycz, W., Chen, SM. (eds) Data Science and Big Data: An Environment of Computational Intelligence. Studies in Big Data, vol 24. Springer, Cham. https://doi.org/10.1007/978-3-319-53474-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-53474-9_3
Published: 22 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53473-2
Online ISBN: 978-3-319-53474-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics