Abstract
Facing hundreds of gigabytes of data has triggered a need to reconsider data management options. There is a tremendous requirement to study data sets beyond the capability of commonly used software tools to capture, curate and manage within a tolerable elapsed time and also beyond the processing feasibility of the single machine architecture. In addition to the traditional structured data, the new avenue of NoSQL Big Data has urged a call to experimental techniques and technologies that require ventures to re-integrate. It helps to discover large hidden values from huge datasets that are complex, diverse and of a massive scale. In many of the real world applications, classification of imbalanced datasets is the point of priority concern. The standard classifier learning algorithms assume balanced class distribution and equal misclassification costs; as a result, the classification of datasets having imbalanced class distribution has produced a notable drawback in performance obtained by the most standard classifier learning algorithms. Most of the classification methods focus on two-class imbalance problem inspite of multi-class imbalance problem, which exist in real-world domains. A methodology is introduced for single-class/multi-class imbalanced data sets (Lowest vs. Highest—LVH) with enhanced over_sampling (O.S.) techniques (MEre Mean Minority Over_Sampling Technique—MEMMOT, Majority Minority Mix mean—MMMm, Nearest Farthest Neighbor_Mid—NFN-M, Clustering Minority Examples—CME, Majority Minority Cluster Based Under_Over Sampling Technique—MMCBUOS, Updated Class Purity Maximization—UCPM) to improve classification. The study is based on broadly two views: either to compare the enhanced non-cluster techniques to prior work or to have a clustering based approach for advance O.S. techniques. Finally, this balanced data is to be applied to form Random Forest (R.F.) tree for classification. O.S. techniques are projected to apply on imbalanced Big Data using mapreduce environment. Experiments are suggested to perform on Apache Hadoop and Apache Spark , using different datasets from UCI/KEEL repository. Geometric mean, F-measures, Area under curve (AUC), Average accuracy, Brier scores are used to measure the performance of this classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
A. Gandomi and M. Haider, “Beyond the hype: big data concepts, methods, and analytics,” International Journal of Information Management, vol. 35, no. 2, pp. 137–144, 2015.
W. Zhao, H. Ma, and Q. He., “Parallel k-means clustering based on mapreduce,” CloudCom, pp. 674–679, 2009.
D. Agrawal et al., “Challenges and Opportunity with Big Data,” Community White Paper, pp. 01–16, 2012.
X. Wu et al., “Data Mining with Big Data,” IEEE Trans. Knowledge Data Engg, vol. 26, no. 1, pp. 97–107, 2014.
X.-W. Chen et al., “Big data deep learning: Challenges and perspectives,” IEEE Access Practical Innovations: open solutions, vol. 2, pp. 514–525, 2014.
M. A. Nadaf, S. S. Patil, “Performance Evaluation of Categorizing Technical Support Requests Using Advanced K-Means Algorithm,” IEEE International Advance Computing Conference (IACC), pp. 409–414, 2015.
“Big Data: Challenges and Opportunities, Infosys Labs Briefings - Infosys Labs,” http://www.infosys. com/infosys-labs/publications/Documents/bigdata-challenges-opportunities.pdf.
“A Modern Data Architecture with Apache Hadoop,” A White paper developed by Hortonworks, pp. 01–18, 2014.
H. Jiang, Y. Chen, and Z. Qiao, “Scaling up mapreduce-based big data processing on multi-GPU systems,” SpingerLink Clust. Comput, vol. 18, no. 1, pp. 369–383, 2015.
N. Chawla, L. Aleksandar, L. Hall, and K. Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting,” PKDD Springer Berlin Heidelberg, pp. 107–119, 2003.
R. C. Bhagat, S. S. Patil, “Enhanced SMOTE algorithm for classification of imbalanced bigdata using Random Forest,” IEEE International Advance Computing Conference (IACC), pp. 403–408, 2015.
W. A. Rivera, O. Asparouhov, “Safe Level OUPS for Improving Target Concept Learning in Imbalanced Data Sets,” Proceedings of the IEEE Southeast Con, pp. 1–8, 2015.
K. Shvachko, H. Kuang, S. Radia, R. Chansler, “The Hadoop Distributed File System,” IEEE 26 th Symposium on Mass Storage Systems and Technologies, pp. 1–10, 2010.
F. Khan, “An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application,” Elsevier publication- Journal of Applied Soft Computing, vol. 12, pp. 3698– 3700, 2012.
R. Esteves, C. Rong, “Using Mahout for Clustering Wikipedia’s Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud,” IEEE Third International Conference on Cloud Computing Technology and Science, pp. 565–569, 2011.
A. Osama, “Comparisons between data clustering algorithms,” The international Arab Journal of Information Technology, vol. 5, pp. 320–325, 2008.
R. Jensen and Q. Shen, “New approaches to fuzzy-rough feature selection,” IEEE Trans. Fuzzy Syst, vol. 17, no. 4, pp. 824–838, 2009.
R. Sara, V. Lopez, J. Benitez, and F. Herrera, “On the use of MapReduce for imbalanced big data using Random Forest,” Elsevier: Journal of Information Sciences, pp. 112–137, 2014.
B. Japp et al., “No More Secrets with Big Data Analytics,” White paper developed by Sogeti Trend Lab VINT, pp. 01–210, 2013.
E. Tsang, D. Chen, D. Yeung, and X. Wang, “Attributes reduction using fuzzy rough sets,” IEEE Trans. Fuzzy Syst, vol. 16, no. 5, pp. 1130–1141, 2008.
R. Renu, G. Mocko, and A. Koneru, “Use of big data and knowledge discovery to create data backbones for decision support systems,” Elsevier Procedia Comput. Sci, vol. 20, pp. 446–453, 2013.
L. Nikolay, Z. Kai, and Z. Carlo, “Early accurate results for advanced analytics on MapReduce,” Proceedings of the VLDB Endowment, vol. 5, no.10, pp. 1028–1039, 2012.
J. Shao, and D. Tu., “The jackknife and bootstrap,” Springer series in statistics Springer Verlag, pp. 01–13, 2013.
B. Chumphol, K. Sinapiromsaran, and C. Lursinsap, “Safe-level-smote: Safelevel- synthetic minority over-sampling technique for handling the class imbalanced problem,” AKDD Springer Berlin Heidelberg, pp. 475–482, 2009.
P. Byoung-Jun, S. Oh, and W. Pedrycz, “The design of polynomial function-based neural network predictors for detection of software defects,” Elsevier: Journal of Information Sciences, pp. 40–57, 2013.
N. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
S. Garcia et al., “Evolutionary-based selection of generalized instances for imbalanced classification,” Elsevier: Journal of Knowl. Based Syst, pp. 3–12, 2012.
H. Xiang, Y. Yang, and S. Zhao, “Local clustering ensemble learning method based on improved AdaBoost for rare class analysis,” Journal of Computational Information Systems, Vol. 8, no. 4, pp. 1783–1790, 2012.
H. Feng, and L. Hang, “A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE,” Hindawi Mathematical Problems in Engineering, 2013.
F. Alberto, M. Jesus, and F. Herrera, “Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning,” Springer IPMU, pp. 89–98, 2010.
J. Hanl, Y. Liul, and X. Sunl, “A Scalable Random Forest Algorithm Based on MapReduce,” IEEE, pp. 849–852, 2013.
D. Chen, Y. Yang, “Attribute reduction for heterogeneous data based on the combination of classical and Fuzzy rough set models,” IEEE Trans. Fuzzy Syst, vol. 22, no. 5, pp. 1325–1334, 2014.
J. Ji, W. Pang, C. Zhou, X. Han, and Z. Wang, “A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data,” Elsevier Knowl. Based Syst, vol. 30, pp. 129–135, 2012.
D. Chen, S. Zhao, L. Zhang, Y. Yang, and X. Zhang, “Sample pair selection for attribute reduction with rough set,” IEEE Trans. Knowl. Data Engg, vol. 24, no. 11, pp. 2080–2093, 2012.
D. G. Chen and S. Y. Zhao, “Local reduction of decision system with fuzzy rough sets,” Elsevier Fuzzy Sets Syst, vol. 161, no. 13, pp. 1871–1883, 2010.
D. Chen, L. Zhang, S. Zhao, Q. Hu, and P. Zhu, “A novel algorithm for finding reducts with fuzzy rough sets,” IEEE Trans. Fuzzy Syst, vol. 20, no. 2, pp. 385–389, 2012.
Q. Hu, D. Yu, J. Liu, and C. Wu, “Neighborhood rough set based heterogeneous feature subset selection,” Elsevier Information Sci, vol. 178, pp. 3577–3594, 2008.
UCI Machine Learning Repository, http://archieve.ics.uci.edu/ml/
S. Yen and Y. Lee, “Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset,” ICIC 2006, LNCIS 344, pp. 731–740, 2006.
K.Yoon, S. Kwek, “An Unsupervised Learning Approach to Resolving the Data Imbalanced Issue in Supervised Learning Problems in Functional Genomics,” International Conference on Hybrid Intelligent Systems, pp. 1–6, 2005.
J. Kwak, T. Lee, C. Kim, “An Incremental Clustering-Based Fault Detection Algorithm for Class-Imbalanced Process Data,” IEEE Transactions on Semiconductor Manufacturing, pp. 318–328, 2015.
S. Kim, H. Kim, Y. Namkoong, “Ordinal Classification of Imbalanced Data with Application in Emergency and Disaster Information Services,” IEEE Intelligent Systems, pp. 50–56, 2016.
M. Chandak, “Role of big-data in classification and novel class detection in data streams,” Springer Journal of Big Data, pp. 1–9, 2016.
M. Bach, A. Werner, J. Żywiec, W. Pluskiewicz, “The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis,” Elsevier Information Sciences(In Press, Corrected Proof), 2016.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Patil, S.S., Sonavane, S.P. (2017). Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification. In: Pedrycz, W., Chen, SM. (eds) Data Science and Big Data: An Environment of Computational Intelligence. Studies in Big Data, vol 24. Springer, Cham. https://doi.org/10.1007/978-3-319-53474-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-53474-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53473-2
Online ISBN: 978-3-319-53474-9
eBook Packages: EngineeringEngineering (R0)