Investigation of Imbalanced Big Data Set Classification: Clustering Minority Samples Over Sampling Technique

Patil, Sachin; Sonavane, Shefali

doi:10.1007/978-981-15-4851-2_32

Sachin Patil^18,19 &
Shefali Sonavane¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1162 ))

393 Accesses

Abstract

Most of the real-world data sets exhibit a skewed scenario of data distribution in contrast to the well-established data sets. The total number of instances of a particular class extremely surpasses the count of other classes. This uneven dispersal of classes leads to a state of imbalance data sets posing an extreme difficulty for learning procedures. Additionally, due to its intrinsic complex data features, analyzing such imbalanced data sets has setup an avenue for focused researchers. Imbalanced class distribution is effectively handled with over sampling of minority class data which is usually independent of the classifiers. A over sampling technique: Clustering minority samples over sampling technique (CMSOT) is proposed to enhance the classification of imbalanced data sets. The projected technique is implemented on Apache Hadoop under mapreduce environment. The data sets are mainly encompassed from the UCI repository. The effect of True Positive rates justifying the imbalance ratio including the examination of improved classification from the generated pool is studied. The achieved experimental results along with its corresponding statistical analysis of over sampled data sets clearly mark the supremacy of the planned technique to the selected benchmarking techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wei, W., Li, J., Cao, L., Ou, Y., Chen, J.: Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web. 4, 449–475 (2013)
Article Google Scholar
Tomczak, J., ZięBa, M.: Probabilistic combination of classification rules and its application to medical diagnosis. Mach. Learn. 1–3, 105–135 (2015)
Article MathSciNet Google Scholar
Chen, Y.: An empirical study of a hybrid imbalanced-class DT-RST classification procedure to elucidate therapeutic effects in uremia patients. Med. Biol. Eng. Compu. 6, 983–1001 (2016)
Article Google Scholar
Elhag, S., Fernández, A., Bawakid, A., Alshomrani, S., Herrera, F.: On the combination of genetic fuzzy systems and pairwise learning for improving detection rates on intrusion detection systems. Expert Syst. Appl. 1, 193–202 (2015)
Article Google Scholar
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Article Google Scholar
Del Río, S., López, V., Benítez, J., Herrera, F.: On the use of MapReduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)
Article Google Scholar
Jiang, H., Chen, Y., Qiao, Z., Weng, T., Li, K.: Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Comput. 1, 369–383 (2015)
Article Google Scholar
Huang, J., Ling, C.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 3, 299–310 (2005)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 5, 429–449 (2002)
Article Google Scholar
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 9, 1263–1284 (2008)
Google Scholar
Sun, Y., Wong, A., Kamel, M.: Classification of imbalanced data: A review. Int. J. Pattern Recognit Artif Intell. 04, 687–719 (2009)
Article Google Scholar
Maalouf, M., Trafalis, T.: Robust weighted kernel logistic regression in imbalanced and rare events data. Comput. Stat. Data Anal. 55, 168–183 (2011)
Article MathSciNet Google Scholar
Japkowicz, N., Myers, C., Gluck, M.: A novelty detection approach to classification. InIJCAI 1, 518–523 (1995)
Google Scholar
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Han, H., Wang, W., Mao, B.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, Springer, Berlin, pp. 878–887 (2005)
Google Scholar
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Berlin. pp. 475–482 (2009)
Google Scholar
He, H., Bai, Y., Garcia, E., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, pp. 322–1328 (2008)
Google Scholar
Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 1, 92–122 (2014)
Article MathSciNet Google Scholar
Hu, F., Li, H.: A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Math. Problems Eng. (20130
Google Scholar
Chawla, N., Lazarevic, A., Hall, L., Bowyer, K.: SMOTEBoost: Improving prediction of the minority class in boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery, Springer, Berlin. pp. 107–119 (2003)
Google Scholar
Xiang, H., Yang, Y., Zhao, S.: Local clustering ensemble learning method based on improved AdaBoost for rare class analysis. J. Comput. Inf. Syst. 4, 1783–1790 (2012)
Google Scholar
Gong, J., Kim, H.: RHSBoost: Improving classification performance in imbalance data. Comput. Stat. Data Anal. 111, 1–3 (2017)
Article MathSciNet Google Scholar
Barua, S., Islam, M., Yao, X., Murase, K.: MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2, 405–425 (2012)
Google Scholar
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 3, 664–684 (2012)
Article Google Scholar
UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets.html Accessed 13 Nov 2019

Download references

Author information

Authors and Affiliations

Rajarambapu Institute of Technology, Rajaramnagar, Urun Islampur, 415409, MH, India
Sachin Patil
Walchand College of Engineering, Sangli, 416415, MH, India
Sachin Patil & Shefali Sonavane

Authors

Sachin Patil
View author publications
You can also search for this author in PubMed Google Scholar
Shefali Sonavane
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sachin Patil .

Editor information

Editors and Affiliations

Department of Computer Engineering, Dr. Babasaheb Ambedkar Technological University, Lonere, Maharashtra, India
Prachi Deshpande
Machine Intelligence Research Labs (MIR Labs), Auburn, WA, USA
Ajith Abraham
Department of Electronics and Telecommunication Engineering, Dr. Babasaheb Ambedkar Technological University, Lonere, Maharashtra, India
Brijesh Iyer
School of Information Science and Engineering, University of Jinan, Jinan, Shandong, China
Kun Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patil, S., Sonavane, S. (2021). Investigation of Imbalanced Big Data Set Classification: Clustering Minority Samples Over Sampling Technique. In: Deshpande, P., Abraham, A., Iyer, B., Ma, K. (eds) Next Generation Information Processing System. Advances in Intelligent Systems and Computing, vol 1162 . Springer, Singapore. https://doi.org/10.1007/978-981-15-4851-2_32

Download citation

DOI: https://doi.org/10.1007/978-981-15-4851-2_32
Published: 14 June 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-4850-5
Online ISBN: 978-981-15-4851-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics