SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification

Gutiérrez, Pablo D.; Lastra, Miguel; Benítez, José M.; Herrera, Francisco

doi:10.1007/s13748-017-0128-2

SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification

Regular Paper
Published: 15 May 2017

Volume 6, pages 347–354, (2017)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

Pablo D. Gutiérrez ORCID: orcid.org/0000-0002-0233-1554¹,
Miguel Lastra²,
José M. Benítez¹ &
…
Francisco Herrera¹

2764 Accesses
15 Citations
Explore all metrics

Abstract

Nowadays, it is usual to work with large amounts of data since our capacity of collecting and storing information has increased significantly. The extraction of knowledge from these scenarios is commonly known as “Big Data,” and it is performed on large clusters with MapReduce platforms. Imbalanced classification poses a problem both in traditional and Big Data learning scenarios. Data sampling is one of the ways that allows to improve the performance on imbalanced problems. A commodity hardware-based method for Big Data problems can offload these computations from the expensive and highly demanded hardware that MapReduce platforms require. The characteristics of some sampling methods make them suitable to be adapted to commodity hardware, taking advantage of the parallel computation capabilities of graphics processing units. SMOTE is one of the most popular oversampling methods which is based on the nearest neighbor rule. The proposed SMOTE-GPU efficiently handles large datasets (several millions of instances) on a wide variety of commodity hardware, including a laptop computer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Vitor Werner de Vargas, Jorge Arthur Schneider Aranda, … Jorge Luis Victória Barbosa

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey of transfer learning

Article Open access 28 May 2016

Karl Weiss, Taghi M. Khoshgoftaar & DingDing Wang

Notes

http://sci2s.ugr.es/GPU-SME-kNN.

References

Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J., Rivas, V., Fernández, J., Herrera, F.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput. 13(3), 307–318 (2009)
Article Google Scholar
Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5 (2014)
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)
MATH Google Scholar
CUDA. http://www.nvidia.com/object/cuda_home_new.html. Accessed March 2017
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
ECBDL14 dataset: Protein structure prediction and contact map for the ECBDL2014 big data competition (2014). http://cruncher.ncl.ac.uk/bdcomp/
Fernández, A., del Río, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data classification: outcomes and challenges. Complex Intell. Syst. (in press). doi:10.1007/s40747-017-0037-9
Foundation, A.S.: Apache Mahout (2017). http://mahout.apache.org/. Accessed March 2017
Gutiérrez, P.D., Lastra, M., Bacardit, J., Benítez, J.M., Herrera, F.: GPU–SME–kNN: scalable and memory efficient \(k\)NN and lazy learning using GPUs. Inf. Sci. 373, 165–182 (2016)
Gutiérrez, P.D., Lastra, M., Herrera, F., Benitez, J.M.: A high performance fingerprint matching system for large databases based on GPU. IEEE Trans. Inf. Forensics Secur. 9(1), 62–71 (2014)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Hoare, C.A.R.: Algorithm 64: quicksort. Commun. ACM 4(7), 321 (1961)
Article Google Scholar
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Progr. Artif. Intell. 5(4), 221–232 (2016)
Article Google Scholar
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Article Google Scholar
Madden, S.: From databases to big data. IEEE Internet Comput. 16(3), 4–6 (2012)
Article Google Scholar
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLLIB: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)
MATH MathSciNet Google Scholar
Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action, Manning Publications Co., Greenwich, CT, USA, ISBN:1935182684, 9781935182689 (2011)
Prati, R.C., Batista, G.E.A.P.A., Silva, D.F.: Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl. Inf. Syst. 45(1), 247–270 (2015)
Article Google Scholar
Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011)
Book Google Scholar
Salomon-Ferrer, R., Götz, A., Poole, D., Le Grand, S., Walker, R.: Routine microsecond molecular dynamics simulations with amber on GPUS. 2. Explicit solvent particle mesh ewald. J. Chem. Theory Comput. 9(9), 3878–3888 (2013)
Article Google Scholar
Spark, A.: Machine Learning Library (MLlib) for Spark (2017). http://spark.apache.org/docs/latest/mllib-guide.html. Accessed March 2017
Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J.M., Herrera, F.: ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition—an extremely imbalanced big data bioinformatics problem. Knowl. Based Syst. 87, 69–79 (2015)
Article Google Scholar
White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc, Sebastopol (2015)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 1–14. USENIX Association (2012)
Zikopoulos, P.C., Eaton, C., deRoos, D., Deutsch, T., Lapis, G.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, 1st edn. McGraw-Hill, New York (2011)
Google Scholar

Download references

Acknowledgements

This work was supported by the Spanish National Research Projects TIN2013-47210-P, TIN2014-57251-P and TIN2016-81113-R and by the Andalusian Regional Government Excellence Research Project P12-TIC-2958. P.D. Gutiérrez holds an FPI scholarship from the Spanish Ministry of Economy and Competitiveness (BES-2012-060450).

Author information

Authors and Affiliations

Department of Computer Science and Artificial Intelligence, CITIC-UGR, University of Granada, 18071, Granada, Spain
Pablo D. Gutiérrez, José M. Benítez & Francisco Herrera
Department of Software Engineering, CITIC-UGR, University of Granada, 18071, Granada, Spain
Miguel Lastra

Authors

Pablo D. Gutiérrez
View author publications
You can also search for this author in PubMed Google Scholar
Miguel Lastra
View author publications
You can also search for this author in PubMed Google Scholar
José M. Benítez
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pablo D. Gutiérrez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gutiérrez, P.D., Lastra, M., Benítez, J.M. et al. SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification. Prog Artif Intell 6, 347–354 (2017). https://doi.org/10.1007/s13748-017-0128-2

Download citation

Received: 20 March 2017
Accepted: 27 April 2017
Published: 15 May 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s13748-017-0128-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey of transfer learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey of transfer learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation