Advertisement

Progress in Artificial Intelligence

, Volume 6, Issue 4, pp 347–354 | Cite as

SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification

  • Pablo D. Gutiérrez
  • Miguel Lastra
  • José M. Benítez
  • Francisco Herrera
Regular Paper

Abstract

Nowadays, it is usual to work with large amounts of data since our capacity of collecting and storing information has increased significantly. The extraction of knowledge from these scenarios is commonly known as “Big Data,” and it is performed on large clusters with MapReduce platforms. Imbalanced classification poses a problem both in traditional and Big Data learning scenarios. Data sampling is one of the ways that allows to improve the performance on imbalanced problems. A commodity hardware-based method for Big Data problems can offload these computations from the expensive and highly demanded hardware that MapReduce platforms require. The characteristics of some sampling methods make them suitable to be adapted to commodity hardware, taking advantage of the parallel computation capabilities of graphics processing units. SMOTE is one of the most popular oversampling methods which is based on the nearest neighbor rule. The proposed SMOTE-GPU efficiently handles large datasets (several millions of instances) on a wide variety of commodity hardware, including a laptop computer.

Keywords

Imbalanced classification SMOTE CUDA Big Data 

Notes

Acknowledgements

This work was supported by the Spanish National Research Projects TIN2013-47210-P, TIN2014-57251-P and TIN2016-81113-R and by the Andalusian Regional Government Excellence Research Project P12-TIC-2958. P.D. Gutiérrez holds an FPI scholarship from the Spanish Ministry of Economy and Competitiveness (BES-2012-060450).

References

  1. 1.
    Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J., Rivas, V., Fernández, J., Herrera, F.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput. 13(3), 307–318 (2009)CrossRefGoogle Scholar
  2. 2.
    Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
  3. 3.
    Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5 (2014)Google Scholar
  4. 4.
    Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)CrossRefGoogle Scholar
  5. 5.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)zbMATHGoogle Scholar
  6. 6.
  7. 7.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  8. 8.
    ECBDL14 dataset: Protein structure prediction and contact map for the ECBDL2014 big data competition (2014). http://cruncher.ncl.ac.uk/bdcomp/
  9. 9.
    Fernández, A., del Río, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data classification: outcomes and challenges. Complex Intell. Syst. (in press). doi: 10.1007/s40747-017-0037-9
  10. 10.
    Foundation, A.S.: Apache Mahout (2017). http://mahout.apache.org/. Accessed March 2017
  11. 11.
    Gutiérrez, P.D., Lastra, M., Bacardit, J., Benítez, J.M., Herrera, F.: GPU–SME–kNN: scalable and memory efficient \(k\)NN and lazy learning using GPUs. Inf. Sci. 373, 165–182 (2016)Google Scholar
  12. 12.
    Gutiérrez, P.D., Lastra, M., Herrera, F., Benitez, J.M.: A high performance fingerprint matching system for large databases based on GPU. IEEE Trans. Inf. Forensics Secur. 9(1), 62–71 (2014)CrossRefGoogle Scholar
  13. 13.
    He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  14. 14.
    Hoare, C.A.R.: Algorithm 64: quicksort. Commun. ACM 4(7), 321 (1961)CrossRefGoogle Scholar
  15. 15.
    Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Progr. Artif. Intell. 5(4), 221–232 (2016)CrossRefGoogle Scholar
  16. 16.
    López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)CrossRefGoogle Scholar
  17. 17.
    Madden, S.: From databases to big data. IEEE Internet Comput. 16(3), 4–6 (2012)CrossRefGoogle Scholar
  18. 18.
    Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLLIB: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)zbMATHMathSciNetGoogle Scholar
  19. 19.
    Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action, Manning Publications Co., Greenwich, CT, USA, ISBN:1935182684, 9781935182689 (2011)Google Scholar
  20. 20.
    Prati, R.C., Batista, G.E.A.P.A., Silva, D.F.: Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl. Inf. Syst. 45(1), 247–270 (2015)CrossRefGoogle Scholar
  21. 21.
    Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011)CrossRefGoogle Scholar
  22. 22.
    Salomon-Ferrer, R., Götz, A., Poole, D., Le Grand, S., Walker, R.: Routine microsecond molecular dynamics simulations with amber on GPUS. 2. Explicit solvent particle mesh ewald. J. Chem. Theory Comput. 9(9), 3878–3888 (2013)CrossRefGoogle Scholar
  23. 23.
    Spark, A.: Machine Learning Library (MLlib) for Spark (2017). http://spark.apache.org/docs/latest/mllib-guide.html. Accessed March 2017
  24. 24.
    Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J.M., Herrera, F.: ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition—an extremely imbalanced big data bioinformatics problem. Knowl. Based Syst. 87, 69–79 (2015)CrossRefGoogle Scholar
  25. 25.
    White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc, Sebastopol (2015)Google Scholar
  26. 26.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 1–14. USENIX Association (2012)Google Scholar
  27. 27.
    Zikopoulos, P.C., Eaton, C., deRoos, D., Deutsch, T., Lapis, G.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, 1st edn. McGraw-Hill, New York (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  1. 1.Department of Computer Science and Artificial Intelligence, CITIC-UGRUniversity of GranadaGranadaSpain
  2. 2.Department of Software Engineering, CITIC-UGRUniversity of GranadaGranadaSpain

Personalised recommendations