Abstract
New developments in computation have allowed an explosion for both data generation and storage. The high value that is hidden within this large volume of data has attracted more and more researchers to address the topic of Big Data analytics. The main difference between addressing Big Data applications and carrying out traditional DM tasks is scalability. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way (supported by a distributed file system) to adapt for commodity hardware. Apart from the difficulties in addressing the Big Data problem itself, we must take into account that the events of interest might occur infrequently. Having in mind the challenges of mining rare classes in standard classification tasks, adding this to the problem of addressing high volumes of data impose a strong constraint for the development of both accurate and scalable solutions. In order to present this interesting topic, current chapter is organized as follows. First, Sect. 13.1 provides a quick overview on Big Data analytics in the context of imbalanced classification. Then, Sect. 13.2 presents the topic of Big Data in detail, focusing on the MapReduce programming model, the Spark framework, and those software libraries that includes Big Data implementations for ML algorithms. Section 13.3 shows an overview on those works that address imbalanced classification for Big Data problems. Then, Sect. 13.4 presents a discussion on the challenges and open problems on imbalanced Big Data classification. Finally, Sect. 13.5 summarizes and concludes this chapter.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., Ayyash, M.: Internet of things: a survey on enabling technologies, protocols, and applications. IEEE Commun. Surv. Tutorials 17(4), 2347–2376 (2015)
Apache Software Foundation: Apache Spark: lightning-fast cluster computing. http://spark.apache.org/ (2016)
Apache Software Foundation: Hadoop distributed file system: users guide. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html (2018)
Bhagat, R.C., Patil, S.S.: Enhanced smote algorithm for classification of imbalanced big-data using random forest. In: Souvenir of the 2015 IEEE International Advance Computing Conference, IACC’2015, Bangalore, pp. 403–408 (2015)
Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14(1), 106 (2013)
Brzezinski, D., Piernik, M.: Structural XML classification in concept drifting data streams. N. Gener. Comput. 33(4), 345–366 (2015)
Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.A., Caelen, O., Mazzer, Y., Bontempi, G.: Scarff: a scalable framework for streaming credit card fraud detection with spark. Inf. Fusion 41, 182–194 (2018)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over–sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chawla, N.V., Cieslak, D.A., Hall, L.O., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Data Min. Knowl. Disc. 17(2), 225–252 (2008)
Chen, C.P., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci. 275, 314–347 (2014)
Cyganek, B.: Object Detection and Recognition in Digital Images: Theory and Practice, 1st edn. Wiley, New York (2013)
Databricks Inc.: Spark Packages: 3rd Party Spark Packages. https://spark-packages.org/ (2018)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI04: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, San Francisco. USENIX Association (2004)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Elsebakhi, E., Lee, F., Schendel, E., Haque, A., Kathireason, N., Pathare, T., Syed, N., Al-Ali, R.: Large-scale machine learning based on functional networks for biomedical Big Data with high performance computing platforms. J. Comput. Sci. 11, 69–81 (2015)
Fan, J., Han, F., Liu, H.: Challenges of Big Data analysis. Nat. Sci. Rev. 1(2), 293–314 (2014)
Fernández, A., López, V., Galar, M., Del Jesus, M., Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl.-based Syst. 42, 97–110 (2013)
Fernandez, A., del Rio, S., Chawla, N.V., Herrera, F.: An insight into imbalanced Big Data classification: outcomes and challenges. Complex Intell. Syst. 3(2), 105–120 (2017)
Fernández, A., Río, S., López, V., Bawakid, A., del Jesus, M.J., Benítez, J., Herrera, F.: Big Data with cloud computing: an insight on the computing environment, MapReduce and programming framework. WIREs Data Min. Knowl. Disc. 4(5), 380–409 (2014)
Fong, S., Liu, K., Cho, K., Wong, R., Mohammed, S., Fiaidhi, J.: Improvised methods for tackling Big Data stream mining challenges: case study of human activity recognition. J. Supercomput. 72, 3927–3959 (2016)
Fong, S., Zhuang, Y., Wong, R., Mohammed, S.: A scalable data stream mining methodology: stream based holistic analytics and reasoning in parallel. In: Proceedings of the 2nd International Symposium on Computational and Business Intelligence, New Delhi, pp. 110–115 (2014)
Fosso Wamba, S., Akter, S., Edwards, A., Chopin, G., Gnanzou, D.: How ‘Big Data’ can make big impact: findings from a systematic review and a longitudinal case study. Int. J. Prod. Econ. 165, 234–246 (2015)
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recogn. 44(8), 1761–1776 (2011)
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for class imbalance problem: bagging, boosting and hybrid based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)
Galpert, D., Fernndez, A., Herrera, F., Antunes, A., Molina-Ruiz, R., Agero-Chapin, G.: Surveying alignment-free features for ortholog detection in related yeast proteomes by using supervised Big Data classifiers. BMC Bioinform. 19(1), 166:1–166:17 (2018)
Galpert, D., Río, S., Herrera, F., Ancede-Gallardo, E., Antunes, A., Agero-Chapin, G.: An effective Big Data supervised imbalanced classification approach for ortholog detection in related yeast species. BioMed Res. Int. 2015, 1–12 (2015)
Gutierrez, P., Lastra, M., Bacardit, J., Benitez, J., Herrera, F.: GPU-SME-kNN: scalable and memory efficient kNN and lazy learning using GPUs. Inf. Sci. 373, 165–182 (2016)
Gutierrez, P.D., Lastra, M., Benitez, J.M., Herrera, F.: SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification. Prog. Artif. Intell. 6(4), 347–354 (2017)
Hamstra, M., Karau, H., Zaharia, M., Konwinski, A., Wendell, P.: Learning Spark: Lightning-Fast Big Data Analytics. O’Reilly Media, Sebastopol (2015)
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Ullah Khan, S.: The rise of “Big Data” on cloud computing: review and open research issues. Inf. Syst. 47, 98–115 (2015)
Hu, F., Li, H.: A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Math. Probl. Eng. 2013, 1–10 (2013)
Hu, F., Li, H., Lou, H., Dai, J.: A parallel oversampling algorithm based on NRSBoundary-SMOTE. J. Inf. Comput. Sci. 11(13), 4655–4665 (2014)
Hu, H., Wen, Y., Chua, T., Li, X.: Toward scalable systems for Big Data analytics: a technology tutorial. IEEE Access 2, 652–687 (2014)
Hurtado, J., Taweewitchakreeya, N., Kong, X., Zhu, X.: A classifier ensembling approach for imbalanced social link prediction. In: 12th International Conference on Machine Learning and Applications, ICMLA’2013, Miami, pp. 436–439. IEEE (2013)
Jagadish, H.V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Shahabi, C.: Big Data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)
Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explor. Newsl. 6(1), 40–49 (2004)
Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349(6245), 255–260 (2015)
Kamal, S., Ripon, S.H., Dey, N., Ashour, A.S., Santhi, V.: A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Comput. Methods Prog. Biomed. 131, 191–206 (2016)
Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in Big Data analytics. J. Parallel Distrib. Comput. 74(7), 2561–2573 (2014)
Kraska, T.: Finding the needle in the Big Data systems haystack. IEEE Internet Comput. 17(1), 84–86 (2013)
Krawczyk, B.: GPU-accelerated extreme learning machines for imbalanced data streams with concept drift. Proc. Comput. Sci. 80, 1692–1701 (2016). https://doi.org/10.1016/j.procs.2016.05.509
Lam, C.: Hadoop in Action, 1st edn. Manning, Greenwich (2011)
Lichman, M.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences (2013). http://archive.ics.uci.edu/ml
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250(20), 113–141 (2013)
López, V., Río, S., Benítez, J.M., Herrera, F.: Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced Big Data. Fuzzy Sets Syst. 258, 5–38 (2015)
Lyubimov, D., Palumbo, A.: Apache Mahout: Beyond MapReduce, 1st edn. CreateSpace Independent, Louisville (2016)
Mahout, A.: Apache Mahout. https://mahout.apache.org/ (2018)
Mardani, M., Mateos, G., Giannakis, G.B.: Subspace learning and imputation for streaming Big Data matrices and tensors. IEEE Trans. Signal Process. 63(10), 2663–2677 (2015)
Marx, V.: The big challenges of Big Data. Nature 498(7453), 255–260 (2013)
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)
Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action, 1st edn. Manning Publications Co., Shelter Island (2011)
Panda, B., Herbach, J.S., Basu, S., Bayardo, R.J.: Planet: massively parallel learning of tree ensembles with MapReduce. Proc. VLDB Endow. 2(2), 1426–1437 (2009)
Park, S.H., Ha, Y.G.: Large imbalance data classification based on MapReduce for traffic accident prediction. In: Proceedings – 2014 8th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, IMIS’2014, Birmingham, pp. 45–49 (2014)
Park, S.H., Kim, S.M., Ha, Y.G.: Highway traffic accident prediction using VDS Big Data analysis. J. Supercomput. 72, 2815–2831 (2016)
Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., Herrera, F.: Big Data: tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Inf. Fusion 42, 51–61 (2018)
Reed, D.A., Dongarra, J.: Exascale computing and Big Data. Commun. ACM 58(7), 56–68 (2015)
Río, S., Benítez, J.M., Herrera, F.: Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced Big Data classification. In: Trustcom/BigDataSE/ISPA, 2015 IEEE, vol. 2, pp. 180–185 (2015)
Río, S., López, V., Benítez, J., Herrera, F.: On the use of MapReduce for imbalanced Big Data using random forest. Inf. Sci. 285, 112–137 (2014)
Río, S., López, V., Benítez, J.M., Herrera, F.: A MapReduce approach to address Big Data classification problems based on the fusion of linguistic fuzzy rules. Int. J. Comput. Intell. Syst. 8(3), 422–437 (2015)
Tang, M., Yang, C., Zhang, K., Xie, Q.: Cost-sensitive support vector machine using randomized dual coordinate descent method for big class-imbalanced data classification. Abstr. Appl. Anal. 2014, 416591:1–416591:9 (2014)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive – a warehousing solution over a map-reduce framework. J. Very Large DataBases 2(2), 1626–1629 (2009)
Triguero, I., Derrac, J., García, S., Herrera, F.: Integrating a differential evolution feature weighting scheme into prototype generation. Neurocomputing 97, 332–343 (2012)
Triguero, I., Galar, M., Merino, D., Maillo, J., Bustince, H., Herrera, F.: Evolutionary undersampling for extremely imbalanced Big Data classification under Apache Spark. In: IEEE Congress on Evolutionary Computation (CEC’2016), Vancouver, pp. 640–647 (2016)
Triguero, I., Galar, M., Vluymans, S., Cornelis, C., Bustince, H., Herrera, F., Saeys, Y.: Evolutionary undersampling for imbalanced Big Data classification. In: 2015 IEEE Congress on Evolutionary Computation (CEC), pp. 715–722 (2015)
Triguero, I., Río, S., López, V., Bacardit, J., Benítez, J.M., Herrera, F.: Rosefw-RF: the winner algorithm for the ECBDL’14 Big Data competition: an extremely imbalanced Big Data bioinformatics problem. Knowl.-Based Syst. 87, 69–79 (2015)
Wang, X., Liu, X., Matwin, S.: A distributed instance-weighted SVM algorithm on large-scale imbalanced datasets. In: Proceedings – 2014 IEEE International Conference on Big Data, IEEE Big Data 2014, Washington, DC, pp. 45–51 (2014)
Weiss, G.M.: The impact of small disjuncts on classifier learning. In: Stahlbock, R., Crone, S.F., Lessmann, S. (eds.) Data Mining, Annals of Information Systems, vol. 8, pp. 193–226. Springer, New York (2010)
Weiss, G.M., Provost, F.J.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)
White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media, Sebastopol (2015)
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with Big Data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
YARN, A.: Apache YARN. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html (2018)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as Part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI’12), pp. 15–28. USENIX, San Jose (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud 2010, pp. 1–7 (2010)
Zhai, J., Zhang, S., Wang, C.: The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers. Int. J. Mach. Learn. Cybern. 8(3), 1009–1017 (2015)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F. (2018). Imbalanced Classification for Big Data. In: Learning from Imbalanced Data Sets. Springer, Cham. https://doi.org/10.1007/978-3-319-98074-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-98074-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98073-7
Online ISBN: 978-3-319-98074-4
eBook Packages: Computer ScienceComputer Science (R0)