A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data

Almasi, Mehrdad; Saniee Abadeh, Mohammad

doi:10.1007/s10586-018-2812-9

A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data

Published: 06 July 2018

Volume 21, pages 1821–1847, (2018)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Mehrdad Almasi¹ &
Mohammad Saniee Abadeh¹

479 Accesses
5 Citations
Explore all metrics

Abstract

The process of knowledge discovery from big and high dimensional datasets has become a popular research topic. The classification problem is a key task in bioinformatics, business intelligence, decision science, astronomy, physics, etc. Building associative classifiers has been a notable research interest in recent years because of their superior accuracy. In associative classifiers, using under-sampling or over-sampling methods for imbalanced big datasets reduces accuracy or increases running time, respectively. Hence, there is a significant need to create efficient associative classifiers for imbalanced big data problems. These classifiers should be able to handle challenges such as memory usage, running time and efficiently exploring the search space. To this end, efficient calculation of measures is a primary objective for associative classifiers. In this paper, we propose a new efficient associative classifier for big imbalanced datasets. The proposed method is based on Rare-PEARs (a multi-objective evolutionary algorithm that efficiently discovers rare and reliable association rules) and is able to evaluate rules in a distributed manner by using a new storing data format. This format simplifies measures calculation and is fully compatible with the MapReduce programming model. We have applied the proposed method (RPII) on a well-known big dataset (ECBDL’14) and have compared our results with seven other learning methods. The experimental results show that RPII outperform other methods in sensitivity and final score measures (the values of sensitivity and final score measures were approximately 0.74 and 0.54 respectively). The results demonstrate that the proposed method is a good candidate for large-scale classification problems; furthermore, it achieves reasonable execution time when the target platform is a typical computer clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Evaluating associative classification algorithms for Big Data

Article Open access 14 January 2019

Francisco Padillo, José María Luna & Sebastián Ventura

A Grammar-Guided Genetic Programing Algorithm for Associative Classification in Big Data

Article 16 January 2019

F. Padillo, J. M. Luna & S. Ventura

Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification

Notes

http://bioinformatics.oxfordjournals.org/content/28/19/2441.

References

Xu, Q., Wang Z., Wang, F., Li J.: Thermal comfort research on human CT data modeling. Multimed. Tools Appl. 1–6 (2017)
Yang, J., Li, J., Liu, S.: A new algorithm of stock data mining in Internet of Multimedia Things. J. Supercomput. 1–6 (2017)
Li, G., Zhang, Z., Wang, L., Chen, Q., Pan, J.: One-class collaborative filtering based on rating prediction and ranking prediction. Knowl.-Based Syst. 124, 46–54 (2017)
Article Google Scholar
Li, G., Ou, W.: Pairwise probabilistic matrix factorization for implicit feedback collaborative filtering. Neurocomputing 204, 17–25 (2016)
Article Google Scholar
Yang, J., Li, J., Liu, S.: A novel technique applied to the economic investigation of recommender system. Multimed. Tools Appl. 1–6 (2017)
Xu, Q., Wu, J., Chen, Q.: A novel mobile personalized recommended method based on money flow model for stock exchange. Math. Prob. Eng. (2014)
Xu, Q.: A novel machine learning strategy based on two-dimensional numerical models in financial engineering. Math. Prob. Eng. (2013)
Corbellini, A., Godoy, D., Mateos, C., Schiaffino, S., Zunino, A.: DPM: a novel distributed large-scale social graph processing framework for link prediction algorithms. Future Gener. Comput. Syst. 78, 474–480 (2018)
Article Google Scholar
Corbellini, A., Mateos, C., Godoy, D., Zunino, A., Schiaffino, S.: An architecture and platform for developing distributed recommendation algorithms on large-scale social networks. J. Inf. Sci. 41(5), 686–704 (2015)
Article Google Scholar
Samovsky, M., Kacur, T.: Cloud-based classification of text documents using the Gridgain platform. In: 2012 7th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI), 2012 May 24, pp. 241–245 (2012)
Christopher, M.B.: Pattern Recognition and Machine Learning. Springer, New York (2016)
Google Scholar
Wedyan, S.: Review and comparison of associative classification data mining approaches. Int. J. Comput. Inf. Syst. Control Eng. 8(1), 34–45 (2014)
Google Scholar
Nguyen, L.T., Vo, B., Hong, T.P., Thanh, H.C.: CAR-Miner: an efficient algorithm for mining class-association rules. Expert Syst. Appl. 40(6), 2305–2311 (2013)
Article Google Scholar
Sun, Y., Wang, Y., Wong, A.K.: Boosting an associative classifier. IEEE Trans. Knowl. Data Eng. 18(7), 988–992 (2006)
Article Google Scholar
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)
Google Scholar
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. InAcm sigmod Record 22(2), 207–216. ACM (1993)
Mahafzah, B.A., Al-Badarneh, A.F., Zakaria, M.Z.: A new sampling technique for association rule mining. J. Inf. Sci. 35(3), 358–376 (2009)
Article Google Scholar
Bechini, A., Marcelloni, F., Segatori, A.: A MapReduce solution for associative classification of big data. Inf. Sci. 332, 33–55 (2016)
Article Google Scholar
Thabtah, F.: A review of associative classification mining. Knowl. Eng. Rev. 22(1), 37–65 (2007)
Article Google Scholar
Almasi, M., Abadeh, M.S.: Rare-PEARs: a new multi objective evolutionary algorithm to mine rare and non-redundant quantitative association rules. Knowl.-Based Syst. 89, 366–384 (2015)
Article Google Scholar
Krishnamoorthy, S., Sadasivam, G.S., Rajalakshmi, M., Kowsalyaa, K., Dhivya, M.: Privacy Preserving Fuzzy Association Rule Mining in Data Clusters Using Particle Swarm Optimization. Int. J. Intell. Inf. Technol. (IJIIT) 13(2), 1–20 (2017)
Article Google Scholar
Martín, D., Alcalá-Fdez, J., Rosete, A., Herrera, F.: NICGAR: a Niching Genetic Algorithm to mine a diverse set of interesting quantitative association rules. Inf. Sci. 355, 208–228 (2016)
Article Google Scholar
Ma, B.L., Liu, B.: Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining (1998)
Li, W., Han, J., Pei, J.: CMAR: Accurate and efficient classification based on multiple class-association rules. InData Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on 2001, pp. 369–376 (2001)
Baralis, E., Chiusano, S., Garza, P.: A lazy approach to associative classification. IEEE Trans. Knowl. Data Eng. 20(2):156–171 (2008)
Article Google Scholar
Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. ACM sIGKDD Explor. Newsl. 14(2), 1–5 (2013)
Article Google Scholar
Luna, J.M., Cano, A., Pechenizkiy, M.: Ventura S.: Speeding-up association rule mining with inverted index compression. IEEE Trans. Cybernet. 46(12), 3059–3072 (2016)
Article Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Sebastopol (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S.: Stoica, I: spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)
Google Scholar
Shanahan, J.G., Dai, L.: Large scale distributed data science using apache spark. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2015 Aug 10, pp. 2323–2324. ACM (2015)
Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data. 2(1), 24 (2015)
Article Google Scholar
Pentreath, N.: Machine Learning with Spark. Packt Publishing Ltd, Birmingham (2015)
http://cruncher.ncl.ac.uk/bdcomp/TrainingSet.arff.gz and http://cruncher.ncl.ac.uk/bdcomp/TestSet.arff.gz and http://cruncher.ncl.ac.uk/bdcomp
Triguero, I:, del Río, S., López, V., Bacardit, J., Benítez, J.M., Herrera, F.: ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowl.-Based Syst. 87:69–79 (2015)
Article Google Scholar
http://cruncher.ncl.ac.uk/bdcomp/BDCOMP-final.pdf
Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at facebook. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010 Jun 6, pp. 1013–1020. ACM (2010)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Fei, X., Li, X., Shen, C.: Parallelized text classification algorithm for processing large scale TCM clinical data with MapReduce. In: Information and Automation, 2015 IEEE International Conference, 1983–1986. IEEE (2015)
Qasem, M.H., Sarhan, A.A., Qaddoura, R., Mahafzah, B.A.: Matrix multiplication of big data using mapreduce: a review. In: Proceedings of the 2nd International Conference on the Applications of Information Technology in Developing Renewable Energy Processes and Systems (IT-DREPS 2017), University of Petra, Amman, Jordan, 52-57, (2017)
Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017)
Article Google Scholar
Perera, S.: Hadoop MapReduce Cookbook. Packt Publishing Ltd, Birmingham (2013)
Lin, D.I., Kedem, Z.M.: Pincer-search: an efficient algorithm for discovering the maximum frequent set. IEEE Trans. Knowl. Data Eng. 14(3), 553–566 (2002)
Article Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. InACM Sigmod Record 2000, 29(2), 1–12 (2000)
Article Google Scholar
Savasere, A., Omiecinski, ER., Navathe, SB.: An efficient algorithm for mining association rules in large databases. Georgia Institute of Technology, Georgia (1995)
Ghosh, A., Nath, B.: Multi-objective rule mining using genetic algorithms. Inf. Sci. 163(1), 123–133 (2004)
Article MathSciNet Google Scholar
Kuo, R.J., Shih, C.W.: Association rule mining through the ant colony system for National Health Insurance Research Database in Taiwan. Comput. Math. Appl. 54(11), 1303–1318 (2007)
Article MathSciNet MATH Google Scholar
Sarath, K.N., Ravi, V.: Association rule mining using binary particle swarm optimization. Eng. Appl. Artif. Intell. 26(8), 1832–1840 (2013)
Article Google Scholar
Kuo, R.J., Chao, C.M., Chiu, Y.T.: Application of particle swarm optimization to association rule mining. Appl. Soft Comput. 11(1), 326–336 (2011)
Article Google Scholar
Martín, D., Rosete, A., Alcalá-Fdez, J., Herrera, F.: QAR-CIP-NSGA-II: a new multi-objective evolutionary algorithm to mine quantitative association rules. Inf. Sci. 258, 1–28 (2014)
Article MathSciNet Google Scholar
Mata, J., Alvarez, J.L., Riquelme, J.C.: Mining numeric association rules with genetic algorithms. In: Smith, G. (ed.), Artificial Neural Nets and Genetic Algorithms. Springer, Vienna, pp. 264–267 (2001)
Chapter MATH Google Scholar
Yan, X., Zhang, C., Zhang, S.: Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support. Expert Syst. Appl. 36(2), 3066–3076 (2009)
Article Google Scholar
Alatas, B., Akin, E., Karci, A.: MODENAR: multi-objective differential evolution algorithm for mining numeric association rules. Appl. Soft Comput. 8(1), 646–656 (2008)
Article Google Scholar
Qodmanan, H.R., Nasiri, M., Minaei-Bidgoli, B.: Multi objective association rule mining with genetic algorithm without specifying minimum support and minimum confidence. Expert Syst. Appl. 38(1), 288–298 (2011)
Article Google Scholar
Ramaswamy, S., Mahajan, S., Silberschatz, A.: On the discovery of interesting patterns in association rules. InVLDB 98, 368–379 (1998)
Google Scholar
Djenouri, Y., Djenouri, D., Habbas, Z., Belhadi, A.: How to exploit high performance computing in population-based metaheuristics for solving association rule mining problem. Distrib. Parallel Databases 1–29 (2018)
Segatori, A., Bechini, A., Ducange, P., Marcelloni, F.: A distributed fuzzy associative classifier for big data. IEEE Trans. Cybernet. (2017)
Venturini, L., Baralis, E., Garza, P.: Scaling associative classification for very large datasets. J. Big Data 4(1), 44 (2017)
Article Google Scholar
Yu, P., Wild, D.J.: Discovering associations in biomedical datasets by link-based associative classifier (LAC). PLoS ONE 7(12), e51018 (2012)
Article Google Scholar
Uriarte-Arcia, A.V., López-Yáñez, I., Yáñez-Márquez, C.: One-hot vector hybrid associative classifier for medical data classification. PLoS ONE 9(4), e95715 (2014)
Article Google Scholar
Yoon, Y., Lee, G.G.: Two scalable algorithms for associative text classification. Inf. Proc. Manag. 49(2), 484–496 (2013)
Article Google Scholar
Costa, G., Ortale, R., Ritacco, E.: X-class: associative classification of xml documents by structure. ACM Trans. Inf. Syst. (TOIS) 31(1), 3 (2013)
Article Google Scholar
Ajlouni, M.D., Hadi, W.E., Alwedyan, J.: Detecting phishing websites using associative classification. Image 5(23), 36–40 (2013)
Wang, C., Hu, L., Guo, M., Liu, X., Zou, Q.: imDC: an ensemble learning method for imbalanced classification with miRNA data. Genet. Mol. Res. 14(1), 123–133 (2015)
Article Google Scholar
Liu, Y., Zhang, J., Li, A., Zhang, Y., Li, Y., Yuan, X., He, Z., Liu, Z., Tuo, S.: Identification of PIWI-interacting RNA modules by weighted correlation network analysis. Clust. Comput. 1–1 (2017)
Bacardit, J., Widera, P., Márquez-Chamorro, A., Divina, F., Aguilar-Ruiz, J.S., Krasnogor, N.: Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics 28(19), 2441–2448 (2012)
Article Google Scholar
Mahafzah, B.A., Jaradat, B.A.: The hybrid dynamic parallel scheduling algorithm for load balancing on chained-cubic tree interconnection networks. J. Supercomput. 52(3), 224–252 (2010)
Article Google Scholar
Mahafzah, B.A., Jaradat, B.A.: The load balancing problem in OTIS-Hypercube interconnection networks. J. Supercomput. 46(3), 276–297 (2008)
Article Google Scholar
https://moa.cms.waikato.ac.nz/overview/ a Hadoop-powered Weka implementation
Sun, Y., Wong, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recog. Artif. Intell. 23(04), 687–719 (2009)
Article Google Scholar
Han, L., Ong, H.Y.: Parallel data intensive applications using MapReduce: a data mining case study in biomedical sciences. Clust. Comput. 18(1), 403–418 (2015)
Article Google Scholar
Park, B.J., Oh, S.K., Pedrycz, W.: The design of polynomial function-based neural network predictors for detection of software defects. Inf. Sci. 229, 40–57 (2013)
Article MathSciNet MATH Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article MATH Google Scholar
Rodríguez-Mazahua, L., Rodríguez-Enríquez, C.A., Sánchez-Cervantes, J.L., Cervantes, J., García-Alcaraz, J.L., Alor-Hernández, G.: A general perspective of Big Data: applications, tools, challenges and trends. J. Supercomput. 72(8), 3073–3113 (2016)
Article Google Scholar
Lee, J., Lapira, E., Bagheri, B., Kao, H.A.: Recent advances and trends in predictive manufacturing systems in big data environment. Manuf. Lett. 1(1), 38–41 (2013)
Article Google Scholar
Costa, F.F.: Big data in biomedicine. Drug Discov. Today 19(4), 433–440 (2014)
Article Google Scholar
Xu, Q., Li, M.: A new cluster computing technique for social media data analysis. Clust. Comput. 1–8 (2017)
Garcı, S., Triguero, I., Carmona, C.J., Herrera, F.: Evolutionary-based selection of generalized instances for imbalanced classification. Knowl.-Based Syst. 25(1), 3–12 (2012)
Article Google Scholar
García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol. Comput. 17(3), 275–306 (2009)
Article MathSciNet Google Scholar
Idris, A., Iftikhar, A., ur Rehman, Z.: Intelligent churn prediction for telecom using GP-AdaBoost learning and PSO undersampling. Clust. Comput. 1–5 (2017)
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor. Newsl. 6(1), 20–29 (2004)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Del Río, S., López, V., Benítez, J.M., Herrera, F.: On the use of MapReduce for imbalanced big data using Random Forest. Inf. Sci. 285, 112–137 (2014)
Article Google Scholar
LóPez, V., FernáNdez, A., Del Jesus, M.J., Herrera, F.: A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets. Knowl.-Based Syst. 38, 85–104 (2013)
Article Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. InICML 97, 179–186 (1997)
Google Scholar
Corbellini, A., Mateos, C., Zunino, A., Godoy, D., Schiaffino, S.: Persisting big-data: the NoSQL landscape. Inf. Syst. 63, 1–23 (2017)
Article Google Scholar
Berzal, F., Cubero, J.C., Marín, N., Sánchez, D., Serrano, J.M., Vila, A.: Association rule evaluation for classification purposes. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje. 135–44 (2005)
https://www.spss-tutorials.com/spss-independent-samples-t-test/
Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)
Article Google Scholar
Leyva, E., Gonzalez, A., Perez, R.: A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans. Knowl. Data Eng. 27(2), 354–367 (2015)
Article Google Scholar
http://sci2s.ugr.es/keel/imbalanced.php?order=insR#sub10

Download references

Author information

Authors and Affiliations

Faculty of Electrical and Computer Engineering, Tarbiat Modares University, Tehran, Iran
Mehrdad Almasi & Mohammad Saniee Abadeh

Authors

Mehrdad Almasi
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Saniee Abadeh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Saniee Abadeh.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 674 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Almasi, M., Saniee Abadeh, M. A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data. Cluster Comput 21, 1821–1847 (2018). https://doi.org/10.1007/s10586-018-2812-9

Download citation

Received: 07 October 2017
Revised: 08 May 2018
Accepted: 22 May 2018
Published: 06 July 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s10586-018-2812-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data

Abstract

Access this article

Similar content being viewed by others

Evaluating associative classification algorithms for Big Data

A Grammar-Guided Genetic Programing Algorithm for Associative Classification in Big Data

Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (DOCX 674 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data

Abstract

Access this article

Similar content being viewed by others

Evaluating associative classification algorithms for Big Data

A Grammar-Guided Genetic Programing Algorithm for Associative Classification in Big Data

Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (DOCX 674 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation