Scaling up data mining algorithms: review and taxonomy

García-Pedrajas, Nicolás; de Haro-García, Aida

doi:10.1007/s13748-011-0004-4

Scaling up data mining algorithms: review and taxonomy

Review
Published: 13 January 2012

Volume 1, pages 71–87, (2012)
Cite this article

Download PDF

Progress in Artificial Intelligence Aims and scope Submit manuscript

Scaling up data mining algorithms: review and taxonomy

Download PDF

Nicolás García-Pedrajas¹ &
Aida de Haro-García¹

2468 Accesses
26 Citations
Explore all metrics

Abstract

The overwhelming amount of data that are now available in any field of research poses new problems for data mining and knowledge discovery methods. Due to this huge amount of data, most of the current data mining algorithms are inapplicable to many real-world problems. Data mining algorithms become ineffective when the problem size becomes very large. In many cases, the demands of the algorithm in terms of the running time are very large, and mining methods cannot be applied when the problem grows. This aspect is closely related to the time complexity of the method. A second problem is linked with performance; although the method might be applicable, the size of the search space prevents an efficient execution, and the resulting solutions are unsatisfactory. Two approaches have been used to deal with this problem: scaling up data mining algorithms and data reduction. However, because data reduction is a data mining task itself, this technique also suffers from scalability problems. Thus, for many problems, especially when dealing with very large datasets, the only way to deal with the aforementioned problems is to scale up the data mining algorithm. Many efforts have been made to obtain methods that can be used to scale up existing data mining algorithms. In this paper, we review the methods that have been used to address the problem of scalability. We focus on general ideas, rather than specific implementations, that can be used to provide a general view of the current approaches for scaling up data mining methods. A taxonomy of the algorithms is proposed, and many examples of different tasks are presented. Among the different techniques used for data mining, we will pay special attention to evolutionary methods, because these methods have been used very successfully in many data mining tasks.

References

Alba E., Nebro A.J., Troya J.M.: Heterogeneous computing and parallel genetic algorithms. J. Parallel Distrib. Comput. 62, 1362–1385 (2002)
Article MATH Google Scholar
Aldinucci, M., Ruggieri, S., Torquati, M.: Porting decision tree algorithms to multicore using fastflow. In: Balcázar, J.L. Bonchi, F., Gionis, A., Sebag, M. (eds.) Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases ECML PKDD, Lecture Notes in Computer Science, vol. 6321, pp. 7–23 (2010)
Anderson, P.G., Arney, J.S., Inverso, S.A., Kunkle, D.R., Lebo, T., Merrigan, C.: Good halftone masks via genetic algorithms. In: Proceedings of the 2003 Western New York Image Processing Workshop (2003)
Andrews N.O., Fox E.A.: Clustering for data reduction: A divide and conquer approach. Technical Report, Virginia Tech (2007)
Google Scholar
Aronis, J., Provost, F.: Increasing the efficiency of data mining algorithms with breadth-first marker propagation. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 119–122. AAAI Press, Menlo Park (1997)
Bader D.A., Cong G.: A fast, parallel spanning tree algorithm for symmetric multiprocessors (smps). J. Parallel Distrib. Comput. 65(9), 994–1006 (2005)
Article MATH Google Scholar
Barolli L., Ikeda M., de Marco G., Durresi A., Koyama A., Iwashige J.: A search space reduction algorithm for improving the performance of a ga-based qos routing method in ad-hoc networks. Int. J. Distrib. Sens. Netw. 3, 41–57 (2007)
Article Google Scholar
Bauer E., Kohavi R.: An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach. Learn. 36(1/2), 105–142 (1999)
Article Google Scholar
Bengtsson, T., Bickel, P., Li, B.: Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. In: Probability and Statistics: Essays in Honor of David A. Freedman, IMS Collections, vol. 2, pp. 316–334. Institute of Mathematical Statistics (2008)
Bentley J.L.: Parallel algorithm for constructing minimum spanning trees. J. Algorithms 1, 51–59 (1980)
Article MathSciNet MATH Google Scholar
Berger J., Barkaoui M.: A new hybrid genetic algorithm for the capacitated vehicle routing problem. J. Oper. Res. Soc. 54, 1254–1262 (2003)
Article MATH Google Scholar
Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I., Bourne P.E.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)
Article Google Scholar
Boullé M.: A parameter-free classification method for large scale learning. J. Mach. Learn. Res. 10, 1367–1385 (2009)
MathSciNet Google Scholar
Brain, D., Webb, G.I.: The need for low bias algorithms in classification learning from large data sets. In: Proceedings of the 16th European Conference Principles of Data Mining and Knowledge Discovery (PKDD’2002), Lecture Notes in Artificial Intelligence, vol. 2431, pp. 62–73. Springer Verlag, New York (2002)
Breiman L.: Pasting small votes for classification in large databases and on-line. Mach. Learn. 36(1–2), 85–103 (1999)
Article Google Scholar
Brent M.R., Guigó R.: Recent advances in gene structure prediction. Curr. Opin. Struct. Biol. 14, 264–272 (2004)
Article Google Scholar
Brill F.Z., Brown D.E., Martin W.N.: Fast genetic selection of features for neural networks classifiers. IEEE Trans. Neural Netw. 3(2), 324–334 (1992)
Article Google Scholar
Brugger, S.T., Kelley, M., Sumikawa, K., Wakumoto, S.: Data mining for security information: A survey. In: Proceedings of the 8th Association for Computing Machinery Conference on Computer and Communications Security (2001)
Cano J.R., Herrera F., Lozano M.: Stratification for scaling up evolutionary prototype selection. Pattern Recognit. Lett. 26(7), 953–963 (2005)
Article Google Scholar
Cantú-Paz E.: A survey of parallel genetic algorithms. Calc. Paralleles 10, 141–171 (1997)
Google Scholar
Cantú-Paz E.: Efficient and Accurate Parallel Genetic Algorithms. Kluwer Academic Publisher, Dordrecht (2001)
Google Scholar
Cantú-Paz E., Kamath C.: Evolving neural networks to identify bent-double galaxies in the first survey. Neural Netw. 16, 507–517 (2003)
Article Google Scholar
Chang, E.Y., Zhu, K., Wang, H., Bai, H., Li, J., Qiu, Z.: Psvm: Parallelizing support vector machines on distributed computers. In: Advances in Neural Information Processing Systems vol. 20, pp. 329–340 (2007)
Chang F., Guo C.Y., Lin X.R., Lu C.J.: Tree decomposition for large-scale SVM problems. J. Mach. Learn. Res. 11, 2855–2892 (2010)
MathSciNet Google Scholar
Chattratichat, J., Darlington, J., Ghanem, M., Guo, Y., Hüning, H., Köhler, M., Sutiwaraphun, J., To, H.W., Yang, D.: Large scale data mining: challenges and responses. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 143–146 (1997)
Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Chawla N.W., Hall L.O., Bowyer K.W., Kegelmeyer W.P.: Learning ensembles from bites: A scalable and accurate approach. J. Mach. Learn. Res. 5, 421–451 (2004)
MathSciNet Google Scholar
Collobert R., Bengio S., Bengio Y.: A parallel mixture of SVMs for very large scale problems. Neural Comput. 14, 1105–1114 (2002)
Article MATH Google Scholar
Cordón O., Herrera-Viedma E., López-Pujalte C., Luque M., Zarco C.: A review on the application of evolutionary computation to information retrieval. Int. J. Approx. Reason. 34, 241–264 (2003)
Article MATH Google Scholar
Craven M., DiPasquoa D., Freitagb D., McCalluma A., Mitchella T., Nigama K., Slatterya S.: Learning to construct knowledge bases from the world wide web. Artif. Intell. 118(1–2), 69–113 (2000)
Article MATH Google Scholar
Cui J., Fogarty T.C., Gammack J.G.: Searching databases using parallel genetic algorithms on a transputer computing surface. Future Gener. Comput. Syst. 9(1), 33–40 (1993)
Article Google Scholar
Dean J., Ghemawat S.: Mapreduce: A flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
de Haro-García, A., García-Pedrajas, N.: Scaling up feature selection by means of pseudoensembles of feature selectors. IEEE Trans. Pattern Anal. Mach. Intell. (2011) (submitted)
de Haro-García, A., Kuncheva, L., García-Pedrajas, N.: Random splitting for cascade feature selection. Technical Report, University or Córdoba (2011)
de Haro-García A., Pedrajas N.G.: A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Min. Knowl. Discov. 18(3), 392–418 (2009)
Article MathSciNet Google Scholar
del Carpio C.A.: A parallel genetic algorithm for polypeptide three dimensional structure prediction. a transputer implementation. J. Chem. Inf. Comput. Sci. 36(2), 258–269 (1996)
Article Google Scholar
Dementiev, R., Sanders, P., Schultes, D.: Engineering an eternal memory minimum spanning tree algorithm. In: Proceedings of the Third IFIP International Conference on Theoretical Computer Science (TCS’04), pp. 195–208 (2004)
Derrac J., García S., Herrera F.: Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability. Memet. Comput. 2, 183–189 (2010)
Article Google Scholar
Domingos, P., Hulten, G.: A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 106–113. Morgan Kaufmann (2001)
Domingos, P., Hulten, G.: Learning from infinite data in finite time. In: Proceedings of Advances in Neural Information Systems, vol. 14, pp. 673–680. Vancouver, Canada (2001)
Domingos P., Hulten G.: A general framework for mining massive data streams. J. Comput. Graph. Stat. 12(4), 945–949 (2003)
Article MathSciNet Google Scholar
Du Z., Lin F.: A novel approach for hierarchical clustering. Parallel Comput. 31, 523–527 (2005)
Article Google Scholar
Eggermont, J., Kok, J.N., Kosters, W.A.: Genetic programming for data classification: Refining the search space. In: Proceedings of the 2004 ACM symposium on Applied computing. ACM Press, New York (2004)
Eitrich, T., Lang, B.: Data mining with parallel support vector machines for classification. In: Yakhno, T., Neuhold, E. (eds.) Proceedings of the Fourth Biennial International Conference on Advances in Information Systems, Lectures Notes in Computer Science, vol. 4243, pp. 197–206 (2006)
Fan, W., Stolfo, S., Zhang, J.: The application of Adaboost for distributed, scalable and on-line learning. In: Proceedings of the Fifth ACD SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 362–366. San Diego, CA, USA (1999)
Fan Y., Jiang T., Evans D.J.: Volumetric segmentation of brain images using parallel genetic algorithms. IEEE Trans. Med. Imaging 21(8), 904–909 (2002)
Article Google Scholar
Fletcher J., Obradovic Z.: Combining prior symbolic knowledge and constructive neural networks. Connect. Sci. 5(3, 4), 365–375 (1993)
Article Google Scholar
Flores, J.J., Rodríguez, H., Graff, M.: Reducing the search space in evolutive design of arima and ann models for time series prediction. In: Proceedings of the 9th Mexican International Conference on Artificial Intelligence, Lecture Notes in Computer Science, vol. 6438, pp. 325–336 (2010)
Freitas, A.A.: A Survey of Parallel Data Mining. In: Arner, H.F., Mackin, N. (eds.) Proceedings of the 2nd International Conference on the Practical Applications of Knowledge Discovery and Data Mining, pp. 287–300. The Practical Application Company (1998)
García S., Cano J.R., Herrera F.: A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognit. 41, 2693–2709 (2008)
Article MATH Google Scholar
García-Osorio C., de Haro-García A., García-Pedrajas N.: Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artif. Intell. 174, 410–441 (2010)
Article Google Scholar
García-Pedrajas N.: Supervised projection approach for boosting classifiers. Pattern Recognit. 42, 1741–1760 (2009)
Article Google Scholar
García-Pedrajas N., del Castillo J.A.R., Ortiz-Boyer D.: A cooperative coevolutionary algorithm for instance selection for instance-based learning. Mach. Learn. 78, 381–420 (2010)
Article Google Scholar
García-Pedrajas N., Hervás-Martínez C., Muñoz-Pérez J.: Covnet: A cooperative coevolutionary model for evolving artificial neural networks. IEEE Trans. Neural Netw. 14(3), 575–596 (2003)
Article Google Scholar
García-Pedrajas N., Hervás-Martínez C., Ortiz-Boyer D.: Cooperative coevolution of artificial neural network ensembles for pattern classification. IEEE Trans. Evol. Comput. 9(3), 271–302 (2005)
Article Google Scholar
García-Pedrajas N., Ortiz-Boyer D.: A cooperative constructive method for neural networks for pattern recognition. Pattern Recognit. 40(1), 80–99 (2007)
Article MATH Google Scholar
García-Pedrajas N., Pérez-Rodríguez J., García-Pedrajas M.D., Ortiz-Boyer D., Fyfe C.: Class imbalance methods for translation initiation site recognition in dna sequences. Knowl. Based Syst. 25, 22–34 (2012)
Article Google Scholar
Graf, H.P., Cosatto, E., Bottou, L., Dourdanovic, I., Vapnik, V.: Parallel support vector machines: the cascade svm. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Neural Information Processing Systems, vol. 17, pp. 521–528 (2004)
Graf, H.P., Cosatto, E., Bottou, L., Durdanovic, I., Vapnik, V.: Parallel support vector machines: The cascade SVM. In: Advances in Neural Information Processing Systems, pp. 521–528. MIT Press, Cambridge (2005)
Griffin, J.D.: Methods for reducing search and evaluating fitness functions in genetic algorithms for the snake-in-the-box problem. Ph.D. thesis, The University of Georgia (2009)
Guyon I., Elisseeff A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Holte, R., Acker, L., Porterm, B.: Concept learning and the problem of small disjuncts. In: Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pp. 813–818. Morgan Kaufmann (2002)
Hong J.H., Cho S.B.: Efficient huge-scale feature selection with speciated genetic algorithm. Pattern Recognit. Lett. 27, 143–150 (2006)
Article Google Scholar
Howffding W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)
Article Google Scholar
Huang, D.W., Lin, J.: Scaling populations of a genetic algorithm for job shop scheduling problems using mapreduce. In: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp. 780–785 (2010)
Huber, P.: From large to huge: A statistician’s reaction to kdd and dm. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 304–308. AAAI Press (1997)
Hulten, G., Domingos, P.: Mining complex models from arbitrarily large databases in constant time. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 525–531. Edmonton, Canada (2002)
Hwang W.J., Ou C.M., Hung P.C., Yang C.Y., Yu T.H.: An efficient distributed genetic algorithm architecture for vector quantizer design. Open Artif. Intell. J. 4, 20–29 (2010)
Article Google Scholar
Islam M.M., Yao X., Murase K.: A constructive algorithm for training cooperative neural network ensembles. IEEE Trans. Neural Netw. 14(4), 820–834 (2003)
Article Google Scholar
Jin R., Yang G., Agrawal G.: Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. IEEE Trans. Knowl. Data Eng. 17(1), 71–89 (2005)
Article Google Scholar
Johnson, D.B., Metaxas, P.: A parallel algorithm for computing minimum spanning trees. In: Proceedings of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’92), pp. 363–372 (1992)
Judd D., McKinley P.K., Jain A.K.: Large-scale parallel data clustering. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 871–876 (1998)
Article Google Scholar
Kerdprasop K., Kerdprasop N.: A lightweight method to parallel k-means clustering. Int. J. Math. Comput. Simul. 4, 144–153 (2010)
Google Scholar
Knuth D.E.: The Art of Computer Programming. Addison- Wesley, Reading (1997)
Google Scholar
Kononova, A.V., Ingham, D.B., Pourkashanian, M.: Simple scheduled memetic algorithm for inverse problems in higher dimensions: application to chemical kinetics. In: Proceedings of the IEEE world congress on computational intelligence CEC’2008, pp. 3906–3913. IEEE Press (2008)
Kumari B., Swarnkar T.: Filter versus wrapper feature subset selection in large dimensionality micro array: A review. Int. J. Comput. Sci. Inf. Technol. 2, 1048–1053 (2011)
Google Scholar
Larrañaga P., Kuijpers C.M.H., Murga R.H., Inza I., Dizdarevic S.: Genetic algorithms for the traveling salesman problem: A review of representations and operators. Artif. Intell. Rev. 13(2), 129–170 (1999)
Article Google Scholar
Lazarevic A., Obradovic Z.: Boosting algorithms for parallel and distributed learning. Distrib. Parallel Databases 11, 203–229 (2002)
Article MATH Google Scholar
Leavitt N.: Data mining for the corporate masses?. Computer 35, 22–24 (2002)
Article Google Scholar
Li X., Fang Z.: Parallel clustering algorithms. Parallel Comput. 11, 275–290 (1989)
Article MathSciNet MATH Google Scholar
Li, X., Yao, X.: Tackling high dimensional nonseparable optimization problems by cooperatively coevolving particle swarms. In: Proceedings of the IEEE Congress on Eevolutionary Computation CEC’2009, pp. 1546–1556 (2009)
Lim D., Ong Y.S., Jin Y., Sendhoff B., Lee B.S.: Efficient hierarchical parallel genetic algorithms using grid computing. Future Gener. Comput. Syst. 23, 658–670 (2007)
Article Google Scholar
Lin Y., Chung S.M.: Parallel bisecting k-means with prediction clustering algorithm. J. Supercomput. 39, 19–37 (2007)
Article Google Scholar
Liu Z., Liu A., Wang C., Niu Z.: Evolving neural networks using real coded genetic algorithm (ga) for multispectral image classification. Future Gener. Comput. Syst. 20(7), 1119–1129 (2004)
Article Google Scholar
Lodhi H., Saunders C., Shawe-Taylor J., Christiani N., Watkins C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)
MATH Google Scholar
Lu, C.T., Boedihardjo, A.P., Manalwar, P.: Exploiting efficient data mining techniques to enhance intrusion detection systems. In: Proceedings of the 2005 IEEE International Conference on Information Reuse and Integration (IEEE IRI-2005 Knowledge Acquisition and Management), pp. 512–517 (2005)
Lu Y., Roychowdhury V.: Parallel randomized sampling for support vector machine (SVM) and support vector regression (SVR). Knowl. Inf. Syst. 14(2), 233–247 (2008)
Article Google Scholar
Lu Y., Roychowdhury V., Vandenberghe L.: Distributed parallel support vector machines in strongly connected networks. IEEE Trans. Neural Netw. 19(7), 1167–1178 (2008)
Article Google Scholar
Marchiori, E., Steenbeek, A.: An evolutionary algorithm for large scale set covering problems with application to airline crew scheduling, pp. 367–381. Lecture Notes in Computer Science. Springer, Berlin (2000)
Moore, A.: Very fast em-based mixture model clustering using multiresolution kd-trees. In: Kearns, M., Cohn, D. (eds.) Advances in Neural Information Processing Systems, pp. 543–549. Morgan Kaufman (1999)
Moriarty D.E., Miikkulainen R.: Efficient reinforcement learning through symbiotic evolution. Mach. Learn. 22, 11–32 (1996)
Google Scholar
Moser, A., Murty, M.N.: On the scalability of genetic algorithms to very large-scale feature selection. In: Proceedings of EvoWorkshops 2000, Lecture Notes in Computer Science, vol. 1603, pp. 77–86. Springer-Verlag, New York (2000)
Murtagh, F.: Clustering in massive data sets. In: Handbook of Massive Data Sets, pp. 501–543. Kluwer Academic Publishers, Dordrecht (2002)
Neumann F., Wegener I.: Minimum spanning trees made easier. Nat. Comput. 5(3), 305–319 (2006)
Article MathSciNet MATH Google Scholar
Nopiah, Z.M., Khairir, M.I., Abdullah, S., Baharin, M.N., Airfin, A.: Time complexity analysis of the genetic algorithm clustering method. In: Proceedings of the 9th WSEAS international conference on Signal processing, robotics and automation, pp. 171–176 (2010)
Nowostawski, M., Poli, R.: Parallel genetic algorithm taxonomy. In: Proceedings of the Third International Conference on Knowledge-Based Intelligent Information Engineering Systems, pp. 88–92 (1999)
Obradovic, Z., Rangarajan, S.: Constructive neural networks design using genetic optimization, pp. 133–146. No. 15 in Mathematics and Informatics. University of Nis (2000)
Oliveto P.S., He J., Yao X.: Time complexity of evolutionary algorithms for combinatorial optimization: A decade of results. Int. J. Autom. Comput. 4(1), 100–106 (2007)
Google Scholar
Olman V., Mao F., Wu H., Xu Y.: Parallel clustering algorithm for large data sets with applications in bioinformatics. IEEE/ACM Trans. Comput. Biol. Bioinforma. 6(2), 344–352 (2009)
Article Google Scholar
Olson C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. V 21, 1313–1325 (1995)
Article MathSciNet MATH Google Scholar
Othman, F., Abdullah, R., Rashid, N.A., Salam, R.A.: Parallel k-means clustering algorithm on dna dataset. In: Proceedings of the 5th International Conference on Parallel and Distributed Computing: Applications and Technologies, (PDCAT’04), Lecture Notes in Computer Science, vol. 3320, pp. 248–251 (2004)
Pal S.K., Bandyopadhyay S.: Evolutionary computation in bioinformatics: A review. IEEE Trans. Syst. Man Cybern. Part B Cybern. 36, 601–615 (2006)
Article Google Scholar
Panigrahy, R.: An improved algorithm finding nearest neighbor using kd-trees. In: Proceedings of the 8th Latin American Symposium, Lectures Notes in Computer Science, vol. 4957, pp. 387–398. Springer, Berlin (2008)
Parekh R., Yang J., Honavar V.: Constructive neural-network learning algorithms for pattern classification. IEEE Trans. Neural Netw. 11(2), 436–450 (2000)
Article Google Scholar
Potter, M.A.: The design and analysis of a computational model of cooperative coevolution. Ph.D. thesis, George Mason University, Fairfax, Virginia (1997)
Potter M.A., De Jong K.A.: Cooperative coevolution: An architecture for evolving coadapted subcomponents. Evol. Comput. 8(1), 1–29 (2000)
Article Google Scholar
Provost F.J., Kolluri V.: A survey of methods for scaling up inductive learning algorithms. Data Min. Knowl. Discov. 2, 131–169 (1999)
Article Google Scholar
Quinn M.J.: Parallel Computing: Theory and Practice. McGraw-Hill, New York (1994)
Google Scholar
Rasmussen E.M., Willet P.: Efficiency of hierarchical agglomerative clustering using ICL distributed array processors. J. Doc. 45(1), 1–24 (1989)
Article Google Scholar
Rausch T., Thomas A., Camp N.J., Cannon-Albrigth L.A., Facelli J.C.: A parallel genetic algorithm to discover patterns in genetic markers that indicate predisposition to multifactorial disease. Comput. Biol. Med. 38, 826–836 (2008)
Article Google Scholar
Rida, A., Labbi, A., Pellegrini, C.: Local experts combination through density decomposition. In: Proceedings of the 7th International Workshop on Artificial Intelligence and Statistics, pp. 692–699 (1999)
Rodríguez M., Escalante D.M., Peregrín A.: Efficient distributed genetic algorithm for rule extraction. Appl. Soft Comput. 11, 733–743 (2011)
Article Google Scholar
Rosset S., Zhu J., Hastie T.: Boosting as a regularized path to a maximum margin classifier. J. Mach. Learn. Res. 5, 941–973 (2004)
MathSciNet MATH Google Scholar
Rudin C., Daubechies I., Schapire R.E.: The dynamics of adaboost: Cyclic behavior and convergence of margins. J. Mach. Learn. Res. 5, 1557–1595 (2004)
MathSciNet MATH Google Scholar
Ruiz R.: Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recognit. 39, 2383–2392 (2006)
Article Google Scholar
Schapire R.E., Freund Y., Bartlett P.L., Lee W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Stat. 26(5), 1651–1686 (1998)
Article MathSciNet MATH Google Scholar
Sebban M., Nock R.: A hybrid filter/wrapper approach of feature selection using information theory. Pattern Recognit. 35, 835–846 (2002)
Article MATH Google Scholar
Sena G.S., Megherbi D., Iserm G.: Implementation of a parallel genetic algorithm on a cluster of workstations: Travelling salesman problem, a case study. Future Gener. Comput. Syst. 17(4), 477–488 (2001)
Article MATH Google Scholar
Sibson R.: Slink: An optimally efficient algorithm for the single link cluster method. Comput. J. 16, 30–34 (1973)
Article MathSciNet Google Scholar
Sikonja, M.R.: Speeding up relief algorithm with k-d trees. In: Proceedings of Electrotechnical and Computer Science Conference (ERK’98), pp. 137–140. Portoroz, Slovenia (1998)
Skillicorn D.: Strategies for parallel data mining. IEEE Concurr. 7(4), 26–35 (1999)
Article Google Scholar
Smieja F.: Neural-network constructive algorithms: Trading generalization for learning efficiency?. Circuits Syst. Signal Process. 12(2), 331–374 (1993)
Article MATH Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2000)
Steinhaeuser, K., Chawla, N.V., Kogge, P.M.: Exploiting thread-level parallelism to build decision trees. In: Proceedings of the ECML/PKDD Workshop on Parallel Data Mining (PDM). Berlin, Germany (2006)
Stoffel, K., Belkoniene, A.: Parallel k/h-means clustering for large data sets. In: Proceedings of the 5th International Parallel Processing Conference (Euro-Par’99), Lecture Notes in Computer Science, vol. 1685, pp. 1451–1454 (1999)
Tresp V.: A bayesian committee machine. Neural Comput. 12, 2719–2741 (2000)
Article Google Scholar
van den Bergh F., Engelbrecht A.P.: A cooperative approach to particle swarm optimization. IEEE Trans. Evol. Comput. 8, 225–239 (2004)
Article Google Scholar
Verma, A., Llorà, X., Goldberg, D.E., Campbell, R.H.: Scaling genetic algorithms using mapreduce. In: Proceedings of the 2009 Ninth International Conference on Intelligent Systems Design and Applications, pp. 13–17 (2009)
Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, Berlin, Germany (2010)
Yang Z., Tang K., Yao X.: Large scale evolutionary optimization using cooperative coevolution. Inf. Sci. 178, 2985–2999 (2008)
Article MathSciNet Google Scholar
Yao X.: Evolving artificial neural networks. Proc. IEEE 9(87), 1423–1447 (1999)
Google Scholar
Yen S.H., Shih C.Y., Li T.K., Chang H.W.: Applying multiple kd-trees in high dimensional nearest neighbor searching. Int. J. Circuits Syst. Signal Process. 4, 153–160 (2010)
Google Scholar
Yıldız O.T., Dikmen O.: Parallel univariate decision trees. Neural Process. Lett. 28, 825–832 (2007)
Google Scholar
Yin, D., An, C., Baird, H.S.: Imbalance and concentration in k-nn classification. In: Proceedings of 20th International Conference on Pattern Recognition (ICPR’2010), pp. 2170–2173. IEEE Press (2010)
Yong, Z., Sannomiya, N.: A method for solving large-scale flowshop problems by reducing search space of genetic algorithms. In: 2000 IEEE International Conference on Systems, Man, and Cybernetics, vol. 3, pp. 1776–1781. IEEE Press (2000)
Yu, T., Davis, L., Baydar, C., Roy, R. (eds.): Evolutionary Computation in Practice, Studies in Computational Intelligence, vol. 88. Springer, Berlin (2008)
Zien A., Rätsch G., Mika S., Schölkopf B., Lengauer T., Müller K.R.: Engineering support vector machines kernels that recognize translation initiation sites. Bioinformatics 16(9), 799–807 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computational Intelligence and Bioinformatics Research Group, University of Córdoba, Córdoba, Spain
Nicolás García-Pedrajas & Aida de Haro-García

Authors

Nicolás García-Pedrajas
View author publications
You can also search for this author in PubMed Google Scholar
Aida de Haro-García
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolás García-Pedrajas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

García-Pedrajas, N., de Haro-García, A. Scaling up data mining algorithms: review and taxonomy. Prog Artif Intell 1, 71–87 (2012). https://doi.org/10.1007/s13748-011-0004-4

Download citation

Received: 04 June 2011
Accepted: 26 September 2011
Published: 13 January 2012
Issue Date: April 2012
DOI: https://doi.org/10.1007/s13748-011-0004-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Scaling up data mining algorithms: review and taxonomy

Abstract

Article PDF

Similar content being viewed by others

Scaling up Data Mining Techniques to Large Datasets Using Parallel and Distributed Processing

An Empirical Analysis Data Mining Frameworks—An Overview

What Is Data Mining and How Does It Work?

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scaling up data mining algorithms: review and taxonomy

Abstract

Article PDF

Similar content being viewed by others

Scaling up Data Mining Techniques to Large Datasets Using Parallel and Distributed Processing

An Empirical Analysis Data Mining Frameworks—An Overview

What Is Data Mining and How Does It Work?

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation