Abstract
The overwhelming amount of data that are now available in any field of research poses new problems for data mining and knowledge discovery methods. Due to this huge amount of data, most of the current data mining algorithms are inapplicable to many real-world problems. Data mining algorithms become ineffective when the problem size becomes very large. In many cases, the demands of the algorithm in terms of the running time are very large, and mining methods cannot be applied when the problem grows. This aspect is closely related to the time complexity of the method. A second problem is linked with performance; although the method might be applicable, the size of the search space prevents an efficient execution, and the resulting solutions are unsatisfactory. Two approaches have been used to deal with this problem: scaling up data mining algorithms and data reduction. However, because data reduction is a data mining task itself, this technique also suffers from scalability problems. Thus, for many problems, especially when dealing with very large datasets, the only way to deal with the aforementioned problems is to scale up the data mining algorithm. Many efforts have been made to obtain methods that can be used to scale up existing data mining algorithms. In this paper, we review the methods that have been used to address the problem of scalability. We focus on general ideas, rather than specific implementations, that can be used to provide a general view of the current approaches for scaling up data mining methods. A taxonomy of the algorithms is proposed, and many examples of different tasks are presented. Among the different techniques used for data mining, we will pay special attention to evolutionary methods, because these methods have been used very successfully in many data mining tasks.
Article PDF
Similar content being viewed by others
References
Alba E., Nebro A.J., Troya J.M.: Heterogeneous computing and parallel genetic algorithms. J. Parallel Distrib. Comput. 62, 1362–1385 (2002)
Aldinucci, M., Ruggieri, S., Torquati, M.: Porting decision tree algorithms to multicore using fastflow. In: Balcázar, J.L. Bonchi, F., Gionis, A., Sebag, M. (eds.) Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases ECML PKDD, Lecture Notes in Computer Science, vol. 6321, pp. 7–23 (2010)
Anderson, P.G., Arney, J.S., Inverso, S.A., Kunkle, D.R., Lebo, T., Merrigan, C.: Good halftone masks via genetic algorithms. In: Proceedings of the 2003 Western New York Image Processing Workshop (2003)
Andrews N.O., Fox E.A.: Clustering for data reduction: A divide and conquer approach. Technical Report, Virginia Tech (2007)
Aronis, J., Provost, F.: Increasing the efficiency of data mining algorithms with breadth-first marker propagation. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 119–122. AAAI Press, Menlo Park (1997)
Bader D.A., Cong G.: A fast, parallel spanning tree algorithm for symmetric multiprocessors (smps). J. Parallel Distrib. Comput. 65(9), 994–1006 (2005)
Barolli L., Ikeda M., de Marco G., Durresi A., Koyama A., Iwashige J.: A search space reduction algorithm for improving the performance of a ga-based qos routing method in ad-hoc networks. Int. J. Distrib. Sens. Netw. 3, 41–57 (2007)
Bauer E., Kohavi R.: An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach. Learn. 36(1/2), 105–142 (1999)
Bengtsson, T., Bickel, P., Li, B.: Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. In: Probability and Statistics: Essays in Honor of David A. Freedman, IMS Collections, vol. 2, pp. 316–334. Institute of Mathematical Statistics (2008)
Bentley J.L.: Parallel algorithm for constructing minimum spanning trees. J. Algorithms 1, 51–59 (1980)
Berger J., Barkaoui M.: A new hybrid genetic algorithm for the capacitated vehicle routing problem. J. Oper. Res. Soc. 54, 1254–1262 (2003)
Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I., Bourne P.E.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)
Boullé M.: A parameter-free classification method for large scale learning. J. Mach. Learn. Res. 10, 1367–1385 (2009)
Brain, D., Webb, G.I.: The need for low bias algorithms in classification learning from large data sets. In: Proceedings of the 16th European Conference Principles of Data Mining and Knowledge Discovery (PKDD’2002), Lecture Notes in Artificial Intelligence, vol. 2431, pp. 62–73. Springer Verlag, New York (2002)
Breiman L.: Pasting small votes for classification in large databases and on-line. Mach. Learn. 36(1–2), 85–103 (1999)
Brent M.R., Guigó R.: Recent advances in gene structure prediction. Curr. Opin. Struct. Biol. 14, 264–272 (2004)
Brill F.Z., Brown D.E., Martin W.N.: Fast genetic selection of features for neural networks classifiers. IEEE Trans. Neural Netw. 3(2), 324–334 (1992)
Brugger, S.T., Kelley, M., Sumikawa, K., Wakumoto, S.: Data mining for security information: A survey. In: Proceedings of the 8th Association for Computing Machinery Conference on Computer and Communications Security (2001)
Cano J.R., Herrera F., Lozano M.: Stratification for scaling up evolutionary prototype selection. Pattern Recognit. Lett. 26(7), 953–963 (2005)
Cantú-Paz E.: A survey of parallel genetic algorithms. Calc. Paralleles 10, 141–171 (1997)
Cantú-Paz E.: Efficient and Accurate Parallel Genetic Algorithms. Kluwer Academic Publisher, Dordrecht (2001)
Cantú-Paz E., Kamath C.: Evolving neural networks to identify bent-double galaxies in the first survey. Neural Netw. 16, 507–517 (2003)
Chang, E.Y., Zhu, K., Wang, H., Bai, H., Li, J., Qiu, Z.: Psvm: Parallelizing support vector machines on distributed computers. In: Advances in Neural Information Processing Systems vol. 20, pp. 329–340 (2007)
Chang F., Guo C.Y., Lin X.R., Lu C.J.: Tree decomposition for large-scale SVM problems. J. Mach. Learn. Res. 11, 2855–2892 (2010)
Chattratichat, J., Darlington, J., Ghanem, M., Guo, Y., Hüning, H., Köhler, M., Sutiwaraphun, J., To, H.W., Yang, D.: Large scale data mining: challenges and responses. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 143–146 (1997)
Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chawla N.W., Hall L.O., Bowyer K.W., Kegelmeyer W.P.: Learning ensembles from bites: A scalable and accurate approach. J. Mach. Learn. Res. 5, 421–451 (2004)
Collobert R., Bengio S., Bengio Y.: A parallel mixture of SVMs for very large scale problems. Neural Comput. 14, 1105–1114 (2002)
Cordón O., Herrera-Viedma E., López-Pujalte C., Luque M., Zarco C.: A review on the application of evolutionary computation to information retrieval. Int. J. Approx. Reason. 34, 241–264 (2003)
Craven M., DiPasquoa D., Freitagb D., McCalluma A., Mitchella T., Nigama K., Slatterya S.: Learning to construct knowledge bases from the world wide web. Artif. Intell. 118(1–2), 69–113 (2000)
Cui J., Fogarty T.C., Gammack J.G.: Searching databases using parallel genetic algorithms on a transputer computing surface. Future Gener. Comput. Syst. 9(1), 33–40 (1993)
Dean J., Ghemawat S.: Mapreduce: A flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
de Haro-García, A., García-Pedrajas, N.: Scaling up feature selection by means of pseudoensembles of feature selectors. IEEE Trans. Pattern Anal. Mach. Intell. (2011) (submitted)
de Haro-García, A., Kuncheva, L., García-Pedrajas, N.: Random splitting for cascade feature selection. Technical Report, University or Córdoba (2011)
de Haro-García A., Pedrajas N.G.: A divide-and-conquer recursive approach for scaling up instance selection algorithms. Data Min. Knowl. Discov. 18(3), 392–418 (2009)
del Carpio C.A.: A parallel genetic algorithm for polypeptide three dimensional structure prediction. a transputer implementation. J. Chem. Inf. Comput. Sci. 36(2), 258–269 (1996)
Dementiev, R., Sanders, P., Schultes, D.: Engineering an eternal memory minimum spanning tree algorithm. In: Proceedings of the Third IFIP International Conference on Theoretical Computer Science (TCS’04), pp. 195–208 (2004)
Derrac J., García S., Herrera F.: Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability. Memet. Comput. 2, 183–189 (2010)
Domingos, P., Hulten, G.: A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 106–113. Morgan Kaufmann (2001)
Domingos, P., Hulten, G.: Learning from infinite data in finite time. In: Proceedings of Advances in Neural Information Systems, vol. 14, pp. 673–680. Vancouver, Canada (2001)
Domingos P., Hulten G.: A general framework for mining massive data streams. J. Comput. Graph. Stat. 12(4), 945–949 (2003)
Du Z., Lin F.: A novel approach for hierarchical clustering. Parallel Comput. 31, 523–527 (2005)
Eggermont, J., Kok, J.N., Kosters, W.A.: Genetic programming for data classification: Refining the search space. In: Proceedings of the 2004 ACM symposium on Applied computing. ACM Press, New York (2004)
Eitrich, T., Lang, B.: Data mining with parallel support vector machines for classification. In: Yakhno, T., Neuhold, E. (eds.) Proceedings of the Fourth Biennial International Conference on Advances in Information Systems, Lectures Notes in Computer Science, vol. 4243, pp. 197–206 (2006)
Fan, W., Stolfo, S., Zhang, J.: The application of Adaboost for distributed, scalable and on-line learning. In: Proceedings of the Fifth ACD SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 362–366. San Diego, CA, USA (1999)
Fan Y., Jiang T., Evans D.J.: Volumetric segmentation of brain images using parallel genetic algorithms. IEEE Trans. Med. Imaging 21(8), 904–909 (2002)
Fletcher J., Obradovic Z.: Combining prior symbolic knowledge and constructive neural networks. Connect. Sci. 5(3, 4), 365–375 (1993)
Flores, J.J., Rodríguez, H., Graff, M.: Reducing the search space in evolutive design of arima and ann models for time series prediction. In: Proceedings of the 9th Mexican International Conference on Artificial Intelligence, Lecture Notes in Computer Science, vol. 6438, pp. 325–336 (2010)
Freitas, A.A.: A Survey of Parallel Data Mining. In: Arner, H.F., Mackin, N. (eds.) Proceedings of the 2nd International Conference on the Practical Applications of Knowledge Discovery and Data Mining, pp. 287–300. The Practical Application Company (1998)
García S., Cano J.R., Herrera F.: A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognit. 41, 2693–2709 (2008)
García-Osorio C., de Haro-García A., García-Pedrajas N.: Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artif. Intell. 174, 410–441 (2010)
García-Pedrajas N.: Supervised projection approach for boosting classifiers. Pattern Recognit. 42, 1741–1760 (2009)
García-Pedrajas N., del Castillo J.A.R., Ortiz-Boyer D.: A cooperative coevolutionary algorithm for instance selection for instance-based learning. Mach. Learn. 78, 381–420 (2010)
García-Pedrajas N., Hervás-Martínez C., Muñoz-Pérez J.: Covnet: A cooperative coevolutionary model for evolving artificial neural networks. IEEE Trans. Neural Netw. 14(3), 575–596 (2003)
García-Pedrajas N., Hervás-Martínez C., Ortiz-Boyer D.: Cooperative coevolution of artificial neural network ensembles for pattern classification. IEEE Trans. Evol. Comput. 9(3), 271–302 (2005)
García-Pedrajas N., Ortiz-Boyer D.: A cooperative constructive method for neural networks for pattern recognition. Pattern Recognit. 40(1), 80–99 (2007)
García-Pedrajas N., Pérez-Rodríguez J., García-Pedrajas M.D., Ortiz-Boyer D., Fyfe C.: Class imbalance methods for translation initiation site recognition in dna sequences. Knowl. Based Syst. 25, 22–34 (2012)
Graf, H.P., Cosatto, E., Bottou, L., Dourdanovic, I., Vapnik, V.: Parallel support vector machines: the cascade svm. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Neural Information Processing Systems, vol. 17, pp. 521–528 (2004)
Graf, H.P., Cosatto, E., Bottou, L., Durdanovic, I., Vapnik, V.: Parallel support vector machines: The cascade SVM. In: Advances in Neural Information Processing Systems, pp. 521–528. MIT Press, Cambridge (2005)
Griffin, J.D.: Methods for reducing search and evaluating fitness functions in genetic algorithms for the snake-in-the-box problem. Ph.D. thesis, The University of Georgia (2009)
Guyon I., Elisseeff A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Holte, R., Acker, L., Porterm, B.: Concept learning and the problem of small disjuncts. In: Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pp. 813–818. Morgan Kaufmann (2002)
Hong J.H., Cho S.B.: Efficient huge-scale feature selection with speciated genetic algorithm. Pattern Recognit. Lett. 27, 143–150 (2006)
Howffding W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)
Huang, D.W., Lin, J.: Scaling populations of a genetic algorithm for job shop scheduling problems using mapreduce. In: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp. 780–785 (2010)
Huber, P.: From large to huge: A statistician’s reaction to kdd and dm. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 304–308. AAAI Press (1997)
Hulten, G., Domingos, P.: Mining complex models from arbitrarily large databases in constant time. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 525–531. Edmonton, Canada (2002)
Hwang W.J., Ou C.M., Hung P.C., Yang C.Y., Yu T.H.: An efficient distributed genetic algorithm architecture for vector quantizer design. Open Artif. Intell. J. 4, 20–29 (2010)
Islam M.M., Yao X., Murase K.: A constructive algorithm for training cooperative neural network ensembles. IEEE Trans. Neural Netw. 14(4), 820–834 (2003)
Jin R., Yang G., Agrawal G.: Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. IEEE Trans. Knowl. Data Eng. 17(1), 71–89 (2005)
Johnson, D.B., Metaxas, P.: A parallel algorithm for computing minimum spanning trees. In: Proceedings of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’92), pp. 363–372 (1992)
Judd D., McKinley P.K., Jain A.K.: Large-scale parallel data clustering. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 871–876 (1998)
Kerdprasop K., Kerdprasop N.: A lightweight method to parallel k-means clustering. Int. J. Math. Comput. Simul. 4, 144–153 (2010)
Knuth D.E.: The Art of Computer Programming. Addison- Wesley, Reading (1997)
Kononova, A.V., Ingham, D.B., Pourkashanian, M.: Simple scheduled memetic algorithm for inverse problems in higher dimensions: application to chemical kinetics. In: Proceedings of the IEEE world congress on computational intelligence CEC’2008, pp. 3906–3913. IEEE Press (2008)
Kumari B., Swarnkar T.: Filter versus wrapper feature subset selection in large dimensionality micro array: A review. Int. J. Comput. Sci. Inf. Technol. 2, 1048–1053 (2011)
Larrañaga P., Kuijpers C.M.H., Murga R.H., Inza I., Dizdarevic S.: Genetic algorithms for the traveling salesman problem: A review of representations and operators. Artif. Intell. Rev. 13(2), 129–170 (1999)
Lazarevic A., Obradovic Z.: Boosting algorithms for parallel and distributed learning. Distrib. Parallel Databases 11, 203–229 (2002)
Leavitt N.: Data mining for the corporate masses?. Computer 35, 22–24 (2002)
Li X., Fang Z.: Parallel clustering algorithms. Parallel Comput. 11, 275–290 (1989)
Li, X., Yao, X.: Tackling high dimensional nonseparable optimization problems by cooperatively coevolving particle swarms. In: Proceedings of the IEEE Congress on Eevolutionary Computation CEC’2009, pp. 1546–1556 (2009)
Lim D., Ong Y.S., Jin Y., Sendhoff B., Lee B.S.: Efficient hierarchical parallel genetic algorithms using grid computing. Future Gener. Comput. Syst. 23, 658–670 (2007)
Lin Y., Chung S.M.: Parallel bisecting k-means with prediction clustering algorithm. J. Supercomput. 39, 19–37 (2007)
Liu Z., Liu A., Wang C., Niu Z.: Evolving neural networks using real coded genetic algorithm (ga) for multispectral image classification. Future Gener. Comput. Syst. 20(7), 1119–1129 (2004)
Lodhi H., Saunders C., Shawe-Taylor J., Christiani N., Watkins C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)
Lu, C.T., Boedihardjo, A.P., Manalwar, P.: Exploiting efficient data mining techniques to enhance intrusion detection systems. In: Proceedings of the 2005 IEEE International Conference on Information Reuse and Integration (IEEE IRI-2005 Knowledge Acquisition and Management), pp. 512–517 (2005)
Lu Y., Roychowdhury V.: Parallel randomized sampling for support vector machine (SVM) and support vector regression (SVR). Knowl. Inf. Syst. 14(2), 233–247 (2008)
Lu Y., Roychowdhury V., Vandenberghe L.: Distributed parallel support vector machines in strongly connected networks. IEEE Trans. Neural Netw. 19(7), 1167–1178 (2008)
Marchiori, E., Steenbeek, A.: An evolutionary algorithm for large scale set covering problems with application to airline crew scheduling, pp. 367–381. Lecture Notes in Computer Science. Springer, Berlin (2000)
Moore, A.: Very fast em-based mixture model clustering using multiresolution kd-trees. In: Kearns, M., Cohn, D. (eds.) Advances in Neural Information Processing Systems, pp. 543–549. Morgan Kaufman (1999)
Moriarty D.E., Miikkulainen R.: Efficient reinforcement learning through symbiotic evolution. Mach. Learn. 22, 11–32 (1996)
Moser, A., Murty, M.N.: On the scalability of genetic algorithms to very large-scale feature selection. In: Proceedings of EvoWorkshops 2000, Lecture Notes in Computer Science, vol. 1603, pp. 77–86. Springer-Verlag, New York (2000)
Murtagh, F.: Clustering in massive data sets. In: Handbook of Massive Data Sets, pp. 501–543. Kluwer Academic Publishers, Dordrecht (2002)
Neumann F., Wegener I.: Minimum spanning trees made easier. Nat. Comput. 5(3), 305–319 (2006)
Nopiah, Z.M., Khairir, M.I., Abdullah, S., Baharin, M.N., Airfin, A.: Time complexity analysis of the genetic algorithm clustering method. In: Proceedings of the 9th WSEAS international conference on Signal processing, robotics and automation, pp. 171–176 (2010)
Nowostawski, M., Poli, R.: Parallel genetic algorithm taxonomy. In: Proceedings of the Third International Conference on Knowledge-Based Intelligent Information Engineering Systems, pp. 88–92 (1999)
Obradovic, Z., Rangarajan, S.: Constructive neural networks design using genetic optimization, pp. 133–146. No. 15 in Mathematics and Informatics. University of Nis (2000)
Oliveto P.S., He J., Yao X.: Time complexity of evolutionary algorithms for combinatorial optimization: A decade of results. Int. J. Autom. Comput. 4(1), 100–106 (2007)
Olman V., Mao F., Wu H., Xu Y.: Parallel clustering algorithm for large data sets with applications in bioinformatics. IEEE/ACM Trans. Comput. Biol. Bioinforma. 6(2), 344–352 (2009)
Olson C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. V 21, 1313–1325 (1995)
Othman, F., Abdullah, R., Rashid, N.A., Salam, R.A.: Parallel k-means clustering algorithm on dna dataset. In: Proceedings of the 5th International Conference on Parallel and Distributed Computing: Applications and Technologies, (PDCAT’04), Lecture Notes in Computer Science, vol. 3320, pp. 248–251 (2004)
Pal S.K., Bandyopadhyay S.: Evolutionary computation in bioinformatics: A review. IEEE Trans. Syst. Man Cybern. Part B Cybern. 36, 601–615 (2006)
Panigrahy, R.: An improved algorithm finding nearest neighbor using kd-trees. In: Proceedings of the 8th Latin American Symposium, Lectures Notes in Computer Science, vol. 4957, pp. 387–398. Springer, Berlin (2008)
Parekh R., Yang J., Honavar V.: Constructive neural-network learning algorithms for pattern classification. IEEE Trans. Neural Netw. 11(2), 436–450 (2000)
Potter, M.A.: The design and analysis of a computational model of cooperative coevolution. Ph.D. thesis, George Mason University, Fairfax, Virginia (1997)
Potter M.A., De Jong K.A.: Cooperative coevolution: An architecture for evolving coadapted subcomponents. Evol. Comput. 8(1), 1–29 (2000)
Provost F.J., Kolluri V.: A survey of methods for scaling up inductive learning algorithms. Data Min. Knowl. Discov. 2, 131–169 (1999)
Quinn M.J.: Parallel Computing: Theory and Practice. McGraw-Hill, New York (1994)
Rasmussen E.M., Willet P.: Efficiency of hierarchical agglomerative clustering using ICL distributed array processors. J. Doc. 45(1), 1–24 (1989)
Rausch T., Thomas A., Camp N.J., Cannon-Albrigth L.A., Facelli J.C.: A parallel genetic algorithm to discover patterns in genetic markers that indicate predisposition to multifactorial disease. Comput. Biol. Med. 38, 826–836 (2008)
Rida, A., Labbi, A., Pellegrini, C.: Local experts combination through density decomposition. In: Proceedings of the 7th International Workshop on Artificial Intelligence and Statistics, pp. 692–699 (1999)
Rodríguez M., Escalante D.M., Peregrín A.: Efficient distributed genetic algorithm for rule extraction. Appl. Soft Comput. 11, 733–743 (2011)
Rosset S., Zhu J., Hastie T.: Boosting as a regularized path to a maximum margin classifier. J. Mach. Learn. Res. 5, 941–973 (2004)
Rudin C., Daubechies I., Schapire R.E.: The dynamics of adaboost: Cyclic behavior and convergence of margins. J. Mach. Learn. Res. 5, 1557–1595 (2004)
Ruiz R.: Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recognit. 39, 2383–2392 (2006)
Schapire R.E., Freund Y., Bartlett P.L., Lee W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Stat. 26(5), 1651–1686 (1998)
Sebban M., Nock R.: A hybrid filter/wrapper approach of feature selection using information theory. Pattern Recognit. 35, 835–846 (2002)
Sena G.S., Megherbi D., Iserm G.: Implementation of a parallel genetic algorithm on a cluster of workstations: Travelling salesman problem, a case study. Future Gener. Comput. Syst. 17(4), 477–488 (2001)
Sibson R.: Slink: An optimally efficient algorithm for the single link cluster method. Comput. J. 16, 30–34 (1973)
Sikonja, M.R.: Speeding up relief algorithm with k-d trees. In: Proceedings of Electrotechnical and Computer Science Conference (ERK’98), pp. 137–140. Portoroz, Slovenia (1998)
Skillicorn D.: Strategies for parallel data mining. IEEE Concurr. 7(4), 26–35 (1999)
Smieja F.: Neural-network constructive algorithms: Trading generalization for learning efficiency?. Circuits Syst. Signal Process. 12(2), 331–374 (1993)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2000)
Steinhaeuser, K., Chawla, N.V., Kogge, P.M.: Exploiting thread-level parallelism to build decision trees. In: Proceedings of the ECML/PKDD Workshop on Parallel Data Mining (PDM). Berlin, Germany (2006)
Stoffel, K., Belkoniene, A.: Parallel k/h-means clustering for large data sets. In: Proceedings of the 5th International Parallel Processing Conference (Euro-Par’99), Lecture Notes in Computer Science, vol. 1685, pp. 1451–1454 (1999)
Tresp V.: A bayesian committee machine. Neural Comput. 12, 2719–2741 (2000)
van den Bergh F., Engelbrecht A.P.: A cooperative approach to particle swarm optimization. IEEE Trans. Evol. Comput. 8, 225–239 (2004)
Verma, A., Llorà, X., Goldberg, D.E., Campbell, R.H.: Scaling genetic algorithms using mapreduce. In: Proceedings of the 2009 Ninth International Conference on Intelligent Systems Design and Applications, pp. 13–17 (2009)
Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, Berlin, Germany (2010)
Yang Z., Tang K., Yao X.: Large scale evolutionary optimization using cooperative coevolution. Inf. Sci. 178, 2985–2999 (2008)
Yao X.: Evolving artificial neural networks. Proc. IEEE 9(87), 1423–1447 (1999)
Yen S.H., Shih C.Y., Li T.K., Chang H.W.: Applying multiple kd-trees in high dimensional nearest neighbor searching. Int. J. Circuits Syst. Signal Process. 4, 153–160 (2010)
Yıldız O.T., Dikmen O.: Parallel univariate decision trees. Neural Process. Lett. 28, 825–832 (2007)
Yin, D., An, C., Baird, H.S.: Imbalance and concentration in k-nn classification. In: Proceedings of 20th International Conference on Pattern Recognition (ICPR’2010), pp. 2170–2173. IEEE Press (2010)
Yong, Z., Sannomiya, N.: A method for solving large-scale flowshop problems by reducing search space of genetic algorithms. In: 2000 IEEE International Conference on Systems, Man, and Cybernetics, vol. 3, pp. 1776–1781. IEEE Press (2000)
Yu, T., Davis, L., Baydar, C., Roy, R. (eds.): Evolutionary Computation in Practice, Studies in Computational Intelligence, vol. 88. Springer, Berlin (2008)
Zien A., Rätsch G., Mika S., Schölkopf B., Lengauer T., Müller K.R.: Engineering support vector machines kernels that recognize translation initiation sites. Bioinformatics 16(9), 799–807 (2000)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
García-Pedrajas, N., de Haro-García, A. Scaling up data mining algorithms: review and taxonomy. Prog Artif Intell 1, 71–87 (2012). https://doi.org/10.1007/s13748-011-0004-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-011-0004-4