Artificial Intelligence Review

, Volume 26, Issue 3, pp 159–190 | Cite as

Machine learning: a review of classification and combining techniques

  • S. B. KotsiantisEmail author
  • I. D. Zaharakis
  • P. E. Pintelas


Supervised classification is one of the tasks most frequently carried out by so-called Intelligent Systems. Thus, a large number of techniques have been developed based on Artificial Intelligence (Logic-based techniques, Perceptron-based techniques) and Statistics (Bayesian Networks, Instance-based techniques). The goal of supervised learning is to build a concise model of the distribution of class labels in terms of predictor features. The resulting classifier is then used to assign class labels to the testing instances where the values of the predictor features are known, but the value of the class label is unknown. This paper describes various classification algorithms and the recent attempt for improving classification accuracy—ensembles of classifiers.


Classifiers Data mining techniques Intelligent data analysis Learning algorithms 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Acid S and de Campos LM (2003). Searching for Bayesian network structures in the space of restricted acyclic partially directed graphs. J Artif Intell Res 18: 445–490 zbMATHGoogle Scholar
  2. Aha D (1997). Lazy learning. Kluwer Academic Publishers, Dordrecht zbMATHGoogle Scholar
  3. An A, Cercone N (1999) Discretization of continuous attributes for learning classification rules. Third Pacific-Asia conference on methodologies for knowledge discovery & data mining, 509–514Google Scholar
  4. An A, Cercone N (2000) Rule quality measures improve the accuracy of rule induction: an experimental approach. Lecture notes in computer science, vol. 1932, pp 119–129Google Scholar
  5. Auer P and Warmuth M (1998). Tracking the best disjunction. Machine Learning 32: 127–150 zbMATHCrossRefGoogle Scholar
  6. Baik S, Bala J (2004) A decision tree algorithm for distributed data mining: towards network intrusion detection. Lecture notes in computer science, vol. 3046, pp 206–212Google Scholar
  7. Batista G and Monard MC (2003). An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17: 519–533 CrossRefGoogle Scholar
  8. Basak J and Kothari R (2004). A classification paradigm for distributed vertically partitioned data. Neural Comput 16(7): 1525–1544 zbMATHCrossRefGoogle Scholar
  9. Bauer E and Kohavi R (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning 36: 105–139 CrossRefGoogle Scholar
  10. Blake CL, Merz CJ (1998) UCI repository of machine learning databases. University of California, Irvine. ( Accessed 10 Oct 2005
  11. Blockeel H and De Raedt L (1998). Top-down induction of first order logical decision trees. Artif Intell 101(1–2): 285–297 zbMATHCrossRefGoogle Scholar
  12. Blum A (1997). Empirical support for winnow and weighted-majority algorithms: results on a calendar scheduling domain. Mach Learn 26(1): 5–23 CrossRefMathSciNetGoogle Scholar
  13. Bonarini A (2000). An introduction to learning fuzzy classifier systems. Lect Notes Comput Sci 1813: 83–92 Google Scholar
  14. Bouckaert R (2003) Choosing between two learning algorithms based on calibrated tests. In: Proceedings of 20th International Conference on Machine Learning. Morgan Kaufmann, pp 51–58Google Scholar
  15. Bouckaert R (2004). Naive Bayes classifiers that perform well with continuous variables. Lect Notes Comput Sci 3339: 1089–1094 MathSciNetGoogle Scholar
  16. Brazdil P, Soares C and Da Costa J (2003). Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach Learn 50: 251–277 zbMATHCrossRefGoogle Scholar
  17. Breiman L (1996). Bagging predictors. Mach Learn 24: 123–140 zbMATHMathSciNetGoogle Scholar
  18. Breslow LA and Aha DW (1997). Simplifying decision trees: a survey. Knowl Eng Rev 12: 1–40 CrossRefGoogle Scholar
  19. Brighton H and Mellish C (2002). Advances in instance selection for instance-based learning algorithms. Data Min Knowl Disc 6: 153–172 zbMATHCrossRefMathSciNetGoogle Scholar
  20. Burges C (1998). A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2(2): 1–47 CrossRefGoogle Scholar
  21. Camargo LS and Yoneyama T (2001). Specification of training sets and the number of hidden neurons for multilayer perceptrons. Neural Comput 13: 2673–2680 zbMATHCrossRefGoogle Scholar
  22. Castellano G, Fanelli A and Pelillo M (1997). An iterative pruning algorithm for feedforward neural networks. IEEE Trans Neural Netw 8: 519–531 CrossRefGoogle Scholar
  23. Cheng J, Greiner R (2001) Learning Bayesian belief network classifiers: algorithms and system. In: Stroulia E, Matwin S (eds) AI 2001, LNAI 2056, pp 141–151Google Scholar
  24. Cheng J, Greiner R, Kelly J, Bell D and Liu W (2002). Learning Bayesian networks from data: an information-theory based approach. Artif Intell 137: 43–90 zbMATHCrossRefMathSciNetGoogle Scholar
  25. Chickering DM (2002). Optimal structure identification with greedy search. J Mach Learn Res 3: 507–554 CrossRefMathSciNetGoogle Scholar
  26. Cohen W (1995) Fast effective rule induction. In: Proceedings of ICML-95, pp 115–123Google Scholar
  27. Cowell RG (2001) Conditions under which conditional independence and scoring methods lead to identical selection of Bayesian network models. In: Proceedings of 17th International Conference on Uncertainty in Artificial IntelligenceGoogle Scholar
  28. Crammer K and Singer Y (2002). On the learnability and design of output codes for multiclass problems. Mach Learn 47: 201–233 zbMATHCrossRefGoogle Scholar
  29. Cristianini N and Shawe-Taylor J (2000). An introduction to support vector machines and other Kernel-based learning methods. Cambridge University Press, Cambridge Google Scholar
  30. Dantsin E, Eiter T, Gottlob G and Voronkov A (2001). Complexity and expressive power of logic programming. ACM Comput Surveys 33: 374–425 CrossRefGoogle Scholar
  31. De Raedt L (1996) Advances in inductive logic programming. IOS PressGoogle Scholar
  32. De Mantaras RL and Armengol E (1998). Machine learning from examples: inductive and Lazy methods. Data Knowl Eng 25: 99–123 zbMATHCrossRefGoogle Scholar
  33. Dietterich TG (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7): 1895–1924 CrossRefGoogle Scholar
  34. Dietterich TG (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization. Mach Learn 40: 139–157 CrossRefGoogle Scholar
  35. Domeniconi C and Gunopulos D (2001). Adaptive nearest neighbor classification using support vector machines. Adv Neural Inf Process Syst 14: 665–672 Google Scholar
  36. Domingos P and Pazzani M (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29: 103–130 zbMATHCrossRefGoogle Scholar
  37. Dutton D and Conroy G (1996). A review of machine learning. Knowl Eng Rev 12: 341–367 CrossRefGoogle Scholar
  38. Dzeroski S and Lavrac N (2001). Relational data mining. Springer, Berlin zbMATHGoogle Scholar
  39. Elomaa T and Rousu J (1999). General and efficient multisplitting of numerical attributes. Mach Learn 36: 201–244 zbMATHCrossRefGoogle Scholar
  40. Elomaa T (1999). The biases of decision tree pruning strategies. Lect Notes Comput Sci 1642: 63–74, Springer Google Scholar
  41. Fidelis MV, Lopes HS, Freitas AA (2000) Discovering comprehensible classification rules using a genetic algorithm. In: Proceedings of CEC-2000, conference on evolutionary computation La Jolla, USA, v. 1, pp 805–811Google Scholar
  42. Flach PA, Lavrac N (2000) The role of feature construction in inductive rule learning. In: De Raedt L, Kramer S (eds) Proceedings of the ICML2000 workshop on attribute-value learning and relational learning: bridging the Gap. Stanford UniversityGoogle Scholar
  43. Frank E and Witten I (1998). Generating accurate rule sets without global optimization. In: Shavlik, J (eds) Machine learning: proceedings of the fifteenth international conference, pp. Morgan Kaufmann Publishers, San Francisco Google Scholar
  44. Freund Y and Schapire R (1997). A decision-theoretic generalization of on-line learning and an application to boosting. JCSS 55(1): 119–139 zbMATHMathSciNetGoogle Scholar
  45. Freund Y and Schapire R (1999). Large margin classification using the perceptron algorithm. Mach Learn 37: 277–296 zbMATHCrossRefGoogle Scholar
  46. Friedman N, Geiger D and Goldszmidt M (1997). Bayesian network classifiers. Mach Learn 29: 131–163 zbMATHCrossRefGoogle Scholar
  47. Friedman N and Koller D (2003). Being Bayesian about network structure: a Bayesian approach to structure discovery in Bayesian networks. Mach Learn 50(1): 95–125 zbMATHCrossRefGoogle Scholar
  48. Furnkranz J (1997). Pruning algorithms for rule learning. Mach Learn 27: 139–171 CrossRefGoogle Scholar
  49. Furnkranz J (1999). Separate-and-conquer rule learning. Artif Intell Rev 13: 3–54 CrossRefGoogle Scholar
  50. Furnkranz J (2001) Round robin rule learning. In: Proceedings of the 18th international conference on machine learning (ICML-01), pp 146–153Google Scholar
  51. Gama J and Brazdil P (1999). Linear tree. Intelligent Data Anal 3: 1–22 zbMATHCrossRefGoogle Scholar
  52. Gama J and Brazdil P (2000). Cascade generalization. Mach Learn 41: 315–343 zbMATHCrossRefGoogle Scholar
  53. Gehrke J, Ramakrishnan R and Ganti V (2000). RainForest—a framework for fast decision tree construction of large datasets. Data Min Knowl Disc 4(2–3): 127–162 CrossRefGoogle Scholar
  54. Genton M (2001). Classes of Kernels for machine learning: a statistics perspective. J Mach Learn Res 2: 299–312 CrossRefMathSciNetGoogle Scholar
  55. Guo G, Wang H, Bell D, Bi Y and Greer K (2003). KNN model-based approach in classification. Lect Notes Comput Sci 2888: 986–996 CrossRefGoogle Scholar
  56. Guyon I and Elissee A (2003). An introduction to variable and feature selection. J Mach Learn Res 3: 1157–1182 zbMATHCrossRefGoogle Scholar
  57. Hall L, Bowyer K, Kegelmeyer W, Moore T, Chao C (2000) Distributed learning on very large data sets. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 79–84Google Scholar
  58. Heckerman D, Meek C, Cooper G (1999) A Bayesian approach to causal discovery. In: Glymour C, Cooper G (eds) Computation, causation, and discovery, MIT Press, pp 141–165Google Scholar
  59. Ho TK (1998). The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20: 832–844 CrossRefGoogle Scholar
  60. Hodge V and Austin J (2004). A survey of outlier detection methodologies. Artif Intell Rev 22(2): 85–126 zbMATHCrossRefGoogle Scholar
  61. Japkowicz N and Stephen S (2002). The class imbalance problem: a systematic study. Intell Data Anal 6(5): 429–450 zbMATHGoogle Scholar
  62. Jain AK, Murty MN and Flynn P (1999). Data clustering: a review. ACM Comput Surveys 31(3): 264–323 CrossRefGoogle Scholar
  63. Jensen F (1996) An introduction to Bayesian networks. SpringerGoogle Scholar
  64. Jordan MI (1998). Learning in graphical models. MIT Press, Cambridge zbMATHGoogle Scholar
  65. Kalousis A and Gama G (2004). On data and algorithms: understanding inductive performance. Mach Learn 54: 275–312 zbMATHCrossRefGoogle Scholar
  66. Keerthi S and Gilbert E (2002). Convergence of a generalized SMO algorithm for SVM classifier design. Mach Learn 46: 351–360 zbMATHCrossRefGoogle Scholar
  67. Kivinen J (2002) Online learning of linear classifiers. In: Advanced Lectures on Machine Learning: Machine Learning Summer School 2002, Australia, February 11–22, pp 235–257, ISSN: 0302–9743Google Scholar
  68. Klusch M, Lodi S, Moro G (2003) Agent-based distributed data mining: the KDEC scheme. In: Intelligent information agents: the agentlink perspective, LNAI 2586, Springer, pp 104–122Google Scholar
  69. Kon M and Plaskota L (2000). Information complexity of neural networks. Neural Netw 13: 365–375 CrossRefGoogle Scholar
  70. Kotsiantis S, Pintelas P (2004) Selective voting. In: Proceedings of the 4th International Conference on Intelligent Systems Design and Applications (ISDA 2004), August 26–28, Budapest, pp 397–402Google Scholar
  71. Kuncheva L and Whitaker C (2001). Feature subsets for classifier combination: an enumerative experiment. Lect Notes Comput Sci 2096: 228–237 MathSciNetGoogle Scholar
  72. Lazkano E and Sierra B (2003). BAYES-NEAREST: a new hybrid classifier combining Bayesian network and distance based algorithms. Lect Notes Comput Sci 2902: 171–183 CrossRefGoogle Scholar
  73. LiMin W, SenMiao Y, Ling L and HaiJun L (2004). Improving the performance of decision tree: a hybrid approach. Lect Notes Comput Sci 3288: 327–335 CrossRefGoogle Scholar
  74. Lindgren T (2004). Methods for rule conflict resolution. Lect Notes Comput Sci 3201: 262–273 CrossRefGoogle Scholar
  75. Littlestone N and Warmuth M (1994). The weighted majority algorithm. Informa Comput 108(2): 212–261 zbMATHCrossRefMathSciNetGoogle Scholar
  76. Liu H and Metoda H (2001). Instance selection and constructive data mining. Kluwer, Boston Google Scholar
  77. Maclin R, Shavlik J (1995) Combining the prediction of multiple classifiers: using competitive learning to initialize ANNs. In: Proceedings of the 14th International joint conference on AI, pp 524–530Google Scholar
  78. Madden M (2003) The performance of Bayesian network classifiers constructed using different techniques. In: Proceedings of European conference on machine learning, workshop on probabilistic graphical models for classification, pp 59–70Google Scholar
  79. Markovitch S and Rosenstein D (2002). Feature generation using general construction functions. Mach Learn 49: 59–98 zbMATHCrossRefGoogle Scholar
  80. McSherry D (1999). Strategic induction of decision trees. Knowl Based Syst 12(5–6): 269–275 CrossRefGoogle Scholar
  81. Melville P, Mooney R (2003) Constructing diverse classifier ensembles using artificial training examples. In: Proceedings of the IJCAI-2003, Acapulco, Mexico, pp 505–510Google Scholar
  82. Mitchell T (1997) Machine learning. McGraw HillGoogle Scholar
  83. Muggleton S (1995). Inverse entailment and Progol. New Generat Comput Special issue on Inductive Logic Programming 13(3–4): 245–286 Google Scholar
  84. Muggleton S (1999). Inductive logic programming: issues, results and the challenge of learning language in logic. Artif Intell 114: 283–296 zbMATHCrossRefGoogle Scholar
  85. Murthy SK (1998). Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min Knowl Disc 2: 345–389 CrossRefMathSciNetGoogle Scholar
  86. Nadeau C and Bengio Y (2003). Inference for the generalization error. Mach Learn 52: 239–281 zbMATHCrossRefGoogle Scholar
  87. Neocleous C, Schizas C (2002) Artificial neural network learning: a comparative review, LNAI 2308, Springer-Verlag, pp 300–313Google Scholar
  88. Okamoto S and Yugami N (2003). Effects of domain characteristics on instance-based learning algorithms. Theoret Comput Sci 298: 207–233 zbMATHCrossRefMathSciNetGoogle Scholar
  89. Opitz D and Maclin R (1999). Popular ensemble methods: an empirical study. J Artif Intell Res (JAIR) 11: 169–198 zbMATHGoogle Scholar
  90. Parekh R, Yang J and Honavar V (2000). Constructive neural network learning algorithms for pattern classification. IEEE Trans Neural Netw 11(2): 436–451 CrossRefGoogle Scholar
  91. Platt J (1999) Using sparseness and analytic QP to speed training of support vector machines. In: Kearns M, Solla S, Cohn D (eds) Advances in neural information processing systems, MIT PressGoogle Scholar
  92. Quinlan JR (1993). C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco Google Scholar
  93. Quinlan JR (1995). Induction of logic programs: FOIL and related systems. New Generat Comput 13: 287–312 Google Scholar
  94. Quinlan JR (1996) Bagging, boosting, and C4.5. Proceedings of the 13th National Conference on Artificial intelligence. AAAI Press and the MIT Press, Menlo Park, CA, pp 725–730Google Scholar
  95. Ratanamahatana C and Gunopulos D (2003). Feature selection for the naive Bayesian classifier using decision trees. Appl Artif Intell 17(5–6): 475–487 Google Scholar
  96. Reinartz T (2002) A unifying view on instance selection, data mining and knowledge discovery, vol. 6. Kluwer Academic Publishers, pp 191–210Google Scholar
  97. Reeves CR, Rowe JE (2003) Genetic algorithms—principles and perspectives: a guide to GA theory. Kluwer AcademicGoogle Scholar
  98. Roli F, Giacinto G and Vernazza G (2001). Methods for designing multiple classifier systems. Lect Notes Comput Sci 2096: 78–87 MathSciNetGoogle Scholar
  99. Robert J, Howlett LCJ (2001) Radial basis function networks 2: new advances in design. Physica-Verlag Heidelberg, ISBN: 3790813680Google Scholar
  100. Roy A (2000). On connectionism, rule extraction and brain-like learning. IEEE Trans Fuzzy Syst 8(2): 222–227 CrossRefGoogle Scholar
  101. Saad D (1998). Online learning in neural networks. Cambridge University Press, London Google Scholar
  102. Saitta L and Neri F (1998). Learning in the ‘Real World’. Mach Learn 30(2–3): 313–163 Google Scholar
  103. Sanchez J, Barandela R and Ferri F (2002). On filtering the training prototypes in nearest neighbour classification. Lect Notes Comput Sci 2504: 239–248 CrossRefGoogle Scholar
  104. Schapire RE, Singer Y, Singhal A (1998) Boosting and Rocchio applied to text filtering. In SIGIR ’98: Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval, pp 215–223Google Scholar
  105. Scholkopf C, Burges JC, Smola AJ (1999) Advances in Kernel Methods. MIT PressGoogle Scholar
  106. Setiono R and Loew WK (2000). FERNN: an algorithm for fast extraction of rules from neural networks. Appl Intell 12: 15–25 CrossRefGoogle Scholar
  107. Siddique MNH and Tokhi MO (2001). Training neural networks: backpropagation vs. genetic algorithms. IEEE Int Joint Conf Neural Netw 4: 2673–2678 Google Scholar
  108. Ting K and Witten I (1999). Issues in stacked generalization. Artif Intell Res 10: 271–289 zbMATHGoogle Scholar
  109. Tjen-Sien L, Wei-Yin L and Yu-Shan S (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40: 203–228 zbMATHCrossRefGoogle Scholar
  110. Todorovski L and Dzeroski S (2003). Combining classifiers with meta decision trees. Mach Learn 50: 223–249 zbMATHCrossRefGoogle Scholar
  111. Utgoff P, Berkman N and Clouse J (1997). Decision tree induction based on efficient tree restructuring. Mach Learn 29(1): 5–44 zbMATHCrossRefGoogle Scholar
  112. Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on artificial intelligence (IJCAI99)Google Scholar
  113. Villada R and Drissi Y (2002). A perspective view and survey of meta-learning. Artif Intell Rev 18: 77–95 CrossRefGoogle Scholar
  114. Vivarelli F and Williams C (2001). Comparing Bayesian neural network algorithms for classifying segmented outdoor images. Neural Netw 14: 427–437 CrossRefGoogle Scholar
  115. Wall R, Cunningham P, Walsh P and Byrne S (2003). Explaining the output of ensembles in medical decision support on a case by case basis. Artif Intell Med 28(2): 191–206 CrossRefGoogle Scholar
  116. Webb IG (2000). Multiboosting: a technique for combining boosting and wagging. Mach Learn 40(2): 159–196 CrossRefGoogle Scholar
  117. Wettschereck D, Aha DW and Mohri T (1997). A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artif Intell Rev 10: 1–37 Google Scholar
  118. Wilson DR and Martinez T (2000). Reduction techniques for instance-based learning algorithms. Mach Learn 38: 257–286 zbMATHCrossRefGoogle Scholar
  119. Witten I and Frank E (2005). Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco zbMATHGoogle Scholar
  120. Yam J and Chow W (2001). Feedforward networks training speed enhancement by optimal initialization of the synaptic coefficients. IEEE Trans Neural Netw 12: 430–434 CrossRefGoogle Scholar
  121. Yang Y and Webb G (2003). On why discretization works for Naive-Bayes classifiers. Lect Notes Comput Sci 2903: 440–452 MathSciNetGoogle Scholar
  122. Yen GG, Lu H (2000) Hierarchical genetic algorithm based neural network design. In: IEEE symposium on combinations of evolutionary computation and neural networks, pp 168–175Google Scholar
  123. Yu L and Liu H (2004). Efficient feature selection via analysis of relevance and redundancy. JMLR 5: 1205–1224 MathSciNetGoogle Scholar
  124. Zhang G (2000). Neural networks for classification: a survey. IEEE Trans Syst Man Cy C 30(4): 451–462 CrossRefGoogle Scholar
  125. Zhang S, Zhang C and Yang Q (2002). Data preparation for data mining. Appl Artif Intell 17: 375–381 CrossRefGoogle Scholar
  126. Zheng Z (1998). Constructing conjunctions using systematic search on decision trees. Knowl Based Syst J 10: 421–430 CrossRefGoogle Scholar
  127. Zheng Z (2000). Constructing X-of-N attributes for decision tree learning. Mach Learn 40: 35–75 zbMATHCrossRefGoogle Scholar
  128. Zheng Z and Webb G (2000). Lazy learning of Bayesian rules. Mach Learn 41(1): 53–84 CrossRefGoogle Scholar
  129. Zhou Z and Chen Z (2002). Hybrid decision tree. Knowl Based Syst 15(8): 515–528 CrossRefGoogle Scholar
  130. Zhou Z (2004). Rule extraction: using neural networks or for neural networks?. J Comput Sci Technol 19(2): 249–253 Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2007

Authors and Affiliations

  • S. B. Kotsiantis
    • 1
    • 2
    Email author
  • I. D. Zaharakis
    • 3
  • P. E. Pintelas
    • 1
    • 2
  1. 1.Department of Computer Science and TechnologyUniversity of PeloponnesePeloponneseGreece
  2. 2.Educational Software Development Laboratory, Department of MathematicsUniversity of PatrasPatrasGreece
  3. 3.Computer Technology InstitutePatrasGreece

Personalised recommendations