Machine Learning

, Volume 87, Issue 1, pp 93–125 | Cite as

Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning

  • Fei Zheng
  • Geoffrey I. Webb
  • Pramuditha Suraweera
  • Liguang Zhu
Article

Abstract

Semi-naive Bayesian techniques seek to improve the accuracy of naive Bayes (NB) by relaxing the attribute independence assumption. We present a new type of semi-naive Bayesian operation, Subsumption Resolution (SR), which efficiently identifies occurrences of the specialization-generalization relationship and eliminates generalizations at classification time. We extend SR to Near-Subsumption Resolution (NSR) to delete near–generalizations in addition to generalizations. We develop two versions of SR: one that performs SR during training, called eager SR (ESR), and another that performs SR during testing, called lazy SR (LSR). We investigate the effect of ESR, LSR, NSR and conventional attribute elimination (BSE) on NB and Averaged One-Dependence Estimators (AODE), a powerful alternative to NB. BSE imposes very high training time overheads on NB and AODE accompanied by varying decreases in classification time overheads. ESR, LSR and NSR impose high training time and test time overheads on NB. However, LSR imposes no extra training time overheads and only modest test time overheads on AODE, while ESR and NSR impose modest training and test time overheads on AODE. Our extensive experimental comparison on sixty UCI data sets shows that applying BSE, LSR or NSR to NB significantly improves both zero-one loss and RMSE, while applying BSE, ESR or NSR to AODE significantly improves zero-one loss and RMSE and applying LSR to AODE significantly improves zero-one loss. The Friedman test and Nemenyi test show that AODE with ESR or NSR have a significant zero-one loss and RMSE advantage over Logistic Regression and a zero-one loss advantage over Weka’s LibSVM implementation with a grid parameter search on categorical data. AODE with LSR has a zero-one loss advantage over Logistic Regression and comparable zero-one loss with LibSVM. Finally, we examine the circumstances under which the elimination of near-generalizations proves beneficial.

Keywords

Classification Naive Bayes Semi-naive Bayes Feature selection AODE 

References

  1. Cerquides, J., & Mántaras, R. L. D. (2005). Robust Bayesian linear classifier ensembles. In Proceedings of the sixteenth European conference on machine learning (pp. 70–81). Google Scholar
  2. Cestnik, B. (1990). Estimating probabilities: a crucial task in machine learning. In Proceedings of the ninth European conference on artificial intelligence (pp. 147–149). London: Pitman. Google Scholar
  3. Dash, D., & Cooper, G. F. (2002). Exact model averaging with naive Bayesian classifiers. In Proceedings of the nineteenth international conference on machine learning (pp. 91–98). San Mateo: Morgan Kaufmann. Google Scholar
  4. De Raedt, L. (2010a). Logic of generality. In C. Sammut & G. I. Webb (Eds.), Encyclopedia of machine learning (pp. 624–631). New York: Springer. Google Scholar
  5. De Raedt, L. D. (2010b). Inductive logic programming. In C. Sammut & G. I. Webb (Eds.), Encyclopedia of machine learning (pp. 529–537). New York: Springer. Google Scholar
  6. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. MATHGoogle Scholar
  7. Domingos, P., & Pazzani, M. J. (1996). Beyond independence: conditions for the optimality of the simple Bayesian classifier. In Proceedings of the thirteenth international conference on machine learning (pp. 105–112). San Mateo: Morgan Kaufmann. Google Scholar
  8. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. MATHGoogle Scholar
  9. Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the thirteenth international joint conference on artificial intelligence (pp. 1022–1029). San Mateo: Morgan Kaufmann. Google Scholar
  10. Flores, M., Gámez, J., Martínez, A., & Puerta, J. (2009). GAODE and HAODE: two proposals based on AODE to deal with continuous variables. In Proceedings of the 26th annual international conference on machine learning (pp. 313–320). Google Scholar
  11. Frank, E., Hall, M., & Pfahringer, B. (2003). Locally weighted naive Bayes. In Proceedings of the nineteenth conference on uncertainty in artificial intelligence (pp. 249–256). San Mateo: Morgan Kaufmann. Google Scholar
  12. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200), 675–701. CrossRefGoogle Scholar
  13. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. Journal of the American Statistical Association, 11(1), 86–92. MATHGoogle Scholar
  14. Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2), 131–163. MATHCrossRefGoogle Scholar
  15. Gama, J. (2003). Iterative Bayes. Theoretical Computer Science, 292(2), 417–430. MathSciNetMATHCrossRefGoogle Scholar
  16. Hand, D. J., & Yu, K. (2001). Idiot’s Bayes: not so stupid after all? International Statistical Review, 69(3), 385–398. MATHCrossRefGoogle Scholar
  17. Hastie, T., Tibshirani, R., & Friedman, J. (2001). Elements of statistical learning: data mining, inference and prediction. New York: Springer. MATHGoogle Scholar
  18. Hilden, J., & Bjerregaard, B. (1976). Computer-aided diagnosis and the atypical case. In F. T. de Dombal & F. Gremy (Eds.), Decision making and medical care: can information science help (pp. 365–378). Amsterdam: North-Holland. Google Scholar
  19. Iman, R. L., & Davenport, J. M. (1980). Approximations of the critical region of the Friedman statistic. In Communications in statistics (pp. 571–595). Google Scholar
  20. Keogh, E. J., & Pazzani, M. J. (1999). Learning augmented Bayesian classifiers: a comparison of distribution-based and classification-based approaches. In Proceedings of the international workshop on artificial intelligence and statistics (pp. 225–230). Google Scholar
  21. Kittler, J. (1986). Feature selection and extraction. In T. Y. Young & K.-S. Fu (Eds.), Handbook of pattern recognition and image processing. New York: Academic Press. Google Scholar
  22. Kohavi, R. (1996). Scaling up the accuracy of naive-Bayes classifiers: a decision-tree hybrid. In Proceedings of the second international conference on knowledge discovery and data mining (pp. 202–207). Google Scholar
  23. Kohavi, R., & Wolpert, D. (1996). Bias plus variance decomposition for zero-one loss functions. In Proceedings of the thirteenth international conference on machine learning (pp. 275–283). San Francisco: Morgan Kaufmann. Google Scholar
  24. Kononenko, I. (1990). Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition. In B. Wielinga, J. Boose, B. Gaines, G. Schreiber, & M. van Someren (Eds.), Current trends in knowledge acquisition. Amsterdam: IOS Press. Google Scholar
  25. Kononenko, I. (1991). Semi-naive Bayesian classifier. In Proceedings of the sixth European working session on machine learning (pp. 206–219). Berlin: Springer. Google Scholar
  26. Langley, P. (1993). Induction of recursive Bayesian classifiers. In Proceedings of the 1993 European conference on machine learning (pp. 153–164). Berlin: Springer. Google Scholar
  27. Langley, P., & Sage, S. (1994). Induction of selective Bayesian classifiers. In Proceedings of the tenth conference on uncertainty in artificial intelligence (pp. 399–406). San Mateo: Morgan Kaufmann. Google Scholar
  28. Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In Proceedings of the tenth national conference on artificial intelligence (pp. 223–228). Menlo Park: AAAI Press. Google Scholar
  29. Langseth, H., & Nielsen, T. D. (2006). Classification using hierarchical naive Bayes models. Machine Learning, 63(2), 135–159 (1994). MATHCrossRefGoogle Scholar
  30. Lewis, D. D. (1998). Naive Bayes at forty: the independence assumption in information retrieval. In Proceedings of the tenth European conference on machine learning (pp. 4–15). Berlin: Springer. Google Scholar
  31. Mitchell, T. (1997). Machine learning. New York: McGraw Hill. MATHGoogle Scholar
  32. Newman, D., Hettich, S., Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. Irvine: University of California, Department of Information and Computer Science. Google Scholar
  33. Pazzani, M. J. (1996). Constructive induction of Cartesian product attributes. In ISIS: information, statistics and induction in science (pp. 66–77). Google Scholar
  34. Platt, J. C. (1999). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in large margin classifiers. Cambridge: MIT Press. Google Scholar
  35. Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In Proceedings of the second international conference on knowledge discovery in databases (pp. 334–338). Menlo Park: AAAI Press. Google Scholar
  36. Webb, G. I. (2000). MultiBoosting: a technique for combining boosting and wagging. Machine Learning, 40(2), 159–196. CrossRefGoogle Scholar
  37. Webb, G. I., & Pazzani, M. J. (1998). Adjusted probability naive Bayesian induction. In Proceedings of the eleventh Australian joint conference on artificial intelligence (pp. 285–295). Berlin: Springer. Google Scholar
  38. Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so naive Bayes: aggregating one-dependence estimators. Machine Learning, 58(1), 5–24. MATHCrossRefGoogle Scholar
  39. Webb, G. I., Boughton, J., Zheng, F., Ting, K. M., & Salem, H. (2011). Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly naive Bayesian classification. Machine Learning. doi: 10.1007/s10994-011-5263-6. Google Scholar
  40. Witten, I. H., & Frank, E. (2005). Data mining: practical machine learning tools and techniques. San Mateo: Morgan Kaufmann. MATHGoogle Scholar
  41. Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the eighteenth international conference on machine learning (pp. 609–616). San Francisco: Morgan Kaufmann. Google Scholar
  42. Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth international conference on knowledge discovery and data mining (pp. 694–699). New York: ACM Press. CrossRefGoogle Scholar
  43. Zhang, N. L., Nielsen, T. D., & Jensen, F. V. (2004). Latent variable discovery in classification models. Artificial Intelligence in Medicine, 30(3), 283–299. CrossRefGoogle Scholar
  44. Zhang, H., Jiang, L., & Su, J. (2005). Hidden naive Bayes. In Proceedings of the twentieth national conference on artificial intelligence (pp. 919–924). Menlo Park: AAAI Press. Google Scholar
  45. Zheng, Z., & Webb, G. I. (2000). Lazy learning of Bayesian rules. Machine Learning, 41(1), 53–84. CrossRefGoogle Scholar
  46. Zheng, F., & Webb, G. I. (2005). A comparative study of semi-naive Bayes methods in classification learning. In Proceedings of the fourth Australasian data mining conference (pp. 141–156). Google Scholar
  47. Zheng, F., & Webb, G. I. (2006). Efficient lazy elimination for averaged-one dependence estimators. In Proceedings of the twenty-third international conference on machine learning (pp. 1113–1120). New York: ACM Press. Google Scholar
  48. Zheng, F., & Webb, G. I. (2007). Finding the right family: parent and child selection for averaged one-dependence estimators. In Proceedings of the eighteenth European conference on machine learning (pp. 490–501). Berlin: Springer. Google Scholar

Copyright information

© The Author(s) 2011

Authors and Affiliations

  • Fei Zheng
    • 1
  • Geoffrey I. Webb
    • 1
  • Pramuditha Suraweera
    • 1
  • Liguang Zhu
    • 1
  1. 1.Faculty of Information TechnologyMonash UniversityClaytonAustralia

Personalised recommendations