Knowledge and Information Systems

, Volume 50, Issue 2, pp 475–503 | Cite as

Selective AnDE for large data learning: a low-bias memory constrained approach

  • Shenglei Chen
  • Ana M. Martínez
  • Geoffrey I. Webb
  • Limin Wang
Regular Paper

Abstract

Learning from data that are too big to fit into memory poses great challenges to currently available learning approaches. Averaged n-Dependence Estimators (AnDE) allows for a flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for learning from large quantities of data. Memory requirement in AnDE, however, increases combinatorially with the number of attributes and the parameter n. In large data learning, number of attributes is often large and we also expect high n to achieve low-bias classification. In order to achieve the lower bias of AnDE with higher n but with less memory requirement, we propose a memory constrained selective AnDE algorithm, in which two passes of learning through training examples are involved. The first pass performs attribute selection on super parents according to available memory, whereas the second one learns an AnDE model with parents only on the selected attributes. Extensive experiments show that the new selective AnDE has considerably lower bias and prediction error relative to A\(n'\)DE, where \(n' = n-1\), while maintaining the same space complexity and similar time complexity. The proposed algorithm works well on categorical data. Numerical data sets need to be discretized first.

Keywords

Attribute selection Bayesian classification AnDE Large data 

References

  1. 1.
    Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
  2. 2.
    Brain D, Webb GI (2002) The need for low bias algorithms in classification learning from large data sets. In: Elomaa T, Mannila H, Toivonen H (eds) Proceedings of the 6th European Conference on Principles of data mining and knowledge discovery. Springer, pp 62–73Google Scholar
  3. 3.
    Cestnik B (1990) Estimating probabilities: a crucial task in machine learning. ECAI 90:147–149Google Scholar
  4. 4.
    Chen S, Martinez AM, Webb GI (2014) Highly scalable attribute selection for averaged one-dependence estimators. In: Proceedings of the 18th Pacific-Asia conference on knowledge discovery and data mining, pp 86–97. SpringerGoogle Scholar
  5. 5.
    Domingos P, Pazzani M (1996) Beyond independence: conditions for the optimality of the simple Bayesian classifier. In: Proceedings of 13th international conference on machine learning, pp 105–112Google Scholar
  6. 6.
    Duda RO, Hart PE (1973) Pattern classification and scene analysis, 1st edn. Wiley, New YorkMATHGoogle Scholar
  7. 7.
    Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on artificial intelligence, pp 1022–1029Google Scholar
  8. 8.
    Flores M, Gmez J, Martnez A, Puerta J (2011) Handling numeric attributes when comparing bayesian network classifiers: does the discretization method matter? Appl Intell 34(3):372–385CrossRefGoogle Scholar
  9. 9.
    Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29(2–3):131–163CrossRefMATHGoogle Scholar
  10. 10.
    Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278 CrossRefGoogle Scholar
  11. 11.
    Schmidtmann I, Hammer G, Sariyar M, Gerhold-Ay A (2009) Evaluation des krebsregisters nrw—schwerpunkt record linkage—abschlussbericht. Tech. rep., Institut für medizinische Biometrie, Epidemiologie und Informatik, Universitätsmedizin MainzGoogle Scholar
  12. 12.
    Jiang L, Zhang H (2006) Weightily averaged one-dependence estimators. In: PRICAI 2006: trends in artificial intelligence, pp 970–974. SpringerGoogle Scholar
  13. 13.
    Kaluža B, Mirchevska V, Dovgan E, Luštrek M, Gams M (2010) An agent-based approach to care in independent living. In: Proceedings of the first international joint conference on ambient intelligence. Am I’10, Springer, Berlin, pp 177–186Google Scholar
  14. 14.
    Kohavi R, Wolpert DH (1996) Bias plus variance decomposition for zero-one loss functions. In: Proceedings of the thirteenth international conference on machine learning, pp 275–283. Morgan Kaufman Publishers, IncGoogle Scholar
  15. 15.
    MacKay DJ (2003) Information theory, inference and learning algorithms. Cambridge University Press, CambridgeMATHGoogle Scholar
  16. 16.
    Petitjean F, Inglada J, Gançarski P (2012) Satellite image time series analysis under time warping. IEEE Trans Geosci Remote Sens 50(8):3081–3095CrossRefGoogle Scholar
  17. 17.
    Reiss A, Stricker D (2012) Creating and benchmarking a new dataset for physical activity monitoring. In: Proceedings of the 5th international conference on PErvasive Technologies Related to Assistive Environments, PETRA ’12, pp 40:1–40:8. ACM, New York, NY, USAGoogle Scholar
  18. 18.
    Rish I (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp 41–46Google Scholar
  19. 19.
    Sahami M (1996) Learning limited dependence Bayesian classifiers. In: Proceedings of the second international conference on knowledge discovery and data mining, pp 335–338Google Scholar
  20. 20.
    Sonnenburg S, Franc V (2010) COFFIN: a computational framework for linear SVMs. In: Proc. ICML 2010Google Scholar
  21. 21.
    Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: fast SVM training on very large data sets. J Mach Learn Res 6:363–392MathSciNetMATHGoogle Scholar
  22. 22.
    Webb GI, Boughton JR, Wang Z (2005) Not so naive bayes: aggregating one-dependence estimators. Mach Learn 58(1):5–24CrossRefMATHGoogle Scholar
  23. 23.
    Webb GI, Boughton JR, Zheng F, Ting KM, Salem H (2012) Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification. Mach Learn 86(2):233–272MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Yang Y, Korb K, Ting KM, Webb GI (2005) Ensemble selection for superparent-one-dependence estimators. In: AI 2005: advances in artificial intelligence, pp 102–112. SpringerGoogle Scholar
  25. 25.
    Yang Y, Webb GI, Cerquides J, Korb KB, Boughton J, Ting KM (2007) To select or to weigh: a comparative study of linear combination schemes for superparent-one-dependence estimators. IEEE Trans Knowl Data Eng 19(12):1652–1665CrossRefGoogle Scholar
  26. 26.
    Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. Proc Twent Int Conf Mach Learn 3:856–863Google Scholar
  27. 27.
    Zaidi NA, Webb GI (2013) Fast and effective single pass Bayesian learning. In: Pei J, Tseng VS, Cao L, Motoda H, Xu G (eds) Proceedings of the 17th Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 149–160Google Scholar
  28. 28.
    Zheng F, Webb GI (2007) Finding the right family: parent and child selection for averaged one-dependence estimators. In: Machine learning: ECML 2007, pp 490–501. SpringerGoogle Scholar
  29. 29.
    Zheng F, Webb GI, Suraweera P, Zhu L (2012) Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning. Mach Learn 87(1):93–125MathSciNetCrossRefMATHGoogle Scholar
  30. 30.
    Zheng Z, Webb GI (2000) Lazy learning of Bayesian rules. Mach Learn 41(1):53–84CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  1. 1.Department of E-CommerceNanjing Audit UniversityNanjingChina
  2. 2.Faculty of Information TechnologyMonash UniversityMelbourneAustralia
  3. 3.Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of EducationJilin UniversityChangchunChina
  4. 4.Aalborg UniversityAalborgDenmark

Personalised recommendations