Advertisement

Data Mining and Knowledge Discovery

, Volume 31, Issue 2, pp 465–501 | Cite as

Multiple Bayesian discriminant functions for high-dimensional massive data classification

  • Jianfei Zhang
  • Shengrui WangEmail author
  • Lifei Chen
  • Patrick Gallinari
Article

Abstract

The presence of complex distributions of samples concealed in high-dimensional, massive sample-size data challenges all of the current classification methods for data mining. Samples within a class usually do not uniformly fill a certain (sub)space but are individually concentrated in certain regions of diverse feature subspaces, revealing the class dispersion. Current classifiers applied to such complex data inherently suffer from either high complexity or weak classification ability, due to the imbalance between flexibility and generalization ability of the discriminant functions used by these classifiers. To address this concern, we propose a novel representation of discriminant functions in Bayesian inference, which allows multiple Bayesian decision boundaries per class, each in its individual subspace. For this purpose, we design a learning algorithm that incorporates the naive Bayes and feature weighting approaches into structural risk minimization to learn multiple Bayesian discriminant functions for each class, thus combining the simplicity and effectiveness of naive Bayes and the benefits of feature weighting in handling high-dimensional data. The proposed learning scheme affords a recursive algorithm for exploring class density distribution for Bayesian estimation, and an automated approach for selecting powerful discriminant functions while keeping the complexity of the classifier low. Experimental results on real-world data characterized by millions of samples and features demonstrate the promising performance of our approach.

Keywords

Decision boundaries Naive Bayes Feature weighting High-dimensional massive data Class dispersion 

Notes

Acknowledgments

We would like to thank Carol Harris for improving this paper significantly. This work has been supported by fundings from the Natural Sciences and Engineering Research Council of Canada (NSERC) to Shengrui Wang under Grant No. 396097-2010, and from the Natural Science Foundation of Fujian Province of China to Lifei Chen under Grant No. 2015J01238. Thanks UPMC—Université Pierre et Marie Curie for a financial support allowing the collaboration between Shengrui Wang and Patrick Gallinari. Shengrui Wang is also partly supported by Natural Science Foundation of China (NSFC) under Grant No. 61170130.

References

  1. Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large data bases (VLDB), pp 852–863CrossRefGoogle Scholar
  2. Aha D, Kibler D (1991) Instance-based learning algorithms. Mach Learn 6:37–66zbMATHGoogle Scholar
  3. Albert MK, Aha DW (1991) Analyses of instance-based learning algorithms. In: Proceedings of the ninth national conference on artificial intelligence (AAAI), pp 553–558Google Scholar
  4. Atashpaz-Gargari E, Sima C, Braga-Neto UM, Dougherty ER (2013) Relationship between the accuracy of classifier error estimation and complexity of decision boundary. Pattern Recognit 46(5):1315–1322CrossRefGoogle Scholar
  5. Bengio Y, Bengio S (1999) Modeling high-dimensional discrete data with multi-layer neural networks. In: The twelfth advances in neural information processing systems (NIPS), pp 400–406Google Scholar
  6. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  7. Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefGoogle Scholar
  8. Chang CC, Lin CJ (2011) Libsvm: a library of the support vector machines. ACM Trans Intell Syst Technol 2:27CrossRefGoogle Scholar
  9. Chen L, Wang S (2012a) Automated feature weighting in naive Bayes for high-dimensional data classification. In: Proceedings of the twenty-first ACM international conference on information and knowledge management (CIKM), pp 1243–1252Google Scholar
  10. Chen L, Wang S (2012b) Semi-naive Bayesian classification by weighted kernel density estimation. In: Proceedings of the eighth international conference on advanced data mining and applications (ADMA), pp 260–270CrossRefGoogle Scholar
  11. Chen L, Jiang Q, Wang S (2012) Model-based method for projective clustering. IEEE Trans Knowl Data Eng 24(7):1291–1305CrossRefGoogle Scholar
  12. Dahl GE, Stokes JW, Deng L, Yu D (2013) Large-scale malware classification using random projections and neural networks. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3422–3426Google Scholar
  13. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38Google Scholar
  14. Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14(1):63–97MathSciNetCrossRefGoogle Scholar
  15. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103–130CrossRefGoogle Scholar
  16. Fan W, Bifet A (2013) Mining big data: current status, and forecast to the future. ACM SIGKDD Explor Newslett 14(2):1–5. doi: 10.1145/2481244.2481246 CrossRefGoogle Scholar
  17. Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874MathSciNetCrossRefGoogle Scholar
  18. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the thirteenth international joint conference on artificial intelligence (IJCAI), pp 1022–1029Google Scholar
  19. Fortuny EJ, Martens D, Provost F (2013) Predictive modeling with big data: is bigger really better? Big Data 1(4):215–226CrossRefGoogle Scholar
  20. Frank E, Hall M, Pfahringer B (2003) Locally weighted naive Bayes. In: Proceedings of the nineteenth annual conference on uncertainty in artificial intelligence (UAI), pp 249–256Google Scholar
  21. Friedman JH (1997) On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Min Knowl Discov 1(1):55–77CrossRefGoogle Scholar
  22. Gopal S, Yang Y, Bai B, Mizil AN (2012) Bayesian models for large-scale hierarchical classification. In: Proceedings of the twenty-sixth annual conference on neural information processing systems (NIPS), pp 2420–2428Google Scholar
  23. Han EHS, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: Proceedings of the fourth European conference on principles of knowledge discovery and data mining, pp 424–431CrossRefGoogle Scholar
  24. Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: The twenty-sixth international conference on very large databases (VLDB), pp 506–515Google Scholar
  25. Hsu CN, Huang HJ, Wong TT (2003) Implications of the dirichlet assumption for discretization of continuous variables in naive Bayesian classifiers. Mach Learn 53(3):235–263CrossRefGoogle Scholar
  26. Ifrim G, Bakır G, Weikum G (2008) Fast logistic regression for text categorization with variable-length n-grams. In: Proceedings of the fourteenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 354–362Google Scholar
  27. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing (STOC), pp 604–613Google Scholar
  28. Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8):1026–1041CrossRefGoogle Scholar
  29. Joachims T (1996) A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical Reports, DTIC DocumentGoogle Scholar
  30. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, BerlinGoogle Scholar
  31. John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence (UAI), pp 338–345Google Scholar
  32. Kang DK, Silvescu A, Honavar V (2006) RNBL-MN: a recursive naive Bayes learner for sequence classification. In: Proceedings of the eighteenth Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 45–54CrossRefGoogle Scholar
  33. Kim J, Le DX, Thoma GR (2008) Naive Bayes classifier for extracting bibliographic information from biomedical online articles. In: Proceedings of the international conference on data mining (DMIN), pp 373–378Google Scholar
  34. Kohavi R, Langley P, Yun Y (1997) The utility of feature weighting in nearest-neighbor algorithms. In: Proceedings of the ninth European conference on machine learning (ECML), pp 85–92Google Scholar
  35. Kooij AJ, et al (2007) Chapter 4: regularization with ridge penalties, the lasso, and the elastic net for regression with optimal scaling transformations. In: Prediction accuracy and stability of regression with optimal scaling transformations, pp 65–90Google Scholar
  36. Lee CH, Gutierrez F, Dou D (2011) Calculating feature weights in naive Bayes with kullback–leibler measure. In: Proceedings of the eleventh IEEE international conference on data mining (ICDM), pp 1146–1151Google Scholar
  37. Li P, Shrivastava A, Moore JL, König AC (2011) Hashing algorithms for large-scale learning. In: Proceedings of the twenty-fifth annual conference on neural information processing systems (NIPS), pp 2672–2680Google Scholar
  38. Lin C, Weng RC, Keerthi SS (2007) Trust region Newton methods for large-scale logistic regression. In: Proceedings of the twenty-fourth international conference on machine learning (ICML), pp 561–568Google Scholar
  39. Lin G, Shen C, Shi Q, van den Hengel A, Suter D (2014) Fast supervised hashing with decision trees for high-dimensional data. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1963–1970Google Scholar
  40. Liu S, Trenkler G (2008) Hadamard, khatri-rao, kronecker and other matrix products. Int J Inf Syst Sci 4(1):160–177MathSciNetzbMATHGoogle Scholar
  41. Manevitz LM, Yousef M (2002) One-class svms for document classification. J Mach Learn Res 2:139–154zbMATHGoogle Scholar
  42. Marchiori E (2013) Class dependent feature weighting and k-nearest neighbor classification. In: Proceedings of the eighth IAPR international conference on pattern recognition in bioinformatics (PRIB), pp 69–78CrossRefGoogle Scholar
  43. Martens D, Provost F (2011) Pseudo-social network targeting from consumer transaction data. Faculty of Applied Economics, University of Antwerp, BelgiumGoogle Scholar
  44. Martínez AM, Kak AC (2001) PCA versus LDA. IEEE Trans Pattern Anal Mach Intell 23(2):228–233CrossRefGoogle Scholar
  45. Masud MM, Khan L, Thuraisingham B (2008) A scalable multi-level feature extraction technique to detect malicious executables. Inf Syst Front 10(1):33–45CrossRefGoogle Scholar
  46. Nakajima S, Watanabe S (2005) Generalization error of linear neural networks in an empirical Bayes approach. In: Proceedings of the nineteenth international joint conference on artificial intelligence (IJCAI), pp 804–810Google Scholar
  47. Navon IM, Phua PK, Ramamurthy M (1988) Vectorization of conjugate-gradient methods for large-scale minimization. In: Proceedings of the second ACM/IEEE conference on supercomputing, pp 410–418Google Scholar
  48. Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett 6(1):90–105CrossRefGoogle Scholar
  49. Rani P, Pudi V (2008) RBNBC: Repeat based naive Bayes classifier for biological sequences. In: Proceedings of the eighth IEEE international conference on data mining (ICDM), pp 989–994Google Scholar
  50. Rao J, Wu C (2010) Bayesian pseudo-empirical-likelihood intervals for complex surveys. J R Stat Soc Ser B 72(4):533–544MathSciNetCrossRefGoogle Scholar
  51. Seeger M (2006) Bayesian modelling in machine learning: a tutorial review. Technical ReportsGoogle Scholar
  52. Shabtai A, Moskovitch R, Elovici Y, Glezer C (2009) Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inf Secur Tech Rep 14(1):16–29CrossRefGoogle Scholar
  53. Straub WO (2009) A brief look at gaussian integrals. Article, Pasadena California. http://www.weylmann.com/gaussian.pdf
  54. Su J, Shirab JS, Matwin S (2011) Large scale text classification using semisupervised multinomial naive Bayes. In: Proceedings of the twenty-eighth international conference on machine learning (ICML), pp 97–104Google Scholar
  55. Tan M, Wang L, Tsang IW (2010) Learning sparse SVM for feature selection on very high dimensional datasets. In: Proceedings of the twenty-seventh international conference on machine learning (ICML), pp 1047–1054Google Scholar
  56. Tan M, Tsang IW, Wang L (2014) Towards ultrahigh dimensional feature selection for big data. J Mach Learn Res 15(1):1371–1429MathSciNetzbMATHGoogle Scholar
  57. Tan S (2008) An improved centroid classifier for text categorization. Exp Syst Appl 35(1–2):279–285CrossRefGoogle Scholar
  58. Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101(476):1566–1581MathSciNetCrossRefGoogle Scholar
  59. Vapnik VN (1992) Principles of risk minimization for learning theory. In: Proceedings of the fifth annual conference on neural information processing systems (NIPS), pp 831–838Google Scholar
  60. Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw Learn Syst 10(5):988–999CrossRefGoogle Scholar
  61. Vapnik VN (2000) The nature of statistical learning theory. Springer, BerlinCrossRefGoogle Scholar
  62. Vapnik VN, Levin E, Le Cun Y (1994) Measuring the VC-dimension of a learning machine. Neural Comput 6(5):851–876CrossRefGoogle Scholar
  63. Verweij PJ, Van Houwelingen HC (1994) Penalized likelihood in cox regression. Stat Med 13(23–24):2427–2436CrossRefGoogle Scholar
  64. Vilalta R, Rish I (2003) A decomposition of classes via clustering to explain and improve naive Bayes. In: Proceedings of the fourteenth European conference on machine learning (ECML), pp 444–455CrossRefGoogle Scholar
  65. Vilalta R, Achari MK, Eick CF (2003) Class decomposition via clustering: a new framework for low-variance classifiers. In: Proceedings of the third IEEE international conference on data mining (ICDM), pp 673–676Google Scholar
  66. Weinberger K, Dasgupta A, Langford J, Smola A, Attenberg J (2009) Feature hashing for large scale multitask learning. In: Proceedings of the twenty-sixth annual international conference on machine learning (ICML), pp 1113–1120Google Scholar
  67. Xu B, Huang JZ, Williams G, Wang Q, Ye Y (2012) Classifying very high-dimensional data with random forests built from small subspaces. Int J Data Warehous Min 8(2):44–63CrossRefGoogle Scholar
  68. Yu S, Fung G, Rosales R, Krishnan S, Rao RB, Dehing-Oberije C, Lambin P (2008) Privacy-preserving Cox regression for survival analysis. In: Proceedings of the fourteenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 1034–1042Google Scholar
  69. Yuan YC (2010) Multiple imputation for missing data: concepts and new development (version 9.0). SAS Institute Inc, RockvilleGoogle Scholar
  70. Zaidi NA, Cerquides J, Carman MJ, Webb GI (2013) Alleviating naive Bayes attribute independence assumption by attribute weighting. J Mach Learn Res 14(1):1947–1988MathSciNetzbMATHGoogle Scholar
  71. Zhang J, Chen L, Guo G (2013) Projected-prototype based classifier for text categorization. Knowl Based Syst 49:179–189CrossRefGoogle Scholar
  72. Zhou Z, Chen Z (2002) Hybrid decision tree. Knowl Based Syst 15(8):515–528CrossRefGoogle Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  • Jianfei Zhang
    • 1
  • Shengrui Wang
    • 1
    Email author
  • Lifei Chen
    • 2
  • Patrick Gallinari
    • 3
  1. 1.ProspectUS Laboratoire, Département d’InformatiqueUniversité de SherbrookeSherbrookeCanada
  2. 2.School of Mathematics and Computer ScienceFujian Normal UniversityFuzhouChina
  3. 3.Laboratoire d’Informatique de Paris 6 (LIP6)Université Pierre et Marie CurieParisFrance

Personalised recommendations