Sparse Feature Learning Using Ensemble Model for Highly-Correlated High-Dimensional Data

  • Ali BrayteeEmail author
  • Ali Anaissi
  • Paul J. Kennedy
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11303)


High-dimensional highly correlated data exist in several domains such as genomics. Many feature selection techniques consider correlated features as redundant and therefore need to be removed. Several studies investigate the interpretation of the correlated features in domains such as genomics, but investigating the classification capabilities of the correlated feature groups is a point of interest in several domains. In this paper, a novel method is proposed by integrating the ensemble feature ranking and co-expression networks to identify the optimal features for classification. The main advantage of the proposed method lies in the fact, that it does not consider the correlated features as redundant. But, it shows the importance of the selected correlated features to improve the performance of classification. A series of experiments on five high dimensional highly correlated datasets with different levels of imbalance ratios show that the proposed method outperformed the state-of-the-art methods.


Feature selection High-dimensional data Feature correlation 


  1. 1.
    Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99(10), 6562–6566 (2002)CrossRefGoogle Scholar
  2. 2.
    Anaissi, A., Goyal, M., Catchpoole, D.R., Braytee, A., Kennedy, P.J.: Ensemble feature learning of genomic data using support vector machine. PLOS ONE 11(6), 1–17 (2016)CrossRefGoogle Scholar
  3. 3.
    Bin, Z., Steve, H.: A general framework for weighted gene co-expression network analysis. Stat. Appl. Gen. Mol. Biol. 4(1), 11–28 (2005)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM, Pittsburgh (1992)Google Scholar
  5. 5.
    Braytee, A., Liu, W., Kennedy, P.J.: Supervised context-aware non-negative matrix factorization to handle high-dimensional high-correlated imbalanced biomedical data. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 4512–4519. IEEE, Anchorage (2017)Google Scholar
  6. 6.
    Conn, D., Ngun, T., Li, G., Ramirez, C.: Fuzzy forests: extending random forests for correlated, high-dimensional data. UCLA Biostatistics Working Paper Series (2015)Google Scholar
  7. 7.
    Cui, C., Wang, D.: High dimensional data regression using lasso model and neural networks with random weights. Inf. Sci. 372, 505–517 (2016)CrossRefGoogle Scholar
  8. 8.
    Golub, T.R., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)CrossRefGoogle Scholar
  9. 9.
    Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1), 389–422 (2002)CrossRefGoogle Scholar
  10. 10.
    Huang, H.H., Liu, X.Y., Liang, Y.: Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2 + 2 regularization. PLOS ONE 11(5), 1–15 (2016)Google Scholar
  11. 11.
    Meier, L., Van De Geer, S., Bühlmann, P.: The group LASSO for logistic regression. J. R. Stat. Soc. Seri. B (Stat. Methodol.) 70(1), 53–71 (2008)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Park, M.Y., Hastie, T., Tibshirani, R.: Averaged gene expressions for regression. Biostatistics 8(2), 212–227 (2007)CrossRefGoogle Scholar
  13. 13.
    Rapaport, F., Barillot, E., Vert, J.P.: Classification of arrayCGH data using fused SVM. Bioinformatics 24(13), i375–i382 (2008)CrossRefGoogle Scholar
  14. 14.
    Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)CrossRefGoogle Scholar
  15. 15.
    Shipp, M.A., et al.: Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8(1), 68–74 (2002)CrossRefGoogle Scholar
  16. 16.
    Singh, D., et al.: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2), 203–209 (2002)CrossRefGoogle Scholar
  17. 17.
    Tibshirani, R.: Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B (Methodological) 58(1), 267–288 (1996)MathSciNetzbMATHGoogle Scholar
  18. 18.
    Tolosi, L., Lengauer, T.: Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27(14), 1986–1994 (2011)CrossRefGoogle Scholar
  19. 19.
    Van De Vijver, M.J., et al.: A gene-expression signature as a predictor of survival in breast cancer. New Engl. J. Med. 347(25), 1999–2009 (2002)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Biomedical EngineeringUniversity of Technology SydneyUltimoAustralia
  2. 2.School of SoftwareUniversity of Technology SydneyUltimoAustralia
  3. 3.The University of SydneySydneyAustralia

Personalised recommendations