Skip to main content
Log in

Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availibility

The data presented in this study are available on request from the corresponding author.

References

  • Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2), 263–265.

    Article  Google Scholar 

  • Bickel, P. J., & Levina, E. (2004). Some theory for Fisher’s linear discriminant function, naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6), 989–1010.

    Article  MathSciNet  Google Scholar 

  • Bron, C., & Kerbosch, J. (1973). Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM, 16(9), 575–577.

    Article  Google Scholar 

  • Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pp. 493–507.

  • Duin, R. P., & Loog, M. (2004). Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion. IEEE transactions on pattern analysis and machine intelligence, 26(6), 732–739.

    Article  Google Scholar 

  • Fisher, R. A. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22(5), 700–725.

    Article  Google Scholar 

  • Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. (2002). The structure of haplotype blocks in the human genome. Science, 296(5576), 2225–2229.

    Article  Google Scholar 

  • Geeleher, P., Cox, N. J., & Huang, R. S. (2014). Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biology, 15, 1–12.

    Article  Google Scholar 

  • Jolliffe, I. T. (1986). Principal components in regression analysis. In Principal component analysis, pp 129–155. Springer.

  • Julier, S. J. (2006). An empirical study into the use of Chernoff information for robust, distributed fusion of gaussian mixture models. In 2006 9th International Conference on Information Fusion, pp 1–8. IEEE.

  • Kim, S. A., Cho, C.-S., Kim, S.-R., Bull, S. B., & Yoo, Y. J. (2017). A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs. Bioinformatics, 34(3), 388–397.

    Article  Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, & K. Weinberger (Eds.), Advances in Neural Information Processing Systems. (Vol. 25). Curran Associates Inc.

    Google Scholar 

  • Lewontin, R. C. (1964). The interaction of selection and linkage. I general considerations; heterotic models. Genetics, 49(1), 49–67.

    Google Scholar 

  • Liu, F., Schmidt, R. H., Reif, J. C., & Jiang, Y. (2019). Selecting closely-linked SNPs based on local epistatic effects for haplotype construction improves power of association mapping. G3: Genes, Genomes, Genetics, 9(12), 4115–4126.

    Article  Google Scholar 

  • Maherin, I., & Liang, Q. (2014). Radar sensor network for target detection using Chernoff information and relative entropy. Physical Communication, 13, 244–252.

    Article  Google Scholar 

  • Motsinger, A. A., Reif, D. M., Fanelli, T. J., Davis, A. C., & Ritchie, M. D. (2007). Linkage disequilibrium in genetic association studies improves the performance of grammatical evolution neural networks. In 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, pp. 1–8. IEEE.

  • Nielsen, F. (2022). Revisiting Chernoff information with likelihood ratio exponential families. Entropy, 24(10), 1400.

    Article  MathSciNet  Google Scholar 

  • Pattaro, C., Ruczinski, I., Fallin, D. M., & Parmigiani, G. (2008). Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies. BMC genomics, 9(1), 1–15.

    Article  Google Scholar 

  • Schaid, D. J., McDonnell, S. K., Hebbring, S. J., Cunningham, J. M., & Thibodeau, S. N. (2005). Nonparametric tests of association of multiple genes with human disease. The American Journal of Human Genetics, 76(5), 780–793.

    Article  Google Scholar 

  • Slatkin, M. (2008). Linkage disequilibrium—Understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477–485.

    Article  Google Scholar 

  • Soh, K. P., Szczurek, E., Sakoparnig, T., & Beerenwinkel, N. (2017). Predicting cancer type from tumour DNA signatures. Genome Medicine, 9(1), 1–11.

    Article  Google Scholar 

  • Taliun, D., Gamper, J., Leser, U., & Pattaro, C. (2015). Fast sampling-based whole-genome haplotype block recognition. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13(2), 315–325.

    Article  Google Scholar 

  • Taliun, D., Gamper, J., & Pattaro, C. (2014). Efficient haplotype block recognition of very long and dense genetic sequences. BMC Bioinformatics, 15(1), 1–18.

    Article  Google Scholar 

  • Vattikuti, S., Guo, J., & Chow, C. C. (2012). Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS Genetics, 8(3), e1002637.

    Article  Google Scholar 

  • Vogelstein, J. T., Bridgeford, E. W., Tang, M., Zheng, D., Douville, C., Burns, R., & Maggioni, M. (2021). Supervised dimensionality reduction for big data. Nature Communications, 12(1), 2872.

    Article  Google Scholar 

  • Vogelstein, J. T., Park, Y., Ohyama, T., Kerr, R. A., Truman, J. W., Priebe, C. E., & Zlatic, M. (2014). Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science, 344(6182), 386–392.

    Article  Google Scholar 

  • Wang, N., Akey, J. M., Zhang, K., Chakraborty, R., & Jin, L. (2002). Distribution of recombination crossovers and the origin of haplotype blocks: The interplay of population history, recombination, and mutation. The American Journal of Human Genetics, 71(5), 1227–1234.

    Article  Google Scholar 

  • Wright, G., Tan, B., Rosenwald, A., Hurt, E. H., Wiestner, A., & Staudt, L. M. (2003). A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proceedings of the National Academy of Sciences, 100(17), 9991–9996.

    Article  Google Scholar 

  • Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14, 1–37.

    Article  Google Scholar 

  • Xing, L., Joun, S., Mackay, K., Lesperance, M., and Zhang, X. (2022). Handling highly correlated genes in prediction analysis of genomic studies. Preprint, arXiv Machine Learning (Statistics).

  • Yoo, Y. J., Kim, S. A., & Bull, S. B. (2015). Clique-based clustering of correlated SNPs in a gene can improve performance of gene-based multi-bin linear combination test. BioMed Research International, 2015, 852341.

    Article  Google Scholar 

  • Yoo, Y. J., Sun, L., & Bull, S. B. (2013). Gene-based multiple regression association testing for combined examination of common and low frequency variants in quantitative trait analysis. Frontiers in Genetics, 4, 233.

    Article  Google Scholar 

  • Yoo, Y. J., Sun, L., Poirier, J., & Bull, S. B. (2014). Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association. Technical report, Tech: Rep., Department of Statistical Sciences, University of Toronto.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aboubacry Gaye.

Ethics declarations

Ethics Approval

Informed consent was obtained from all subjects involved in the study.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gaye, A., Ka Diongue, A., Sylla, S.N. et al. Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data. J Classif 41, 158–169 (2024). https://doi.org/10.1007/s00357-024-09463-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-024-09463-5

Keywords

Navigation