Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

Gaye, Aboubacry; Ka Diongue, Abdou; Sylla, Seydou Nourou; Diarra, Maryam; Diallo, Amadou; Talla, Cheikh; Loucoubar, Cheikh

doi:10.1007/s00357-024-09463-5

Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

Published: 28 February 2024

Volume 41, pages 158–169, (2024)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Aboubacry Gaye ORCID: orcid.org/0000-0002-8550-7737^1,2,
Abdou Ka Diongue¹,
Seydou Nourou Sylla³,
Maryam Diarra²,
Amadou Diallo²,
Cheikh Talla² &
…
Cheikh Loucoubar²

125 Accesses
Explore all metrics

Abstract

This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

2D–EM clustering approach for high-dimensional data through folding feature vectors

Article Open access 28 December 2017

Improved Classification Method for Detecting Potential Interactions Between Genes

Clustered Variable Selection by Regularized Elimination in PLS

Data Availibility

The data presented in this study are available on request from the corresponding author.

References

Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2), 263–265.
Article Google Scholar
Bickel, P. J., & Levina, E. (2004). Some theory for Fisher’s linear discriminant function, naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6), 989–1010.
Article MathSciNet Google Scholar
Bron, C., & Kerbosch, J. (1973). Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM, 16(9), 575–577.
Article Google Scholar
Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pp. 493–507.
Duin, R. P., & Loog, M. (2004). Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion. IEEE transactions on pattern analysis and machine intelligence, 26(6), 732–739.
Article Google Scholar
Fisher, R. A. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22(5), 700–725.
Article Google Scholar
Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. (2002). The structure of haplotype blocks in the human genome. Science, 296(5576), 2225–2229.
Article Google Scholar
Geeleher, P., Cox, N. J., & Huang, R. S. (2014). Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biology, 15, 1–12.
Article Google Scholar
Jolliffe, I. T. (1986). Principal components in regression analysis. In Principal component analysis, pp 129–155. Springer.
Julier, S. J. (2006). An empirical study into the use of Chernoff information for robust, distributed fusion of gaussian mixture models. In 2006 9th International Conference on Information Fusion, pp 1–8. IEEE.
Kim, S. A., Cho, C.-S., Kim, S.-R., Bull, S. B., & Yoo, Y. J. (2017). A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs. Bioinformatics, 34(3), 388–397.
Article Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, & K. Weinberger (Eds.), Advances in Neural Information Processing Systems. (Vol. 25). Curran Associates Inc.
Google Scholar
Lewontin, R. C. (1964). The interaction of selection and linkage. I general considerations; heterotic models. Genetics, 49(1), 49–67.
Google Scholar
Liu, F., Schmidt, R. H., Reif, J. C., & Jiang, Y. (2019). Selecting closely-linked SNPs based on local epistatic effects for haplotype construction improves power of association mapping. G3: Genes, Genomes, Genetics, 9(12), 4115–4126.
Article Google Scholar
Maherin, I., & Liang, Q. (2014). Radar sensor network for target detection using Chernoff information and relative entropy. Physical Communication, 13, 244–252.
Article Google Scholar
Motsinger, A. A., Reif, D. M., Fanelli, T. J., Davis, A. C., & Ritchie, M. D. (2007). Linkage disequilibrium in genetic association studies improves the performance of grammatical evolution neural networks. In 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, pp. 1–8. IEEE.
Nielsen, F. (2022). Revisiting Chernoff information with likelihood ratio exponential families. Entropy, 24(10), 1400.
Article MathSciNet Google Scholar
Pattaro, C., Ruczinski, I., Fallin, D. M., & Parmigiani, G. (2008). Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies. BMC genomics, 9(1), 1–15.
Article Google Scholar
Schaid, D. J., McDonnell, S. K., Hebbring, S. J., Cunningham, J. M., & Thibodeau, S. N. (2005). Nonparametric tests of association of multiple genes with human disease. The American Journal of Human Genetics, 76(5), 780–793.
Article Google Scholar
Slatkin, M. (2008). Linkage disequilibrium—Understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477–485.
Article Google Scholar
Soh, K. P., Szczurek, E., Sakoparnig, T., & Beerenwinkel, N. (2017). Predicting cancer type from tumour DNA signatures. Genome Medicine, 9(1), 1–11.
Article Google Scholar
Taliun, D., Gamper, J., Leser, U., & Pattaro, C. (2015). Fast sampling-based whole-genome haplotype block recognition. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13(2), 315–325.
Article Google Scholar
Taliun, D., Gamper, J., & Pattaro, C. (2014). Efficient haplotype block recognition of very long and dense genetic sequences. BMC Bioinformatics, 15(1), 1–18.
Article Google Scholar
Vattikuti, S., Guo, J., & Chow, C. C. (2012). Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS Genetics, 8(3), e1002637.
Article Google Scholar
Vogelstein, J. T., Bridgeford, E. W., Tang, M., Zheng, D., Douville, C., Burns, R., & Maggioni, M. (2021). Supervised dimensionality reduction for big data. Nature Communications, 12(1), 2872.
Article Google Scholar
Vogelstein, J. T., Park, Y., Ohyama, T., Kerr, R. A., Truman, J. W., Priebe, C. E., & Zlatic, M. (2014). Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science, 344(6182), 386–392.
Article Google Scholar
Wang, N., Akey, J. M., Zhang, K., Chakraborty, R., & Jin, L. (2002). Distribution of recombination crossovers and the origin of haplotype blocks: The interplay of population history, recombination, and mutation. The American Journal of Human Genetics, 71(5), 1227–1234.
Article Google Scholar
Wright, G., Tan, B., Rosenwald, A., Hurt, E. H., Wiestner, A., & Staudt, L. M. (2003). A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proceedings of the National Academy of Sciences, 100(17), 9991–9996.
Article Google Scholar
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14, 1–37.
Article Google Scholar
Xing, L., Joun, S., Mackay, K., Lesperance, M., and Zhang, X. (2022). Handling highly correlated genes in prediction analysis of genomic studies. Preprint, arXiv Machine Learning (Statistics).
Yoo, Y. J., Kim, S. A., & Bull, S. B. (2015). Clique-based clustering of correlated SNPs in a gene can improve performance of gene-based multi-bin linear combination test. BioMed Research International, 2015, 852341.
Article Google Scholar
Yoo, Y. J., Sun, L., & Bull, S. B. (2013). Gene-based multiple regression association testing for combined examination of common and low frequency variants in quantitative trait analysis. Frontiers in Genetics, 4, 233.
Article Google Scholar
Yoo, Y. J., Sun, L., Poirier, J., & Bull, S. B. (2014). Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association. Technical report, Tech: Rep., Department of Statistical Sciences, University of Toronto.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory for Studies and Research in Statistics and Development, Gaston Berger University of Saint Louis, Saint Louis, Senegal
Aboubacry Gaye & Abdou Ka Diongue
Epidemiology, Clinical Research and Data Science Unit, Institut Pasteur de Dakar, 220, Dakar, Senegal
Aboubacry Gaye, Maryam Diarra, Amadou Diallo, Cheikh Talla & Cheikh Loucoubar
Information and Communication Technologies for Development, Alioune Diop University of Bambey, Bambey, Senegal
Seydou Nourou Sylla

Authors

Aboubacry Gaye
View author publications
You can also search for this author in PubMed Google Scholar
Abdou Ka Diongue
View author publications
You can also search for this author in PubMed Google Scholar
Seydou Nourou Sylla
View author publications
You can also search for this author in PubMed Google Scholar
Maryam Diarra
View author publications
You can also search for this author in PubMed Google Scholar
Amadou Diallo
View author publications
You can also search for this author in PubMed Google Scholar
Cheikh Talla
View author publications
You can also search for this author in PubMed Google Scholar
Cheikh Loucoubar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aboubacry Gaye.

Ethics declarations

Ethics Approval

Informed consent was obtained from all subjects involved in the study.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gaye, A., Ka Diongue, A., Sylla, S.N. et al. Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data. J Classif 41, 158–169 (2024). https://doi.org/10.1007/s00357-024-09463-5

Download citation

Accepted: 02 February 2024
Published: 28 February 2024
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00357-024-09463-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

Abstract

Access this article

Similar content being viewed by others

2D–EM clustering approach for high-dimensional data through folding feature vectors

Improved Classification Method for Detecting Potential Interactions Between Genes

Clustered Variable Selection by Regularized Elimination in PLS

Data Availibility

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics Approval

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

Abstract

Access this article

Similar content being viewed by others

2D–EM clustering approach for high-dimensional data through folding feature vectors

Improved Classification Method for Detecting Potential Interactions Between Genes

Clustered Variable Selection by Regularized Elimination in PLS

Data Availibility

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics Approval

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation