Biclustering by sparse canonical correlation analysis

Pimentel, Harold; Hu, Zhiyue; Huang, Haiyan

doi:10.1007/s40484-017-0127-0

Biclustering by sparse canonical correlation analysis

Research Article
Published: 09 February 2018

Volume 6, pages 56–67, (2018)
Cite this article

Download PDF

Quantitative Biology

Biclustering by sparse canonical correlation analysis

Download PDF

Harold Pimentel¹,
Zhiyue Hu² &
Haiyan Huang²

730 Accesses
3 Citations
Explore all metrics

Abstract

Background

Developing appropriate computational tools to distill biological insights from large-scale gene expression data has been an important part of systems biology. Considering that gene relationships may change or only exist in a subset of collected samples, biclustering that involves clustering both genes and samples has become increasingly important, especially when the samples are pooled from a wide range of experimental conditions.

Methods

In this paper, we introduce a new biclustering algorithm to find subsets of genomic expression features (EFs) (e.g., genes, isoforms, exon inclusion) that show strong “group interactions” under certain subsets of samples. Group interactions are defined by strong partial correlations, or equivalently, conditional dependencies between EFs after removing the influences of a set of other functionally related EFs. Our new biclustering method, named SCCA-BC, extends an existing method for group interaction inference, which is based on sparse canonical correlation analysis (SCCA) coupled with repeated random partitioning of the gene expression data set.

Results

SCCA-BC gives sensible results on real data sets and outperforms most existing methods in simulations. Software is available at https://github.com/pimentel/scca-bc.

Conclusions

SCCA-BC seems to work in numerous conditions and the results seem promising for future extensions. SCCA-BC has the ability to find different types of bicluster patterns, and it is especially advantageous in identifying a bicluster whose elements share the same progressive and multivariate normal distribution with a dense covariance matrix.

Article PDF

Identifying gene-specific subgroups: an alternative to biclustering

Article Open access 03 December 2019

A GPU-accelerated algorithm for biclustering analysis and detection of condition-dependent coexpression network modules

Article Open access 23 June 2017

A systematic comparative evaluation of biclustering techniques

Article Open access 23 January 2017

References

Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868
Article CAS PubMed PubMed Central Google Scholar
Lazzeroni, L. and Owen, A. (2002) Plaid models for gene expression data. Stat. Sin., 12, 61–86
Google Scholar
Kluger, Y., Basri, R., Chang, J. T. and Gerstein, M. (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res., 13, 703–716
Article CAS PubMed PubMed Central Google Scholar
Bhattacharya, A. and De, R. K. (2009) Bi-correlation clustering algorithm for determining a set of co-regulated genes. Bioinformatics, 25, 2795–2801
Article CAS PubMed Google Scholar
Nepomuceno, J. A., Troncoso, A. and Aguilar-Ruiz, J. S. (2011) Biclustering of gene expression data by correlation-based scatter search. BioData Min., 4, 3
Article PubMed PubMed Central Google Scholar
Gao, Q., Ho, C., Jia, Y., Li, J. J. and Huang, H. (2012) Biclustering of linear patterns in gene expression data. J. Comput. Biol., 19, 619–631
Article CAS PubMed PubMed Central Google Scholar
Ben-Dor, A., Chor, B., Karp, R. and Yakhini, Z. (2003) Discovering local structure in gene expression data: the orderpreserving submatrix problem. J. Comput. Biol., 10, 373–384
Article CAS PubMed Google Scholar
Liu, J. and Wang, W. (2003) Op-cluster: Clustering by tendency in high dimensional space. In Third IEEE International Conference on Data Mining, 2003. ICDM 2003, pp. 187–194 IEEE
Google Scholar
Wang, Y. X. R., Jiang, K., Feldman, L. J., Bickel, P. J., and Huang, H. (2015) Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis. Ann. Appl. Stat. 9, 300–323
Article Google Scholar
Tan, K. M. and Witten, D. M. (2014) Sparse biclustering of transposable data. J. Comput. Graph. Stat., 23, 985–1008
Article PubMed PubMed Central Google Scholar
Turner, H., Bailey, T. and Krzanowski, W. (2005) Improved biclustering of microarray data demonstrated through systematic performance tests. Comput. Stat. Data Anal., 48, 235–254
Article Google Scholar
Lee, M., Shen, H., Huang, J. Z. and Marron, J. S. (2010) Biclustering via sparse singular value decomposition. Biometrics, 66, 1087–1095
Article PubMed Google Scholar
Prelić, A., Bleuler, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., Hennig, L., Thiele, L. and Zitzler, E. (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22, 1122–1129
Article PubMed Google Scholar
Higham, N. J. (2002) Computing the nearest correlation matrix–a problem from finance. IMA J. Numer. Anal., 22, 329–343
Article Google Scholar
St Johnston, D. (2002) The art and design of genetic screens: Drosophila melanogaster. Nat. Rev. Genet., 3, 176–188
Article Google Scholar
Jorgensen, E. M. and Mango, S. E. (2002) The art and design of genetic screens: Caenorhabditis elegans. Nat. Rev.Genet., 3, 356–369
Article CAS PubMed Google Scholar
Brown, J. B., Boley, N., Eisman, R., May, G. E., Stoiber, M. H., Duff, M. O., Booth, B. W., Wen, J., Park, S., Suzuki, A. M., et al. (2014) Diversity and dynamics of the Drosophila transcriptome. Nature, 512, 393–399
Article CAS PubMed PubMed Central Google Scholar
Hotelling, H. (1936) Relations between two sets of variates. Biometrika, 28, 321–377
Article Google Scholar
Lee, W., Lee, D., Lee, Y. and Pawitan, Y. (2011) Sparse canonical covariance analysis for high-throughput data. Stat. Appl. Genet. Mol. Biol., 10
Tibshirani, R., Walther, G. and Hastie, T. (2001) Estimating the number of clusters in a dataset via the gap statistic. J. R. Stat. Soc. B, 63, 411–423
Article Google Scholar
Anderson, T. W. (1958) An Introduction to Multivariate Statistical Analysis. New York: Wiley
Google Scholar
Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 267–288
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of California, Berkeley, CA, 94720, USA
Harold Pimentel
Department of Statistics, University of California, Berkeley, CA, 94720, USA
Zhiyue Hu & Haiyan Huang

Authors

Harold Pimentel
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyue Hu
View author publications
You can also search for this author in PubMed Google Scholar
Haiyan Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haiyan Huang.

Additional information

Author summary: In this paper, we introduce a novel biclustering algorithm to find subsets of genomic expression features (EFs) (e.g., genes, isoforms, exon inclusion) that show strong partial correlations (i.e., conditional dependencies between EFs after removing the influences of other EFs in the same set) under certain subsets of samples. We extend an existing method for inferring such conditional dependencies, which is based on sparse canonical correlation analysis (SCCA) coupled with repeated random partitioning and subsampling of the gene expression data set. We incorporate a binary vector such that it will assist the objective function on deciding exclusion or inclusion of a particular sample to the bicluster.We test our algorithm on both simulation and real datasets, and achieve promising results. In addition, our algorithm is shown to be relatively robust to initialization and small perturbation in hyper-parameters. The algorithm is available at https://github.com/ pimentel/scca-bc.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pimentel, H., Hu, Z. & Huang, H. Biclustering by sparse canonical correlation analysis. Quant Biol 6, 56–67 (2018). https://doi.org/10.1007/s40484-017-0127-0

Download citation

Received: 13 July 2017
Revised: 20 September 2017
Accepted: 20 September 2017
Published: 09 February 2018
Issue Date: March 2018
DOI: https://doi.org/10.1007/s40484-017-0127-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Biclustering by sparse canonical correlation analysis