Biclustering by sparse canonical correlation analysis
- 215 Downloads
Developing appropriate computational tools to distill biological insights from large-scale gene expression data has been an important part of systems biology. Considering that gene relationships may change or only exist in a subset of collected samples, biclustering that involves clustering both genes and samples has become increasingly important, especially when the samples are pooled from a wide range of experimental conditions.
In this paper, we introduce a new biclustering algorithm to find subsets of genomic expression features (EFs) (e.g., genes, isoforms, exon inclusion) that show strong “group interactions” under certain subsets of samples. Group interactions are defined by strong partial correlations, or equivalently, conditional dependencies between EFs after removing the influences of a set of other functionally related EFs. Our new biclustering method, named SCCA-BC, extends an existing method for group interaction inference, which is based on sparse canonical correlation analysis (SCCA) coupled with repeated random partitioning of the gene expression data set.
SCCA-BC gives sensible results on real data sets and outperforms most existing methods in simulations. Software is available at https://github.com/pimentel/scca-bc.
SCCA-BC seems to work in numerous conditions and the results seem promising for future extensions. SCCA-BC has the ability to find different types of bicluster patterns, and it is especially advantageous in identifying a bicluster whose elements share the same progressive and multivariate normal distribution with a dense covariance matrix.
Keywordsbiclustering SCCA gene clusters
- 2.Lazzeroni, L. and Owen, A. (2002) Plaid models for gene expression data. Stat. Sin., 12, 61–86Google Scholar
- 8.Liu, J. and Wang, W. (2003) Op-cluster: Clustering by tendency in high dimensional space. In Third IEEE International Conference on Data Mining, 2003. ICDM 2003, pp. 187–194 IEEEGoogle Scholar
- 19.Lee, W., Lee, D., Lee, Y. and Pawitan, Y. (2011) Sparse canonical covariance analysis for high-throughput data. Stat. Appl. Genet. Mol. Biol., 10Google Scholar
- 21.Anderson, T. W. (1958) An Introduction to Multivariate Statistical Analysis. New York: WileyGoogle Scholar
- 22.Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 267–288Google Scholar