Biclustering by sparse canonical correlation analysis

Abstract

Background

Developing appropriate computational tools to distill biological insights from large-scale gene expression data has been an important part of systems biology. Considering that gene relationships may change or only exist in a subset of collected samples, biclustering that involves clustering both genes and samples has become increasingly important, especially when the samples are pooled from a wide range of experimental conditions.

Methods

In this paper, we introduce a new biclustering algorithm to find subsets of genomic expression features (EFs) (e.g., genes, isoforms, exon inclusion) that show strong “group interactions” under certain subsets of samples. Group interactions are defined by strong partial correlations, or equivalently, conditional dependencies between EFs after removing the influences of a set of other functionally related EFs. Our new biclustering method, named SCCA-BC, extends an existing method for group interaction inference, which is based on sparse canonical correlation analysis (SCCA) coupled with repeated random partitioning of the gene expression data set.

Results

SCCA-BC gives sensible results on real data sets and outperforms most existing methods in simulations. Software is available at https://github.com/pimentel/scca-bc.

Conclusions

SCCA-BC seems to work in numerous conditions and the results seem promising for future extensions. SCCA-BC has the ability to find different types of bicluster patterns, and it is especially advantageous in identifying a bicluster whose elements share the same progressive and multivariate normal distribution with a dense covariance matrix.

References

  1. 1.

    Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Lazzeroni, L. and Owen, A. (2002) Plaid models for gene expression data. Stat. Sin., 12, 61–86

    Google Scholar 

  3. 3.

    Kluger, Y., Basri, R., Chang, J. T. and Gerstein, M. (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res., 13, 703–716

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Bhattacharya, A. and De, R. K. (2009) Bi-correlation clustering algorithm for determining a set of co-regulated genes. Bioinformatics, 25, 2795–2801

    CAS  Article  PubMed  Google Scholar 

  5. 5.

    Nepomuceno, J. A., Troncoso, A. and Aguilar-Ruiz, J. S. (2011) Biclustering of gene expression data by correlation-based scatter search. BioData Min., 4, 3

    Article  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Gao, Q., Ho, C., Jia, Y., Li, J. J. and Huang, H. (2012) Biclustering of linear patterns in gene expression data. J. Comput. Biol., 19, 619–631

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Ben-Dor, A., Chor, B., Karp, R. and Yakhini, Z. (2003) Discovering local structure in gene expression data: the orderpreserving submatrix problem. J. Comput. Biol., 10, 373–384

    CAS  Article  PubMed  Google Scholar 

  8. 8.

    Liu, J. and Wang, W. (2003) Op-cluster: Clustering by tendency in high dimensional space. In Third IEEE International Conference on Data Mining, 2003. ICDM 2003, pp. 187–194 IEEE

    Google Scholar 

  9. 9.

    Wang, Y. X. R., Jiang, K., Feldman, L. J., Bickel, P. J., and Huang, H. (2015) Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis. Ann. Appl. Stat. 9, 300–323

    Article  Google Scholar 

  10. 10.

    Tan, K. M. and Witten, D. M. (2014) Sparse biclustering of transposable data. J. Comput. Graph. Stat., 23, 985–1008

    Article  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Turner, H., Bailey, T. and Krzanowski, W. (2005) Improved biclustering of microarray data demonstrated through systematic performance tests. Comput. Stat. Data Anal., 48, 235–254

    Article  Google Scholar 

  12. 12.

    Lee, M., Shen, H., Huang, J. Z. and Marron, J. S. (2010) Biclustering via sparse singular value decomposition. Biometrics, 66, 1087–1095

    Article  PubMed  Google Scholar 

  13. 13.

    Prelić, A., Bleuler, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., Hennig, L., Thiele, L. and Zitzler, E. (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22, 1122–1129

    Article  PubMed  Google Scholar 

  14. 14.

    Higham, N. J. (2002) Computing the nearest correlation matrix–a problem from finance. IMA J. Numer. Anal., 22, 329–343

    Article  Google Scholar 

  15. 15.

    St Johnston, D. (2002) The art and design of genetic screens: Drosophila melanogaster. Nat. Rev. Genet., 3, 176–188

    Article  Google Scholar 

  16. 16.

    Jorgensen, E. M. and Mango, S. E. (2002) The art and design of genetic screens: Caenorhabditis elegans. Nat. Rev.Genet., 3, 356–369

    CAS  Article  PubMed  Google Scholar 

  17. 17.

    Brown, J. B., Boley, N., Eisman, R., May, G. E., Stoiber, M. H., Duff, M. O., Booth, B. W., Wen, J., Park, S., Suzuki, A. M., et al. (2014) Diversity and dynamics of the Drosophila transcriptome. Nature, 512, 393–399

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Hotelling, H. (1936) Relations between two sets of variates. Biometrika, 28, 321–377

    Article  Google Scholar 

  19. 19.

    Lee, W., Lee, D., Lee, Y. and Pawitan, Y. (2011) Sparse canonical covariance analysis for high-throughput data. Stat. Appl. Genet. Mol. Biol., 10

  20. 20.

    Tibshirani, R., Walther, G. and Hastie, T. (2001) Estimating the number of clusters in a dataset via the gap statistic. J. R. Stat. Soc. B, 63, 411–423

    Article  Google Scholar 

  21. 21.

    Anderson, T. W. (1958) An Introduction to Multivariate Statistical Analysis. New York: Wiley

    Google Scholar 

  22. 22.

    Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 267–288

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Haiyan Huang.

Additional information

Author summary: In this paper, we introduce a novel biclustering algorithm to find subsets of genomic expression features (EFs) (e.g., genes, isoforms, exon inclusion) that show strong partial correlations (i.e., conditional dependencies between EFs after removing the influences of other EFs in the same set) under certain subsets of samples. We extend an existing method for inferring such conditional dependencies, which is based on sparse canonical correlation analysis (SCCA) coupled with repeated random partitioning and subsampling of the gene expression data set. We incorporate a binary vector such that it will assist the objective function on deciding exclusion or inclusion of a particular sample to the bicluster.We test our algorithm on both simulation and real datasets, and achieve promising results. In addition, our algorithm is shown to be relatively robust to initialization and small perturbation in hyper-parameters. The algorithm is available at https://github.com/ pimentel/scca-bc.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pimentel, H., Hu, Z. & Huang, H. Biclustering by sparse canonical correlation analysis. Quant Biol 6, 56–67 (2018). https://doi.org/10.1007/s40484-017-0127-0

Download citation

Keywords

  • biclustering
  • SCCA
  • gene clusters