Abstract
High-throughput techniques are producing large-scale high-dimensional (e.g., 4D with genes vs timepoints vs conditions vs tissues) genome-wide gene expression data. This induces increasing demands for effective methods for partitioning the data into biologically relevant groups. Current clustering and co-clustering approaches have limitations, which may be very time consuming and work for only low-dimensional expression datasets. In this work, we introduce a new notion of “co-identification”, which allows systematical identification of genes participating different functional groups under different conditions or different development stages. The key contribution of our work is to build a unified computational framework of co-identification that enables clustering to be high-dimensional and adaptive. Our framework is based upon a generic optimization model and a general optimization method termed Maximum Block Improvement. Testing results on yeast and Arabidopsis expression data are presented to demonstrate high efficiency of our approach and its effectiveness.
Keywords
- Gene Expression Data
- Gene Expression Dataset
- Classical Cluster
- Generalize Maximum Entropy
- Yeast Gene Expression
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This research is supported by grants from NIH NCRR (5P20RR016460-11) and NIGMS (8P20GM103429-11).
Download conference paper PDF
References
Aguilar-Ruiz, J.S.: Shifting and scaling patterns from gene expression data. Bioinformatics 21, 3840–3845 (2005)
Banerjee, A., et al.: A generalized maximum entropy approach to bregman coclustering and matrix approximation. JMLR 8, 1919–1986 (2007)
Ben-Dor, A., et al.: Discovering local structure in gene expression data: the order-preserving submatrix problem. In: RECOMB 2002, pp. 49–57 (2002)
Ben-Hur, A., et al.: A stability based method for discovering structure in clustered data. In: Proc. of PSB (2002)
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)
Chen, B., et al.: Maximum block improvement and polynomial optimization. SIAM Journal on Optimization 22, 87–107 (2012)
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 93–103 (2000)
Cheung, A.N.: Molecular targets in gynaecological cancers. Pathology 39, 26–45 (2007)
Cho, H., et al.: Minimum sum-squared residue co-clustering of gene expression data. In: Proc. SIAM on Data Mining, pp. 114–125 (2004)
Costa, I.G., et al.: Comparative analysis of clustering methods for gene expression time course data. Genet. Mol. Biol. 27, 623–631 (2004)
Deodhar, M., et al.: Hunting for Coherent Co-clusters in High Dimensional and Noisy Datasets. In: IEEE Intl. Conf. on Data Mining Workshops (2008)
D’haeseleer, P.: How does gene expression clustering work? Nature Biotechnology 23, 1499–1501 (2005)
Downey, R.G., Fellows, M.R.: Parameterized Complexity. Springer (1999)
Dudoit, S., Fridlyand, J.: A prediction based resampling method for estimating the number of clusters in a data set. Genome Biology 3, 1–21 (2002)
Eisen, M.B., et al.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95, 14863–14868 (1998)
Gibbons, F.D., Roth, F.P.: Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res. 12, 1574–1581 (2002)
Hochreiter, S., et al.: FABIA: factor analysis for bicluster acquisition. Bioinformatics 26, 1520–1527 (2010)
Kilian, J., et al.: The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. The Plant Journal 2, 347–363 (2007)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Review 51, 455–500 (2009)
Jegelka, S., Sra, S., Banerjee, A.: Approximation Algorithms for Tensor Clustering. In: Gavaldà, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS, vol. 5809, pp. 368–383. Springer, Heidelberg (2009)
Jiang, D., et al.: Mining coherent gene clusters from gene-sample-time microarray data. In: Proc. ACM SIGKDD, pp. 430–439 (2004)
Lathauwer, D., et al.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21, 1253–1278 (2000)
Lazzeroni, L., Owen, A.B.: Plaid models for gene expression data. Statistica Sinica 12, 61–86 (2002)
Lee, M., et al.: Biclustering via Sparse Singular Value Decomposition. Biometrics 66, 1087–1095 (2010)
Li, A., Tuck, D.: An Effective Tri-Clustering Algorithm Combining Expression Data with Gene Regulation. Gene Regulation and Systems Biology 3, 49–64 (2009)
Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biology Bioinform. 1, 24–45 (2004)
Magic, Z., et al.: cDNA microarrays: identification of gene signatures and their application in clinical practice. J. BUON 12(suppl.1), S39–S44 (2007)
Murali, T., Kasif, S.: Extracting conserved gene expression motifs from gene expression data. In: Pacific Symposium on Biocomputing, vol. 8, pp. 77–88 (2003)
Prelic, A., et al.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22, 1122–1129 (2006)
Snider, N., Diab, M.: Unsupervised Induction of Modern Standard Arabic Verb Classes. In: HLT-NAACL, New York (2006)
Strauch, M., et al.: A Two-Step Clustering for 3-D Gene Expression Data Reveals the Main Features of the Arabidopsis Stress Response. J. Integrative Bioinformatics 4, 54–66 (2007)
Supper, J., et al.: EDISA: extracting biclusters from multiple time-series of gene expression profiles. BMC Bioinformatics 8, 334–347 (2007)
Suter, L., et al.: Toxicogenomics in predictive toxicology in drug development. Chem. Biol. 11, 161–171 (2004)
Tamayo, P., et al.: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96, 2907–2912 (1999)
Tavazoie, S., et al.: Systematic determination of genetic network architecture. Nat. Genet. 22, 281–285 (1999)
Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31, 279–311 (1966)
Tibshirani, R., et al.: Estimating the Number of Clusters in a Dataset via the Gap Statistic. J. Royal Stat. Soc. B 63, 411–423 (2001)
Wang, H., et al.: Clustering by pattern similarity in large data sets. In: Proc. KDD 2002, pp. 394–405 (2002)
Xu, X., et al.: Mining shifting-and-scaling co-regulation patterns on gene expression profiles. In: Proc. ICDE 2006, pp. 89–98 (2006)
Zhang, S., Wang, K., Chen, B., Huang, X.: A New Framework for Co-clustering of Gene Expression Data. In: Loog, M., Wessels, L., Reinders, M.J.T., de Ridder, D. (eds.) PRIB 2011. LNCS, vol. 7036, pp. 1–12. Springer, Heidelberg (2011)
Zhao, L., Zaki, M.J.: Tricluster: an effective algorithm for mining coherent clusters in 3D microarray data. In: Proc. ACM SIGMOD, pp. 694–705 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, S., Wang, K., Ashby, C., Chen, B., Huang, X. (2012). A Unified Adaptive Co-identification Framework for High-D Expression Data. In: Shibuya, T., Kashima, H., Sese, J., Ahmad, S. (eds) Pattern Recognition in Bioinformatics. PRIB 2012. Lecture Notes in Computer Science(), vol 7632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34123-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-34123-6_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34122-9
Online ISBN: 978-3-642-34123-6
eBook Packages: Computer ScienceComputer Science (R0)
-
Published in cooperation with
http://www.iapr.org/
