Abstract
We propose a framework for biclustering gene expression profiles. This framework applies dominant set approach to create sets of sorting vectors for the sorting of the rows in the data matrix. In this way, the coexpressed rows of gene expression vectors could be gathered. We iteratively sort and transpose the gene expression data matrix to gather the blocks of coexpressed subset. Weighted correlation coefficient is used to measure the similarity in the gene level and the condition level. Their weights are updated each time using the sorting vector of the previous iteration. In this way, the highly correlated bicluster is located at one corner of the rearranged gene expression data matrix. We applied our approach to synthetic data and three real gene expression data sets with encouraging results. Secondly, we propose ACV (average correlation value) to evaluate the homogeneity of a bicluster or a data matrix. This criterion conforms to the intuitive biological notion of coexpressed set of genes or samples and is compared with the mean squared residue score. ACV is found to be more appropriate for both additive models and multiplicative models.
Similar content being viewed by others
References
J. Hartigan, “Clustering Algorithms,” Wiley, 1975.
Y. Cheng and G. Church, “Biclustering of Expression Data,” in Proc. Eighth Int’l Conf. Intelligent Systems for Molecular Biology (ISMB’00), 2000, pp. 93–103.
S.C. Madeira and A.L. Oliveira, “Biclustering Algorithms for Biological Data Analysis: A Survey,” IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 1, no. 1, 2004, pp. 24–45.
G. Getz, E. Levine and E. Domany, “Coupled Two-Way Clustering Analysis of Gene Microarray Data,” Proc. Natl. Acad. Sci. U.S.A., vol. 97, 2000, pp. 12079–12084.
C. Tang, L. Zhang, I. Ahang and M. Ramanathan, “Interrelated Two-Way Clustering: An Unsupervised Approach for Gene Expression Data Analysis,” in Proc. Second IEEE Int’l Symp. Bioinformatics and Bioeng., 2001, pp. 41–48.
J.A. Hartigan, “Direct Clustering of a Data Matrix,” J. Am. Stat. Assoc. (JASA), vol. 67, no. 337, 1972, pp. 123–129.
H. Cho, I.S. Dhillon, Y. Guan and S. Sra, “Minimum Sum-Squared Residue Cococlustering of Gene Expression Data,” in Proc. Fourth SIAM Int’l Conf. Data Mining, 2004.
J. Yang, W. Wang, H. Wang and P. Yu, “δ-Clustering: Capturing Subspace Correlation in a Large Data Set,” in Proc. 18th IEEE Int’l Conf. Data Eng., 2002, pp. 517–528.
J. Yang, W. Wang, H. Wang and P. Yu, “Enhanced Biclustering on Expression Data,” in Proc. Third IEEE Conf. Bioinformatics and Bioeng., 2003, pp. 321–327.
H. Wang, W. Wang, J. Yang and P.S. Yu, “Clustering by Pattern Similarity in Large Data Sets,” in Proc. 2002 ACM SIGMOD Int’l Conf. Management of Data, 2002, pp. 394–405.
L. Lazzeroni and A. Owen, “Plaid Models for Gene Expression Data,” Technical Report, Stanford University, 2000.
M. Pavan and M. Pelillo, “A new Graph-Theoretic Approach to Clustering and Segmentation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2003, pp. 3068–3077.
J.M. Bland and D.G. Altman, “Calculating Correlation Coefficients with Repeated Observations: Part 2–Correlation Between Subjects,” BMJ, vol. 310, 1995, p. 633.
M.B. Eisen, P.T. Spellman, P.O. Brown and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Natl. Acad. Sci. U.S.A., vol. 95, 1998, pp. 14863–14868.
T.S. Motzkin and E.G. Straus, “Maxima for Graphs and A New Proof of A Theorem of Turan,” Can. J. Math., vol. 17, 1965, pp. 533–540.
X. Fu, L. Teng, Y. Li, W. Chen, Y. Mao, I.-F. Shen and Y. Xie, “Finding Dominant Sets in Microarray Data,” Front. Biosci., vol. 10, 2005, pp. 3068–3077.
A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown and L.M. Staudt, “Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling,” Nature, vol. 403, 2000, pp. 503–510.
V.R. Iyer, M.B. Eisen, D.T. Ross, G. Schuler, T. Moore, J.C.F. Lee, J.M. Trent, L.M. Staudt, J. Hudson Jr., M.S. Boguski, D. Lashkari, D. Shalon, D. Botstein and P.O. Brown, “The Transcriptional Program in the Response of Human Fibroblasts to Serum,” Science, vol. 283, 1999, pp. 83–87.
S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho and G.M. Church, “Systematic Determination of Genetic Network Architecture,” Nat. Genet., vol. 22, 1999, pp. 281–285.
X.L. Ji, L.L. Jesse and Z.R. Sun, “Mining Gene Expression Data Using a Novel Approach Based on Hidden Markov Models,” FEBS Lett., vol. 542, 2003, pp. 125–131.
J. Liu and W. Wang, “OP-Cluster: Clustering by Tendency in High Dimensional Space,” in Proc. Third IEEE Int’l Conf. Data Mining, 2003, pp. 187–194.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Teng, L., Chan, L. Discovering Biclusters by Iteratively Sorting with Weighted Correlation Coefficient in Gene Expression Data. J Sign Process Syst Sign Image 50, 267–280 (2008). https://doi.org/10.1007/s11265-007-0121-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-007-0121-2