Abstract
Over the last several years, many clustering algorithms have been applied to gene expression data. However, most clustering algorithms force the user into having one set of clusters, resulting in a restrictive biological interpretation of gene function. It would be difficult to interpret the complex biological regulatory mechanisms and genetic interactions from this restrictive interpretation of microarray expression data. The software package SignatureClust allows users to select a group of functionally related genes (called ‘Landmark Genes’), and to project the gene expression data onto these genes. Compared to existing algorithms and software in this domain, our software package offers two unique benefits. First, by selecting different sets of landmark genes, it enables the user to cluster the microarray data from multiple biological perspectives. This encourages data exploration and discovery of new gene associations. Second, most packages associated with clustering provide internal validation measures, whereas our package validates the biological significance of the new clusters by retrieving significant ontology and pathway terms associated with the new clusters. SignatureClust is a free software tool that enables biologists to get multiple views of the microarray data. It highlights new gene associations that were not found using a traditional clustering algorithm. The software package ‘SignatureClust’ and the user manual can be downloaded from http://infos.korea.ac.kr/sigclust.php.
Similar content being viewed by others
Abbreviations
- GO:
-
Gene Ontology
- KEGG:
-
Kyoto Encyclopedia of Genes and Genomes
- PFAM:
-
Protein Families
References
Allison DB, Cui X, Page GP, Sabripour M (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7(5):406–406. doi:10.1038/nrg1869
Andreopoulos B, An A, Wang X, Schroeder M (2009) A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinf 10(3):297–314. doi:10.1093/bib/bbn058
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet 25(1):25–29. doi:10.1038/75556
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of the SIAM international conference on data mining, pp 333–344
Beissbarth T, Speed TP (2004) GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20(9):1464–1465. doi:10.1093/bioinformatics/bth088. http://bioinformatics.oxfordjournals.org/cgi/reprint/20/9/1464.pdf
Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: ICML ’04: proceedings of the twenty-first international conference on machine learning. ACM, New York, p 11. doi:10.1145/1015330.1015360
Casati P, Stapleton AE, Blum JE, Walbot V (2006) Genome-wide analysis of high-altitude maize and gene knockdown stocks implicates chromatin remodeling proteins in response to uv-b. Plant J 46(4):613–627. doi:10.1111/j.1365-313X.2006.02721.x
Cheng Y, Church GM (2000) Biclustering of expression data. In: Eighth international conference on intelligent systems for molecular biology, pp 93–103
Chopra P, Kang J, Yang J, Cho H, Kim HS, Lee MG (2008) Microarray data mining using landmark gene-guided clustering. BMC Bioinf 9:92+. doi:10.1186/1471-2105-9-92
Covell DG, Wallqvist A, Rabow AA, Thanki N (2003) Molecular classification of cancer: unsupervised self-organizing map analysis of gene expression microarray data. Mol Cancer Ther 2(3):317–332
Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. ICMLA, pp 245–250
Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA (2003) Global functional profiling of gene expression. Genomics 81(2):98–104
Fern X, Brodley C (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: The twentieth international conference on machine learning (ICML-2003)
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer ELL, Bateman A (2008) The Pfam protein families database. Nucl Acids Res 36(1):D281–D288. doi:10.1093/nar/gkm960
Handl J, Knowles J, Kell DB (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212. doi:10.1093/bioinformatics/bti517. http://bioinformatics.oxfordjournals.org/cgi/reprint/21/15/3201.pdf
Huang D, Wei P, Pan W (2006) Combining gene annotations and gene expression data in model-based clustering: Weighted method. OMICS J Integr Biol 10(1):28. doi:10.1089/omi.2006.10.28 http://www.liebertonline.com/doi/pdf/10.1089/omi.2006.10.28
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: A survey. IEEE Trans Knowl Data Eng 16(11):1370–1386. doi:10.1109/TKDE.2004.68
Kabbarah O, Mallon MA, Pfeifer JD, Goodfellow PJ (2006) Transcriptional profiling endometrial carcinomas microdissected from des-treated mice identifies changes in gene expression associated with estrogenic tumor promotion. Int J Cancer 119(8):1843–1849
Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y (2008) KEGG for linking genomes to life and the environment. Nucl Acids Res 36(1):D480–484. doi:10.1093/nar/gkm882
Kang J, Yang J, Xu W, Chopra P (2005) Integrating heterogeneous microarray data sources using correlation signatures. In: Ludäscher B, Raschid L (eds) DILS, lecture notes in computer science, vol 3615. Springer, Berlin, pp 105–120
Kohonen T (2000) Self-organizing maps. Springer, Berlin
McNicholas PD, Murphy TB (2010) Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21):2705–2712. doi:10.1093/bioinformatics/btq498. http://bioinformatics.oxfordjournals.org/content/26/21/2705.abstract, http://bioinformatics.oxfordjournals.org/content/26/21/2705.full.pdf+html
Mimaroglu S, Erdil E (2010) Obtaining better quality final clustering by merging a collection of clusterings. Bioinformatics 26(20):2645–2646. doi:10.1093/bioinformatics/btq489. http://bioinformatics.oxfordjournals.org/content/26/20/2645.abstract, http://bioinformatics.oxfordjournals.org/content/26/20/2645.full.pdf+html
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. SIGKDD Explor Newsl 6(1):90–105. doi:10.1145/1007730.1007731
R Development Core Team (2006) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. ISBN 3-900051-07-0
Ressom H, Wang D, Natarajan P (2003) Adaptive double self-organizing maps for clustering gene expression profiles. Neural Netw 16(5-6):633–640. doi:10.1016/S0893-6080(03)00102-3
Tari L, Baral C, Kim S (2009) Fuzzy c-means clustering with prior biological knowledge. J Biomed Inf 42(1):74 – 81. doi:10.1016/j.jbi.2008.05.009. http://www.sciencedirect.com/science/article/B6WHD-4SKB3F9-1/2/5ce6f8bed2ce251d9b43fc060bcf504c
Tseng GC, Wong WH (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics 61(1):10–16
Wagsta K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of 18th international conference on machine learning (ICML-01), pp 577–584
Yeung K, Medvedovic M, Bumgarner R (2003) Clustering gene-expression data with repeated measurements. Genome Biol 4(5):R34. doi:10.1186/gb-2003-4-5-r34. http://genomebiology.com/2003/4/5/R34
Zhao L, Zaki MJ (2005) Tricluster: an effective algorithm for mining coherent clusters in 3d microarray data. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data, ACM Press, New York, pp 694–705. doi:10.1145/1066157.1066236
Zhou XJ, Kao MCJ, Huang H, Wong A, Nunez-Iglesias J, Primig M, Aparicio OM, Finch CE, Morgan TE, Wong WH (2005) Functional annotation and network reconstruction through cross-platform integration of microarray data. Nat Biotechnol 23(2):238–243. doi:10.1038/nbt1058
Acknowledgments
This work was supported by the Second Brain Korea 21 Project Grant, a Microsoft Research Asia Grant, a Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic Research Promotion Fund) (KRF-2008-331-D00481), a National Research Foundation of Korea (NRF) grant funded by the Korean government (MEST) (2010-0015713, 2010-0027793, 2010-0027592), and a National IT Industry Promotion Agency (NIPA) grant funded by the Korean government (ITAC1810100200160001000100100).
Author information
Authors and Affiliations
Corresponding author
Additional information
P. Chopra and H. Shin contributed equally to this work.
Rights and permissions
About this article
Cite this article
Chopra, P., Shin, H., Kang, J. et al. SignatureClust: a tool for landmark gene-guided clustering. Soft Comput 16, 411–418 (2012). https://doi.org/10.1007/s00500-011-0725-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-011-0725-0