Abstract
In functional genomics, an important application of microarray data is to classify samples according to their gene expression profiles such as to classify cancer versus normal samples or to classify different types or subtypes of cancer. Hence, one of the major tasks with the gene expression data is to find groups of co-regulated genes whose collective expression is strongly associated with the sample categories or response variables. In this regard, a supervised gene clustering algorithm is presented in this chapter to find groups of genes. It directly incorporates the information of sample categories into the gene clustering process. A new quantitative measure, based on mutual information, is reported that incorporates the information of sample categories to measure the similarity between attributes. The supervised gene clustering algorithm is based on measuring the similarity between genes using the new quantitative measure. The performance of the new algorithm is compared with that of existing supervised and unsupervised gene clustering and gene selection algorithms based on the class separability index and the predictive accuracy of naive Bayes classifier, k-nearest neighbor rule, and support vector machine on several cancer and arthritis microarray data sets. The biological significance of the generated clusters is interpreted using the gene ontology.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Nat Acad Sci USA 96(12):6745–6750
Au WH, Chan KCC, Wong AKC, Wang Y (2005) Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans Comput Biol Bioinf 2(2):83–101
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Networks 5(4):537–550
Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G (2004) GO: term finder open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics 20(18):3710–3715
Dettling M, Buhlmann P (2002) Supervised clustering of genes. Genome Biol 3(12):1–15
Devijver PA, Kittler J (1982) Pattern recognition: a statistical approach. Prentice Hall, Englewood Cliffs
Dhillon I, Mallela S, Kumar R (2003) Divisive information-theoretic feature clustering algorithm for text classification. J Mach Learn Res 3:1265–1287
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinf Comput Biol 3(2):185–205
Domany E (2003) Cluster analysis of gene expression data. J Stat Phys 110(3–6):1117–1139
Duda RO, Hart PE, Stork DG (1999) Pattern Classification and Scene Analysis. John Wiley and Sons, New York
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Nat Acad Sci USA 95(25):14863–14868
Fukunaga K (1990) Introduction to statistical pattern recognition. Academic Press, New York
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Haiying W, Huiru Z, Francisco A (2007) Poisson-based self-organizing feature maps and hierarchical clustering for serial analysis of gene expression data. IEEE/ACM Trans Comput Biol Bioinf 4(2):163–175
Hastie T, Tibshirani R, Botstein D, Brown P (2001) Supervised harvesting of expression trees. Genome Biol 1:1–12
Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P (2000) ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 1(2):1–21
Herrero J, Valencia A, Dopazo J (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17:126–136
Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9(11):1106–1115
Huang D, Chow TWS (2004) Effective feature selection scheme using mutual information. Neurocomputing 63:325–343
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs, NJ
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386
Joo Y, Booth JG, Namkoong Y, Casella G (2008) Model-based bayesian clustering (MBBC). Bioinformatics 24(6):874–875
Jornsten R, Yu B (2003) Simultaneous gene clustering and subset selection for sample classification via MDL. Bioinformatics 19(9):1100–1109
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Koller D, Sahami M (1996) Toward optimal feature selection. In: Proceedings of the international conference on machine learning, pp 284–292
Li J, Su H, Chen H, Futscher BW (2007) Optimal search-based gene subset selection for gene array cancer classification. IEEE Trans Inf Technol Biomed 11(4):398–405
Liao JG, Chin KV (2007) Logistic regression for disease classification using microarray data: model selection in a large \(p\) and small \(n\) case. Bioinformatics 23(15):1945–1951
Liu X, Krishnan A, Mondry A (2005) An entropy based gene selection method for cancer classification using microarray data. BMC Bioinformatics 6(1):76
Maji P (2009) \(f\)-Information measures for efficient selection of discriminative genes from microarray data. IEEE Trans Biomed Eng 56(4):1063–1069
Maji P (2011) Fuzzy-rough supervised attribute clustering algorithm and classification of microarray data. IEEE Trans Syst Man Cybern Part B Cybern 41(1):222–233
Maji P (2012) Mutual information based supervised attribute clustering for microarray sample classification. IEEE Trans Knowl Data Eng 24(1):127–140
Maji P, Das C (2012) Relevant and Significant Supervised Gene Clusters for Microarray Cancer Classification. IEEE Transactions on NanoBioscience 11(2):161–168
McLachlan GJ, Do KA, Ambroise C (2004) Analyzing microarray gene expression data. John Wiley, Hoboken, NJ
Medvedovic M, Sivaganesan S (2002) Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18(9):1194–1206
Nguyen D, Rocke D (2002) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18:39–50
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
van der Pouw Kraan TCTM, van Gaalen FA, Kasperkovitz PV, Verbeet NL, Smeets TJM, Kraan MC, Fero M, Tak PP, Huizinga TWJ, Pieterman E, Breedveld FC, Alizadeh AA, Verweij CL (2003) Rheumatoid arthritis is a heterogeneous disease: evidence for differences in the activation of the STAT-1 pathway between rheumatoid tissues. Arthritis Rheum 48(8):2132–2145
van der Pouw Kraan TCTM, Wijbrandts CA, van Baarsen LGM, Voskuyl AE, Rustenburg F, Baggen JM, Ibrahim SM, Fero M, Dijkmans BAC, Tak PP, Verweij CL (2007) Rheumatoid arthritis subtypes identified by genomic profiling of peripheral blood sells: assignment of a type I interferon signature in a subpopulation of pateints. Ann Rheum Dis 66:1008–1014
Shannon C, Weaver W (1964) The mathematical theory of communication. University of Illinois Press, Champaign, IL
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Nat Acad Sci USA 96(6):2907–2912
Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19):2405–2412
Vapnik V (1995) The nature of statistical learning theory. Springer-Verlag, New York
Wang L, Chu F, Xie W (2007) Accurate cancer classification using expressions of very few genes. IEEE/ACM Trans Comput Biol Bioinf 4(1):40–53
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Nat Acad Sci USA 98(20):11462–11467
Yeung KY, Ruzzo WL (2001) Principal component analysis for clustering gene expression data. Bioinformatics 17(9):763–774
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Maji, P., Paul, S. (2014). Mutual Information Based Supervised Attribute Clustering for Microarray Sample Classification. In: Scalable Pattern Recognition Algorithms. Springer, Cham. https://doi.org/10.1007/978-3-319-05630-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-05630-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05629-6
Online ISBN: 978-3-319-05630-2
eBook Packages: Computer ScienceComputer Science (R0)