Skip to main content

Mutual Information Based Supervised Attribute Clustering for Microarray Sample Classification

  • Chapter
  • First Online:
  • 1444 Accesses

Abstract

In functional genomics, an important application of microarray data is to classify samples according to their gene expression profiles such as to classify cancer versus normal samples or to classify different types or subtypes of cancer. Hence, one of the major tasks with the gene expression data is to find groups of co-regulated genes whose collective expression is strongly associated with the sample categories or response variables. In this regard, a supervised gene clustering algorithm is presented in this chapter to find groups of genes. It directly incorporates the information of sample categories into the gene clustering process. A new quantitative measure, based on mutual information, is reported that incorporates the information of sample categories to measure the similarity between attributes. The supervised gene clustering algorithm is based on measuring the similarity between genes using the new quantitative measure. The performance of the new algorithm is compared with that of existing supervised and unsupervised gene clustering and gene selection algorithms based on the class separability index and the predictive accuracy of naive Bayes classifier, k-nearest neighbor rule, and support vector machine on several cancer and arthritis microarray data sets. The biological significance of the generated clusters is interpreted using the gene ontology.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Nat Acad Sci USA 96(12):6745–6750

    Article  Google Scholar 

  2. Au WH, Chan KCC, Wong AKC, Wang Y (2005) Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans Comput Biol Bioinf 2(2):83–101

    Article  Google Scholar 

  3. Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Networks 5(4):537–550

    Article  Google Scholar 

  4. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G (2004) GO: term finder open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics 20(18):3710–3715

    Article  Google Scholar 

  5. Dettling M, Buhlmann P (2002) Supervised clustering of genes. Genome Biol 3(12):1–15

    Article  Google Scholar 

  6. Devijver PA, Kittler J (1982) Pattern recognition: a statistical approach. Prentice Hall, Englewood Cliffs

    Google Scholar 

  7. Dhillon I, Mallela S, Kumar R (2003) Divisive information-theoretic feature clustering algorithm for text classification. J Mach Learn Res 3:1265–1287

    MATH  MathSciNet  Google Scholar 

  8. Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinf Comput Biol 3(2):185–205

    Article  MathSciNet  Google Scholar 

  9. Domany E (2003) Cluster analysis of gene expression data. J Stat Phys 110(3–6):1117–1139

    Article  MATH  Google Scholar 

  10. Duda RO, Hart PE, Stork DG (1999) Pattern Classification and Scene Analysis. John Wiley and Sons, New York

    Google Scholar 

  11. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Nat Acad Sci USA 95(25):14863–14868

    Google Scholar 

  12. Fukunaga K (1990) Introduction to statistical pattern recognition. Academic Press, New York

    MATH  Google Scholar 

  13. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537

    Article  Google Scholar 

  14. Haiying W, Huiru Z, Francisco A (2007) Poisson-based self-organizing feature maps and hierarchical clustering for serial analysis of gene expression data. IEEE/ACM Trans Comput Biol Bioinf 4(2):163–175

    Article  Google Scholar 

  15. Hastie T, Tibshirani R, Botstein D, Brown P (2001) Supervised harvesting of expression trees. Genome Biol 1:1–12

    Google Scholar 

  16. Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P (2000) ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 1(2):1–21

    Article  Google Scholar 

  17. Herrero J, Valencia A, Dopazo J (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17:126–136

    Article  Google Scholar 

  18. Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9(11):1106–1115

    Article  Google Scholar 

  19. Huang D, Chow TWS (2004) Effective feature selection scheme using mutual information. Neurocomputing 63:325–343

    Article  Google Scholar 

  20. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs, NJ

    MATH  Google Scholar 

  21. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386

    Article  Google Scholar 

  22. Joo Y, Booth JG, Namkoong Y, Casella G (2008) Model-based bayesian clustering (MBBC). Bioinformatics 24(6):874–875

    Article  Google Scholar 

  23. Jornsten R, Yu B (2003) Simultaneous gene clustering and subset selection for sample classification via MDL. Bioinformatics 19(9):1100–1109

    Article  Google Scholar 

  24. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324

    Article  MATH  Google Scholar 

  25. Koller D, Sahami M (1996) Toward optimal feature selection. In: Proceedings of the international conference on machine learning, pp 284–292

    Google Scholar 

  26. Li J, Su H, Chen H, Futscher BW (2007) Optimal search-based gene subset selection for gene array cancer classification. IEEE Trans Inf Technol Biomed 11(4):398–405

    Article  Google Scholar 

  27. Liao JG, Chin KV (2007) Logistic regression for disease classification using microarray data: model selection in a large \(p\) and small \(n\) case. Bioinformatics 23(15):1945–1951

    Article  Google Scholar 

  28. Liu X, Krishnan A, Mondry A (2005) An entropy based gene selection method for cancer classification using microarray data. BMC Bioinformatics 6(1):76

    Article  Google Scholar 

  29. Maji P (2009) \(f\)-Information measures for efficient selection of discriminative genes from microarray data. IEEE Trans Biomed Eng 56(4):1063–1069

    Article  MathSciNet  Google Scholar 

  30. Maji P (2011) Fuzzy-rough supervised attribute clustering algorithm and classification of microarray data. IEEE Trans Syst Man Cybern Part B Cybern 41(1):222–233

    Article  Google Scholar 

  31. Maji P (2012) Mutual information based supervised attribute clustering for microarray sample classification. IEEE Trans Knowl Data Eng 24(1):127–140

    Article  MathSciNet  Google Scholar 

  32. Maji P, Das C (2012) Relevant and Significant Supervised Gene Clusters for Microarray Cancer Classification. IEEE Transactions on NanoBioscience 11(2):161–168

    Article  Google Scholar 

  33. McLachlan GJ, Do KA, Ambroise C (2004) Analyzing microarray gene expression data. John Wiley, Hoboken, NJ

    Book  MATH  Google Scholar 

  34. Medvedovic M, Sivaganesan S (2002) Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18(9):1194–1206

    Article  Google Scholar 

  35. Nguyen D, Rocke D (2002) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18:39–50

    Article  Google Scholar 

  36. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  37. van der Pouw Kraan TCTM, van Gaalen FA, Kasperkovitz PV, Verbeet NL, Smeets TJM, Kraan MC, Fero M, Tak PP, Huizinga TWJ, Pieterman E, Breedveld FC, Alizadeh AA, Verweij CL (2003) Rheumatoid arthritis is a heterogeneous disease: evidence for differences in the activation of the STAT-1 pathway between rheumatoid tissues. Arthritis Rheum 48(8):2132–2145

    Google Scholar 

  38. van der Pouw Kraan TCTM, Wijbrandts CA, van Baarsen LGM, Voskuyl AE, Rustenburg F, Baggen JM, Ibrahim SM, Fero M, Dijkmans BAC, Tak PP, Verweij CL (2007) Rheumatoid arthritis subtypes identified by genomic profiling of peripheral blood sells: assignment of a type I interferon signature in a subpopulation of pateints. Ann Rheum Dis 66:1008–1014

    Google Scholar 

  39. Shannon C, Weaver W (1964) The mathematical theory of communication. University of Illinois Press, Champaign, IL

    Google Scholar 

  40. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Nat Acad Sci USA 96(6):2907–2912

    Article  Google Scholar 

  41. Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19):2405–2412

    Article  Google Scholar 

  42. Vapnik V (1995) The nature of statistical learning theory. Springer-Verlag, New York

    Book  MATH  Google Scholar 

  43. Wang L, Chu F, Xie W (2007) Accurate cancer classification using expressions of very few genes. IEEE/ACM Trans Comput Biol Bioinf 4(1):40–53

    Article  MathSciNet  Google Scholar 

  44. West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Nat Acad Sci USA 98(20):11462–11467

    Google Scholar 

  45. Yeung KY, Ruzzo WL (2001) Principal component analysis for clustering gene expression data. Bioinformatics 17(9):763–774

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pradipta Maji .

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Maji, P., Paul, S. (2014). Mutual Information Based Supervised Attribute Clustering for Microarray Sample Classification. In: Scalable Pattern Recognition Algorithms. Springer, Cham. https://doi.org/10.1007/978-3-319-05630-2_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-05630-2_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-05629-6

  • Online ISBN: 978-3-319-05630-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics