Knowledge and Information Systems

, Volume 29, Issue 1, pp 81–101 | Cite as

Measuring gene similarity by means of the classification distance

  • Elena Baralis
  • Giulia BrunoEmail author
  • Alessandro Fiori
Regular Paper


Microarray technology provides a simple way for collecting huge amounts of data on the expression level of thousands of genes. Detecting similarities among genes is a fundamental task, both to discover previously unknown gene functions and to focus the analysis on a limited set of genes rather than on thousands of genes. Similarity between genes is usually evaluated by analyzing their expression values. However, when additional information is available (e.g., clinical information), it may be beneficial to exploit it. In this paper, we present a new similarity measure for genes, based on their classification power, i.e., on their capability to separate samples belonging to different classes. Our method exploits a new gene representation that measures the classification power of each gene and defines the classification distance as the distance between gene classification powers. The classification distance measure has been integrated in a hierarchical clustering algorithm, but it may be adopted also by other clustering algorithms. The result of experiments runs on different microarray datasets supports the intuition of the proposed approach.


Similarity measure Microarray Clustering Data mining 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aicha SB, Lessard J, Pelletier M, Fournier A, Calvo E, Labrie C (2007) Transcriptional profiling of genes that are regulated by the endoplasmic reticulum-bound transcription factor AIbZIP/CREB3L4 in prostate cells. Physiol Genom 31(2): 295CrossRefGoogle Scholar
  2. 2.
    Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Nat Acad Sci 96(12): 6745–6750CrossRefGoogle Scholar
  3. 3.
    Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z (2000) Tissue classification with gene expression profiles. J Comput Biol 7(3–4): 559–583CrossRefGoogle Scholar
  4. 4.
    Bo T, Jonassen I (2002) New feature subset selection procedures for classification of expression profiles. Genome Biol 3(4): 17CrossRefGoogle Scholar
  5. 5.
    Bouguessa M, Wang S (2009) Mining projected clusters in high-dimensional spaces. IEEE Trans Knowl Data Eng 21(4): 507–522CrossRefGoogle Scholar
  6. 6.
    Bushel PR, Wolfinger RD, Gibson G (2007) Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes. BMC Syst Biol 1(1): 15CrossRefGoogle Scholar
  7. 7.
    Chang CC, Lin CJ (2001) Training v-support vector classifiers: theory and algorithms. Neural Comput 13(9): 2119–2147zbMATHCrossRefGoogle Scholar
  8. 8.
    Chen JJ, Tsai CA, Tzeng SL, Chen CH (2007) Gene selection with multiple ordering criteria. BMC Bioinform 8(1): 74CrossRefGoogle Scholar
  9. 9.
    Chu T, Huang J, Chuang K, Yang D, Chen M (2010) Density conscious subspace clustering for high-dimensional data. IEEE Trans Knowl Data Eng 22(1): 16–30CrossRefGoogle Scholar
  10. 10.
    Cox TF, Cox MAA (2001) Multidimensional scaling. Chapman and Hall, New YorkzbMATHGoogle Scholar
  11. 11.
    Daszykowski M, Kaczmarek K, Vander Heyden Y, Walczak B (2007) Robust statistics in data analysis—a review: basic concepts. Chemom Intell Lab Syst 85(2): 203–219CrossRefGoogle Scholar
  12. 12.
    Datta S, Datta S (2006) Evaluation of clustering algorithms for gene expression data. BMC Bioinform 7(Suppl 4): S17CrossRefGoogle Scholar
  13. 13.
    Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88: 782–792MathSciNetzbMATHCrossRefGoogle Scholar
  14. 14.
    El Akadi A, Amine A, El Ouardighi A, Aboutajdine D (2010) A two-stage gene selection scheme utilizing MRMR filter and GA wrapper. Knowl Inform Syst. doi: 10.1007/s10115-010-0288-x
  15. 15.
    Ester M, Kriegel H, Jörg S, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, pp 226–231Google Scholar
  16. 16.
    Everitt BS, Landau S, Leese M (2009) Cluster analysis, 4th Edn. Wiley, New YorkGoogle Scholar
  17. 17.
    Fu L, Medico E (2007) FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinform 8(1): 3CrossRefGoogle Scholar
  18. 18.
    Fu Q, Banerjee A (2008) Multiplicative Mixture Models for Overlapping Clustering. In: Proceedings of the eighth IEEE international conference on data mining, pp 791–796Google Scholar
  19. 19.
    Gevaert O, Smet FD, Timmerman D, Moreau Y, Moor BD (2006) Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatic 22(14): e184–e190CrossRefGoogle Scholar
  20. 20.
    Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, AAAs 286(5439): 531Google Scholar
  21. 21.
    Gregg JL, Brown KE, Mintz EM, Piontkivska H, Fraizer GC (2010) Analysis of gene expression in prostate cancer epithelial and interstitial stromal cells using laser capture microdissection. BMC Cancer 10(1): 165CrossRefGoogle Scholar
  22. 22.
    Gu J, Liu J (2008) Bayesian biclustering of gene expression data. BMC Genomics 9(Suppl 1): S4CrossRefGoogle Scholar
  23. 23.
    Hampel FR (1974) The influence curve and its role in robust estimation. J Am Stat Assoc 69: 383–393MathSciNetzbMATHCrossRefGoogle Scholar
  24. 24.
    He X, Cai D, Niyogi P. (2006) Laplacian score for feature selection. Adv Neural Inform Proc Syst 18: 507Google Scholar
  25. 25.
    Huang D, Pan W (2006) Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinform 22(10): 1259–1268CrossRefGoogle Scholar
  26. 26.
    Jiang D, Pei M, Ramanathan C, Lin C, Tang C, Zhang A (2006) Mining gene-sample-time microarray data: a coherent gene cluster discovery approach. Knowl Inform Syst 13(3): 305–335CrossRefGoogle Scholar
  27. 27.
    Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11): 1370–1386CrossRefGoogle Scholar
  28. 28.
    Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New YorkGoogle Scholar
  29. 29.
    Liu H, Motoda H (2007) Computational methods of feature selection. Chapman & Hall/CRC, Boca RatonzbMATHGoogle Scholar
  30. 30.
    Liu J, Wang W (2003) Op-cluster: clustering by tendency in high dimensional space. In: Proceedings of the ICDM 2003 conference, pp 187–194Google Scholar
  31. 31.
    Mitra P, Majumder DD (2004) Feature selection and gene clustering from gene expression data. In: Proceedings of the pattern recognition, 17th international conference on, vol 2. pp 343–346Google Scholar
  32. 32.
    Mukkamala S, Liu Q, Veeraghattamand R, Sung A (2006) Feature selection and ranking of key genes for tumor classification: using microarray gene expression data. Springer, Berlin/HeidelbergGoogle Scholar
  33. 33.
    Petrovics G, Liu A, Shaheduzzaman S, Furasato B, Sun C, Chen Y, Nau M, Ravindranath L, Chen Y, Dobi A et al (2005) Frequent overexpression of ETS-related gene-1 (ERG1) in prostate cancer transcriptome. Oncogene 24(23): 3847–3852CrossRefGoogle Scholar
  34. 34.
    Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: 846–850CrossRefGoogle Scholar
  35. 35.
    Rosini P, Bonaccorsi L, Baldi E, Chiasserini C, Forti G, De Chiara G, Lucibello M, Mongiat M, Iozzo RV, Garaci E et al (2002) Androgen receptor expression induces FGF2, FGF-binding protein production, and FGF2 release in prostate carcinoma cells: role of FGF2 in growth, survival, and androgen receptor down-modulation. The Prostate 53(4): 310–321CrossRefGoogle Scholar
  36. 36.
    Royuela M, Rodríguez-Berriguete G, Fraile B, Paniagua R (2008) TNF-alpha/IL-1/NF-kappaB transduction pathway in human cancer prostate. Histol Histopathol 23(10): 1279Google Scholar
  37. 37.
    Song J, Liu C, Song Y, Qu J (2008) Clustering for DNA microarray data analysis with a graph cut based algorithm. Seventh international conference on machine learning and applicationsGoogle Scholar
  38. 38.
    Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21(5): 631–643CrossRefGoogle Scholar
  39. 39.
    Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinform 22(19): 2405CrossRefGoogle Scholar
  40. 40.
    Thompson RC, Deo M, Turner DL (2007) Analysis of microRNA expression by in situ hybridization with RNA oligonucleotide probes. Methods 43(2): 153–161CrossRefGoogle Scholar
  41. 41.
    Torosyan Y, Dobi A, Glasman M, Mezhevaya K, Naga S, Huang W, Paweletz C, Leighton X, Pollard HB, Srivastava M (2010) Role of multi-hnRNP nuclear complex in regulation of tumor suppressor ANXA7 in prostate cancer cells. Oncogene 29(17): 2457–2466CrossRefGoogle Scholar
  42. 42.
    Wang H, Wang W, Yang J, Yu PS (2002) Clustering by pattern similarity in large data sets. In: Proceedings of the 2002 ACM SIGMOD international conference on management of data, pp 394–405Google Scholar
  43. 43.
    Wang L, Leckie C, Ramamohanarao K, Bezdek J (2009) Automatically Determining the Number of Clusters in Unlabeled Data Sets. IEEE Trans Knowl Data Eng 21(3): 335–350CrossRefGoogle Scholar
  44. 44.
    Yang P, Zhang Z (2007) Hybrid methods to select informative gene sets in microarray data classification. Lecture Notes Comput Sci 4830: 810CrossRefGoogle Scholar
  45. 45.
    Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucl Acids Res 30(4): e15CrossRefGoogle Scholar
  46. 46.
    Yu LTH, Chung F, Chan SCF, Yuen SMC (2004) Using emerging pattern based projected clustering and gene expression data for cancer detection. In: Proceedings of the second conference on Asia-Pacific bioinformatics 29: 75–84Google Scholar
  47. 47.
    Zapala MA, Schork NJ (2006) Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables. In: Proceedings of the national academy of sciences 103(51): 19430Google Scholar
  48. 48.
    Zhao Y, Wang G, Yin Y, Yu G (2006) Mining positive and negative co-regulation patterns from microarray data. Sixth IEEE symposium on bioinformatics and BioEngineering, pp 86–93Google Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.Dipartimento di Automatica e InformaticaPolitecnico di TorinoTorinoItaly

Personalised recommendations