Skip to main content

Advertisement

Log in

Identification of certain cancer-mediating genes using Gaussian fuzzy cluster validity index

  • Published:
Journal of Biosciences Aims and scope Submit manuscript

Abstract

In this article, we have used an index, called Gaussian fuzzy index (GFI), recently developed by the authors, based on the notion of fuzzy set theory, for validating the clusters obtained by a clustering algorithm applied on cancer gene expression data. GFI is then used for the identification of genes that have altered quite significantly from normal state to carcinogenic state with respect to their mRNA expression patterns. The effectiveness of the methodology has been demonstrated on three gene expression cancer datasets dealing with human lung, colon and leukemia. The performance of GFI is compared with 19 exiting cluster validity indices. The results are appropriately validated biologically and statistically. In this context, we have used biochemical pathways, p-value statistics of GO attributes, t-test and z-score for the validation of the results. It has been reported that GFI is capable of identifying high-quality enriched clusters of genes, and thereby is able to select more cancer-mediating genes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

Similar content being viewed by others

References

  • Akaike H 1979 A Bayesian extension of the minimum aic procedure of autoregressive model fitting. Biometrika 66 237–242

    Article  Google Scholar 

  • Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D and Levine AJ 1999 Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96 6745–6750

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Bandler W and Kohout LJ 1980 Fuzzy power sets and fuzzy implication operators. Fuzzy Sets Syst. 4 13–30

    Article  Google Scholar 

  • Beer GD et al. 2002 Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med. 8 816–823

    CAS  PubMed  Google Scholar 

  • Bensaid AM, Hall LO, Bezdek J, Clarke LP, Silbiger ML, Arrington JA and Murtagh RF 1996 Validity-guided (re) clustering with applications to image segmentation. IEEE Trans. Fuzzy Syst. 4 112–123

    Article  Google Scholar 

  • Bezdek JC 1974 On clustering validation techniques. J. Cybernet. 17 58–73

    Google Scholar 

  • Bezdek J 1981 Pattern recognition with fuzzy objective function algorithms (New York: Plenum Press)

    Book  Google Scholar 

  • Dave RN 1996 Validating fuzzy partition obtained through c-shells clustering. Pattern Recogn. Lett. 17 613–623

    Article  Google Scholar 

  • Davies DL and Bouldin DW 1979 A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1 224–227

    Article  CAS  PubMed  Google Scholar 

  • Deborah LJ, Baskaran R and Kannan A 2010 A survey on internal validity measure for cluster validation. IJCSES. 1 85–102

    Article  Google Scholar 

  • Dubes RC and Jain AK 1988 Algorithms for clustering data (Prentice Hall)

  • Dunn JC 1974 Well separated clusters and optimal fuzzy partitions. J. Cybern. 4 95–104

    Article  Google Scholar 

  • Fukuyama Y and Sugeno M 1989 A new method of choosing the number of clusters for the fuzzy c-means method; In Proceeding of Fifth Fuzzy Syst. Symp. pp 247–250

  • Gath I and Geva AB 1989 Unsupervised optimal fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 11 773–781

    Article  Google Scholar 

  • Ghosh A and De RK 2013 Gaussian Fuzzy Index (GFI) for cluster validation: identification of high quality biologically enriched clusters of genes and selection of some possible genes mediating lung cancer; in Pattern Recognition and Machine Intelligence (Proc. PReMI 2013), Kolkata, India, LNCS 8251 Proceedings of the 5 th International Conference on Pattern Recognition and Machine Intelligence (PReMI 2013), India (eds) P Maji, A Ghosh, MN Murty, K Ghosh and SK Pal, pp 680–687

  • Ghosh A, Dhara BC and De RK 2013 Comparative analysis of cluster validity indices in identifying some possible genes mediating certain cancers. Mol. Inf. 32 347–354

    Article  CAS  Google Scholar 

  • Gibbons FD and Ro FP 2002 Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res. 12 1574–1581

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Goodman L and Kruskal W 1954 Measures of associations for cross-validations. J. Am. Stat. Assoc. 49 732–764

    Google Scholar 

  • Gutierrez NC, Ocio EM, delas Rivas J, Maiso P, Delgado M, Ferminan E, Arcos MJ, Sanchez ML, et al. 2007 Gene expression profiling of B lymphocytes and plasma cells from Waldenstroms macroglobulinemia: comparison with expression patterns of the same cell counterparts from chronic lymphocytic leukemia, multiple myeloma and normal individuals. Leukemia. 21 541–549

    Article  CAS  PubMed  Google Scholar 

  • Hubert L and Schultz J 1976 Quadratic assignment as a general data-analysis strategy. Br. J. Math. Stat. Psychol. 29 190–241

    Article  Google Scholar 

  • Pakhira M, Bandyopadhyay S and Maulik U 2005 A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification. Fuzzy Sets Syst. 155 191–214

    Article  Google Scholar 

  • Pauwels EJ and Frederix G 1999 Finding salient regions in images: nonparametric clustering for image segmentation and grouping. Comput. Vis. Image Underst. 75 73–85

    Article  Google Scholar 

  • Rousseeuw PJ 1987 A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 53–65

    Article  Google Scholar 

  • Trauwaert E 1988 On the meaning of Dunn’s partition coefficient for fuzzy clusters. Fuzzy Sets Syst. 25 217–242

    Article  Google Scholar 

  • Tripathy BC, Sen M and Nath S 2012 I-convergence in probabilistic n-normed space. Soft. Comput. 16 1021–1027

    Article  Google Scholar 

  • Wu K and Yang M 2005 A cluster validity index for fuzzy clustering. Pattern Recogn. Lett. 26 1275–1291

    Article  Google Scholar 

  • Xie XL and Beni GA 1991 Validity measure for fuzzy clustering. IEEE Trans. PAMI. 3 841–846

    Article  Google Scholar 

  • Yun XU and Brereton GR 2005 A comparative study of cluster validation indices applied to genotyping data. Chemom. Intell. Lab. Syst. 78 30–40

    Article  Google Scholar 

  • Zadeh LA 1965 Fuzzy sets. Inf. Control. 8 338–353

    Article  Google Scholar 

  • Zadeh LA 1972 A fuzzy-set-theoretic interpretation of linguistic hedges. J. Cybern. 2 4–34

    Article  Google Scholar 

  • Zadeh LA 1997 Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst. 90 111–127

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anupam Ghosh.

Additional information

[Ghosh A and De RK 2015 Identification of certain cancer-mediating genes using Gaussian fuzzy cluster validity index. J. Biosci.] DOI 10.1007/s12038-015-9557-x

Supplementary materials pertaining to this article are available on the Journal of Biosciences Website at http://www.ias.ac.in/jbiosci/oct2015/supp/Ghosh.pdf

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 57.5 kb)

Appendix

Appendix

1.1 A. Methodology

Although, GFI has already been developed in Ghosh and De (2013), we again describe it here, for the sake of the readers, along with the methodology for identification of disease mediating genes. Thus, this part actually repeats the methodology part of Ghosh and De (2013). Let us consider a set of samples U = {xk |k = 1, 2, …, n} that are distributed in l clusters C 1, C 2, …, C l . These clusters have been obtained by a clustering algorithm.

1.1.1 A.1 Gaussian Fuzzy Index (GFI) for cluster validation

We now define a cluster validity index, called Gaussian Fuzzy Index that will demonstrate the goodness of the results obtained by a clustering algorithm. Gaussian Fuzzy Index (GFI) is defined as

$$ GFI=\frac{E^{\prime }}{1+E} $$
(1)

where E ′ is given by

$$ E^{\prime }=\frac{2}{l\left(l-1\right)}{\displaystyle \sum_{\begin{array}{c}\hfill k,j=1\hfill \\ {}\hfill k\ne j\hfill \end{array}}^l}{\mu}_k\left({\boldsymbol{c}}_{j\ }\right) $$
(2)

and E defined by

$$ E=\frac{1}{l}\ {\displaystyle {\sum}_{k=1}^l\frac{1}{\left|{C}_k\right|}\ }{\displaystyle {\sum}_{{\boldsymbol{x}}_{p\in {C}_k}}{\mu}_k\left({\boldsymbol{x}}_p\right)} $$
(3)

The term μ k (c j  ) represents the membership value indicating the degree of belongingness of the center of j th cluster C j to k th cluster C k , and l stands for the number of resulting clusters. The membership function we have considered here is of Gaussian type, and is defined as

$$ {\mu}_k\left({\boldsymbol{c}}_j\right) = \exp \left(-\frac{\left|\right|{\boldsymbol{c}}_j-{\boldsymbol{c}}_k\left|\right|{}^2}{L^2}\right) $$
(4)

Here c k and c j are the k th and j th cluster centers respectively. The term L indicates the maximum distance between two objects in the set U (i.e., set of all the data objects). Thus L is represented by

$$ L=\underset{{\boldsymbol{x}}_p, {\boldsymbol{x}}_{\begin{array}{c}\hfill p\prime \in U\hfill \\ {}\hfill p\ne p\prime \hfill \end{array}}}{ \max}\left|\right|{\boldsymbol{x}}_p - {\boldsymbol{x}}_{p\prime}\left|\right| $$
(5)

It is to be mentioned here that the elements are chosen from normed linear space. Similarly, μ k (x p  ) the membership value of p th sample x p to k th cluster C k , is defined as

$$ \begin{array}{c}\hfill {\mu}_k\left({\boldsymbol{x}}_p\right) = \exp \left(-\frac{\left|\right|{\boldsymbol{x}}_p\ \hbox{--} {\boldsymbol{c}}_k\left|\right|{}^2}{\sigma_k^2}\right),\kern0.5em where\kern0.5em {\boldsymbol{x}}_p\ \in {C}_k\hfill \\ {}\hfill =0,\kern1.25em otherwise\hfill \end{array} $$
(6)

The term σ k is the diameter of k th cluster C k , and is defined as

$$ {\sigma}_k=\underset{{\boldsymbol{x}}_p, {\boldsymbol{x}}_{p\prime \in {C}_k}}{ \max}\left|\right|{\boldsymbol{x}}_p - {\boldsymbol{x}}_{p\prime }\ \left|\right| $$
(7)

We say that a set of clusters to be good if the inter-cluster distances are large and intra-cluster distances are small. Here, E (in equation 3) represents the average fuzzy intra-cluster distance over all the clusters. The value of E lies in [0, 1]. E = 0 represents the highest average fuzzy intra-cluster distance over all the clusters. It is to be mentioned that since E can be zero, we have added 1 in the denominator of equation 1. On the other hand, the lowest average fuzzy intra-cluster distance over all the clusters is obtained at E = 1. Likewise, E ′ (in equation 2) represents the average fuzzy distance among the cluster centers or average fuzzy inter-cluster distance. As in the case of E, E ′ lies in [0, 1]. E ′ = 0 indicates the highest fuzzy inter-cluster distance over all the pairs of clusters. On the other hand, the lowest average fuzzy inter-cluster distance over all the pairs of clusters corresponds to E ′ =1. Thus, a set of clusters is said to be good if the value of GFI is minimum. In other words, lower the value of GFI, better is the set of clusters obtained by an algorithm.

1.1.2 A.2 Comparative study of cluster validity indices and selection of possible disease mediating genes

The performance of GFI is compared with 19 cluster validity indices. This comparison leads to demonstrating the capability of identifying a set of good clusters and thereby selecting some possible disease mediating genes. For this comparative study, we consider the following work flow.

Step I: Generation of clusters: A clustering algorithm C is applied on a gene expression data with the different number (k for k-means and c for fuzzy c-means) of clusters as its input. Here we have considered these numbers ranging from 2 to 20. It is to be noted that the gene expression profiles for normal and diseased states are considered separately, and the number of clusters to be generated in the diseased state is kept equal to that for normal state.

Step II: Selection of the best k -value (or c -value) using a cluster validity index: Among these 19 k-values (or c-values), the best k-value (or c-value) has been selected based on a cluster validity index. Thus we have got 19 best k-values (c-values) corresponding to 19 cluster validity indices, for a clustering algorithm C. These best k-values (or c-values) have been selected from gene expression data of normal states. These best k-values (or c-values) have been obtained by the cluster validity indices, and will be compared with the corresponding best k-values obtained in Steps III and IV.

Step III: For each k -value (or c -value) and for the clustering algorithm C , the following steps are performed. It is to be mentioned here that we have considered k = 2,3,..,20, in Step I, for each clustering algorithm. In this step (Step III), we consider the same k-values as in Step I.

Step III.1: Determining corresponding clusters: Clusters obtained in Step I using the clustering algorithm C for a k-value (or c-value) for both normal and diseased states need to be matched. Let C N i and C D j be i th and j th clusters, obtained by the clustering algorithm C for a k-value (or c-value), for normal and diseased states respectively. We say that the cluster C N i , for normal state, corresponds to cluster C D j , for diseased state, if |(C N i  ∩ C D j )| is maximum over j = 1, 2, …, j, …, k.Without loss of generality, we renumber the cluster C D j as C D i so that C N i corresponds to C D i .

Step III.2: Identifying altered gene clusters: For both normal and diseased states of data, we get k clusters, i.e., C N1 , C N2 , …, C N k for normal state, and similarly for diseased state, the corresponding clusters are C D1 , C D2 , …, C D k . The clusters of normal state have been compared with the clusters of diseased state to identify the altered gene sets. We call a gene to be an altered gene if the gene is in C N i and C D j where i ≠ j.Thus, we can write an altered gene set A i  = ∪  k j = 1 j ≠ 1 (C N i  ∩ C D j ) for C N i . Thus, altered gene sets or altered clusters (i.e. A 1, A 2, …, A k − 1 , A k ) are generated from k normal clusters.

Step III.3: Scoring an altered gene set: In this step, we compare the altered gene sets with an existing pathway database. If a gene in an altered gene set A i is also included in a cancer pathway, we call the said gene in A i to be a matched gene. Here, we generate a score (S) for the altered gene set. Let the number of matched genes in altered gene sets A 1, A 2, …, A k − 1 , A k be l 1, l 2, …, l k − 1 , l k respectively. Thus, the score for S k is defined as

$$ {S}_k=\frac{1}{k} \ast {\displaystyle \sum_{i=1}^k}\frac{l_i}{\left|{A}_i\right|} \ast 100\% $$
(8)

Higher the value of S k , better is the matching. In other words, if S k , for a clustering algorithm and cluster validity index, is high, the index is highly capable of identifying genes mediating a cancer provided the said clustering algorithm is used.

Step III.4: Enriched attributes of an altered gene set: In this step, we compute the enriched attributes of the altered gene sets using p-value statistics. It is to be noted that only functional categories with p-value ≤ 5 × 10− 5 have been considered. Here, we compute a count of enriched attributes (E) for genes in an altered set. Let the number of enriched attributes for the matched genes in altered gene sets A 1, A 2, …, A k − 1 , A k be e 1, e 2, …, e k − 1 , e k respectively. Thus, the count for E k is defined as

$$ {E}_k={\displaystyle \sum_{i=1}^k{e}_i} $$
(9)

Higher the value of E k , better is the chance of having common functions of the altered genes. Thus the genes together may be responsible for mediating a cancer.

Step III.5: z -score: It is based on mutual information between a clustering result gene annotation data. The z-score indicates relationships between clustering and annotation, relative to a clustering method that randomly assigns genes to clusters. a higher z-score indicates a clustering result that is further from a random one. In order to compare the performance of the clustering algorithms, this z-score is plotted for clustering results as a function of number of clusters, k, and to determine an optimal value for k.

Step IV: Determining the best k -value (or c -value) and selection of some possible genes mediating certain cancers: Let the k-value (or c-value) for whichS k , E k and z-score are maximum be K S , K E and K Z respectively. Thus K S , K E and K Z are the best k-values (or c-values) considering the pathway database and p-value statistics of the enriched attributes and z-score respectively. Let the best k-value (or c-value) obtained by a cluster validity index I be K I . For example, the best k-value (or c-value) selected by Dunn Index (DI) is denoted as K DI . A cluster validity index performs the best if and only if |K S  − K I | = 0, |K E  − K I | = 0 and |K Z  − K I | = 0. Now, after selecting the best k-value (or c-value), the genes in the corresponding altered gene sets are selected as possible genes mediating certain cancers.

The best k-values (or c-values) obtained by different cluster validity indices (Step II) for a clustering algorithm are compared with those obtained in Step IV. We say that a cluster validity index I 1 is better than I 2 if

$$ \left|{K}_S-{K}_{I_1}\right|+\left|{K}_E-{K}_{I_1}\right|+\left|{K}_Z-{K}_{I_1}\right|<\left|{K}_S-{K}_{I_2}\right|+\left|{K}_E-{K}_{I_2}\right|+\left|{K}_Z-{K}_{I_2}\right| $$
(10)

The performance of GFI has been compared extensively with 19 indices (given in table 2).

Table 2 Various cluster validity indices and the underlying notion

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ghosh, A., De, R.K. Identification of certain cancer-mediating genes using Gaussian fuzzy cluster validity index. J Biosci 40, 741–754 (2015). https://doi.org/10.1007/s12038-015-9557-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12038-015-9557-x

Keywords

Navigation