Skip to main content

Cluster analysis of cancer data using semantic similarity, sequence similarity and biological measures

Abstract

Clustering of genes on the basis of expression profiles is generally the first step in understanding how a class of genes behaves in a biological process. A number of supervised and unsupervised algorithms are available in statistics and machine learning literature for clustering microarray data, but the algorithms are restricted in their ability to evaluate the results of a clustering algorithm in the light of biologically meaningful clusters. If two gene sequences are similar, then we would expect that their genetic expressions are similar and that they are similarly annotated in the Gene Ontology (GO) databases. Hence a comparison of the expression level similarity of two gene sequences against their corresponding similarity of annotation in the GO can establish this fact. Semantic similarity has now become a valuable tool for validating the results drawn from biomedical studies such as gene clustering and gene expression data analysis. This paper borrows from our previous work on meta-ensembles using cancer datasets where the output of several clustering algorithms are subsequently fed to a consensus building process to generate a stable set of cluster results. Next, these cluster results are further refined through a sequence of biological validation process for each gene pair of a given cluster using semantic similarity and sequence similarity. We have tested our approach on several benchmark cancer datasets in an attempt to provide a more accurate biological analysis of the clusters and the results have been found to be satisfactory.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  • Altschul SF et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    Article  Google Scholar 

  • Azuaje F, Bodenreider O (2004) Incorporating ontology-driven similarity knowledge into functional genomics: an exploratory study. In: Proceedings IEEE Fourth Symp. Bioinformatics and Bioeng. (BIBE 2004). Taichung, Taiwan, 2004

  • Bhattacherjee V et al (2007) Neural crest and mesoderm lineage-dependent gene expression in orofacial development. Differentiation 75(5):463–477

    Article  Google Scholar 

  • Cheng J et al (2004) A knowledge-based clustering algorithm driven by gene ontology. J Biopharm Stat 14:687–700

    Article  Google Scholar 

  • Chenna R et al (2003) Multiple sequence alignment with clustal series of programs. Nucleic Acids Res 31(13):3497–3500

    Article  Google Scholar 

  • Chu S et al (1998) The transcriptional program of sporulation in budding yeast. Science 282(5389):699–705

    Article  Google Scholar 

  • Couto FM, Silva MJ, Coutinho P (2003) Implementation of a functional semantic similarity measure between gene-products. technical report. Univ. of Lisbon, Lisbon

    Google Scholar 

  • Couto FM, Silva MJ, Coutinho P (2005) Semantic similarity over the gene ontology: Family correlation and selecting disjunctive ancestors. In: Proceedings of the ACM Conference in Information and Knowledge Management, 2005

  • Couto FM, Silva MJ, Coutinho P (2007) Measuring semantic similarity between gene ontology terms. Data Knowl Eng 61:137–152

    Article  Google Scholar 

  • Datta S, Datta S (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4):459–466

    Article  Google Scholar 

  • Datta S, Datta S (2006) Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinform 7:397

    Article  Google Scholar 

  • DeRisi JL, Iyer VR, Brown PO (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278(5338):680–686

    Article  Google Scholar 

  • Dopazo J, Carazo JM (1997) Phylogenetic reconstruction using a growing neural network that adopts the topology of a phylogenetic tree. J Mol Evol 44(2):226–233

    Article  Google Scholar 

  • Dunn JC (1974) Well separated clusters and fuzzy partitions. J Cybern 4:95–104

    Article  MathSciNet  Google Scholar 

  • Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. In: Proceedings Natl Acad Sci USA., 1998

  • Fraley C, Raftery AE (2001) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 17:126–136

    Google Scholar 

  • Gentleman RC, Carey VJ, Bates DM (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5(10):R80

    Article  Google Scholar 

  • Handl J, Knowles J, Kell DB (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212

    Article  Google Scholar 

  • Hartigan JA, Wong MA (1979) A K-means clustering algorithm. Appl Stat 28:100–108

    Article  MATH  Google Scholar 

  • Herrero J, Valencia A, Dopazo J (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17(2):126–136

    Article  Google Scholar 

  • Jiang J, Conrath D (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th International Conference on Research on Computational Linguistics, Taiwan, 1997

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Book  Google Scholar 

  • Kent W et al (2002) The human genome browser at UCSC. Genome Res 12(6):996–1006

    Article  MathSciNet  Google Scholar 

  • Kohonen T (1997) Self-organizing maps, 2nd edn. Springer-Verlag, Berlin

    Book  MATH  Google Scholar 

  • Lam TW et al (2008) Compressed indexing and local alignment of DNA. Bioinformatics 24(6):791–797

    Article  Google Scholar 

  • Larkin M et al (2007) Clustal W and clustal X version 2.0. Bioinformatics 23(21):2947–2948

    Article  Google Scholar 

  • Lee HK et al (2004) Coexpression analysis of human genes across many microarray data sets. Genome Res 14:1085–1094

    Article  Google Scholar 

  • Li J, Liu H (2002) Kent ridge bio-medical dataset repository (Online). Available at: http://sdmc.lit.org.sg/GEDatasets/Datasets.html

  • Li J, Gong B, Chen X et al (2011) DOSim: an R package for similarity between diseases based on disease ontology. BMC Bioinform 12:266

    Article  Google Scholar 

  • Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning. San Francisco, CA, Morgan Kaufmann, 1998

  • Lord PW, Stevens RD, Brass A, Goble CA (2003a) Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19:1275–1283

    Article  Google Scholar 

  • Lord PW, Stevens RD, Brass A, Goble CA (2003) Semantic similarity measures as tools for exploring the gene ontology. In: Proceedings of the 8th Pacific Symposium on Biocomputing. 2003

  • Nagi S, Bhattacharyya DK (2013) Classification of microarray cancer data using ensemble approach. Netw Model Anal Health 2(3):159–173

    Article  Google Scholar 

  • Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453

    Article  Google Scholar 

  • Newberg LA (2008) Memory-efficient dynamic programming backtrace and pairwise local sequence alignment. Bioinformatics 26(16):1772–1778

    Article  Google Scholar 

  • Othman R, Deris S, Illias R (2007) A genetic similarity algorithm for searching the gene ontology terms and annotating anonymous protein sequences. J Biomed Inf 23:529–538

    Google Scholar 

  • Pekar V, Staab S (2002) Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision. In: Proceedings of the 19th international conference on Computational linguistics. Morristown, NJ, USA, 2002

  • Rada R, Mili H, Bicknell E, Blettner M (1989) Development and application of a metric on semantic nets. Man, and Cybernetics, In IEEE Transaction on Systems, p 1989

    Google Scholar 

  • Resnick P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence. 1995

  • Resnick P (1999) Semantic similarity in a taxonomy: an information based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95–130

    Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  • Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T (2006) A new measure for functional similarity of gene products based on gene ontology. BMC Bioinform 7:302

    Article  Google Scholar 

  • Sevilla JL et al (2005) Correlation between gene expression and GO semantic similarity IEEE/ACM. Trans Comput Biol Bioinf 2(4):330–337

    Article  Google Scholar 

  • Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197

    Article  Google Scholar 

  • Stuart JM, Segal E, Koller D, Kim SK (2003) A gene- coexpression network for global discovery of conserved genetic modules. Science 302(5643):249–255

    Article  Google Scholar 

  • Su AI, et al. (2002) Large-Scale Analysis of the Human and Mouse Transcriptomes. In: Proceedings of the National Academy of Science, USA, 2002

  • Team RC, (2013) R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: http://www.R-project.org

  • The Gene Ontology Consortium (2001) Creating the gene ontology resource: design and implementation. Genome Res 11(8):1425–1433

    Article  Google Scholar 

  • van’t Veer LJ (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536

    Article  Google Scholar 

  • Wang, J.Z. et al., 2007. A new method to measure the semantic similarity of GO terms. Bioinformatics

  • Wang H, Azuaje F, Bodenreider O, Dopazo J (2004) Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships. In: Proceedings Computational Intelligence in Bioinformatics and Computational Biology.CA, USA, 2004

  • Wang H, Azuaje F, Bodenreider O (2005) An ontology-driven clustering method for supporting gene expression analysis, computer-based medical systems. In: Proceedings IEEE Symposium on Computer-based Medical Systems. 2005

  • Wu H, et al. (2005) Prediction of functional modules based on comparative genome analysis and gene ontology application. Nucleic Acid Res 33: 2822–2837. Available at: http://www.view.ncbi.nlm.nih.gov/pubmed/15901854

  • Wu Z, Palmer MS, (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd. Annual Meeting of the Association for Computational Linguistics (ACL 1994). 1994

  • Wu X et al (2006) Prediction of yeast proteinprotein interaction network: insights from the gene ontology and annotations. Nucleic Acids Res 34:2137–2150

    Article  Google Scholar 

  • Yeung KY, Haynor DR, Ruzzo WL (2001) Validating clustering for gene expression data. Bioinformatics 17(4):309–318

    Article  Google Scholar 

  • Yu H, Gao L, Tu K, Guo Z (2005) Broadly predicting specific gene functions with expression similarity and taxonomy similarity. Gene 352:75–81

    Article  Google Scholar 

  • Yu G et al (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26(7):976–978

    Article  Google Scholar 

  • Zheng H, Azuaje F, Wang H (2010) seGOsa: software environment for Gene Ontology-driven similarity assessment. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM’10). 2010

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dhruba K. Bhattacharyya.

Appendix

Appendix

The results for the experiments involving semantic similarity, sequence similarity, internal measures, stability measures and biological measures on the (a) Breast Cancer Dataset (b) Lymphoma Dataset and (c) Embryonal Tumours of the Central Nervous System (CNS) dataset are given here.

Appendix A: The pair-wise gene expression similarity matrix

The pair-wise gene expression similarity is calculated using Pearson Correlation for (a) Breast Cancer Dataset, (b) Lymphoma Dataset and (c) Embryonal Tumours of the Central Nervous System (CNS) dataset. The similarity matrix value for some of the genes is shown in Fig. 8 below along with the plots of the values. These expression values will be used subsequently for comparison with the semantic similarity values for the corresponding pair of genes in Section C.

Fig. 8
figure 8

Pair-Wise Gene Expression Similarity Matrix for a Breast Cancer Dataset, b lymphoma dataset and c embryonal tumours of the Central Nervous System (CNS)

Appendix B: The pair-wise semantic similarity matrix

The pair-wise semantic similarity matrix for the Lin, Jiang and Conrath and Wang measures for the Breast Cancer Dataset are calculated and the semantic similarity values and plots for some of the genes is shown in Fig. 9 below:

Fig. 9
figure 9

Pair-wise semantic similarity matrix using a Lin, b Jiang and Conrath and c Wang for Breast Cancer Dataset

The pair-wise semantic similarity matrix given in Fig. 10 shows the values for Lin, Jiang and Conrath and Wang measures for the Lymphoma Dataset along with the plots.

Fig. 10
figure 10

Pair-Wise Semantic Similarity Matrix using (a) Lin, (b) Jiang and Conrath and (c) Wang for Lymphoma Dataset

Figure 11 below shows the pair-wise semantic similarity values and plots of some of the genes for the Lin, Jiang and Conrath and Wang measures for the Embryonal Tumours of the Central Nervous System Dataset.

Fig. 11
figure 11

Pair-wise semantic similarity matrix using a Lin, b Jiang and Conrath and c Wang for embryonal tumours of the central nervous system

Appendix C: Comparison of pair-wise gene expression similarity and semantic similarity

Figure 12 below gives a comparison of expression similarity and semantic similarity of four sample gene pairs of the Breast Cancer Dataset, suggesting that gene products with similar expression patterns might have similarly annotated profiles.

Fig. 12
figure 12

Comparison of gene expression similarity and semantic similarity for a Lin, b Jiang and Conrath and c Wang of four sample gene pairs of Breast Cancer Dataset

Figures 13 and 14 show the comparison of expression similarity and semantic similarity of four sample gene pairs of lymphoma dataset and embryonal tumours of central nervous system dataset. The assumption that gene products with similar expression patterns might have similarly annotated profiles also seems to hold true for this dataset. The graph obtained from the plot of the values for the genes exhibit a similar trend.

Fig. 13
figure 13

Comparison of gene expression similarity and semantic similarity for a Lin, b Jiang and Conrath and c Wang of four sample gene pairs of Lymphoma Dataset

Fig. 14
figure 14

Comparison of gene expression similarity and semantic similarity for a Lin, b Jiang and Conrath and c Wang of four sample gene pairs of Embryonal Tumours of Central Nervous System Dataset

Appendix D: Comparison of pair-wise gene expression similarity, semantic similarity and sequence similarity

The gene expression similarity, Lin, Jiang and Conrath and Wang semantic similarity and sequence similarity of the some of the gene pairs of Breast Cancer Dataset is given in Fig. 15. From the graph it can be clearly observed that the genes pair-wise scores for the various measures follow a common trend, indicating a correlation between gene expression similarity, semantic similarity and sequence similarity.

Fig. 15
figure 15

Comparison of gene expression similarity, semantic similarity and sequence similarity for a Lin, b Jiang and Conrath and c Wang and Sequence Similarity of four sample gene pairs of Breast Cancer Dataset

Figure 16 depicts the gene expression similarity, Lin, Jiang and Conrath and Wang semantic similarity and sequence similarity of the some of the Gene Pairs of Lymphoma Dataset. From the graph it can be clearly observed that the genes pair-wise scores for the various measures follow a common trend, indicating a correlation between gene expression similarity, semantic similarity and sequence similarity. This is borne out also by Fig. 17 for the Gene Pairs of Embryonal Tumours of Central Nervous System Dataset.

Fig. 16
figure 16

Comparison of gene expression similarity, semantic similarity for a Lin, b Jiang and Conrath and c Wang and sequence similarity of four sample gene pairs of Lymphoma Dataset

Fig. 17
figure 17

Comparison of gene expression similarity, semantic similarity for a Lin, b Jiang and Conrath and c Wang and sequence similarity of four sample gene pairs of Embryonal Tumours of Central Nervous System Dataset

Appendix E: Result of internal validation

The internal validation measures of connectivity, Dunn index and silhouette width for each of the eight algorithms for the Breast Cancer dataset are shown in Table 11. We notice that hierarchical clustering with two clusters performs the best in the case of connectivity and silhouette width and with four clusters in case of Dunn index. The plots of the connectivity, Dunn index, and silhouette width are given in Fig. 18, which indicates that hierarchical clustering outperforms the other clustering algorithms under each validation measure and hence appears to be the method of choice.

Table 11 Scores of internal validation measures for the Breast Cancer Dataset
Fig. 18
figure 18

The plots of the connectivity, Dunn index, and silhouette width for the Breast Cancer Dataset

The eight algorithms using the Lymphoma dataset are subjected to the calculation of connectivity, Dunn index and silhouette width internal validation measures and the results are shown in Table 12. As in the case in Table 11, we notice that hierarchical clustering with two clusters performs the best in the case of connectivity and silhouette width and with three clusters in case of Dunn index. This indicates that hierarchical clustering outperforms the other clustering algorithms under each validation measure and from the plots of connectivity, Dunn index, and silhouette width shown in Fig. 19 hierarchical clustering appears to be the method of choice.

Table 12 Scores of internal validation measures for the lymphoma Dataset
Fig. 19
figure 19

The plots of the connectivity, Dunn index, and silhouette width for the Lymphoma Dataset

The internal validation scores for the measures of connectivity, Dunn index and silhouette width for the Embryonal Tumours of Central Nervous System dataset are shown in Table 13. The optimal scores show that hierarchical clustering with two clusters performs the best in each case, which is confirmed by the plots of the connectivity, Dunn index and silhouette width in Fig. 20. SOTA is seen not to perform well as it could not uncover clusters between the ranges 4–6.

Table 13 Scores of Internal Validation Measures for the Embryonal Tumours of Central Nervous System Dataset
Fig. 20
figure 20

The plots of the connectivity, Dunn index, and silhouette width for the Embryonal Tumours of Central Nervous System Dataset

Appendix F: Result of stability measures

The results of APN, AD, ADM and FOM for the Breast Cancer dataset are given in Table 14.

Table 14 Scores of Stability Measures for the Breast Cancer Dataset

For the APN and ADM measures, values close to zero are preferred. The optical score in Table 14 shows that hierarchical clustering with four clusters gives the best score, as was also in the case of internal validation. However, for the other two measures model-based clustering with six clusters has the best score. It is illustrative to graphically visualize each of the validation measures.

The plots of the APN, AD, and ADM are given in Fig. 21. The APN measure shows an interesting trend, in that it initially stabilizes from two to four clusters for all the clustering methods except for SOM and SOTA, but marginally increases afterwards. Though hierarchical clustering with four clusters has the best score, Diana with six clusters is a close second. The AD and FOM measures tend to decrease as the number of clusters increases. Here model-based clustering with six clusters has the best overall score, though the other algorithms have similar scores. The plot of the FOM measure is very similar to the AD measure, so we have omitted it from the figure. For the ADM measure hierarchical with four clusters again has the best score.

Fig. 21
figure 21

The plots of the APN, AD and ADM of stability measures for the Breast Cancer Dataset

For the Lymphoma dataset, the plots of the APN, AD and ADM are given in Fig. 22. Though the graph of APN measure shows Diana as the most favourable algorithm, one must keep in mind that this algorithm is a special case of the hierarchical algorithm. Thus, hierarchical clustering with two clusters gives the best score and it matches the findings as seen in the case of internal validation in the optimal score given in Table 15. For the AD and FOM measures, PAM with six clusters has the best overall score, but over the entire range of clusters evaluated SOM, K-means and Diana have comparable performance. Similarly, for the ADM measure hierarchical has a more stable and better performance.

Fig. 22
figure 22

The plots of the APN, AD and ADM of stability measures for the Lymphoma Dataset

Table 15 Scores of Stability Measures for the Lymphoma Dataset

The plots for the Embryonal Tumours of Central Nervous System Dataset of APN, AD and ADM are given in Fig. 23. Hierarchical clustering with two clusters is seen to be performing the best score and it confirms the findings of internal validation, followed by Diana. This is confirmed by the optimal scores given in Table 16. It is seen that PAM and k-means also perform well. For the AD and FOM measures, PAM with six clusters has the best overall score along with k-means. SOTA is seen not to perform well in this case as it could not uncover clusters between the ranges 4–6.

Fig. 23
figure 23

The plots of the APN, AD and ADM of stability measures for the Embryonal Tumours of Central Nervous System Dataset

Table 16 Scores of Stability Measures for the Embryonal Tumours of Central Nervous System Dataset

Appendix G: Results of BHI and BSI

The BHI and the BSI values were computed for each clustering algorithm in the range of cluster numbers from two to six. We first consider the breast cancer data. Table 17 shows the scores for the Breast Cancer Dataset and we see that Diana has the highest BHI score for six clusters and the highest BSI score is by hierarchical algorithm for two clusters, which indicates that consistency of clustering for genes with similar biological functionality is given by hierarchical algorithm.

Table 17 Scores of BHI and BSI for the Breast Cancer Dataset

Figure 24 shows the plots of BHI for the eight clustering algorithms which reveal that Diana happens to produce most homogeneous biological clusters based on this data set and the results are statistically significant when the number of clusters is between four and six.

Fig. 24
figure 24

BHI plot for Breast Cancer Dataset

The plots of BSI are shown in Fig. 25 and hierarchical algorithm seems to be the most stable in its capability of producing clusters using reduced data sets that are biologically alike. Considering both indices, we would say that hierarchical algorithm is the best choice for this data set to maximize the biological homogeneity and Diana can be a worthwhile consideration if six clusters are desired.

Fig. 25
figure 25

BSI plot for Breast Cancer Dataset

The scores for the Lymphoma Dataset are shown in Table 18 and we see that SOM has the highest BHI score for six clusters and the highest BSI score is again by hierarchical algorithm for two clusters, which indicates its consistency to produce most homogeneous biological clusters based on this data set.

Table 18 Scores of BHI and BSI for the lymphoma dataset

Figure 26 shows that SOM produces the most homogeneous biological clusters when six clusters are required and hierarchical is the most consistent of all the algorithms. The plots of BSI are shown in Fig. 27 and hierarchical algorithm appears to be the most stable in its capability of producing clusters that are biologically alike and model-based clustering appears to be the least stable. We can conclude that hierarchical algorithm seems to be the best choice for this data set to maximize the biological homogeneity, considering both the indices.

Fig. 26
figure 26

BHI plot for Lymphoma Dataset

Fig. 27
figure 27

BSI plot for Lymphoma Dataset

Finally, for the Embryonal Tumours of Central Nervous System dataset, the BHI and the BSI scores are shown in Table 19 and in both cases, hierarchical scores the highest points for producing biological significant clusters.

Table 19 Scores of BHI and BSI for the CNS dataset

Although hierarchical shows a marked increase for cluster sizes of five or six, SOM can be the algorithm of choice when four clusters are desired, as shown in Fig. 28. When we compare the plots of BSI as shown in Fig. 29, we can see that all the clustering algorithms have produced significantly consistent results barring SOTA, as it could not generate clusters between the ranges 4–6. Hierarchical algorithm seems to be the most stable in its ability of producing biologically relevant clusters and when we look at both the indices, we can conclude that hierarchical algorithm is the best choice for this data set to maximize the biological homogeneity.

Fig. 28
figure 28

BHI plot for CNS Dataset

Fig. 29
figure 29

BSI plot for CNS Dataset

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nagi, S., Bhattacharyya, D.K. Cluster analysis of cancer data using semantic similarity, sequence similarity and biological measures. Netw Model Anal Health Inform Bioinforma 3, 67 (2014). https://doi.org/10.1007/s13721-014-0067-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13721-014-0067-9

Keywords

  • Biological significance
  • Cancer dataset
  • Cluster analysis
  • Gene Ontology
  • Semantic
  • Sequence
  • Validation