Clustering of Expressed Sequence Tag Using Global and Local Features: A Performance Study

  • Keng-Hoong NgEmail author
  • Somnuk Phon-Amnuaisuk
  • Chin-Kuan Ho
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 52)


Clustering of expressed sequence tag (EST) plays an important role in gene analysis. Alignment-based sequence comparison is commonly used to measure the similarity between sequences, and recently some of the alignment-free comparisons have been introduced. In this paper, we evaluate the role of global and local features extracted from the alignment free approaches i.e., the compression-based method and the generalized relative entropy method. The evaluation is done from the perspective of EST clustering quality. Our evaluation shows that the local feature of EST yields much better clustering quality compared to the global feature. Furthermore, we verified our best clustering result achieved in the experiments with another EST clustering algorithm, wcd, and it shows that our performance is comparable to the later.


Sequence clustering Expressed sequence tag Alignment-free Similarity measure Grammar-based distance Generalized relative entropy 


  1. 1.
    Ptitsyn, A., & Hide, W. (2005). CLU: A new algorithm for EST clustering. BMC Bioinformatics, 6. doi: 10.1186/1471-2105-6-S2-S3.
  2. 2.
    Malde, K., Coward, E., & Jonassen, I. (2005). A graph based algorithm for generating EST consensus sequences. Bioinformatics, 21(8), 1371–1375.CrossRefGoogle Scholar
  3. 3.
    Hide, W., Miller, R., Ptitsyn, A., Kelso, J., Gopallakrishnan, C., & Christoffels, A. (1999). EST clustering tutorial. SANBI.Google Scholar
  4. 4.
    Burke, J.P., Wang, H., Hide, W., & Davison, D. (1998). Alternative gene form discovery and candidate gene selection from gene indexing projects. Genome Research, 8, 276–290.Google Scholar
  5. 5.
    Haas, S.A., Beissbarth, T., Ribals, E., Krause A., & Vingron, M. (2000). GeneNest: Automated generation and visualization of gene indices. Trends Genetics, 16, 521–523.CrossRefGoogle Scholar
  6. 6.
    Altschul, S., Gish, W., Miller, W., Myers, E., & Lipman, D. (1990). A basic local alignment search tool. Journal of Molecular Biology, 215, 403–410.Google Scholar
  7. 7.
    Lipman, D.J., & Pearson, W.R. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America, 85(8), 2444–2488.CrossRefGoogle Scholar
  8. 8.
    Sutton, G., White, O., Adams, M.D., & Kerlavage, A.R. (1995). TIGR assembler: A new tool for assembling large shotgun sequencing projects. Genome Science Technology, 1, 9–18.CrossRefGoogle Scholar
  9. 9.
    Boguski, M.S., & Schuler, G.D. (1995). Establishing a human transcript map. National Genetics, 10, 369–371.CrossRefGoogle Scholar
  10. 10.
    Vinga, S., & Almeida, J. (2003). Alignment-free sequence comparison – a review. Bioinformatics, 19(4), 513–523.CrossRefGoogle Scholar
  11. 11.
    Mantaci, S., Restivo, A., & Sciortino, M. (2008). Distance measures for biological sequences: Some recent approaches. International Journal of Approximate Reason, 47, 109–124.MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Burke, J., Davison, D., & Hide, W. (1999). d2_cluster: A validated method for clustering EST and full length cDNA sequences. Genome Research, 9, 1135–1142.CrossRefGoogle Scholar
  13. 13.
    Hazelhurst, S. (2008). Algorithms for clustering expressed sequence tag: The wcd tool. South African Computer Journal, 40, 51–62.Google Scholar
  14. 14.
    Malde, K., Coward, E., & Jonassen, I. (2003). Fast sequence clustering using a suffix array algorithm. Bioinformatics, 19(10), 1221–1226.CrossRefGoogle Scholar
  15. 15.
    Wu, X., Lee, W.J., Gupta, D., & Tseng, C.W. (2005). ESTmapper: Efficiently clustering EST sequences using genome maps. Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 196a. doi:10.1109/IPDPS:2005.204.Google Scholar
  16. 16.
    Blaisdell, B.E. (1986). A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences of the United States of America, 83, 5155–5159.zbMATHCrossRefGoogle Scholar
  17. 17.
    Pevzner, P.A. (1992). Statistical distance between texts and filtration methods in sequence comparison. Computer Applications in the Biosciences, 8, 121–127.Google Scholar
  18. 18.
    Petrilli, P. (1993). Classification of protein sequences by their dipeptide composition. Computer Applications in the Bioscience, 9, 205–209.Google Scholar
  19. 19.
    Wu, T.J., Hsieh, Y.C., & Li, L.A. (2001). Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition, Biometrics, 57, 441–448.MathSciNetzbMATHCrossRefGoogle Scholar
  20. 20.
    Ziv, J., & Merhav, N. (1993). A measure of relative entropy between individual sequences with application to universal classification. IEEE Transactions on Information Theory, 39(4), 1270–1279.MathSciNetzbMATHCrossRefGoogle Scholar
  21. 21.
    Otu, H.H., & Sayood, K. (2003). A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 19(16), 2122–2130.CrossRefGoogle Scholar
  22. 22.
    Dong, G., & Pei, J. (2007). Classification, clustering, features and distances of sequence Data. Sequence Data Mining, 33, Springer US, 47–65. doi: 10.1007/978-0-387-69937-0.
  23. 23.
    Ma, C.H., Chan, C.C., Yao, X., & Chiu, K.Y. (2006). An evolutionary clustering algorithm for gene expression microarray data analysis. IEEE Transactions on Evolutionary Computation, 10, 296–314.CrossRefGoogle Scholar
  24. 24.
    Handl, J., Knowles, J., & Dorigo, M. (2003). Ant-based clustering: A comparative study of its relative performance with respect to k-means, average link and 1d-som. Technical Report TR/IRIDIA/2003-24, IRIDIA,
  25. 25.
    Tamayo, P., Slonim, D., Mesiov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., & Golub, T.R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences of the United States of America, 96(6), 2907–2912.CrossRefGoogle Scholar
  26. 26.
    Xu, Y., Olman, V., & Xu, D. (2002). Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 18(4), 536–545.CrossRefGoogle Scholar
  27. 27.
    Zhou, D., He, Y., Kwoh, C.K., & Wang, H. (2007). Ant-MST: An ant-based minimum spanning tree for gene expression data clustering. LNBI, 4774, 198–205.Google Scholar
  28. 28.
    Smit, A.F.A., Hubley, R., & Green, P. (2004). RepeatMasker Open-3.0, 2004,
  29. 29.
    Russell, D.J., Otu, H.H., & Sayood, K. (2008). Grammar-based distance in progressive multiple sequence alignment. BMC Bioinformatics, 9, 306. doi:10.1186/1471-2105-9-306.CrossRefGoogle Scholar
  30. 30.
    Tai, Q., & Wang, T. (2008). Comparison study on k-word statistical measures for protein: From sequence to sequence space. BMC Bioinformatics, 9, 394. doi:10.1186/1471-2105-9-394.CrossRefGoogle Scholar
  31. 31.
    Hathaway, R.J. & Bezdek, J.C. (2003). Visual cluster validity for prototype generator clustering models. Pattern Recognition Letters, 24, 1563–1569.zbMATHCrossRefGoogle Scholar
  32. 32.
    Rudd, S. (2003). Expressed sequence tags: alternative or complement to whole genome sequence? Trends in Plant Science, 8(7), 321–329.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  • Keng-Hoong Ng
    • 1
    Email author
  • Somnuk Phon-Amnuaisuk
    • 1
  • Chin-Kuan Ho
    • 1
  1. 1.Faculty of Information TechnologyMultimedia UniversityCyberjayaMalaysia

Personalised recommendations