Clustering of Expressed Sequence Tag Using Global and Local Features: A Performance Study
Clustering of expressed sequence tag (EST) plays an important role in gene analysis. Alignment-based sequence comparison is commonly used to measure the similarity between sequences, and recently some of the alignment-free comparisons have been introduced. In this paper, we evaluate the role of global and local features extracted from the alignment free approaches i.e., the compression-based method and the generalized relative entropy method. The evaluation is done from the perspective of EST clustering quality. Our evaluation shows that the local feature of EST yields much better clustering quality compared to the global feature. Furthermore, we verified our best clustering result achieved in the experiments with another EST clustering algorithm, wcd, and it shows that our performance is comparable to the later.
KeywordsSequence clustering Expressed sequence tag Alignment-free Similarity measure Grammar-based distance Generalized relative entropy
- 1.Ptitsyn, A., & Hide, W. (2005). CLU: A new algorithm for EST clustering. BMC Bioinformatics, 6. doi: 10.1186/1471-2105-6-S2-S3.
- 3.Hide, W., Miller, R., Ptitsyn, A., Kelso, J., Gopallakrishnan, C., & Christoffels, A. (1999). EST clustering tutorial. SANBI.Google Scholar
- 4.Burke, J.P., Wang, H., Hide, W., & Davison, D. (1998). Alternative gene form discovery and candidate gene selection from gene indexing projects. Genome Research, 8, 276–290.Google Scholar
- 6.Altschul, S., Gish, W., Miller, W., Myers, E., & Lipman, D. (1990). A basic local alignment search tool. Journal of Molecular Biology, 215, 403–410.Google Scholar
- 13.Hazelhurst, S. (2008). Algorithms for clustering expressed sequence tag: The wcd tool. South African Computer Journal, 40, 51–62.Google Scholar
- 15.Wu, X., Lee, W.J., Gupta, D., & Tseng, C.W. (2005). ESTmapper: Efficiently clustering EST sequences using genome maps. Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 196a. doi:10.1109/IPDPS:2005.204.Google Scholar
- 17.Pevzner, P.A. (1992). Statistical distance between texts and filtration methods in sequence comparison. Computer Applications in the Biosciences, 8, 121–127.Google Scholar
- 18.Petrilli, P. (1993). Classification of protein sequences by their dipeptide composition. Computer Applications in the Bioscience, 9, 205–209.Google Scholar
- 22.Dong, G., & Pei, J. (2007). Classification, clustering, features and distances of sequence Data. Sequence Data Mining, 33, Springer US, 47–65. doi: 10.1007/978-0-387-69937-0.
- 24.Handl, J., Knowles, J., & Dorigo, M. (2003). Ant-based clustering: A comparative study of its relative performance with respect to k-means, average link and 1d-som. Technical Report TR/IRIDIA/2003-24, IRIDIA, http://dbkgroup.org/handl/TR-IRIDIA-2003-24.pdf.
- 25.Tamayo, P., Slonim, D., Mesiov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., & Golub, T.R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences of the United States of America, 96(6), 2907–2912.CrossRefGoogle Scholar
- 27.Zhou, D., He, Y., Kwoh, C.K., & Wang, H. (2007). Ant-MST: An ant-based minimum spanning tree for gene expression data clustering. LNBI, 4774, 198–205.Google Scholar
- 28.Smit, A.F.A., Hubley, R., & Green, P. (2004). RepeatMasker Open-3.0, 2004, http://www.repeatmasker.org.