Clustering of Expressed Sequence Tag Using Global and Local Features: A Performance Study

Ng, Keng-Hoong; Phon-Amnuaisuk, Somnuk; Ho, Chin-Kuan

doi:10.1007/978-90-481-3517-2_31

Keng-Hoong Ng⁴,
Somnuk Phon-Amnuaisuk⁴ &
Chin-Kuan Ho⁴

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 52))

774 Accesses

Abstract

Clustering of expressed sequence tag (EST) plays an important role in gene analysis. Alignment-based sequence comparison is commonly used to measure the similarity between sequences, and recently some of the alignment-free comparisons have been introduced. In this paper, we evaluate the role of global and local features extracted from the alignment free approaches i.e., the compression-based method and the generalized relative entropy method. The evaluation is done from the perspective of EST clustering quality. Our evaluation shows that the local feature of EST yields much better clustering quality compared to the global feature. Furthermore, we verified our best clustering result achieved in the experiments with another EST clustering algorithm, wcd, and it shows that our performance is comparable to the later.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ptitsyn, A., & Hide, W. (2005). CLU: A new algorithm for EST clustering. BMC Bioinformatics, 6. doi:10.1186/1471-2105-6-S2-S3.
Malde, K., Coward, E., & Jonassen, I. (2005). A graph based algorithm for generating EST consensus sequences. Bioinformatics, 21(8), 1371–1375.
Article Google Scholar
Hide, W., Miller, R., Ptitsyn, A., Kelso, J., Gopallakrishnan, C., & Christoffels, A. (1999). EST clustering tutorial. SANBI.
Google Scholar
Burke, J.P., Wang, H., Hide, W., & Davison, D. (1998). Alternative gene form discovery and candidate gene selection from gene indexing projects. Genome Research, 8, 276–290.
Google Scholar
Haas, S.A., Beissbarth, T., Ribals, E., Krause A., & Vingron, M. (2000). GeneNest: Automated generation and visualization of gene indices. Trends Genetics, 16, 521–523.
Article Google Scholar
Altschul, S., Gish, W., Miller, W., Myers, E., & Lipman, D. (1990). A basic local alignment search tool. Journal of Molecular Biology, 215, 403–410.
Google Scholar
Lipman, D.J., & Pearson, W.R. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America, 85(8), 2444–2488.
Article Google Scholar
Sutton, G., White, O., Adams, M.D., & Kerlavage, A.R. (1995). TIGR assembler: A new tool for assembling large shotgun sequencing projects. Genome Science Technology, 1, 9–18.
Article Google Scholar
Boguski, M.S., & Schuler, G.D. (1995). Establishing a human transcript map. National Genetics, 10, 369–371.
Article Google Scholar
Vinga, S., & Almeida, J. (2003). Alignment-free sequence comparison – a review. Bioinformatics, 19(4), 513–523.
Article Google Scholar
Mantaci, S., Restivo, A., & Sciortino, M. (2008). Distance measures for biological sequences: Some recent approaches. International Journal of Approximate Reason, 47, 109–124.
Article MathSciNet MATH Google Scholar
Burke, J., Davison, D., & Hide, W. (1999). d2_cluster: A validated method for clustering EST and full length cDNA sequences. Genome Research, 9, 1135–1142.
Article Google Scholar
Hazelhurst, S. (2008). Algorithms for clustering expressed sequence tag: The wcd tool. South African Computer Journal, 40, 51–62.
Google Scholar
Malde, K., Coward, E., & Jonassen, I. (2003). Fast sequence clustering using a suffix array algorithm. Bioinformatics, 19(10), 1221–1226.
Article Google Scholar
Wu, X., Lee, W.J., Gupta, D., & Tseng, C.W. (2005). ESTmapper: Efficiently clustering EST sequences using genome maps. Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 196a. doi:10.1109/IPDPS:2005.204.
Google Scholar
Blaisdell, B.E. (1986). A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences of the United States of America, 83, 5155–5159.
Article MATH Google Scholar
Pevzner, P.A. (1992). Statistical distance between texts and filtration methods in sequence comparison. Computer Applications in the Biosciences, 8, 121–127.
Google Scholar
Petrilli, P. (1993). Classification of protein sequences by their dipeptide composition. Computer Applications in the Bioscience, 9, 205–209.
Google Scholar
Wu, T.J., Hsieh, Y.C., & Li, L.A. (2001). Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition, Biometrics, 57, 441–448.
Article MathSciNet MATH Google Scholar
Ziv, J., & Merhav, N. (1993). A measure of relative entropy between individual sequences with application to universal classification. IEEE Transactions on Information Theory, 39(4), 1270–1279.
Article MathSciNet MATH Google Scholar
Otu, H.H., & Sayood, K. (2003). A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 19(16), 2122–2130.
Article Google Scholar
Dong, G., & Pei, J. (2007). Classification, clustering, features and distances of sequence Data. Sequence Data Mining, 33, Springer US, 47–65. doi:10.1007/978-0-387-69937-0.
Ma, C.H., Chan, C.C., Yao, X., & Chiu, K.Y. (2006). An evolutionary clustering algorithm for gene expression microarray data analysis. IEEE Transactions on Evolutionary Computation, 10, 296–314.
Article Google Scholar
Handl, J., Knowles, J., & Dorigo, M. (2003). Ant-based clustering: A comparative study of its relative performance with respect to k-means, average link and 1d-som. Technical Report TR/IRIDIA/2003-24, IRIDIA, http://dbkgroup.org/handl/TR-IRIDIA-2003-24.pdf.
Tamayo, P., Slonim, D., Mesiov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., & Golub, T.R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences of the United States of America, 96(6), 2907–2912.
Article Google Scholar
Xu, Y., Olman, V., & Xu, D. (2002). Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 18(4), 536–545.
Article Google Scholar
Zhou, D., He, Y., Kwoh, C.K., & Wang, H. (2007). Ant-MST: An ant-based minimum spanning tree for gene expression data clustering. LNBI, 4774, 198–205.
Google Scholar
Smit, A.F.A., Hubley, R., & Green, P. (2004). RepeatMasker Open-3.0, 2004, http://www.repeatmasker.org.
Russell, D.J., Otu, H.H., & Sayood, K. (2008). Grammar-based distance in progressive multiple sequence alignment. BMC Bioinformatics, 9, 306. doi:10.1186/1471-2105-9-306.
Article Google Scholar
Tai, Q., & Wang, T. (2008). Comparison study on k-word statistical measures for protein: From sequence to sequence space. BMC Bioinformatics, 9, 394. doi:10.1186/1471-2105-9-394.
Article Google Scholar
Hathaway, R.J. & Bezdek, J.C. (2003). Visual cluster validity for prototype generator clustering models. Pattern Recognition Letters, 24, 1563–1569.
Article MATH Google Scholar
Rudd, S. (2003). Expressed sequence tags: alternative or complement to whole genome sequence? Trends in Plant Science, 8(7), 321–329.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, Multimedia University, Cyberjaya, 63100, Selangor Darul Ehsan, Malaysia
Keng-Hoong Ng, Somnuk Phon-Amnuaisuk & Chin-Kuan Ho

Authors

Keng-Hoong Ng
View author publications
You can also search for this author in PubMed Google Scholar
Somnuk Phon-Amnuaisuk
View author publications
You can also search for this author in PubMed Google Scholar
Chin-Kuan Ho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Keng-Hoong Ng .

Editor information

Editors and Affiliations

School of Information Sciences &, Engineering, University of Canberra, Canberra, 2601, Australia
Xu Huang
International Association of Engineers, Hung To Road 37-39, Hong Kong, Hong Kong/PR China
Sio-Iong Ao
Dept. Computer Science, Tijuana Institute of Technology, Chula Vista, USA
Oscar Castillo

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ng, KH., Phon-Amnuaisuk, S., Ho, CK. (2009). Clustering of Expressed Sequence Tag Using Global and Local Features: A Performance Study. In: Huang, X., Ao, SI., Castillo, O. (eds) Intelligent Automation and Computer Engineering. Lecture Notes in Electrical Engineering, vol 52. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3517-2_31

Download citation

DOI: https://doi.org/10.1007/978-90-481-3517-2_31
Published: 05 December 2009
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-3516-5
Online ISBN: 978-90-481-3517-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics