Using Markov model to improve word normalization algorithm for biological sequence comparison

Dai, Qi; Liu, Xiaoqing; Yao, Yuhua; Zhao, Fukun

doi:10.1007/s00726-011-0906-2

Using Markov model to improve word normalization algorithm for biological sequence comparison

Original Article
Published: 20 April 2011

Volume 42, pages 1867–1877, (2012)
Cite this article

Amino Acids Aims and scope Submit manuscript

Qi Dai¹,
Xiaoqing Liu²,
Yuhua Yao¹ &
…
Fukun Zhao¹

1223 Accesses
3 Citations
Explore all metrics

Abstract

There are two crucial problems with statistical measures for sequence comparison: overlapping structures and background information of words in biological sequences. Word normalization in improved composition vector method took into account these problems and achieved better performance in evolutionary analysis. The word normalization is desirable, but not sufficient, because it assumes that the four bases A, C, T, and G occur randomly with equal chance. This paper proposed an improved word normalization which uses Markov model to estimate exact k-word distribution according to observed biological sequence and thus has the ability to adjust the background information of the k-word frequencies in biological sequences. The improved word normalization was tested with three experiments and compared with the existing word normalization. The experiment results confirm that the improved word normalization using Markov model to estimate the exact k-word distribution in biological sequences is more efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Particle swarm optimization algorithm: an overview

Article 17 January 2017

Introduction to Bioinformatics

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

References

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article PubMed CAS Google Scholar
Blaisdell BE (1986) Ameasure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci USA 83:5155–5159
Article PubMed CAS Google Scholar
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30:1145–1159
Article Google Scholar
Cao Y, Janke A, Waddell PJ, Westerman M, Takenaka O, Murata S, Okada N, Paabo S, Hasegawa M (1998) Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J Mol Evol 47:307–322
Article PubMed CAS Google Scholar
Dai Q, Yang YC, Wang TM (2008) Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics 24:2296–2302
Article PubMed CAS Google Scholar
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge
Egan JP (1975) Signal detection theory and ROC-analysis. Academic Press, New York
Google Scholar
Felsenstein J (1989) PHYLIP-phylogeny inference package (version 3.2). Cladistics 5:164–166
Google Scholar
Felsenstein J (1996) Inferring phylogenies from protein sequences by parsimony, distance and likelihood methods. Methods Enzymol 266:418–427
Article PubMed CAS Google Scholar
Fichant G, Gautier C (1987) Statistical method for predicting protein coding regions in nucleic acid sequences. Comput Appl Biosci 3:287–295
PubMed CAS Google Scholar
Fickett JW, Tung CS (1992) Assessment of protein coding measures. Nucleic Acids Res 20:6641–6450
Article Google Scholar
Fickett JW (1996) Finding genes by computer: the state of the art. Trends Genet 12:316–320
Article PubMed CAS Google Scholar
Gallo SM et al (2006) REDfly: a regulatory element database for Drosophila. Bioinformatics 22:381–383
Article PubMed CAS Google Scholar
Green RE, Brenner SE (2002) Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison. Proc IEEE 90:1834–1847
Article CAS Google Scholar
Guigo R (1999) In: Genetic databases. Academic Press, New York
Google Scholar
Hao B, Qi J (2004) Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. J Bioinform Comput Biol 2:1–19
Article PubMed CAS Google Scholar
Handl J, Knowles J, Kell DB (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21:3201–3212
Article PubMed CAS Google Scholar
Huelsenbeck JP, Ronquist F (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–755
Article PubMed CAS Google Scholar
Kantorovitz MR, Robinson GE, Sinha S (2007) A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 23:i249–i255
Article PubMed CAS Google Scholar
Komatsu K, Zhu S, Fushimi H, Qui TK, Cai S, Kadota S (2001) Phylogenetic analysis based on 18S rRNA gene and matK gene sequences of Panax vietnamensis and five related species. Plant Med 67:461–465
Article CAS Google Scholar
Kumar S, Tamura K, Nei M (2004) MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief Bioinform 5:150–163
Article PubMed CAS Google Scholar
Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H (2001) An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17:149–154
Article PubMed CAS Google Scholar
Liu Z, Meng J, Sun X (2008) A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping. Biochem Biophys Res Commun 368:223–30
Article PubMed CAS Google Scholar
Lu GQ, Zhang SP, Fang X (2008) An improved string composition method for sequence comparison. BMC Bioinform 9(Suppl 6):S15
Article Google Scholar
Lu L, Li C, Hagedorn CH (2006) Phylogenetic analysis of global hepatitis E virus sequences: genetic diversity, subtypes and zoonosis. Rev Med Virol 16:5–36
Article PubMed CAS Google Scholar
Mitrophanov AY, Borodovsky M (2006) Statistical significance in biological sequence analysis. Brief Bioinform 7:2–24
Article PubMed CAS Google Scholar
Mohseni-Zadeh S, Brezellec P, Risler JL (2004) Cluster-C: an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques. Comput Biol Chem 28:211–218
Article PubMed CAS Google Scholar
Otu HH, Sayood K (2003) A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19:2122–2130
Article PubMed CAS Google Scholar
Pham TD, Zuegg J (2004) A probabilistic measure for alignment-free sequence comparison. Bioinformatics 20:3455–3461
Article PubMed CAS Google Scholar
Pham TD (2007) Spectral distortion measures for biological sequence comparisons and database searching. Pattern Recognit 40:516–529
Article Google Scholar
Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, Schrader R (2002) ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics 18:S182–S191
Article PubMed Google Scholar
Reinert G, Schbath S, Waterman MS (2000) Probabilistic and statistical properties of words: an overview. J Comput Biol 7:1–46
Article PubMed CAS Google Scholar
Reyes A, Gissi C, Pesole G, Catzeflis FM, Saccone C (2000) Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris. Mol Biol Evol 17:979–983
PubMed CAS Google Scholar
Rijsbergen CJ (1979) Information retireval. Butterworths, London
Robin S, Daudin JJ (1999) Exact distribution of word occurrences in a random sequence of letters. J Appl Prob 36:179–193
Article Google Scholar
Ronquist F, Huelsenbeck JP (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572–1574
Article PubMed CAS Google Scholar
Schbath S (2000) An overview on the distribution of word counts in Markov chains. J Comput Biol 7:193–201
Article PubMed CAS Google Scholar
Stajich JE et al (2002) The BioPerl Toolkit: Perl Modules for the life sciences. Genome Res 12:1611–1618
Article PubMed CAS Google Scholar
Stuart GW, Moffett K, Baker S (2002) Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics 18:100–108
Article PubMed CAS Google Scholar
Van Helden J (2004) Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20:399–406
Article PubMed CAS Google Scholar
Vinga S, Almeida J (2003) Alignment-free sequence comparison: a review. Bioinformatics 19:513–523
Article PubMed CAS Google Scholar
Waddell PJ, Kishino H, Ota R (2001) A phylogenetic foundation for comparative mammalian genomics. Genome Inform Ser 12:141–154
CAS Google Scholar
Waterman MS (1995) Introduction to computational biology: maps, sequences, and genomes: interdisciplinary statistics. Chapman and Hall, Boca Raton
Wu X, Wan X, Wu G, Xu D, Lin G (2006) Phylogenetic analysis using complete signature information of whole genomes and clustered neighbour-joining method. Int J Bioinform Res Appl 2:219–248
PubMed CAS Google Scholar
Wu TJ, Burke JP, Davison DB (1997) A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics 53:1431–1439
Article PubMed CAS Google Scholar
Wu TJ, Hsieh YC, Li LA (2001) Statistical measures of DNA dissimilarity under Markov chain models of base composition. Biometrics 57:441–448
Article PubMed CAS Google Scholar
Yang L, Chang G, Zhang X, Wang T (2010) Use of the Burrows–Wheeler similarity distribution to the comparison of the proteins. Amino Acids 39(3):887–898
Article PubMed CAS Google Scholar
Yao YH, Dai Q, Li C, He PA, Nan XY, Zhang YZ (2008) Analysis of similarity/dissimilarity of protein sequences. Proteins 73(4):864–871
Article PubMed CAS Google Scholar

Download references

Acknowledgments

The author thanks all the anonymous referees for their valuable suggestions and support. This work is supported by the National Natural Science Foundation of China (61001214, 61003191), and a research grants (Y2100930, Y6100339) from Zhejiang Provincial Natural Science Foundation of China.

Author information

Authors and Affiliations

College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou, 310018, People’s Republic of China
Qi Dai, Yuhua Yao & Fukun Zhao
School of Science, Hangzhou Dianzi University, Hangzhou, 310018, People’s Republic of China
Xiaoqing Liu

Authors

Qi Dai
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoqing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yuhua Yao
View author publications
You can also search for this author in PubMed Google Scholar
Fukun Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Dai.

Electronic supplementary material

Below is the link to the electronic supplementary material.

PDF (15 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dai, Q., Liu, X., Yao, Y. et al. Using Markov model to improve word normalization algorithm for biological sequence comparison. Amino Acids 42, 1867–1877 (2012). https://doi.org/10.1007/s00726-011-0906-2

Download citation

Received: 13 November 2010
Accepted: 29 March 2011
Published: 20 April 2011
Issue Date: May 2012
DOI: https://doi.org/10.1007/s00726-011-0906-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Markov model to improve word normalization algorithm for biological sequence comparison

Abstract

Access this article

Similar content being viewed by others

Particle swarm optimization algorithm: an overview

Introduction to Bioinformatics

A comprehensive and analytical review of text clustering techniques

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

PDF (15 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using Markov model to improve word normalization algorithm for biological sequence comparison

Abstract

Access this article

Similar content being viewed by others

Particle swarm optimization algorithm: an overview

Introduction to Bioinformatics

A comprehensive and analytical review of text clustering techniques

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

PDF (15 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation