Skip to main content
Log in

Using Markov model to improve word normalization algorithm for biological sequence comparison

  • Original Article
  • Published:
Amino Acids Aims and scope Submit manuscript

Abstract

There are two crucial problems with statistical measures for sequence comparison: overlapping structures and background information of words in biological sequences. Word normalization in improved composition vector method took into account these problems and achieved better performance in evolutionary analysis. The word normalization is desirable, but not sufficient, because it assumes that the four bases A, C, T, and G occur randomly with equal chance. This paper proposed an improved word normalization which uses Markov model to estimate exact k-word distribution according to observed biological sequence and thus has the ability to adjust the background information of the k-word frequencies in biological sequences. The improved word normalization was tested with three experiments and compared with the existing word normalization. The experiment results confirm that the improved word normalization using Markov model to estimate the exact k-word distribution in biological sequences is more efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  PubMed  CAS  Google Scholar 

  • Blaisdell BE (1986) Ameasure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci USA 83:5155–5159

    Article  PubMed  CAS  Google Scholar 

  • Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30:1145–1159

    Article  Google Scholar 

  • Cao Y, Janke A, Waddell PJ, Westerman M, Takenaka O, Murata S, Okada N, Paabo S, Hasegawa M (1998) Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J Mol Evol 47:307–322

    Article  PubMed  CAS  Google Scholar 

  • Dai Q, Yang YC, Wang TM (2008) Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics 24:2296–2302

    Article  PubMed  CAS  Google Scholar 

  • Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge

  • Egan JP (1975) Signal detection theory and ROC-analysis. Academic Press, New York

    Google Scholar 

  • Felsenstein J (1989) PHYLIP-phylogeny inference package (version 3.2). Cladistics 5:164–166

    Google Scholar 

  • Felsenstein J (1996) Inferring phylogenies from protein sequences by parsimony, distance and likelihood methods. Methods Enzymol 266:418–427

    Article  PubMed  CAS  Google Scholar 

  • Fichant G, Gautier C (1987) Statistical method for predicting protein coding regions in nucleic acid sequences. Comput Appl Biosci 3:287–295

    PubMed  CAS  Google Scholar 

  • Fickett JW, Tung CS (1992) Assessment of protein coding measures. Nucleic Acids Res 20:6641–6450

    Article  Google Scholar 

  • Fickett JW (1996) Finding genes by computer: the state of the art. Trends Genet 12:316–320

    Article  PubMed  CAS  Google Scholar 

  • Gallo SM et al (2006) REDfly: a regulatory element database for Drosophila. Bioinformatics 22:381–383

    Article  PubMed  CAS  Google Scholar 

  • Green RE, Brenner SE (2002) Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison. Proc IEEE 90:1834–1847

    Article  CAS  Google Scholar 

  • Guigo R (1999) In: Genetic databases. Academic Press, New York

    Google Scholar 

  • Hao B, Qi J (2004) Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. J Bioinform Comput Biol 2:1–19

    Article  PubMed  CAS  Google Scholar 

  • Handl J, Knowles J, Kell DB (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21:3201–3212

    Article  PubMed  CAS  Google Scholar 

  • Huelsenbeck JP, Ronquist F (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–755

    Article  PubMed  CAS  Google Scholar 

  • Kantorovitz MR, Robinson GE, Sinha S (2007) A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 23:i249–i255

    Article  PubMed  CAS  Google Scholar 

  • Komatsu K, Zhu S, Fushimi H, Qui TK, Cai S, Kadota S (2001) Phylogenetic analysis based on 18S rRNA gene and matK gene sequences of Panax vietnamensis and five related species. Plant Med 67:461–465

    Article  CAS  Google Scholar 

  • Kumar S, Tamura K, Nei M (2004) MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief Bioinform 5:150–163

    Article  PubMed  CAS  Google Scholar 

  • Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H (2001) An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17:149–154

    Article  PubMed  CAS  Google Scholar 

  • Liu Z, Meng J, Sun X (2008) A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping. Biochem Biophys Res Commun 368:223–30

    Article  PubMed  CAS  Google Scholar 

  • Lu GQ, Zhang SP, Fang X (2008) An improved string composition method for sequence comparison. BMC Bioinform 9(Suppl 6):S15

    Article  Google Scholar 

  • Lu L, Li C, Hagedorn CH (2006) Phylogenetic analysis of global hepatitis E virus sequences: genetic diversity, subtypes and zoonosis. Rev Med Virol 16:5–36

    Article  PubMed  CAS  Google Scholar 

  • Mitrophanov AY, Borodovsky M (2006) Statistical significance in biological sequence analysis. Brief Bioinform 7:2–24

    Article  PubMed  CAS  Google Scholar 

  • Mohseni-Zadeh S, Brezellec P, Risler JL (2004) Cluster-C: an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques. Comput Biol Chem 28:211–218

    Article  PubMed  CAS  Google Scholar 

  • Otu HH, Sayood K (2003) A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19:2122–2130

    Article  PubMed  CAS  Google Scholar 

  • Pham TD, Zuegg J (2004) A probabilistic measure for alignment-free sequence comparison. Bioinformatics 20:3455–3461

    Article  PubMed  CAS  Google Scholar 

  • Pham TD (2007) Spectral distortion measures for biological sequence comparisons and database searching. Pattern Recognit 40:516–529

    Article  Google Scholar 

  • Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, Schrader R (2002) ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics 18:S182–S191

    Article  PubMed  Google Scholar 

  • Reinert G, Schbath S, Waterman MS (2000) Probabilistic and statistical properties of words: an overview. J Comput Biol 7:1–46

    Article  PubMed  CAS  Google Scholar 

  • Reyes A, Gissi C, Pesole G, Catzeflis FM, Saccone C (2000) Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris. Mol Biol Evol 17:979–983

    PubMed  CAS  Google Scholar 

  • Rijsbergen CJ (1979) Information retireval. Butterworths, London

  • Robin S, Daudin JJ (1999) Exact distribution of word occurrences in a random sequence of letters. J Appl Prob 36:179–193

    Article  Google Scholar 

  • Ronquist F, Huelsenbeck JP (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572–1574

    Article  PubMed  CAS  Google Scholar 

  • Schbath S (2000) An overview on the distribution of word counts in Markov chains. J Comput Biol 7:193–201

    Article  PubMed  CAS  Google Scholar 

  • Stajich JE et al (2002) The BioPerl Toolkit: Perl Modules for the life sciences. Genome Res 12:1611–1618

    Article  PubMed  CAS  Google Scholar 

  • Stuart GW, Moffett K, Baker S (2002) Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics 18:100–108

    Article  PubMed  CAS  Google Scholar 

  • Van Helden J (2004) Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics 20:399–406

    Article  PubMed  CAS  Google Scholar 

  • Vinga S, Almeida J (2003) Alignment-free sequence comparison: a review. Bioinformatics 19:513–523

    Article  PubMed  CAS  Google Scholar 

  • Waddell PJ, Kishino H, Ota R (2001) A phylogenetic foundation for comparative mammalian genomics. Genome Inform Ser 12:141–154

    CAS  Google Scholar 

  • Waterman MS (1995) Introduction to computational biology: maps, sequences, and genomes: interdisciplinary statistics. Chapman and Hall, Boca Raton

  • Wu X, Wan X, Wu G, Xu D, Lin G (2006) Phylogenetic analysis using complete signature information of whole genomes and clustered neighbour-joining method. Int J Bioinform Res Appl 2:219–248

    PubMed  CAS  Google Scholar 

  • Wu TJ, Burke JP, Davison DB (1997) A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics 53:1431–1439

    Article  PubMed  CAS  Google Scholar 

  • Wu TJ, Hsieh YC, Li LA (2001) Statistical measures of DNA dissimilarity under Markov chain models of base composition. Biometrics 57:441–448

    Article  PubMed  CAS  Google Scholar 

  • Yang L, Chang G, Zhang X, Wang T (2010) Use of the Burrows–Wheeler similarity distribution to the comparison of the proteins. Amino Acids 39(3):887–898

    Article  PubMed  CAS  Google Scholar 

  • Yao YH, Dai Q, Li C, He PA, Nan XY, Zhang YZ (2008) Analysis of similarity/dissimilarity of protein sequences. Proteins 73(4):864–871

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

The author thanks all the anonymous referees for their valuable suggestions and support. This work is supported by the National Natural Science Foundation of China (61001214, 61003191), and a research grants (Y2100930, Y6100339) from Zhejiang Provincial Natural Science Foundation of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Dai.

Electronic supplementary material

Below is the link to the electronic supplementary material.

PDF (15 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dai, Q., Liu, X., Yao, Y. et al. Using Markov model to improve word normalization algorithm for biological sequence comparison. Amino Acids 42, 1867–1877 (2012). https://doi.org/10.1007/s00726-011-0906-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00726-011-0906-2

Keywords

Navigation