Modelling-Alignment for Non-random Sequences

  • David R. Powell
  • Lloyd Allison
  • Trevor I. Dix
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3339)

Abstract

Populations of biased, non-random sequences may cause standard alignment algorithms to yield false-positive matches and false-negative misses. A standard significance test based on the shuffling of sequences is a partial solution, applicable to populations that can be described by simple models. Masking-out low information content intervals throws information away. We describe a new and general method, modelling-alignment: Population models are incorporated into the alignment process, which can (and should) lead to changes in the rank-order of matches between a query sequence and a collection of sequences, compared to results from standard algorithms. The new method is general and places very few conditions on the nature of the models that can be used with it. We apply modelling-alignment to local alignment, global alignment, optimal alignment, and the relatedness problem.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Allison, L.: Normalization of affine gap costs used in optimal sequence alignment. Journal of Theoretical Biology 161, 263–269 (1993)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Allison, L., Powell, D.R., Dix, T.I.: Compression and approximate matching. The Computer Journal 42(1), 1–10 (1999)MATHCrossRefGoogle Scholar
  3. 3.
    Allison, L., Powell, D.R., Dix, T.I.: Modelling is more versatile than shuffling. Technical report, Monash University, School of Computer Science and Software Engineering (2000)Google Scholar
  4. 4.
    Allison, L., Wallace, C.S., Yee, C.N.: Finite-state models in the alignment of macromolecules. Journal of Molecular Evolution 35, 77–89 (1992)CrossRefGoogle Scholar
  5. 5.
    Altschul, S.F., Erickson, B.W.: Significance of nucleotide sequence alignments: A method for random sequence permutation that preserves dinucleotide and codon usage. Mol. Biol. Evol. 2(6), 526–538 (1985)Google Scholar
  6. 6.
    Bishop, M.J., Thompson, E.A.: Maximum likelihood alignment of DNA sequences. J. Mol. Biol. 190, 159–165 (1986)CrossRefGoogle Scholar
  7. 7.
    Brenner, S.E., Chothia, C., Hubbard, T.J.P.: Assessing sequence comparison methods with reliable structurally identifed distant evolutionary relationships. Proc. Natl. Acad. Sci. 95, 6073–6078 (1998)CrossRefGoogle Scholar
  8. 8.
    Claverie, J.-M., States, D.J.: Information enhancement methods for large scale sequence analysis. Comp. Chem 17(2), 191–201 (1993)CrossRefGoogle Scholar
  9. 9.
    Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5, 345–352 (1978)Google Scholar
  10. 10.
    Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998)CrossRefGoogle Scholar
  11. 11.
    Fitch, W.M.: Random sequences. Journal of Molecular Biology 163, 171–176 (1983)CrossRefGoogle Scholar
  12. 12.
    Georgeff, M.P., Wallace, C.S.: A general selection criterion for inductive inference. In: European Conf. on Artificial Intelligence, pp. 473–482 (1984)Google Scholar
  13. 13.
    Gotoh, O.: An improved algorithm for matching biological sequences. Journal of Molecular Biology 162, 705–708 (1982)CrossRefGoogle Scholar
  14. 14.
    Gribskov, M., Robinson, N.L.: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers and Chemistry 20(1), 25–33 (1996)CrossRefGoogle Scholar
  15. 15.
    Grumbach, S., Tahi, F.: A new challenge for compression algorithms: genetic sequences. Inf. Proc. and Management 30(6), 875–886 (1994)MATHCrossRefGoogle Scholar
  16. 16.
    Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Academy Science 89(10), 915–919 (1992)Google Scholar
  17. 17.
    Huestis, R., Fischer, K.: Prediction of many new exons and introns in Plasmodium falciparum chromosome 2. Molecular and Biochemical Parasitology 118, 187–199 (2001)CrossRefGoogle Scholar
  18. 18.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  19. 19.
    Loewenstern, D.M., Yianilos, P.N.: Significantly lower entropy estimates for natural DNA sequences. Technical Report 96-51, DIMACS (December 1996)Google Scholar
  20. 20.
    Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21(6), 1087–1092 (1953)CrossRefGoogle Scholar
  21. 21.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970)CrossRefGoogle Scholar
  22. 22.
    Pearson, W.R.: Effective protein sequence comparison. Meth. Enzymol. 266, 227–258 (1996)CrossRefGoogle Scholar
  23. 23.
    Pearson, W.R., Lipman, D.J.: Improved tools for biological comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448 (1988)CrossRefGoogle Scholar
  24. 24.
    Rivals, E., Delgrange, O., Delahaye, J.-P., Dauchet, M., Delorme, M.-O., Hénaut, A., Ollivier, E.: Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences. CABIOS 13(2), 131–136 (1997)Google Scholar
  25. 25.
    Sellers, P.H.: On the theory and computation of evolutionary distances. SIAM J. Appl. Math. 26(4), 787–793 (1974)MATHMathSciNetCrossRefGoogle Scholar
  26. 26.
    Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. U. of Illinois Press (1949)Google Scholar
  27. 27.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)CrossRefGoogle Scholar
  28. 28.
    Wallace, C.S., Freeman, P.R.: Estimation and inference by compact coding. Journal of the Royal Statistical Society series B 49(3), 240–265 (1987)MATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • David R. Powell
    • 1
    • 2
  • Lloyd Allison
    • 1
  • Trevor I. Dix
    • 1
    • 2
  1. 1.School of Computer Science and Software EngineeringMonash UniversityAustralia
  2. 2.Victorian Bioinformatics Consortium 

Personalised recommendations