Modelling-Alignment for Non-random Sequences
Populations of biased, non-random sequences may cause standard alignment algorithms to yield false-positive matches and false-negative misses. A standard significance test based on the shuffling of sequences is a partial solution, applicable to populations that can be described by simple models. Masking-out low information content intervals throws information away. We describe a new and general method, modelling-alignment: Population models are incorporated into the alignment process, which can (and should) lead to changes in the rank-order of matches between a query sequence and a collection of sequences, compared to results from standard algorithms. The new method is general and places very few conditions on the nature of the models that can be used with it. We apply modelling-alignment to local alignment, global alignment, optimal alignment, and the relatedness problem.
KeywordsPopulation Model Receiver Operating Characteristic Query Sequence Dynamic Programming Algorithm Optimal Alignment
Unable to display preview. Download preview PDF.
- 3.Allison, L., Powell, D.R., Dix, T.I.: Modelling is more versatile than shuffling. Technical report, Monash University, School of Computer Science and Software Engineering (2000)Google Scholar
- 5.Altschul, S.F., Erickson, B.W.: Significance of nucleotide sequence alignments: A method for random sequence permutation that preserves dinucleotide and codon usage. Mol. Biol. Evol. 2(6), 526–538 (1985)Google Scholar
- 9.Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5, 345–352 (1978)Google Scholar
- 12.Georgeff, M.P., Wallace, C.S.: A general selection criterion for inductive inference. In: European Conf. on Artificial Intelligence, pp. 473–482 (1984)Google Scholar
- 16.Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Academy Science 89(10), 915–919 (1992)Google Scholar
- 19.Loewenstern, D.M., Yianilos, P.N.: Significantly lower entropy estimates for natural DNA sequences. Technical Report 96-51, DIMACS (December 1996)Google Scholar
- 24.Rivals, E., Delgrange, O., Delahaye, J.-P., Dauchet, M., Delorme, M.-O., Hénaut, A., Ollivier, E.: Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences. CABIOS 13(2), 131–136 (1997)Google Scholar
- 26.Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. U. of Illinois Press (1949)Google Scholar