Computing Substitution Matrices for Genomic Comparative Analysis

Cao, Minh Duc; Dix, Trevor I.; Allison, Lloyd

doi:10.1007/978-3-642-01307-2_64

Computing Substitution Matrices for Genomic Comparative Analysis

Minh Duc Cao²³,
Trevor I. Dix^23,24 &
Lloyd Allison²³

Conference paper

3129 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5476))

Abstract

Substitution matrices describe the rates of mutating one character in a biological sequence to another character, and are important for many knowledge discovery tasks such as phylogenetic analysis and sequence alignment. Computing substitution matrices for very long genomic sequences of divergent or even unrelated species requires sensitive algorithms that can take into account differences in composition of the sequences. We present a novel algorithm that addresses this by computing a nucleotide substitution matrix specifically for the two genomes being aligned. The method is founded on information theory and in the expectation maximisation framework. The algorithm iteratively uses compression to align the sequences and estimates the matrix from the alignment, and then applies the matrix to find a better alignment until convergence. Our method reconstructs, with high accuracy, the substitution matrix for synthesised data generated from a known matrix with introduced noise. The model is then successfully applied to real data for various malaria parasite genomes, which have differing phylogenetic distances and composition that lessens the effectiveness of standard statistical analysis techniques.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altschul, S.F., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25(17), 3389–3402 (1997)
Article Google Scholar
Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S.: Versatile and open software for comparing large genomes. Genome. Biol. 5(2) (2004)
Google Scholar
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89(22), 10915–10919 (1992)
Article Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Article Google Scholar
Lio, P., Goldman, N.: Models of Molecular Evolution and Phylogeny. Genome. Res. 8(12), 1233–1244 (1998)
Google Scholar
Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Biol. 76(6), 368–376 (1981)
Google Scholar
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model for evolutionary change in proteins. In: National Biochemical Research Foundation, Washington DC (1978)
Google Scholar
Comeron, J.M., Aguade, M.: An evaluation of measures of synonymous codon usage bias. J. Mol. Biol. 47(3), 268–274 (1998)
Google Scholar
Klein, R., Eddy, S.: Rsearch: Finding homologs of single structured RNA sequences. BMC Bioinformatics 4(1) (2003)
Google Scholar
Goldman, N.: Statistical tests of models of DNA substitution. J. Mol. Evol. 36(2), 182–198 (1993)
Article Google Scholar
Yang, Z.: Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39(1), 105–111 (1994)
Article Google Scholar
Yap, V.B., Speed, T.P.: Modeling dna base substitution in large genomic regions from two organisms. J. Mol. Evol. 58(1), 12–18 (2004)
Article Google Scholar
Jukes, T.H., Cantor, C.: Evolution of protein molecules. Mammalian Protein Metabolism, 21–132 (1969)
Google Scholar
Kimura, M.: A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120 (1980)
Article Google Scholar
Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal 27, 379–423 (1948)
Article MathSciNet MATH Google Scholar
Wallace, C.S., Boulton, D.M.: An information measure for classification. Computer Journal 11(2), 185–194 (1968)
Article MATH Google Scholar
Wallace, C.S., Freeman, P.R.: Estimation and inference by compact coding. Journal of the Royal Statistical Society series 49(3), 240–265 (1987)
MathSciNet MATH Google Scholar
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Data Compression Conference, pp. 43–52 (2007)
Google Scholar
Cao, M.D., Dix, T.I., Allison, L.: A genome alignment algorithm based on compression. Technical Report 2009/233, FIT, Monash University (2009)
Google Scholar
Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219(3), 555–565 (1991)
Article Google Scholar
Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Nat. Acad. Sci. 87(6), 2264–2268 (1990)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Clayton School of Information Technology, Monash University, Clayton, 3800, Australia
Minh Duc Cao, Trevor I. Dix & Lloyd Allison
Faculty of Information & Communication Technologies, Swinburne University of Technology, Hawthorn, 3122, Australia
Trevor I. Dix

Authors

Minh Duc Cao
View author publications
You can also search for this author in PubMed Google Scholar
Trevor I. Dix
View author publications
You can also search for this author in PubMed Google Scholar
Lloyd Allison
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5 Tiwanont Road, 12000, Bangkadi, Muang, Pathumthani, Thailand
Thanaruk Theeramunkong
Dept. of Computer Engineering, Faculty of Engineering, Chulalongkorn University, 10330, Bangkok, Thailand
Boonserm Kijsirikul
Faculty of Science & Engineering, York University, 355 Lumbers Building, 4700 Keele Street, M3J 1P3, Toronto, Ontario, Canada
Nick Cercone
School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, 923-1292, Ishikawa, Japan
Tu-Bao Ho

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, M.D., Dix, T.I., Allison, L. (2009). Computing Substitution Matrices for Genomic Comparative Analysis. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_64

Download citation

DOI: https://doi.org/10.1007/978-3-642-01307-2_64
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics