Computing Substitution Matrices for Genomic Comparative Analysis

  • Minh Duc Cao
  • Trevor I. Dix
  • Lloyd Allison
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5476)


Substitution matrices describe the rates of mutating one character in a biological sequence to another character, and are important for many knowledge discovery tasks such as phylogenetic analysis and sequence alignment. Computing substitution matrices for very long genomic sequences of divergent or even unrelated species requires sensitive algorithms that can take into account differences in composition of the sequences. We present a novel algorithm that addresses this by computing a nucleotide substitution matrix specifically for the two genomes being aligned. The method is founded on information theory and in the expectation maximisation framework. The algorithm iteratively uses compression to align the sequences and estimates the matrix from the alignment, and then applies the matrix to find a better alignment until convergence. Our method reconstructs, with high accuracy, the substitution matrix for synthesised data generated from a known matrix with introduced noise. The model is then successfully applied to real data for various malaria parasite genomes, which have differing phylogenetic distances and composition that lessens the effectiveness of standard statistical analysis techniques.


Hash Table Genomic Comparative Analysis Alignment Score Substitution Matrix Expert Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altschul, S.F., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25(17), 3389–3402 (1997)CrossRefGoogle Scholar
  2. 2.
    Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S.: Versatile and open software for comparing large genomes. Genome. Biol. 5(2) (2004)Google Scholar
  3. 3.
    Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89(22), 10915–10919 (1992)CrossRefGoogle Scholar
  4. 4.
    Altschul, S.F., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)CrossRefGoogle Scholar
  5. 5.
    Lio, P., Goldman, N.: Models of Molecular Evolution and Phylogeny. Genome. Res. 8(12), 1233–1244 (1998)Google Scholar
  6. 6.
    Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Biol. 76(6), 368–376 (1981)Google Scholar
  7. 7.
    Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model for evolutionary change in proteins. In: National Biochemical Research Foundation, Washington DC (1978)Google Scholar
  8. 8.
    Comeron, J.M., Aguade, M.: An evaluation of measures of synonymous codon usage bias. J. Mol. Biol. 47(3), 268–274 (1998)Google Scholar
  9. 9.
    Klein, R., Eddy, S.: Rsearch: Finding homologs of single structured RNA sequences. BMC Bioinformatics 4(1) (2003)Google Scholar
  10. 10.
    Goldman, N.: Statistical tests of models of DNA substitution. J. Mol. Evol. 36(2), 182–198 (1993)CrossRefGoogle Scholar
  11. 11.
    Yang, Z.: Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39(1), 105–111 (1994)CrossRefGoogle Scholar
  12. 12.
    Yap, V.B., Speed, T.P.: Modeling dna base substitution in large genomic regions from two organisms. J. Mol. Evol. 58(1), 12–18 (2004)CrossRefGoogle Scholar
  13. 13.
    Jukes, T.H., Cantor, C.: Evolution of protein molecules. Mammalian Protein Metabolism, 21–132 (1969)Google Scholar
  14. 14.
    Kimura, M.: A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120 (1980)CrossRefGoogle Scholar
  15. 15.
    Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal 27, 379–423 (1948)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Wallace, C.S., Boulton, D.M.: An information measure for classification. Computer Journal 11(2), 185–194 (1968)CrossRefMATHGoogle Scholar
  17. 17.
    Wallace, C.S., Freeman, P.R.: Estimation and inference by compact coding. Journal of the Royal Statistical Society series 49(3), 240–265 (1987)MathSciNetMATHGoogle Scholar
  18. 18.
    Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Data Compression Conference, pp. 43–52 (2007)Google Scholar
  19. 19.
    Cao, M.D., Dix, T.I., Allison, L.: A genome alignment algorithm based on compression. Technical Report 2009/233, FIT, Monash University (2009)Google Scholar
  20. 20.
    Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219(3), 555–565 (1991)CrossRefGoogle Scholar
  21. 21.
    Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Nat. Acad. Sci. 87(6), 2264–2268 (1990)CrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Minh Duc Cao
    • 1
  • Trevor I. Dix
    • 1
    • 2
  • Lloyd Allison
    • 1
  1. 1.Clayton School of Information TechnologyMonash UniversityClaytonAustralia
  2. 2.Faculty of Information & Communication TechnologiesSwinburne University of TechnologyHawthornAustralia

Personalised recommendations