Back-Translation for Discovering Distant Protein Homologies

  • Marta Gîrdea
  • Laurent Noé
  • Gregory Kucherov
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5724)

Abstract

Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins’ common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. To cope with this situation, we propose a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. This allows us to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Raes, J., Van de Peer, Y.: Functional divergence of proteins through frameshift mutations. Trends in Genetics 21(8), 428–431 (2005)CrossRefPubMedGoogle Scholar
  2. 2.
    Okamura, K., et al.: Frequent appearance of novel protein-coding sequences by frameshift translation. Genomics 88(6), 690–697 (2006)CrossRefPubMedGoogle Scholar
  3. 3.
    Harrison, P., Yu, Z.: Frame disruptions in human mRNA transcripts, and their relationship with splicing and protein structures. BMC Genomics 8, 371 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Hahn, Y., Lee, B.: Identification of nine human-specific frameshift mutations by comparative analysis of the human and the chimpanzee genome sequences. Bioinformatics 21(suppl. 1), i186–i194 (2005)CrossRefGoogle Scholar
  5. 5.
    Grantham, R., Gautier, C., Gouy, M., Mercier, R., Pave, A.: Codon catalog usage and the genome hypothesis. Nucleic Acids Research (8), 49–62 (1980)Google Scholar
  6. 6.
    Shepherd, J.C.: Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification.. Proceedings National Academy Sciences USA (78), 1596–1600 (1981)Google Scholar
  7. 7.
    Guigo, R.: DNA composition, codon usage and exon prediction. Nucleic Protein Databases, 53–80 (1999)Google Scholar
  8. 8.
    Leluk, J.: A new algorithm for analysis of the homology in protein primary structure. Computers and Chemistry 22(1), 123–131 (1998)CrossRefPubMedGoogle Scholar
  9. 9.
    Leluk, J.: A non-statistical approach to protein mutational variability. BioSystems 56(2-3), 83–93 (2000)CrossRefPubMedGoogle Scholar
  10. 10.
    Altschul, S., et al.: Basic local alignment search tool. JMB 215(3), 403–410 (1990)CrossRefGoogle Scholar
  11. 11.
    Altschul, S., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)CrossRefPubMedPubMedCentralGoogle Scholar
  12. 12.
    Pellegrini, M., Yeates, T.: Searching for Frameshift Evolutionary Relationships Between Protein Sequence Families. Proteins 37, 278–283 (1999)CrossRefPubMedGoogle Scholar
  13. 13.
    Arvestad, L.: Aligning coding DNA in the presence of frame-shift errors. In: Hein, J., Apostolico, A. (eds.) CPM 1997. LNCS, vol. 1264, pp. 180–190. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  14. 14.
    Arvestad, L.: Algorithms for biological sequence alignment. PhD thesis, Royal Institute of Technology, Stocholm, Numerical Analysis and Computer Science (2000)Google Scholar
  15. 15.
    Blake, R., Hess, S., Nicholson-Tuell, J.: The influence of nearest neighbors on the rate and pattern of spontaneous point mutations. JME 34(3), 189–200 (1992)CrossRefGoogle Scholar
  16. 16.
    Kosiol, C., Holmes, I., Goldman, N.: An Empirical Codon Model for Protein Sequence Evolution. Molecular Biology and Evolution 24(7), 1464 (2007)CrossRefPubMedGoogle Scholar
  17. 17.
    Pedersen, A., Jensen, J.: A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. Molecular Biology and Evolution 18, 763–776 (2001)CrossRefPubMedGoogle Scholar
  18. 18.
    Lio, P., Goldman, N.: Models of Molecular Evolution and Phylogeny. Genome Research 8(12), 1233–1244 (1998)PubMedGoogle Scholar
  19. 19.
    Altschul, S., et al.: The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Research 29(2), 351–361 (2001)CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Olsen, R., Bundschuh, R., Hwa, T.: Rapid assessment of extremal statistics for gapped local alignment. In: ISMB, pp. 211–222 (1999)Google Scholar
  21. 21.
    Delaye, L., DeLuna, A., Lazcano, A., Becerra, A.: The origin of a novel gene through overprinting in Escherichia coli. BMC Evolutionary Biology 8, 31 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  22. 22.
    Hubbard, T., et al.: Ensembl 2007. Nucleic Acids Res. 35 (2007)Google Scholar
  23. 23.
    Clamp, M., et al.: Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl. Acad. Sci. 104(49), 19428–19433 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  24. 24.
    Oostra, B., Chiurazzi, P.: The fragile X gene and its function. Clinical genetics 60(6), 399 (2001)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Marta Gîrdea
    • 1
  • Laurent Noé
    • 1
  • Gregory Kucherov
    • 1
  1. 1.INRIA Lille - Nord Europe, LIFL/CNRSVilleneuve d’AscqFrance

Personalised recommendations