Journal of Molecular Evolution

, Volume 86, Issue 6, pp 365–378 | Cite as

Correlated Selection on Amino Acid Deletion and Replacement in Mammalian Protein Sequences

  • Yichen Zheng
  • Dan Graur
  • Ricardo B. R. Azevedo
Original Article


A low ratio of nonsynonymous and synonymous substitution rates (dN/dS) at a codon is an indicator of functional constraint caused by purifying selection. Intuitively, the functional constraint would also be expected to prevent such a codon from being deleted. However, to the best of our knowledge, the correlation between the rates of deletion and substitution has never actually been estimated. Here, we use 8595 protein-coding region sequences from nine mammalian species to examine the relationship between deletion rate and dN/dS. We find significant positive correlations at the levels of both sites and genes. We compared our data against controls consisting of simulated coding sequences evolving along identical phylogenetic trees, where deletions occur independently of substitutions. A much weaker correlation was found in the corresponding simulated sequences, probably caused by alignment errors. In the real data, the correlations cannot be explained by alignment errors. Separate investigations on nonsynonymous (dN) and synonymous (dS) substitution rates indicate that the correlation is most likely due to a similarity in patterns of selection rather than in mutation rates.


Mammals Protein-coding genes dN/dS Codon deletion Purifying selection Indifferent DNA 



We used the Maxwell cluster from the Center of Advanced Computing and Data Systems (CACDS) at the University of Houston. CACDS staff provided technical support. We would like to thank Sarah Parks and her colleagues at EMBL-European Bioinformatics Institute for their help in running the SLR program on part of our data. R.B.R.A. was funded by NIH R01GM101352. We would also like to thank Jaanus Suurväli and Jan Gravemeyer at University of Cologne for their help in manuscript editing.

Supplementary material

239_2018_9853_MOESM1_ESM.docx (793 kb)
Supplementary material 1 (DOCX 792 KB)


  1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25CrossRefPubMedPubMedCentralGoogle Scholar
  2. Cartwright R (2009) Problems and solutions for estimating indel rates and length distributions. Mol Biol Evol 26:473–480CrossRefPubMedGoogle Scholar
  3. Chen J-Q, Wu Y, Yang H, Bergelson J, Kreitman M, Tian D (2009) Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria. Mol Biol Evol 26:1523–1531CrossRefPubMedGoogle Scholar
  4. Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum Associates, Hillsdale, p 67Google Scholar
  5. de la Chaux N, Messer PW, Arndt PF (2007) DNA indels in coding regions reveal selective constraints on protein evolution in the human lineage. BMC Evol Biol 7:19CrossRefGoogle Scholar
  6. Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 15:330–340CrossRefPubMedPubMedCentralGoogle Scholar
  7. Fletcher W, Yang Z (2009) INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol 26:1879–1888CrossRefPubMedPubMedCentralGoogle Scholar
  8. Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates C, Fairley S, Fitzgerald S et al (2011) Ensembl 2011. Nucleic Acids Res 39:D800–D806CrossRefPubMedGoogle Scholar
  9. Graur D (2016) Molecular and genome evolution. Sinauer Associates, SunderlandGoogle Scholar
  10. Graur D, Zheng Y, Price N, Azevedo RBR, Zufall RA, Elhaik E (2013) On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE. Genome Biol Evol 5:578–590CrossRefPubMedPubMedCentralGoogle Scholar
  11. Graur D, Zheng Y, Azevedo RBR (2015) An evolutionary classification of genomic function. Genome Biol Evol 7:642–645CrossRefPubMedPubMedCentralGoogle Scholar
  12. Hallström BM, Schneider A, Zoller S, Janke A (2011) A genomic approach to examine the complex evolution of laurasiatherian mammals. PLoS ONE 6(12):e28199CrossRefPubMedPubMedCentralGoogle Scholar
  13. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D (2002) The human genome browser at UCSC. Genome Res 12:996–1006CrossRefPubMedPubMedCentralGoogle Scholar
  14. Klug A, Rhodes D (1987) Zinc fingers: a novel protein fold for nucleic acid recognition. Cold Spring Harb Symp Quant Biol 52:473–482CrossRefPubMedGoogle Scholar
  15. Kolmogorov A (1933) Sulla determinazione empirica di una legge di distribuzione. G Ist Ital Attuari 4:83–91Google Scholar
  16. Landan G, Graur D (2008) Local reliability measures from sets of co-optimal multiple sequence alignments. Pac Symp Biocomput 13:15–24Google Scholar
  17. Landan G, Graur D (2009) Characterization of pairwise and multiple sequence alignment errors. Gene 441:141–147CrossRefPubMedGoogle Scholar
  18. Light S, Sagit R, Ekman D, Elofsson A (2013) Long indels are disordered: a study of disorder and indels in homologous eukaryotic proteins. Biochim Biophys Acta 1834(5):890–897CrossRefPubMedGoogle Scholar
  19. Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, Kheradpour P, Ernst J, Jordan G, Mauceli E et al (2011) A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478:476–482CrossRefPubMedPubMedCentralGoogle Scholar
  20. Lunter G, Ponting CP, Hein J (2006) Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comp Biol 2(1):e5CrossRefGoogle Scholar
  21. Maddison WP (1997) Gene trees in species trees. Syst Biol 46(3):523–536CrossRefGoogle Scholar
  22. Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B, Burhans R, King DC, Baertsch R, Blankenberg D et al (2007) 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res 17:1797–1808CrossRefPubMedPubMedCentralGoogle Scholar
  23. Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD, Mu XJ, Ananda G, Howie B, Karczewski KJ, Smith KS, Anaya V (2013) The origin, evolution, and functional impact of short insertion–deletion variants identified in 179 human genomes. Genome Res 23(5):749–761CrossRefPubMedPubMedCentralGoogle Scholar
  24. Murrell B, Moola S, Mabona A, Weighill T, Sheward D, Pond SLK, Scheffler K (2013) FUBAR: a fast, unconstrained bayesian approximation for inferring selection. Mol Biol Evol 30:1196–1205CrossRefPubMedPubMedCentralGoogle Scholar
  25. Nagy LG, Kocsubé S, Csanádi Z, Kovács GM, Petkovits T, Vágvölgyi C, Papp T (2012) Re-mind the gap! Insertion–deletion data reveal neglected phylogenetic potential of the nuclear ribosomal internal transcribed spacer (ITS) of fungi. PLoS ONE 7:e49794CrossRefPubMedPubMedCentralGoogle Scholar
  26. Nei M, Gojobori T (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 3:418–426PubMedGoogle Scholar
  27. Nishihara H, Hasegawa M, Okada N (2006) Pegasoferae, an unexpected mammalian clade revealed by tracking ancient retroposon insertions. Proc Natl Acad Sci USA 103:9929–9934CrossRefPubMedGoogle Scholar
  28. Pang A, Smith AD, Nuin PAS, Tillier ERM (2005) SIMPROT: using an empirically determined indel distribution in simulations of protein evolution. BMC Bioinform 6:236CrossRefGoogle Scholar
  29. Prasad AB, Allard MW, NISC Comparative Sequencing Program, Green ED (2008) Confirming the phylogeny of mammals by use of large comparative sequence data sets. Mol Biol Evol 25:1795–1808CrossRefPubMedPubMedCentralGoogle Scholar
  30. Price N, Graur D (2016) Are synonymous sites in primates and rodents functionally constrained? J Mol Evol 82:51–64CrossRefPubMedGoogle Scholar
  31. Rodrigue N, Philippe H, Lartillot N (2010) Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proc Natl Acad Sci USA 107(10):4629–4634CrossRefPubMedGoogle Scholar
  32. Scholtz JM, Baldwin RL (1992) The mechanism of alpha-helix formation by peptides. Annu Rev Biophys Biomol Struct 21(1):95–118CrossRefPubMedGoogle Scholar
  33. Slowinski JB (1998) The number of multiple alignments. Mol Phylogenet Evol 10(2):264–266CrossRefPubMedGoogle Scholar
  34. Smirnov N (1948) Table for estimating the goodness of fit of empirical distributions. Ann Math Stat 19:279–281CrossRefGoogle Scholar
  35. Stamatakis A (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688–2690CrossRefPubMedGoogle Scholar
  36. Stoye J, Ever D, Meyer F (1998) Rose: generating sequence families. Bioinformatics 14:157–163CrossRefPubMedGoogle Scholar
  37. Strope CL, Abel K, Scott SD, Moriyama EN (2009) Biological sequence simulation for testing complex evolutionary hypotheses: indel-Seq-Gen version 2.0. Mol Biol Evol 26:2581–2593CrossRefPubMedPubMedCentralGoogle Scholar
  38. Sung W, Ackerman MS, Dillon MM, Platt TG, Fuqua C, Cooper VS, Lynch M (2016) Evolution of the insertion-deletion mutation rate across the tree of life. G3: Genes Genomes Genetics 6(8):2583–2591CrossRefPubMedGoogle Scholar
  39. Taylor MS, Ponting CP, Copley RR (2004) Occurrence and consequences of coding sequence insertions and deletions in mammalian genomes. Genome Res 14:555–566CrossRefPubMedPubMedCentralGoogle Scholar
  40. Wang H, Susko E, Roger AJ (2013) The site-wise log-likelihood score is a good predictor of genes under positive selection. J Mol Evol 76:280–294CrossRefPubMedGoogle Scholar
  41. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P et al (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–562CrossRefPubMedGoogle Scholar
  42. Wong WS, Nielsen R (2004) Detecting selection in noncoding regions of nucleotide sequences. Genetics 167(2):949–958CrossRefPubMedPubMedCentralGoogle Scholar
  43. Zhang Z, Huang J, Wang Z, Wang L, Gao P (2011) Impact of indels on the flanking regions in structural domains. Mol Biol Evol 28(1):291–301CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Institute of GeneticsUniversity of CologneCologneGermany
  2. 2.Department of Biology and BiochemistryUniversity of HoustonHoustonUSA

Personalised recommendations