Enhancing Statistical Multiple Sequence Alignment and Tree Inference Using Structural Information

  • Joseph L. HermanEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1851)


For highly divergent sequences, there is often insufficient information to reliably construct alignments and phylogenetic trees. Since protein structure may be strongly conserved despite large divergences in sequence, structural information can be used to help identify homology in such cases.

While there exist well-studied models of sequence evolution, structurally informed alignment methods have typically made use of geometric measures of deviation that do not take into account the underlying mutational processes. In order to integrate structural information into sequence-based evolutionary models, we recently developed a stochastic model of structural evolution on a phylogenetic tree and implemented this as the StructAlign plugin for the StatAlign statistical alignment package.

In this chapter, we will outline the types of analyses that can be carried out using StructAlign, illustrating how the inclusion of structural information can be used to inform joint estimation of alignments and trees. StructAlign can also be used to infer branch-specific rates of structural evolution, and analysis of an example globin dataset highlights strong variation in the inferred rate across the tree. While structure is more highly conserved within clades, the rate of structural divergence as a function of sequence variation is larger between functionally divergent proteins. Allowing for the rate of structural divergence to vary over the tree results in an improved fit to the empirically observed pairwise RMSD values.

Key words

Protein structure Structural alignment RMSD Statistical alignment Alignment uncertainty Bayesian hierarchical models MCMC Parallel tempering Molecular phylogenetics Globins 


  1. 1.
    Godzik A (1996) The structural alignment between two proteins: is there a unique answer? Protein Sci 5:1325–1338CrossRefGoogle Scholar
  2. 2.
    Sela I, Ashkenazy H, Katoh K, Pupko T (2015) GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res 43:W7–W14CrossRefGoogle Scholar
  3. 3.
    Morrison DA, Ellis JT (1997) Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. Mol Biol Evol 14:428–441CrossRefGoogle Scholar
  4. 4.
    Ogden TH, Rosenberg MS (2006) Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol 55:314–328CrossRefGoogle Scholar
  5. 5.
    Wong KM, Suchard MA, Huelsenbeck JP (2008) Alignment uncertainty and genomic analysis. Science 319:473–476CrossRefGoogle Scholar
  6. 6.
    Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J (2008) Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res 18:298–309CrossRefGoogle Scholar
  7. 7.
    Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J (2015) Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics 16:108CrossRefGoogle Scholar
  8. 8.
    Nelesen S, Liu K, Zhao D, Linder CR, Warnow T (2008) The effect of the guide tree on multiple sequence alignments and subsequent phylogenetic analyses. In: Proceedings of the 2008 Pacific Symposium on Biocomputing. World Scientific. p 25–36Google Scholar
  9. 9.
    Lunter G, Drummond AJ, Miklós I, Hein J (2005) Statistical alignment: recent progress, new applications, and challenges. In: Statistical Methods in Molecular Evolution. Statistics for Biology and Health. Springer, New York, NYGoogle Scholar
  10. 10.
    Redelings BD, Suchard MA (2005) Joint Bayesian estimation of alignment and phylogeny. Syst Biol 54:401–418CrossRefGoogle Scholar
  11. 11.
    Westesson O, Lunter G, Paten B, Holmes I (2012) Accurate reconstruction of insertion-deletion histories by statistical phylogenetics. PLoS One 7:e34572CrossRefGoogle Scholar
  12. 12.
    Holmes IH (2017) Historian: accurate reconstruction of ancestral sequences and evolutionary rates. Bioinformatics 33:1227–1229CrossRefGoogle Scholar
  13. 13.
    Redelings BD (2014) Erasing errors due to alignment ambiguity when estimating positive selection. Mol Biol Evol 31:1979–1993CrossRefGoogle Scholar
  14. 14.
    Satija R, Pachter L, Hein J (2008) Combining statistical alignment and phylogenetic footprinting to detect regulatory elements. Bioinformatics 24:1236–1242CrossRefGoogle Scholar
  15. 15.
    Satija R, Novák Á, Miklós I, Lyngsø R, Hein J (2009) BigFoot: Bayesian alignment and phylogenetic footprinting with MCMC. BMC Evol Biol 9:217CrossRefGoogle Scholar
  16. 16.
    Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G, Baurain D (2011) Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol 9:e1000602CrossRefGoogle Scholar
  17. 17.
    Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL, Tamura K (2012) Statistics and truth in phylogenomics. Mol Biol Evol 29:457–472CrossRefGoogle Scholar
  18. 18.
    Talavera G, Castresana J (2007) Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol 56:564–577CrossRefGoogle Scholar
  19. 19.
    Wu M, Chatterji S, Eisen JA (2012) Accounting for alignment uncertainty in phylogenomics. PLoS One 7:e30288CrossRefGoogle Scholar
  20. 20.
    Gatesy J, DeSalle R, Wheeler W (1993) Alignment-ambiguous nucleotide sites and the exclusion of systematic data. Mol Phylogenet Evol 2:152–157CrossRefGoogle Scholar
  21. 21.
    Lee MS (2001) Unalignable sequences and molecular evolution. Trends Ecol Evol 16:681–685CrossRefGoogle Scholar
  22. 22.
    Löytynoja A, Goldman N (2008) Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320:1632–1635CrossRefGoogle Scholar
  23. 23.
    Hasegawa H, Holm L (2009) Advances and pitfalls of protein structural alignment. Curr Opin Struct Biol 19:341–348CrossRefGoogle Scholar
  24. 24.
    Johnson MS, Šali A, Blundell TL (1990) Phylogenetic relationships from three-dimensional protein structures. Methods Enzymol 183:670–690CrossRefGoogle Scholar
  25. 25.
    Bujnicki JM (2000) Phylogeny of the restriction endonuclease-like superfamily inferred from comparison of protein structures. J Mol Evol 50:39–44CrossRefGoogle Scholar
  26. 26.
    Lundin D, Poole AM, Sjöberg B-M, Högbom M (2012) Use of structural phylogenetic networks for classification of the ferritin-like superfamily. J Biol Chem 287:20565–20575CrossRefGoogle Scholar
  27. 27.
    Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. EMBO J 5:823CrossRefGoogle Scholar
  28. 28.
    Panchenko AR, Wolf YI, Panchenko LA, Madej T (2005) Evolutionary plasticity of protein families: coupling between sequence and structure variation. Proteins 61:535–544CrossRefGoogle Scholar
  29. 29.
    Illergård K, Ardell DH, Elofsson A (2009) Structure is three to ten times more conserved than sequence: a study of structural response in protein cores. Proteins 77:499–508CrossRefGoogle Scholar
  30. 30.
    Echave J, Spielman SJ, Wilke CO (2016) Causes of evolutionary rate variation among protein sites. Nat Rev Genet 17:109–121CrossRefGoogle Scholar
  31. 31.
    Worth CL, Gong S, Blundell TL (2009) Structural and functional constraints in the evolution of protein families. Nat Rev Mol Cell Biol 10:709–720CrossRefGoogle Scholar
  32. 32.
    Gilson AI, Marshall-Christensen A, Choi J-M, Shakhnovich EI (2017) The role of evolutionary selection in the dynamics of protein structure evolution. Biophys J 112:1350–1365CrossRefGoogle Scholar
  33. 33.
    Choi SC, Hobolth A, Robinson DM, Kishino H, Thorne JL (2007) Quantifying the impact of protein tertiary structure on molecular evolution. Mol Biol Evol 24:1769–1782CrossRefGoogle Scholar
  34. 34.
    Kleinman CL, Rodrigue N, Lartillot N, Philippe H (2010) Statistical potentials for improved structurally constrained evolutionary models. Mol Biol Evol 27:1546–1560CrossRefGoogle Scholar
  35. 35.
    Rodrigue N, Philippe H, Lartillot N (2006) Assessing site-interdependent phylogenetic models of sequence evolution. Mol Biol Evol 23:1762–1775CrossRefGoogle Scholar
  36. 36.
    Sadowski M, Taylor W (2010) On the evolutionary origins of “fold space continuity”: a study of topological convergence and divergence in mixed alpha-beta domains. J Struct Biol 172:244–252CrossRefGoogle Scholar
  37. 37.
    Rackovsky S (2015) Nonlinearities in protein space limit the utility of informatics in protein biophysics. Proteins 83:1923–1928CrossRefGoogle Scholar
  38. 38.
    Sadreyev RI, Kim B-H, Grishin NV (2009) Discrete–continuous duality of protein structure space. Curr Opin Struct Biol 19:321–328CrossRefGoogle Scholar
  39. 39.
    Holzgräfe C, Wallin S (2014) Smooth functional transition along a mutational pathway with an abrupt protein fold switch. Biophys J 107:1217–1225CrossRefGoogle Scholar
  40. 40.
    Challis CJ, Schmidler SC (2012) A stochastic evolutionary model for protein structure alignment and phylogeny. Mol Biol Evol 29:3575–3587CrossRefGoogle Scholar
  41. 41.
    Herman JL, Challis CJ, Novák Á, Hein J, Schmidler SC (2014) Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure. Mol Biol Evol 31:2251–2266CrossRefGoogle Scholar
  42. 42.
    Novák Á, Miklós I, Lyngsø R, Hein J (2008) StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics 24:2403–2404CrossRefGoogle Scholar
  43. 43.
    Burmester T, Ebner B, Weich B, Hankeln T (2002) Cytoglobin: a novel globin type ubiquitously expressed invertebrate tissues. Mol Biol Evol 19:416–421CrossRefGoogle Scholar
  44. 44.
    de Sanctis D, Dewilde S, Pesce A, Moens L, Ascenzi P, Hankeln T, Burmester T, Bolognesi M (2004) Crystal structure of cytoglobin: the fourth globin type discovered in man displays heme hexa-coordination. J Mol Biol 336:917–927CrossRefGoogle Scholar
  45. 45.
    Hoffmann FG, Opazo JC, Storz JF (2010) Gene cooption and convergent evolution of oxygen transport hemoglobins in jawed and jawless vertebrates. Proc Natl Acad Sci U S A 107:14274–14279CrossRefGoogle Scholar
  46. 46.
    Hoffmann FG, Opazo JC, Storz JF (2011) Differential loss and retention of cytoglobin, myoglobin, and globin-e during the radiation of vertebrates. Genome Biol Evol 3:588–600CrossRefGoogle Scholar
  47. 47.
    Hoffmann FG, Opazo JC, Hoogewijs D, Hankeln T, Ebner B, Vinogradov SN, Bailly X, Storz JF (2012) Evolution of the globin gene family in deuterostomes: lineage-specific patterns of diversification and attrition. Mol Biol Evol 29:1735–1745CrossRefGoogle Scholar
  48. 48.
    Geyer C (2011) Importance sampling, simulated tempering, and umbrella sampling. In: Brooks S, Gelman A, Jones G, Meng X (eds) Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC, Boca Raton, pp 295–311Google Scholar
  49. 49.
    Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F (2004) Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20:407–415CrossRefGoogle Scholar
  50. 50.
    Thorne JL, Kishino H, Felsenstein J (1992) Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol 34:3–16CrossRefGoogle Scholar
  51. 51.
    Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7:457–472CrossRefGoogle Scholar
  52. 52.
    Humphrey W, Dalke A, Schulten K (1996) VMD: visual molecular dynamics. J Mol Graph 14:33–38CrossRefGoogle Scholar
  53. 53.
    Hoy JA, Robinson H, Trent JT, Kakar S, Smagghe BJ, Hargrove MS (2007) Plant hemoglobins: a molecular fossil record for the evolution of oxygen transport. J Mol Biol 371:168–179CrossRefGoogle Scholar
  54. 54.
    Lobanov M, Bogatyreva N, Galzitskaia O (2008) Radius of gyration is indicator of compactness of protein structure. Mol Biol 42:701–706CrossRefGoogle Scholar
  55. 55.
    Christensen AB, Herman JL, Elphick MR, Kober KM, Janies D, Linchangco G, Semmens DC, Bailly X, Vinogradov SN, Hoogewijs D (2015) Phylogeny of echinoderm hemoglobins. PLoS One 10:e0129668CrossRefGoogle Scholar
  56. 56.
    Gupta KJ, Hebelstrup KH, Mur LA, Igamberdiev AU (2011) Plant hemoglobins: important players at the crossroads between oxygen and nitric oxide. FEBS Lett 585:3843–3849CrossRefGoogle Scholar
  57. 57.
    Hargrove MS, Brucker EA, Stec B, Sarath G, Arredondo-Peter R, Klucas RV, Olson JS, Phillips GN (2000) Crystal structure of a nonsymbiotic plant hemoglobin. Structure 8:1005–1014CrossRefGoogle Scholar
  58. 58.
    Sharir-Ivry A, Xia Y (2017) The impact of native state switching on protein sequence evolution. Mol Biol Evol 34:1378–1390CrossRefGoogle Scholar
  59. 59.
    Maadooliat M, Zhou L, Najibi SM, Gao X, Huang JZ (2016) Collective estimation of multiple bivariate density functions with application to angular-sampling-based protein loop modeling. J Am Stat Assoc 111:43–56CrossRefGoogle Scholar
  60. 60.
    Golden M, García-Portugués E, Sørensen M, Mardia KV, Hamelryck T, Hein J (2017) A generative angular model of protein structure evolution. Mol Biol Evol 34:2085–2100CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Biomedical InformaticsHarvard Medical SchoolBostonUSA

Personalised recommendations