Abstract
During the course of evolution, variations of a protein sequence is an ongoing phenomenon however limited by the need to maintain its structural and functional integrity. Deciphering the evolutionary path of a protein is thus of fundamental interest. With the development of new methods to visualize high dimension spaces and the improvement of phylogenetic analysis tools, it is possible to study the evolutionary trajectories of proteins in the sequence space. Using the data-driven high-dimensional scaling method, we show that it is possible to predict and represent potential evolutionary trajectories by representing phylogenetic trees into a 3D projection of the sequence space. With the case of the aminodeoxychorismate synthase, an enzyme involved in folate synthesis, we show that this representation raises interesting questions about the complexity of the evolution of a given biological function, in particular concerning its capacity to explore the sequence space.
Similar content being viewed by others
References
Abascal F, Zardoya R, Telford MJ (2010) TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res 38:W7–W13
Adami C, Ofria C, Travis CC (2000) Evolution of biological complexity. PNAS 97(9):4463–4468
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Anisimova M, Gascuel O (2006) Approximate likelihood-ratio test for branches: a fast, accurate, and powerful alternative. Syst Biol 55(4):539–552
Aurrecoechea C, Brestelli J, Brunk BP, Dommer J, Fischer S, Gajria B, Gao X, Gingle A, Grant G, Harb OS, Heiges M, Innamorato F, Iodice J, Kissinger JC, Kraemer E, Li W, Miller JA, Nayak V, Pennington C, Pinney DF, Roos DS, Ross C, Stoeckert CJ Jr, Treatman C, Wang H (2009) PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res 37:D539–D543
Basset GJC, Quinlivan EP, Ravanel S, Rébeillé F, Nichols BP, Shinozaki K, Seki M, Adams-Phillips LC, Giovannoni JJ, Gregory JF III, Hanson AD (2004) Folate synthesis in plants: the p-aminobenzoate branch is initiated by a bifunctional PabA-PabB protein that is targeted to plastids. Proc Natl Acad Sci USA 101:1496–1501
Bastien O, Ortet P, Roy S, Maréchal E (2005) A configuration space of homologous proteins conserving mutual information and allowing a phylogeny inference based on pair-wise Z-score probabilities. BMC Bioinform 6(1):49
Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, Huala E (2015) The Arabidopsis information resource: making and mining the “gold standard” annotated reference plant genome. Genesis 53:474–485
Bornberg-Bauer E, Chan HS (1999) Modeling evolutionary landscapes: mutational stability, topology, and superfunnels in sequence space. Proc Natl Acad Sci 96(19):10689–10694
Camara D, Richefeu-Contesto C, Gambonnet B, Dumas R, Rébeillé F (2011) The synthesis of pABA: coupling between the glutamine amidotransferase and aminodeoxychorismate synthase domains of the bifunctional aminodeoxychorismate synthase from Arabidopsis thaliana. Arch Biochem Biophys 505(1):83–90
Dayhoff MO (1976) The origin and evolution of protein superfamilies. Fed Proc 35:2132–2138
Dayhoff MO, Barker WC, Hunt LT (1983) Establishing homologies in protein sequences. Methods Enzymol 91:524–545
Degret F, Lespinats S (2018) Circular background decreases misunderstanding of multidimensional scaling results for naive readers. In: MATEC web of conferences, vol 189. EDP sciences, p 10002
DePristo MA, Weinreich DM, Hartl DL (2005) Missense meanderings in sequence space: a biophysical view of protein evolution. Nat Rev Genet 6:678–687
Dryden DTF, Thomson AR, White JH (2008) How much of protein sequence space has been explored by life on Earth? J R Soc Interface 5:953–956
Edman JC, Goldstein AL, Erbe JG (1993) Para-aminobenzoate synthase gene of Saccharomyces cerevisiae encodes a bifunctional enzyme. Yeast 9:669–675
Facco E, d’Errico M, Rodriguez A, Laio A (2017) Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Sci Rep 7:12140
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17(6):368–376
France SL, Caroll JD (2010) Two-way multidimensional scaling: a review. In: IEEE transactions on systems, man, and cybernetics, Part C: applications and reviews, vol 99, pp 1–18
Gignoux C, Silvestre-Brac B (2002) Mécanique, de la formulation lagrangienne au chaos hamiltonien. EDP Sciences, Grenoble
Gorelova V, Bastien O, de Clerck O, Lespinats S, Rébeillé F, Van Der Straeten D (2019) Evolution of folate biosynthesis and metabolism across algae and land plant lineages. Sci Rep 9(1):5731
Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59(3):307–321
Hinton GE, Roweis ST (2003) Stochastic neighbor embedding. Advances in neural information processing systems. MIT Press, Cambridge, pp 857–864
Holm L, Sander C (1996) Mapping the protein universe. Science 273:595–603
James TY, Boulianne RP, Bottoli AP, Granado JD, Aebi M, Kües U (2002) The pab1 gene of Coprinus cinereus encodes a bifunctional protein for para-aminobenzoic acid (PABA) synthesis: implications for the evolution of fused PABA synthases. J Basic Microbiol 42:91–103
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780
Kondrashov DA, Kondrashov FA (2015) Topological features of rugged fitness landscapes in sequence space. Trends Genet 31(1):24–33
Koonin EV, Wolf YI, Karev GP (2002) The structure of the protein universe and genome evolution. Nature 420:218–223
Kumar S, Stecher G, Tamura K (2016) MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets. Mol Biol Evol 33:1870–1874
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23:2947–2948
Lee JA, Verleysen M (2007) Nonlinear dimensionality reduction. Springer, New York
Lemey P, Salemi M, Vandamme AM (2009) The phylogenetic handbook, 2nd edn. Cambridge Press, Cambridge
Lespinats S, Aupetit M (2011) CheckViz: sanity check and topological clues for linear and non-linear mappings. Comput Graph Forum 30:113–121
Lespinats S, Fertil B (2011) ColorPhylo: a color code to accurately display taxonomic classifications. Evol Bioinform 7:EBO-S7565
Lespinats S, Verleysen M, Giron A, Fertil B (2007) DD-HDS: a method for visualization and exploration of high-dimensional data. IEEE Trans Neural Netw 18(5):1265–1279
Lukasz P, Kozlowski LP (2017) Proteome-pI: proteome isoelectric point database. Nucleic Acids Res 45:D1112–D1116
Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Lu F, Marchler GH, Mullokandov M, Omelchenko MV, Robertson CL, Song JS, Thanki N, Yamashita RA, Zhang D, Zhang N, Zheng C, Bryant SH (2011) CDD: a conserved domain database for the functional annotation of proteins. Nucleic Acids Res 39(D):225–229
Maynard Smith J (1970) Natural selection and the concept of a protein space. Nature 225:563–564
Morrison A, Ross G, Chalmers M (2003) Fast multidimensional scaling through sampling, springs and interpolation. Inf Vis 2:68–77
Neath AA, Cavanaugh JE (2012) The Bayesian information criterion: background, derivation, and applications. Wiley Interdiscip Rev Comput Stat 4(2):199–203
Nei M, Kumar S (2000) Molecular evolution and phylogenetics. Oxford University Press, New York
Nordberg H, Cantor M, Dusheyko S, Hua S, Poliakov A, Shabalov I, Smirnova T, Grigoriev IV, Dubchak I (2014) The genome portal of the Department of Energy Joint Genome Institute: 2014 updates. Nucleic Acids Res 42(1):D26–D31
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci 2(11):559–572
Povolotskaya IS, Kondrashov FA (2010) Sequence space and the ongoing expansion of the protein universe. Nature 465:922–927
Pruitt KD, Tatusova T, Maglott DR (2005) NCBI reference sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 33(Database Issue):D501–D504
Rambaut A (2018) FigTree version 1.4.4 (computer program). http://tree.bio.ed.ac.uk/software/figtree/
Rébeillé F, Ravanel S, Jabrin S, Douce R, Storozhenko S, Van Der Straeten D (2006) Folates in plants: biosynthesis, distribution, and enhancement. Physiol Plant 126:330–342
Romero PA, Arnold FH (2009) Exploring protein fitness landscapes by directed evolution. Nat Rev Mol Cell Biol 10(12):866–876
Sammon JW (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput C 18(5):401–409
Setubal JC, Meidanis J (1997) Introduction to computational molecular biology. PWS, Boston
Shannon CE, Weaver W (1949) The mathematical theory of communication. Univ of Illinois Press, Urbana
Stahnke J, Dörk M, Müller B, Thom A (2016) Probing projections: interaction techniques for interpreting arrangements and errors of dimensionality reductions. IEEE Trans Vis Comput Graph 22(1):629–638
Starr TN, Thornton JW (2016) Epistasis in protein evolution. Protein Sci 25(7):1204–1218
Tokuriki N, Tawfik DS (2009) Stability effects of mutations and protein evolvability. Curr Opin Struct Biol 19:596–604
Torgerson WS (1965) Multidimensional scaling of similarity. Psychometrika 30(4):379–393
Triglia T, Cowman AF (1999) Plasmodium falciparum: a homologue of p-aminobenzoic acid synthetase. Exp Parasitol 92:154–158
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ (2009) Jalview version 2-a multiple sequence alignment editor and analysis workbench. Bioinformatics 25:1189–1191
Wright S (1931) Evolution in mendelian populations. Genetics 16:0097–0159
Yau SST, Mao WG, Benson M, He RL (2015) Distinguishing proteins from arbitrary amino acid sequences. Sci Rep 5:7972
Young G, Householder AS (1938) Discussion of a set of points in terms of their mutual distances. Psychometrika 3:19–22
Acknowledgements
This work was supported by the French National Research Agency (ANR-10-LABEX-04 GRAL Labex, Grenoble Alliance for Integrated Structural Cell Biology; ANR-11-BTBR-0008 Océanomics; ANR-15-IDEX-02 GlycoAlps and “Origin Of Life” Cross Disciplinary Projects of the Univ. Grenoble-Alpes; ANR-17-EURE-0003).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lespinats, S., De Clerck, O., Colange, B. et al. Phylogeny and Sequence Space: A Combined Approach to Analyze the Evolutionary Trajectories of Homologous Proteins. The Case Study of Aminodeoxychorismate Synthase. Acta Biotheor 68, 139–156 (2020). https://doi.org/10.1007/s10441-019-09352-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10441-019-09352-0