The evolution of proteins from random amino acid sequences. I. Evidence from the lengthwise distribution of amino acids in modern protein sequences

White, Stephen H.; Jacobs, Russell E.

doi:10.1007/BF02407307

The evolution of proteins from random amino acid sequences. I. Evidence from the lengthwise distribution of amino acids in modern protein sequences

Published: January 1993

Volume 36, pages 79–95, (1993)
Cite this article

Journal of Molecular Evolution Aims and scope Submit manuscript

Stephen H. White¹ &
Russell E. Jacobs¹^nAff2

169 Accesses
41 Citations
Explore all metrics

Summary

We examine in this paper one of the expected consequences of the hypothesis that modern proteins evolved from random heteropeptide sequences. Specifically, we investigate the lengthwise distributions of amino acids in a set of 1,789 protein sequences with little sequence identity using the run test statistic (r _o) of Mood (1940,Ann. Math. Stat. 11, 367–392). The probability density ofr _o for a collection of random sequences has mean=0 and variance=1 [the N(0,1) distribution] and can be used to measure the tendency of amino acids of a given type to cluster together in a sequence relative to that of a random sequence. We implement the run test using binary representations of protein sequences in which the amino acids of interest are assigned a value of 1 and all others a value of 0. We consider individual amino acids and sets of various combinations of them based upon hydrophobicity (4 sets), charge (3 sets), volume (4 sets), and secondary structure propensity (3 sets). We find that any sequence chosen randomly has a 90% or greater chance of having a lengthwise distribution of amino acids that is indistinguishable from the random expectation regardless of amino acid type. We regard this as strong support for the random-origin hypothesis. However, we do observe significant deviations from the random expectation as might be expected after billions years of evolution. Two important global trends are found: (1) Amino acids with a strong α-helix propensity show a strong tendency to cluster whereas those with β-sheet or reverse-turn propensity do not. (2) Clustered rather than evenly distributed patterns tend to be preferred by the individual amino acids and this is particularly so for methionine. Finally, we consider the problem of reconciling the random nature of protein sequences with structurally meaningful periodic “patterns” that can be detected by sliding-window, autocorrelation, and Fourier analyses. Two examples, rhodopsin and bacteriorhodopsin, show that such patterns are a natural feature of random sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review of Hidden Markov Models and Their Applications

Article 12 May 2020

Bhavya Mor, Sunita Garhwal & Ajay Kumar

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

The fundamental theorem of natural selection with mutations

Article Open access 07 November 2017

William F. Basener & John C. Sanford

References

Barlow RJ (1989) Statistics. A guide to the use of statistical methods in the physical sciences. John Wiley and Sons, New York, pp 1–204
Google Scholar
Black JA, Harkins RN, Stenzel P (1976) Non-random relationships among amino acids in protein sequences. Int J Peptide Protein Res 8:125–130
Article CAS Google Scholar
Blake C (1983) Exons—present from the beginning? Nature (London) 306:535–537
Article CAS Google Scholar
Chan HS and Dill KA (1990) Origins of structure in globular proteins. Proc Natl Acad Sci USA 87:6388–6392
CAS PubMed Google Scholar
Darnell JE (1978) Implications of RNA-RNA splicing in evolution of eukaryotic cells. Science 202:1257–1260
CAS PubMed Google Scholar
David FN, Barton DE (1962) Combinatorial chance. Charles Griffin and Co., London, pp 1–356
Google Scholar
Doolittle RF (1979) Protein evolution. In: Neurath H, Hill RL (eds) The proteins, vol IV. Academic Press, New York, pp 1–118
Google Scholar
Doolittle RF (1989) Redundancies in protein sequences. In: Fasman GD (ed) Prediction of protein structure and the principles of protein conformation. Plenum Press, New York, pp 599–623
Google Scholar
Doolittle WF (1978) Genes in pieces: were they ever together? Nature (London) 272:581–582
Article Google Scholar
Dorit RL, Schoenbach L, Gilbert W (1990) How big is the universe of exons. Science 250:1377–1382
CAS PubMed Google Scholar
Dorit RL, Gilbert W (1991) The limited universe of exons. Cur Opinion Struc Biol 1:973–977
Article CAS Google Scholar
Eck RV, Dayhoff MO (1966) Evolution of the structure of ferredoxin based on living relics of primitive amino acid sequences. Science 152:363–366
CAS PubMed Google Scholar
Eisenberg D, Weiss RM, Terwilliger TC (1982) The helical hydrophobic moment: a measure of the amphiphilicity of a helix. Nature (London) 299:371–374
Article CAS Google Scholar
Eisenberg D, Weiss RM, Terwilliger TC (1984) The hydrophobic moment detects periodicity in protein hydrophobicity. Proc Natl Acad Sci USA 81:140–144
CAS PubMed Google Scholar
Engelman DM, Steitz TA, Goldman A (1986) Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Biophys Chem 15:321–353
Article CAS PubMed Google Scholar
Fasman GD (1989) The development of the prediction of protein structure. In: Fasman GD (ed) Prediction of protein structure and the principles of protein conformation. Plenum Press, New York, pp 193–316
Google Scholar
Finkelstein AV, Ptitsyn OB (1987) Why do globular proteins fit the limited set of folding patterns. Prog Biophys Molec Biol 50:171–190
Article CAS Google Scholar
Fisher HF (1964) A limiting law relating the size and shape of protein molecules to their composition. Proc Natl Acad Sci USA 51:1285–1291
CAS PubMed Google Scholar
Fitch WM, Margoliash E (1967) A method for estimating the number of invariant amino acid coding positions in a gene using cytochrome c as a model case. Biochem Genet 1:65–71
Article CAS PubMed Google Scholar
Gamow G, Ycas M (1958) The cryptographic approach to the problem of protein synthesis. In: Yockey HP (ed) Symposium on information theory in biology. Pergamon Press, New York, pp 63–69
Google Scholar
Garnier J (1990) Protein structure prediction. Biochimie 72:513–524
Article CAS PubMed Google Scholar
Gates RE, Fisher HF (1971) Restrictions of sequence on the thickness of globular protein molecules. Proc Natl Acad Sci USA 68:2928–2931
CAS PubMed Google Scholar
Gellman SH (1991) On the role of methionine residues in the sequence-independent recognition of nonpolar protein surfaces. Biochemistry 30:6633–6636
Article CAS PubMed Google Scholar
George DG, Barker WC, Hunt LT (1986) The protein identification resource (PIR). Nucleic Acids Res 14:11–15
CAS PubMed Google Scholar
Gilbert W (1978) Why genes in pieces? Nature (London) 271:501
Article CAS Google Scholar
Holland SK, Blake CCF (1990) Proteins, exons, and molecular evolution. In: Stone EM, Schwartz RJ (eds) Intervening sequences in evolution and development. Oxford University Press, New York, pp 10–42
Google Scholar
Janin J (1979) Surface and inside volumes in globular proteins. Nature (London) 277:491–492
Article CAS Google Scholar
Jukes TH (1969) Evolutionary pattern of specificity regions in light chains of immunoglobulins. Biochem Genet 3:109–117
Article CAS PubMed Google Scholar
Karlin S, Bucher P, Brendel V, Altschul SF (1991) Statistical methods and insights for protein and DNA sequences. Annu Rev Biophys Biophys Chem 20:175–203
Article CAS PubMed Google Scholar
Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268
CAS PubMed Google Scholar
Khorana HG, Gerber GE, Herlihy WC, Gray CP, Anderegg RJ, Nihei K, Biemann K (1979) Amino acid sequence of bacteriorhodopsin. Proc Natl Acad Sci USA 76:5046–5050
CAS PubMed Google Scholar
Klapper MH (1977a) Amino acid frequency distributions in proteins. Fed Proc 36:837
Google Scholar
Klapper MH (1977b) The independent distribution of amino acid near neighbor pairs and in polypeptides. Biochem Biophys Res Comm 78:1018–1024
CAS PubMed Google Scholar
Lau KF, Dill KA (1990) Theory for protein mutability and biogenesis. Proc Natl Acad Sci USA 87:638–642
CAS PubMed Google Scholar
Lee B, Richards FM (1971) The interpretation of protein structures: estimation of static accessibility. J Mol Biol 55:379–400
Article CAS PubMed Google Scholar
Levitt M (1978) Conformational preferences of amino acids in globular proteins. Biochemistry 17:4277–4285
Article CAS PubMed Google Scholar
Macchiato V, Tramontano A (1985) Determination of the autocorrelation orders in proteins. Eur J Biochem 149:375–379
Article CAS PubMed Google Scholar
McCaldon P, Argos P (1988) Oligopeptide biases in protein sequences and their use in predicting protein coding regions in nucleotide sequences. Protein-Struct Funct Genet 4:99–122
Article CAS Google Scholar
McLachlan AD (1972) Repeating sequences and gene duplication in proteins. J Mol Biol 64:417–437
Article CAS PubMed Google Scholar
McLachlan AD, Stewart M (1976) The 14-fold periodicity in alpha-tropomyosin and the interaction with actin. J Mol Biol 103:271–298
Article CAS PubMed Google Scholar
Mood AM (1940) The distribution theory of runs. Ann Math Stat 11:367–392
Google Scholar
Nathans J, Hogness DS (1983) Isolation, sequence analysis, and intron-exon arrangement of the gene encoding bovine rhodopsin. Cell 34:807–814
Article CAS PubMed Google Scholar
Orcutt BC, George DG, Dayhoff MO (1983) Protein and nucleic acid data base systems. Annu Rev Biophys Bioengr 12:419–441
Article CAS Google Scholar
Parzen E (1967) Time series analysis papers. Holden-Day, San Francisco pp 1–565
Google Scholar
Patthy L (1991) Exons—original building blocks of proteins? BioEssays 13:187–192
Article CAS PubMed Google Scholar
Peebles PJE, Schramm DN, Turner EL, Kron RG (1991) The case for the relativistic hot Big Bang cosmology. Nature (London) 352:769–776
Article Google Scholar
Ptitsyn OB (1985) Random sequences and protein folding. J Molec Struct (Theochem) 123:45–65
Article Google Scholar
Ptitsyn OB (1987) Protein folding: hypotheses and experiments. J Protein Chem 6:273–294
Article CAS Google Scholar
Rose GD (1978) Prediction of chain turns in globular proteins on a hyrophobic basis. Nature (London) 272:586–590
Article CAS Google Scholar
Rose GD, Roy S (1980) Hydrophobic basis of packing in globular proteins. Proc Natl Acad Sci USA 77:4643–4647
CAS PubMed Google Scholar
Saroff HA (1984) The uniqueness of protein sequences. Uniqueness diagrams for the Dayhoff file—1984. Bull Math Biol 46:661–672
Article CAS PubMed Google Scholar
Shakhnovich EI, Gutin AM (1989) Formation of unique structure in polypeptide chains: theoretical investigation with the aid of a replica approach. Biophys Chem 34:187–199
Article CAS PubMed Google Scholar
Shakhnovich EI, Gutin AM (1990a) Implications of thermodynamics of protein folding for evolution of primary sequences. Nature (London) 346:773–775
Article CAS Google Scholar
Shakhnovich EI, Gutin AM (1990b) Enumeration of all compact conformations of copolymers with random sequence of links. J Chem Phys 93:5967–5971
Article CAS Google Scholar
Vonderviszt F, Matrai G, Simon I (1986) Characteristic sequential residue environment of amino acids in proteins. Int J Peptide Protein Res 27:483–492
Article CAS Google Scholar
Wani JK (1971) Probability and statistical inference. Appleton-Century-Crofts, New York pp 1–315
Google Scholar
White SH, Jacobs RE (1990) Statistical distribution of hydrophobic residues along the length of protein chains—implications for protein folding and evolution. Biophys J 57:911–921
Article CAS PubMed Google Scholar
Wilson IA, Haft DH, Getzoff ED, Tainer JA, Lerner RA, Brenner S (1985) Identical short peptide sequences in unrelated proteins can have different conformations: A testing ground for theories of immune recognition. Proc Natl Acad Sci USA 82:5255–5259
CAS PubMed Google Scholar
Ycas M (1958) The protein text. In: Yockey HP (ed) Symposium on information theory in biology. Pergamon Press, New York, pp 70–102
Google Scholar
Zielenkiewicz P, Plochocka D, Rabczenko A (1988) The formation of protein secondary structure. Its connection with amino acid sequence. Biophys Chem 31:139–142
Article CAS PubMed Google Scholar
Zimmerman JM, Eliezer N, Simha R (1968) The characterization of amino acid sequences in proteins by statistical methods. J Theor Biol 21:170–201
Article CAS PubMed Google Scholar

Download references

Author information

Russell E. Jacobs
Present address: Beckman Institute, Mail Stop 139-74, California Institute of Technology, 91125, Pasadena, CA, USA

Authors and Affiliations

Department of Physiology and biophysics, University of California, 92717, Irvine, CA, USA
Stephen H. White & Russell E. Jacobs

Authors

Stephen H. White
View author publications
You can also search for this author in PubMed Google Scholar
Russell E. Jacobs
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

White, S.H., Jacobs, R.E. The evolution of proteins from random amino acid sequences. I. Evidence from the lengthwise distribution of amino acids in modern protein sequences. J Mol Evol 36, 79–95 (1993). https://doi.org/10.1007/BF02407307

Download citation

Received: 04 November 1991
Revised: 27 June 1992
Issue Date: January 1993
DOI: https://doi.org/10.1007/BF02407307

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The evolution of proteins from random amino acid sequences. I. Evidence from the lengthwise distribution of amino acids in modern protein sequences

Summary

Access this article

Similar content being viewed by others

A Systematic Review of Hidden Markov Models and Their Applications

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

The fundamental theorem of natural selection with mutations

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Key words

Navigation

Summary

Access this article

Similar content being viewed by others

A Systematic Review of Hidden Markov Models and Their Applications

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

The fundamental theorem of natural selection with mutations

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation