Skip to main content

The evolution of proteins from random amino acid sequences. I. Evidence from the lengthwise distribution of amino acids in modern protein sequences

Summary

We examine in this paper one of the expected consequences of the hypothesis that modern proteins evolved from random heteropeptide sequences. Specifically, we investigate the lengthwise distributions of amino acids in a set of 1,789 protein sequences with little sequence identity using the run test statistic (r o) of Mood (1940,Ann. Math. Stat. 11, 367–392). The probability density ofr o for a collection of random sequences has mean=0 and variance=1 [the N(0,1) distribution] and can be used to measure the tendency of amino acids of a given type to cluster together in a sequence relative to that of a random sequence. We implement the run test using binary representations of protein sequences in which the amino acids of interest are assigned a value of 1 and all others a value of 0. We consider individual amino acids and sets of various combinations of them based upon hydrophobicity (4 sets), charge (3 sets), volume (4 sets), and secondary structure propensity (3 sets). We find that any sequence chosen randomly has a 90% or greater chance of having a lengthwise distribution of amino acids that is indistinguishable from the random expectation regardless of amino acid type. We regard this as strong support for the random-origin hypothesis. However, we do observe significant deviations from the random expectation as might be expected after billions years of evolution. Two important global trends are found: (1) Amino acids with a strong α-helix propensity show a strong tendency to cluster whereas those with β-sheet or reverse-turn propensity do not. (2) Clustered rather than evenly distributed patterns tend to be preferred by the individual amino acids and this is particularly so for methionine. Finally, we consider the problem of reconciling the random nature of protein sequences with structurally meaningful periodic “patterns” that can be detected by sliding-window, autocorrelation, and Fourier analyses. Two examples, rhodopsin and bacteriorhodopsin, show that such patterns are a natural feature of random sequences.

This is a preview of subscription content, access via your institution.

References

  1. Barlow RJ (1989) Statistics. A guide to the use of statistical methods in the physical sciences. John Wiley and Sons, New York, pp 1–204

    Google Scholar 

  2. Black JA, Harkins RN, Stenzel P (1976) Non-random relationships among amino acids in protein sequences. Int J Peptide Protein Res 8:125–130

    CAS  Article  Google Scholar 

  3. Blake C (1983) Exons—present from the beginning? Nature (London) 306:535–537

    Article  CAS  Google Scholar 

  4. Chan HS and Dill KA (1990) Origins of structure in globular proteins. Proc Natl Acad Sci USA 87:6388–6392

    CAS  PubMed  Google Scholar 

  5. Darnell JE (1978) Implications of RNA-RNA splicing in evolution of eukaryotic cells. Science 202:1257–1260

    CAS  PubMed  Google Scholar 

  6. David FN, Barton DE (1962) Combinatorial chance. Charles Griffin and Co., London, pp 1–356

    Google Scholar 

  7. Doolittle RF (1979) Protein evolution. In: Neurath H, Hill RL (eds) The proteins, vol IV. Academic Press, New York, pp 1–118

    Google Scholar 

  8. Doolittle RF (1989) Redundancies in protein sequences. In: Fasman GD (ed) Prediction of protein structure and the principles of protein conformation. Plenum Press, New York, pp 599–623

    Google Scholar 

  9. Doolittle WF (1978) Genes in pieces: were they ever together? Nature (London) 272:581–582

    Article  Google Scholar 

  10. Dorit RL, Schoenbach L, Gilbert W (1990) How big is the universe of exons. Science 250:1377–1382

    CAS  PubMed  Google Scholar 

  11. Dorit RL, Gilbert W (1991) The limited universe of exons. Cur Opinion Struc Biol 1:973–977

    Article  CAS  Google Scholar 

  12. Eck RV, Dayhoff MO (1966) Evolution of the structure of ferredoxin based on living relics of primitive amino acid sequences. Science 152:363–366

    CAS  PubMed  Google Scholar 

  13. Eisenberg D, Weiss RM, Terwilliger TC (1982) The helical hydrophobic moment: a measure of the amphiphilicity of a helix. Nature (London) 299:371–374

    Article  CAS  Google Scholar 

  14. Eisenberg D, Weiss RM, Terwilliger TC (1984) The hydrophobic moment detects periodicity in protein hydrophobicity. Proc Natl Acad Sci USA 81:140–144

    CAS  PubMed  Google Scholar 

  15. Engelman DM, Steitz TA, Goldman A (1986) Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Biophys Chem 15:321–353

    Article  CAS  PubMed  Google Scholar 

  16. Fasman GD (1989) The development of the prediction of protein structure. In: Fasman GD (ed) Prediction of protein structure and the principles of protein conformation. Plenum Press, New York, pp 193–316

    Google Scholar 

  17. Finkelstein AV, Ptitsyn OB (1987) Why do globular proteins fit the limited set of folding patterns. Prog Biophys Molec Biol 50:171–190

    Article  CAS  Google Scholar 

  18. Fisher HF (1964) A limiting law relating the size and shape of protein molecules to their composition. Proc Natl Acad Sci USA 51:1285–1291

    CAS  PubMed  Google Scholar 

  19. Fitch WM, Margoliash E (1967) A method for estimating the number of invariant amino acid coding positions in a gene using cytochrome c as a model case. Biochem Genet 1:65–71

    Article  CAS  PubMed  Google Scholar 

  20. Gamow G, Ycas M (1958) The cryptographic approach to the problem of protein synthesis. In: Yockey HP (ed) Symposium on information theory in biology. Pergamon Press, New York, pp 63–69

    Google Scholar 

  21. Garnier J (1990) Protein structure prediction. Biochimie 72:513–524

    Article  CAS  PubMed  Google Scholar 

  22. Gates RE, Fisher HF (1971) Restrictions of sequence on the thickness of globular protein molecules. Proc Natl Acad Sci USA 68:2928–2931

    CAS  PubMed  Google Scholar 

  23. Gellman SH (1991) On the role of methionine residues in the sequence-independent recognition of nonpolar protein surfaces. Biochemistry 30:6633–6636

    Article  CAS  PubMed  Google Scholar 

  24. George DG, Barker WC, Hunt LT (1986) The protein identification resource (PIR). Nucleic Acids Res 14:11–15

    CAS  PubMed  Google Scholar 

  25. Gilbert W (1978) Why genes in pieces? Nature (London) 271:501

    Article  CAS  Google Scholar 

  26. Holland SK, Blake CCF (1990) Proteins, exons, and molecular evolution. In: Stone EM, Schwartz RJ (eds) Intervening sequences in evolution and development. Oxford University Press, New York, pp 10–42

    Google Scholar 

  27. Janin J (1979) Surface and inside volumes in globular proteins. Nature (London) 277:491–492

    Article  CAS  Google Scholar 

  28. Jukes TH (1969) Evolutionary pattern of specificity regions in light chains of immunoglobulins. Biochem Genet 3:109–117

    Article  CAS  PubMed  Google Scholar 

  29. Karlin S, Bucher P, Brendel V, Altschul SF (1991) Statistical methods and insights for protein and DNA sequences. Annu Rev Biophys Biophys Chem 20:175–203

    Article  CAS  PubMed  Google Scholar 

  30. Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268

    CAS  PubMed  Google Scholar 

  31. Khorana HG, Gerber GE, Herlihy WC, Gray CP, Anderegg RJ, Nihei K, Biemann K (1979) Amino acid sequence of bacteriorhodopsin. Proc Natl Acad Sci USA 76:5046–5050

    CAS  PubMed  Google Scholar 

  32. Klapper MH (1977a) Amino acid frequency distributions in proteins. Fed Proc 36:837

    Google Scholar 

  33. Klapper MH (1977b) The independent distribution of amino acid near neighbor pairs and in polypeptides. Biochem Biophys Res Comm 78:1018–1024

    CAS  PubMed  Google Scholar 

  34. Lau KF, Dill KA (1990) Theory for protein mutability and biogenesis. Proc Natl Acad Sci USA 87:638–642

    CAS  PubMed  Google Scholar 

  35. Lee B, Richards FM (1971) The interpretation of protein structures: estimation of static accessibility. J Mol Biol 55:379–400

    Article  CAS  PubMed  Google Scholar 

  36. Levitt M (1978) Conformational preferences of amino acids in globular proteins. Biochemistry 17:4277–4285

    Article  CAS  PubMed  Google Scholar 

  37. Macchiato V, Tramontano A (1985) Determination of the autocorrelation orders in proteins. Eur J Biochem 149:375–379

    Article  CAS  PubMed  Google Scholar 

  38. McCaldon P, Argos P (1988) Oligopeptide biases in protein sequences and their use in predicting protein coding regions in nucleotide sequences. Protein-Struct Funct Genet 4:99–122

    Article  CAS  Google Scholar 

  39. McLachlan AD (1972) Repeating sequences and gene duplication in proteins. J Mol Biol 64:417–437

    Article  CAS  PubMed  Google Scholar 

  40. McLachlan AD, Stewart M (1976) The 14-fold periodicity in alpha-tropomyosin and the interaction with actin. J Mol Biol 103:271–298

    Article  CAS  PubMed  Google Scholar 

  41. Mood AM (1940) The distribution theory of runs. Ann Math Stat 11:367–392

    Google Scholar 

  42. Nathans J, Hogness DS (1983) Isolation, sequence analysis, and intron-exon arrangement of the gene encoding bovine rhodopsin. Cell 34:807–814

    Article  CAS  PubMed  Google Scholar 

  43. Orcutt BC, George DG, Dayhoff MO (1983) Protein and nucleic acid data base systems. Annu Rev Biophys Bioengr 12:419–441

    Article  CAS  Google Scholar 

  44. Parzen E (1967) Time series analysis papers. Holden-Day, San Francisco pp 1–565

    Google Scholar 

  45. Patthy L (1991) Exons—original building blocks of proteins? BioEssays 13:187–192

    Article  CAS  PubMed  Google Scholar 

  46. Peebles PJE, Schramm DN, Turner EL, Kron RG (1991) The case for the relativistic hot Big Bang cosmology. Nature (London) 352:769–776

    Article  Google Scholar 

  47. Ptitsyn OB (1985) Random sequences and protein folding. J Molec Struct (Theochem) 123:45–65

    Article  Google Scholar 

  48. Ptitsyn OB (1987) Protein folding: hypotheses and experiments. J Protein Chem 6:273–294

    Article  CAS  Google Scholar 

  49. Rose GD (1978) Prediction of chain turns in globular proteins on a hyrophobic basis. Nature (London) 272:586–590

    Article  CAS  Google Scholar 

  50. Rose GD, Roy S (1980) Hydrophobic basis of packing in globular proteins. Proc Natl Acad Sci USA 77:4643–4647

    CAS  PubMed  Google Scholar 

  51. Saroff HA (1984) The uniqueness of protein sequences. Uniqueness diagrams for the Dayhoff file—1984. Bull Math Biol 46:661–672

    Article  CAS  PubMed  Google Scholar 

  52. Shakhnovich EI, Gutin AM (1989) Formation of unique structure in polypeptide chains: theoretical investigation with the aid of a replica approach. Biophys Chem 34:187–199

    Article  CAS  PubMed  Google Scholar 

  53. Shakhnovich EI, Gutin AM (1990a) Implications of thermodynamics of protein folding for evolution of primary sequences. Nature (London) 346:773–775

    Article  CAS  Google Scholar 

  54. Shakhnovich EI, Gutin AM (1990b) Enumeration of all compact conformations of copolymers with random sequence of links. J Chem Phys 93:5967–5971

    Article  CAS  Google Scholar 

  55. Vonderviszt F, Matrai G, Simon I (1986) Characteristic sequential residue environment of amino acids in proteins. Int J Peptide Protein Res 27:483–492

    CAS  Article  Google Scholar 

  56. Wani JK (1971) Probability and statistical inference. Appleton-Century-Crofts, New York pp 1–315

    Google Scholar 

  57. White SH, Jacobs RE (1990) Statistical distribution of hydrophobic residues along the length of protein chains—implications for protein folding and evolution. Biophys J 57:911–921

    CAS  PubMed  Article  Google Scholar 

  58. Wilson IA, Haft DH, Getzoff ED, Tainer JA, Lerner RA, Brenner S (1985) Identical short peptide sequences in unrelated proteins can have different conformations: A testing ground for theories of immune recognition. Proc Natl Acad Sci USA 82:5255–5259

    CAS  PubMed  Google Scholar 

  59. Ycas M (1958) The protein text. In: Yockey HP (ed) Symposium on information theory in biology. Pergamon Press, New York, pp 70–102

    Google Scholar 

  60. Zielenkiewicz P, Plochocka D, Rabczenko A (1988) The formation of protein secondary structure. Its connection with amino acid sequence. Biophys Chem 31:139–142

    Article  CAS  PubMed  Google Scholar 

  61. Zimmerman JM, Eliezer N, Simha R (1968) The characterization of amino acid sequences in proteins by statistical methods. J Theor Biol 21:170–201

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Stephen H. White.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

White, S.H., Jacobs, R.E. The evolution of proteins from random amino acid sequences. I. Evidence from the lengthwise distribution of amino acids in modern protein sequences. J Mol Evol 36, 79–95 (1993). https://doi.org/10.1007/BF02407307

Download citation

Key words

  • Protein evolution
  • Protein sequence analysis
  • Random protein sequences
  • Run test
  • Protein folding
  • Rhodopsin
  • Bacteriorhodopsin