Skip to main content
Log in

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

  • Published:
Science in China Series C: Life Sciences Aims and scope Submit manuscript

Abstract

Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced alphabet with N around 9.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bowie J U, Luthy R, Eisenberg D. A method to identify protein sequences that fold into a known three-dimensional structure. Science, 1991, 253: 164–170

    Article  PubMed  CAS  Google Scholar 

  2. Jones D T, Taylor W R, Thornton J M. A new approach to protein fold recognition. Nature, 1992, 358: 86–89

    Article  PubMed  CAS  Google Scholar 

  3. Regan L, Degrado W F. Characterization of a helical protein designed from first principles. Science, 1988, 241: 976–978

    Article  PubMed  CAS  Google Scholar 

  4. Kamtekar S. Protein design by binary patterning of polar and nopolar amino acids. Science, 1993, 262: 1680–1685

    Article  PubMed  CAS  Google Scholar 

  5. Plaxco K W. Simplified proteins: Minimalist solutions to the “protein folding problem”. Curr Opin Struct Biol, 1998, 8: 80–85

    Article  PubMed  CAS  Google Scholar 

  6. Wang J, Wang W. A computational approach to simplifying the protein folding alphabet. Nature Struct Biol, 1999, 6: 1033–1038

    Article  PubMed  CAS  Google Scholar 

  7. Henikoff S, Henikoff J G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA, 1992, 89: 10915–10919

    Article  PubMed  CAS  Google Scholar 

  8. Ogata K, Ohya M, Umeyama H. Amino acid similarity matrix for homology derived from structural alignment and optimized by the Monte Carlo method. J Mol Graph Model, 1998, 16: 178–189

    PubMed  CAS  Google Scholar 

  9. Zhou H, Zhou Y. Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins, 2005, 58: 321–328

    Article  PubMed  CAS  Google Scholar 

  10. Friedberg I, Kaplan T, Margalit H. Evaluation of PSI-BLAST alignment accuracy in comparison to structural alignments. Protein Sci, 2000, 9: 2278–2284

    Article  PubMed  CAS  Google Scholar 

  11. Mallick P, Weiss R, Eisenberg D. The directional atomic solvation energy: An atombased potential for the assignment of protein sequences to known folds. Proc Natl Acad Sci USA, 2002, 99: 16041–16046

    Article  PubMed  CAS  Google Scholar 

  12. Kleiger G. PFIT and PFRIT: Bioinformatic algorithms for detecting glycosidase function from structure and sequence. Protein Sci, 2004, 13: 221–229

    Article  PubMed  CAS  Google Scholar 

  13. Karlin S, Altschul S F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA, 1990, 87: 2264–2268

    Article  PubMed  CAS  Google Scholar 

  14. Altschul S F. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol, 1991, 219: 555–565

    Article  PubMed  CAS  Google Scholar 

  15. Karlin S, Altschul S F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA, 1993, 90: 5873–5877

    Article  PubMed  CAS  Google Scholar 

  16. Higgins D G, Sharp P M. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 1988, 73: 237–244

    Article  PubMed  CAS  Google Scholar 

  17. Holm L, Sander C. Mapping the protein universe. Science, 1996, 273: 595–602

    Article  PubMed  CAS  Google Scholar 

  18. Holm L, Sander C. Dictionary of recurrent domains in protein structures. Proteins, 1998, 33: 88–96

    Article  PubMed  CAS  Google Scholar 

  19. Blake J D, Cohen F E. Pairwise sequence alignment below the twilight zone. J Mol Biol, 2001, 307: 721–735

    Article  PubMed  CAS  Google Scholar 

  20. Dosztanyi Z, Torda A E. Amino acid identity matrices based on force fields. Bioinformatics, 2001, 17: 686–699

    Article  PubMed  CAS  Google Scholar 

  21. Johnson M S, Overington J P. A structural basis for sequence comparisons an evaluation of scoring methodologies. J Mol Biol, 1993, 233: 716–738

    Article  PubMed  CAS  Google Scholar 

  22. Li T. Reduction of protein sequence complexity by residue grouping Protein Eng, 2003, 16: 323–330

    CAS  Google Scholar 

  23. Fan K, Wang W. What is the minimum number of letters required to fold a protein. J Mol Biol, 2003, 328: 921–926

    Article  PubMed  CAS  Google Scholar 

  24. Koradi R, Billeter M, Whrich K. MOLMOL: A program for display and analysis of macromolecular structures. J Mol Graphics, 1996, 14: 51–55

    Article  CAS  Google Scholar 

  25. Henikoff S. Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene, 1995, 163: GC17–GC26

    Article  PubMed  CAS  Google Scholar 

  26. Pietrokovski S, Henikoff J G, Henikoff S. The blocks database-A system for protein classification. Nucleic Acids Res, 1996, 24: 197–200

    Article  PubMed  CAS  Google Scholar 

  27. Clarke N D. Sequence “minimization”: Exploring the sequence landscape with simplified sequences. Curr Opin Biotech, 1995, 6: 467–472

    Article  PubMed  CAS  Google Scholar 

  28. Riddle D S. Functional rapidly folding proteins from simplified amino acid sequences. Nature Struct Biol, 1997, 4: 805–809

    Article  PubMed  CAS  Google Scholar 

  29. Akanuma S, Kigawa T, Yokoyama S. Combinatorial mutagenesis to restricted amino acid usage in an enzyme to a reduced set. Proc Natl Acad Sci USA, 2002, 99: 13549–13553

    Article  PubMed  CAS  Google Scholar 

  30. Felsenstein J. Confidence limits on phylogenies: An approach using the bootstrap. Evolution, 1985, 39: 783–791

    Article  Google Scholar 

  31. Liu X. Simplified amino acid alphabets based on deviation of conditional probability from random background. Phys Rev E, 2002, 66: 021906-1–021906-4

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wang Wei.

Additional information

Supported by the National Natural Science Foundation of China (Grant Nos. 90403120, 10474041 and 10021001) and the Nonlinear Project (973) of the NSM

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Wang, W. Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids. SCI CHINA SER C 50, 392–402 (2007). https://doi.org/10.1007/s11427-007-0023-3

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11427-007-0023-3

Keywords

Navigation