Indexing Protein Structures Using Suffix Trees

  • Feng Gao
  • Mohammed J. Zaki
Part of the Methods in Molecular Biology™ book series (MIMB, volume 413)


Approaches for indexing proteins and fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this chapter, we describe a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between Cα atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain alignments with database proteins. Similar proteins are selected by their alignment score against the query. Our results show classification accuracy up to 97.8 and 99.4% at the superfamily and class level according to the SCOP classification and show that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results outperform the best previous methods.


Protein structure indexing suffix trees structural motifs 3D database search approximate matches 



We thank Tolga Can, Arnab Bhattacharya and Ambuj Singh for providing us the ProGreSS code and other assistance. We also thank Chris Bystroff and Nilanjana De for helpful suggestions. This work was supported in part by NSF CAREER Award IIS-0092978, DOE Career Award DE-FG02-02ER25538, NSF grant EIA-0103708 and NSF grant EMT-0432098.


  1. 1.
    B. Rost. Twilight zone of protein sequence alignments. Protein Eng, 12(2):85–94, 1999.CrossRefPubMedGoogle Scholar
  2. 2.
    S. Altschul, T. Madden, J. Zhang, Z. Zhang, W. Miller, and D. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389–3402, 1997.CrossRefPubMedGoogle Scholar
  3. 3.
    I. Eidhammer, I. Jonassen, and W. Taylor. Structure comparison and structure patterns. J Comp Biol, 7(5):685–716, 2000.CrossRefGoogle Scholar
  4. 4.
    L. Holm and C. Sander. Protein structure comparison by alignment of distance matrices. J Mol Biol, 233:123–138, 1993.CrossRefPubMedGoogle Scholar
  5. 5.
    I. Shindyalov and P. Bourne. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng, 11(9):739–747, 1998.CrossRefPubMedGoogle Scholar
  6. 6.
    T. Madej, J. Gibrat, and S. Bryant. Threading a database of protein cores.Proteins, 23:356–369, 1995.CrossRefPubMedGoogle Scholar
  7. 7.
    K. Mizoguchi and N. Go. Comparison of spatial arrangements of secondary structural elements in proteins. Protein Eng, 8:353–362, 1995.CrossRefGoogle Scholar
  8. 8.
    C. Orengo and W. Taylor. SSAP: sequential structure alignment program for protein structure comparisons. Methods Enzymol, 266:617–634, 1996.CrossRefPubMedGoogle Scholar
  9. 9.
    Y. Lamdan and H. Wolfson. Geometric hashing: a general and efficient model-based recognition scheme. International Conference on Computer Vision (ICCV), 238–249, 1988.Google Scholar
  10. 10.
    L. Holm and C. Sander. 3-d lookup: fast protein structure database searches at 90% reliability. International Conference on Intelligent Systems for Molecular Biology (ISMB), 179–187, 1995.Google Scholar
  11. 11.
    R. Nussinov, N. Leibowit, and H. Wolfson. MUSTA: a general, efficient, automated method for multiple structure alignment and detection of common motifs: application to proteins. J Comp Biol, 8(2):93–121, 2001.CrossRefGoogle Scholar
  12. 12.
    O. Dror, H. Benyamini, R. Nussinov, and H. Wolfson. MASS: multiple structural alignment by secondary structures. Bioinformatics, 19(12):95–104, 2003.CrossRefGoogle Scholar
  13. 13.
    M. Shatsky, R. Nussinov, and H. Wolfson. Multiprot - a multiple protein structural alignment algorithm. Proteins, 56:143–156, 2004.CrossRefPubMedGoogle Scholar
  14. 14.
    X. Yuan and C. Bystroff. Non-sequential structure-based alignments reveal topology-independent core packing arrangements in proteins. Bioinformatics, 21(7):1010–1019, 2005.CrossRefPubMedGoogle Scholar
  15. 15.
    T. Can and Y. Wang. CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features. IEEE Computer Society Bioinformatics Conference (CSB), 169–179, 2003.Google Scholar
  16. 16.
    A. Bhattacharya, T. Can, T. Kahveci, A. Singh, and Y. Wang. Progress: simultaneous searching of protein databases by sequence and structure. Pacific Symposium on Bioinformatics (PSB), 264–275, 2004.Google Scholar
  17. 17.
    I. Choi, J. Kwon, and S. Kim. Local feature frequency profile: a method to measure structural similarity in proteins. Proc Natl Acad Sci, 101(11):3797–3802, 2004.CrossRefPubMedGoogle Scholar
  18. 18.
    O. Camoglu, T. Kahveci, and A. Singh. Towards index-based similarity search for protein structure databases. IEEE Computer Society Bioinformatics Conference (CSB), 148–158, 2003.Google Scholar
  19. 19.
    Z. Aung, W. Fu, and K. Tan. An efficient index-based protein structure database searching method.International Conference on Database Systems for Advanced Applications (DASFAA), 311–318, 2003.Google Scholar
  20. 20.
    H. Täubig, A. Buchner, and J. Griebsch: A method for fast approximate searching of polypeptide structures in the PDB. German Conference on Bioinformatics (GCB),2004.Google Scholar
  21. 21.
    D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, 1997.CrossRefGoogle Scholar
  22. 22.
    E. Hunt, M. Atkinson, and R. Irving. Database indexing for large DNA and protein sequence collections. International Conference on Very Large Data Bases (VLDB), 256–271, 2003.Google Scholar
  23. 23.
    C. Meek, J. Patel, and S. Kasetty. Oasis: an online and accurate technique for local-alignment searches on biological sequences. International Conference on Very Large Data Bases (VLDB), 910–923, 2003.Google Scholar
  24. 24.
    A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg. Alignment of whole genomes. Nucleic Acids Res, 27(11):2369–2376, 1999.CrossRefPubMedGoogle Scholar
  25. 25.
    A. Delcher, A. Phillippy, J. Carlton, and S. Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res., 30(11): 2478–2483, 2002.CrossRefPubMedGoogle Scholar
  26. 26.
    E. McCreight. A space-economic suffix tree construction algorithm. J. of the ACM, 23(2): 262–272, 1976.CrossRefGoogle Scholar
  27. 27.
    E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.CrossRefGoogle Scholar
  28. 28.
    F. Smith and M. Waterman. Identification of common molecular subsequences. J Mol Biol, (147):195–197, 1981.CrossRefPubMedGoogle Scholar
  29. 29.
    W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison.Proc Natl Acad Sci, 85:2444–2448, 1988.Google Scholar
  30. 30.
    A. Murzin, S. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 247:536–540, 1995.PubMedGoogle Scholar

Copyright information

© Humana Press Inc 2008

Authors and Affiliations

  • Feng Gao
  • Mohammed J. Zaki

There are no affiliations available

Personalised recommendations