Indexing Protein Structures Using Suffix Trees
Approaches for indexing proteins and fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this chapter, we describe a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between Cα atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain alignments with database proteins. Similar proteins are selected by their alignment score against the query. Our results show classification accuracy up to 97.8 and 99.4% at the superfamily and class level according to the SCOP classification and show that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results outperform the best previous methods.
KeywordsProtein structure indexing suffix trees structural motifs 3D database search approximate matches
We thank Tolga Can, Arnab Bhattacharya and Ambuj Singh for providing us the ProGreSS code and other assistance. We also thank Chris Bystroff and Nilanjana De for helpful suggestions. This work was supported in part by NSF CAREER Award IIS-0092978, DOE Career Award DE-FG02-02ER25538, NSF grant EIA-0103708 and NSF grant EMT-0432098.
- 9.Y. Lamdan and H. Wolfson. Geometric hashing: a general and efficient model-based recognition scheme. International Conference on Computer Vision (ICCV), 238–249, 1988.Google Scholar
- 10.L. Holm and C. Sander. 3-d lookup: fast protein structure database searches at 90% reliability. International Conference on Intelligent Systems for Molecular Biology (ISMB), 179–187, 1995.Google Scholar
- 15.T. Can and Y. Wang. CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features. IEEE Computer Society Bioinformatics Conference (CSB), 169–179, 2003.Google Scholar
- 16.A. Bhattacharya, T. Can, T. Kahveci, A. Singh, and Y. Wang. Progress: simultaneous searching of protein databases by sequence and structure. Pacific Symposium on Bioinformatics (PSB), 264–275, 2004.Google Scholar
- 18.O. Camoglu, T. Kahveci, and A. Singh. Towards index-based similarity search for protein structure databases. IEEE Computer Society Bioinformatics Conference (CSB), 148–158, 2003.Google Scholar
- 19.Z. Aung, W. Fu, and K. Tan. An efficient index-based protein structure database searching method.International Conference on Database Systems for Advanced Applications (DASFAA), 311–318, 2003.Google Scholar
- 20.H. Täubig, A. Buchner, and J. Griebsch: A method for fast approximate searching of polypeptide structures in the PDB. German Conference on Bioinformatics (GCB),2004.Google Scholar
- 22.E. Hunt, M. Atkinson, and R. Irving. Database indexing for large DNA and protein sequence collections. International Conference on Very Large Data Bases (VLDB), 256–271, 2003.Google Scholar
- 23.C. Meek, J. Patel, and S. Kasetty. Oasis: an online and accurate technique for local-alignment searches on biological sequences. International Conference on Very Large Data Bases (VLDB), 910–923, 2003.Google Scholar
- 29.W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison.Proc Natl Acad Sci, 85:2444–2448, 1988.Google Scholar