Geometric Suffix Tree: A New Index Structure for Protein 3-D Structures
Protein structure analysis is one of the most important research issues in the post-genomic era, and faster and more accurate query data structures for such 3-D structures are highly desired for research on proteins. This paper proposes a new data structure for indexing protein 3-D structures. For strings, there are many efficient indexing structures such as suffix trees, but it has been considered very difficult to design such sophisticated data structures against 3-D structures like proteins. Our index structure is based on the suffix trees and is called the geometric suffix tree. By using the geometric suffix tree for a set of protein structures, we can search for all of their substructures whose RMSDs (root mean square deviations) or URMSDs (unit-vector root mean square deviations) to a given query 3-D structure are not larger than a given bound. Though there are O(N 2) substructures, our data structure requires only O(N) space where N is the sum of lengths of the set of proteins. We propose an O(N 2) construction algorithm for it, while a naive algorithm would require O(N 3) time to construct it. Moreover we propose an efficient search algorithm. We also show computational experiments to demonstrate the practicality of our data structure. The experiments show that the construction time of the geometric suffix tree is practically almost linear to the size of the database, when applied to a protein structure database.
KeywordsSingular Value Decomposition Index Structure Outgoing Edge Construction Algorithm Suffix Tree
Unable to display preview. Download preview PDF.
- 1.Akutsu, T., Onizuka, K., Ishikawa, M.: New hashing techniques and their application to a protein database system. In: Proc. Hawaii Int. Conf. System Sciences (HICSS-28), vol. 5, pp. 197–206 (1995)Google Scholar
- 3.Aung, Z., Fu, W., Tan, K.: An efficient index-based protein structure database searching method. In: Proc. Intl. Conf. on Database Systems for Advanced Applications, pp. 311–318 (2003)Google Scholar
- 5.Çamoğlu, O., Kahveci, T., Singh, A.: Towards index-based similarity search for protein structure databases. In: IEEE Computer Society Bioinformatics Conference, pp. 148–158 (2003)Google Scholar
- 6.Can, T., Wang, Y.: CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features. In: IEEE Computer Society Bioinformatics Conference, pp. 169–179 (2003)Google Scholar
- 11.Farach, M.: Optimal suffix tree construction with large alphabets. In: Proc. 38th IEEE Symp. Foundations of Computer Science, pp. 137–143 (1997)Google Scholar
- 12.Gao, F., Zaki, M.J.: PSIST: Indexing Protein Structures using Suffix Trees. In: Proc. IEEE Computational Systems Bioinformatics Conference (CSB), pp. 212–222 (2005)Google Scholar
- 13.Golub, G.H., Van Loan, C.F.: Matrix Computation, 3rd edn. John Hopkins University Press (1996)Google Scholar
- 15.Kedem, K., Chew, P., Elber, R.: Unit-vector RMS (URMS) as a tool to analyze molecular dynamics trajectories. Proteins: Struct. Funct. Genet. 38, 1–12 (1999)Google Scholar
- 20.Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar