Abstract
The need to retrieve or classify proteins using structure or sequence-based similarity underlies many biomedical applications. In drug discovery, researchers search for proteins that share specific chemical properties as sources for new treatment. With folding simulations, similar intermediate structures might be indicative of a common folding pathway. Here we present two normalized, stand-alone representations of proteins that enable fast and efficient object retrieval based on sequence or structure. To create our sequence-based representation, we take the profiles returned by the PSI-BLAST alignment algorithm and create a normalized summary using a discrete wavelet transform. For our structural representation, we transform each 3D structure into a normalized 2D distance matrix and apply a 2D wavelet decomposition to generate our descriptor. We also create a hybrid representation by concatenating together the above descriptors. We evaluate the generality of our models by using them as indices for database retrieval experiments as well as feature vectors for classification. We find that our methods provide excellent performance when compared with the state-of-the-art for each task. Our results show that the sequence-based representation is generally superior to the structure-based representation and that in the classification context, the hybrid strategy affords a significant improvement over sequence or structure.
Similar content being viewed by others
References
Altschul SF, Madden TL, Schaffer AA, Zhang J, Anang Z, Miller W and Lipman DJ (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402
Aung Z and Tan K-L (2004). Rapid 3d protein structure database searching using information retrieval techniques. Bioinformatics 20: 1045–1052
Bentley JL (1975). Multidimensional binary search trees used for associate searching. Comm ACM 18(9): 509–517
Bhattacharya A, Can T, Kahveci T, Singh A, Wang Y (2004) ProGreSS: simultaneous searching of protein databases by sequence and structure. In: Pacific symposium on biocomputing, vol. 9. World Scientific Press, pp 264–275
Brenner SE, Koehl P and Levitt M (2000). The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Res 28: 254–256
Çamoğlu O, Kahveci T, Singh A (2003) Towards index-based similarity search for protein structure databases. In: Proceedings of 2nd IEEE Computer Society Bioinformatics Conference (CSB). IEEE, pp 148–158
Coatney M and Parthasarathy S (2005). Motifminer: efficient discovery of common substructures in biochemical molecules. Knowl Inf Sys (KAIS) 7(2): 202–223
Gao F and Zaki M (2005). PSIST: indexing protein structures using suffix trees. In: (eds) In: Proceedings of IEEE computational systems bioinformatics conference (CSB), pp 212–222. IEEE, Palo Alto
Han S, Lee B-C, Yu ST, Jeong C-S, Lee S and Kim D (2005). Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics 21(11): 2667–2673
Henikoff S and Henikoff J.G. (1992). Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89: 10915–10919
Ie E, Weston J, Noble WS, Leslie C (2005) Multi-class protein fold recognition using adaptive codes. In: Proceedings of the 22nd International Conferences on machine learning. ACM, Bonn, Germany, pp 329–336
Karplus K, Barrett C and Hughley R (1998). Hidden markov models for detecting remote protein homologies. Bioinformatics 14: 846–856
Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie CS (2004) Profile-based string kernels for remote homology detection and motif extraction. In: Proceedings of CSB 2004’, IEEE, pp 152–160
Larson SM, Snow CD, Shirts M, Pande VS (2002) Folding@home and genome@home: using distributed computing to tackle previously intractable problems in computational biology. In: Grant, R. (ed.) Computational genomics. Horizon Press, Norwich, UK
Mallat S (1999). A wavelet tour of signal processing, 2nd edn. Academic, New York
Marsolo K, Parthasarathy S, Ramamohanarao K (2006) Structure-based querying of proteins using wavelets. In: Proceedings of CIKM’06. IEEE, pp 24–33
Mehta S, Barr S, Choy A, Yang H, Parthasarathy S, Machiraju R, Wilkins J (2005) Dynamic classification of anomalous structures in molecular dynamics simulation data. In: Proceedings of the SIAM conference on data mining. SIAM
Murzin AG, Brenner SE, Hubbard T and Chothia C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536–540
Parthasarathy S and Aggarwal CC (2003). On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Trans Knowl Data Eng 15(6): 1512–1521
Platt J (1998) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods - support vector learning. MIT Press, Cambridge, MA, pp 185–208
Rangwala H and Karypis G (2005). Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 21(23): 4239–4247
Tan Z, Tung AKH (2004) Substructure clustering on sequential 3D object datasets. In: International conference on data engineering (ICDE). IEEE, Boston, pp 634–645
Weston J, Leslie C, Zhou D, Noble WS (2004) Semi-supervised protein classification using cluster kernels. In: Advances in neural information processing systems (NIPS) 16, NIPS, pp 595–602
Witten IH and Frank E (2005). Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Yang H, Parthasarathy S, Ucar, D (2007) A spatio-temporal mining approach towards summarizing and analyzing protein folding trajectories. Algorithms for Molecular Biology 2(3)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported in part by the following research grants: DOE Award No. DE-FG02-04ER25611; NSF CAREER Grant IIS-0347662.
Rights and permissions
About this article
Cite this article
Marsolo, K., Parthasarathy, S. On the use of structure and sequence-based features for protein classification and retrieval. Knowl Inf Syst 14, 59–80 (2008). https://doi.org/10.1007/s10115-007-0088-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-007-0088-0