On the use of structure and sequence-based features for protein classification and retrieval

Marsolo, Keith; Parthasarathy, Srinivasan

doi:10.1007/s10115-007-0088-0

On the use of structure and sequence-based features for protein classification and retrieval

Regular Paper
Published: 18 August 2007

Volume 14, pages 59–80, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Keith Marsolo¹ &
Srinivasan Parthasarathy¹

95 Accesses
6 Citations
Explore all metrics

Abstract

The need to retrieve or classify proteins using structure or sequence-based similarity underlies many biomedical applications. In drug discovery, researchers search for proteins that share specific chemical properties as sources for new treatment. With folding simulations, similar intermediate structures might be indicative of a common folding pathway. Here we present two normalized, stand-alone representations of proteins that enable fast and efficient object retrieval based on sequence or structure. To create our sequence-based representation, we take the profiles returned by the PSI-BLAST alignment algorithm and create a normalized summary using a discrete wavelet transform. For our structural representation, we transform each 3D structure into a normalized 2D distance matrix and apply a 2D wavelet decomposition to generate our descriptor. We also create a hybrid representation by concatenating together the above descriptors. We evaluate the generality of our models by using them as indices for database retrieval experiments as well as feature vectors for classification. We find that our methods provide excellent performance when compared with the state-of-the-art for each task. Our results show that the sequence-based representation is generally superior to the structure-based representation and that in the classification context, the hybrid strategy affords a significant improvement over sequence or structure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

Article Open access 16 May 2015

FEPS: A Tool for Feature Extraction from Protein Sequence

From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets

Article 16 April 2024

References

Altschul SF, Madden TL, Schaffer AA, Zhang J, Anang Z, Miller W and Lipman DJ (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402
Article Google Scholar
Aung Z and Tan K-L (2004). Rapid 3d protein structure database searching using information retrieval techniques. Bioinformatics 20: 1045–1052
Article Google Scholar
Bentley JL (1975). Multidimensional binary search trees used for associate searching. Comm ACM 18(9): 509–517
Article MATH MathSciNet Google Scholar
Bhattacharya A, Can T, Kahveci T, Singh A, Wang Y (2004) ProGreSS: simultaneous searching of protein databases by sequence and structure. In: Pacific symposium on biocomputing, vol. 9. World Scientific Press, pp 264–275
Brenner SE, Koehl P and Levitt M (2000). The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Res 28: 254–256
Article Google Scholar
Çamoğlu O, Kahveci T, Singh A (2003) Towards index-based similarity search for protein structure databases. In: Proceedings of 2nd IEEE Computer Society Bioinformatics Conference (CSB). IEEE, pp 148–158
Coatney M and Parthasarathy S (2005). Motifminer: efficient discovery of common substructures in biochemical molecules. Knowl Inf Sys (KAIS) 7(2): 202–223
Article Google Scholar
Gao F and Zaki M (2005). PSIST: indexing protein structures using suffix trees. In: (eds) In: Proceedings of IEEE computational systems bioinformatics conference (CSB), pp 212–222. IEEE, Palo Alto
Google Scholar
Han S, Lee B-C, Yu ST, Jeong C-S, Lee S and Kim D (2005). Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics 21(11): 2667–2673
Article Google Scholar
Henikoff S and Henikoff J.G. (1992). Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89: 10915–10919
Article Google Scholar
Ie E, Weston J, Noble WS, Leslie C (2005) Multi-class protein fold recognition using adaptive codes. In: Proceedings of the 22nd International Conferences on machine learning. ACM, Bonn, Germany, pp 329–336
Karplus K, Barrett C and Hughley R (1998). Hidden markov models for detecting remote protein homologies. Bioinformatics 14: 846–856
Article Google Scholar
Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie CS (2004) Profile-based string kernels for remote homology detection and motif extraction. In: Proceedings of CSB 2004’, IEEE, pp 152–160
Larson SM, Snow CD, Shirts M, Pande VS (2002) Folding@home and genome@home: using distributed computing to tackle previously intractable problems in computational biology. In: Grant, R. (ed.) Computational genomics. Horizon Press, Norwich, UK
Mallat S (1999). A wavelet tour of signal processing, 2nd edn. Academic, New York
Google Scholar
Marsolo K, Parthasarathy S, Ramamohanarao K (2006) Structure-based querying of proteins using wavelets. In: Proceedings of CIKM’06. IEEE, pp 24–33
Mehta S, Barr S, Choy A, Yang H, Parthasarathy S, Machiraju R, Wilkins J (2005) Dynamic classification of anomalous structures in molecular dynamics simulation data. In: Proceedings of the SIAM conference on data mining. SIAM
Murzin AG, Brenner SE, Hubbard T and Chothia C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536–540
Article Google Scholar
Parthasarathy S and Aggarwal CC (2003). On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Trans Knowl Data Eng 15(6): 1512–1521
Article Google Scholar
Platt J (1998) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods - support vector learning. MIT Press, Cambridge, MA, pp 185–208
Rangwala H and Karypis G (2005). Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 21(23): 4239–4247
Article Google Scholar
Tan Z, Tung AKH (2004) Substructure clustering on sequential 3D object datasets. In: International conference on data engineering (ICDE). IEEE, Boston, pp 634–645
Weston J, Leslie C, Zhou D, Noble WS (2004) Semi-supervised protein classification using cluster kernels. In: Advances in neural information processing systems (NIPS) 16, NIPS, pp 595–602
Witten IH and Frank E (2005). Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
MATH Google Scholar
Yang H, Parthasarathy S, Ucar, D (2007) A spatio-temporal mining approach towards summarizing and analyzing protein folding trajectories. Algorithms for Molecular Biology 2(3)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210, USA
Keith Marsolo & Srinivasan Parthasarathy

Authors

Keith Marsolo
View author publications
You can also search for this author in PubMed Google Scholar
Srinivasan Parthasarathy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Srinivasan Parthasarathy.

Additional information

This work is supported in part by the following research grants: DOE Award No. DE-FG02-04ER25611; NSF CAREER Grant IIS-0347662.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marsolo, K., Parthasarathy, S. On the use of structure and sequence-based features for protein classification and retrieval. Knowl Inf Syst 14, 59–80 (2008). https://doi.org/10.1007/s10115-007-0088-0

Download citation

Received: 18 December 2006
Revised: 04 March 2007
Accepted: 28 April 2007
Published: 18 August 2007
Issue Date: January 2008
DOI: https://doi.org/10.1007/s10115-007-0088-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the use of structure and sequence-based features for protein classification and retrieval

Abstract

Access this article

Similar content being viewed by others

ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

FEPS: A Tool for Feature Extraction from Protein Sequence

From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the use of structure and sequence-based features for protein classification and retrieval

Abstract

Access this article

Similar content being viewed by others

ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

FEPS: A Tool for Feature Extraction from Protein Sequence

From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation