Skip to main content
Log in

On the use of structure and sequence-based features for protein classification and retrieval

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The need to retrieve or classify proteins using structure or sequence-based similarity underlies many biomedical applications. In drug discovery, researchers search for proteins that share specific chemical properties as sources for new treatment. With folding simulations, similar intermediate structures might be indicative of a common folding pathway. Here we present two normalized, stand-alone representations of proteins that enable fast and efficient object retrieval based on sequence or structure. To create our sequence-based representation, we take the profiles returned by the PSI-BLAST alignment algorithm and create a normalized summary using a discrete wavelet transform. For our structural representation, we transform each 3D structure into a normalized 2D distance matrix and apply a 2D wavelet decomposition to generate our descriptor. We also create a hybrid representation by concatenating together the above descriptors. We evaluate the generality of our models by using them as indices for database retrieval experiments as well as feature vectors for classification. We find that our methods provide excellent performance when compared with the state-of-the-art for each task. Our results show that the sequence-based representation is generally superior to the structure-based representation and that in the classification context, the hybrid strategy affords a significant improvement over sequence or structure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Anang Z, Miller W and Lipman DJ (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402

    Article  Google Scholar 

  2. Aung Z and Tan K-L (2004). Rapid 3d protein structure database searching using information retrieval techniques. Bioinformatics 20: 1045–1052

    Article  Google Scholar 

  3. Bentley JL (1975). Multidimensional binary search trees used for associate searching. Comm ACM 18(9): 509–517

    Article  MATH  MathSciNet  Google Scholar 

  4. Bhattacharya A, Can T, Kahveci T, Singh A, Wang Y (2004) ProGreSS: simultaneous searching of protein databases by sequence and structure. In: Pacific symposium on biocomputing, vol. 9. World Scientific Press, pp 264–275

  5. Brenner SE, Koehl P and Levitt M (2000). The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Res 28: 254–256

    Article  Google Scholar 

  6. Çamoğlu O, Kahveci T, Singh A (2003) Towards index-based similarity search for protein structure databases. In: Proceedings of 2nd IEEE Computer Society Bioinformatics Conference (CSB). IEEE, pp 148–158

  7. Coatney M and Parthasarathy S (2005). Motifminer: efficient discovery of common substructures in biochemical molecules. Knowl Inf Sys (KAIS) 7(2): 202–223

    Article  Google Scholar 

  8. Gao F and Zaki M (2005). PSIST: indexing protein structures using suffix trees. In: (eds) In: Proceedings of IEEE computational systems bioinformatics conference (CSB), pp 212–222. IEEE, Palo Alto

    Google Scholar 

  9. Han S, Lee B-C, Yu ST, Jeong C-S, Lee S and Kim D (2005). Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics 21(11): 2667–2673

    Article  Google Scholar 

  10. Henikoff S and Henikoff J.G. (1992). Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89: 10915–10919

    Article  Google Scholar 

  11. Ie E, Weston J, Noble WS, Leslie C (2005) Multi-class protein fold recognition using adaptive codes. In: Proceedings of the 22nd International Conferences on machine learning. ACM, Bonn, Germany, pp 329–336

  12. Karplus K, Barrett C and Hughley R (1998). Hidden markov models for detecting remote protein homologies. Bioinformatics 14: 846–856

    Article  Google Scholar 

  13. Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie CS (2004) Profile-based string kernels for remote homology detection and motif extraction. In: Proceedings of CSB 2004’, IEEE, pp 152–160

  14. Larson SM, Snow CD, Shirts M, Pande VS (2002) Folding@home and genome@home: using distributed computing to tackle previously intractable problems in computational biology. In: Grant, R. (ed.) Computational genomics. Horizon Press, Norwich, UK

  15. Mallat S (1999). A wavelet tour of signal processing, 2nd edn. Academic, New York

    Google Scholar 

  16. Marsolo K, Parthasarathy S, Ramamohanarao K (2006) Structure-based querying of proteins using wavelets. In: Proceedings of CIKM’06. IEEE, pp 24–33

  17. Mehta S, Barr S, Choy A, Yang H, Parthasarathy S, Machiraju R, Wilkins J (2005) Dynamic classification of anomalous structures in molecular dynamics simulation data. In: Proceedings of the SIAM conference on data mining. SIAM

  18. Murzin AG, Brenner SE, Hubbard T and Chothia C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536–540

    Article  Google Scholar 

  19. Parthasarathy S and Aggarwal CC (2003). On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Trans Knowl Data Eng 15(6): 1512–1521

    Article  Google Scholar 

  20. Platt J (1998) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods - support vector learning. MIT Press, Cambridge, MA, pp 185–208

  21. Rangwala H and Karypis G (2005). Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 21(23): 4239–4247

    Article  Google Scholar 

  22. Tan Z, Tung AKH (2004) Substructure clustering on sequential 3D object datasets. In: International conference on data engineering (ICDE). IEEE, Boston, pp 634–645

  23. Weston J, Leslie C, Zhou D, Noble WS (2004) Semi-supervised protein classification using cluster kernels. In: Advances in neural information processing systems (NIPS) 16, NIPS, pp 595–602

  24. Witten IH and Frank E (2005). Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  25. Yang H, Parthasarathy S, Ucar, D (2007) A spatio-temporal mining approach towards summarizing and analyzing protein folding trajectories. Algorithms for Molecular Biology 2(3)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Srinivasan Parthasarathy.

Additional information

This work is supported in part by the following research grants: DOE Award No. DE-FG02-04ER25611; NSF CAREER Grant IIS-0347662.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marsolo, K., Parthasarathy, S. On the use of structure and sequence-based features for protein classification and retrieval. Knowl Inf Syst 14, 59–80 (2008). https://doi.org/10.1007/s10115-007-0088-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-007-0088-0

Keywords

Navigation