Knowledge and Information Systems

, Volume 14, Issue 1, pp 59–80 | Cite as

On the use of structure and sequence-based features for protein classification and retrieval

Regular Paper


The need to retrieve or classify proteins using structure or sequence-based similarity underlies many biomedical applications. In drug discovery, researchers search for proteins that share specific chemical properties as sources for new treatment. With folding simulations, similar intermediate structures might be indicative of a common folding pathway. Here we present two normalized, stand-alone representations of proteins that enable fast and efficient object retrieval based on sequence or structure. To create our sequence-based representation, we take the profiles returned by the PSI-BLAST alignment algorithm and create a normalized summary using a discrete wavelet transform. For our structural representation, we transform each 3D structure into a normalized 2D distance matrix and apply a 2D wavelet decomposition to generate our descriptor. We also create a hybrid representation by concatenating together the above descriptors. We evaluate the generality of our models by using them as indices for database retrieval experiments as well as feature vectors for classification. We find that our methods provide excellent performance when compared with the state-of-the-art for each task. Our results show that the sequence-based representation is generally superior to the structure-based representation and that in the classification context, the hybrid strategy affords a significant improvement over sequence or structure.


Bioinformatics Protein indexing Protein retrieval Sequence and structure-based protein representations 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altschul SF, Madden TL, Schaffer AA, Zhang J, Anang Z, Miller W and Lipman DJ (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402 CrossRefGoogle Scholar
  2. 2.
    Aung Z and Tan K-L (2004). Rapid 3d protein structure database searching using information retrieval techniques. Bioinformatics 20: 1045–1052 CrossRefGoogle Scholar
  3. 3.
    Bentley JL (1975). Multidimensional binary search trees used for associate searching. Comm ACM 18(9): 509–517 MATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Bhattacharya A, Can T, Kahveci T, Singh A, Wang Y (2004) ProGreSS: simultaneous searching of protein databases by sequence and structure. In: Pacific symposium on biocomputing, vol. 9. World Scientific Press, pp 264–275Google Scholar
  5. 5.
    Brenner SE, Koehl P and Levitt M (2000). The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Res 28: 254–256 CrossRefGoogle Scholar
  6. 6.
    Çamoğlu O, Kahveci T, Singh A (2003) Towards index-based similarity search for protein structure databases. In: Proceedings of 2nd IEEE Computer Society Bioinformatics Conference (CSB). IEEE, pp 148–158Google Scholar
  7. 7.
    Coatney M and Parthasarathy S (2005). Motifminer: efficient discovery of common substructures in biochemical molecules. Knowl Inf Sys (KAIS) 7(2): 202–223 CrossRefGoogle Scholar
  8. 8.
    Gao F and Zaki M (2005). PSIST: indexing protein structures using suffix trees. In: (eds) In: Proceedings of IEEE computational systems bioinformatics conference (CSB), pp 212–222. IEEE, Palo Alto Google Scholar
  9. 9.
    Han S, Lee B-C, Yu ST, Jeong C-S, Lee S and Kim D (2005). Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics 21(11): 2667–2673 CrossRefGoogle Scholar
  10. 10.
    Henikoff S and Henikoff J.G. (1992). Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 89: 10915–10919 CrossRefGoogle Scholar
  11. 11.
    Ie E, Weston J, Noble WS, Leslie C (2005) Multi-class protein fold recognition using adaptive codes. In: Proceedings of the 22nd International Conferences on machine learning. ACM, Bonn, Germany, pp 329–336Google Scholar
  12. 12.
    Karplus K, Barrett C and Hughley R (1998). Hidden markov models for detecting remote protein homologies. Bioinformatics 14: 846–856 CrossRefGoogle Scholar
  13. 13.
    Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie CS (2004) Profile-based string kernels for remote homology detection and motif extraction. In: Proceedings of CSB 2004’, IEEE, pp 152–160Google Scholar
  14. 14.
    Larson SM, Snow CD, Shirts M, Pande VS (2002) Folding@home and genome@home: using distributed computing to tackle previously intractable problems in computational biology. In: Grant, R. (ed.) Computational genomics. Horizon Press, Norwich, UKGoogle Scholar
  15. 15.
    Mallat S (1999). A wavelet tour of signal processing, 2nd edn. Academic, New York Google Scholar
  16. 16.
    Marsolo K, Parthasarathy S, Ramamohanarao K (2006) Structure-based querying of proteins using wavelets. In: Proceedings of CIKM’06. IEEE, pp 24–33Google Scholar
  17. 17.
    Mehta S, Barr S, Choy A, Yang H, Parthasarathy S, Machiraju R, Wilkins J (2005) Dynamic classification of anomalous structures in molecular dynamics simulation data. In: Proceedings of the SIAM conference on data mining. SIAMGoogle Scholar
  18. 18.
    Murzin AG, Brenner SE, Hubbard T and Chothia C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536–540 CrossRefGoogle Scholar
  19. 19.
    Parthasarathy S and Aggarwal CC (2003). On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Trans Knowl Data Eng 15(6): 1512–1521 CrossRefGoogle Scholar
  20. 20.
    Platt J (1998) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods - support vector learning. MIT Press, Cambridge, MA, pp 185–208Google Scholar
  21. 21.
    Rangwala H and Karypis G (2005). Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 21(23): 4239–4247 CrossRefGoogle Scholar
  22. 22.
    Tan Z, Tung AKH (2004) Substructure clustering on sequential 3D object datasets. In: International conference on data engineering (ICDE). IEEE, Boston, pp 634–645Google Scholar
  23. 23.
    Weston J, Leslie C, Zhou D, Noble WS (2004) Semi-supervised protein classification using cluster kernels. In: Advances in neural information processing systems (NIPS) 16, NIPS, pp 595–602Google Scholar
  24. 24.
    Witten IH and Frank E (2005). Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco MATHGoogle Scholar
  25. 25.
    Yang H, Parthasarathy S, Ucar, D (2007) A spatio-temporal mining approach towards summarizing and analyzing protein folding trajectories. Algorithms for Molecular Biology 2(3)Google Scholar

Copyright information

© Springer-Verlag London Limited 2007

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringThe Ohio State UniversityColumbusUSA

Personalised recommendations