Clustering Support Vector Machines and Its Application to Local Protein Tertiary Structure Prediction

  • Jieyue He
  • Wei Zhong
  • Robert Harrison
  • Phang C. Tai
  • Yi Pan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3992)


Support Vector Machines (SVMs) are new generation of machine learning techniques and have shown strong generalization capability for many data mining tasks. SVMs can handle nonlinear classification by implicitly mapping input samples from the input feature space into another high dimensional feature space with a nonlinear kernel function. However, SVMs are not favorable for huge datasets with over millions of samples. Granular computing decomposes information in the form of some aggregates and solves the targeted problems in each granule. Therefore, we propose a novel computational model called Clustering Support Vector Machines (CSVMs) to deal with the complex classification problems for huge datasets. Taking advantage of both theory of granular computing and advanced statistical learning methodology, CSVMs are built specifically for each information granule partitioned intelligently by the clustering algorithm. This feature makes learning tasks for each CSVMs more specific and simpler. Moreover, CSVMs built particularly for each granule can be easily parallelized so that CSVMs can be used to handle huge datasets efficiently. The CSVMs model is used for predicting local protein tertiary structure. Compared with the conventional clustering method, the prediction accuracy for local protein tertiary structure has been improved noticeably when the new CSVM model is used. The encouraging experimental results indicate that our new computational model opens a new way to solve the complex classification for huge datasets.


Support Vector Machine Cluster Group Sequence Segment Information Granule Sequential Minimal Optimization 


  1. 1.
    Agarwal, D.K.: Shrinkage estimator generalizations of proximal support vector machines. In: Proc.of the 8th ACM SIGKDD international conference of knowledge Discovery and data mining, Edmonton, Canada (2002)Google Scholar
  2. 2.
    Award, M., Khan, L., Bastani, F., Yen, I.: An Effective Support Vector Machines (SVMs) Performance Using Hierarchical Clustering. In: Proc. of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004) (2004)Google Scholar
  3. 3.
    Balcazar, J.L., Dai, Y., Watanabe, O.: Provably Fast Training Algorithms for Support Vector Machines. In: Proc.of the 1stIEEE International Conference on Data mining, pp. 43–50. IEEE Computer Society, Los Alamitos (2001)CrossRefGoogle Scholar
  4. 4.
    Berman, H.M., Westbrook, J., Bourne, P.E.: The protein data bank. Nucleic Acids Research 28, 235–242 (2000)Google Scholar
  5. 5.
    Bystroff, C., Baker, D.: Prediction of local structure in proteins using a library of sequence-structure motifs. J. Mol. Biol. 281, 565–577 (1998)CrossRefGoogle Scholar
  6. 6.
    Bystroff, C., Thorsson, V., Baker, D.: HMMSTR: A hidden markov model for local sequence-structure correlations in proteins. J. Mol. Biol. 301, 173–190 (2000)CrossRefGoogle Scholar
  7. 7.
    Chang, C.C., Lin, C.J.: Training nu-support vector classifiers: Theory and algorithms. Neural Computations 13, 2119–2147 (2001)MATHCrossRefGoogle Scholar
  8. 8.
    Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000)Google Scholar
  9. 9.
    Daniael, B., Cao, D.: Training Support Vector Machines Using Adaptive Clustering. In: Proc. of SIAM International Conference on Data Mining 2004, Lake Buena Vista, FL, USA (2004)Google Scholar
  10. 10.
    Gupta, S.K., Rao, K.S., Bhatnagar, V.: K-means clustering algorithm for categorical attributes. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, pp. 203–208. Springer, Heidelberg (1999)Google Scholar
  11. 11.
    Hu, H., Pan, Y., Harrsion, R., Tai, P.C.: Improved protein secondary structure prediction using support vector machine with a new encoding scheme and advanced tertiary classifier. IEEE Transactions on NanoBioscience 2, 265–271 (2004)CrossRefGoogle Scholar
  12. 12.
    Kolodny, R., Linial, N.: Approximate protein structural alignment in polynomial time. Proc Natl. Acad. Sci. 101, 12201–12206 (2004)CrossRefGoogle Scholar
  13. 13.
    Osuna, E., Freund, R., Girosi, F.: An improved training algorithm for support vector machines. In: Proc. of IEEE Workshop on Neural Networks for Signal Processing, pp. 276–285 (1997)Google Scholar
  14. 14.
    Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kerenel Methods-Support Vector Learning, pp. 185–208 (1999)Google Scholar
  15. 15.
    Schoelkopf, B., Tsuda, K., Vert, J.P.: Kernel Methods in Computational Biology, pp. 71–92. MIT Press, Cambridge (2004)Google Scholar
  16. 16.
    Scholkopf, B., Burges, C., Smola, A. (eds.): Advances in Kernel Methods-Support Vector Learning. MIT Press, Cambridge (1999)Google Scholar
  17. 17.
    Valentini, G., Dietterich, T.G.: Low Bias Bagged Support vector Machines. In: Proc. of the 20th International Conference on Machine Learning ICML 2003, pp. 752–759. Washington D.C. USA (2003)Google Scholar
  18. 18.
    Vapnik, V.: Statistical Learning Theory. John Wiley&Sons, Inc., New York (1998)MATHGoogle Scholar
  19. 19.
    Vavasis, S.A.: Nonlinear Optimization: Complexity Issues. Oxford Science, New York (1991)MATHGoogle Scholar
  20. 20.
    Wang, G., Dunbrack Jr., R.L.: PISCES: a protein sequence-culling server. Bioinformatics 19(12), 1589–1591 (2003)CrossRefGoogle Scholar
  21. 21.
    Yao, Y.Y.: Granular Computing. Computer Science (Ji Suan Ji Ke Xue). In: Proceedings of The 4th Chinese National Conference on Rough Sets and Soft Computing, vol. 31, pp. 1–5 (2004)Google Scholar
  22. 22.
    Yao, Y.Y.: Perspectives of Granular Computing. In: IEEE Conference on Granular Computing (2005) (to appear)Google Scholar
  23. 23.
    Yu, H., Yang, J., Han, J.: Classifying Large Data sets Using SVMs with Hierarchical Clusters. In: Proc. of the 9th ACM SIGKDD 2003, Washington DC, USA (2003)Google Scholar
  24. 24.
    Zagrovic, B., Pande, V.S.: How does averaging affect protein structure comparison on the ensemble level? Biophysical Journal 87, 2240–2246 (2004)CrossRefGoogle Scholar
  25. 25.
    Zhong, W., Altun, G., Harrison, R., Tai, P.C., Pan, Y.: Mining Protein Sequence Motifs Representing Common 3D Structures. In: Poster Paper of IEEE Computational Systems Bioinformatics (CSB 2005), Stanford University (2005)Google Scholar
  26. 26.
    Zhong, W., Altun, G., Harrison, R., Tai, P.C., Pan, Y.: Improved K-means Clustering Algorithm for Exploring Local Protein Sequence Motifs Representing Common Structural Property. IEEE Transactions on NanoBioscience 4, 255–265 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jieyue He
    • 1
  • Wei Zhong
    • 2
  • Robert Harrison
    • 2
    • 3
    • 4
  • Phang C. Tai
    • 3
  • Yi Pan
    • 2
  1. 1.Department of Computer ScienceSoutheast UniversityNanjingChina
  2. 2.Department of Computer ScienceUSA
  3. 3.Department of BiologyGeorgia State UniversityAtlantaUSA
  4. 4.GCC Distinguished Cancer ScholarUSA

Personalised recommendations