A Discriminative Method for Protein Remote Homology Detection Based on N-nary Profiles

  • Bin Liu
  • Lei Lin
  • Xiaolong Wang
  • Qiwen Dong
  • Xuan Wang
Part of the Communications in Computer and Information Science book series (CCIS, volume 13)


Protein homology detection is a key problem in computational biology. In this paper, a novel building block for protein called N-nary profile which contains the evolutionary information of protein sequence frequency profiles has been presented. The protein sequence frequency profiles calculated from the multiple sequence alignments outputted by PSI-BLAST are converted into N-nary profiles. Such N-nary profiles are filtered by a feature selection algorithm called chi-square algorithm. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each N-nary profile and then the corresponding vectors are inputted to support vector machine (SVM). The latent semantic analysis (LSA) model, an efficient feature extraction algorithm, is adopted to further improve the performance of this method. When tested on the SCOP 1.53 data set, the prediction performance of N-nary profile method outperforms all compared methods of protein remote homology detection. The ROC50 score is 0.736, which is higher than the current best method for nearly 4 percent.


remote homology N-nary profiles chi-square algorithm latent semantic analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. J. Mol. Biol. 147, 195–197 (1981)CrossRefGoogle Scholar
  2. 2.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)Google Scholar
  3. 3.
    Pearson, W.R.: Rapid and Sensitive Sequence Comparison with Fastp and Fasta. Methods Enzymol. 183, 63–98 (1990)CrossRefGoogle Scholar
  4. 4.
    Rost, B.: Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999)CrossRefGoogle Scholar
  5. 5.
    Thomas, L.: Remote homology detection based on oligomer distances. Bioinformatics 22, 2224–2231 (2006)CrossRefGoogle Scholar
  6. 6.
    Karplus, K., Barrett, C., Hughey, R.: Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics 14, 846–856 (1998)CrossRefGoogle Scholar
  7. 7.
    Qian, B., Goldstein, R.A.: Performance of an Iterated T-Hmm for Homology Detection. Bioinformatics 20, 2175–2180 (2004)CrossRefGoogle Scholar
  8. 8.
    Vapnik, V.N.: Statistical Learning Theory. New York (1998)Google Scholar
  9. 9.
    Jaakkola, T., Diekhans, M., Haussler, D.: A Discriminative Framework for Detecting Remote Protein Homologies. J. Comput. Biol. 7, 95–114 (2000)CrossRefGoogle Scholar
  10. 10.
    Li, L., Noble, W.S.: Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. J. Comput. Biol. 10, 857–868 (2003)CrossRefGoogle Scholar
  11. 11.
    Leslie, C., Eskin, E., Noble, W.S.: The Spectrum Kernel: A String Kernel for svm Protein Classification. In: Pacific Symposium on Biocomputing, pp. 566–575 (2002)Google Scholar
  12. 12.
    Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, S.W.: Mismatch String Kernels for Discriminative Protein Classification. Bioinformatics 20, 467–476 (2004)CrossRefGoogle Scholar
  13. 13.
    Hou, Y., Hsu, W., Lee, M.L., Bystroff, C.: Efficient Remote Homology Detection Using Local Structure. Bioinformatics 19, 2294–2301 (2003)CrossRefGoogle Scholar
  14. 14.
    Ogul, H., Mumcuoglu, E.: A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets. BioSystems 87, 75–81 (2007)CrossRefGoogle Scholar
  15. 15.
    Håndstad, T., Hestnes, A.J., Sætrom, P.: Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinformatics 8, 23 (2007)CrossRefGoogle Scholar
  16. 16.
    Saigo, H., Vert, J.P., Ueda, N., Akutsu, T.: Protein Homology Detection Using String Alignment Kernels. Bioinformatics 20, 1682–1689 (2004)CrossRefGoogle Scholar
  17. 17.
    Saigo, H., Vert, J.P., Akutsu, T., Ueda, N.: Comparison of Svm-Based Methods for Remote Homology Detection. Genome Informatics 13, 396–397 (2002)Google Scholar
  18. 18.
    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.H., Zhang, Z., Miller, W., Lipman, D.J.: Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs. Nucleic Acids Research 25, 3389–3402 (1997)CrossRefGoogle Scholar
  19. 19.
    Dowd, S.E., Zaragoza, J., Rodriguez, J.R., Oliver, M.J., Payton, P.R.: Windows.Net Network Distributed Basic Local Alignment Search Toolkit (W.Nd-Blast). BMC Bioinformatics. 6, 93 (2005)CrossRefGoogle Scholar
  20. 20.
    Dong, Q.W., Lin, L., Wang, X.L.: Protein Remote Homology Detection Based on Binary Profiles. In: Hochreiter, S., Wagner, R. (eds.) BIRD 2007. LNCS (LNBI), vol. 4414, pp. 212–223. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  21. 21.
    Dong, Q.W., Wang, X.L., Lin, L.: Application of Latent Semantic Analysis to Protein Remote Homology Detection. Bioinformatics 22, 285–290 (2006)CrossRefGoogle Scholar
  22. 22.
    Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In: 6th Annual International Conference on Research in Computational Molecular Biology, pp. 225–232 (2002)Google Scholar
  23. 23.
    Chandonia, J.M., Hon, G., Walker, N.S., Conte, L.L., Koehl, P., Levitt, M., Brenner, S.E.: The astral compendium in 2004. Nucleic acids research 32, 189–192 (2004)CrossRefGoogle Scholar
  24. 24.
    Holm, L., Sander, C.: Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14, 423–429 (1998)CrossRefGoogle Scholar
  25. 25.
    Henikoff, S., Henikoff, J.G.: Position-Based Sequence Weights. J. Mol. Biol. 243, 574–578 (1994)CrossRefGoogle Scholar
  26. 26.
    Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J.P., Chothia, C., Murzin, A.G.: Scop Database in 2004: Refinements Integrate Structure and Sequence Family Data. Nucleic Acids Research 32, 226–229 (2004)CrossRefGoogle Scholar
  27. 27.
    Dong, Q.W., Lin, L., Wang, X.L., Li, M.H.: A Pattern-Based svm for Protein Remote Homology Detection. In: 4th international conference on machine learning and cybernetics, GuangZhou, China, pp. 3363–3368 (2005)Google Scholar
  28. 28.
    Yang, Y., Pedersen, J.A.: A comparative study on feature selection in text categorization. In: 14th international conference on machine learning, San Francisco, USA, pp. 412–420 (1997)Google Scholar
  29. 29.
    Ganapathiraju, M., et al.: Characterization of protein secondary structure, Application of latent semantic analysis using different vocabularies. IEEE Signal Processing Magazine 21, 78–87 (2004)CrossRefGoogle Scholar
  30. 30.
    Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to Latent Semantic Analysis. Discourse Processes 25, 259–284 (1998)CrossRefGoogle Scholar
  31. 31.
    Ben-Hur, A., Brutlag, D.: Remote homology detection: A motif based approach. Bioinformatics 19(suppl. 1), i26–i33 (2003)CrossRefGoogle Scholar
  32. 32.
    Gribskov, M., Robinson, N.L.: Use of Receiver Operating Characteristic (Roc) Analysis to Evaluate Sequence Matching. Computers and Chemistry 20, 25–33 (1996)CrossRefGoogle Scholar
  33. 33.
    Bailey, T.L., Grundy, W.N.: Classifying Proteins by Family Using the Product of Correlated P-Values. In: 3rd international conference on computational molecular biology (RECOMB 1999), pp. 10–14 (1999)Google Scholar
  34. 34.
    Krogh, A., Brown, M., Mian, I.S., Sjolander, K., Haussler, D.: Hidden Markov Models in Computational Biology: Applications to Protein Modeling. Journal of Molecular Biology 235, 1501–1531 (1994)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Bin Liu
    • 1
  • Lei Lin
    • 2
  • Xiaolong Wang
    • 1
    • 2
  • Qiwen Dong
    • 2
  • Xuan Wang
    • 1
  1. 1.Shenzhen Graduate SchoolHarbin Institute of TechnologyShenzhenChina
  2. 2.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations