A Discriminative Method for Protein Remote Homology Detection Based on N-nary Profiles
Protein homology detection is a key problem in computational biology. In this paper, a novel building block for protein called N-nary profile which contains the evolutionary information of protein sequence frequency profiles has been presented. The protein sequence frequency profiles calculated from the multiple sequence alignments outputted by PSI-BLAST are converted into N-nary profiles. Such N-nary profiles are filtered by a feature selection algorithm called chi-square algorithm. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each N-nary profile and then the corresponding vectors are inputted to support vector machine (SVM). The latent semantic analysis (LSA) model, an efficient feature extraction algorithm, is adopted to further improve the performance of this method. When tested on the SCOP 1.53 data set, the prediction performance of N-nary profile method outperforms all compared methods of protein remote homology detection. The ROC50 score is 0.736, which is higher than the current best method for nearly 4 percent.
Keywordsremote homology N-nary profiles chi-square algorithm latent semantic analysis
Unable to display preview. Download preview PDF.
- 2.Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)Google Scholar
- 8.Vapnik, V.N.: Statistical Learning Theory. New York (1998)Google Scholar
- 11.Leslie, C., Eskin, E., Noble, W.S.: The Spectrum Kernel: A String Kernel for svm Protein Classification. In: Pacific Symposium on Biocomputing, pp. 566–575 (2002)Google Scholar
- 17.Saigo, H., Vert, J.P., Akutsu, T., Ueda, N.: Comparison of Svm-Based Methods for Remote Homology Detection. Genome Informatics 13, 396–397 (2002)Google Scholar
- 22.Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In: 6th Annual International Conference on Research in Computational Molecular Biology, pp. 225–232 (2002)Google Scholar
- 27.Dong, Q.W., Lin, L., Wang, X.L., Li, M.H.: A Pattern-Based svm for Protein Remote Homology Detection. In: 4th international conference on machine learning and cybernetics, GuangZhou, China, pp. 3363–3368 (2005)Google Scholar
- 28.Yang, Y., Pedersen, J.A.: A comparative study on feature selection in text categorization. In: 14th international conference on machine learning, San Francisco, USA, pp. 412–420 (1997)Google Scholar
- 33.Bailey, T.L., Grundy, W.N.: Classifying Proteins by Family Using the Product of Correlated P-Values. In: 3rd international conference on computational molecular biology (RECOMB 1999), pp. 10–14 (1999)Google Scholar