Fast Classification of Protein Structures by an Alignment-Free Kernel
Alignment is the most fundamental algorithm that has been widely used in numerous research in bioinformatics, but its computation cost becomes too expensive in various modern problems because of the recent explosive data growth. Hence the development of alignment-free algorithms, i.e., alternative algorithms that avoid the computationally expensive alignment, has become one of the recent hot topics in algorithmic bioinformatics.
Analysis of protein structures is a very important problem in bioinformatics. We focus on the problem of predicting functions of proteins from their structures, as the functions of proteins are the keys of everything in the understandings of any organisms and moreover these functions are said to be determined by their structures. But the previous best-known (i.e., the most accurate) method for this problem utilizes alignment-based kernel method, which suffers from the high computation cost of alignments.
For the problem, we propose a new kernel method that does not employ alignments. Instead of alignments, we apply the two-dimensional suffix tree and the contact map graph to reduce kernel-related computation cost dramatically. Experiments show that, compared to the previous best algorithm, our new method runs about 16 times faster in training and about 37 times faster in prediction while preserving comparatively high accuracy.
KeywordsFeature Vector Kernel Function Structural Alignment Adjacency Matrice Suffix Array
This work was supported by JSPS KAKENHI Grant Numbers 25280002 and 24106007. The super-computing resource was provided by Human Genome Center (the Univ. of Tokyo).
- 1.Aluru, S., Apostolico, A., Thankachan, S.V.: Efficient alignment free sequence comparison with bounded mismatches. In: Przytycka, T.M. (ed.) RECOMB 2015. LNCS, vol. 9029, pp. 1–12. Springer, Heidelberg (2015)Google Scholar
- 3.Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 360–369 (1997)Google Scholar
- 4.Bhattacharya, S., Bhattacharyya, C., Chandra, N.: Structural alignment based kernels for protein structure classification. In: Proceedings of the 24th International Conference on Machine Learning, pp. 73–80 (2007)Google Scholar
- 10.Goldman, D., Istrail, S., Papadimitriou, C.H.: Algorithmic aspects of protein structure similarity. In: Proceedings of the 40th Symposium on Foundations of Computer Science, pp. 512–521 (1999)Google Scholar
- 14.Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217–226 (2006)Google Scholar
- 16.Leslie, C.S., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Pacific Symposium on Biocomputing, pp. 566–575 (2002)Google Scholar
- 17.Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 319–327 (1990)Google Scholar
- 25.Wang, C., Scott, S.D.: New kernels for protein structural motif discovery and function classification. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 940–947 (2005)Google Scholar