A Class of Evolution-Based Kernels for Protein Homology Analysis: A Generalization of the PAM Model
There are two desirable properties that a pair-wise similarity measure between amino acid sequences should possess in order to produce good performance in protein homology analysis. First, it is the presence of kernel properties that allow using popular and well-performing computational tools designed for linear spaces, like SVM and k-means. Second, it is very important to take into account common evolutionary descent of homologous proteins. However, none of the existing similarity measures possesses both of these properties at once. In this paper, we propose a simple probabilistic evolution model of amino acid sequences that is built as a straightforward generalization of the PAM evolution model of single amino acids. This model produces a class of kernel functions each of which is computed as the likelihood of the hypothesis that both sequences are results of two independent evolutionary transformations of a hidden common ancestor under some specific assumptions on the evolution mechanism. The proposed class of kernels is rather wide and contains as particular subclasses not only the family of J.-P Vert’s local alignment kernels, whose algebraic structure was introduced without any evolutionary motivation, but also some other families of local and global kernels. We demonstrate, via k-means clustering of a set of amino acid sequences from the VIDA database, that the global kernel can be useful in bringing together otherwise very different protein families.
KeywordsProtein homology analysis evolution modeling amino acid sequence alignment evolutionary kernel function kernel-derived clusters
Unable to display preview. Download preview PDF.
- 5.Mirkin, B., Camargo, R., Fenner, T., Loizou, G., Kellam, P.: Aggregating homologous protein families in evolutionary reconstructions of herpesviruses. In: Ashlock, D. (ed.) Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pp. 255–262 (2006)Google Scholar
- 6.Rocha, J., Rossello, F., Segura, J.: The Universal Similarity Metric does not detect domain similarity. Technical Report, Quantitative Methods, Q-bio QM (2006), http://arxiv.org/abs/q-bio/0603007
- 8.Vert, J.-P., Saigo, H., Akutsu, T.: Local alignment kernels for biological sequences. In: Scholkopf, B., Tsuda, K., Vert, J.P. (eds.) Kernel Methods in Computational Biology. MIT Press, Cambridge (2004)Google Scholar
- 9.Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge (2002)Google Scholar
- 14.Haussler, D.: Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, UC Santa Cruz (1999)Google Scholar
- 15.Cuturi, M., Vert, J.-P.: A mutual information kernel for sequences. In: Proc. of IEEE Int. Joint Conference on Neural Networks, vol. 3, pp. 1905–1910 (2004)Google Scholar
- 18.Miklos, I., Novak, A., Satija, R., Lyngso, R., Hein, J.: Stochastic models of sequence evolution including insertion-deletion events. Statistical methods in medical research 29 (2008)Google Scholar
- 20.Dayhoff, M.O., Schwarts, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of Protein Sequences and Structures 5(suppl. 3), 345–352 (1978)Google Scholar
- 22.Sulimova, V., Mottl, V., Kulikowski, C., Muchnik, I.: Probabilistic evolutionary model for substitution matrices of PAM and BLOSUM families. In: DIMACS Technical Report 2008-16. DIMACS Technical Report 2008-16, Rutgers University, 17 p. (2008), ftp://dimacs.rutgers.edu/pub/dimacs/TechicalReports/TechReports/2008/2008-16.pdf