Skip to main content
Log in

String kernels construction and fusion: a survey with bioinformatics application

  • Review Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

The kernel method, especially the kernel-fusion method, is widely used in social networks, computer vision, bioinformatics, and other applications. It deals effectively with nonlinear classification problems, which can map linearly inseparable biological sequence data from low to high-dimensional space for more accurate differentiation, enabling the use of kernel methods to predict the structure and function of sequences. Therefore, the kernel method is significant in the solution of bioinformatics problems. Various kernels applied in bioinformatics are explained clearly, which can help readers to select proper kernels to distinguish tasks. Mass biological sequence data occur in practical applications. Research of the use of machine learning methods to obtain knowledge, and how to explore the structure and function of biological methods for theoretical prediction, have always been emphasized in bioinformatics. The kernel method has gradually become an important learning algorithm that is widely used in gene expression and biological sequence prediction. This review focuses on the requirements of classification tasks of biological sequence data. It studies kernel methods and optimization algorithms, including methods of constructing kernel matrices based on the characteristics of biological sequences and kernel fusion methods existing in a multiple kernel learning framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Vapnik V N. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 1999, 10(5): 988–999

    Article  Google Scholar 

  2. Schölkopf B, Smola A, Müller K R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 1998, 10(5): 1299–1319

    Article  Google Scholar 

  3. Nello C, John S T. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000

  4. Mercer J. XVI. Functions of positive and negative type, and their connection the theory of integral equations. Philosophical Transactions of the Royal Society of London, 1909, 209(441–458): 415–446

    Google Scholar 

  5. Vapnik V. Statistical learning theory. New York: Wiley, 1998

    Google Scholar 

  6. Mika S, Ratsch G, Weston J, Scholkopf B, Mullers K R. Fisher discriminant analysis with kernels. In: Proceedings of the 1999 IEEE Signal Processing Society Workshop. 1999, 41–48

  7. Cristianini N, John S T, Elisseeff A, Kandola J S. On kernel-target alignment. Advances in Neural Information Processing Systems, 2002, 367–373

  8. Song L, Kolar M, Xing E P. KELLER: estimating time-varying interactions between genes. Bioinformatics, 2009, 25(12): i128–i136

    Article  Google Scholar 

  9. Song L, Bedo J, Borgwardt K M, Gretton A, Smola A. Gene selection via the BAHSIC family of algorithms. Bioinformatics, 2007, 23(13): i490–i498

    Article  Google Scholar 

  10. Kato T, Tsuda K, Asai K. Selective integration of multiple biological data for supervised network inference. Bioinformatics, 2005, 21(10): 2488–2495

    Article  Google Scholar 

  11. Donini M, Monteiro J M, Pontil M, Shawe-Taylor J, Mourao-Miranda J. A multimodal multiple kernel learning approach to Alzheimer’s disease detection. In: Proceedings of the 26th IEEE International workshop on Machine Learning for Signal Processing. 2016, 1–6

  12. Gu Y, Liu T, Jia X, Benediktsson J A, Chanussot J. Nonlinear multiple kernel learning with multiple-structure-element extended morphological profiles for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 2016, 54(6): 3235–3247

    Article  Google Scholar 

  13. Han L, Yue Z, Guo X. Image segmentation implementation based on FPGA and SVM. In: Proceedings of the 3rd International Conference on Control, Automation and Robotics. 2017, 405–409

  14. Leslie C S, Eskin E, Cohen A, Weston J, Noble W S. Mismatch string kernels for discriminative protein classification. Bioinformatics, 2004, 20(4): 467–476

    Article  Google Scholar 

  15. Tsuda K, Noble W S. Learning kernels from biological networks by maximizing entropy. Bioinformatics, 2004, 20(suppl_1): i326–i333

    Article  Google Scholar 

  16. Chou K C. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics, 2009, 6(4): 262–274

    Article  Google Scholar 

  17. Swamidass S J, Chen J, Bruand J, Phung P, Ralaivola L, Baldi P. Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics, 2005, 21(suppl_1): i359–i368

    Article  Google Scholar 

  18. Asa B H, Noble W S. Kernel methods for predicting protein-protein interactions. Bioinformatics, 2005, 21(suppl_1): i38–i46

    Google Scholar 

  19. Lanckriet G R, Cristianini N, Bartlett P, Ghaoui L E, Jordan M I. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 2004, 5(Jan): 27–72

    MathSciNet  Google Scholar 

  20. Lanckriet G R, Tijl D B, Cristianini N, Jordan M I, Noble W S. A statistical framework for genomic data fusion. Bioinformatics, 2004, 20(16): 2626–2635

    Article  Google Scholar 

  21. Lanckriet G R, Deng M, Cristianini N, Jordan M I, Noble W S. Kernelbased data fusion and its application to protein function prediction in yeast. Biocomputing: World Scientific, 2003

  22. Bach F R, Thibaux R, Jordan M I. Computing regularization paths for learning multiple kernels. Advances in Neural Information Processing Systems, 2005, 73–80

  23. Sonnenburg S, Rätsch G, Schäfer C. A general and efficient multiple kernel learning algorithm. Advances in Neural Information Processing Systems, 2006, 1273–1280

  24. Jebara T. Multi-task feature and kernel selection for SVMs. In: Proceedings of the 21th International Conference on Machine Learning. 2004, 55

  25. Lewis D P, Jebara T, Noble W S. Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Bioinformatics, 2006, 22(22): 2753–2760

    Article  Google Scholar 

  26. Rätsch G, Sonnenburg S, Schäfer C. Learning interpretable SVMs for biological sequence classification. BMC Bioinformatics, 2006, 7(1): S9

    Article  Google Scholar 

  27. Varma M, Babu B R. More generality in efficient multiple kernel learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 1065–1072

  28. Jain A, Vishwanathan S V, Varma M. SPF-GMKL: generalized multiple kernel learning with a million kernels. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 750–758

  29. Wu P, Hoi S C, Zhao P, Miao C, Liu Z-Y. Online multi-modal distance metric learning with application to image retrieval. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(2): 454–467

    Article  Google Scholar 

  30. Borgwardt K M, Ong C S, Schönauer S, Vishwanathan S, Smola A J, Kriegel H-P. Protein function prediction via graph kernels. Bioinformatics, 2005, 21(suppl_1), i47–i56

    Article  Google Scholar 

  31. Zien A, Ong C S. An automated combination of sequence motif kernels for predicting protein subcellular localization, 2006

  32. Damoulas T, Girolami M A. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics, 2008, 24(10): 1264–1270

    Article  Google Scholar 

  33. Vert J P, Qiu J, Noble W S. A new pairwise kernel for biological network inference with support vector machines. BMC Bioinformatics, 2007, 8(10): 1–10

    Google Scholar 

  34. Vapnik V. The nature of statistical learning theory. Springer science & business media, 2013

  35. Aronszajn N. Theory of reproducing kernels. Transactions of the American Mathematical Society, 1950, 68(3): 337–404

    Article  MathSciNet  Google Scholar 

  36. Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on Computational learning theory. 1992, 144–152

  37. Boyd S, Vandenberghe L. Convex optimization. Cambridge university press, 2004

  38. Leslie C, Eskin E, Noble W S. The spectrum kernel: A string kernel for SVM protein classification. Biocomputing 2002: World Scientific, 2001

  39. Saigo H, Vert J P, Ueda N, Akutsu T. Protein homology detection using string alignment kernels. Bioinformatics, 2004, 20(11): 1682–1689

    Article  Google Scholar 

  40. Rätsch G, Sonnenburg S. Accurate splice site prediction for caenorhabditis elegans. Computational Molecular Biology, 2004, 277–298

  41. Asa B H, Brutlag D. Remote homology detection: a motif based approach. Bioinformatics, 2003, 19(suppl_1): i26–i33

    Article  Google Scholar 

  42. Nevill M, Craig G, Wu T D, Brutlag D L. Highly specific protein sequence motifs for genome analysis. In: Proceedings of the National Academy of Sciences. 1998, 95(11): 5865–5871

  43. Huang J Y, Brutlag D L. The EMOTIF database. Nucleic Acids Research, 2001, 29(1): 202–204

    Article  Google Scholar 

  44. Kuang R, WANG K, Wang K, Siddiqi M, Freund Y, Leslie C. Profile-based string kernels for remote homology detection and motif extraction. Journal of Bioinformatics and Computational Biology, 2005, 3(3): 527–550

    Article  Google Scholar 

  45. Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller K R, Sommer R J, Schölkopf B. Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Computational Biology, 2007, 3(2): e20

    Article  MathSciNet  Google Scholar 

  46. Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong C S, Philips P, De Bona F, Hartmann L, Bohlen A. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Research, 2009, 19(11): 2133–2143

    Article  Google Scholar 

  47. Jacob L, Vert J P. Efficient peptide—MHC-I binding prediction for alleles with few known binders. Bioinformatics, 2007, 24(3): 358–366

    Article  Google Scholar 

  48. Röttig M, Rausch C, Kohlbacher O. Combining structure and sequence information allows automated prediction of substrate specificities within enzyme families. PLoS Computational Biology, 2010, 6(1): e1000636

    Article  Google Scholar 

  49. Teramoto R, Aoki M, Kimura T, Kanaoka M. Prediction of siRNA functionality using generalized string kernel and support vector machine. FEBS letters, 2005, 579(13): 2878–2882

    Article  Google Scholar 

  50. Kuksa P, Qi Y, Bai B, Collobert R, Weston J, Pavlovic V, Ning X. Semi-supervised abstraction-augmented string kernel for multi-level bio-relation extraction. Joint European Conference on Machine Learning and Knowledge Discovery in Databases: Springer, 2010, 128–144

  51. Leslie C, Eskin E, Weston J, Noble W S. Mismatch string kernels for SVM protein classification. Advances in Neural Information Processing Systems, 2003, 1441–1448

  52. Weston J, Leslie C, Ie E, Zhou D, Elisseeff A, Noble W S. Semi-supervised protein classification using cluster kernels. Bioinformatics, 2005, 21(15): 3241–3247

    Article  Google Scholar 

  53. Kuksa P, Huang P H, Pavlovic V. Scalable algorithms for string kernels with inexact matching. Advances in Neural Information Processing Systems, 2008, 21, 881–888

  54. Leslie C, Kuang R, Bennett K. Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 2004, 5(9)

  55. Cichonska A, Pahikkala T, Szedmak S, Julkunen H, Airola A, Heinonen M, Aittokallio T, Rousu J. Learning with multiple pairwise kernels for drug bioactivity prediction. Bioinformatics, 2018, 34(13): i509–i518

    Article  Google Scholar 

  56. Liao L, Noble W S. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In: Proceedings of the Sixth Annual International Conference on Computational Biology. 2002, 225–232

  57. Filatov G, Bauwens B, Attila K F. LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification. Bioinformatics, 2018, 34(19): 3281–3288

    Article  Google Scholar 

  58. Smith T F, Waterman M S. Identification of common molecular subsequences. Journal of Molecular Biology, 1981, 147(1): 195–197

    Article  Google Scholar 

  59. Altschul S F, Gish W, Miller W, Myers E W, Lipman D J. Basic local alignment search tool. Journal of Molecular Biology, 1990, 215(3): 403–410

    Article  Google Scholar 

  60. Vert J P, Saigo H, Akutsu T. Local alignment kernels for biological sequences. Kernel Methods in Computational Biology, 2004, 131–154

  61. Jaakkola T, Diekhans M, Haussler D. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 2000, 7(1–2), 95–114

    Article  Google Scholar 

  62. Gönen M, Alpaydin E. Multiple kernel learning algorithms. Journal of Machine Learning Research, 2011, 12: 2211–2268

    MathSciNet  Google Scholar 

  63. Bucak S S, Jin R, Jain A K. Multiple kernel learning for visual object recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1354–1369

    Article  Google Scholar 

  64. Bach F R, Lanckriet G R, Jordan M I. Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the 21th International Conference on Machine Learning. 2004, 6

  65. Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B. Large scale multiple kernel learning. Journal of Machine Learning Research, 2006, 7: 1531–1565

    MathSciNet  Google Scholar 

  66. Papadopoulos A. Metric spaces, convexity and nonpositive curvature. European Mathematical Society, 2005

  67. Rapcsak T. Geodesic convexity in nonlinear optimization. Journal of Optimization Theory and Applications, 1991, 69(1): 169–183

    Article  MathSciNet  Google Scholar 

  68. Zakeri P, Jeuris B, Vandebril R, Moreau Y. Protein fold recognition using geometric kernel data fusion. Bioinformatics, 2014, 30(13): 1850–1857

    Article  Google Scholar 

  69. Jeuris B, Vandebril R, Vandereycken B. A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electronic Transactions on Numerical Analysis, 2012, 39(ARTICLE): 379–402

    MathSciNet  Google Scholar 

  70. Wang Y C, Zhang C H, Deng N Y, Wang Y. Kernel-based data fusion improves the drug-protein interaction prediction. Computational Biology and Chemistry, 2011, 35(6): 353–362

    Article  MathSciNet  Google Scholar 

  71. Yu G, Rangwala H, Domeniconi C, Zhang G, Zhang Z. Protein Function Prediction by Integrating Multiple Kernels. In: Proceedings of Twenty-Third International Joint Conference on Artificial Intelligence. 2013

  72. Yu G, Rangwala H, Domeniconi C, Zhang G, Zhang Z. Predicting Protein Function using Multiple Kernels. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2015, 12(1): 219–233

    Article  Google Scholar 

  73. Yu G, Fu G, Wang J, Zhu H. Predicting Protein Function via Semantic Integration of Multiple Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016, 13(2): 220–232

    Article  Google Scholar 

  74. Cortes Corinna M M, Afshin Rostamizadeh. Learning non-linear combinations of kernels. Advances in neural information processing systems, 2009

  75. Kloft M, Brefeld U, Sonnenburg S, Zien A. Non-sparse regularization and efficient training with multiple kernels. 2010, arXiv preprint arXiv: 1003.0079 2010, 186: 189–190

    Google Scholar 

  76. Eli M, Kisilev P. Nuc-mkl: A convex approach to non linear multiple kernel learning. Artificial Intelligence and Statistics, 2016, 610–619

  77. Wilson C M, Li K, Yu X, Kuan P F, Wang X. Multiple-kernel learning for genomic data mining and prediction. BMC Bioinformatics, 2019, 20(1): 1–7

    Article  Google Scholar 

  78. Sakakibara Y, Popendorf K, Ogawa N, Asai K, Sato K. Stem kernels for RNA sequence analyses. Journal of Bioinformatics and Computational Biology, 2007, 5(05): 1103–1122

    Article  Google Scholar 

  79. Navarin N, Costa F. An efficient graph kernel method for non-coding RNA functional prediction. Bioinformatics, 2017, 33(17): 2642–2650

    Article  Google Scholar 

  80. Brayet J, Zehraoui F, Laurence J L, Israeli D, Tahi F. Towards a piRNA prediction using multiple kernel fusion and support vector machine. Bioinformatics, 2014, 30(17): i364–i370

    Article  Google Scholar 

  81. Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 27

    Article  Google Scholar 

  82. Costello J C, Heiser L M, Georgii E, Gönen M, Menden M P, Wang N J, Bansal M, Hintsanen P, Khan S A, Mpindi J P. A community effort to assess and improve drug sensitivity prediction algorithms. Nature Biotechnology, 2014, 32(12): 1202

    Article  Google Scholar 

  83. Cichonska A, Ravikumar B, Parri E, Timonen S, Pahikkala T, Airola A, Wennerberg K, Rousu J, Aittokallio T. Computational-experimental approach to drug-target interaction mapping: A case study on kinase inhibitors. PLoS Computational Biology, 2017, 13(8): e1005678

    Article  Google Scholar 

  84. Gönen M. Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics, 2012, 28(18): 2304–2310

    Article  Google Scholar 

  85. Gönen M, Khan S, Kaski S. Kernelized Bayesian matrix factorization. International Conference on Machine Learning, 2013, 864–872

  86. Nascimento A C, Prudêncio R B, Costa I G. A multiple kernel learning algorithm for drug-target interaction prediction. BMC Bioinformatics, 2016, 17(1): 46

    Article  Google Scholar 

  87. Kloft M, Brefeld U, Laskov P, Müller K-R, Zien A, Sonnenburg S. Efficient and accurate lp-norm multiple kernel learning. Advances in Neural Information Processing Systems, 2009, 997–1005

  88. Sun Z, Ampornpunt N, Varma M, Vishwanathan S. Multiple kernel learning and the SMO algorithm. Advances in Neural Information Processing Systems, 2010, 2361–2369

  89. Rakotomamonjy A, Bach F R, Canu S, Grandvalet Y. SimpleMKL. Journal of Machine Learning Research, 2008, 9: 2491–2521

    MathSciNet  Google Scholar 

  90. Bucak S, Jin R, Jain A. Multi-label multiple kernel learning by stochastic approximation: Application to visual object recognition. Advances in Neural Information Processing Systems, 2010, 325–333

  91. Gönen M, Alpaydin E. Localized multiple kernel learning. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 352–359

  92. Mostafavi S, Morris Q. Fast integration of heterogeneous data sources for predict-ing gene function with limited annotation. Bioinformatics, 2010, 26(14): 1759–1765

    Article  Google Scholar 

  93. Kong X, Ng M K, Zhou Z-H. Trans-ductive multi-label learning via label set propagation. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(3): 704–719

    Article  Google Scholar 

  94. Tang L, Chen J, Ye J. On multiplekernel learning with multiple labels. In: Proceedings of Twenty-First International Joint Conference on Artificial Intelligence. 2009, 1255–1260

  95. Bucak S, Jin R, Jain A. Multi-label multiple kernel learning by stochastic approximation:Application to visual object recognition. Advances in Neural Information Processing Systems, 2010, 24: 325–333

    Google Scholar 

Download references

Acknowledgements

The work was supported by the National Natural Science Foundation of China (Grant Nos. 61922020, 61771331, 61902259).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quan Zou.

Additional information

Ren Qi received the MS degrees in computer technology from the College of Intelligence and Computing, Tianjin University, China in 2018 and she is a doctoral student in College of Intelligence and Computing from Tianjin University, China in 2018.

Her research interests include machine learning, metric learning, and bioinformatics. Several related works have been published by IJCAI, Journal of software.

Fei Guo received the BSc and the PhD degrees in School of Computer Science and Technology, Shandong University, China in 2007 and 2012, respectively. She is currently an assistant professor in the College of Intelligence and Computing, Tianjin University, China.

Her research interests include bioinformatics, algorithms, machine learning. Several related works have been published by Briefings in Bioinformatics, Bioinformatics, IEEE/ACM Transactions on Computational Biology and Bioinformatics, etc. She is the PC member of APBC-2016, APBC-2017, and reviewer of BMC Bioinformatics, BMC System Biology, IEEE TCBB, PLoS ONE, etc.

Quan Zou (M’13-SM’17) received the BSc, MSc and the PhD degrees in computer science from Harbin Institute of Technology, China in 2004, 2007 and 2009, respectively. He worked in Xiamen University and Tianjin University, China, from 2009 to 2018 as an assistant professor, associate professor and professor. He is currently a professor in the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, China.

His research is in the areas of bioinformatics, machine learning and parallel computing. Several related works have been published by Science, Briefings in Bioinformatics, Bioinformatics, IEEE/ACM Transactions on Computational Biology and Bioinformatics, etc. Google scholar showed that his more than 100 papers have been cited more than 5000 times. He is the editor-in-chief of Current Bioinformatics, associate editor of IEEE Access, and the editor board member of Computers in Biology and Medicine, Genes, Scientific Reports, etc. He was selected as one of the Clarivate Analytics Highly Cited Researchers in 2018 and 2021.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qi, R., Guo, F. & Zou, Q. String kernels construction and fusion: a survey with bioinformatics application. Front. Comput. Sci. 16, 166904 (2022). https://doi.org/10.1007/s11704-021-1118-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-021-1118-x

Keywords

Navigation