The kernelHMM: Learning Kernel Combinations in Structured Output Domains

  • Volker Roth
  • Bernd Fischer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4713)


We present a model for learning convex kernel combinations in classification problems with structured output domains. The main ingredient is a hidden Markov model which forms a layered directed graph. Each individual layer represents a multilabel version of nonlinear kernel discriminant analysis for estimating the emission probabilities. These kernel learning machines are equipped with a mechanism for finding convex combinations of kernel matrices. The resulting kernelHMM can handle multiple partial paths through the label hierarchy in a consistent way. Efficient approximation algorithms allow us to train the model to large-scale learning problems. Applied to the problem of document categorization, the method exhibits excellent predictive performance.


Hide Markov Model Linear Discriminant Analysis Multiple Kernel Kernel Matrice Quadratically Constrain Quadratic Program 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden markov support vector machines. In: ICML 2003, pp. 3–10 (2003)Google Scholar
  2. 2.
    Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. In: ICML 2004 (2004)Google Scholar
  3. 3.
    Peña Centeno, T., Lawrence, N.D.: Optimising kernel parameters and regularisation coefficients for non-linear discriminant analysis. J. Machine Learning Research 7, 455–49 (2006)Google Scholar
  4. 4.
    Crammer, K., Keshet, J., Singer, Y.: Kernel design using boosting. In: NIPS 15, pp. 537–544. MIT Press, Cambridge (2002)Google Scholar
  5. 5.
    Dubrulle, A.A.: Retooling the method of block conjugate gradients. Electron. Trans. Numer. Anal. 12, 216–233 (2001)MATHMathSciNetGoogle Scholar
  6. 6.
    Grandvalet, Y.: Least absolute shrinkage is equivalent to quadratic penalization. In: ICANN 1998, pp. 201–206. Springer, Heidelberg (1998)Google Scholar
  7. 7.
    Hastie, T., Tibshirani, R.: Discriminant analysis by gaussian mixtures. J. Royal Statistical Society B 58, 158–176 (1996)MathSciNetGoogle Scholar
  8. 8.
    Hastie, T., Tibshirani, R., Buja, A.: Flexible discriminant analysis by optimal scoring. J. American Statistical Association 89, 1255–1270 (1994)MATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Hastie, T., Tibshirani, R.: Classification by pairwise coupling. In: NIPS 10, The MIT Press, Cambridge (1998)Google Scholar
  10. 10.
    Kumar, N., Neti, C., Andreou, A.: Application of discriminant analysis to speech recognition with auditory features. In: 15th Annual Speech Research Symposium, pp. 153–160. Johns Hopkins University, Baltimore (1995)Google Scholar
  11. 11.
    Lanckriet, G.R.G., Deng, M., Cristianini, N., Jordan, M.I., Noble, W.S.: Kernel-based data fusion and its application to protein function prediction in yeast. In: Pacific Symposium on Biocomputing, pp. 300–311 (2004)Google Scholar
  12. 12.
    Lehmann, A., Shawe-Taylor, J.: A probabilistic model for text kernels. In: ICML 2006 (2006)Google Scholar
  13. 13.
    Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A new benchmark collection for text categorization research. J. Machine Learning Research 5, 361–397 (2004)Google Scholar
  14. 14.
    McCallum, A.: Multi-label text classification with a mixture model trained by EM. In: AAAI 1999 Workshop on Text Learning (1999)Google Scholar
  15. 15.
    Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Müller, K.-R.: Fisher discriminant analysis with kernels. In: Proceedings of IEEE Neural Networks for Signal Processing Workshop, vol. 9, pp. 41–48 (1999)Google Scholar
  16. 16.
    Roth, V., Steinhage, V.: Nonlinear discriminant analysis using kernel functions. In: NIPS 12, pp. 568–574. MIT Press, Cambridge (2000)Google Scholar
  17. 17.
    Rousu, J., Saunders, C., Szedmak, S., Shawe-Taylor, J.: Learning hierarchical multi-category text classification models. In: ICML 2005, pp. 744–751 (2005)Google Scholar
  18. 18.
    Sonnenburg, S., Rätsch, G., Schäfer, C.: A general and efficient multiple kernel learning algorithm. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) NIPS 18, pp. 1275–1282. MIT Press, Cambridge (2006)Google Scholar
  19. 19.
    Soong, F.K., Huang, E.-F.: A tree-trellis based fast search for finding the n best sentence hypotheses in continuous speech recognition. In: Proceedings of a workshop on Speech and natural language, pp. 12–19 (1990)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Volker Roth
    • 1
  • Bernd Fischer
    • 1
  1. 1.ETH Zurich, Institute of Computational Science, Universität-Str. 6, CH-8092 Zurich 

Personalised recommendations