Machine Learning

, 77:249 | Cite as

Structured prediction by joint kernel support estimation

  • Christoph H. LampertEmail author
  • Matthew B. Blaschko
Open Access


Discriminative techniques, such as conditional random fields (CRFs) or structure aware maximum-margin techniques (maximum margin Markov networks (M3N), structured output support vector machines (S-SVM)), are state-of-the-art in the prediction of structured data. However, to achieve good results these techniques require complete and reliable ground truth, which is not always available in realistic problems. Furthermore, training either CRFs or margin-based techniques is computationally costly, because the runtime of current training methods depends not only on the size of the training set but also on properties of the output space to which the training samples are assigned.

We propose an alternative model for structured output prediction, Joint Kernel Support Estimation (JKSE), which is rather generative in nature as it relies on estimating the joint probability density of samples and labels in the training set. This makes it tolerant against incomplete or incorrect labels and also opens the possibility of learning in situations where more than one output label can be considered correct.

At the same time, we avoid typical problems of generative models as we do not attempt to learn the full joint probability distribution, but we model only its support in a joint reproducing kernel Hilbert space. As a consequence, JKSE training is possible by an adaption of the classical one-class SVM procedure. The resulting optimization problem is convex and efficiently solvable even with tens of thousands of training examples. A particular advantage of JKSE is that the training speed depends only on the size of the training set, and not on the total size of the label space. No inference step during training is required (as M3N and S-SVM would) nor do we have calculate a partition function (as CRFs do).

Experiments on realistic data show that, for suitable kernel functions, our method works efficiently and robustly in situations that discriminative techniques have problems with or that are computationally infeasible for them.


Structured prediction Generative model Support estimation Reproducing kernel Hilbert space Joint kernel function One-class support vector machine 


  1. Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003). Hidden Markov support vector machines. In ICML (pp. 3–10). Google Scholar
  2. Bakır, G. H., Hofmann, T., Schölkopf, B., Smola, A. J., Taskar, B., & Vishwanathan, S. V. N. (2007). Predicting structured data. Cambridge: MIT Press. Google Scholar
  3. Baum, L. E., & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 41. Google Scholar
  4. Bay, H., Tuytelaars, T., & Van Gool, L. J. (2006). SURF: speeded up robust features. In ECCV (pp. 404–417). Google Scholar
  5. Benveniste, A., Métivier, M., & Priouret, P. (1990). Adaptive Algorithms and Stochastic Approximations. New York: Springer. zbMATHGoogle Scholar
  6. Bertsekas, D. P. (1995). Nonlinear programming. Nashua: Athena Scientific. zbMATHGoogle Scholar
  7. Bilmes, J. A. (1999). Natural statistical models for automatic speech recognition. PhD thesis, University of California, Berkeley. Google Scholar
  8. Blaschko, M. B., & Lampert, C. H. (2008). Learning to localize objects with structured output regression. In ECCV. Google Scholar
  9. Bordes, A., Ertekin, S., Weston, J., & Bottou, L. (2005). Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6. Google Scholar
  10. Bordes, A., Usunier, N., & Bottou, L. (2008). Sequence labelling SVMs trained in one pass. In ECML/PKDD (pp. 146–161). Google Scholar
  11. Bottou, L., Chapelle, O., DeCoste, D., & Weston, J. (Eds.). (2007). Large scale kernel machines. Cambridge: MIT Press. Google Scholar
  12. Collins, M. (2002). Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In EMNLP. Google Scholar
  13. Finley, T., & Joachims, T. (2008). Training structural SVMs when exact inference is intractable. In ICML. Google Scholar
  14. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6. Google Scholar
  15. Har-Peled, S., Roth, D., & Zimak, D. (2007). Maximum margin coresets for active and noise tolerant learning. In IJCAI. Google Scholar
  16. Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, & Sundararajan, S. (2008). A dual coordinate descent method for large-scale linear SVM. In ICML. Google Scholar
  17. Jaakkola, T. S., & Haussler, D. (1998). Exploiting generative models in discriminative classifiers. In NIPS. Google Scholar
  18. Joachims, T. (2006). Training linear SVMs in linear time. In ACM KDD. Google Scholar
  19. Kassel, R. (1995). A comparison of approaches to on-line handwritten character recognition. PhD thesis, Massachusetts Institute of Technology, Cambridge. Google Scholar
  20. Kindermann, R., & Snell, J. L. (1980). Markov random fields and their applications. Providence: American Mathematical Society. Google Scholar
  21. Kulesza, A., & Pereira, F. (2007). Structured learning with approximate inference. In NIPS. Google Scholar
  22. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In ICML. Google Scholar
  23. Lafferty, J., Zhu, X., & Liu, Y. (2004). Kernel conditional random fields: representation and clique selection. In ICML. Google Scholar
  24. Lampert, C. H., Blaschko, M. B., & Hofmann, T. (2008). Beyond sliding windows: object localization by efficient subwindow search. In CVPR. Google Scholar
  25. Lin, C.-J., Weng, R. C., & Keerthi, S. S. (2008). Trust region Newton method for logistic regression. Journal of Machine Learning Research. Google Scholar
  26. Martinus, D., & Tax, J. (2001). One-class classification: Concept-learning in the absence of counter-examples. PhD thesis, Delft University of Technology. Google Scholar
  27. Mumford, D., & Shah, J. (1988). Optimal approximations by piecewise smooth functions and variational problems. Communications on Pure and Applied Mathematics, XLII(5). Google Scholar
  28. Murphy, K. P., Weiss, Y., & Jordan, M. (1999). Loopy belief propagation for approximate inference: an empirical study. In UAI (pp. 467–475). Google Scholar
  29. Ng, A. Y., & Jordan, M. I. (2003). On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In NIPS. Google Scholar
  30. Parker, C., Fern, A., & Tadepalli, P. (2007). Learning for efficient retrieval of structured data with noisy queries. In ICML. Google Scholar
  31. Pernkopf, F., & Bilmes, J. A. (2005). Discriminative versus generative parameter and structure learning of Bayesian network classifiers. In ICML. Google Scholar
  32. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77. Google Scholar
  33. Sarawagi, S., & Gupta, R. (2008). Accurate max-margin training for structured output spaces. In ICML. Google Scholar
  34. Schölkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge: MIT Press. Google Scholar
  35. Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471. zbMATHCrossRefGoogle Scholar
  36. Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: primal estimated sub-gradient solver for SVM. In ICML. Google Scholar
  37. Szdemak, S., Saunders, C., Shawe-Taylor, J., & Rousu, J. (2005). Learning hierarchies at two-class complexity. In NIPS workshop on kernel methods and structured domains. Google Scholar
  38. Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin Markov networks. In NIPS. Google Scholar
  39. Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6. Google Scholar
  40. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. zbMATHGoogle Scholar
  41. Vert, R., & Vert, J.-P. (2005). Consistency of one-class SVM and related algorithms. In NIPS. Google Scholar
  42. Yang, X.-Y., Liu, J., Zhang, M.-Q., & Niu, K. (2007). A new multi-class SVM algorithm based on one-class SVM. In ICCS (pp. 677–684). Google Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  1. 1.Max Planck Institute for Biological CyberneticsTübingenGermany
  2. 2.Department of Engineering ScienceUniversity of OxfordOxfordUK

Personalised recommendations