Machine Learning

, Volume 77, Issue 1, pp 27–59 | Cite as

Cutting-plane training of structural SVMs

  • Thorsten Joachims
  • Thomas Finley
  • Chun-Nam John Yu
Article

Abstract

Discriminative training approaches like structural SVMs have shown much promise for building highly complex and accurate models in areas like natural language processing, protein structure prediction, and information retrieval. However, current training algorithms are computationally expensive or intractable on large datasets. To overcome this bottleneck, this paper explores how cutting-plane methods can provide fast training not only for classification SVMs, but also for structural SVMs. We show that for an equivalent “1-slack” reformulation of the linear SVM training problem, our cutting-plane method has time complexity linear in the number of training examples. In particular, the number of iterations does not depend on the number of training examples, and it is linear in the desired precision and the regularization parameter. Furthermore, we present an extensive empirical evaluation of the method applied to binary classification, multi-class classification, HMM sequence tagging, and CFG parsing. The experiments show that the cutting-plane algorithm is broadly applicable and fast in practice. On large datasets, it is typically several orders of magnitude faster than conventional training methods derived from decomposition methods like SVM-light, or conventional cutting-plane methods. Implementations of our methods are available at www.joachims.org.

Keywords

Structural SVMs Support vector machines Structured output prediction Training algorithms 

References

  1. Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003). Hidden Markov support vector machines. In International conference on machine learning (ICML) (pp. 3–10). Google Scholar
  2. Anguelov, D., Taskar, B., Chatalbashev, V., Koller, D., Gupta, D., Heitz, G., & Ng, A. Y. (2005). Discriminative learning of Markov random fields for segmentation of 3D scan data. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 169–176). Los Alamitos: IEEE Computer Society. Google Scholar
  3. Bartlett, P., Collins, M., Taskar, B., & McAllester, D. (2004). Exponentiated algorithms for large-margin structured classification. In Advances in neural information processing systems (NIPS) (pp. 305–312). Google Scholar
  4. Caruana, R., Joachims, T., & Backstrom, L. (2004). KDDCup 2004: Results and analysis. ACM SIGKDD Newsletter, 6(2), 95–108. CrossRefGoogle Scholar
  5. Chang, C. C., & Lin, C. J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  6. Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Empirical methods in natural language processing (EMNLP) (pp. 1–8). Google Scholar
  7. Collins, M. (2004). Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods. In New developments in parsing technology. Dordrecht: Kluwer Academic (paper accompanied invited talk at IWPT 2001). Google Scholar
  8. Collins, M., & Duffy, N. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Annual meeting of the association for computational linguistics (ACL) (pp. 263–270). Google Scholar
  9. Collobert, R., & Bengio, S. (2001). SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research (JMLR), 1, 143–160. CrossRefMathSciNetGoogle Scholar
  10. Cortes, C., & Vapnik, V. N. (1995). Support-vector networks. Machine Learning, 20, 273–297. MATHGoogle Scholar
  11. Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research (JMLR), 2, 265–292. CrossRefGoogle Scholar
  12. Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research (JMLR), 3, 951–991. MATHCrossRefMathSciNetGoogle Scholar
  13. Ferris, M., & Munson, T. (2003). Interior-point methods for massive support vector machines. SIAM Journal of Optimization, 13(3), 783–804. MATHCrossRefMathSciNetGoogle Scholar
  14. Fukumizu, K., Bach, F., & Jordan, M. (2004). Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research (JMLR), 5, 73–99. MathSciNetGoogle Scholar
  15. Fung, G., & Mangasarian, O. (2001). Proximal support vector classifiers. In ACM conference on knowledge discovery and data mining (KDD) (pp. 77–86). Google Scholar
  16. Globerson, A., Koo, T. Y., Carreras, X., & Collins, M. (2007). Exponentiated gradient algorithm for log-linear structured prediction. In International conference on machine learning (ICML) (pp. 305–312). Google Scholar
  17. Joachims, T. (1999). Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—support vector learning (pp. 169–184). Cambridge: MIT Press. Chap. 11. Google Scholar
  18. Joachims, T. (2003). Learning to align sequences: A maximum-margin approach. Online manuscript. Google Scholar
  19. Joachims, T. (2005). A support vector method for multivariate performance measures. In International conference on machine learning (ICML) (pp. 377–384). Google Scholar
  20. Joachims, T. (2006). Training linear SVMs in linear time. In ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 217–226). Google Scholar
  21. Johnson, M. (1998). PCFG models of linguistic tree representations. Computational Linguistics, 24(4), 613–632. Google Scholar
  22. Keerthi, S., & DeCoste, D. (2005). A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research (JMLR), 6, 341–361. MathSciNetGoogle Scholar
  23. Keerthi, S., Chapelle, O., & DeCoste, D. (2006). Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research (JMLR), 7, 1493–1515. MathSciNetGoogle Scholar
  24. Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1), 1–63. MATHCrossRefMathSciNetGoogle Scholar
  25. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International conference on machine learning (ICML). Google Scholar
  26. Lewis, D., Yang, Y., Rose, T., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research (JMLR), 5, 361–397. Google Scholar
  27. Mangasarian, O., & Musicant, D. (2001). Lagrangian support vector machines. Journal of Machine Learning Research (JMLR), 1, 161–177. MATHCrossRefMathSciNetGoogle Scholar
  28. Marcus, M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330. Google Scholar
  29. McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependency parsers. In Annual meeting of the association for computational linguistics (ACL) (pp. 91–98). Google Scholar
  30. Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—support vector learning. Cambridge: MIT Press. Chap 12. Google Scholar
  31. Ratliff, N. D., Bagnell, J. A., & Zinkevich, M. A. (2007). (Online) subgradient methods for structured prediction. In Conference on artificial intelligence and statistics (AISTATS). Google Scholar
  32. Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM. In International conference on machine learning (ICML) (pp. 807–814). New York: ACM. Google Scholar
  33. Smola, A., & Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In International conference on machine learning (pp. 911–918). Google Scholar
  34. Taskar, B., Guestrin, C., & Koller, D. (2003). Maximum-margin Markov networks. In Advances in neural information processing systems (NIPS). Google Scholar
  35. Taskar, B., Klein, D., Collins, M., Koller, D., & Manning, C. (2004). Max-margin parsing. In Empirical methods in natural language processing (EMNLP). Google Scholar
  36. Taskar, B., Lacoste-Julien, S., & Jordan, M. I. (2005). Structured prediction via the extragradient method. In Advances in neural information processing systems (NIPS). Google Scholar
  37. Teo, C. H., Smola, A., Vishwanathan, S. V., & Le, Q. V. (2007). A scalable modular convex solver for regularized risk minimization. In ACM conference on knowledge discovery and data mining (KDD) (pp. 727–736). Google Scholar
  38. Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In International conference on machine learning (ICML) (pp. 104–112). Google Scholar
  39. Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6, 1453–1484. MathSciNetGoogle Scholar
  40. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. MATHGoogle Scholar
  41. Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt, M. W., & Murphy, K. P. (2006). Accelerated training of conditional random fields with stochastic gradient methods. In International conference on machine learning (ICML) (pp. 969–976). Google Scholar
  42. Yu, C. N., Joachims, T., Elber, R., & Pillardy, J. (2007). Support vector training of protein alignment models. In Proceeding of the international conference on research in computational molecular biology (RECOMB) (pp. 253–267). Google Scholar
  43. Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In ACM SIGIR conference on research and development in information retrieval (SIGIR) (pp. 271–278). Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Thorsten Joachims
    • 1
  • Thomas Finley
    • 1
  • Chun-Nam John Yu
    • 1
  1. 1.Department of Computer ScienceCornell UniversityIthacaUSA

Personalised recommendations