Machine Learning

, Volume 46, Issue 1–3, pp 225–254

Linear Programming Boosting via Column Generation

  • Ayhan Demiriz
  • Kristin P. Bennett
  • John Shawe-Taylor
Article

Abstract

We examine linear program (LP) approaches to boosting and demonstrate their efficient solution using LPBoost, a column generation based simplex method. We formulate the problem as if all possible weak hypotheses had already been generated. The labels produced by the weak hypotheses become the new feature space of the problem. The boosting task becomes to construct a learning function in the label space that minimizes misclassification error and maximizes the soft margin. We prove that for classification, minimizing the 1-norm soft margin error function directly optimizes a generalization error bound. The equivalent linear program can be efficiently solved using column generation techniques developed for large-scale optimization problems. The resulting LPBoost algorithm can be used to solve any LP boosting formulation by iteratively optimizing the dual misclassification costs in a restricted LP and dynamically generating weak hypotheses to make new LP columns. We provide algorithms for soft margin classification, confidence-rated, and regression boosting problems. Unlike gradient boosting algorithms, which may converge in the limit only, LPBoost converges in a finite number of iterations to a global solution satisfying mathematically well-defined optimality conditions. The optimal solutions of LPBoost are very sparse in contrast with gradient based methods. Computationally, LPBoost is competitive in quality and computational cost to AdaBoost.

ensemble learning boosting linear programming sparseness soft margin 

References

  1. Anthony, M. & Bartlett, P. (1999). Learning in neural networks: Theoretical foundations. Cambridge: Cambridge University Press.Google Scholar
  2. Bauer, E. & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36, 105-139.Google Scholar
  3. Bennett, K. P. (1999). Combining support vector and mathematical programming methods for classification. In B. Schölkopf, C. Burges, & A. Smola (Eds.). Advances in kernel methods-Support vector machines (pp. 307-326). Cambridge, MA: MIT Press.Google Scholar
  4. Bennett, K. & Bredensteiner, E. J. (2000). Duality and geometry in svm classifiers. In P. Langley (Ed.). Proceedings of the 17th International Conference on Machine Learning (pp. 57-64). San Mateo, CA: Morgan Kaufmann.Google Scholar
  5. Bennett, K. P. & Demiriz, A. (1999). Semi-supervised support vector machines. In M. Kearns & S. Solla, D. C. (Ed.). Advances in neural information processing systems 11 (pp. 368-374). Cambridge, MA: MIT Press.Google Scholar
  6. Bennett, K. P., Demiriz, A., & Shawe-Taylor, J. (2000). A column generation approach to boosting. In P. Langley (Ed.), Proceedings of Seventeenth International Conference on Machine Learning (ICML' 2000) (pp. 65-72). San Francisco, CA: Morgan Kaufmann.Google Scholar
  7. Bennett, K. P. & Mangasarian, O. L. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23-34.Google Scholar
  8. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.). Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (pp. 144-152). New York: ACM Press.Google Scholar
  9. Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation, 11:7, 1493-1517.Google Scholar
  10. Cortes, C. & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273-297.Google Scholar
  11. CPLEX Optimization Incorporated, Incline Village, Nevada (1994). Using the CPLEX Callabe Library.Google Scholar
  12. Cristianni, N. & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press.Google Scholar
  13. Grove, A. & Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, AAAI-98.Google Scholar
  14. Mangasarian, O. L. (1995). Linear and nonlinear separation of patterns by linear programming. Operations Research, 13, 444-452.Google Scholar
  15. Mangasarian, O. L. (2000). Generalized support vector machines. In A. Smola, P. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 135-146). Cambridge, MA: MIT Press. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-14.ps.Google Scholar
  16. Murphy, P. & Aha, D. (1992). UCI repository of machine learning databases. Department of Information and Computer Science, Irvine, California: University of California.Google Scholar
  17. Nash, S. & Sofer, A. (1996). Linear and nonlinear programming. New York, NY: McGraw-Hill.Google Scholar
  18. Quinlan, J. (1996). Bagging, boosting, and C4.5. In Proceedings of the 13th National Conference on Artificial Intelligence, Menlo Park, CA: AAAI Press.Google Scholar
  19. Rätsch, G., Schölkopf, B., Smola, A., Mika, S., Onoda, T., & Müller, K.-R. (2000a). Robust ensemble learning. In A. Smola, P. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.). Advances in large margin classifiers (pp. 207-219). Cambridge, MA: MIT Press.Google Scholar
  20. Rätsch, G., Schölkopf, B., Smola, A., Müller, K.-R., Onoda, T., & Mika, S. (2000b). v-arc ensemble learning in the presence of outliers. In S. A. Solla, T. Leen, & K.-R. Müller (Eds.). Advances in neural information processing systems 12. Cambridge, MA: MIT Press.Google Scholar
  21. Rätsch, G., Warmuth, M., Mika, S., Onoda, T., Lemm, S., & Müller, K.-R. (2000c). Barrier boosting. Technical Report.Google Scholar
  22. Schapire, R., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26:5, 1651-1686.Google Scholar
  23. Schapire, R. & Singer, Y. (1998). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37:3, 297-336.Google Scholar
  24. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44:5, 1926-1940.Google Scholar
  25. Shawe-Taylor, J. & Cristianini, N. (1999). Margin distribution bounds on generalization. In Proceedings of the European Conference on Computational Learning Theory, EuroCOLT'99 (pp. 263-273).Google Scholar
  26. Zhang, T. (1999). Analysis of regularized linear functions for classification problems. Technical Report RC-21572, IBM.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Ayhan Demiriz
    • 1
  • Kristin P. Bennett
    • 2
    • 3
  • John Shawe-Taylor
    • 4
  1. 1.Department of Decision Sciences and Eng. SystemsRensselaer Polytechnic InstituteTroyUSA
  2. 2.Rensselaer Polytechnic InstituteTroyUSA
  3. 3.Microsoft ResearchRedmondUSA
  4. 4.Department of Computer Science, Royal HollowayUniversity of LondonEgham, SurreyUK

Personalised recommendations