Machine Learning

, 78:175 | Cite as

Learning the set covering machine by bound minimization and margin-sparsity trade-off

  • François Laviolette
  • Mario Marchand
  • Mohak Shah
  • Sara Shanian
Article

Abstract

We investigate classifiers in the sample compression framework that can be specified by two distinct sources of information: a compression set and a message string of additional information. In the compression setting, a reconstruction function specifies a classifier when given this information. We examine how an efficient redistribution of this reconstruction information can lead to more general classifiers. In particular, we derive risk bounds that can provide an explicit control over the sparsity of the classifier and the magnitude of its separating margin and a capability to perform a margin-sparsity trade-off in favor of better classifiers. We show how an application to the set covering machine algorithm results in novel learning strategies. We also show that these risk bounds are tighter than their traditional counterparts such as VC-dimension and Rademacher complexity-based bounds that explicitly take into account the hypothesis class complexity. Finally, we show how these bounds are able to guide the model selection for the set covering machine algorithm enabling it to learn by bound minimization.

Keywords

Set covering machine Sample compression Risk bounds Margin-sparsity trade-off Bound minimization 

References

  1. Ben-David, S., & Litman, A. (1998). Combinatorial variability of Vapnik-Chervonenkis classes. Discrete Applied Mathematics, 86, 3–25. MATHCrossRefMathSciNetGoogle Scholar
  2. Blum, A., & Langford, J. (2003). PAC-MDL bounds. In Lecture notes in artificial intelligence : Vol. 2777. Proceedings of 16th annual conference on learning theory, COLT 2003, Washington, DC, August 2003 (pp. 344–357). Berlin: Springer. Google Scholar
  3. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1987). Occam’s razor. Information Processing Letters, 24, 377–380. MATHCrossRefMathSciNetGoogle Scholar
  4. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the 5th annual ACM workshop on computational learning theory (pp. 144–152). New York: ACM. CrossRefGoogle Scholar
  5. Chvátal, V. (1979). A greedy heuristic for the set covering problem. Mathematics of Operations Research, 4, 233–235. MATHCrossRefMathSciNetGoogle Scholar
  6. Floyd, S., & Warmuth, M. (1995). Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine Learning, 21(3), 269–304. Google Scholar
  7. Garey, M. R., & Johnson, D. S. (1979). Computers and intractability, a guide to the theory of NP-completeness. New York: Freeman. MATHGoogle Scholar
  8. Graepel, T., Herbrich, R., & Shawe-Taylor, J. (2000). Generalisation error bounds for sparse linear classifiers. In Proceedings of the thirteenth annual conference on computational learning theory (pp. 298–303). Google Scholar
  9. Graepel, T., Herbrich, R., & Shawe-Taylor, J. (2005). PAC-Bayesian compression bounds on the prediction error of learning algorithms for classification. Machine Learning, 59(12), 55–76. MATHCrossRefGoogle Scholar
  10. Graepel, T., Herbrich, R., & Williamson, R. C. (2001). From margin to sparsity. In Advances in neural information processing systems (Vol. 13, pp. 210–216). Google Scholar
  11. Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artificial Intelligence, 36, 177–221. MATHCrossRefMathSciNetGoogle Scholar
  12. Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory. Cambridge: MIT Press. Google Scholar
  13. Kuzmin, D., & Warmuth, M. K. (2007). Unlabeled compression schemes for maximum classes. Journal of Machine Learning Research, 8, 2047–2081. MathSciNetGoogle Scholar
  14. Langford, J. (2005). Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 3, 273–306. MathSciNetGoogle Scholar
  15. Laviolette, F., & Marchand, M. (2007). PAC-Bayes risk bounds for stochastic averages and majority votes of sample-compressed classifiers. Journal of Machine Learning Research, 8, 1461–1487. MathSciNetGoogle Scholar
  16. Laviolette, F., Marchand, M., & Shah, M. (2005). Margin-sparsity trade-off for the set covering machine. In Lecture notes in artificial intelligence : Vol. 3720. Proceedings of the 16th European conference on machine learning, ECML 2005 (pp. 206–217). Berlin: Springer. CrossRefGoogle Scholar
  17. Laviolette, F., Marchand, M., & Shah, M. (2006). A PAC-Bayes approach to the set covering machine. In Advances in neural information processing systems (Vol. 18, pp. 731–738). Cambridge: MIT Press. Google Scholar
  18. Littlestone, N., & Warmuth, M. (1986). Relating data compression and learnability. Technical report, University of California Santa Cruz, Santa Cruz. Google Scholar
  19. Marchand, M., Shah, M., Shawe-Taylor, J., & Sokolova, M. (2003). The set covering machine with data-dependent half-spaces. In Proceedings of the twentieth international conference on machine learning (ICML 2003) (pp. 520–527). Google Scholar
  20. Marchand, M., & Shawe-Taylor, J. (2001). Learning with the set covering machine. In Proceedings of the eighteenth international conference on machine learning (ICML 2001) (pp. 345–352). Google Scholar
  21. Marchand, M., & Shawe-Taylor, J. (2002). The set covering machine. Journal of Machine Learning Reasearch, 3, 723–746. CrossRefMathSciNetGoogle Scholar
  22. Marchand, M., & Sokolova, M. (2005). Learning with decision lists of data-dependent features. Journal of Machine Learning Reasearch, 6, 427–451. MathSciNetGoogle Scholar
  23. McAllester, D. (2003). PAC-Bayesian stochastic model selection. Machine Learning, 51, 5–21. A preliminary version appeared in proceedings of COLT’99. MATHCrossRefGoogle Scholar
  24. Mendelson, S. (2002). Rademacher averages and phase transitions in Glivenko-Cantelli class. IEEE Transactions on Information Theory, 48, 251–263. MATHCrossRefMathSciNetGoogle Scholar
  25. Rubinstein, J. H., & Rubinstein, B. I. P. (2008). Geometric & topological representations of maximum classes with applications to sample compression. In COLT (pp. 299–310). Google Scholar
  26. Shah, M. (2006). Sample compression, margins and generalization: extensions to the set covering machine. PhD thesis, SITE, University of Ottawa, Ottawa, Canada, May 2006. Google Scholar
  27. Shah, M. (2007). Sample compression bounds for decision trees. In ICML’07: Proceedings of the 24th international conference on machine learning (pp. 799–806). New York: ACM. CrossRefGoogle Scholar
  28. Valiant, L. G. (1984). A theory of the learnable. Communications of the Association of Computing Machinery, 27(11), 1134–1142. MATHGoogle Scholar
  29. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. MATHGoogle Scholar
  30. von Luxburg, U., Bousquet, O., & Schölkopf, B. (2004). A compression approach to support vector model selection. Journal of Machine Learning Research, 5, 293–323. Google Scholar
  31. Warmuth, M. K. (2003). Compressing to VC dimension many points. In Proceedings of the 16th annual conference on learning theory (COLT 03) (pp. 743–744). Berlin: Springer. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • François Laviolette
    • 1
  • Mario Marchand
    • 1
  • Mohak Shah
    • 2
  • Sara Shanian
    • 1
  1. 1.Department of Computer Science and Software Engineering, Pav. Adrien PouliotLaval UniversityQuebecCanada
  2. 2.Centre for Intelligent MachinesMcGill UniversityMontrealCanada

Personalised recommendations