Machine Learning

, Volume 41, Issue 3, pp 295–313 | Cite as

Enlarging the Margins in Perceptron Decision Trees

  • Kristin P. Bennett
  • Nello Cristianini
  • John Shawe-Taylor
  • Donghui Wu
Article

Abstract

Capacity control in perceptron decision trees is typically performed by controlling their size. We prove that other quantities can be as relevant to reduce their flexibility and combat overfitting. In particular, we provide an upper bound on the generalization error which depends both on the size of the tree and on the margin of the decision nodes. So enlarging the margin in perceptron decision trees will reduce the upper bound on generalization error. Based on this analysis, we introduce three new algorithms, which can induce large margin perceptron decision trees. To assess the effect of the large margin bias, OC1 (Journal of Artificial Intelligence Research, 1994, 2, 1–32.) of Murthy, Kasif and Salzberg, a well-known system for inducing perceptron decision trees, is used as the baseline algorithm. An extensive experimental study on real world data showed that all three new algorithms perform better or at least not significantly worse than OC1 on almost every dataset with only one exception. OC1 performed worse than the best margin-based method on every dataset.

capacity control decision trees perceptron learning theory learning algorithm 

References

  1. Alon, N., Ben-David, S., Cesa-Bianchi, N., & Haussler, D. (1997). Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4), 615–631.Google Scholar
  2. Anthony, M. & Bartlett, P. (1994). Function learning from interpolation. Technical Report (An extended abstract appeared in Computational Learning Theory, Proceedings 2nd European Conference, EuroCOLT'95, edited by Paul Vitanyi (Lecture Notes in Artificial Intelligence, vol. 904) Springer-Verlag, Berlin, 1995, pp. 211–221).Google Scholar
  3. Bartlett, P. L. & Long, P. M. (1995). Prediction, learning, uniform convergence, and scale-sensitive dimensions. Preprint, Department of Systems Engineering, Australian National University.Google Scholar
  4. Bartlett, P., Long, P., & Williamson, R. (1996). Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3), 434–452.Google Scholar
  5. Bartlett, P. & Shawe-Taylor, J. (1998). Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in Kernel methods-support vector learning (pp. 43–54). Cambridge, USA: MIT Press.Google Scholar
  6. Bennett, K. & Mangasarian, O. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23–34.Google Scholar
  7. Bennett, K. & Mangasarian, O. (1994a). Multicategory discrimination via linear programming. Optimization Methods and Software, 3, 29–39.Google Scholar
  8. Bennett, K. & Mangasarian, O. (1994b). Serial and parallel multicategory discrimination. SIAM Journal on Optimization, 4(4), 722–734.Google Scholar
  9. Bennett, K., Wu, D., & Auslender, L. (1998). On support vector decision trees for database marketing. R.P.I. Math Report No. 98–100, Rensselaer Polytechnic Institute, Troy, NY.Google Scholar
  10. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth International Group.Google Scholar
  11. Broadley, C. E. & Utgoff, P. E. (1995). Multivariate decision trees. Machine Learning, 19, 45–77.Google Scholar
  12. Cortes, C. & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297.Google Scholar
  13. Cristianini, N., Shawe-Taylor, J., & Sykacek, P. (1998). Bayesian classifiers are large margin hyperplanes in a Hilbert space. In J. Shavlik (Ed.), Machine Learning: Proceedings of the Fifteenth International Conference (pp. 109–117). San Francisco, CA: Morgan Kaufmann Publishers.Google Scholar
  14. Diettrich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1924.Google Scholar
  15. Kearns, M. & Schapire, R. (1990). Efficient distribution-free learning of probabilistic concepts. In Proceedings of the 31st Symposium on the Foundations of Computer Science (pp. 382–391). Los Alamitos, CA: IEEE Computer Society Press.Google Scholar
  16. Kohavi, R. (1995). A study of cross-validation and bootstraping for accuracy estimation and model selection. In International Joint Conference on Artifical Intelligence (pp. 1137–1143). San Mateo, CA: Morgan Kaufmann.Google Scholar
  17. Mangasarian, O., Setiono, R., & Wolberg, W. (1990). Pattern recognition via linear programming: Theory and application to medical diagnosis. In T. F. Coleman & Y. Li (Eds.), Proceedings on Workshop on Large-Scale Numerical Optimization (pp. 22–31). Philadelphia, PA: SIAM.Google Scholar
  18. Murthy, S. K., Kasif, S., & Salzberg, S. (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2, 1–32.Google Scholar
  19. Neal, R. N. (1998). Assessing relevance determination methods using DELVE generalization. In C. M. Bishop (Ed.), Neural networks and machine learning (pp. 97–129). Springer-Verlag.Google Scholar
  20. Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.Google Scholar
  21. Quinlan, J. R. & Rivest, R. (1989). Learning decision trees using the minimum description length principle. Information and Computation 80, 227–248.Google Scholar
  22. Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1(3), 317–327.Google Scholar
  23. Sankar, A. & Mammone, R. J. (1993). Growing and pruning neural tree networks. IEEE Transactions on Computers, 42, 291–299.Google Scholar
  24. Schapire, R., Freund, Y., Bartlett, P. L., & Sun Lee, W. (1997). Boosting the margin: A new explanation for the effectiveness of voting methods. In D. H. Fisher, Jr. (Ed.), Proceedings of International Conference on Machine Learning, ICML'97, (pp. 322–330). Nashville, Tennessee. Morgan Kaufmann Publishers.Google Scholar
  25. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1996). Structural risk minimization over data-dependent hierarchies, IEEE Transactions on Information Theory, 44(5), 1926–1940.Google Scholar
  26. Sirat, J. A. & Nadal, J.-P. (1990). Neural trees: A new tool for classification. Network, 1, 423–438. University of California, Irvine-Machine Learning Repository, http://www.ics.uci.edu/∼mlearn/ MLRepository.html.Google Scholar
  27. Utgoff, P. E. (1989). Perceptron trees: A case study in hybrid concept representations. Connection Science, 1, 377–391.Google Scholar
  28. Vapnik, V. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag.Google Scholar
  29. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag.Google Scholar
  30. Vapnik, V. & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Applications, 16, 264–280.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Kristin P. Bennett
    • 1
  • Nello Cristianini
    • 2
  • John Shawe-Taylor
    • 3
  • Donghui Wu
    • 4
  1. 1.Dept of Mathematical SciencesRensselaer Polytechnic InstituteTroyUSA
  2. 2.Dept of Computer Science, Royal HollowayUniversity of London, EghamSurreyUK
  3. 3.Dept of Computer Science, Royal HollowayUniversity of London, EghamSurreyUK
  4. 4.Dept of Mathematical SciencesRensselaer Polytechnic InstituteTroyUSA

Personalised recommendations