Enlarging the Margins in Perceptron Decision Trees
 Kristin P. Bennett,
 Nello Cristianini,
 John ShaweTaylor,
 Donghui Wu
 … show all 4 hide
Abstract
Capacity control in perceptron decision trees is typically performed by controlling their size. We prove that other quantities can be as relevant to reduce their flexibility and combat overfitting. In particular, we provide an upper bound on the generalization error which depends both on the size of the tree and on the margin of the decision nodes. So enlarging the margin in perceptron decision trees will reduce the upper bound on generalization error. Based on this analysis, we introduce three new algorithms, which can induce large margin perceptron decision trees. To assess the effect of the large margin bias, OC1 (Journal of Artificial Intelligence Research, 1994, 2, 1–32.) of Murthy, Kasif and Salzberg, a wellknown system for inducing perceptron decision trees, is used as the baseline algorithm. An extensive experimental study on real world data showed that all three new algorithms perform better or at least not significantly worse than OC1 on almost every dataset with only one exception. OC1 performed worse than the best marginbased method on every dataset.
 Alon, N., BenDavid, S., CesaBianchi, N., & Haussler, D. (1997). Scalesensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4), 615–631.
 Anthony, M. & Bartlett, P. (1994). Function learning from interpolation. Technical Report (An extended abstract appeared in Computational Learning Theory, Proceedings 2nd European Conference, EuroCOLT'95, edited by Paul Vitanyi (Lecture Notes in Artificial Intelligence, vol. 904) SpringerVerlag, Berlin, 1995, pp. 211–221).
 Bartlett, P. L. & Long, P. M. (1995). Prediction, learning, uniform convergence, and scalesensitive dimensions. Preprint, Department of Systems Engineering, Australian National University.
 Bartlett, P., Long, P., & Williamson, R. (1996). Fatshattering and the learnability of realvalued functions. Journal of Computer and System Sciences, 52(3), 434–452.
 Bartlett, P. & ShaweTaylor, J. (1998). Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in Kernel methodssupport vector learning (pp. 43–54). Cambridge, USA: MIT Press.
 Bennett, K. & Mangasarian, O. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23–34.
 Bennett, K. & Mangasarian, O. (1994a). Multicategory discrimination via linear programming. Optimization Methods and Software, 3, 29–39.
 Bennett, K. & Mangasarian, O. (1994b). Serial and parallel multicategory discrimination. SIAM Journal on Optimization, 4(4), 722–734.
 Bennett, K., Wu, D., & Auslender, L. (1998). On support vector decision trees for database marketing. R.P.I. Math Report No. 98–100, Rensselaer Polytechnic Institute, Troy, NY.
 Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth International Group.
 Broadley, C. E. & Utgoff, P. E. (1995). Multivariate decision trees. Machine Learning, 19, 45–77.
 Cortes, C. & Vapnik, V. (1995). Supportvector networks. Machine Learning, 20, 273–297.
 Cristianini, N., ShaweTaylor, J., & Sykacek, P. (1998). Bayesian classifiers are large margin hyperplanes in a Hilbert space. In J. Shavlik (Ed.), Machine Learning: Proceedings of the Fifteenth International Conference (pp. 109–117). San Francisco, CA: Morgan Kaufmann Publishers.
 Diettrich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1924.
 Kearns, M. & Schapire, R. (1990). Efficient distributionfree learning of probabilistic concepts. In Proceedings of the 31st Symposium on the Foundations of Computer Science (pp. 382–391). Los Alamitos, CA: IEEE Computer Society Press.
 Kohavi, R. (1995). A study of crossvalidation and bootstraping for accuracy estimation and model selection. In International Joint Conference on Artifical Intelligence (pp. 1137–1143). San Mateo, CA: Morgan Kaufmann.
 Mangasarian, O., Setiono, R., & Wolberg, W. (1990). Pattern recognition via linear programming: Theory and application to medical diagnosis. In T. F. Coleman & Y. Li (Eds.), Proceedings on Workshop on LargeScale Numerical Optimization (pp. 22–31). Philadelphia, PA: SIAM.
 Murthy, S. K., Kasif, S., & Salzberg, S. (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2, 1–32.
 Neal, R. N. (1998). Assessing relevance determination methods using DELVE generalization. In C. M. Bishop (Ed.), Neural networks and machine learning (pp. 97–129). SpringerVerlag.
 Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
 Quinlan, J. R. & Rivest, R. (1989). Learning decision trees using the minimum description length principle. Information and Computation 80, 227–248.
 Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1(3), 317–327.
 Sankar, A. & Mammone, R. J. (1993). Growing and pruning neural tree networks. IEEE Transactions on Computers, 42, 291–299.
 Schapire, R., Freund, Y., Bartlett, P. L., & Sun Lee, W. (1997). Boosting the margin: A new explanation for the effectiveness of voting methods. In D. H. Fisher, Jr. (Ed.), Proceedings of International Conference on Machine Learning, ICML'97, (pp. 322–330). Nashville, Tennessee. Morgan Kaufmann Publishers.
 ShaweTaylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1996). Structural risk minimization over datadependent hierarchies, IEEE Transactions on Information Theory, 44(5), 1926–1940.
 Sirat, J. A. & Nadal, J.P. (1990). Neural trees: A new tool for classification. Network, 1, 423–438. University of California, IrvineMachine Learning Repository, http://www.ics.uci.edu/∼mlearn/ MLRepository.html.
 Utgoff, P. E. (1989). Perceptron trees: A case study in hybrid concept representations. Connection Science, 1, 377–391.
 Vapnik, V. (1982). Estimation of dependences based on empirical data. New York: SpringerVerlag.
 Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag.
 Vapnik, V. & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Applications, 16, 264–280.
 Title
 Enlarging the Margins in Perceptron Decision Trees
 Journal

Machine Learning
Volume 41, Issue 3 , pp 295313
 Cover Date
 20001201
 DOI
 10.1023/A:1007600130808
 Print ISSN
 08856125
 Online ISSN
 15730565
 Publisher
 Kluwer Academic Publishers
 Additional Links
 Topics
 Keywords

 capacity control
 decision trees
 perceptron
 learning theory
 learning algorithm
 Industry Sectors
 Authors

 Kristin P. Bennett ^{(1)}
 Nello Cristianini ^{(2)}
 John ShaweTaylor ^{(3)}
 Donghui Wu ^{(4)}
 Author Affiliations

 1. Dept of Mathematical Sciences, Rensselaer Polytechnic Institute, 110 8th St., Troy, NY, 12180, USA
 2. Dept of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK
 3. Dept of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK
 4. Dept of Mathematical Sciences, Rensselaer Polytechnic Institute, 110 8th St., Troy, NY, 12180, USA