Alon, N., Ben-David, S., Cesa-Bianchi, N., & Haussler, D. (1997). Scale-sensitive dimensions, uniform convergence, and learnability. *Journal of the ACM*, *44*(4), 615–631.

Anthony, M. & Bartlett, P. (1994). Function learning from interpolation. Technical Report (An extended abstract appeared in Computational Learning Theory, Proceedings 2nd European Conference, EuroCOLT'95, edited by Paul Vitanyi (Lecture Notes in Artificial Intelligence, vol. 904) Springer-Verlag, Berlin, 1995, pp. 211–221).

Bartlett, P. L. & Long, P. M. (1995). Prediction, learning, uniform convergence, and scale-sensitive dimensions. Preprint, Department of Systems Engineering, Australian National University.

Bartlett, P., Long, P., & Williamson, R. (1996). Fat-shattering and the learnability of real-valued functions. *Journal of Computer and System Sciences*, *52*(3), 434–452.

Bartlett, P. & Shawe-Taylor, J. (1998). Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), *Advances in Kernel methods-support vector learning* (pp. 43–54). Cambridge, USA: MIT Press.

Bennett, K. & Mangasarian, O. (1992). Robust linear programming discrimination of two linearly inseparable sets. *Optimization Methods and Software*, *1*, 23–34.

Bennett, K. & Mangasarian, O. (1994a). Multicategory discrimination via linear programming. *Optimization Methods and Software*, *3*, 29–39.

Bennett, K. & Mangasarian, O. (1994b). Serial and parallel multicategory discrimination. *SIAM Journal on Optimization*, *4*(4), 722–734.

Bennett, K., Wu, D., & Auslender, L. (1998). On support vector decision trees for database marketing. R.P.I. Math Report No. 98–100, Rensselaer Polytechnic Institute, Troy, NY.

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). *Classification and regression trees*. Belmont, CA: Wadsworth International Group.

Broadley, C. E. & Utgoff, P. E. (1995). Multivariate decision trees. *Machine Learning*, *19*, 45–77.

Cortes, C. & Vapnik, V. (1995). Support-vector networks. *Machine Learning*, *20*, 273–297.

Cristianini, N., Shawe-Taylor, J., & Sykacek, P. (1998). Bayesian classifiers are large margin hyperplanes in a Hilbert space. In J. Shavlik (Ed.), *Machine Learning: Proceedings of the Fifteenth International Conference* (pp. 109–117). San Francisco, CA: Morgan Kaufmann Publishers.

Diettrich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. *Neural Computation*, *10*(7), 1895–1924.

Kearns, M. & Schapire, R. (1990). Efficient distribution-free learning of probabilistic concepts. In *Proceedings of the 31st Symposium on the Foundations of Computer Science* (pp. 382–391). Los Alamitos, CA: IEEE Computer Society Press.

Kohavi, R. (1995). A study of cross-validation and bootstraping for accuracy estimation and model selection. In *International Joint Conference on Artifical Intelligence* (pp. 1137–1143). San Mateo, CA: Morgan Kaufmann.

Mangasarian, O., Setiono, R., & Wolberg, W. (1990). Pattern recognition via linear programming: Theory and application to medical diagnosis. In T. F. Coleman & Y. Li (Eds.), *Proceedings on Workshop on Large-Scale Numerical Optimization* (pp. 22–31). Philadelphia, PA: SIAM.

Murthy, S. K., Kasif, S., & Salzberg, S. (1994). A system for induction of oblique decision trees. *Journal of Artificial Intelligence Research*, *2*, 1–32.

Neal, R. N. (1998). Assessing relevance determination methods using DELVE generalization. In C. M. Bishop (Ed.), *Neural networks and machine learning* (pp. 97–129). Springer-Verlag.

Quinlan, J. R. (1993). *C4.5: Programs for machine learning*. Morgan Kaufmann.

Quinlan, J. R. & Rivest, R. (1989). Learning decision trees using the minimum description length principle. *Information and Computation 80*, 227–248.

Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. *Data Mining and Knowledge Discovery 1*(3), 317–327.

Sankar, A. & Mammone, R. J. (1993). Growing and pruning neural tree networks. *IEEE Transactions on Computers*, *42*, 291–299.

Schapire, R., Freund, Y., Bartlett, P. L., & Sun Lee, W. (1997). Boosting the margin: A new explanation for the effectiveness of voting methods. In D. H. Fisher, Jr. (Ed.), *Proceedings of International Conference on Machine Learning, ICML'97*, (pp. 322–330). Nashville, Tennessee. Morgan Kaufmann Publishers.

Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1996). Structural risk minimization over data-dependent hierarchies, *IEEE Transactions on Information Theory*, *44*(5), 1926–1940.

Sirat, J. A. & Nadal, J.-P. (1990). Neural trees: A new tool for classification. *Network*, *1*, 423–438. University of California, Irvine-Machine Learning Repository, http://www.ics.uci.edu/∼mlearn/ MLRepository.html.

Utgoff, P. E. (1989). Perceptron trees: A case study in hybrid concept representations. *Connection Science*, *1*, 377–391.

Vapnik, V. (1982). *Estimation of dependences based on empirical data*. New York: Springer-Verlag.

Vapnik, V. (1995). *The nature of statistical learning theory*. New York: Springer-Verlag.

Vapnik, V. & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. *Theory of Probability and Applications*, *16*, 264–280.