Machine Learning

, Volume 71, Issue 2–3, pp 243–264 | Cite as

Efficient approximate leave-one-out cross-validation for kernel logistic regression

Article

Abstract

Kernel logistic regression (KLR) is the kernel learning method best suited to binary pattern recognition problems where estimates of a-posteriori probability of class membership are required. Such problems occur frequently in practical applications, for instance because the operational prior class probabilities or equivalently the relative misclassification costs are variable or unknown at the time of training the model. The model parameters are given by the solution of a convex optimization problem, which may be found via an efficient iteratively re-weighted least squares (IRWLS) procedure. The generalization properties of a kernel logistic regression machine are however governed by a small number of hyper-parameters, the values of which must be determined during the process of model selection. In this paper, we propose a novel model selection strategy for KLR, based on a computationally efficient closed-form approximation of the leave-one-out cross-validation procedure. Results obtained on a variety of synthetic and real-world benchmark datasets are given, demonstrating that the proposed model selection procedure is competitive with a more conventional k-fold cross-validation based approach and also with Gaussian process (GP) classifiers implemented using the Laplace approximation and via the Expectation Propagation (EP) algorithm.

Keywords

Model selection Kernel logistic regression 

References

  1. Allen, D. M. (1974). The relationship between variable selection and prediction. Technometrics, 16, 125–127. CrossRefMathSciNetMATHGoogle Scholar
  2. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., & Sorensen, D. (1999). LAPACK users’ guide (3rd ed.). Philadelphia: Society for Industrial and Applied Mathematics. Google Scholar
  3. Bo, L., Wang, L., & Jiao, L. (2006). Feature scaling for kernel Fisher discriminant analysis using leave-one-out cross validation. Neural Computation, 18(4), 961–978. CrossRefMathSciNetMATHGoogle Scholar
  4. Boser, B. E., Guyon, I. M., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), Proceedings of the fifth annual ACM workshop on computational learning theory (pp. 144–152), Pittsburgh, PA, July 1992. Google Scholar
  5. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. MATHGoogle Scholar
  6. Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M. Jr., & Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, 97(1), 262–267. CrossRefGoogle Scholar
  7. Cawley, G. C., Janacek, G. J., & Talbot, N. L. C. (2007). Generalised kernel machines. In Proceedings of the IEEE/INNS international joint conference on neural networks (IJCNN-07), Orlando, FL, USA, 12–17 August 2007. Google Scholar
  8. Cawley, G. C., & Talbot, N. L. C. (2003). Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers. Pattern Recognition, 36(11), 2585–2592. CrossRefMATHGoogle Scholar
  9. Cawley, G. C., & Talbot, N. L. C. (2004a). Efficient model selection for kernel logistic regression. In Proceedings of the 17th international conference on pattern recognition (ICPR-2004) (Vol. 2, pp. 439–442), Cambridge, United Kingdom, 23–26 August 2004. Google Scholar
  10. Cawley, G. C., & Talbot, N. L. C. (2004b). Fast leave-one-out cross-validation of sparse least-squares support vector machines. Neural Networks, 17(10), 1467–1475. CrossRefMATHGoogle Scholar
  11. Cawley, G. C., & Talbot, N. L. C. (2007). Preventing over-fitting in model selection via Bayesian regularization of the hyper-parameters. Journal of Machine Learning Research, 8, 841–861. Google Scholar
  12. Cawley, G. C., Talbot, N. L. C., Janacek, G. J., & Peck, M. W. (2006). Parametric accelerated life survival analysis using sparse Bayesian kernel learning methods. IEEE Transactions on Neural Networks, 17(2), 471–481. CrossRefGoogle Scholar
  13. Chapelle, O. (2006). Leave k out for kernel machines. unpublished research note. 2 October 2006. Google Scholar
  14. Chapelle, O. (2007). Training a support vector machine in the primal. Neural Computation, 19(5), 1155–1178. CrossRefMathSciNetMATHGoogle Scholar
  15. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1), 131–159. CrossRefMATHGoogle Scholar
  16. Cook, R. D., & Weisberg, S. (1982). Monographs on statistics and applied probability. Residuals and influence in regression. New York: Chapman and Hall. MATHGoogle Scholar
  17. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297. MATHGoogle Scholar
  18. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. CrossRefGoogle Scholar
  19. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd. ed.). Baltimore: The Johns Hopkins University Press. MATHGoogle Scholar
  20. Green, P. J., & Silverman, B. W. (1994). Monographs on statistics and applied probability : Vol. 58. Nonparametric regression and generalized linear models—a roughness penalty approach. London: Chapman & Hall/CRC. MATHGoogle Scholar
  21. Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82–95. CrossRefMathSciNetMATHGoogle Scholar
  22. Lachenbruch, P. A., & Mickey, M. R. (1968). Estimation of error rates in discriminant analysis. Technometrics, 10(1), 1–11. CrossRefMathSciNetGoogle Scholar
  23. Luntz, A., & Brailovsky, V. (1969). On estimation of characters obtained in statistical procedure of recognition. Techicheskaya Kibernetica, 3 (in Russian). Google Scholar
  24. Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London, A, 209, 415–446. CrossRefGoogle Scholar
  25. Mika, S., Rätsch, G., Weston, J., Schölkopf, B., & Müller, K.-R. (1999). Fisher discriminant analysis with kernels. In Neural networks for signal processing (Vol. IX, pp. 41–48). New York: IEEE Press. Google Scholar
  26. Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A. J., & Müller, K.-R. (2000). Invariant feature extraction and classification in feature spaces. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12, pp. 526–532). Cambridge: MIT Press. Google Scholar
  27. Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A., & Müller, K.-R. (2003). Constructing descriptive and discriminative features: Rayleigh coefficients in kernel feature spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5), 623–628. CrossRefGoogle Scholar
  28. Minka, T. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of uncertainty in artificial intelligence (pp. 362–369). Google Scholar
  29. Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., & Schölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201. CrossRefGoogle Scholar
  30. Nabney, I. T. (1999). Efficient training of RBF networks for classification. In: Proceedings of the ninth international conference on artificial neural networks (Vol. 1, pp. 210–215), Edinburgh, United Kingdom, 7–10 September, 1999. Google Scholar
  31. Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, 308–313. MATHGoogle Scholar
  32. Opper, M., & Winther, O. (2000). Gaussian processes for classification: mean-field algorithms. Neural Computation, 12(11), 2665–2684. CrossRefGoogle Scholar
  33. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Adaptive computation and machine learning. Cambridge: MIT Press. Google Scholar
  34. Saunders, C., Gammermann, A., & Vovk, V. (1998). Ridge regression in dual variables. In J. Shavlik (Ed.), Proceedings of the fifteenth international conference on machine learning (ICML-1998). San Mateo: Morgan Kaufmann. Google Scholar
  35. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels—support vector machines, regularization, optimization and beyond. Cambridge: MIT Press. Google Scholar
  36. Schölkopf, B., Smola, A. J., & Müller, K. (1997). Kernel principal component analysis. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Lecture notes in computer science : Vol. 1327. Proceedings of the international conference on artificial neural networks (ICANN-1997) (pp. 583–588). Berlin: Springer. Google Scholar
  37. Schölkopf, B., Herbrich, R., & Smola, A. J. (2002). A generalized representer theorem. In Proceedings of the fourteenth international conference on computational learning theory (pp. 416–426), Amsterdam, The Netherlands, 16–19 July 2002. Google Scholar
  38. Seaks, T. (1972). SYMINV: an algorithm for the inversion of a positive definite matrix by the Cholesky decomposition. Econometrica, 40(5), 961–962. CrossRefGoogle Scholar
  39. Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press. Google Scholar
  40. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, B, 36(1), 111–147. MATHGoogle Scholar
  41. Sundararajan, S., & Keerthi, S. S. (2001). Predictive approaches for choosing hyperparameters in Gaussian processes. Neural Computation, 13(5), 1103–1118. CrossRefMATHGoogle Scholar
  42. Suykens, J. A. K., De Brabanter, J., Lukas, L., & Vandewalle, J. (2002a). Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing, 48(1–4), 85–105. CrossRefMATHGoogle Scholar
  43. Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vanderwalle, J. (2002b). Least squares support vector machines. Singapore: World Scientific. MATHGoogle Scholar
  44. Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. New York: Wiley. MATHGoogle Scholar
  45. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. MATHGoogle Scholar
  46. Vapnik, V., & Chapelle, O. (2000). Bounds on error expectation for SVM. In: A. J. Smola, P. L. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 261–280). Google Scholar
  47. Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM. MATHGoogle Scholar
  48. Weisberg, S. (1985). Applied linear regression (2nd ed.). New York: Wiley. MATHGoogle Scholar
  49. Williams, P. M. (1991). A Marquardt algorithm for choosing the step-size in backpropagation learning with conjugate gradients. Cognitive Science Research Paper CSRP-229, University of Sussex, Brighton, UK, February 1991. Google Scholar
  50. Williams, C. K. I., & Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1342–1351. CrossRefGoogle Scholar
  51. Zhu, J., & Hastie, T. (2005). Kernel logistic regression and the import vector machine. Journal of Computational & Graphical Statistics, 14(1), 185–205. CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  1. 1.School of Computing SciencesUniversity of East AngliaNorwichUK

Personalised recommendations