Abstract
Enforcing sparsity constraints has been shown to be an effective and efficient way to obtain state-of-the-art results in regression and classification tasks. Unlike the support vector machine (SVM) the relevance vector machine (RVM) explicitly encodes the criterion of model sparsity as a prior over the model weights. However the lack of an explicit prior structure over the weight variances means that the degree of sparsity is to a large extent controlled by the choice of kernel (and kernel parameters). This can lead to severe overfitting or oversmoothing—possibly even both at the same time (e.g. for the multiscale Doppler data). We detail an efficient scheme to control sparsity in Bayesian regression by incorporating a flexible noise-dependent smoothness prior into the RVM. We present an empirical evaluation of the effects of choice of prior structure on a selection of popular data sets and elucidate the link between Bayesian wavelet shrinkage and RVM regression. Our model encompasses the original RVM as a special case, but our empirical results show that we can surpass RVM performance in terms of goodness of fit and achieved sparsity as well as computational performance in many cases. The code is freely available.
Article PDF
Similar content being viewed by others
References
Arfken, G. (1985). Mathematical methods for physicists. New York: Academic Press.
Bernardo, J., & Smith, A. (1994). Bayesian theory. New York: Wiley.
Bishop, C., & Tipping, M. (2000). Variational relevance vector machines. In Proceedings of the 16th conference on uncertainty in artificial intelligence (pp. 46–53).
Bobin, J., Moudden, Y., Starck, J.-L., & Elad, M. (2005). Multichannel morphological component analysis. In Proceedings of Spars05 (pp. 103–106), Rennes, France.
Chipman, H., Kolaczyk, E., & McCulloch, R. (1997). Adaptive Bayesian wavelet shrinkage. Journal of the American Statistical Association, 92, 1413–1421.
Clarkson, E., & Barrett, H. (2001). High-pass filters give histograms with positive kurtosis. Optics Letters, 26(16), 1253–1255.
Daubechies, I. (1992). Ten lectures on wavelets. Philadelphia: SIAM.
Denison, D., Holmes, C., Mallick, B., & Smith, A. (2002). Bayesian methods for nonlinear classification and regression. New York: Wiley.
Donoho, D., & Johnstone, I. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3), 425–455.
D’Souza, A., Vijayakumar, S., & Schaal, S. (2004). The Bayesian backfitting relevance vector machine. In Proceedings of the international conference on machine learning (ICML 2004).
Faul, A., & Tipping, M. (2002). Analysis of sparse Bayesian learning. In Advances in neural information processing systems (Vol. 14).
Figueiredo, M. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1150–1159.
Fokoué, E., Goel, P., & Sun, D. (2004). A prior for consistent estimation for the relevance vector machine. Technical report, Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC, USA.
Girolami, M., & Rogers, S. (2005). Hierachic Bayesian models for kernel learning. In 22nd international conference on machine learning (ICML 2005) (pp. 241–248), Bonn.
Golub, G. H., & van Loan, C. E. (1989). Matrix computations (2nd ed.). Baltimore: Hopkins.
Hastie, T., & Tibshirani, R. (1990). Generalized additive models. London: Chapman and Hall.
Hoerl, A., & Kennard, R. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.
Holmes, C., & Denison, G. (1999). Bayesian wavelet analysis with a model complexity prior. In Bayesian statistics 6: proceedings of the sixth Valencia international meeting (pp. 769–776), Oxford.
Jansen, M. (2001). Noise reduction by wavelet thresholding. New York: Springer.
Lam, E., & Goodman, J. (2000). A mathematical analysis of the DCT coefficient distributions for images. IEEE Trans. Image Processing, 9, 1661–1666.
MacKay, D. (1992). The evidence framework applied to classification networks. Neural Computation, 4(5), 720–736.
Mallat, S. (1999). A wavelet tour of signal processing (2nd ed.). New York: Academic Press.
Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (1992). Numerical Recipes in C (2nd ed.). Cambridge: Cambridge University Press.
Quiñonero-Candela, J. (2004). Learning with uncertainty—Gaussian processes and relevance vector machines. PhD thesis, Technical University of Denmark, Lyngby, Denmark.
Roweis, S. (1999). Matrix identities. Available from http://www.cs.toronto.edu/~roweis/notes/matrixid.pdf.
Schmolck, A., & Everson, R. (2005). Smoothness priors for sparse Bayesian regression. In Workshop on signal processing with adaptive sparse structured representations, Rennes, France.
Schölkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge: MIT.
Smola, A., & Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In P. Langley (Ed.), Proceedings of the 17th international conference on machine learning (pp. 911–918). San Fransisco: Morgan Kaufman.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (B), 58, 267–288.
Tipping, M. (2000). The relevance vector machine. In A. Solla, T. Leen, K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12, pp. 652–658). Cambridge: MIT.
Tipping, M. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.
Tipping, M., & Faul, A. (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In Proceedings of the ninth international workshop on artificial intelligence and statistics.
Vidakovic, B. (1998a). Nonlinear wavelet shrinkage with Bayes rules and Bayes factors. Journal of the American Statistical Association, 93, 173–179.
Vidakovic, B. (1998b). Wavelet-based non-parametric Bayes Methods. In Practical nonparametric and semiparametric Bayesian statistics (pp. 133–155). New York: Springer.
Wipf, D., & Rao, B. (2004). Perspectives on sparse Bayesian learning. In Advances in neural information processing systems (Vol. 16).
Wipf, D., & Rao, B. (2005). Finding sparse representations in multiple response models via Bayesian learning. In Proceedings of Spars05 (pp. 155–158), Rennes.
Author information
Authors and Affiliations
Corresponding author
Additional information
Action Editor: Dale Schuurmans.
Rights and permissions
About this article
Cite this article
Schmolck, A., Everson, R. Smooth relevance vector machine: a smoothness prior extension of the RVM. Mach Learn 68, 107–135 (2007). https://doi.org/10.1007/s10994-007-5012-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-007-5012-z