Neural Processing Letters

, Volume 38, Issue 3, pp 403–416 | Cite as

Optimized Parameter Search for Large Datasets of the Regularization Parameter and Feature Selection for Ridge Regression

  • Pieter ButeneersEmail author
  • Ken Caluwaerts
  • Joni Dambre
  • David Verstraeten
  • Benjamin Schrauwen


In this paper we propose mathematical optimizations to select the optimal regularization parameter for ridge regression using cross-validation. The resulting algorithm is suited for large datasets and the computational cost does not depend on the size of the training set. We extend this algorithm to forward or backward feature selection in which the optimal regularization parameter is selected for each possible feature set. These feature selection algorithms yield solutions with a sparse weight matrix using a quadratic cost on the norm of the weights. A naive approach to optimizing the ridge regression parameter has a computational complexity of the order \(O(R K N^{2} M)\) with \(R\) the number of applied regularization parameters, \(K\) the number of folds in the validation set, \(N\) the number of input features and \(M\) the number of data samples in the training set. Our implementation has a computational complexity of the order \(O(KN^3)\). This computational cost is smaller than that of regression without regularization \(O(N^2M)\) for large datasets and is independent of the number of applied regularization parameters and the size of the training set. Combined with a feature selection algorithm the algorithm is of complexity \(O(RKNN_s^3)\) and \(O(RKN^3N_r)\) for forward and backward feature selection respectively, with \(N_s\) the number of selected features and \(N_r\) the number of removed features. This is an order \(M\) faster than \(O(RKNN_s^3M)\) and \(O(RKN^3N_rM)\) for the naive implementation, with \(N \ll M\) for large datasets. To show the performance and reduction in computational cost, we apply this technique to train recurrent neural networks using the reservoir computing approach, windowed ridge regression, least-squares support vector machines (LS-SVMs) in primal space using the fixed-size LS-SVM approximation and extreme learning machines.


Cross-validation Feature selection Ridge regression  Regularization parameter optimization Computationally efficient Model selection 



The work presented in this paper is funded by a Ph.D. Grant of the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT-Vlaanderen), a Ph.D. fellowship of the Research Foundation—Flanders (FWO), the EC FP7 project ORGANIC (FP7-231267), the BOF-GOA project Home-MATE funded by the Ghent university Special Research Fund, the Interuniversity Attraction Poles Program (Belgian Science Policy) project Photonics@be IAP6/10 and the FWO project RECAP.


  1. 1.
    Tikhonov AN, Arsenin VY (1977) Solutions of ill-posed problems. Winston and Sons, WashingtonzbMATHGoogle Scholar
  2. 2.
    Cawley G, Talbot N (2004) Fast exact leave-one-out cross-validation of sparse least-squares support vector machines. Neural Netw 17:1467–1475CrossRefzbMATHGoogle Scholar
  3. 3.
    Pahikkala T, Boberg J, Salakoski T (2006) Fast n-fold cross-validation for regularized least-squares. Proceedings of the ninth scandinavian conference on artificial intelligence (SCAI).Google Scholar
  4. 4.
    Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182zbMATHGoogle Scholar
  5. 5.
    Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Annals of Stat 32:407–451MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Dutoit X, Schrauwen B, Van Campenhout J, Stroobandt D, Van Brussel H, Nuttin M (2009) Pruning and regularization in reservoir computing. Neurocomputing 72:1534–1546CrossRefGoogle Scholar
  7. 7.
    Ojeda F, Suykens J, De Moor B (2008) Low rank updated LS-SVM classifiers for fast variable selection. Neural Netw 21:437–449CrossRefzbMATHGoogle Scholar
  8. 8.
    Pahikkala T, Airola A, Salakoski T (2010) Feature selection for regularized least-squares: new computational short-cuts and fast algorithmic implementations. IEEE international workshop on machine learning for signal processing.Google Scholar
  9. 9.
    Miche Y, Sorjamaa A, Bas P, Simula O, Jutten C, Lendasse A (2010) Op-elm: optimally pruned extreme learning machine. IEEE Trans Neural Netw 21:158–162CrossRefGoogle Scholar
  10. 10.
    Lukoševičius M, Jaeger H (2009) Reservoir computing approaches to recurrent neural network training. Comput Sci Rev 3(3):127–149CrossRefGoogle Scholar
  11. 11.
    Suykens J, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J (2002) Least squares support vector machines. World Scientific Publishing, SingaporeCrossRefzbMATHGoogle Scholar
  12. 12.
    Huang G, Zhu Q, Siew C (2006) Extreme learning machine: theory and applications. Neurocomputing 70:489–501CrossRefGoogle Scholar
  13. 13.
    Buteneers P, Verstraeten D, van Mierlo P, Wyckhuys T, Stroobandt D, Raedt R, Hallez H, Schrauwen B (2011) Automatic detection of epileptic seizures on the intra-cranial electroencephalogram of rats using reservoir computing. Artif Intell Med 53(3):215–223CrossRefGoogle Scholar
  14. 14.
    Hoegaerts L, Suykens J, Vandewalle J, De Moor B (2004) Primal space sparse kernel partial least squares regression for large scale problems. IEEE international joint conference on neural networks.Google Scholar
  15. 15.
    Espinoza M, Suykens JAK, De Moor B (2003) Least squares support vector machines and primal space estimation. Proceedings of 42nd IEEE conference on decision and control, vol 4. pp 3451–3456.Google Scholar
  16. 16.
    Verstraeten D, Schrauwen B, Dieleman S, Brakel P, Buteneers P, Pecevski D (2011) Oger: modular learning architectures for large-scale sequential processing. (in press).Google Scholar
  17. 17.
    Press W, Teukolsky S, Vetterling W, Flannery B (1992) Numerical recipes in FORTRAN: the art of scientific computing, 2nd edn. Cambridge University Press, CambridgeGoogle Scholar
  18. 18.
    De Brabanter K, De Brabanter J, Suykens J, De Moor B (2010) Optimized fixed-size kernel models for large data sets. Comput Stat Data Anal 54:1484–1504CrossRefGoogle Scholar
  19. 19.
    Pelckmans K, De Brabanter J, Suykens JAK, De Moor B (2005) The differogram: non-parametric noise variance estimation and its use for model selection. Neurocomputing 69(1):100–122CrossRefGoogle Scholar
  20. 20.
    Pelckmans K, Suykens JAK, De Moor B (2006) Additive regularization trade-off: fusion of training and validation levels in kernel methods. Mach Learn 62(3):217–252CrossRefGoogle Scholar
  21. 21.
    Golub G, Van Loan C (1989) Matrix computations. The Jonhs Hopkins University Press, BaltimorezbMATHGoogle Scholar
  22. 22.
    Holland P (1973) Weighted ridge regression: combining ridge and robust regression methods. Technical Report 0011, National Bureau of Economic Research.Google Scholar
  23. 23.
    Allen D (1974) The relationship between variable selection and data augmentation and slow feature analysis. Technometrics 16:125–127MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Sherman J, Morisson WJ (1950) Adjustments of an inverse matrix corresponding to a change in one element of a given matrix. Annals Math Stat 21:124–127CrossRefzbMATHGoogle Scholar
  25. 25.
    Toh KA (2008) Deterministic neural classification. Neural Comput 20:1565–1595MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Pieter Buteneers
    • 1
    Email author
  • Ken Caluwaerts
    • 1
  • Joni Dambre
    • 1
  • David Verstraeten
    • 1
  • Benjamin Schrauwen
    • 1
  1. 1.Electronics and Information SystemsGhent UniversityGhentBelgium

Personalised recommendations