Skip to main content
Log in

An interpretable regression approach based on bi-sparse optimization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Given the increasing amounts of data and high feature dimensionalities in forecasting problems, it is challenging to build regression models that are both computationally efficient and highly accurate. Moreover, regression models commonly suffer from low interpretability when using a single kernel function or a composite of multi-kernel functions to address nonlinear fitting problems. In this paper, we propose a bi-sparse optimization-based regression (BSOR) model and corresponding algorithm with reconstructed row and column kernel matrices in the framework of support vector regression (SVR). The BSOR model can predict continuous output values for given input points while using the zero-norm regularization method to achieve sparse instance and feature sets. Experiments were run on 16 datasets to compare BSOR to SVR, linear programming SVR (LPSVR), least squares SVR (LSSVR), multi-kernel learning SVR (MKLSVR), least absolute shrinkage and selection operator regression (LASSOR), and relevance vector regression (RVR). BSOR significantly outperformed the other six regression models in predictive accuracy, identification of the fewest representative instances, selection of the fewest important features, and interpretability of results, apart from its slightly high runtime.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Abe S (2010) Support vector Machines for Pattern Classification, 2nd edn. Springer, London, UK

    MATH  Google Scholar 

  2. Ahdesmaki M, Strimmer K (2010) Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. Ann Appl Stat 4(1):503–519

    MathSciNet  MATH  Google Scholar 

  3. Bach F, Jenatton R, Mairal J, Obozinski G (2012) Optimization with sparsity-inducing penalties. Found Trends Mach Learn 4(1):1–106

    MATH  Google Scholar 

  4. Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA http://archive.ics.uci.edu/ml

    Google Scholar 

  5. Basak D, Pal S, Patranabis DC (2007) Support vector regression. Neural Inf Process-Lett Rev 11(10):203–224

    Google Scholar 

  6. Berk RA (2008) Statistical learning from a regression perspective. Springer, New York

    MATH  Google Scholar 

  7. Berrendero JR, Cuevas A, Torrecilla JL (2016) Variable selectionin functional data classification: a maxima-hunting proposal. Stat Sin 26:619–638

    MATH  Google Scholar 

  8. Bi J, Bennett K, Embrechts M, Breneman C, Song M (2003) Dimensionality reduction via sparse support vector machines. J Mach Learn Res 3:1229–1243

    MATH  Google Scholar 

  9. Blanquero R, Carrizosa E, Jimenez-Cordero A, Martin-Barragan B (2018) Variable selection with support vector regression for multivariate functional data. In: Technical report. Edinburgh - Universidad de Sevilla, University of

    Google Scholar 

  10. Blanquero R, Carrizosa E, Jimenez-Cordero A, Martin-Barragan B (2019) Variable selection in classification for multivariate functional data. Inf Sci 481:445–462

    MATH  Google Scholar 

  11. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Feature selection for high-dimensional data. Artificial Intelligence: Foundations, Theory, and Algorithms 10:978–973

    Google Scholar 

  12. Bradley PS, Mangasarian OL (1998) Feature selection via concave minimization and support vector machines. In ICML 98:82–90

    Google Scholar 

  13. Broniatowski M, Celant, Giorgio (2016) Interpolation and extrapolation optimal designs. 1, polynomial regression and approximation theory 1st Ed. Wiley-ISTE

  14. Carrizosa E, Guerrero V (2014) Rs-sparse principal component analysis: A mixed integer nonlinear programming approach with vns. Comput Oper Res 52:349–354

    MathSciNet  MATH  Google Scholar 

  15. Carrizosa E, Ramirez-Cobo P, Olivares-Nadal AV (2016) A sparsity-controlled vector autoregressive model. Biostatistics 18(2):244–259

    MathSciNet  MATH  Google Scholar 

  16. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28

    Google Scholar 

  17. Cheng L, Ramchandran S, Vatanen T, Lietzén N, Lahesmaa R, Vehtari A, Lähdesmäki H (2019) An additive Gaussian process regression model for interpretable non-parametric analysis of longitudinal data. Nat Commun 10(1):1798

    Google Scholar 

  18. Cotter A, Shalev-Shwartz S, Srebro N (2013) Learning optimally sparse support vector machines. In ICML, pp:266–274

  19. Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  20. Cui Z, Gong G (2018) The effect of machine learning regression algorithms and sample size on individualized behavioral prediction with functional connectivity features. NeuroImage 178:622–637

    Google Scholar 

  21. Cunningham JP, Ghahramani Z (2015) Linear dimensionality reduction: survey, insights, and generalizations. J Mach Learn Res 16:2859–2900

    MathSciNet  MATH  Google Scholar 

  22. David JO (2017) Linear regression. Springer

  23. Demsar J (2006) Statistical comparison of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  24. Deng N, Tian Y, Zhang C (2013) Support vector machines: optimization based theory. Algorithms and Extensions, Chapman & Hall/CRC

    MATH  Google Scholar 

  25. Draper NR, Smith H (1998) Applied regression analysis, vol 326. John Wiley & Sons

  26. Duch W, Winiarski T, Biesiada J, Kachel A (2003) Feature selection and ranking filters. In: International conference on artificial neural networks (ICANN) and International conference on neural information processing (ICONIP), vol 251, p 254

    Google Scholar 

  27. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499

    MathSciNet  MATH  Google Scholar 

  28. Ehsanes Saleh AKM, Arashi M, Golam Kibria BM (2019) Theory of ridge regression estimation with applications. Wiley

  29. Fabio A, Donini M (2015) EasyMKL: a scalable multiple kernel learning algorithm. Neurocomputing 169:215–224

    Google Scholar 

  30. Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5):849–911

    MathSciNet  MATH  Google Scholar 

  31. García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180:2044–2064

    Google Scholar 

  32. Garg R, Khandekar R (2009) Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 337–344

  33. Gibbons JD, Chakraborti S (2011) Nonparametric statistical inference, 5th edn. Chapman & Hall/CRC Press, Taylor & Francis Group, Boca Raton

    MATH  Google Scholar 

  34. Gönen M, Alpaydin E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268

    MathSciNet  MATH  Google Scholar 

  35. Gu Y, Liu T, Jia X, Benediktsson JA, Chanussot J (2016) Nonlinear multiple kernel learning with multiple-structure-element extended morphological profiles for hyperspectral image classification. IEEE Trans Geosci Remote Sens 54(6):3235–3247

    Google Scholar 

  36. Gunn SR (1998) Support vector machines for classification and regression, vol 14. ISIS technical report, pp 85–86

  37. Hammer B, Villmann T (2002) Generalized relevance learning vector quantization. Neural Netw 15(8):1059–1068

    Google Scholar 

  38. Hu M, Chen Y, Tin-Yau KJ (2009) Building sparse multiple-kernel SVM classifiers. IEEE Trans Neural Netw 20(5):827–839

    Google Scholar 

  39. Huang K, Zheng D, Sun J, Hotta Y, Fujimoto K, Naoi S (2010) Sparse learning for support vector classification. Pattern Recogn Lett 31(13):1944–1951

    Google Scholar 

  40. Jacek W, Rodriguez PJ, Esquerdo (2018) Applied regression analysis for business: tools, Traps and Applications. Springer

  41. James GM, Wang J, Zhu J (2009) Functional linear regression that's interpretable. Ann Stat 37(5A):2083–2108

    MathSciNet  MATH  Google Scholar 

  42. Johansson U, Linusson H, Löfström T, Boström H (2018) Interpretable regression trees using conformal prediction. Expert Syst Appl 97:394–404

    Google Scholar 

  43. Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning, pp 249–256

    Google Scholar 

  44. Koenker R, Hallock KF (2001) Quantile regression. J Econ Perspect 15(4):143–156

    Google Scholar 

  45. Liu H, Motoda H (2007) Computational methods of feature selection. CRC Press

  46. López J, Maldonado S, Carrasco M (2018) Double regularization methods for robust feature selection and SVM classification via DC programming. Inf Sci 429:377–389

    MathSciNet  MATH  Google Scholar 

  47. Martínez AM, Kak AC (2001) PCA versus LDA. IEEE Trans Pattern Anal Mach Intell 23(2):228–233

    Google Scholar 

  48. McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition. Wiley Interscience

  49. Micchelli CA, Pontil M (2005) Learning the kernel function via regularization. J Mach Learn Res 6:1099–1125

    MathSciNet  MATH  Google Scholar 

  50. Neumann J, Schnörr C, Steidl G (2005) Combined SVM-based feature selection and classification. Mach Learn 61(1–3):129–150

    MATH  Google Scholar 

  51. O'Brien CM (2016) Statistical learning with Sparsity: the lasso and generalizations. CRC press

  52. Orabona F, Jie L, Caputo B (2012) Multi kernel learning with online-batch optimization. J Mach Learn Res 13:227–253

    MathSciNet  MATH  Google Scholar 

  53. Pelckmans K., Goethals I., Brabanter J. De, Suykens J. A., Moor B. De (2005). Componentwise Least Squares Support Vector Machines. in Support Vector Machines: Theory and Applications, (Wang L., ed.), Springer, Berlin 77–98

  54. Qiu S, Lane T (2005) Multiple kernel learning for support vector regression. In: Computer science department, the University of new Mexico, Albuquerque, NM, USA, tech. Rep, 1

    Google Scholar 

  55. Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9:2491–2521

    MathSciNet  MATH  Google Scholar 

  56. Ramsay JO, Silverman BW (2002) Applied functional data analysis: methods and case studies, volume 77 of springer series in statistics. Springer-Verlag

  57. Ramsay JO, Silverman BW (2005) Functional data analysis, 2nd edn. Springer-Verlag, Springer Series in Statistics

    MATH  Google Scholar 

  58. Rao N, Nowak R, Cox C, Rogers T (2016) Classification with the sparse group lasso. IEEE Trans Signal Process 64(2):448–463

    MathSciNet  MATH  Google Scholar 

  59. Rhinehart RR (2016) Nonlinear regression modeling for engineering applications: modeling, model validation, and enabling design of experiments. John Wiley & Sons

  60. Rish I, Grabarnik G (2014) Sparse modeling: theory, algorithms, and applications. CRC press

  61. Sato A, Yamada K (1996) Generalized learning vector quantization. In: Advances in neural information processing systems, pp 423–429

    Google Scholar 

  62. Schmidt M (2005) Least squares optimization with L1-norm regularization, vol 504. CS542B project report, pp 195–221

  63. Shim J, Hwang C (2015) Varying coefficient modeling via least squares support vector regression. Neurocomputing 161:254–259

    Google Scholar 

  64. Shlens J (2014) A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100

  65. Shrivastava A, Patel VM, Chellappa R (2014) Multiple kernel learning for sparse representation-based classification. IEEE Trans Image Process 23(7):3013–3024

    MathSciNet  MATH  Google Scholar 

  66. Silverman BD, Platt DE (1996) Comparative molecular moment analysis (CoMMA): 3D-QSAR without molecular superposition. J Med Chem 39(11):2129–2140

  67. Sjöstrand K, Clemmensen LH, Larsen R, Einarsson G, Ersbøll BK (2018) Spasm: A matlab toolbox for sparse statistical modeling. J Stat Softw 84(10):1–37

    Google Scholar 

  68. Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222

    MathSciNet  Google Scholar 

  69. Sonnenburg S, Ratsch G, Schafer C, Scholkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565

    MathSciNet  MATH  Google Scholar 

  70. Subrahmanya N, Shin YC (2010) Sparse multiple kernel learning for signal processing applications. IEEE Trans Pattern Anal Mach Intell 32(5):788–798

    Google Scholar 

  71. Suykens JA, Lukas L, Vandewalle J (2000) Sparse least squares support vector machine classifiers. In ESANN, pp:37–42

  72. Suykens JA, Van Gestel T, De Brabanter J (2002) Least squares support vector machines. World Scientific

  73. Suykens JA, Signoretto M, Argyriou A (2014) Regularization, optimization, kernels, and support vector machines. Chapman and Hall/CRC

  74. Suykens JA (2017) Efficient Sparse Approximation of Support Vector Machines Solving a Kernel Lasso. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 21st Iberoamerican Congress, CIARP 2016, Lima, Peru, Nov. 8–11, 2016, Proceedings. vol. 10125. Springer

  75. Tan M, Wang L, Tsang IW (2010) Learning sparse svm for feature selection on very high dimensional datasets. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 1047–1054

    Google Scholar 

  76. Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: A review. Appl Soft Comput 9(1):1–12

    Google Scholar 

  77. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc, Series B (Methodological), 267–288

  78. Tipping ME (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244

    MathSciNet  MATH  Google Scholar 

  79. Trawiński B, Smętek M, Telec Z, Lasota T (2012) Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms. Int J Appl Math Comput Sci 22(4):867–881

    MathSciNet  MATH  Google Scholar 

  80. Wall ME, Rechtsteiner A, Rocha LM (2003) Singular value decomposition and principal component analysis. In: A practical approach to microarray data analysis. Springer US, pp 91–109

  81. Wasserstein RL, Lazar NA (2016) The ASA statement on p-values: context, process, and purpose. Am Stat 70(2):129–133

    MathSciNet  Google Scholar 

  82. Weston J, Elisseeff A, Schölkopf B, Tipping M (2003) Use of the zero-norm with linear models and kernel methods. J Mach Learn Res 3:1439–1461

    MathSciNet  MATH  Google Scholar 

  83. Yamada M, Jitkrittum W, Sigal L, Xing EP, Sugiyama M (2014) High-dimensional feature selection by feature-wise kernelized lasso. Neural Comput 26(1):185–207

    MathSciNet  MATH  Google Scholar 

  84. Zhang Y, Wang S, Phillips P (2014) Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl-Based Syst 64:22–31

    Google Scholar 

  85. Zhang Z, Gao G, Tian Y, Yue J (2016) Two-phase multi-kernel LP-SVR for feature sparsification and forecasting. Neurocomputing 214:594–606

    Google Scholar 

  86. Zhang Z, He J, Gao G, Tian Y (2019) Bi-sparse optimization-based least squares regression. Appl Soft Comput 77:300–315

    Google Scholar 

  87. Zhao YP, Sun JG (2011) Multikernel semiparametric linear programming support vector regression. Expert Syst Appl 38:1611–1618

    Google Scholar 

  88. Zhou W, Zhang L, Jiao L (2002) Linear programming support vector machines. Pattern Recogn 35(12):2927–2936

    MATH  Google Scholar 

  89. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320

    MathSciNet  MATH  Google Scholar 

  90. Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This research has been partially supported by the Natural Science Foundation of Shandong, China (ZR2016FM15), and the National Natural Science Foundation of China (#61877061, #61872170).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiwang Zhang.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

The article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Gao, G., Yao, T. et al. An interpretable regression approach based on bi-sparse optimization. Appl Intell 50, 4117–4142 (2020). https://doi.org/10.1007/s10489-020-01687-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01687-3

Keywords

Navigation