Abstract
Given the increasing amounts of data and high feature dimensionalities in forecasting problems, it is challenging to build regression models that are both computationally efficient and highly accurate. Moreover, regression models commonly suffer from low interpretability when using a single kernel function or a composite of multi-kernel functions to address nonlinear fitting problems. In this paper, we propose a bi-sparse optimization-based regression (BSOR) model and corresponding algorithm with reconstructed row and column kernel matrices in the framework of support vector regression (SVR). The BSOR model can predict continuous output values for given input points while using the zero-norm regularization method to achieve sparse instance and feature sets. Experiments were run on 16 datasets to compare BSOR to SVR, linear programming SVR (LPSVR), least squares SVR (LSSVR), multi-kernel learning SVR (MKLSVR), least absolute shrinkage and selection operator regression (LASSOR), and relevance vector regression (RVR). BSOR significantly outperformed the other six regression models in predictive accuracy, identification of the fewest representative instances, selection of the fewest important features, and interpretability of results, apart from its slightly high runtime.
Similar content being viewed by others
References
Abe S (2010) Support vector Machines for Pattern Classification, 2nd edn. Springer, London, UK
Ahdesmaki M, Strimmer K (2010) Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. Ann Appl Stat 4(1):503–519
Bach F, Jenatton R, Mairal J, Obozinski G (2012) Optimization with sparsity-inducing penalties. Found Trends Mach Learn 4(1):1–106
Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA http://archive.ics.uci.edu/ml
Basak D, Pal S, Patranabis DC (2007) Support vector regression. Neural Inf Process-Lett Rev 11(10):203–224
Berk RA (2008) Statistical learning from a regression perspective. Springer, New York
Berrendero JR, Cuevas A, Torrecilla JL (2016) Variable selectionin functional data classification: a maxima-hunting proposal. Stat Sin 26:619–638
Bi J, Bennett K, Embrechts M, Breneman C, Song M (2003) Dimensionality reduction via sparse support vector machines. J Mach Learn Res 3:1229–1243
Blanquero R, Carrizosa E, Jimenez-Cordero A, Martin-Barragan B (2018) Variable selection with support vector regression for multivariate functional data. In: Technical report. Edinburgh - Universidad de Sevilla, University of
Blanquero R, Carrizosa E, Jimenez-Cordero A, Martin-Barragan B (2019) Variable selection in classification for multivariate functional data. Inf Sci 481:445–462
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Feature selection for high-dimensional data. Artificial Intelligence: Foundations, Theory, and Algorithms 10:978–973
Bradley PS, Mangasarian OL (1998) Feature selection via concave minimization and support vector machines. In ICML 98:82–90
Broniatowski M, Celant, Giorgio (2016) Interpolation and extrapolation optimal designs. 1, polynomial regression and approximation theory 1st Ed. Wiley-ISTE
Carrizosa E, Guerrero V (2014) Rs-sparse principal component analysis: A mixed integer nonlinear programming approach with vns. Comput Oper Res 52:349–354
Carrizosa E, Ramirez-Cobo P, Olivares-Nadal AV (2016) A sparsity-controlled vector autoregressive model. Biostatistics 18(2):244–259
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Cheng L, Ramchandran S, Vatanen T, Lietzén N, Lahesmaa R, Vehtari A, Lähdesmäki H (2019) An additive Gaussian process regression model for interpretable non-parametric analysis of longitudinal data. Nat Commun 10(1):1798
Cotter A, Shalev-Shwartz S, Srebro N (2013) Learning optimally sparse support vector machines. In ICML, pp:266–274
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Cui Z, Gong G (2018) The effect of machine learning regression algorithms and sample size on individualized behavioral prediction with functional connectivity features. NeuroImage 178:622–637
Cunningham JP, Ghahramani Z (2015) Linear dimensionality reduction: survey, insights, and generalizations. J Mach Learn Res 16:2859–2900
David JO (2017) Linear regression. Springer
Demsar J (2006) Statistical comparison of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Deng N, Tian Y, Zhang C (2013) Support vector machines: optimization based theory. Algorithms and Extensions, Chapman & Hall/CRC
Draper NR, Smith H (1998) Applied regression analysis, vol 326. John Wiley & Sons
Duch W, Winiarski T, Biesiada J, Kachel A (2003) Feature selection and ranking filters. In: International conference on artificial neural networks (ICANN) and International conference on neural information processing (ICONIP), vol 251, p 254
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Ehsanes Saleh AKM, Arashi M, Golam Kibria BM (2019) Theory of ridge regression estimation with applications. Wiley
Fabio A, Donini M (2015) EasyMKL: a scalable multiple kernel learning algorithm. Neurocomputing 169:215–224
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5):849–911
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180:2044–2064
Garg R, Khandekar R (2009) Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 337–344
Gibbons JD, Chakraborti S (2011) Nonparametric statistical inference, 5th edn. Chapman & Hall/CRC Press, Taylor & Francis Group, Boca Raton
Gönen M, Alpaydin E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268
Gu Y, Liu T, Jia X, Benediktsson JA, Chanussot J (2016) Nonlinear multiple kernel learning with multiple-structure-element extended morphological profiles for hyperspectral image classification. IEEE Trans Geosci Remote Sens 54(6):3235–3247
Gunn SR (1998) Support vector machines for classification and regression, vol 14. ISIS technical report, pp 85–86
Hammer B, Villmann T (2002) Generalized relevance learning vector quantization. Neural Netw 15(8):1059–1068
Hu M, Chen Y, Tin-Yau KJ (2009) Building sparse multiple-kernel SVM classifiers. IEEE Trans Neural Netw 20(5):827–839
Huang K, Zheng D, Sun J, Hotta Y, Fujimoto K, Naoi S (2010) Sparse learning for support vector classification. Pattern Recogn Lett 31(13):1944–1951
Jacek W, Rodriguez PJ, Esquerdo (2018) Applied regression analysis for business: tools, Traps and Applications. Springer
James GM, Wang J, Zhu J (2009) Functional linear regression that's interpretable. Ann Stat 37(5A):2083–2108
Johansson U, Linusson H, Löfström T, Boström H (2018) Interpretable regression trees using conformal prediction. Expert Syst Appl 97:394–404
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on machine learning, pp 249–256
Koenker R, Hallock KF (2001) Quantile regression. J Econ Perspect 15(4):143–156
Liu H, Motoda H (2007) Computational methods of feature selection. CRC Press
López J, Maldonado S, Carrasco M (2018) Double regularization methods for robust feature selection and SVM classification via DC programming. Inf Sci 429:377–389
Martínez AM, Kak AC (2001) PCA versus LDA. IEEE Trans Pattern Anal Mach Intell 23(2):228–233
McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition. Wiley Interscience
Micchelli CA, Pontil M (2005) Learning the kernel function via regularization. J Mach Learn Res 6:1099–1125
Neumann J, Schnörr C, Steidl G (2005) Combined SVM-based feature selection and classification. Mach Learn 61(1–3):129–150
O'Brien CM (2016) Statistical learning with Sparsity: the lasso and generalizations. CRC press
Orabona F, Jie L, Caputo B (2012) Multi kernel learning with online-batch optimization. J Mach Learn Res 13:227–253
Pelckmans K., Goethals I., Brabanter J. De, Suykens J. A., Moor B. De (2005). Componentwise Least Squares Support Vector Machines. in Support Vector Machines: Theory and Applications, (Wang L., ed.), Springer, Berlin 77–98
Qiu S, Lane T (2005) Multiple kernel learning for support vector regression. In: Computer science department, the University of new Mexico, Albuquerque, NM, USA, tech. Rep, 1
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9:2491–2521
Ramsay JO, Silverman BW (2002) Applied functional data analysis: methods and case studies, volume 77 of springer series in statistics. Springer-Verlag
Ramsay JO, Silverman BW (2005) Functional data analysis, 2nd edn. Springer-Verlag, Springer Series in Statistics
Rao N, Nowak R, Cox C, Rogers T (2016) Classification with the sparse group lasso. IEEE Trans Signal Process 64(2):448–463
Rhinehart RR (2016) Nonlinear regression modeling for engineering applications: modeling, model validation, and enabling design of experiments. John Wiley & Sons
Rish I, Grabarnik G (2014) Sparse modeling: theory, algorithms, and applications. CRC press
Sato A, Yamada K (1996) Generalized learning vector quantization. In: Advances in neural information processing systems, pp 423–429
Schmidt M (2005) Least squares optimization with L1-norm regularization, vol 504. CS542B project report, pp 195–221
Shim J, Hwang C (2015) Varying coefficient modeling via least squares support vector regression. Neurocomputing 161:254–259
Shlens J (2014) A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100
Shrivastava A, Patel VM, Chellappa R (2014) Multiple kernel learning for sparse representation-based classification. IEEE Trans Image Process 23(7):3013–3024
Silverman BD, Platt DE (1996) Comparative molecular moment analysis (CoMMA): 3D-QSAR without molecular superposition. J Med Chem 39(11):2129–2140
Sjöstrand K, Clemmensen LH, Larsen R, Einarsson G, Ersbøll BK (2018) Spasm: A matlab toolbox for sparse statistical modeling. J Stat Softw 84(10):1–37
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Sonnenburg S, Ratsch G, Schafer C, Scholkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565
Subrahmanya N, Shin YC (2010) Sparse multiple kernel learning for signal processing applications. IEEE Trans Pattern Anal Mach Intell 32(5):788–798
Suykens JA, Lukas L, Vandewalle J (2000) Sparse least squares support vector machine classifiers. In ESANN, pp:37–42
Suykens JA, Van Gestel T, De Brabanter J (2002) Least squares support vector machines. World Scientific
Suykens JA, Signoretto M, Argyriou A (2014) Regularization, optimization, kernels, and support vector machines. Chapman and Hall/CRC
Suykens JA (2017) Efficient Sparse Approximation of Support Vector Machines Solving a Kernel Lasso. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 21st Iberoamerican Congress, CIARP 2016, Lima, Peru, Nov. 8–11, 2016, Proceedings. vol. 10125. Springer
Tan M, Wang L, Tsang IW (2010) Learning sparse svm for feature selection on very high dimensional datasets. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 1047–1054
Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: A review. Appl Soft Comput 9(1):1–12
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc, Series B (Methodological), 267–288
Tipping ME (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244
Trawiński B, Smętek M, Telec Z, Lasota T (2012) Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms. Int J Appl Math Comput Sci 22(4):867–881
Wall ME, Rechtsteiner A, Rocha LM (2003) Singular value decomposition and principal component analysis. In: A practical approach to microarray data analysis. Springer US, pp 91–109
Wasserstein RL, Lazar NA (2016) The ASA statement on p-values: context, process, and purpose. Am Stat 70(2):129–133
Weston J, Elisseeff A, Schölkopf B, Tipping M (2003) Use of the zero-norm with linear models and kernel methods. J Mach Learn Res 3:1439–1461
Yamada M, Jitkrittum W, Sigal L, Xing EP, Sugiyama M (2014) High-dimensional feature selection by feature-wise kernelized lasso. Neural Comput 26(1):185–207
Zhang Y, Wang S, Phillips P (2014) Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl-Based Syst 64:22–31
Zhang Z, Gao G, Tian Y, Yue J (2016) Two-phase multi-kernel LP-SVR for feature sparsification and forecasting. Neurocomputing 214:594–606
Zhang Z, He J, Gao G, Tian Y (2019) Bi-sparse optimization-based least squares regression. Appl Soft Comput 77:300–315
Zhao YP, Sun JG (2011) Multikernel semiparametric linear programming support vector regression. Expert Syst Appl 38:1611–1618
Zhou W, Zhang L, Jiao L (2002) Linear programming support vector machines. Pattern Recogn 35(12):2927–2936
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This research has been partially supported by the Natural Science Foundation of Shandong, China (ZR2016FM15), and the National Natural Science Foundation of China (#61877061, #61872170).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Ethical Approval
The article does not contain any studies with human participants or animals performed by any of the authors.
Informed Consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, Z., Gao, G., Yao, T. et al. An interpretable regression approach based on bi-sparse optimization. Appl Intell 50, 4117–4142 (2020). https://doi.org/10.1007/s10489-020-01687-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01687-3