A sparse linear regression model for incomplete datasets

Veras, Marcelo B. A.; Mesquita, Diego P. P.; Mattos, Cesar L. C.; Gomes, João P. P.

doi:10.1007/s10044-019-00859-3

A sparse linear regression model for incomplete datasets

Theoretical advances
Published: 04 December 2019

Volume 23, pages 1293–1303, (2020)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Marcelo B. A. Veras¹,
Diego P. P. Mesquita²,
Cesar L. C. Mattos¹ &
…
João P. P. Gomes ORCID: orcid.org/0000-0003-1686-595X¹

558 Accesses
5 Citations
Explore all metrics

Abstract

Incomplete data are often neglected when designing machine learning methods. A popular strategy adopted by practitioners to circumvent this consists of taking a preprocessing step to fill the missing components. These preprocessing algorithms are designed independently of the machine learning method that will be applied subsequently, which may lead to sub-optimal results. An alternative solution is to redesign classical machine learning methods to handle missing data directly. In this paper, we propose a variant of the forward stagewise regression (FSR) algorithm for incomplete data. The original FSR is an iterative procedure to estimate parameters of sparse linear models. The proposed method, named forward stagewise regression for incomplete datasets with GMM (FSIG), models the missing components as random variables following a Gaussian mixture distribution. In FSIG, the main steps of FSR are adapted to deaç with the intrinsic uncertainty of incomplete samples. The performance of FSIG was evaluated in an extensive set of experiments, and our model was able to outperform classical methods in most of the tested cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Missing value imputation: a review and analysis of the literature (2006–2017)

Article 05 April 2019

A survey on missing data in machine learning

Article Open access 27 October 2021

References

Belanche L, Kobayashi V, Aluja T (2014) Handling missing values in kernel methods with application to microbiology data. Neurocomputing 141:110–116. https://doi.org/10.1016/j.neucom.2014.01.047
Article Google Scholar
Chen SS, Donoho DL, Saunders MA (2001) Atomic decomposition by basis pursuit. SIAM Rev 43(1):129–159
Article MathSciNet Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Efron B, Hastie T, Johnstone I, Tibshirani R et al (2004) Least angle regression. Ann Stat 32(2):407–499
Article MathSciNet Google Scholar
Eirola E, Doquire G, Verleysen M, Lendasse A (2013) Distance estimation in numerical data sets with missing values. Inf Sci 240:115–128
Article MathSciNet Google Scholar
Eirola E, Lendasse A, Vandewalle V, Biernacki C (2014) Mixture of gaussians for distance estimation with missing data. Neurocomputing 131:32–42. https://doi.org/10.1016/j.neucom.2013.07.050
Article Google Scholar
Figueiredo MA, Nowak RD, Wright SJ (2007) Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE J Sel Top Signal Process 1(4):586–597
Article Google Scholar
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92
Article MathSciNet Google Scholar
García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2009) Pattern classification with missing data: a review. Neural Comput Appl 19:263–282
Article Google Scholar
Gui J, Sun Z, Ji S, Tao D, Tan T (2017) Feature selection based on structured sparsity: a comprehensive study. IEEE Trans Neural Netw Learn Syst 28(7):1490–1507
Article MathSciNet Google Scholar
Hastie T, Taylor J, Tibshirani R, Walther G (2006) Forward stagewise regression and the monotone lasso. Electron J Stat 1:2007
MathSciNet MATH Google Scholar
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series in statistics. Springer, New York
Book Google Scholar
Hulse JV, Khoshgoftaar TM (2014) Incomplete-case nearest neighbor imputation in software measurement data. Inf Sci 259:596–610
Article Google Scholar
Hunt L, Jorgensen M (2003) Mixture model clustering for mixed data with missing information. Comput Stat Data Anal 41(3–4):429–440. https://doi.org/10.1016/S0167-9473(02)00190-1
Article MathSciNet MATH Google Scholar
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley-Interscience, Hoboken
Book Google Scholar
Liu Z, Wu XJ, Shu Z (2019) Sparsity augmented discriminative sparse representation for face recognition. Pattern Anal Appl. https://doi.org/10.1007/s10044-019-00792-5
Article Google Scholar
Malkomes G, de Brito CEF, Gomes JPP (2017) A stochastic framework for k-SVD with applications on face recognition. Pattern Anal Appl 20(3):845–854. https://doi.org/10.1007/s10044-016-0541-3
Article MathSciNet Google Scholar
Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2):267–278
Article MathSciNet Google Scholar
Mesquita DP, Gomes JP, Junior AHS, Nobre JS (2017) Euclidean distance estimation in incomplete datasets. Neurocomputing 248:11–18. https://doi.org/10.1016/j.neucom.2016.12.081
Article Google Scholar
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press, Cambridge
MATH Google Scholar
Nebot-Troyano G, Belanche-Muñoz LA (2010) A kernel extension to handle missing data. In: Bramer M, Ellis R, Petridis M (eds) Research and development in intelligent systems XXVI. Springer, London, pp 165–178
Chapter Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464. https://doi.org/10.1214/aos/1176344136
Article MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58:267–288
Veras MBA, Mesquita DPP, Gomes JPP, Souza Junior AH, Barreto GA (2017) Forward stagewise regression on incomplete datasets. In: Rojas I, Joya G, Catala A (eds) Advances in computational intelligence. Springer, Cham, pp 386–395
Chapter Google Scholar
Wu TT, Lange K et al (2008) Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat 2(1):224–244
Article MathSciNet Google Scholar
Xie P, Liu X, Yin J, Wang Y (2016) Absent extreme learning machine algorithm with application to packed executable identification. Neural Comput Appl 27(1):93–100. https://doi.org/10.1007/s00521-014-1558-4
Article Google Scholar
Yang AY, Sastry SS, Ganesh A, Ma Y (2010) Fast l1-minimization algorithms and an application in robust face recognition: a review. In: 2010 17th IEEE international conference on image processing (ICIP). IEEE, pp 1849–1852
Yuan GX, Chang KW, Hsieh CJ, Lin CJ (2010) A comparison of optimization methods and software for large-scale l1-regularized linear classification. J Mach Learn Res 11(Nov):3183–3234
MathSciNet MATH Google Scholar
Zahin SA, Ahmed CF, Alam T (2018) An effective method for classification with missing values. Appl Intell 48(10):3209–3230. https://doi.org/10.1007/s10489-018-1139-9
Article Google Scholar
Zhang H, Wang S, Xu X, Chow TWS, Wu QMJ (2018) Tree2Vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst 29(11):5304–5318. https://doi.org/10.1109/TNNLS.2018.2797060
Article MathSciNet Google Scholar
Zhang X, Song S, Wu C (2013) Robust Bayesian classification with incomplete data. Cogn Comput 5(2):170–187. https://doi.org/10.1007/s12559-012-9188-6
Article Google Scholar
Zhang Z, Xu Y, Yang J, Li X, Zhang D (2015) A survey of sparse representation: algorithms and applications. IEEE Access 3:490–530
Article Google Scholar
Ziegler ML (2000) Variable selection when confronted with missing data. PhD thesis, University of Pittsburgh

Download references

Acknowledgements

The authors would like to thank the Brazilian National Council for Scientific and Technological Development (CNPq) for financial support (Grant 302289/2019-4).

Author information

Authors and Affiliations

Department of Computer Science, Federal University of Ceara, Rua Campus do Pici sn, Fortaleza, CE, CEP 60455900, Brazil
Marcelo B. A. Veras, Cesar L. C. Mattos & João P. P. Gomes
Department of Computer Science, Aalto University, Konemiehentie 2, 02150, Espoo, Finland
Diego P. P. Mesquita

Authors

Marcelo B. A. Veras
View author publications
You can also search for this author in PubMed Google Scholar
Diego P. P. Mesquita
View author publications
You can also search for this author in PubMed Google Scholar
Cesar L. C. Mattos
View author publications
You can also search for this author in PubMed Google Scholar
João P. P. Gomes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to João P. P. Gomes.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Veras, M.B.A., Mesquita, D.P.P., Mattos, C.L.C. et al. A sparse linear regression model for incomplete datasets. Pattern Anal Applic 23, 1293–1303 (2020). https://doi.org/10.1007/s10044-019-00859-3

Download citation

Received: 24 March 2019
Accepted: 23 November 2019
Published: 04 December 2019
Issue Date: August 2020
DOI: https://doi.org/10.1007/s10044-019-00859-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A sparse linear regression model for incomplete datasets

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Missing value imputation: a review and analysis of the literature (2006–2017)

A survey on missing data in machine learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A sparse linear regression model for incomplete datasets

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Missing value imputation: a review and analysis of the literature (2006–2017)

A survey on missing data in machine learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation