Data Integration Using Model-Based Boosting

Li, Bin; Chakraborty, Somsubhra; Weindorf, David C.; Yu, Qingzhao

doi:10.1007/s42979-021-00797-0

Data Integration Using Model-Based Boosting

Original Research
Published: 07 August 2021

Volume 2, article number 400, (2021)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Bin Li¹,
Somsubhra Chakraborty²,
David C. Weindorf³ &
…
Qingzhao Yu ORCID: orcid.org/0000-0001-8194-0798⁴

468 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The need for data integration is becoming ubiquitous and encompasses many disciplines due to the technological development in instrumentation. Combining the information from distinct data sources in modeling, so as to improve the prediction accuracy and have a holistic view of the problem is a challenge for statisticians. In this paper, we present a flexible statistical framework for integrating various types of data from distinct sources through model-based boosting (IMBoost) with two types of base models: regression trees and penalized splines. The performance of IMBoost is illustrated through two recent studies in environmental soil science, where multiple sensors were used to quantify several soil parameters. Empirical results are promising and show the proposed algorithms substantially improve the prediction performance through combining the strength from distinct data sources. We also proposed a surrogate model approach, which allows IMBoost to handle situations when partial samples are missing from distinct sources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

Article 19 April 2016

Air pollution prediction with machine learning: a case study of Indian cities

Article 15 May 2022

References

Banerjee TP, Das S. Multi-sensor data fusion using support vector machine for motor fault detection. Inf Sci. 2012;217:96–107.
Article Google Scholar
Bania RK, Halder A. R-Ensembler: a greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data. Comput Methods Programs Biomed. 2020;184:105122.
Article Google Scholar
Bigdeli B, Pahlavani P, Amirkolaee HA. An ensemble deep learning method as data fusion system for remote sensing multisensor classification. Appl Soft Comput. 2021;110:107563.
Bühlmann P. Boosting for high-dimensional linear models. Ann Stat. 2006;34:559–83.
Article MathSciNet Google Scholar
Bühlmann P, Hothorn T. Boosting algorithms: regularization, prediction and model fitting. Stat Sci. 2007;22:477–505.
MathSciNet MATH Google Scholar
Eilers P, Marx BD. Flexible smoothing with B-splines and penalties (with comments and rejoinder). Stat Sci. 1996;11:89–121.
Article Google Scholar
Friedman J. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232.
Article MathSciNet Google Scholar
Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Ann Stat. 2000;28:337–407.
Article MathSciNet Google Scholar
Gao JB, Harris CJ. Some remarks on Kalman filters for the multisensor fusion. Inf Fusion. 2002;3(3):191–201.
Article Google Scholar
Greenwell B, Boehmke B, Cunningham J. GBM Developers. gbm: generalized boosted regression models. R package version 2.1.4. 2018. https://CRAN.R-project.org/package=gbm. Accessed 16 Sept 2018
Hall DL, Llinas J. An introduction to multisensor data fusion. Proc IEEE. 1997;85(1):6–23.
Article Google Scholar
Hastie T, Tibshirani R, Friedman J. Elements of statistical learning: data mining, inference and prediction. 2nd ed. New York: Springer Verlag; 2009.
Book Google Scholar
Holzinger A, Malle B, Saranti A, Pfeifer B. Towards multi-modal causability with Graph Neural Networks enabling information fusion for explainable AI. Inf Fusion. 2021;71:28–37.
Article Google Scholar
Khan SI, Hoque ASML. SICE: an improved missing data imputation technique. J Big Data. 2020;7:37.
Article Google Scholar
Lin WC, Tsai CF. Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev. 2020;53:1487–509.
Article Google Scholar
Lin J, Li N, Alam MA, Ma Y. Data-driven missing data imputation in cluster monitoring system based on deep neural network. Appl Intell. 2020;50:860–77.
Article Google Scholar
Liu J, Li T, Xie P, Du S, Teng F, Yang X. Urban big data fusion based on deep learning: An overview. Inf Fusion. 2020;53:123–33.
Article Google Scholar
Marx BD, Eilers P. Generalized linear regression on sampled signals and curves: a P-spline approach. Technometrics. 1999;41:1–13.
Article Google Scholar
Meng T, Jing X, Yan Z, Pedrycz W. A survey on machine learning for data fusion. Inf Fusion. 2020;57:115–29.
Article Google Scholar
Muzammal M, Talat R, Sodhro AH, Pirbhulal S. A multi-sensor data fusion enabled ensemble approach for medical data from body sensor networks. Inf Fusion. 2020;53:155–64.
Article Google Scholar
R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2018. https://www.R-project.org/. Accesed 4 Aug 2021
Raja PS, Thangavel K. Missing value imputation using unsupervised machine learning techniques. Soft Comput. 2020;24:4361–92.
Article Google Scholar
Raja PS, Sasirekha K, Thangavel K. A novel fuzzy rough clustering parameter-based missing value imputation. Neural Comput Appl. 2020;32:10033–50.
Article Google Scholar
Sauta E, Demartini A, Vitali F, Riva A, Bellazzi R. A Bayesian data fusion based approach for learning genome-wide transcriptional regulatory networks. BMC Bioinform. 2020;21:219.
Article Google Scholar
Tang W, Lu Z, Dhillon IS. Clustering with multiple graphs. In: Proceedings of the 9th IEEE International Conference on data mining, 2009; p. 1016–1021.
van Vliet MH, Horlings HM, van de Vijver MJ, Reinders MJT, Wessels LFA. Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PLoS One. 2012;7(7):e40358. https://doi.org/10.1371/journal.pone.0040358.
Article Google Scholar
Wang D, Chakraborty S, Weindorf D, Li B, Sharma A, Paul S, Ali M. Synthesized use of VisNIR DRS and PXRF for soil characterization: total carbon and total nitrogen. Geoderma. 2015;243–244:157–67.
Article Google Scholar
Wolpert DH. Stacked generalization. Neural Netw. 1992;5(2):241–59.
Article Google Scholar
Xia J, Zhang S, Cai G, Li L, Pan Q, Yan J, Ning G. Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recognit. 2017;69:52–60.
Article Google Scholar
Yang H, Cao H, He T, Wang T, Cui Y. Multilevel heterogeneous omics data integration with kernel fusion. Brief Bioinform. 2020;21(1):156–70.
Google Scholar

Download references

Acknowledgements

The authors would like to thank the reviewers for constructive comments that helped to improve the quality of the article. Portions of this research were conducted with high-performance computational resources provided by the Louisiana Optical Network Infrastructure (http://www.loni.org).

Author information

Authors and Affiliations

Department of Experimental Statistics, Louisiana State University, Baton Rouge, LA, 70803, USA
Bin Li
Agricultural and Food Engineering Department, IIT Kharagpur, Kharagpur, 721302, India
Somsubhra Chakraborty
Department of Earth and Atmospheric Sciences, Central Michigan University, Mount Pleasant, MI, 48859, USA
David C. Weindorf
Louisiana State University Health Sciences Center, New Orleans, LA, 70112, USA
Qingzhao Yu

Authors

Bin Li
View author publications
You can also search for this author in PubMed Google Scholar
Somsubhra Chakraborty
View author publications
You can also search for this author in PubMed Google Scholar
David C. Weindorf
View author publications
You can also search for this author in PubMed Google Scholar
Qingzhao Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingzhao Yu.

Ethics declarations

Funding:

Not applicable.

Conflict of interest:

None.

Availability of data and material:

The authors will provide at least part of the data after acceptance.

Code availability:

The authors will make the sample code available online.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, B., Chakraborty, S., Weindorf, D.C. et al. Data Integration Using Model-Based Boosting. SN COMPUT. SCI. 2, 400 (2021). https://doi.org/10.1007/s42979-021-00797-0

Download citation

Received: 22 January 2021
Accepted: 27 July 2021
Published: 07 August 2021
DOI: https://doi.org/10.1007/s42979-021-00797-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Integration Using Model-Based Boosting

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

Air pollution prediction with machine learning: a case study of Indian cities

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Funding:

Conflict of interest:

Availability of data and material:

Code availability:

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data Integration Using Model-Based Boosting

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

Air pollution prediction with machine learning: a case study of Indian cities

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Funding:

Conflict of interest:

Availability of data and material:

Code availability:

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation