Abstract
The need for data integration is becoming ubiquitous and encompasses many disciplines due to the technological development in instrumentation. Combining the information from distinct data sources in modeling, so as to improve the prediction accuracy and have a holistic view of the problem is a challenge for statisticians. In this paper, we present a flexible statistical framework for integrating various types of data from distinct sources through model-based boosting (IMBoost) with two types of base models: regression trees and penalized splines. The performance of IMBoost is illustrated through two recent studies in environmental soil science, where multiple sensors were used to quantify several soil parameters. Empirical results are promising and show the proposed algorithms substantially improve the prediction performance through combining the strength from distinct data sources. We also proposed a surrogate model approach, which allows IMBoost to handle situations when partial samples are missing from distinct sources.
Similar content being viewed by others
References
Banerjee TP, Das S. Multi-sensor data fusion using support vector machine for motor fault detection. Inf Sci. 2012;217:96–107.
Bania RK, Halder A. R-Ensembler: a greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data. Comput Methods Programs Biomed. 2020;184:105122.
Bigdeli B, Pahlavani P, Amirkolaee HA. An ensemble deep learning method as data fusion system for remote sensing multisensor classification. Appl Soft Comput. 2021;110:107563.
Bühlmann P. Boosting for high-dimensional linear models. Ann Stat. 2006;34:559–83.
Bühlmann P, Hothorn T. Boosting algorithms: regularization, prediction and model fitting. Stat Sci. 2007;22:477–505.
Eilers P, Marx BD. Flexible smoothing with B-splines and penalties (with comments and rejoinder). Stat Sci. 1996;11:89–121.
Friedman J. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232.
Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Ann Stat. 2000;28:337–407.
Gao JB, Harris CJ. Some remarks on Kalman filters for the multisensor fusion. Inf Fusion. 2002;3(3):191–201.
Greenwell B, Boehmke B, Cunningham J. GBM Developers. gbm: generalized boosted regression models. R package version 2.1.4. 2018. https://CRAN.R-project.org/package=gbm. Accessed 16 Sept 2018
Hall DL, Llinas J. An introduction to multisensor data fusion. Proc IEEE. 1997;85(1):6–23.
Hastie T, Tibshirani R, Friedman J. Elements of statistical learning: data mining, inference and prediction. 2nd ed. New York: Springer Verlag; 2009.
Holzinger A, Malle B, Saranti A, Pfeifer B. Towards multi-modal causability with Graph Neural Networks enabling information fusion for explainable AI. Inf Fusion. 2021;71:28–37.
Khan SI, Hoque ASML. SICE: an improved missing data imputation technique. J Big Data. 2020;7:37.
Lin WC, Tsai CF. Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev. 2020;53:1487–509.
Lin J, Li N, Alam MA, Ma Y. Data-driven missing data imputation in cluster monitoring system based on deep neural network. Appl Intell. 2020;50:860–77.
Liu J, Li T, Xie P, Du S, Teng F, Yang X. Urban big data fusion based on deep learning: An overview. Inf Fusion. 2020;53:123–33.
Marx BD, Eilers P. Generalized linear regression on sampled signals and curves: a P-spline approach. Technometrics. 1999;41:1–13.
Meng T, Jing X, Yan Z, Pedrycz W. A survey on machine learning for data fusion. Inf Fusion. 2020;57:115–29.
Muzammal M, Talat R, Sodhro AH, Pirbhulal S. A multi-sensor data fusion enabled ensemble approach for medical data from body sensor networks. Inf Fusion. 2020;53:155–64.
R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2018. https://www.R-project.org/. Accesed 4 Aug 2021
Raja PS, Thangavel K. Missing value imputation using unsupervised machine learning techniques. Soft Comput. 2020;24:4361–92.
Raja PS, Sasirekha K, Thangavel K. A novel fuzzy rough clustering parameter-based missing value imputation. Neural Comput Appl. 2020;32:10033–50.
Sauta E, Demartini A, Vitali F, Riva A, Bellazzi R. A Bayesian data fusion based approach for learning genome-wide transcriptional regulatory networks. BMC Bioinform. 2020;21:219.
Tang W, Lu Z, Dhillon IS. Clustering with multiple graphs. In: Proceedings of the 9th IEEE International Conference on data mining, 2009; p. 1016–1021.
van Vliet MH, Horlings HM, van de Vijver MJ, Reinders MJT, Wessels LFA. Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PLoS One. 2012;7(7):e40358. https://doi.org/10.1371/journal.pone.0040358.
Wang D, Chakraborty S, Weindorf D, Li B, Sharma A, Paul S, Ali M. Synthesized use of VisNIR DRS and PXRF for soil characterization: total carbon and total nitrogen. Geoderma. 2015;243–244:157–67.
Wolpert DH. Stacked generalization. Neural Netw. 1992;5(2):241–59.
Xia J, Zhang S, Cai G, Li L, Pan Q, Yan J, Ning G. Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recognit. 2017;69:52–60.
Yang H, Cao H, He T, Wang T, Cui Y. Multilevel heterogeneous omics data integration with kernel fusion. Brief Bioinform. 2020;21(1):156–70.
Acknowledgements
The authors would like to thank the reviewers for constructive comments that helped to improve the quality of the article. Portions of this research were conducted with high-performance computational resources provided by the Louisiana Optical Network Infrastructure (http://www.loni.org).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Funding:
Not applicable.
Conflict of interest:
None.
Availability of data and material:
The authors will provide at least part of the data after acceptance.
Code availability:
The authors will make the sample code available online.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, B., Chakraborty, S., Weindorf, D.C. et al. Data Integration Using Model-Based Boosting. SN COMPUT. SCI. 2, 400 (2021). https://doi.org/10.1007/s42979-021-00797-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-021-00797-0