Skip to main content
Log in

Data Integration Using Model-Based Boosting

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

The need for data integration is becoming ubiquitous and encompasses many disciplines due to the technological development in instrumentation. Combining the information from distinct data sources in modeling, so as to improve the prediction accuracy and have a holistic view of the problem is a challenge for statisticians. In this paper, we present a flexible statistical framework for integrating various types of data from distinct sources through model-based boosting (IMBoost) with two types of base models: regression trees and penalized splines. The performance of IMBoost is illustrated through two recent studies in environmental soil science, where multiple sensors were used to quantify several soil parameters. Empirical results are promising and show the proposed algorithms substantially improve the prediction performance through combining the strength from distinct data sources. We also proposed a surrogate model approach, which allows IMBoost to handle situations when partial samples are missing from distinct sources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Banerjee TP, Das S. Multi-sensor data fusion using support vector machine for motor fault detection. Inf Sci. 2012;217:96–107.

    Article  Google Scholar 

  2. Bania RK, Halder A. R-Ensembler: a greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data. Comput Methods Programs Biomed. 2020;184:105122.

    Article  Google Scholar 

  3. Bigdeli B, Pahlavani P, Amirkolaee HA. An ensemble deep learning method as data fusion system for remote sensing multisensor classification. Appl Soft Comput. 2021;110:107563.

  4. Bühlmann P. Boosting for high-dimensional linear models. Ann Stat. 2006;34:559–83.

    Article  MathSciNet  Google Scholar 

  5. Bühlmann P, Hothorn T. Boosting algorithms: regularization, prediction and model fitting. Stat Sci. 2007;22:477–505.

    MathSciNet  MATH  Google Scholar 

  6. Eilers P, Marx BD. Flexible smoothing with B-splines and penalties (with comments and rejoinder). Stat Sci. 1996;11:89–121.

    Article  Google Scholar 

  7. Friedman J. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232.

    Article  MathSciNet  Google Scholar 

  8. Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Ann Stat. 2000;28:337–407.

    Article  MathSciNet  Google Scholar 

  9. Gao JB, Harris CJ. Some remarks on Kalman filters for the multisensor fusion. Inf Fusion. 2002;3(3):191–201.

    Article  Google Scholar 

  10. Greenwell B, Boehmke B, Cunningham J. GBM Developers. gbm: generalized boosted regression models. R package version 2.1.4. 2018. https://CRAN.R-project.org/package=gbm. Accessed 16 Sept 2018

  11. Hall DL, Llinas J. An introduction to multisensor data fusion. Proc IEEE. 1997;85(1):6–23.

    Article  Google Scholar 

  12. Hastie T, Tibshirani R, Friedman J. Elements of statistical learning: data mining, inference and prediction. 2nd ed. New York: Springer Verlag; 2009.

    Book  Google Scholar 

  13. Holzinger A, Malle B, Saranti A, Pfeifer B. Towards multi-modal causability with Graph Neural Networks enabling information fusion for explainable AI. Inf Fusion. 2021;71:28–37.

    Article  Google Scholar 

  14. Khan SI, Hoque ASML. SICE: an improved missing data imputation technique. J Big Data. 2020;7:37.

    Article  Google Scholar 

  15. Lin WC, Tsai CF. Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev. 2020;53:1487–509.

    Article  Google Scholar 

  16. Lin J, Li N, Alam MA, Ma Y. Data-driven missing data imputation in cluster monitoring system based on deep neural network. Appl Intell. 2020;50:860–77.

    Article  Google Scholar 

  17. Liu J, Li T, Xie P, Du S, Teng F, Yang X. Urban big data fusion based on deep learning: An overview. Inf Fusion. 2020;53:123–33.

    Article  Google Scholar 

  18. Marx BD, Eilers P. Generalized linear regression on sampled signals and curves: a P-spline approach. Technometrics. 1999;41:1–13.

    Article  Google Scholar 

  19. Meng T, Jing X, Yan Z, Pedrycz W. A survey on machine learning for data fusion. Inf Fusion. 2020;57:115–29.

    Article  Google Scholar 

  20. Muzammal M, Talat R, Sodhro AH, Pirbhulal S. A multi-sensor data fusion enabled ensemble approach for medical data from body sensor networks. Inf Fusion. 2020;53:155–64.

    Article  Google Scholar 

  21. R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2018. https://www.R-project.org/. Accesed 4 Aug 2021

  22. Raja PS, Thangavel K. Missing value imputation using unsupervised machine learning techniques. Soft Comput. 2020;24:4361–92.

    Article  Google Scholar 

  23. Raja PS, Sasirekha K, Thangavel K. A novel fuzzy rough clustering parameter-based missing value imputation. Neural Comput Appl. 2020;32:10033–50.

    Article  Google Scholar 

  24. Sauta E, Demartini A, Vitali F, Riva A, Bellazzi R. A Bayesian data fusion based approach for learning genome-wide transcriptional regulatory networks. BMC Bioinform. 2020;21:219.

    Article  Google Scholar 

  25. Tang W, Lu Z, Dhillon IS. Clustering with multiple graphs. In: Proceedings of the 9th IEEE International Conference on data mining, 2009; p. 1016–1021.

  26. van Vliet MH, Horlings HM, van de Vijver MJ, Reinders MJT, Wessels LFA. Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PLoS One. 2012;7(7):e40358. https://doi.org/10.1371/journal.pone.0040358.

    Article  Google Scholar 

  27. Wang D, Chakraborty S, Weindorf D, Li B, Sharma A, Paul S, Ali M. Synthesized use of VisNIR DRS and PXRF for soil characterization: total carbon and total nitrogen. Geoderma. 2015;243–244:157–67.

    Article  Google Scholar 

  28. Wolpert DH. Stacked generalization. Neural Netw. 1992;5(2):241–59.

    Article  Google Scholar 

  29. Xia J, Zhang S, Cai G, Li L, Pan Q, Yan J, Ning G. Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recognit. 2017;69:52–60.

    Article  Google Scholar 

  30. Yang H, Cao H, He T, Wang T, Cui Y. Multilevel heterogeneous omics data integration with kernel fusion. Brief Bioinform. 2020;21(1):156–70.

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank the reviewers for constructive comments that helped to improve the quality of the article. Portions of this research were conducted with high-performance computational resources provided by the Louisiana Optical Network Infrastructure (http://www.loni.org).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingzhao Yu.

Ethics declarations

Funding:

Not applicable.

Conflict of interest:

None.

Availability of data and material:

The authors will provide at least part of the data after acceptance.

Code availability:

The authors will make the sample code available online.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, B., Chakraborty, S., Weindorf, D.C. et al. Data Integration Using Model-Based Boosting. SN COMPUT. SCI. 2, 400 (2021). https://doi.org/10.1007/s42979-021-00797-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-021-00797-0

Keywords

Navigation