Abstract
Background
The identification of associated overweight risk factors is crucial to future health risk predictions and behavioral interventions. Several consensus problems remain in machine learning, such as cross-validation, and the resulting model may suffer from overfitting or poor interpretability.
Methods
This study employed nine commonly used machine learning methods to construct overweight risk models. The general community are the target of this study, and a total of 10,905 Chinese subjects from Ningde City in Fujian province, southeast China, participated. The best model was selected through appropriate verification and validation and was suitably explained.
Results
The overweight risk models employing machine learning exhibited good performance. It was concluded that CatBoost, which is used in the construction of clinical risk models, may surpass previous machine learning methods. The visual display of the Shapley additive explanation value for the machine model variables accurately represented the influence of each variable in the model.
Conclusions
The construction of an overweight risk model using machine learning may currently be the best approach. Moreover, CatBoost may be the best machine learning method. Furthermore, combining Shapley’s additive explanation and machine learning methods can be effective in identifying disease risk factors for prevention and control.
Similar content being viewed by others
Data availability
The raw data supporting the conclusions of this article will be made available by the authors without undue reservation.
Abbreviations
- ANN/MLP:
-
artificial neural network/multiparametric linear programming
- AUC:
-
area under the curve
- BMI:
-
body mass index
- BP:
-
blood pressure
- DM:
-
diabetes mellitus
- DPB:
-
diastolic blood pressure
- FBG:
-
fasting blood glucose
- FINS:
-
fasting insulin
- GBDT:
-
gradient boosted decision tree
- GBM:
-
gradient boosting machine
- GNB:
-
Gaussian NB
- HDL-C:
-
high-density lipoprotein cholesterol
- HOMA-IR:
-
homeostasis model assessment of insulin resistance
- KNN:
-
K-nearest neighbor
- LDL-C:
-
low-density lipoprotein cholesterol
- PBG:
-
postprandial blood glucose
- ROC:
-
receiver operating characteristic
- SBP:
-
systolic blood pressure
- SHAP:
-
Shapley additive explanation
- SVM:
-
supported vector machine
- TC:
-
total cholesterol
- TG:
-
total triglyceride
- WC:
-
waist circumference
- WHO:
-
World Health Organization
References
A. Chatterjee, M.W. Gerdes, S.G. Martinez, Identification of risk factors associated with obesity and overweight-a machine learning overview. Sensors 20(9), 2734 (2020). https://doi.org/10.3390/s20092734
E.P. Williams, M. Mesidor, K. Winters, P.M. Dubbert, S.B. Wyatt, Overweight and obesity: prevalence, consequences, and causes of a growing public health problem. Curr. Obes. Rep. 4, 363–370 (2015). https://doi.org/10.1007/s13679-015-0169-4
H. Chen, B. Yang, D. Liu et al., Using blood indexes to predict overweight statuses: an extreme learning machine-based approach. PLoS ONE 10(11), e0143003 (2015). https://doi.org/10.1371/journal.pone.0143003
E.M. Bomberg, O.Y. Addo, K. Sarafoglou, B.S. Miller, Adjusting for pubertal status reduces overweight and obesity prevalence in the United States. J. Pediatr. 231, 200–206.e1 (2021). https://doi.org/10.1016/j.jpeds.2020.12.038
Y. Wang, M.A. Beydoun, J. Min, H. Xue, L.A. Kaminsky, L.J. Cheskin, Has the prevalence of overweight, obesity and central obesity levelled off in the United States? Trends, patterns, disparities, and future projections for the obesity epidemic. Int J. Epidemiol. 49, 810–823 (2020). https://doi.org/10.1093/ije/dyz273
C.J. Ireland, S.K. Thompson, T.A. Laws, A. Esterman, Risk factors for Barrett’s esophagus: a scoping review. Cancer Causes Control 27, 301–323 (2016). https://doi.org/10.1007/s10552-015-0710-5
Z. Obermeyer, E.J. Emanuel, Predicting the future - big data, machine learning, and clinical medicine. N. Engl. J. Med. 375, 1216–1219 (2016). https://doi.org/10.1056/NEJMp1606181
M. Padmanabhan, P. Yuan, G. Chada, H.V. Nguyen, Physician-friendly machine learning: a case study with cardiovascular disease risk prediction. J Clin Med. 8(7), 1050 (2019). https://doi.org/10.3390/jcm8071050
K.W. DeGregory, P. Kuiper, T. DeSilvio et al., A review of machine learning in obesity. Obes. Rev. 19, 668–685 (2018). https://doi.org/10.1111/obr.12667
H.F. Golino, L.S. Amaral, S.F. Duarte et al., Predicting increased blood pressure using machine learning. J. Obes. 2014, 637635 (2014). https://doi.org/10.1155/2014/637635
A. Maharana, E.O. Nsoesie, Use of deep learning to examine the association of the built environment with prevalence of neighborhood adult obesity. JAMA Netw. Open 1, e181535 (2018). https://doi.org/10.1001/jamanetworkopen.2018.1535
I. Yoo, P. Alafaireet, M. Marinov et al., Data mining in healthcare and biomedicine: a survey of the literature. J. Med. Syst. 36, 2431–2448 (2012). https://doi.org/10.1007/s10916-011-9710-5
M.N. LeCroy, R.S. Kim, J. Stevens, D.B. Hanna, C.R. Isasi, Identifying key determinants of childhood obesity: a narrative review of machine learning studies. Child Obes. 17, 153–159 (2021). https://doi.org/10.1089/chi.2020.0324
S. Lundberg, S.- Lee, A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 4766–4775 (2017)
L. Pezzoli, N. Andrews, O. Ronveaux, Clustered lot quality assurance sampling to assess immunisation coverage: increasing rapidity and maintaining precision. Trop. Med. Int. Health 15, 540–546 (2010). https://doi.org/10.1111/j.1365-3156.2010.02482.x
Hypertension Study Group of Chinese Society of Cardiology of Chinese Medical A, [Chinese expert consensus on obesityrelatedhypertension management]. Zhonghua Xin Xue Guan Bing Za Zhi 44, 212–219 (2016)
Endocrinology. CSo, Medicine. DSoCAoC, Surgery. CSfMaB, Surgery. CSoDaB, Hospitals, CAoR. Multidisciplinary clinical consensus on diagnosis and treatment of obesity (2021 edition). Chin. J. Endocrinol. Metab. 37(11), 959–972 (2021). https://doi.org/10.3760/cma.j.cn311282-20210807-00503
W. Lin, S. Shi, H. Huang, N. Wang, J. Wen, G. Chen, Development of a risk model for predicting microalbuminuria in the Chinese population using machine learning algorithms. Front. Med. 9, 775275 (2022). https://doi.org/10.3389/fmed.2022.775275
W. Jia, J. Weng, D. Zhu et al., Standards of medical care for type 2 diabetes in China 2019. Diabetes Metab. Res. Rev. 35, e3158 (2019). https://doi.org/10.1002/dmrr.3158
Joint Committee for Guideline R, 2018 Chinese guidelines for prevention and treatment of hypertension–a report of the Revision Committee of Chinese Guidelines for Prevention and Treatment of Hypertension. J. Geriatr. Cardiol. 16, 182–241 (2019). https://doi.org/10.11909/j.issn.1671-5411.2019.03.014
T.M. Wallace, J.C. Levy, D.R. Matthews, Use and abuse of HOMA modeling. Diabetes Care 27, 1487–1495 (2004). https://doi.org/10.2337/diacare.27.6.1487
I.M. Nasir, M.A. Khan, M. Yasmin, et al., Pearson correlation-based feature selection for document classification using balanced training. Sensors 20(23), 6793 (2020). https://doi.org/10.3390/s20236793
P. Fabian, V. Gael, G. Alexandre, M. BVincent, T. Bertrand, Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011)
W. Seo, N. Kim, S.K. Lee, S.M. Park, Machine learning-based analysis of adolescent gambling factors. J. Behav. Addict. 9, 734–743 (2020). https://doi.org/10.1556/2006.2020.00063
A. Abraham, F. Pedregosa, M. Eickenberg et al., Machine learning for neuroimaging with scikit-learn. Front. Neuroinform. 8, 14 (2014). https://doi.org/10.3389/fninf.2014.00014
G. Colmenarejo, Machine Learning Models to Predict Childhood and Adolescent Obesity: A Review. Nutrients 12(8), 2466 (2020). https://doi.org/10.3390/nu12082466
B. Van Calster, D.J. McLernon, M. van Smeden et al., Calibration: the Achilles heel of predictive analytics. BMC Med. 17, 230 (2019). https://doi.org/10.1186/s12916-019-1466-7
A.J. Vickers, F. Holland, Decision curve analysis to evaluate the clinical benefit of prediction models. Spine J. 21, 1643–1648 (2021). https://doi.org/10.1016/j.spinee.2021.02.024
A.J. Vickers, E.B. Elkin, Decision curve analysis: a novel method for evaluating prediction models. Med Decis. Mak. 26, 565–574 (2006). https://doi.org/10.1177/0272989X06295361
M.J. Pencina, R.B. D’Agostino Sr, R.B. D’Agostino Jr, R.S. Vasan, Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat. Med. 27, 157–172 (2008). https://doi.org/10.1002/sim.2929.
Y. Yang, Y. Yuan, Z. Han, G. Liu, Interpretability analysis for thermal sensation machine learning models: an exploration based on the SHAP approach. Indoor Air 32, e12984 (2022). https://doi.org/10.1111/ina.12984
S.M. Lundberg, G. Erion, H. Chen et al., From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020). https://doi.org/10.1038/s42256-019-0138-9
X. Wang, G. Gong, N. Li, S. Qiu, Detection analysis of epileptic EEG using a novel random forest model combined with grid search optimization. Front. Hum. Neurosci. 13, 52 (2019). https://doi.org/10.3389/fnhum.2019.00052
J.T. Hancock, T.M. Khoshgoftaar, CatBoost for big data: an interdisciplinary review. J. Big Data 7(1), 94 (2020). https://doi.org/10.1186/s40537-020-00369-8
K. Ambe, M. Suzuki, T. Ashikaga, M. Tohkin, Development of quantitative model of a local lymph node assay for evaluating skin sensitization potency applying machine learning CatBoost. Regul. Toxicol. Pharmacol. 125, 105019 (2021). https://doi.org/10.1016/j.yrtph.2021.105019
C. Zhang, X. Chen, S. Wang, J. Hu, C. Wang, X. Liu, Using CatBoost algorithm to identify middle-aged and elderly depression, national health and nutrition examination survey 2011-2018. Psychiatry Res. 306, 114261 (2021). https://doi.org/10.1016/j.psychres.2021.114261
T.M. Dugan, S. Mukhopadhyay, A. Carroll, S. Downs, Machine learning techniques for prediction of early childhood obesity. Appl. Clin. Inf. 6(3), 506–520 (2015). https://doi.org/10.4338/ACI-2015-03-RA-0036
N. Kanerva, J. Kontto, M. Erkkola, J. Nevalainen, S. Mannisto, Suitability of random forest analysis for epidemiological research: exploring sociodemographic and lifestyle-related risk factors of overweight in a cross-sectional design. Scand. J. Public Health 46, 557–564 (2018). https://doi.org/10.1177/1403494817736944
M. Safaei, E.A. Sundararajan, M. Driss, W. Boulila, A. Shapi’i, A systematic literature review on obesity: understanding the causes & consequences of obesity and reviewing various machine learning approaches used to predict obesity. Comput. Biol. Med. 136, 104754 (2021). https://doi.org/10.1016/j.compbiomed.2021.104754
X. Pang, C.B. Forrest, F. Le-Scherban, A.J. Masino, Prediction of early childhood obesity with machine learning and electronic health record data. Int. J. Med. Inform. 150, 104454 (2021). https://doi.org/10.1016/j.ijmedinf.2021.104454
B. Farran, R. AlWotayan, H. Alkandari, D. Al-Abdulrazzaq, A. Channanath, T.A. Thanaraj, Use of non-invasive parameters and machine-learning algorithms for predicting future risk of type 2 diabetes: a retrospective cohort study of health data from Kuwait. Front. Endocrinol. 10, 624 (2019). https://doi.org/10.3389/fendo.2019.00624
C.C. Olisah, L. Smith, M. Smith, Diabetes mellitus prediction and diagnosis from a data preprocessing and machine learning perspective. Comput. Methods Prog. Biomed. 220, 106773 (2022). https://doi.org/10.1016/j.cmpb.2022.106773
S.M. Lee, S. Hwangbo, E.R. Norwitz et al., Nonalcoholic fatty liver disease and early prediction of gestational diabetes mellitus using machine learning methods. Clin. Mol. Hepatol. 28, 105–116 (2022). https://doi.org/10.3350/cmh.2021.0174
A. Cahn, A. Shoshan, T. Sagiv et al., Prediction of progression from pre-diabetes to diabetes: development and validation of a machine learning model. Diabetes Metab. Res. Rev. 36, e3252 (2020). https://doi.org/10.1002/dmrr.3252
H. Wei, J. Sun, W. Shan et al., Environmental chemical exposure dynamics and machine learning-based prediction of diabetes mellitus. Sci. Total Environ. 806, 150674 (2022). https://doi.org/10.1016/j.scitotenv.2021.150674
A. Nicolucci, L. Romeo, M. Bernardini et al., Prediction of complications of type 2 diabetes: a machine learning approach. Diabetes Res. Clin. Pract. 190, 110013 (2022). https://doi.org/10.1016/j.diabres.2022.110013
H. Liu, J. Li, J. Leng et al., Machine learning risk score for prediction of gestational diabetes in early pregnancy in Tianjin, China. Diabetes Metab. Res. Rev. 37, e3397 (2021). https://doi.org/10.1002/dmrr.3397
S. Belur Nagaraj, M.J. Pena, W. Ju, H.L. Heerspink, B.E.-D. Consortium, Machine-learning-based early prediction of end-stage renal disease in patients with diabetic kidney disease using clinical trials data. Diabetes Obes. Metab. 22, 2479–2486 (2020). https://doi.org/10.1111/dom.14178
I. Motaib, F. Aitlahbib, A. Fadil et al., Predicting poor glycemic control during Ramadan among non-fasting patients with diabetes using artificial intelligence based machine learning models. Diabetes Res. Clin. Pract. 190, 109982 (2022). https://doi.org/10.1016/j.diabres.2022.109982
Y. Ruan, A. Bellot, Z. Moysova et al., Predicting the risk of inpatient hypoglycemia with machine learning using electronic health records. Diabetes Care 43, 1504–1511 (2020). https://doi.org/10.2337/dc19-1743
Y.T. Wu, C.J. Zhang, B.W. Mol et al., Early prediction of gestational diabetes mellitus in the Chinese population via advanced machine learning. J. Clin. Endocrinol. Metab. 106, e1191–e1205 (2021). https://doi.org/10.1210/clinem/dgaa899
Acknowledgements
The authors would like to thank the participants for providing the information used in this study and for kindly making arrangements for the data collection.
Funding
This study was supported by Fujian Research and Training Grants for Young and Middle-aged Leaders in Healthcare (Grant No. (2023)417#), the Innovation Project of Fujian Provincial Health Commission (2021CXA003), Natural Science Foundation of Fujian Province (Grant No. 2022J011017 and Grant No. 2020J011068), National Key Research and Development Program of China (2018YFC2001100-5), and Natural Science Foundation of China (82070878).
Author information
Authors and Affiliations
Contributions
W.L. and S.S. performed the formal analysis, devised the methodology, and wrote the original draft. H.L., H.H., and J.W. performed the curation of data and resources. W.L. and G.C. were involved in the conceptualization, formal analysis, writing of the original draft, and project administration. G.C. are the guarantors of this manuscript, had full access to all the data in the study, and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Ethical approval
Our study was performed in accordance with the Declaration of Helsinki and approved by The Ethics Committee of Fujian Provincial Hospital (approval no. K2019-06-032). All patients provided written informed consent prior to enrollment in the study.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lin, W., Shi, S., Lan, H. et al. Identification of influence factors in overweight population through an interpretable risk model based on machine learning: a large retrospective cohort. Endocrine 83, 604–614 (2024). https://doi.org/10.1007/s12020-023-03536-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12020-023-03536-y