Introduction

The pandemic and increasing trend of obesity-related cardiometabolic risk (CMR) factors are a public health challenge globally [1, 2]. Data from the China Health and Nutrition Survey (CHNS) in 2009 showed that there was a high prevalence of CMR factors in both children and adults [3], which imposes a tremendous burden on health care systems. Childhood CMR factors are highly likely to persist into adulthood and are associated with cardiovascular disease, diabetes, and mortality in the future [4,5,6,7,8]. Therefore, it is imperative to slow or reverse the increasing trend in the prevalence of CMR factors at an early stage of life [9, 10].

Diet is of paramount importance for the prevention of CMR factors [11]. Strong evidence from adults has shown that diets low in processed food, sugar-sweetened beverages (SSBs), and carbohydrate, and high in dairy and fish are associated with lower risks of cardiometabolic disorders including obesity, cardiovascular disease, diabetes, dyslipidemia, and hypertension [12, 13]. However, consumption of individual foods has not been demonstrated to be strongly predictive of CMR factors in children. A recent systematic review has shown that significant associations between dietary intakes and obesity-related CMR factors were observed in 19% of the 81 included studies in children [14]. Previous studies are also limited by small sample sizes, cross-sectional design, or failure to adjust for important confounders. Although dietary patterns have been well linked to CMR factors in children in some countries [15,16,17], they cannot be applied to other populations given that a healthy diet pattern in one study can be hardly derived from other studies. For example, a healthy diet pattern in one study was high in vegetables, fruits, and dairy [15], while a healthy pattern in another study was high in vegetables, fruits, fish, crackers, and bread [17]. Diet indices have also been developed for diet quality assessment, however, these indices are shown to be weak predictors of CMR factors [18, 19]. Identifying leading dietary determinants for changes in CMR factors using new methods is urgent for targeting intervention priorities for the prevention of CMR in children. It is also important to create a healthy diet score (HDS) to predict CMR factors in children.

We used machine learning techniques to identify leading dietary determinants for changes in CMR factors in children based on longitudinal data. We then created an HDS based on the identified leading determinants to predict CMR factors.

Methods

Participant selection

The present analysis was based on a multicenter, randomized cluster controlled trial and the full description of the study has been published elsewhere [20]. Briefly, the study was conducted in six capital or province capital cities including Beijing, Shanghai, Chongqing, Jinan, Harbin, and Guangzhou. Data were collected at both baseline (May 2009) and follow-up (May 2010). Children in the intervention group received nutrition lectures (knowledge, attitudes, and dietary habits) as well as participated in two times of ten minutes or one time 20 min of Happy 10 program per day (involves various physical activities such as games, dances, and gymnastics, which were designed to stimulate children to enjoy physical activity). A total of 9901 children from 390 classes within 38 schools were screened for eligibility. Among 9867 children who were assessed at baseline, 8572 were reassessed at follow-up. Participants whose dietary intakes were not assessed, those who fell in the top (3500 Kcal/day) or bottom (300 Kcal/day) percentile of total energy intake, and who had missing values in all cardiometabolic measurements were excluded (n = 2896). A total of 5676 participants were included in the final analysis (Fig. 1).

Fig. 1
figure 1

Flowchart for population section

The study protocol was approved by the Ethical Review Committee of the National Institute for Nutrition and Food Safety, Chinese Centre for Disease Control and Prevention. Written informed consent was obtained from the next of kin, carers, or guardians of all participants.

Dietary assessment

Dietary intake was assessed using 24-h diet recalls for three consecutive days including two weekdays and one weekend day in children in grades 2–5. Interviews were conducted by trained investigators. During the interview, samples of local household dishes and utensils (different sizes of bowls, plates, and spoons) were displayed to the children. They were then shown pictures of common foods eaten in these dishes or utensils to indicate portion size consumed. The interviewer and the tutor would help children recall food intake at school while parents would help recall foods consumed at home.

A total of 1169 different food items were collected among all children and they were categorized into 26 groups according to nutrient contents as below: rice (boiled), wheat (such as steamed bun, noodles, and brans), refined grains (such as white breads, pizza, muffins, pancakes, and granola bar made by further processing grain powder), other cereals (such as corn, millet, and sorghum), fried foods, nuts and legumes (such as peanuts, walnuts, beans, and soybean products), starch roots and tubers, deep color vegetables, light color vegetables, edible fungi and algae (such as mushrooms, agaric, seaweed, and kelp), pickled vegetables (such as pickled mustard root, pickled sweet garlic, pickled cucumber, and pickled radish), fruits, pork, poultry, red meat other than pork (beef, lamb, other red meat), animal offal, processed meat (such as ham, beef jerky, and luncheon meat), seafood (such as fish, lobster, and crab), eggs, milk, yogurt, dairy product (such as milk powder and cheese), catsup and other sources, SSBs, candy and sugar, and dessert. Deep color vegetables were classified as carotene content ≥500 μg/100 g and light color vegetable with carotene < 500 μg/100 g. Nutrients and energy intake was calculated based on the China Food Composition [21]. The average amount of food and nutrient intake per day was calculated and energy-adjusted food and nutrient consumption were computed as ([100 × weight in grams]/total energy intake in Kcal).

Confounders

Puberty status was recorded by investigators during the interview when physical examinations were conducted. Physical activity was assessed using a validated questionnaire in children, from which metabolic equivalent (MET) was calculated [22]. The questionnaire is a 39-item self-administered questionnaire that captures the number of days in the past week, times per day, and the number of minutes per time engaging in the physical activity. MET was calculated according to the assigned metabolic values for each specified physical activity. Birthweight, household income, parental education, and parental height and weight were reported by parents using a self-administered questionnaire.

Physical examinations and blood tests

Physical examinations and blood tests (10–14 h fasting beforehand) were performed at both baseline and follow-up following standardized procedures.

Height was measured to the nearest 0.1 cm and weight to the nearest 0.1 kg. Body mass index (BMI) was computed as weight in kilograms divided by the square of height in meters. Waist circumference (WC) was measured midway between the lowest rib and the superior border of the iliac crest on expiration to the nearest 0.1 cm and the average of two measurements was used.

Body composition was assessed using a single frequency (50 Hz) hand to foot bioelectrical impendence device (ImpDF50, Impedimed Pty Ltd., Qld, Australia). Body fat mass was computed using the prediction formula developed by Deurenberg et al. [23] and percent body fat (PBF) was calculated as fat mass divided by body weight.

Blood pressure was measured in the seated position using a mercury sphygmomanometer (XJ300/40–1, Made in Shanghai) by trained nurses with at least 10 min rest before the measurement. The first and the fifth Korotkoff sounds were used to represent the systolic and diastolic blood pressure (SBP and DBP). Three measurements were taken to the nearest two mmHg and the average of the last two measurements was used. Mean arterial pressure (MAP) was calculated as (DBP + 0.33 × [SBP-DBP]).

Fasting glucose was measured using the glucose-oxidize method (Daiichi Pharmaceutical Co., Ltd., Tokyo, Japan) within four hours after the fasting blood sample was obtained. Fasting insulin was measured using the immunoenzymatic method (analyzer AXSYM, Abbott Co., Ltd., Japan). The homeostatic model assessment of insulin resistance (HOMA-IR) was computed as (fasting insulin [μU/L] × fasting glucose [mg/dL])/405.

Conventional enzymatic assays were used to measure levels of serum triglycerides (TG), total cholesterol (TC), high-density lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDL-C) with 7080 Automatic Analyzer (Daiichi Pharmaceutical Co., Ltd., Tokyo, Japan).

Statistical analysis

BMI, WC, PBF, SBP, DBP, MAP, TG, TC, HDL-C, LDL-C, TG to HDL-C ratio, fasting glucose, insulin, and HOMA-IR were standardized (i.e. Z scores were calculated: Z = (value−mean)/SD using sex- and age-specific means and SDs). CMR score (CMRS) was calculated by summing Z scores of WC, the average of SBP and DBP, fasting glucose, HDL-C (multiplying by − 1), and TG [24].

We randomly selected 50% of all participants in the intervention study as training data and the remaining as testing data. We used three established machine learning models including multiple linear regression model, random forest, and gradient boost machine (GBM) to analyze the importance of 26 dietary predictors (baseline) for the change in CMRS based on the training data and compared the performance of these models based on the testing data. For multiple linear regression model, we selected Gaussian family distribution when established prediction model using machine learning techniques. The hyper-parameters alpha and lambda specify the regularization strength and the regularization distribution between L1 (LASSO) and (ridge regression) L2 penalties, respectively. The random forest algorithm is a supervised learning algorithm constructing an ensemble of decision-trees using randomly bootstrapping sample datasets and averaging predictions of its trees [25]. It applies a bagging method to ensemble multiple decision trees generated from subsets to reduce correlations among the constitute decision trees. In this study, we used the R-square to determine the best predicting variable and location for each tree split in our algorithm. We grew the forest with 500 trees and implemented a grid search to obtain optimal parameters including the number of variables randomly sampled as candidates at each split and the max depth of each tree (effectively the number of interactions are considered in the model) for the random forest.

GBM belongs to a family of machine learning approaches leveraging a boosting ensemble method. An ensemble of decision-trees was constructed using a weighted average of trees with more weight to those with a better performance [26]. GBM converts a weak original learning algorithm to a strong one by minimizing an exponential loss of the misclassification rate. A forest of 500 trees was applied and a grid search for model optimization was also conducted with the maximum number of models, the max depth of each tree, learning rate, row sample rate per tree, and column sample rate as hyper-parameters. Five-Fold cross-validation was applied to test if the models were overfitting. Regularization was conducted, and optimal parameters were used in modeling (Table S1). We realized these modeling exercises using the statistical software R 3.4.1 (toolbox h2o). Leading dietary factors were obtained according to their contribution derived from the machine learning method with the best performance.

A healthy diet score (HDS) was computed by summing sub-scores with each of the leading dietary predictors as one point according to their associations with CMRS. For example, more than the median intake of fruit was scored as 1 and equal or less as 0, if fruit intake was inversely associated with CMRS. We also calculated a HDS by summing weighted sub-scores according to the contribution of the corresponding dietary predictors derived from the machine learning method. ANOVA for continuous variables and Chi-square tests for categorical variables were performed to compare the difference of baseline characteristics across HDS.

Since the interaction between intervention/sex and HDS for changes in most CMR factors was not significant (Table S2 and S3), we did the analysis for the association between HDS and CMR factors in the whole population.

The general linear regression model (GLM) was used to test the difference in changes in CMR factors between participants with different HDS. We tested the following models: 1) classes in schools were adjusted for as random effects and characteristics of the individuals including age, sex, and corresponding CMR factor at baseline as fixed effects; 2) model 1 plus intervention group, grade, puberty, BMI, physical activity, and intake of energy, fiber, vegetable, fruit, pork, legumes, and nuts at baseline; 3) model 2 plus birth weight, breastfeeding, household income, parental BMI and education. We used the Benjamin-Hochberg procedure to control the false discovery rate at level of 5% for multiple comparisons [27]. Bonferroni P-value adjustments were performed for all pairwise comparisons. The association between change in HDS and changes in CMR factors was also tested using GLM. Changes in CMR factors were calculated by subtracting the results at baseline from those at follow-up. HDS at follow-up was calculated based on the nine leading dietary determinants and improved HDS referred to an increase in HDS (subtracting HDS at baseline from that at follow-up). For individual CMR factors, a standardized mean difference of 0.2, 0.5, and ≥ 0.8 represents a small, medium, and large effect size, respectively. As CMRS is the summing of Z-scores of five components, a standardized mean difference of 1.0, 2.5, and ≥ 4.0 represents a small, medium, and large effect size, respectively [28].

We did an interaction analysis to examine whether the association between HDS and CMRS was modified by sex, grade, birthweight, household income, parental BMI, and parental education.

We repeated the analysis for the association between HDS and changes in CMR factors in children in the control group. We also did external validation of our HDS in children aged 6–13 years from CHNS with diet and physical examinations measured in two or more surveys.

Analyses except modeling machine learning were performed using SAS version 9.4 (SAS Institute Inc.) and all P values were two-sided.

Results

We included 5676 children (50.5% girls) aged 6–13 years (mean ± SD: 9.54 ± 1.19) in the final analysis. HDS was inversely associated with age, BMI, WC, PBF, and DBP at baseline and positively associated with TC, HDL-C, and LDL-C at baseline. There was not a significant association of HDS with CMRS at baseline. Higher HDS was associated with lower intake of energy, carbohydrate, fat, fiber, and iron and higher intake of protein, vitamin C, vitamin E, and carotene (Table 1).

Table 1 Baseline characteristics by healthy diet score

Importance of contributors to CMRS

Random Forest exhibited higher R-square compared with the other two machine learning models for CMRS (Table S4). Figure 2 depicts the leading predictors for CMRS as derived from Random Forest. The nine leading predictors for CMRS were refined grains, seafood, fried foods, SSBs, wheat, red meat other than pork, rice, fungi and algae, and roots and tubers with the contribution ranging from 3.9 to 19.6% of the total variance. These leading predictors were consistent with those identified by GBM and GLM (Table S5).

Fig. 2
figure 2

Leading dietary determinants for changes in cardiometabolic risk scores in children. This figure shows the contribution of the total variance in percentage by leading dietary determinants (selected from 26 food groups). Machine learning models including general linear regression model, random forest, and gradient boost machine were used to analyze the importance of dietary predictors for CMRS. Random forest had the highest prediction performance and this figure shows the leading dietary determinants derived from the random forest

Dietary intakes and CMRS and healthy diet score

Diets low in refined grains, fried foods, SSBs, wheat, fungi and algae, roots and tubers and high in seafood, rice, and red meat other than pork were associated with a favorable change in CMRS (Table 2). HDS was then computed by summing sub-scores with each of the nine leading healthy factors as one point (according to their associations with CMRS): refined grains (<median), seafood (>median), fried foods (<median), SSBs (<median), wheat (<median), red meat other than pork (>median), rice (>median), fungi and algae (<median), and roots and tubers (<median). HDS ranged from 0 to 9 with a higher level representing a healthier diet. HDS was also calculated by summing the weighted sub-scores according to the contribution derived from the Random Forest. The maximum sub-score of 1 was set at the levels of 0 (refined grains, fried foods, SSBs, wheat, fungi and algae, and roots and tubers) or above the 80th percentile (seafood, rice, and red meat other than pork) of the food intake. While the minimum score of 0 was set at the levels of above the 80th percentile (refined grains, fried foods, SSBs, wheat, fungi and algae, and roots and tubers) or 0 (seafood, rice, and red meat other than pork) of the food intake. Scores for the amounts between 0 and 1 were prorated linearly. The sub-scores were then weighted by multiplying the contribution (percentage of the total variance of the nine dietary predictors) of the corresponding dietary predictors (Table S6).

Table 2 Changes in cardiometabolic risk score during follow-up associated with dietary intakes at baseline

Healthy diet score and CMR factors

High HDS at baseline was associated with favorable changes in CMRS, BMI, PBF, SBP, DBP, MAP, HDL-C, fasting glucose, insulin, and HOMA-IR. There was a positive association between HDS at baseline and changes in TC and LDL-C (Table 3). High weighted HDS at baseline was associated with favorable changes in CMRS, BMI, PBF, SBP, DBP, MAP, HDL-C, fasting glucose, insulin, and HOMA-IR (Table S7). Improved HDS was associated with favorable changes in BMI, SBP, DBP, MAP, fasting glucose, insulin, HOMA-IR, and CMRS (Table S8).

Table 3 Changes in cardiometabolic risk factors during follow-up associated with Healthy Diet Score at baseline

Moderation analysis

The inverse association between HDS and CMRS was stronger in children whose parents had higher education (Fig. 3). No significant interaction between HDS and sex, grade, birthweight, household income, or parental BMI for change in CMRS was observed.

Fig. 3
figure 3

Associations between healthy diet score and changes in cardiometabolic risk score modified by parental education. CMRS, cardiometabolic risk score; SD, standard deviation. The general linear regression model was used to test the interaction adjusted for classes in schools as random effects and characteristics of the individuals including age, sex, intervention, grade, puberty, BMI, physical activity, CMRS, and intake of energy, fiber, vegetable, fruit, pork, legumes, and nuts at baseline, birth weight, breastfeeding, household income, or parental BMI and education. We examined whether the association between healthy diet score and CMRS was modified by sex, grade, birthweight, household income, parental BMI, and parental education and a significant interaction were observed only for healthy diet score and parental education. *represents there is a significant association between healthy diet score and change in CMRS

Sensitivity analysis

High HDS at baseline was associated with favorable changes in CMRS, PBF, SBP, DBP, MAP, HDL-C, TG, insulin, and HOMA-IR in the control group (Table S9).

External validation

We examined the association of HDS with available CMR factors among 4530 children aged 6–13 years from CHNS. Baseline HDS was inversely associated with BMI, WC, SBP, DBP, and MAP at baseline and the change in DBP (Table S10).

Discussion

In this longitudinal analysis of children with large sample size, we found the nine leading healthy dietary determinants for CMRS were diets low in refined grains, fried foods, SSBs, wheat, fungi and algae, and roots and tubers, and high in seafood, rice, and red meat other than pork. We created an HDS based on these leading determinants that were shown to be a strong predictor for changes in 10 out of 14 CMR factors examined. The inverse association between HDS and CMRS was more likely to be evident in children whose parents had high education. The predictive ability of our HDS on several CMRS factors was validated in children from CHNS.

Previous studies have shown that diets high in glycemic index are associated with high CMR in children [29]. Our findings agree with these studies showing that high consumption of refined grains, fried foods, SSBs, roots and tubers, or wheat was associated with a higher increase in CMRS. The positive association between SSBs intake and CMR factors has been reported in many studies [16, 17, 30]. Although refined grains and wheat were not linked to CMR factors in children, their harmful effect on CMR has been reported in adults [31]. Foods being fried have lower nutrients and higher energy density than those being boiled or steamed [32, 33]. Several cohort studies in adults showed that higher consumption of fried foods was associated with an increased risk of obesity, type 2 diabetes, and cardiovascular diseases [34, 35]. Our study supports the dietary guidelines that diets low in glycemic index are beneficial for the prevention of CMR factors.

Seafood, rice, and red meat other than pork are major sources of protein that plays an important role in child growth and development. High consumption of fish has been recommended for the prevention of CMR factors in adults because fish is rich in protein, omega-3 fatty acids, and minerals [12, 13]. Our further analysis shows that red meat other than pork intake was only significantly associated with two (pork and milk) out of 25 food groups (Table S11). The intake of pork and milk was not a significant predictor of CMRS suggesting that the red meat other than pork intake was independently associated with CMRS. Processed but not unprocessed red meat is associated with an increased risk of obesity and related CMR in previous studies suggesting that unprocessed red meat other than pork may be considered as part of a healthy balanced diet in children considering its high contents in protein [36]. Compared with the harmful effect of high wheat intake, high rice (white and brown) intake resulted in a beneficial change in CMRS in our study, which may be partly due to the difference in nutrient composition between rice and wheat [21]. The divergent associations of rice and wheat intake with CMRS may also be attributable to the fact that they are associated with different dietary patterns. For example, we found high rice intake was associated with a higher intake of vegetables, fish, and poultry and a lower intake of fried foods, beverages, refined grains, and edible fungus and algae (Table S12). In contrast, high wheat intake was associated with a higher intake of fried foods, refined grains, roots and tubers, and a lower intake of fish, pork, milk, and vegetables (Table S13). Our findings highlight the importance of high consumption of seafood, rice, and red meat other than pork on the prevention of CMR factors in children.

The association between mushroom consumption and CMR factors is inconsistent between studies with the largest longitudinal study showing no significant association [37, 38]. The association of edible fungus and algae with CMR is less known. The positive association of fungi and algae with CMR factors in our study may be partly attributed to the harmful constituents in some of them [39]. However, more research needs to warrant our findings.

Dietary patterns derived by posterior methods including principal component analysis, cluster analysis, and latent class analysis have been linked to CMR factors in children [15,16,17, 40]. These studies showed that Western dietary pattern, high energy-dense pattern, or sweet dietary pattern were associated with high CMR [15,16,17], whereas vegetable and the wholemeal pattern was associated with favorable changes in CMR factors [40]. Although these findings may imply which food groups are associated with CMR factors, these dietary patterns can hardly be obtained in other studies. In contrast, priori patterns based on dietary guidelines may be applied to different studies and the findings are comparable [18]. An inverse association between adherence to Dietary Approaches to Stop Hypertension (DASH) Dietary Pattern or Mediterranean pattern and CMR has been observed in some studies [41, 42], but not in other studies [18, 43]. Therefore, establishing an efficient HDS based on evidence to predict CMR factors in children is urgent. Our HDS created based on the leading determinants of CMRS was strongly associated with 10 out of 14 CMR factors. Although no significant association of baseline HDS with baseline CMRS was observed, both high baseline HDS and improved HDS were associated with favorable changes in most CMR factors in our study. Furthermore, validation analysis in children from CHNS showed that higher baseline HDS was associated with lower BMI, WC, SBP, DBP, and MAP in the cross-sectional analysis and a lower increase in DBP only in the longitudinal analysis. The weak association between HDS and CMR factors in the longitudinal analysis might be due to the small variation of HDS and small available sample size in some HDS subgroups. We found higher parental education and higher HDS resulted in more decrease in CMRS suggesting the importance of the involvement of parents with high education and children whose parents with low education are more needed in care.

The strengths of the present study included the large sample size and the measurement of multiple CMR factors and dietary intakes assessed at both baseline and follow-up. To our knowledge, this is the first study to identify leading dietary determinants of CMRS in children using machine learning techniques. We also created an HDS based on medians of leading determinants that were strongly predictive of most CMR factors. This score was also validated in children from CHNS. The study has several limitations. Firstly, 24 h of food records are limited by not accounting for seasonal variation of dietary intakes especially fruits and vegetables. However, the dietary intakes are comparable between individuals given all data were collected in May of the year. Furthermore, our HDS was validated in a Chinese population, but whether HDS was predictive of CMR needs to be examined in other ethnic groups. The validation dataset is also limited by only having several CMR factors measured at baseline and follow-up, which makes it impossible to compute CMRS. Thirdly, several food items such as algae and fungi were not frequently consumed by people in countries other than Asia therefore the HDS was not applied to these populations. However, nuts, milk, and vegetables (11th, 12th, and 13th leading predictors in our study) instead of these food items may be included in the calculation of the HDS. Fourthly, the importance of an individual food for the CMRS is deriving partly from that food but also from other foods it is correlated with, which was not accounted for in our machine learning analysis. Fithly, the follow-up period of our study (one year) is relatively short to judge the effect of dietary factors on change in CMR, therefore longitudinal studies with long-term follow-up are needed to warrant our findings. Finally, because of the observational nature of the analysis in the present study, causal relations could not be established based on our findings.

Conclusions

Diets high in seafood, rice, and red meat other than pork and low in refined grains, fried foods, SSBs, wheat, fungi and algae, roots and tubers are leading healthy diet factors for changes in CMR factors in children. HDS based on these leading dietary determinants is strongly predictive of CMR factors.