Background

Adolescence is defined as a unique decade of human development by the World Health Organization. It is a life stage when growth spurts, puberty changes and the major non-communicable diseases (NCDs) start or are reinforced [1, 2]. However, the Lancet Commission on Adolescent Health and Well-being indicated that global health and social policy have overlooked adolescent health, resulting in fewer health improvements compared to other age groups [3, 4]. Among the various health concerns during adolescence, weight problems are particularly prevalent, with obesity remaining a serious health challenge in many countries.

Overweight and obesity are strongly associated with NCDs and are considered decisive risk factors for premature mortality and physical morbidity in later life. Astonishingly, 80% of obese adolescents remain obese in later adulthood [5, 6]. On the other hand, being underweight in adolescence is associated with psychiatric disorders [7], osteoporosis [8], scoliosis [9], and pubertal delay [10]. In Hong Kong, the prevalence of overweight and obesity among 9-to-12-year-old students increased from 20% in 1999/2000 to 25% in 2008/9 [11]. This prevalence spiked even further to 24.1% during the covid-19 pandemic in 2020, largely attributed to lifestyle changes [12]. Additionally, 20.5% of 12-to-18-year-old students reported being mildly or severely underweight in 2007 [13]. Therefore, controlling weight problems during adolesence is a paramount public health issue.

Studies have suggested that weight problems during adolescence can often be prevented by strategies that are more cost-effective than clinic-based weight-loss programs [14]. Early intervention is crucial to control the adolescent obesity epidemic [15]. Thus, NCD-related health behaviors, such as weight management among adolescents, deserve more attention to prevent future disease development [3]. Predictive models that can accurately classify a child’s future weight status would be valuable tools for tackling child and adolescent weight problems early.

While logistic regression (LG) has traditionally been used to predict adolescent weight status, it is limited to binary outcomes and a specific structural form of the predictors which may result in suboptimal prediction accuracy [16, 17]. In contrast, machine learning (ML) algorithms can accommodate multiclass outcomes and fully consider the complex interrelationships among all predictors by eliciting all possible patterns and thus may optimize the prediction accuracy [18]. As a result, ML models have become increasingly popular. However, the latest review of ML models revealed that many studies only considered the cross-sectional classification rather than temporal prediction [18, 19]. Moreover, most temporal prediction models used birth, infant, or parental measurements to predict overweight or obesity in early childhood period [19, 20]. Only one study derived a deep learning prediction model for adolescents, but it only focused on predicting obesity for three subsequent years [21]. Thus, there has been neither a prediction model that utilizes ML to predict multiclass weight statuses (underweight, normal, overweight, and obese) for more than three years in adolescence, nor a ML-based prediction models of weight status for the Chinese population.

Therefore, we aimed to develop ML models to predct weight status in children, which can assist health professionals in identifying those who are at risk of developing weight problems. We evaluated the performance of these models in a large population-based cohort of children in Hong Kong, and validated them in an independent cohort. We also assessed the relative importance of the predictors to provide more evidence on early weight problems intervention practices.

Methods

Design and setting

We conducted a retrospective cohort study of P4 students from the 1995/1996 to 2015/2016 academic cohorts, who were followed until Secondary 6 (S6, Grade 12 in the US). P4 students are cognitively competent to provide self-reported measurements [22]. Additionally, we chose a cohort of P6 students from 1995/1996 to 2013/14 academic cohorts to predict weight status after P6, the last year of primary education in Hong Kong before students are promoted to the secondary level. Students who visited at least two years and had completed health measurements records were included. Data were obtained from the Student Health Service (SHS) of the Department of Health in Hong Kong, which has provided voluntary territory-wide annual health assessment services for primary and secondary students since 1995/1996. The health assessment questionnaire changed in 2015/16 [23]. Therefore, we included P4 students during 1995/1996 to 2014/2015, allowing at least one year of follow-up prediction. Fruther details of the survey health assessment scheme can be found elsewhere [24, 25].

Potential predictor variables

Weight was measured to the nearest 0.1 kg and height to the nearest 0.1 cm were assessed annually at the SHS by well trained healthcare workers or nurses according to the study protocol. Demographics included sex, age and family socioeconomic level. Family’s socioeconomic status was indicated by parental educational level, parental occupation and the type of housing [26].

Dietary habits were assessed by “breakfast eating habit,” “sweetness preference during past 7 days,” “junk food intake habit,” “fruit/vegetable intake,” and “milk consumption habit”. Physical activity behaviors were assessed by “frequency of aerobic exercise each week,” “hours of doing aerobic exercise each week,” and “daily hours of TV viewing”. All of these predictors in the structured questionnaires had four response options representing different degrees of frequency or duration. Breakfast habits were assessed by the item ‘I usually have breakfast at?’, we considered three response categories: (i) ‘home’, representing frequently eating at home, (ii) ‘rarely at home’, after combining the original categories of ‘fast food stall/cafeteria/restaurant’ and ‘some other places’, and (iii) ‘no breakfast at all’, representing never eating at home. Thus, this item can be considered an assessment of the frequency of breakfast eating at home.

Psychological development was assessed using the 60-item self-reported Culture Free Self-Esteem Inventory for Children Questionnaire (CFSEI-2), which has been validated in Hong Kong children and adolescents [27, 28]. The Self-Esteem Inventory (SEI) comprises a total score and four domain scores: (i) ‘general self-esteem’ denoting children’s overall perception of themselves, the score ≤ 7 was considered as “very-low”; (ii) ‘social self-esteem’ denoting children’s perception of their peer relationship, (iii) ‘school-related self-esteem’ denoting children’s perception on their ability to achieve academic success, (iv) ‘parent-related self-esteem’ denoting children’s perception on their family’s thoughts. Scores ≤ 2 in any of these three subscales were considered “very-low” [27]. Children with a total score ≤ 19 or a “very-low” score in any domain were considered to have low self-esteem. A lie scale score was also obtained, and a score ≤ 2 indicates the corresponding child’s self-reported assessment is unreliable [27].

Potential behavioral problems of children and adolescents were assessed using the 4-item Rutter Behavior Questionnaire (RBQ), which has been validated in Hong Kong children [29]. It inquired about behaviors on hyperactivity, conduct, and emotional disturbances and were completed by parents. A RBQ total score ≥ 19 indicated a potential behavior problem [30]. In total, 25 predictors were considered as input variables in developing multiclass prediction models.

Prediction outcome

Prediction weight status was classified as normal, obese, overweight, and underweight, based on the next measurement year of the body mass index (BMI, expressed in kg/m2) and the age- and sex-specific BMI references in the international Obesity Task Force Standards (IOTF).

Data preparation

Children with a lie self-esteem score ≤ 2 were considered unreliable and removed. For the type of housing and parental occupation, we ordered their response categories in order of socioeconomic level by using the median monthly domestic household income for each type of housing and occupation obtained from the Hong Kong Census and Statistics Department. Sex as categorical variables was one-hot encoded. The responses of dietary and physical activity behavioral measurements were treated as ordinal variables, and other predictors were considered as continuous variables. Missing data on socioeconomic status were filled out according to the information reported in the student’s other assessment years. The other measurements had less than 5% missing data, which was considered inconsequential to the validity of the model development [31]. We used k nearest neighbour imputation algorithms to the training and test sets separatly to facilitate the use of ML that required complete data [32].

Data analysis

Categorical data were expressed as the number with a percentage for each weight status and compared using chi-square test. Numberical data were presented as the mean ± standard deviation (SD).

Multiclass prediction models development

P4 students were randomly divided into a training set and a test set at an 80:20 ratio. Multiclass prediction models were developed using the P4 training data to predict weight status in each subsequent year until S6, creating eight prediction windows. We used the same procedure to develop prediction models for the P6 training cohort, creating six prediction windows until S6. The weight status in our cohorts was imbalanced, with underweight, overweight and obese categories being underpresented. The imbalance could have led to biased model performance, where the model may have been more accurate at predicting the majority weight status while performing poorly on the minority weight status. To address this issue, we used the Synthetic Minority Oversampling Technique (SMOTE) sampling technique to the training sets [33]. SMOTE was a widely used technique that creates synthetic samples for the minority categories by generating new instances that are similar to the original underpresented categories. We attempted several ML approaches, including Decision Tree (DT), Random Forest (RF), Supportive Vector Machine (SVM), k-Nearest Neighbor (k-NN), and eXtreme Gradient Boosting (XG Boost), as well as the LG approach for comparison. The short- and long-term prediction abilities of the models were compared by calculating the correct classification rate, overall accuracy of the test set and micro-, macro-averaging area under the curve (AUC). Receiver operating characteristics (ROC) curves for each weight status on test set were also obtained. The AUC, precision, recall and F1-score were calculated to evaluate the model prediction accuracy, and assess the ability to predict an abnormal weight status. The precision and recall are conceptually equivalent to the sensitivity and positive predictive value, and the F1 score is the harmonic mean of precision and recall [34]. For predicting a specific weight status, all accuracy measures ranged from 0 to 1, with a higher value indicating a higher accuracy.

To examine the importance of each predictor at both population and individual levels, based on the best performing prediction models, we used the Shapley Additive Explanations (SHAP) to obtain their contributions for a prediction window [35]. SHAP value is assigned to each predictor and can quantify them by comparing the differences with and without that predictor. The Shapley values from all prediction windows in each cohort were used to compare the summary importance of predictors by different weight status. Furthermore, to better understand the individual-level prediction of weight status, we selected two students as examples and used SHAP waterfall plots to illustrate the importance of different predictors for each student. Figure 1 shows the workflow used for this study. All prediction models were developed and compared using Python software (version 3.10) with Scikit-Learn.

Fig. 1
figure 1

Graphical illustration of the workflow used for this study

Results

A total of 442,898 and 344,186 students were enrolled in P4 and P6 from 1995/96 to 2014/15. The characteristics of the students in these two cohorts are shown in Table 1. The number of students in successive prediction windows (indicated by academic grade) decreased due to the loss of follow-up. Of the enrolled students in P4 and P6, respectively 224,398 (50.7%) and 171,768 (49.9%) were male. The mean age for the two cohorts were 9.4 ± 0.56 and 11.3 ± 0.54 years, respectively. The prevalence of normal weight, underweight, overweight, and obese children were, respectively 63.4%, 12.0%, 18.5%, and 6.0% at P4, and 65.5%, 11.9%, 18.5%, and 4.1% at P6. The characteristics of demographic, personal lifestyle, and psychological wellbeings among different weight status are also summarized and compared in the Supplementary Tables S1 and S2. All predictors showed significant difference across different weight status in both the P4 and P6 cohorts.

Table 1 Baseline characteristics of students in the primary four and primary six cohorts

Figure 2 shows the overall predictive ability of the generated prediction models on the test set, with the exact accuracy levels tabulated in Supplementary Table S3. The XG Boost prediction models exhibited higher accuracy than all other models. They demonstrated robust performance in predicting both short- and long-term weight status, with the overall accuracy, micro-averaging AUC, and macro-averaging AUC values exceeding 0.72, 0.92, and 0.83, respectively, for the eight consecutive years of P4 prediction. Similarly, for the six consecutive years of P6 prediction, the corresponding values were greater than 0.74, 0.93, and 0.86, respectively. Table 2 presents the AUC values for each weight status across different models, highlighting XGBoost’s superior performance on multiclass prediction, with AUCs exceeding 0.85 for the underweight group, 0.85 for the overweight group, and 0.92 for the obese group both P4 and P6 predictions. Supplementary Table S4 provides precision, recall and F1-score metrics or each weight status.

Fig. 2
figure 2

Prediction accuracy of different multiclass machine learning models for every prediction window

A based on the primary four cohort; B based on the primary six cohort; XG Boost eXtreme Gradient Boosting

Table 2 AUC values of different multiclass machine learning models by weight status

The predictor importance of all 26 variables were evaluated using Shapley values for different weight statuses (Fig. 3).The summary predictor importance by SHAP was presented in a column list in descending order for each cohort. Weight, height, sex, age, frequency and hours of aerobic exercise consistently showed higher importance in the XG Boost prediction models. To further explore the predictive power of these top predictors, we re-developed the XG Boost prediction models using the above six and top three predictors for the P4 and P6 cohorts, respectively, in the training set. However, the models showed reduced accuracy when tested on the test set (Supplementary Table S5).

Fig. 3
figure 3

Relative importance of predictors by predicted weight status

(a) based on the primary four cohort; (b) based on the primary six cohort. The relative predictor importance on each weight status was measured by the Shapley values under a XG Boost model. The predictors were ordered in descending order of overall importance

To provide a more detailed understanding of the predictors’ contributions to the predictions, we generated Waterfall plots for two children, one at P4 and the other at P6, who were both predicted to be obese at S1 and S6, respectively (Fig. 4). Each arrow in the plot represented the extent and direction of predictor’s contribution to the prediction. An arrow pointing to the left indicated that the corresponding predictor would increase the risk of obesity, whereas an arrow pointing to the right indicated that the predictor would reduce the risk of obesity. The student in Fig. 4a had weight as the main contributor to the predicted outcome of being obese, and weight reduction would be the target for reducing the risk of obesity at S1. In contrast, the student in Fig. 4b also had hyperactivity and hours of aerobic exercises as the main contributors to the prediction, and alleviating hyperactivity and increases hours of aerobic exericses would also be the targets for reducing obesity at S6.

Fig. 4
figure 4

SHAP Waterfall plots of the contribution of each predictor to a predicted weight status

a Based on a child at primary four data, who is predicted to be obese at secondary one; b Based on a child at primary six, who is predicted to be obese at secondary six; Each arrow shows the magnitude and direction a predictor’s contribution to the predicted outcome

Discussion

To the best of our knowledge, this is the first study to simultaneously predict four weight statuses (normal weight, underweight, overweight, and obese) using ML, for both short- and long-term prediction. Our large population-based cohort of children around 9 or 11 years old, followed until around 17 years of age, allowed us to develop and validate these models with high accuracy. The use of ML in predicting weight status demonstrated superior accuracy compared to traditional methods, providing a preview of the weight status over subsequent years. Our models offer potential benefits for health professionals in identifying children who are at risk of developing weight problems.

In our study, the XG Boost machine-learning method demonstrated the highest accuracy for predicting weight status in adolescents for all prediction windows. Our models using P4 data to predict weight status at P5 to S6 reached a micro-averaging AUC of 0.97 to 0.92, while using P6 data for prediction until S6 had a micro-averaging AUC of 0.97 to 0.93. These accuracies were higher than those achieved by LG (0.88 to 0.80 and 0.88 to 0.81, respectively). The suboptimal accuracy of LG was also shown in a previous study which predicted overweight at 2 years [36]. Our large population-based sample made multiclass prediction possible by ensuring a decent number of children at each weight status. To our knowledge, no prediction models have been developed for the simultaneous temporal prediction of multiclass weight status in adolescents. Our XG Boost models had better performance for at least six prediction years, indicating their long-term prediction ability. However, the predictive ability was gradually declined as the time span extend, which can be attributed to the diminishing influence of the predictors over a longer period. Moreover, the corresponding AUCs for each abnormal weight status were consistently above 0.85 at every prediction year. Therefore, our prediction models using XG Boost can accurately predict all weight statuses for adolescents at around 9 and 11 year-old for the subsequent years during adolescence.

We evaluated several ML algorithms for predicting weight status in adolescents and found that the SVM approach performed slightly better than LG, while it was not appropriate due to extremely long computation time. The RF generally outperformed LG, while k-NN and DT showed unstable prediction abilities and yielded less consistent results. Each ML algorithm may have its various advantages and disadvantages, and the performance may vary depending on the application. Thus, we attempted several ML algorithms on a large population-based sample to allow the robust assessment of various prediction approaches. In our study, XG Boost was the most effective tool for predicting weight status in adolescents due to its ability to handle nonlinear predictors, and its high computing efficiency using parallel computing.

To gain a deeper understanding of the predictors influcening adolescent weight status, we repeated the same model development process on two cohorts. Although there is apparent overlap between the important predictors across the two cohorts, there are also some distinct differences. Notably, three subscares of the RBQ, particularly emotion and hyperactivity, had increased contributions to prediction in the P6 cohort compared to P4. Additionaly, the social subscale of the SEI had increased importance in the P6 cohort. The P6 cohort was designed for the prediction of weight status in secondary school students who are at least two years older than the children in the P4 cohort. One possible reason for these differences is that the transition from primary to secondary school may be a particularly difficult experience for some children [37]. The adjustment to a new social environment can lead to anxiety and emotional problems, which can lower their social self-esttem if they fail to negotiate new relationships.

Our findings also suggest that emotoional and behavioral problems, as well as low self-esteeem, are associated with weight problems in adolescents. Adolescents with emotional and behavioral problems are more susceptible to losing behavioral control, disordered eating, and sedentary behaviors, leading to poor weight management [38]. In addition, individuals with lower self-esteem tend to experience painful self-awareness, do less future planning, have increased food consumption, and decreased physical activity, leading to a higher risk of being overweight or obese [39, 40]. These findings highlight the need for early intervention in adolescents with emotional and behavorial problems and low self-esteem to prevent or manage weight problems.

Historical weight and height are the most crucial predictors of weight status during most of adolescence. Previous ML prediction models considered these measurements only at birth or predicted weight status in at most three subsequent years [19]. Age and sex are also significant predictors commonly used to predict adolescent weight status, and our study found that the averaged Shaply values of weight, height and sex were consistently quite high. However, the accuracy of the models decreased when we excluded other predictors, indicating that all selected predictors should be considered for optimal early intervention for weight problems in Hong Kong adolescents, especially physical activity habits.

ML models may also offer a powerful tool for prioritizing predictors that are most relevant to predicting outcomes for adolescents. In our study, each student may have a unique set of predictors that contribute to the predicted outcome. To determine the importance of each predictor, we use the Shapley value, which is represented in the Waterfall plot as the contribution of each predictor to the final predicted outcome. For instance, consider our illustrative example with two students, one from primary four and the other from primary six, who have the same predicted higher risk of being obese in a subsequent year. For the P6 students, weight control, addressing hyperactivity, and increasing hours of aerobic exercises are the most critical strategies, while for the P4 student, weight control is the key predictor. By identifying the most influential predictors, our multiclass prediction models can assist health professionals in developing targeted and effective interventions to prevent obesity in adolescents.

Limitations

Our study has some limitations. First, we did not consider all the relevant predictors, such as parental weight status and lifestyle, which have been shown to influence adolescents’ BMI [41]. Future studies could include more predictors to improve model accuracy. Second, our retrospective design limited data quality control. However, the annual health assessment scheme data were well-managed by the Department of Health, allowing us to apply ML for multiclass prediction. Although a prospective design would have been ideal, it was not feasibile to accrue a large cohort with a long follow-up period for applying ML algorithms. Future studies may consider using a prospective design to validate our findings. Third, we did not conduct feature selection to determine the optimal set of predictors for our prediction models. However, all the predictors included in our prediction models showed a significant association with weight status. Their inclusion would enhance the prediction accuracy, particularly for long-term prediction. Nonetheless, feature selection that takes into account the importance, stability across samples, or other performance criteria of the predictors would be desirable in future studies.

Conclusions

Our study demonstrates the potential of ML approaches for multiclass prediction of child and adolescent weight status in Hong Kong. XG Boost performed better than the other approaches, indicating its potential to improve the accuracy of existing weight status prediction models. Our results suggest that it is possible to predict the long-term weight status by utilizing student characteristics as early as primary four. With the interpretability and high accuracy of the XG Boost models developed in this study, health professionals can improve weight promotion programs and provide personalized and effective weight management interventions for adolescents.