Study design and participants
This study used data from the community-based Korean Medicine Daejeon Citizen Cohort (KDCC) study currently undergoing in Korea [19]. The KDCC study includes 30–55 years old residents of Daejeon, excluding individuals diagnosed with cancer or CVD (myocardial infarction, angina, stroke/apoplexy). The study completed a population-based survey of 2000 participants between 2017 and 2019 to collect demographic, lifestyle-related, individual characteristics of Korean medicine (KM), clinical, and biochemical measurements data. The questionnaire survey was conducted as a face-to-face interview by well-trained interviewers. The participants height, weight, waist circumference, and hip circumference were measured. Samples for blood tests, collected after 12 h of fasting, were sent for testing to an authorized diagnostic laboratory (Seoul Clinical Laboratories, Seoul, Korea). This study analyzed the KDCC data of 1991 individuals after excluding nine with missing values.
The KDCC study was approved by the Institutional Review Board, and informed consent forms were obtained from the participants after providing an explanation about their participation in the study.
Measures
With reference to previous studies, the 20 features used in the MetS prediction models were examined [17] and added sequentially in three steps [15, 16] taking into account their characteristics and methods of collection: demographic and anthropometric data that could be self-reported or were already known were added in step 1; lifestyle-related factors that could be measured using questionnaires were added in step 2; and blood indicators were added in step 3. The variables used in this study are well known risk factors for metabolic syndrome in the clinical setting. In addition, these variables are important modifiable factors through clinical attention and individual intervention and awareness for the prediction and management of metabolic syndrome [1].
Demographic and body measurements (Step 1)
The first group of features consisted of sex, age, body mass index (BMI), and waist-to-hip ratio (WHR). BMI was calculated by dividing the measured weight (kg) by the squared height (m2), while WHR was calculated by dividing the average waist circumference by the average hip circumference after performing two measurements for each with a measuring tape (Rollfix, Hoechstmass, Germany).
Lifestyle-related factors (Step 2)
The second group of features consisted of lifestyle-related factors, including drinking status, smoking status, physical activity [20], sleep time and quality [21], eating index [22], stress [23], and symptom-based KM types used [24]. All eight features were investigated using a structured questionnaire. The following questions were asked for smoking status: “Have you smoked more than 100 cigarettes in your lifetime?” and “Do you currently smoke?” Based on the responses, the smoking status of the participants was classified as “current smoker,” “former smoker,” and “non-smoker.” Drinking status was classified as “current drinker,” “former drinker,” and “non-drinker” based on similar questions about drinking. Physical activity (PA) was assessed using the Korean Global Physical Activity Questionnaire developed by the World Health Organization [20]. PA was calculated and later converted to Metabolic equivalent of task (METs). The sleeping time and quality over the past month were assessed using the Korean version of the Pittsburgh Sleep Quality Index [25]. Eating index was measured using a semi-quantitative food frequency questionnaire consisting of 34 food groups, which collects data on the frequency (nine option ranging from rarely eaten to three times a day) and average intake (three or four specified portion sizes) of each food item over the past year [19]. Eating index was composed of 9 adequacy components and 5 moderate components, and the total score ranged from 0 to 100 following the previously-reported calculation method of the Korean Healthy Eating Index [26]. The stress index was calculated using the 18-item Psychosocial Well-being Index-Short Form [27]. The KM types were defined as Sasang constitution and were determined by the simplified and structured questionnaire comprised one physical characteristic, six personality traits, and eight physiological symptoms [24]. The KM types were classified into Taeumin, Soeumin, or Soyangin because users of these types vary in their physiological and psychological states, disease susceptibility, and lifestyle healthcare approach [28].
Biochemical measurements (Step 3)
The third group of features consisted of eight blood test features, including aspartate transaminase (AST), alanine transaminase (ALT), and alkaline phosphatase (ALP) for liver function [29]; high-sensitivity C-reactive protein (hsCRP) [30]; hemoglobin A1c (HbA1c) [31]; insulin; gamma-glutamyl transferase (GGT); and homeostatic model assessment for insulin resistance (HOMA-IR) [32]. Blood samples were collected from a peripheral vein in the morning, following overnight fasting, and then centrifuged at 3450 rpm for 10 min. Blood samples were examined using automatic clinical chemistry analyzers (ADVIA1800, Siemens, USA) for AST, ALT, ALP, hsCRP, and GGT, also including glucose, triglyceride, and high-density lipoprotein-cholesterol as diagnostic indicators of Mets. HbA1c and insulin levels were determined using an automated analyzer (Variant II trubo, BIORAD, USA and ADIVA Centaur, Siemens, USA, respectively). HOMA-IR was calculated as glucose (mg/dL) × insulin level (mIU/L)/405.
Definition of the metabolic syndrome
The Mets group in present study was defined as meeting at least two criteria including both pre-MetS and Mets status, because of the importance of preventive healthcare by early detection of MetS in the middle-aged population [33]. MetS group was diagnosed by the following five criteria given in the NCEP-ATP III guidelines [1]: 1) a waist circumference above the cut-off point for Koreans (≥90 cm for males and ≥ 85 cm for females); 2) systolic blood pressure ≥ 130 mmHg, diastolic blood pressure ≥ 85 mmHg, or taking medication for hypertension; 3) a triglyceride level of ≥150 mg/dL or taking medication for such lipid abnormalities; 4) low high-density lipoprotein-cholesterol level (< 40 mg/dL for males and < 50 mg/dL for females) or taking medication for such lipid abnormalities; 5) a fasting plasma glucose level of ≥100 mg/dL or taking medication for type 2 diabetes.
Analysis
Data are expressed as mean and standard deviation, and frequency and percentage. General characteristics of the participants between the normal and the Mets groups were compared by the Fisher’s exact test or chi-square test for categorical variables and by independent t-tests for continuous variables. The performance of the MetS prediction models was compared by sequentially inputting the 20 features identified as key indicators on MetS in three steps and examining their influence. A list and scale of features by stage are as follows. In step 1, sex as categorical variable, and age, BMI, and WHR as continuous variables were inputted. In step 2, drinking, smoking, KM types as categorical variables, and physical activity, sleep time, sleep quality, eating index, and stress as continuous variables were additionally inputted. In step 3, AST, ALT, ALP, hsCRP, HbA1c, insulin, GGT, and HOMA-IR as continuous variables were additionally inputted.
A supervised machine learning model was used for MetS prediction. The algorithms used to develop the model were decision tree, Gaussian Naïve Bayes (NB), K-nearest neighbor (KNN) [34], XGBoost, random forest (RF), logistic regression [15, 18, 35], support vector machine (SVM), multi-layer perceptron (MLP) [16], and 1-dimensional convolutional neural network (1D-CNN) [36]. Min-max normalization was applied to the data used in the analysis [37]. The model was built using 6-fold classified training data and test data. The ratio of the number of training and test datasets was 5:1. Of the 1991 datasets, 1659 and 332 datasets were used for training dataset, and test dataset, respectively. In addition, the 2:1 ratio of the normal group and the Mets group was configured to remain the same for the training and the test datasets. Moreover, we performed oversampling using the synthetic minority oversampling technique (SMOTE) to deal with data imbalance [13, 38, 39]. The SMOTE generates randomly synthesized data for the minority class using the Euclidean distance-based nearest neighbor approach. The synthesized and existing data had similar characteristics as the generation of the synthesized data was based on existing data. We compared the performances before and after the SMOTE application. Lastly, RF [18, 40] investigated the importance of features influencing the MetS. This is because the performance of the RF model consistently showed the best overall performance in all three stages.
The performance of the MetS prediction models was measured using F1-score, accuracy, sensitivity, specificity, and the AUC, along with 95% confidence interval. F1-score is the harmonic mean of precision and recall, and the calculation formula is as follows: F1-score = 2 / {(1/Precision) + (1/Recall)}, Precision = True Positive / (True Positive + False Positive), and Recall = True Positive / (True Positive + False Negative). Scikit-learn library in Python ver. 3.8.5 (Python Software Foundation, https://www.python.org/psf/) was used. For analysis and comparison, a model was built using default parameters.