Study design
The present study was based on data from the CHNS, a unique population-based longitudinal study in China that covers key phenotypes, diet and health outcomes of participants from 15 provinces or megacities in China (six in the Northern region, nine in the Southern region) [14]. The detailed study design of CHNS has been described previously [14]. CHNS rounds were completed in 1989, 1991, 1993, 1997, 2000, 2004, 2006, 2009, 2011, 2015 and 2018. Stool samples and dietary information were collected in the 2015 survey, and participants with a gut microbiota profile based on 16S rRNA analysis from stool samples were included in the present study (n = 3248). Participants were excluded if they had used antibiotics within the month preceding stool collection (n = 71), had ever had an intestinal disease (n = 26, including ulcerative colitis, Crohn’s disease, localised enteritis or irritable bowel syndrome), or had prevalent type 2 diabetes in 2015 (n = 379). Therefore, a total of 2772 diabetes-free participants from the 2015 survey for whom a gut microbiota profile was available were included in the present study (age 50.8 ± 12.7] years, mean ± SD). After a median follow-up period of 3.04 years (IQR 2.9–3.1 years), 1829 participants remained at the time of the 2018 survey, 123 of whom had incident type 2 diabetes. These participants were included in our longitudinal analysis of gut microbiota with glycaemic traits and incident type 2 diabetes.
The CHNS protocol was approved by the Institutional Review Boards of the Chinese Center for Disease Control and Prevention (number 201524), the University of North Carolina at Chapel Hill, USA, and the US National Institute for Nutrition and Health (number 07-1963). Informed consent was obtained from all participants.
Faecal sample collection and 16S rRNA profiling
Stool samples were collected by the participants themselves, who received instruction for the collection process during a home visit on one of the two weekdays when the 24 h dietary recall data were recorded, and immediately frozen at −20°C after collection. All stool samples were transported through a cold chain to the central laboratory within 24–48 h and stored at −20°C until processing. We obtained a mean of 76,881 paired-end raw reads for each sample. The methods for DNA extraction, amplification and sequencing have been described previously [15]. The 16S rRNA sequencing data were analysed using the Quantitative Insights Into Microbial Ecology 2 platform (QIIME 2) [16]. DADA2 software [17] was used to filter out sequencing reads with quality score Q<25 and to de-noise reads into amplicon sequence variants, resulting in feature tables and representative sequences. Taxonomy classification was performed based on the naive Bayes classifier using the classify-sklearn package against the Silva-132-99 reference sequences [18].
Data collection
Demographic, lifestyle and dietary data were collected by questionnaires during the home visits on three consecutive days. Anthropometric factors were measured on-site by trained staff. Habitual dietary and total energy intakes were assessed by three consecutive 24 h dietary recalls, including two weekdays and one weekend day. The participants were asked to report the types and amounts of all food eaten during the previous 24 h [19]. The energy intake was calculated from the collected dietary data based on the Chinese Food Composition Table [20]. Physical activity was assessed as a total metabolic equivalent for task hours per week from 7-day recalls of occupational, transportation, domestic and leisure activities [21]. Urbanisation was quantified by a validated index covering 12 urbanicity-related components [22]. We assessed household income as the total income of all household members.
Following an overnight fast, a blood sample was collected by venepuncture. Blood glucose levels were measured using a glucose oxidase phenol 4-aminoantipyrine peroxidase kit (Randox, Crumlin, UK) and a Hitachi 7600 Analyzer (Hitachi, Tokyo, Japan). Serum insulin levels were measured using a radioimmunology assay kit (North Institute of Biological Technology, Beijing, China) and a XH-6020 gamma counter (North Institute of Biological Technology). HPLC (model HLC-723 G7; Tosoh Corporation, Tokyo, Japan) was used to measure HbA1c [23]. The coefficients of variation for fasting glucose, insulin and HbA1c at follow-up were 19%, 13% and 16%, respectively. HOMA-IR (calculated as fasting glucose × fasting insulin/22.5) was used to represent insulin resistance.
Ascertainment of type 2 diabetes
Incident type 2 diabetes cases were ascertained based on fasting blood glucose ≥7.0 mmol/l or HbA1c ≥47.5 mmol/mol (6.5%), or being currently under medical treatment for diabetes during the follow-up visits, according to the American Diabetes Association criteria for the diagnosis of diabetes [24].
Bioinformatics and statistical analysis
Statistical analyses were performed using Stata 15 (StataCorp, College Station, TX, USA). The classifier was based on codes adapted from the scikit-learn package [25]. Missing values of the continuous covariates were imputed from the mean value in the corresponding regions (i.e. North or South China), and categorical covariates were imputed from the highest frequency value. Only microbial genera present in at least 10% of the participants were included in our analyses.
Comparison of the gut microbial composition between participants from North and South China
At the genus level, we used the vegdist function from the R package vegan [26] to calculate the gut microbial Bray–Curtis dissimilarity matrix. The p value was determined by 1000 permutations, and a p value <0.05 was considered statistically significant.
A machine learning model (gradient boosting decision trees from the Light Gradient Boosting Machine [LightGBM] package [27]) was used for classification of participants from North or South China. The genus-level taxonomic abundance was used as the predictive feature. We used the ‘leave one out’ strategy to evaluate the classifier’s performance, meaning that each training set was created by taking all provinces or megacities except for the test set. The above process was repeated ten times, resulting in a probability for each participant belong to the Southern region.
We used the SHAP (Shapley Additive exPlanations) algorithm [28] to estimate the contribution of each gut microbial genus to the overall classifier prediction. Combination of the LightGBM and SHAP method has shown unique strength in prediction and feature selection [6, 29]. Microbial genera with a mean absolute SHAP value greater than 0 contributed to the classification of geographic regions, and were treated as a region-discriminating gut microbe.
Region-discriminating gut microbiota predicted dietary habits
For each of the dietary factors, we used the LightGBM method to predict the dietary intake based on the region-discriminating microbial genera. The tested dietary factors including rice, wheat, fruit, vegetable, nuts, pork, poultry, milk, egg, fish, animal oil and vegetable oil. We constructed an index by generating the wheat/rice ratio to reflect the staple food preference. A tenfold cross-validation predictive implementation was used to generate genera-predicted intake values for each participant. The performance of the model was quantified using Pearson correlation for regression and the AUC of the receiver operating characteristic for classification. The R package pROC [30] was used for receiver operating characteristic curve analyses. As a sensitivity analysis, we also imputed the missing dietary factors by multiple imputation using chained equations. The multiple imputation model included the outcome (dietary factors), age, sex, education, marital status, education and geographic region (North or South China). Five imputed datasets were generated, and the prediction analyses were based on the mean values of the imputed datasets.
Longitudinal relationship between gut microbiota and glycaemic traits
At the genus level, we used a linear mixed-effects model to examine the longitudinal association of gut microbiota with glycaemic traits (fasting glucose, fasting insulin, HbA1c and HOMA-IR), adjusted for the corresponding baseline glycaemic trait, demographic, anthropometric and lifestyle factors. Sensitivity analysis was performed by adding the dietary factors into the covariate list. The demographic, anthropometric and lifestyle factors included age, sex, household income, marital status, self-reported educational level, place of residence (rural or urban), urbanisation index, BMI, total energy intake, alcohol consumption, smoking and physical activity. To further identify microbial genera associated with glycaemic traits that are potentially mediated by BMI, we re-examined the association of the gut microbiota with glycaemic traits without adjusting for the BMI. Here, associations were expressed as the difference in glycaemic traits (in SD units) per SD difference in each gut microbial genus. The linear mixed-effects model contains a random intercept and random coefficient on the provinces or megacities to adjust for the heterogeneity of the gut microbiota composition among the provinces or megacities. We independently examined the gut microbiota/glycaemic trait association in the Northern and Southern populations, and combined the effect estimates from the two regions using random-effects meta-analysis. A p value <0.05 was considered statistically significant. The Benjamini–Hochberg method was used to control the false discovery rate (FDR).
Healthy microbiome index and incident type 2 diabetes
We used an additive model to construct a healthy microbiome index (HMI) with the glycaemic trait-related genera as
$$ {\mathrm{HMI}}_i=\sum \limits_{j=1}^m{g}_{ij} $$
where HMIi is a healthy microbiome index for individual i, m is the number of glycaemic trait-related genera, and gij is the score for gut microbial genus j for the individual i. If the individual i carries genus j that is in favour of a glycaemic trait, or does not carry genus j that is harmful to the glycaemic trait, gij equals 1, otherwise gij equals 0.
We then examined the prospective association of the baseline HMI (per SD unit) with incident type 2 diabetes using a Poisson regression model, adjusted for the aforementioned demographic, anthropometric and lifestyle factors. We also performed subgroup analysis stratified by the geographic region, age group, sex, BMI level and urbanisation level (city or rural), to test the robustness of the model.
Relationship between dietary or lifestyle factors and glycaemic trait-related gut microbiota
Linear regression was used to estimate the difference in the above glycaemic trait-related gut microbiota or HMI (in SD units) per SD change for continuous dietary or lifestyle factors (per-category change for categorical dietary or lifestyle factors), with adjustment for potential confounders and mutually adjusted for the other dietary or lifestyle factors. The tested dietary or lifestyle factors included wheat, rice, wheat/rice ratio, fruit, vegetable, nuts, pork, poultry, milk, egg, fish, alcohol consumption, smoking and physical activity. The adjusted covariates included age, sex, BMI, total energy intake, household income, marital status, self-reported educational level, place of residence (rural or urban), urbanisation index, and animal or vegetable oil intake. In addition to the above food groups, we also used linear regression to evaluate the association of dietary fibre with glycaemic trait-related gut microbial genera, with adjustment for the above covariates. The Benjamini–Hochberg method was used to control the FDR. An FDR value <0.05 was considered statistically significant. We further used linear regression to examine the association between the included food groups and glycaemic traits with and without adjustment for the gut microbial genera (i.e. HMI).