Background

Esophageal cancer ranks seventh in terms of incidence and sixth in mortality overall in the world according to the report of the Global Cancer Statistics 2020. Eastern Asia shows the highest regional incidence rates for both men and women, partly because of the large burden in China [1]. In China, esophageal cancer is the sixth most common malignancy and the fourth leading cause of cancer-related deaths [2]. Esophageal adenocarcinoma (EADC) and esophageal squamous cell carcinoma (ESCC) are the two most common histologic subtypes of esophageal cancer. In China, more than 90% of esophageal cancer cases are ESCCs, whose number accounts for about half of all the ESCC cases on earth [3]. The most surveyed region of China is the North Central Taihung Mountain range. In small areas of this region, ESCC may be at or nearly the leading cause of death, with incidence rates exceeding 125/100,000 per year [4]. Therefore, identification of risk factors in the early stages of the disease appears to be essential in order to decrease ESCC incidence and mortality.

Esophagitis is a precancerous disease of esophageal cancer, and it is a benign disease with a certain canceration rate. Based on WHO tumor histological classification, the esophageal precancerous lesions (EPL) can be defined as low-grade intraepithelial neoplasia (LGIN) and high-grade intraepithelial neoplasia (HGIN) [5]. With the aggravation of precancerous lesions, the rate of developing esophageal cancer increased from 24 to 74% [6]. Therefore, intervention at an early stage of the disease results in a significant decrease in ESCC incidence and mortality. The etiology of ESCC is multi-factorial and strongly population dependent. A study estimated a population-attributable risk of 89% using only cigarette smoking, alcoholic beverage consumption, and low consumption of fruits and vegetables [7]. There is also some evidence on the protective effect of fruit and vegetable and the potential harmful effect of processed/red meat consumption. The evidence on the effect of diet on ESCC risk is however still suggestive or limited [8]. Before a diagnosis of esophageal cancer is established, majority of the patients have experienced pain, dysphagia, eating difficulties, appetite loss, bloating and nausea resulting in patients’ daily living and quality of life [9]. Therefore, it is more meaningful for the prevention of esophageal cancer that identifying the high risk dietary patterns and some classical symptoms in the early stage of esophageal cancer.

The analysis of dietary patterns aims to fully explore the complexity of the diet, as an alternative to the study of isolated components. These techniques depend upon the concept that food consumption can be effectively presented by reproducible patterns, in spite of individual variations, and that food eaten together may have interactive effects on the risk of cancer. Similarly, the symptoms are the same. Identifying symptom clusters and their relationship to patient characteristics may lead to a better interpretation for identifying patients with early lesions and provide greater insight into the planning of future interventions.

Latent class analysis (LCA) is a model-based cluster analysis technique that allows for identifying prevalent, mutually exclusive, dietary patterns and symptoms with additional advantages with respect to the classical approaches [10]. Unlike principal component analysis (PCA) and factor analysis (FA), it can be used to categorize individuals into mutually exclusive groups, dietary patterns or severity of symptoms and differently from cluster analysis, it grants quantification of the uncertainty of class membership, and assessment of goodness of fit [11].

Our study aims to identify dietary patterns and severity of symptom through LCA, and thus to model screening for different stages of the disease. Therefore, adding a new perspective on the association between dietary habits and symptoms and ESCC in China.

Methods

Study population

This was a multicenter cross-sectional study, depending on high incidence regions of esophageal and gastric cancer established by the cancer early diagnosis and early treatment project in some high risk areas in China since 2000 [12]. In 2017, a new screening study of upper gastrointestinal cancer in five high-risk rural areas in China, including Hebei, Henan, Shandong, Shanxi, and Gansu Provinces was released. Men and women age 40 to 69 years were all selected as the target population. The main purpose of this project was to confirm the high-risk population of malignant upper gastrointestinal cancer and to establish a cancer risk prediction model to provide support for the prevention of upper gastrointestinal cancer.

The inclusion criteria were as follows: (1) local permanent residents in selected regions, (2) no history of endoscopic examination during the last 3 years, (3) no history of cancer, mental disorder, or any contraindication for endoscopy, (4) signed informed consent and (5) agreement to complete the entire survey and examination, including endoscopy. The participant selection process is shown in Fig. 1. We recruited participants from April 2017 to December 2018. The final analysis included 34,707 residents aged 40–69 years. Among these participants, there were 81 persons with ESCC, 251 persons with HGIN, 1413 persons with LGIN, 3883 persons with esophagitis and 29,079 persons serving as normal esophagus controls.

Fig. 1
figure 1

Flow chart of participant selection. Note: LGIN = low-grade intraepithelial neoplasia; HGIN = high-grade intraepithelial neoplasia; ESCC = esophageal squamous cell carcinoma

The study was approved by the Capital Medical University, Chinese Academy of Medical Sciences and Peking Union Medical College. The experimental protocol involving humans was in accordance to the guidelines of the Declaration of Helsinki.

Diet and symptoms assessment

Comprehensive questionnaire information was collected by face-to-face interviews and entered directly into a laptop based data entry system by trained investigators. The data entry process was conducted with software designed to decrease missing items and reduce logic inaccuracy. A questionnaire typically took 35–45 min to complete. Items of dietary intake were selected through the above questionnaire, including livestock meat and its products (cured, processed and salted meats), poultry meat, seafood, eggs and their products, vegetables, fruits, bean products, scallions, ginger and garlic, pickles and nuts. All the variables were categorical. Foods that are consumed more frequently, were divided into three categories namely: every day, 1–6 days per week, and less than 1 day per week. Foods that are consumed infrequently, were divided into two categories namely: at least one day per week and less than one day per week. Items of typical patient-reported symptoms were also selected through the above questionnaire, including number of lost teeth, frequent bleeding of gums, dysphagia, bloating, heartburn, acid reflux, nausea, vomiting, belching and epigastric pain. The number of lost teeth were categorized into three groups namely: never dropped, 1–3 teeth and more than 4 teeth. Other variables are categorized as yes or no.

Outcome assessment

The endoscopic examinations were carried out by physicians at local hospitals. Procedures were based on clinical guidelines for cancer screening and early diagnosis and treatment in China. Lugol’s iodine staining method was used to identify suspicious tissues, which were then biopsied. To confirm severity, the esophageal mucosa was ranked into 5 categories: normal esophageal mucosa, minor mucosa changes, esophagitis, esophageal squamous simple hyperplasia (ESSH) or esophageal squamous dysplasia (ESD) [13]. ESD was further classified into 3 levels including slight, moderate, and severe. According to WHO tumor histological classification, mild and moderate ESD combined fall under LGIN. Severe ESD and squamous cell carcinoma in situ are considered as HGIN. If there were any inconsistencies, a third pathologist would give advice through discussion. Doctors reported the worst biopsy diagnosis to be from participants with multiple lesions. In this study, we divided the participants into 4 groups: normal control, esophagitis, LGIN and HGIN/ESCC.

Statistical methods

We characterized the dietary patterns and symptom patterns, assumed as unobserved mutually exclusive, with different variables probability distributions, by performing LCA on the observed responses on the different items.

LCA identified latent classes of participants based on the ten dietary variables and six symptom variables. Estimation was conducted with the robust maximum-likelihood and expectation-maximization algorithms [14]. Statistical fit indexes were used to assess model fit and to decide the final number of latent classes. The model that fits the data best was selected by a combination of the following criteria: (a) the lowest Akaike information criterion (AIC), (b) the lowest Bayesian information criterion (BIC), (c) the lowest Lo–Mendell–Rubin likelihood ratio test (LMR), (d) the lowest Lo–Mendell–Rubin Adjusted LMR test (ALMR), and (e) entropy to be 0.6 or greater [15]. Next, we executed an unconditional multivariable logistic regression to identify sociodemographic and risk factors that predicted class membership. Before conducting the analysis, we performed a covariance diagnosis between the independent variables. We considered models to calculate the adjusted odds ratios (ORs) and 95% confidence intervals (CIs), including age, gender, education, body mass index (BMI), smoking and drinking at the same time. We also used a nomogram to model normal controls separately from the different stages of the disease. Evaluation of the model was performed using calibration curves and decision curves. In addition, basic, descriptive statistics show categorical variables as percentages, while continuous variables are shown as mean and standard deviations.

LCA was conducted in both cases and normal controls. Analysis of only the normal control was performed to check the robustness of the previous solution. As dietary patterns and symptom severity classes identified on controls were consistent (number and characteristics of the patterns) with the ones obtained on the overall dataset, we based all our analysis on the overall dataset. To guarantee the internal reproducibility of the chosen solution the analysis was conducted separately in two randomly selected subsets of the original data several times.

Statistical analyses were performed using Mplus (version 8.1) and R (version 3.6.3) software. All tests were two-sided and had a significance level of 0.05.

Results

When fitting the LCA model, we selected the result with 5 classes for dietary patterns and 3 classes for the severity of symptom in every group according to the criteria in methods (Tables 1 and 2).

Table 1 Model fit information for latent class analysis of dietary patterns
Table 2 Model fit information for latent class analysis of symptom classes

Latent classes of dietary patterns

Table 3 shows the conditional distribution of food group intake for all participants, giving the latent classes for the food groups that were more relevant in discriminating and labeling the clusters. The table of the normal control and the other three groups of different stages of disease are given in the Supporting Information Table S1-S3. We identified five classes with similar probability distributions according to a previous study [10], both for all participants and for different stages of the disease. Cluster 1 labeled “Healthy pattern”, showed higher probability to consume more fruiting vegetables and all other kinds of fruits, high-quality protein and lower probability to consume red meat. Subjects in Cluster 2, “Western pattern,” reported higher consumption of red meat and lower consumption of vegetables and fruits. Clusters 3 to 5 were related to previous food groups, but with a difference in the amount of intake. We termed Cluster 3 “Lower consumers-combination pattern” as people in it were less likely to eat all kinds of food. Cluster 4 had slightly higher probability than cluster 3 to eat all kinds of food. We called this cluster “Medium consumers-combination pattern.” Cluster 5 had the highest probability as compared to cluster 3 and 4 to eat all kinds of food, so we name it as “Higher consumers-combination pattern.” The estimated cluster sizes were 28.8% (n = 9990) for the “Healthy pattern,” 9.3% (n = 3216) for the “Western pattern,” 29.1% (n = 10,100) for the “Lower consumers-combination pattern,” 28.7% (n = 9971) for the “Medium consumers-combination pattern” and 4.1% (n = 1430) for the “Higher consumers-combination pattern.”

Table 3 Probabilities of consumption for selected food items by dietary patterns derived from LCA for all participants

Descriptions of the clusters for selected variables are given in Table 4. The proportion of smokers and drinkers was higher in the Western pattern compared to the Healthy pattern. For the other three patterns, the proportion of smokers to drinkers gradually increased with increasing food consumption. A similar trend was shown in the other three subgroups (data not shown).

Table 4 Dietary patterns’ characteristics according to selected sociodemographic variables for all participants

Latent classes of severity of symptom

Table 5 shows the conditional distribution of symptom group for all participants, giving the latent classes for the symptom groups more relevant in discriminating and labeling the clusters. The table of the normal control and the other three groups of the different stages of disease are given in the Supporting Information Table S4-S6. For all participants and for different stages of the disease, there were three classes with similar probability distributions. We named the first class “Asymptomatic” as subjects were reported to be relatively healthy and not showing any symptoms. Class 2 was named “Mild symptoms” as subjects in this cluster reported significant symptoms in terms of gingival bleeding. Subjects in the last class had a high percentage of symptoms reported in all areas. We named this class “Overt symptoms.” Sizes of the severity of symptom classes were 71.9% for the “Asymptomatic”, 26.7% for the “Mild symptoms” and 1.5% for the “Overt symptoms.”

Table 5 Probabilities for selected symptom items by severity of symptom derived from LCA for all participants

Further description of the severity of symptom classes for a selected set of variables are shown in Table 6. Differences between the clusters in demographics were not particularly significant. A similar trend was shown in the other three subgroups (data not shown).

Table 6 Severity of symptoms’ characteristics according to selected sociodemographic variables for all participants

Logistic regression analysis and nomogram

We applied an unconditional multivariable logistic regression to analyze dietary patterns as well as symptom severity for normal controls vs different stages of esophageal diseases. The results of the covariance diagnosis showed no significant covariance between the independent variables (data not shown). Tables 7 and 8 shows the ORs and 95%CI for all stages of disease, by the classification in the five dietary patterns and three symptom severity classes from the composite model including the relevant confounding and risk variables. In the dietary patterns, compared to the “Healthy” pattern, the “Western”, the “Lower consumers-combination”, the “Medium consumers-combination” and the “Higher consumers-combination” were positively related to the risk of the progression of the disease stage for esophagitis (OR = 1.42, 95%CI: 1.23–1.53; OR = 2.33, 95%CI: 2.11–2.57; OR = 1.99, 95%CI: 1.80–2.19 and OR = 1.59, 95%CI: 1.32–1.92, respectively). Consistent results were also observed between the normal control, LGIN/HGIN and ESCC three groups. The “Western”, the “Lower consumers-combination”, the “Medium consumers-combination” and the “Higher consumers-combination” patterns showed a positive association with the risk of LGIN and HGIN and ESCC compared to the “Healthy” pattern. In the symptom severity classes, compared to the “Asymptomatic” class, the “Mild symptoms” class was positively related to the risk of the progression of the disease stage for the esophagitis and LGIN groups, OR = 1.87, 95%CI: 1.66–2.10, OR = 1.25, 95%CI: 1.11–1.41 for the “Mild symptoms” class and OR = 1.04, 95%CI: 0.86–1.25 and OR = 0.82, 95%CI: 0.51–1.31 for the “Overt symptoms” class. However, in the HGIN&ESCC group, the “Overt symptoms” class was positively related to the risk of HGIN&ESCC compared to the “Asymptomatic” class (OR = 1.58, 95%CI: 1.00–2.50), The “Mild symptoms” class did not differ significantly from the “Asymptomatic” class (OR = 0.98, 95%CI: 0.58–1.66).

Table 7 Logistic regression analysis of associations between dietary patterns with esophageal cancer and precancerous lesions
Table 8 Logistic regression analysis of associations between symptom clusters with esophageal cancer and precancerous lesions

After logistic regression, we performed nomogram building and internal validation for each of the three subgroups. The model was virtually presented in the form of a nomogram (Fig. 2), the C-index of the novel nomogram was 0.612, 0.684 and 0.746, respectively for esophagitis, LGIN and HGIN&ESCC groups, embodying the good predictive ability of the model. The calibration curves also showed good consistency in the probability between the actual observation and the nomogram prediction (Fig. 3 a-c). In addition, decision curve analyses (DCA) exhibited great positive net benefits in the predictive model among almost all of the threshold probabilities at different groups, indicating the favorable potential clinical effect of the predictive model (Fig. 3 d-f).

Fig. 2
figure 2

A: Nomogram predicting the risk for the normal control group vs esophagitis; B: Nomogram predicting the risk for the normal control group vs LGIN; C: Nomogram predicting the risk for the normal control group vs HGIN and ESC. Note: LGIN: low-grade intraepithelial neoplasia; HGIN: high-grade intraepithelial neoplasia; ESCC: esophageal squamous cell carcinoma

Fig. 3
figure 3

A-C: Calibration curves showing the probability of the normal control vs esophagitis/LGIN/HGIN and ESCC three groups between the nomogram prediction and the actual observation. Perfect prediction would correspond to a slope of 1 (diagonal 45-degree gray line). D-F: Decision curves of the nomogram predicting the risk. The x-axis represents the threshold probabilities, and the y-axis measures the net benefit calculated by adding the true positives and subtracting the false positives. Note: LGIN: low-grade intraepithelial neoplasia; HGIN: high-grade intraepithelial neoplasia; ESCC: esophageal squamous cell carcinoma

Discussion

This study is the first latent class analysis based on a natural population in a high incidence area of esophageal cancer in China. Additionally, this is the largest study to use latent class analysis to describe multiple dietary patterns and symptoms experienced by high incidence areas of esophageal cancer in China. In our study, we identify dietary patterns and severity of symptoms conceived as mutually exclusive groups of people characterized by similar food intake and symptom clusters and to compare the resulting classes in terms of different stages of disease risk.

Dietary factors as well as early symptoms are considered of significantly influencing ESCC risk. Recent evidence came from studies focusing on single foods [16, 17] or single symptom [18,19,20,21,22,23], rather than dietary patterns and symptom clusters. Here, we used latent class analysis to explore relationships between ESCC risk and dietary patterns characterized by varying levels of fruit, vegetable, and red meat intake in multicenter. Meanwhile, we also investigate the relationships between symptom cluster and risk of ESCC characterized by different levels of lost teeth, gingival bleeding, and some specific symptoms Most of the previous studies on the dietary pattern of esophageal cancer have been conducted only for esophageal cancer and not for precancerous lesions [24]. Our study focused not only on ESCC cases, but also on the precancerous lesions with different dietary patterns, which is more likely to provide individualized dietary recommendations for the early prevention of ESCC.

We classified five dietary patterns. We discovered a protective effect on ESCC risk for a diet rich in vegetables and fruit consumption and associated with a low intake of red meat and pickles (Healthy pattern). Compared to this group, the “Western” pattern, distinguished by a low consumption of fruit and vegetables and high intake of red meat and pickles, was associated with an increased risk of esophagitis, LGIN, HGIN and ESCC. The“Lower consumers-combination” pattern, which showed a diet deficient in most of the foods considered, was positively related to the different stages of disease risk. The higher caloric intake of the “Higher consumers-combination” pattern, leads to an increased risk of the different stages of disease. And the “Medium consumers-combination” pattern is somewhere in between.

Our results are consistent with most of the studies evaluating the influence of diet on ESCC risk [24]. The pattern with a high intake of fruit and vegetables, named “Healthy” or “Vegetable and Fruit” was found and associated with a decrease of ESCC risk in most of these studies [25,26,27,28]. High dietary fiber intake has also been found to significantly reduce the risk for Barrett’s esophagus and esophageal cancer [29]. As previous study suggested, inositol hexaphosphate, one component in food sources high in dietary fiber, has been proven in vitro experiment to inhibit the growth rate of esophageal cancer cells by decreasing cellular proliferation and stimulating apoptosis [30]. Dietary fiber phenolic compounds ferulic acid and pcoumaric acid may have an anti-proliferative effect on cell cycle [31]. In addition, a high dietary fiber diet is associated with lower plasma levels of systematic inflammation biomarkers such as tumor necrosis factor-α receptor-2 (TNF-α-R2) and interleukin-6 (IL-6), which may influence the process of carcinogenesis [32].

A potential adverse effect of a diet mainly based on red meat, processed food in general was also found [33, 34]. In a previous study, despite different labelling, number of components and weighting, the most commonly classified patterns were an ‘unhealthy’ one (often named ‘Western’) and a ‘healthy’ one in different country [35]. The other three patterns were not common in previous studies. The“Lower consumers-combination” pattern, is usually composed of people whose dietary intake is insufficient. Previous studies have reported a significantly increased risk of esophageal cancer in malnourished populations [36]. For the “Higher consumers-combination” pattern, the increased risk of esophageal cancer may be due to the high caloric intake of food. And a case-control study conducted in an Iranian population showed that higher intake of calories and total fat significantly increased the risk of esophageal cancer [37]. The increased risk of esophageal cancer in the “Medium consumers-combination” pattern may be due to the high intake of red meat and processed foods and relatively low intake of fruits and vegetables. The reasons for similar results in the subgroup analysis can be explained in the same way.

We classified three symptom severity classes defined as mutually exclusive groups of subjects characterized by different symptoms. We found that at the stage of esophagitis and LGIN, the patient has only mild symptoms, while at the stage of HGIN and ESCC, the patient has relatively severe symptoms. If symptoms can be identified early in the disease process, the development of the disease can be reduced. In esophagitis and in the LGIN stage, compared to the “Asymptomatic” class, the “Mild symptoms” class may experience tooth loss and frequent gingival bleeding. In a study of risk factors for esophageal cancer and its precancerous lesions conducted in Henan, China, a high number of missing teeth was found to be a significant risk factor [38]. In another studies of a Chinese population, tooth loss was found to be a risk factor for esophageal squamous cell cancer [19] as well as gastric cancer [39]. The hypothesis, frequently cited in the ESCC etiology literature, is that incomplete chewing and rapid swallowing of large pieces of food might lead to irritation or damage to the esophageal epithelium and subsequently increase the risk of ESCC [40]. In the current study, the relationship between the number of teeth lost and ESCC risk supports this hypothesis. Another hypothesis is that poor oral hygiene and tooth loss mediate a bacterial load and “overgrowth” of microorganisms on teeth [41], which can transform nitrates into nitrites and then combine with amines to form carcinogenic nitrosamines, some of which may be gastrointestinal organ-specific carcinogens [42, 43]. Therefore, an association between tooth loss and esophageal disease seems plausible. For people with frequent gingival bleeding, the increased risk of esophageal disease may be due to poor oral health as a result of bleeding gums. And many studies have shown that poor oral health is a risk factor for the development of esophageal cancer [44,45,46,47]. In the HGIN and ESCC group, compared to the “Asymptomatic” class, the “Overt symptoms” class has an increased risk of HGIN and ESCC. In this class, each symptom has a certain percentage of occurrence, especially upper gastrointestinal symptoms such as dysphagia, nausea, vomiting, etc. This indicates that upper gastrointestinal symptoms are already present when esophageal squamous cell carcinoma occurs or is about to occur. This reinforces the importance of intervention at an early stage of the disease.

LCA can bring interesting insights into dietary patterning and symptom severity classes, allowing to identify prevalent types of eating behavior and severity of symptom in a population and to compare risk for people with different types of diet and symptom severity. Traditional methods such as PCA and FA, which have the disadvantage of not being able to produce mutually exclusive groups. Thus, when the interest is to compare subgroups of subjects, an additional step of cross-classification of the dimensions/combinations is needed. The application of a LCA to this study on different stages of esophageal disease overcomes problems inherent in the traditional methods and gives further advantages in dietary patterning and symptom severity class, such as a probability-based classification under a general parametric approach and pattern prevalence estimation.

A nomogram is a convenient tool to anticipate and quantify the chance of an individual patient progressing to a certain clinical event. Nomograms are helpful in clinical decision making and valuable in risk stratification and individualized treatment [48]. Based on the nomogram, we can calculate the individual risk score and thus estimate the individual risk of disease. The identification of individuals as patients with early-stage disease, coupled with timely interventions, leads to a decrease in the incidence of ESCC. In this study, the nomogram we created had a good predictive power. Using, age, gender, education, BMI, smoking status, drinking status, dietary patterns, and symptom severity to build the model, it exhibited great positive net benefits.

Several limitations of the study should be considered. First, there may be recall bias in our study, which is the common issue in cross-sectional studies. The community-level recruitment approach in this report reduces but does not eliminate this source of bias. Second, our model has only been internally validated, and the model will need to be further externally validated in the future if possible. Third, cigarette smoking is known risk factor for ESCC. Due to non-availability of the data of detailed tobacco consumption between the cases and controls, there may exist residual confounding from smoking, thus reducing the OR of the dietary patterns and severity of symptom with the risk of the ESD and ESCC. In addition, due to lack of information on genetics, we have not investigate the genetic impact on esophageal squamous carcinoma development.

Conclusions

In conclusion, LCA gives further insights into dietary patterns and symptoms research, allowing for the definition and estimation of the prevalence of different groups of subjects characterized by different dietary choices and symptoms, and comparing those groups in relation to important health outcomes. Individuals at high risk of ESCC might be strongly recommended a “Healthy pattern” in the future life. Additionally, more dietary nutrition interventions and health promotion would be improved for the precision prevention, which is imperative to reduce the incidence and mortality of ESCC.