Introduction

According to the International Agency for Research on Cancer (IARC), gastric cancer (GC) is responsible for more than 769,000 global deaths, equating to one in every 13 deaths, for the year 2020 [1]. GC is more prevalent amongst the male population, such that ~ 49 in 100,000 males suffer from this disease, which is more than twice its prevalence in females (~ 21 in 100,000) [2]. Stomach cancer mainly involves older people, with the average age of diagnosis being 68 and more than half of people diagnosed are 65 or older [3].

GC is a multistep and multifactorial process involving genetic and environmental factors [4]. Besides age and gender as known risk factors, there is much evidence that unhealthy diets [5, 6], alcohol abuse [7], smoking [8,9,10] and other factors such as genetics, environmental and behavioural factors [11,12,13,14] enhance the risk of GC development.

Considering the incidence rates of GC in most countries are expected to decrease through 2030, reductions in smoking, prevalence of Helicobacter pylori infection and diet improvement will be the likely contributing factors [15].

It is most desired to estimate the primary disease risk using general information, consuming the least time and resources [16]. This can be made possible by combining clinical knowledge with applied data science [17]. The ultimate product should be an optimal tool, readily performed by anyone without expert knowledge, using their personal information. Ideally, the application of such tools will help increase awareness and ultimately reduce the burden of disease on the community and the health care system [18].

Classification methods are usually used to develop a risk score and identify high-risk individuals in a population [19,20,21,22]. In this study, we have used multivariate analysis to identify individual predictors differentiating GC from NUD patients. For this purpose, we carried out a logistic regression approach, using 64 predictors of known risk factors including demographic features, dietary habits, self-reported medical status, narcotics use, and SES indicators. Developing a time and cost-effective algorithm that uses the personal medical history and lifestyle habits to screen subjects for GC risk, can provide a tool for filtering dyspeptic patients prior to the more invasive screening approaches.

Materials and methods

Study setting

This hospital-based observational study was conducted on a group of Iranian gastric cancer (GC) patients (n = 858), who were consecutively (July 2003 to Jan 2020) referred to the National cancer Institute of Iran (NCII). Our GC cases were diagnosed with histologically confirmed gastric adenocarcinoma. The non-ulcer dyspeptic (NUD) patients (n = 1132) were those who had referred for upper gastroscopy, but lacked GC. NUD patients were admitted at the endoscopy unit of Amiralam Hospital. Both centers shared similar SES profiles. The anatomic location (subsite) of the tumor was classified as cardia (defined as cardioesophageal junction, oesophagogastric junction and gastroesophageal junction) or non-cardia (all other locations in the stomach) [23]. Histopathologic studies identified the subtype of the gastric tumors, as intestinal or diffuse [24].

Trained technicians interviewed each participant at the time of recruitment, using a structured questionnaire. This questionnaire elicited 64 predictors, including demographic features, dietary habits, self-reported medical status, narcotics use, and SES indicators. Primarily, each of the questionnaire predictors, with multiple levels (with the exception of age) was turned into binary groups (S-Table-1).

In order to use the properties of data while assuming the power of 90% for testing the significance of the odds ratio in the logistic regression model, a minimum of 936 observations were required. Therefore, our data, including 1990 observations, had sufficient power for the risk score development.

Statistical analysis

Imputation of the missing data

We used multivariate imputation by chained equations (MICE) [25], to deal with missing data in more than one variable. In this method, two general approaches for imputing multivariate data have been applied: joint modeling (JM) [computational strategies for multivariate linear mixed-effects models with missing values [26], multilevel models with multivariate mixed response types] and full conditional specification (FCS) [multivariate imputation by chained equations-dependency networks for inference, collaborative filtering, and data visualisation [27].

To validate our imputation method, we conducted the following steps. At first, using the bootstrap method [28], based on the distribution of data, we have generated ten copies of our dataset. The multivariate imputation method imputed the missing values in each copy, and five new complete datasets were generated for all the copies. In this manner, we achieved 50 complete datasets. The distribution of all variables in the original dataset and these 50 imputed versions were compared. The variables were kept in the dataset, if the deviation in the mean (S-Fig. 1) and standard deviation (S-Fig. 2) did not exceed 0.05. Next, we randomly converted 10% of the observed values for each variable into missing and imputed them again. This process was repeated 1000 times and all the variables’ biases were calculated (S-Fig. 3). The cut-off value for the bias variation was set at 2%. We aimed to maintain the imputation bias under this cut-off value.

Model development

We used Chi-square test to measure associations between predictors and outcomes (Table 1). Statistical significance was determined using 2-sided P-values, with values < 0.05 considered as statistically significant. We have presented a univariate analysis of all predictors and assessed the association between each of them with GC vs. NUD, without taking into consideration the other predictors. In this step, we emphasized on the distribution of each (Table 1). In the next step we performed multivariate logistic regression analysis on 70% of randomly selected observations, and determined the associations with each predictor, while adjusting for all others (Table 2) [29]. The probability of having GC vs. NUD, based on the logistic model, was calculated [30].

Table 1 Distribution of predictors amongst GC versus NUD patients
Table 2 The results of the multivariate logistic regression model to explore the GC-prone versus NUD-prone predictors

The probability of being GC versus NUD was computed using logistic regression:

$$P(GC)=\frac{1}{1+{e}^{-\left({\beta}_0+{\beta}_1{X}_1+{\beta}_2{X}_2+\dots +{\beta}_k{X}_k\right)}}$$

Where β0 is the intercept term and β1, β2, …, βk are the coefficients associated with the input features X1, X2, …, Xk.

This study divided patients into two risk groups based on an assigned cut-off point, derived from fixing the sensitivity rate at a minimum of 90%, while maximizing the specificity rate. Accordingly, the best threshold for the risk score was identified. We defined the shared percentage for every predictor in our risk calculator, as the contribution of each variable in predicting GC vs. NUD, as clinical outcomes. This measure is the proportion of the standardized regression coefficient (point estimates) for each predictor relative to their total sum (Table 2 and Fig. 1). The final risk score for each predictor was calculated by the multiplication of their pertinent point estimate by 100.

Fig. 1
figure 1

Risk score system for segregation of GC from NUD based on our logistic regression model

Model validation

We used the train-test split method [31] for determining the performance criteria (AUC, sensitivity, specificity, precision, false-positive, false-negative, and accuracy rates) of our logistic model (Fig. 2), as well as to assess its internal validity.

Fig. 2
figure 2

The probability of GC versus NUD based on risk scores

These performance criteria were calculated as follows:

The accuracy rate, which measures the overall correctness of the classification, was calculated as:

$$Accuracy=\frac{TP+ TN}{TP+ TN+ FP+ FN}$$

The sensitivity rate (true positive rate or recall), which measures the proportion of actual positive instances that were correctly identified, was calculated as:

$$Sensitivity=\frac{TP}{TP+ FN}$$

The specificity rate (true negative rate), which measures the proportion of actual negative instances that were correctly identified, was calculated as:

$$Specificity=\frac{TN}{TN+ FP}$$

Where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives.

To do this, the data were randomly divided into the development (70%) and validation (30%) subsets. The performance criteria of our GC risk calculator were determined by examining calibration and discrimination measures. Calibration refers to how closely the predicted probability of having GC agrees with the observed GC status and is assessed by the Hosmer-Lemeshow test [32]. The discrimination rate expresses the ability of the model to differentiate between individuals with GC versus NUD. This was evaluated by calculating the area under the ROC curve (AUC) [33]. An AUC value of 50 and 100 was considered as having no versus perfect discrimination, respectively. Risk thresholds that gave a combination of more than 85% sensitivity rates and maximum specificity rates were derived from the list provided by the ROC curve analysis. All statistical analysis and data visualizations were done in R statistical software environment.

Results

Descriptive information

Our observational study included 858 GC [development = 610 and validation = 248] and 1132 NUD [development = 783 and validation = 349] patients, who were entered into this hospital-based study.

All of the 64 questionnaire predictors, with multiple levels, were converted into binary categories, as presented in S-Table 1. We have also presented the distribution of each of these predictors amongst GC versus NUD patients, without any adjustments for other predictors in Table 1. The results of the Chi-square test showed that the distribution of most (47/64) of the predictors were different between GC and NUD patients (Table 1). The association between predictors and GC vs. NUD affected the model, and although most variables were independently associated with GC (Table 1), when adjusted for all other variables, few associations, remained statistically significant (Table 2).

The data obtained from the 64 predictors from our 1990 (GC + NUD) cases included varying degrees of missingness. To remedy this, we used the MICE method to impute the missing values. But first it was critical to validate our imputation technique and ascertain a consistent distribution for each predictor thereafter. Having done so, in the first approach amongst the 50 regenerated samples, the mean (S-Fig. 1) and standard deviation (S-Figure-2) differences, between our actual and imputed data did not exceed 0.05 and were thus acceptable. In the second approach, for all 1000 bootstrap-generated samples, the bias was determined as under 0.02 (S-Fig. 3).

Model development

Logistic regression specified the strength of association between each of our 64 predictors and the clinical outcome (GC or NUD). The strengths of association for each of the predictors (if any), while adjusting for all other predictors, are presented via risk scores, shared precents, and odds ratios (Table 2). Taking into consideration all of the 64 predictors in our model, a risk calculator was created, scoring for GC or NUD (Fig. 1). The obtained risk score ranged from − 1261 to + 2077 (total range of 3338), moving from NUD towards GC. Of this range, the risk score of − 451, equivalent to a shared percentage of 13.49, was assigned to subjects at reference level. The remaining 86.51 percentage of the risk score was contributed by our 64 predictors, with varying shares. Aiming for a minimum sensitivity rate of 90%, the risk score of − 91, coinciding with the probability value of 0.29 (ranging from 0 to 1.0, Fig. 2) was identified as the cut-off point. Keeping in mind that each of the addressed 64 predictors contributed to the final risk score, those which were statistically significant are described below.

GC-prone predictors

The predictors which acted towards the development of GC are considered as GC-prone. Amongst the demographic category, older age holds the first place, creating risk scores of + 134 to + 221 to + 241, for subjects aged > 50 – 60, > 60 – 70 and > 70, in reference to those aged ≤ 50 years, respectively. These values were sequentially equivalent to 4.02, 6.60, and 7.23 shared percentage (SP) of the total range. Next in line, was being of male gender and non-Fars/mixed ethnicity, with risk scores of + 119 (SP = 3.55%) and + 89 (SP = 2.66%), respectively. Amongst the SES factors, the illiteracy of both parents and residence in a rural area contributed + 77 and + 78 risk scores, respectively, which contributed 2.35 and 2.3 shared percentages to the score. In regards to the medical status of the subjects, having a personal and family history of GI cancers provided a GC risk score of + 173 (SP =5.19) and + 57 (SP =1.72), respectively. Modifiable lifestyle behaviors, such as diet and use of narcotics took the subsequent positions. Amongst dietary habits, drinking hot tea [+ 53 (SP =1.58)], consumption of medium-to-high amounts of cheese [+ 47 (SP = 1.42)], use of table salt [+ 46 (SP = 1.39)], late dinnertime [+ 34 (SP = 1.03)], and consumption of medium-to-high amounts of eggs [+ 32 (SP = 0.96)] were amongst the dietary GC-prone predictors (Table 2).

Association with the subtype and subsite of GC

Some of the above-mentioned GC-prone predictors were also associated with the subsite and/or histologic subtype of the tumor. Amongst these, age was closely associated with the intestinal histologic subtype of GC (P = 0.003). History of GI cancer was associated with the cardia anatomic location (P = 0.02) and intestinal histologic subtype (P = 0.032) of the tumor. The predictors of drinking hot tea (P = 0.004) and consumption of table salt (P = 0. 04) were associated with the cardia subset of the GC tumors.

NUD-prone predictors

However, there were some predictors that acted towards the development of NUD; in other words, they were NUD-prone. These predictors belonged to the two categories of SES and diet. The predictors of the former category included: rented residential place (during childhood) [− 62 (SP = 1.85)], rural-birth place [− 57 (SP = 1.69)], and lacking refrigerator (during childhood) [− 46 (SP = 1.39)]. Predictors of the dietary habits included: taking vitamins [− 60 (SP = 1.81)], consumption of smoked rice [− 57 (SP = 1.72)], medium-to-high consumption of yoghurt [− 39 (SP = 1.16)], and consumption of tuna fish [− 31 (SP = 0.94)] (Table 2).

Model validation

To validate the results of the above-described development model, a validation approach was taken, assessing 597 individuals (248 GC and 349 NUD; GC: NUD ratio, 1:1.41), on which the ROC analysis was performed (Fig. 3). Using our 64 predictors, we were able to differentiate GC from NUD, with an AUC of 86.37% and the sensitivity, specificity, and accuracy rates of 85.89, 63.9 and 73.03%, respectively. According to this model, the rates of false positives and false negatives were 36.1, and 14.11%, respectively (Fig. 3).

Fig. 3
figure 3

ROC curve analysis of our model differentiating GC from NUD

We have also evaluated the calibration of our model by the Hosmer-Lemeshow method. Having done so, a P value of 0.4761 (well above 0.05) was obtained. Thus, the fitness of our model was confirmed. Our risk calculator can, thus, calculate the risk of GC versus NUD, based on the probability our 64 predictors, proportional to the achieved risk score (Fig. 3).

Discussion

Gastric cancer being a silent killer, usually catches patients and their health service providers, off-guard. Being able to assign a relative risk to subjects, based on their demographic characteristics and life style behaviours, will provide an upper hand in focusing on the at-risk subjects, with subsequent stepwise clinical testing and follow-ups. The goal of this study was to develop an approach to accomplish the primary screening step based on our target predictors.

In 2023, a multicentre population-based study, carried out on over 416 thousand subjects (aged 40 – 75 years) in China, a GC risk calculator was developed, which highlighted 11 demographic and life style variables that place individuals at risk of GC [34]. Although our study was hospital-based and has screened Iranian dyspeptic patients, the common variables between these two studies still identify age, gender (male), education (illiteracy of parents), salt intake and personal and family history of cancer as definite risk factors. In another population-based screening study on subjects (aged 40–74 years), with no history of cancer in Korea, six risk factors were identified [35], of which salt intake was a shared prominent risk factor with our hospital-based screening study.

In 2019, a population-based study was conducted in China to assess the general knowledge about GC risk factors and symptoms. The analysis was performed on 1200 adults, over the age of 18 with an average age of 40, which showed that the mean score for GC knowledge was 8.85 out of 22. Of the 1200 participants, 564 (47.0%) had insufficient understanding of GC risk factors and warning symptoms. Overall, about 84% of people believed that screening helped diagnose GC. However, only 15.2% of people were screened for GC. There were various reasons for avoiding screening, including being asymptomatic, fear of diagnostic screening and its outcomes, male gender, living in rural areas, lower educational levels, etc. [36]. Hence, lack of routine screening and the absence of specific symptoms for this fatal disease, leaves most subjects undiagnosed until the terminal stages, which accounts for GC being known as a silent killer [37, 38].

Several methodological studies on GC have been conducted over the years [13, 39,40,41,42,43]. A concerted strategy for the joint analysis of these investigations may allow new insights into the etiology of GC. Therefore, the ‘Stomach cancer Pooling (StoP) Project’ was set up in 2012 to join together several investigators and create a consortium of epidemiological investigations on risk factors for GC. The SToP’s final aim was to examine the role of several lifestyles and genetic determinants in the etiology of GC, through pooled analyses of individual-level data [44].

In our study we intended to investigate the effects of any potential risk factors, even if they were not statistically significant, so to create an all-inclusive risk calculator.

The GC-prone factors identified via our model, are also supported by previous studies, include older age [34, 45,46,47], male gender [34, 48], and non-Fars/mixed ethnicity [49,50,51], illiteracy of both parents [52,53,54,55], family history of GI cancers [56,57,58,59], drinking hot tea [60], late dinnertime [61, 62], consumption of table salt [63,64,65], and medium to high amounts of cheese and eggs [66,67,68,69]. Having used a logistic regression model, we have developed a gastric cancer risk calculator, with the sensitivity, specificity, and accuracy rates of 85.89, 63.9, and 73.03%, which can be used by individuals or their healthcare workers, for primary screening of dyspeptic patients.

In 2007, Driver et al. [21] developed a simple scoring system that identifies men at increased risk of colorectal cancer, based on age and modifiable behaviours, such as alcohol intake, smoking status, and body mass index. They ran a logistic regression model as well as a proportional hazards model, to better simulate a screening decision, based on the information obtained. The discrimination power of the final model was about 70% (AUC = 69.5%) [21]. In comparison, our risk score had the discrimination power (AUC) of 86.37% during internal validation. Keeping in mind that this is the primary screening step, followed by simple and complex clinical testing, the limited detection rates, we have herein obtained for a primary questionnaire-based surveillance, are acceptable.

The strengths of our study include its sensible sample size and inclusion of a wide variety of target demographic and lifestyle behaviours. However, we have used a case-case setting in order to be able to add other clinical data, on the next rounds of clinical and paraclinical screening. Having compared GC patients with non-GC (non-ulcer dyspeptic, NUD) patients, the scale bar of our risk score moves towards the direction of GC (GC-prone) or NUD (NUD-prone) and is, at best, suitable for screening dyspeptic patients, rather than the general population. Thus, our risk calculator, must be adjusted, by applying the model in a case-control (GC versus healthy population) setting. It must also be kept in mind that some of the highlighted risk indicators may actually be proxies for other unaddressed predictors. Furthermore, the fact that we had to turn our multinomial levels (answers), into binomial, may have oversimplified our model. Another point of concern is the external validation of this model on other sample cohorts with diverse environmental, cultural, and social characteristics.

Nevertheless, applying such an inexpensive GC risk calculator, using questionnaire-based information, can provide the first step in screening Iranian at risk patients, to be followed by more complex laboratory and clinical screenings. Furthermore, providing information about individualized GC risk status, can lead to attempts at correction of the modifiable risk behaviours. Future studies include, validation of this model in case-control settings, in different geographic locations.