Background

Colorectal cancer (CRC) stands as a leading cause of cancer-related mortality in both developed and developing countries [1]. Implementation of population-based CRC screening has demonstrated a potential to reduce CRC incidence, garnering strong recommendations [3, 4]. Notably, over 85% of CRC cases originate from pre-malignant adenoma polyps, emphasizing the preventive nature of early detection [5]. The primary objective of CRC screening is to identify pre-symptomatic neoplastic lesions, thereby reducing the overall incidence through timely intervention and examination [6].

The prevailing CRC screening approaches involve fecal immunochemical tests (FIT) coupled with subsequent colonoscopies for positive cases, or periodic endoscopic procedures such as flexible sigmoidoscopy every 5 years or colonoscopy every 10 years [8, 9]. Ongoing considerations include alternative screening methods like fecal DNA analysis and CT colonography [5]. However, the efficacy of any screening program hinges on two pivotal factors: compliance and accuracy [10]. Despite the success observed in various strategies, overall individual compliance remains suboptimal, with rates falling below 52% in CRC screening initiatives [5]. Therefore, there is a growing consensus that novel strategies, encompassing the amalgamation of established tests or the introduction of convenient screening alternatives, could significantly enhance population-based CRC screening adherence [11, 12].

Remarkably, altered microbiota composition has emerged as a potential foundation for a highly sensitive and specific CRC screening test [13,14,15,16,17,18]. Beyond microbiota, their proteins and metabolites contribute to CRC pathogenesis, with reciprocal interactions influencing host proteins and metabolites in CRC development [19]. Significantly, signatures derived from the abundance of bacterial proteins, particularly those associated with signal transduction systems like sensory proteins, hold promise in distinguishing between healthy and diseased states [19].

In this context, our study represents a continuation of previous efforts focused on early CRC detection based on microbial biomarkers [15, 20, 21]. We aim to assess fecal and oral microbiota through 16S rRNA sequencing analysis, exploring the abundance and variation of pathogenic oral and fecal microbiota composition between CRC-positive individuals (CPs) and CRC-negative counterparts (CNs) in the Iranian population. Additionally, we investigate the status of nonpathogenic microorganisms, including probiotics and short-chain fatty acid (SCFA)-producing bacteria, in the feces of CPs compared to CNs. Ultimately, we endeavor to develop classifier models utilizing oral and fecal microbiota profiles, with the intent of enhancing the diagnostic capabilities for early CRC detection with high sensitivity and specificity.

Results

Demographic results

Demographic characterization of participants with related p-value between CPs and CNs are presented in Table 1. The population study was characterized by similar distributions of gender, viral infection, alcohol consumption and dietary habit. The profession, family history, disease and surgical history, smoking habit and physical activity had significant differences between the CPs and CNs based on p-value.

Table 1 Demographic characteristics of CRC positives (CPs) and CRC negatives (CNs)

16S rRNA sequencing analysis of clinical samples:

Top 10 microbes with more abundance in CPs versus CNs

We conducted a comparison of the frequency of the top 10 microbes that were most abundant CPs, analyzing both fecal and oral samples, in terms of phylum, family, and species in comparison to CN samples (see Fig. 1). Notably, some of these microbes were completely absent in CNs, while others exhibited a significant difference in their presence.

Fig. 1
figure 1

The frequency of top 10 bacteria that were most abundant in oral and fecal samples of colorectal cancer positives (CPs) for phylum, family, and species versus colorectal cancer negatives (CNs) [# = CRC-exclusive bacteria, * = significant CRC vs. normal differences]

In the saliva of CPs, Chloroflexi, Lactobacillaceae, Rivulariaceae, Calothrix parietina, Rothia dentocariosa, and Rothia mucilaginosa ranked among the top 10 microbes, none of which were present in the saliva of CN individuals. Conversely, in the feces of CRC patients, Coprobacillaceae, Enterococcaceae, Neisseriaceae, Streptococcaceae, Bacteroides cellulosilyticus, Coprobacillus cateniformis, Porphyromonas asaccharolytica, Sphingobacterium bambusae, and Streptococcus vestibularis were identified among the 10 most abundant microbes at the family and species levels, with none of them present in CN participants.

Furthermore, our analysis revealed a higher abundance of microbes such as Fusobactria in the saliva of CRC patients compared to CN individuals. Additionally, a significant p-value indicated a higher amount of Lachnospiraceae and Prevotellaceae in the stool of CPs compared to controls, suggesting that these microbes are present in both CNs and CPs, but their quantity is elevated in CPs.

In the Table 2, the median and the p-value of these 10 more abundant microbes in the saliva and feces of CRC patients compared to CNs regarding the phylum, family and species have been investigated in detail.

Table 2 Median (first quartile, third quartile) and a p-value of each individual candidate bacteria based on abundancy

Non-pathogenic microbiota

An investigation into a range of commensal microbiota, including Lactobacillaceae, Bifidobacteriaceae, Ruminococcaceae, Lachnospiraceae, Lactobacillus, Bifidobacterium, Akkermansia, Roseburia, Faecalibacterium, and Ruminococcus, was conducted in the feces of CPs in comparison to CNs (see Fig. 2). Notably, among all the non-pathogenic microbes analyzed in the stool samples, the genus Akkermansia and the species Akkermansia muciniphila were significantly more abundant in the CN group than in CRC patients.

Fig. 2
figure 2

The higher abundancy of the genus Akkermansia and the species Akkermansia muciniphila among all the non-pathogenic microbes in the stool samples of colorectal cancer negatives versus colorectal cancer positive patients

Based on microbial variables that have the least missing data, 24 microbes in saliva and 27 microbes in stool were selected. AUROC, sensitivity, specificity, PPV, NPV and ACC were calculated for each bacterium. For ROC analysis, four different models were used, including logistic regression, support vector machine, naïve bayes and neural network. In Table 3 we showed which microbes are most important in predicting CRC. Four of them in saliva have the highest AUC which include Porphyromonadaceae, Unclassified at Family level, Fusobacteria, and Streptococcus infantis. Also, four of the microbes in stool have the highest AUC, which include Lachnospiraceae, Proteobacteria, Nitrospirae and Escherichia albertii. Confidence interval (CI) was reported for SE, SP, PPV, NPV and ACC.

Table 3 The Prediction performance using logistic regression for each microbiota

In Fig. 3, important microbes in predicting CRC in saliva include Streptococcus infantis, Fusobacteria, Actinobacteria, Porphyromonadaceae, Streptococcus tigurinus, Streptococcaceae, Spirochaetes, Unclassified at Family level, and Unclassified at phylum level. Also, important microbes in predicting CRC in stool include Lachnospiraceae, Proteobacteria, Nitrospirae, Prevotellaceae, Escherichia albertii, Ruminococcaceae, Veillonellaceae, Clostridiaceae, and Alcaligenaceae.

Fig. 3
figure 3

Mean Decrease GINI model for colorectal cancer prediction. Higher mean decreases in GINI for bacteria show that bacteria are more important in predicting CRC. *The Mean Decrease GINI presents those microbes that have the highest amount in GINI, their removal makes the model worse in the direction of predicting CRC and their presence helps the model to be powerful

Combination of selected variable microbiota based on mean decrease GINI model for improvement of the diagnostic ability for early detection of CRC

The desired microbial variables were selected based on Mean Decrease GINI, and then we examined multiple regressions. Multiple regressions mean to use certain microbiota simultaneously in certain statistical models to predict CRC patients. Four different models including logistic regression, support vector machine, Naïve Bayes, neural network were selected along with a selection of microbiota based on GINI. For saliva, the logistic model is the best model among others due to its simplicity and AUC of 91%, SE of 87%, SP of 80%, PPV 87%, NPV of 80% and ACC of 84% (Table 4). For stool, the support vector machine was the best model because it has performed with the highest AUC of 97%, SE of 92%, SP of 93%, PPV of 96%, NPV of 87% and ACC of 90% compared to other models, even the simple logistic regression (Table 4).

Table 4 The Prediction performance using logistic regression with selected variables for each microbiota

ROC curves with performance of logistic regression, support vector machine, naïve Bayes and neural network models along with a selection of microbiota based on mean decrease GINI were demonstrated in Fig. 4. At the best cutoff value, this panel of bacteria could be used to discriminate CP patients from CN individuals.

Fig. 4
figure 4

ROC curves with performance of logistic model, support vector machine, naïve bayes and neural network models using selected variables

Discussion

In this study, we conducted the first-ever examination of the integrated microbiome from stool and saliva samples of colorectal cancer (CRC) patients in comparison to healthy controls (CNs) within the Iranian population, utilizing the 16S rRNA sequencing method. The utilization of microbiota as biomarkers for disease and health has gained significant traction, particularly with the advancements in 16S rRNA sequencing technology.

Our results, as depicted in the demographic table, reveal a noteworthy difference between CPs and CNs concerning occupation, physical activity, and smoking habits. Interestingly, housewives and retired individuals exhibited a higher prevalence of CRC compared to working and non-retired individuals. Furthermore, smoking and a lack of exercise were more prevalent among CP patients compared to CNs.

In general, the incidence of CRC tends to be higher in individuals over 50 years old, whereas those under 50 years old, who typically undergo screening, are generally healthier. This age-related discrepancy is a noteworthy factor contributing to the differences observed between the CP and CN groups. Additionally, the occurrence of CRC in individuals with a family history of the disease and a personal history of other illnesses and surgeries was more prevalent than in CNs. This implies that individuals with a susceptibility marked by a history of other diseases and surgeries are more predisposed to CRC than those without such histories.

The notable observation of distinct microbial profiles between CPs and CNs highlights a significant aspect, suggesting that the microbiome may play a crucial role in the initiation and development of CRC. For instance, certain microbial patterns were found to be significantly more abundant in CRC patients compared to CNs, with specific examples including Chloroflexi, Lactobacillaceae, Rivulariaceae, Calothrix parietina, Rothia dentocariosa, and Rothia mucilaginosa, which exhibited higher abundancy in the saliva of CRC patients but were entirely absent in CN individuals. Similarly, Coprobacillaceae, Enterococcaceae, Neisseriaceae, Streptococcaceae, Bacteroides cellulosilyticus, Coprobacillus cateniformis, Porphyromonas asaccharolytica, Sphingobacterium bambusae, and Streptococcus vestibularis were identified as the most abundant microbes in the feces of CRC patients, whereas they were absent in CN individuals.

While our findings suggest a compelling association between the presence or absence of certain microbes and CRC, it is essential to conduct studies on a larger population to provide more definitive insights. Our results align with the research by Flemer et al. [18], who identified 63 operational taxonomic units (OTU) distinguishing CRC cases from CNs, including 29 oral OTU and 34 stool OTU. Additionally, our findings are consistent with previous studies that have highlighted the ability of specific microbiota to differentiate individuals with CRC or adenoma polyps from healthy individuals.

Notably, research conducted across various geographical regions such as the USA, Canada, Ireland, Spain, China, Colorado, France, and India has explored the increased presence of bacteria in CRC. Despite differences in ethnicity and geography influencing microbial patterns, it is intriguing that many of the microbes identified in these studies closely correlate with those increased in our CRC patients, including Fusobacterium, Porphyromonas, Prevotella, Bacteroides, and Streptococcus [18, 22,23,24,25,26,27,28].

Identifying a group of microbes with higher abundance in CPs than in healthy CNs and demonstrating statistical significance is crucial, as it facilitates the selection of potential biomarker candidates. In our study, we observed an increased number of Fusobacteria in the saliva of CRC patients compared to CNs, as well as a higher abundance of Lachnospiraceae and Prevotellaceae in the stool of CRC patients compared to CNs. Consistent with our findings, Flemer et al. reported differential abundance of certain oral microbiotas between CPs and CNs, including Parvimonas, Haemophilus, Prevotella, Alloprevotella, Neisseria, Lachnoanaerobaculum, and Streptococcus [18].

Furthermore, non-pathogenic microbiota in the human gut or microbiota that produces short-chain fatty acids (SCFA) play a crucial role in human health and disease prevention [29]. In our research, Akkermansia muciniphila showed significantly higher abundance in CNs compared to CPs. Akkermansia muciniphila is an important bacterium that degrades mucin in the gut, and its role is debated regarding whether it is beneficial or harmful [30]. Patients with conditions such as overweight, obesity, type 2 diabetes [31], and inflammatory bowel disease (ulcerative colitis and Crohn's disease) [33, 34] have exhibited reduced levels of Akkermansia muciniphila in their intestines. In contrast to our findings, Wang et al. reported that Akkermansia muciniphila exacerbated the development of colitis-associated CRC in mice [35]. However, similar to our study, Gu et al. concluded that an increased number of Akkermansia muciniphila is associated with protection against inflammatory bowel disease (IBD) and CRC following interventions with nutrients, prebiotics, probiotics, and medications [36]. They noted that despite these therapeutic benefits, some animal studies, such as Wang et al.'s experiment, have reported a negative association with Akkermansia muciniphila [35, 36]. Therefore, it is advisable to consider Akkermansia muciniphila as both a "friend and foe" until additional research and clinical examinations provide further clarity.

A limitation of this study is the small sample size of the cohort, which lacks geographical coverage and broader applicability of the microbiome-based biomarker approach. Validation and confirmation of these findings would benefit from a larger population. Additionally, there is an age difference between the CPs and CNs, which we have attempted to minimize for future studies.

Furthermore, utilizing a combination of selected variable microbiota based on the Mean Decrease GINI model platform, we aimed to enhance the diagnostic ability for the early detection of CRC. For saliva, logistic regression emerged as the optimal model due to its simplicity, boasting an AUC of 91%, sensitivity of 87%, specificity of 80%, PPV of 87%, NPV of 80%, and an ACC of 84%. In contrast, for stool, the support vector machine outperformed other models, achieving the highest AUC of 97%, sensitivity of 92%, specificity of 93%, PPV of 96%, NPV of 87%, and ACC of 90%.

In previous studies, we examined fecal samples of CRC and polyps’ cases versus normal individuals in the Iranian population, employing three models of logistic regression, simple linear combination, and factor with the q-PCR method, ultimately determining specific biomarkers [15]. We identified elevated counts of F. nucleatum, Enterococcus faecalis, Streptococcus bovis, Enterotoxigenic Bacteroides fragilis, and Porphyromonas spp. in CRC stages 0 and I, as well as in adenoma polyps’ cases, specifically in tubular adenomas and notably in villous and tubovillous adenomas. This contrasts with samples from normal, hyperplastic, and sessile serrated adenoma groups.

However, in the current study, we investigated the entire fecal and saliva microbiota of CRC patients and CNs in the Iranian population using the 16S rRNA sequencing technique. Statistical modeling was not limited to stool but extended to saliva as well. Sensitivity and specificity were determined, and biomarker candidates were selected. In parallel with our study, Flemer et al. [18] identified 16 oral microbiota OTUs that distinguished CRC patients from CN individuals with a sensitivity of 53% and specificity of 96%. Their model's sensitivity to using fecal microbiota to distinguish CRC patients was 22% with a specificity of 95%. However, with the combination of oral and stool microbiota, the model's sensitivity increased to 76% for CRC detection.

Furthermore, an identical set of biomarkers between our study and the studies of Yuan et al., Deng et al., and Choi et al. included Bacteroides, Prevotella, Fusobacterium nucleatum, and Veillonella dispar [37,38,39]. By comparing the differences and similarities between our study and these findings, we emphasize the necessity of investigating a large cohort consisting of different geographical populations of CP and CN individuals from Europe, Asia, and America to comprehensively compare the microbiome.

Conclusion

Our findings indicate that both oral and fecal microbiota have the potential to differentiate individuals with CPs from CNs. Additionally, our study revealed a reduction in the abundance of Akkermansia muciniphila in the stool of patients with CRC. This raises the question of whether these microbes play a crucial role in maintaining health, and their diminished presence may be associated with the pathogenesis of CRC.

Given these observations, further research into the cellular and molecular mechanisms of Akkermansia muciniphila is warranted and should be conducted extensively. Moreover, we recommend larger prospective studies that encompass diverse geographical populations with varying diets. These studies should incorporate the analysis of FIT, fecal microbiota, and oral microbiota composition to validate the promising results obtained in our study.

Methods

Study population

The current study follows a case–control design, and clinical samples, including saliva and stool (n = 80), were gathered from participants who underwent colonoscopy at Taleghani Hospital in Tehran, Iran, between 2020 and 2021. All participants volunteered to take part in the study, and samples were obtained prior to the colonoscopy procedure. Those enrolled in the study presented symptoms such as rectal bleeding, changes in bowel movements, abdominal pains, and anemia, prompting their initial screening. CN individuals also underwent the screening test, and their colonoscopy results indicated normal findings. The inclusion and exclusion criteria are thoroughly detailed in our recently published article [16]. Additionally, demographic information for the studied groups was collected through questionnaire forms.

Stool and saliva samples collection, storage, and extraction

Fecal samples were collected before colonoscopy, at a point when the gut microbiota had returned to baseline levels [15, 20]. These stool samples were preserved at − 80 °C at Taleghani Hospitals until subsequent analysis. Similarly, saliva samples were stored at − 80 °C until utilized in the experiments. The comprehensive protocol for sample collection has been detailed in our prior study [16].

Patients underwent diagnosis through colonoscopy and histopathological review of any biopsy. For oral specimens, thawing was done on ice, and Genomic DNA was extracted using the QIAamp DNA Microbiome Kit from Qiagen (Hilden, Germany). In parallel, stool specimens were thawed, and DNA extraction was carried out using the QIAamp DNA Fecal Mini Kit (Qiagen), following the procedures explained earlier [21, 22].

PCR amplification and sequencing

The gene specific sequences applied here target the 16S rRNA V3 and V4 regions using primers: a forward (5′TCGTCGGCAGCGTCAGATGTGTATA AGAGACAGCCTACGGGNGGCWGCAG3′) and a reverse (5′GTCTCGTGGGCTCGGAG ATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC3′). The 25 µL PCR was set up as follow: 12.5 µL per sample 2xKAPA HiFi HotStart Ready Mix, 5 µL forward primer (1 µM), 5 µL reverse primer (1 µM), and 2.5 µL genomic DNA of bacteria (5 ng/µL in 10 mM Tris pH 8.5). The thermal cycling situation for amplification of PCR was as follows: initial incubation step at 98 °C for 3 min, 30 denaturation cycles at 94 °C for 30 s, annealing step at 55 °C for 30 s, extension at 72 °C for 30 s, and a final extension at 72 °C for 5 min [16]. Then, 1 µL of PCR product was run on a BioanalyzerDNA 1000 chip to verify the size. Using the V3 and V4 primer pairs in current study, the expected size on a Bioanalyzer trace after the Amplicon PCR step is ~ 550 bp. Amplicon product purification was done with AMPure XP beads based on the manufacturer’s protocol to remove contaminants and PCR artifacts. Purified amplicons were utilized to construct the library based on standard protocols, and sequencing was done using the Nextera XT Index Kiton on an Illumina NovaSeq platform (Illumina, San Diego, CA, USA) [16].

Demultiplexed raw sequences were imported into QIIME2 v.2022-2 [40] and were denoised and clustered using DADA2 [41]. Taxonomy classification was done using the pre-trained, via scikit-learn [42], SILVA [43] with 138 99% full-length sequences. The resulting amplicon sequence variant (ASV) table, taxonomy assignment, and appropriate metadata were applied as input for the Marker Data Profiling module of the online platform Microbiome Analyst [44]. Features with low counts (< 4 and < 20% prevalence in samples, n = 1815) along with those with low variance (based on interquartile range, n = 25) were excluded from the downstream analyses counts were normalized using Total Sum Scaling (TSS).

Statistical analysis

Descriptive statistics were presented using mean ± standard deviation (SD) and median (interquartile range [IQR]) for quantitative data by group (CNs and CPs). The independent t-test was applied to compare the mean of age between CRC and normal groups. The Fisher exact test or exact Pearson Chi-Square was used to evaluate the relation between categorical variables and group. Barplots were utilized to show the frequency of microbiota and compare them between the CPs and CNs groups. The "*" symbol in barplots represents statistically significant differences between CRC samples and normal samples, while the "#" symbol highlights CRC-exclusive bacteria. Analyses were conducted applying SPSS (version 26) and R (version 4.2.1). p-values less than 0.05 were assumed as statistically significant.

Machine learning algorithm

In current study, subjects were randomly divided into two groups: training specimens (70% of samples) and validation specimens (30% of samples). Models were created based on training data and tested based on validation data. It is possible for a patient to appear in only one sample, depending on which sample was used. Data in training was used to expand models including logistic regression (LR), naive baye (NB), support vector machine (SVM), and neural network (NN) [45,46,47].

Tune parameters

Each of the methods described here has a number of parameters associated with it, and it is crucial that the most appropriate parameter be selected in order to produce both the optimal and minimal model. In order to accurately predict diseases, each algorithm was fine-tuned. The fivefold cross validation was used with ten iterations to tune each machine learning algorithm, utilizing available statistical codes and R packages.

Performance evaluation

An area under Receiver Operating Characteristics (ROC) curve (AUC) was used to estimate and compare models, followed by sensitivity, specificity, positive predictive values (PPV), negative predictive values (NPV), and accuracy (ACC). AUC was used as the criteria for selecting the most effective model for clinical decision-making. The ROC curve depicts the sensitivity and specificity of different diagnostic tests. There is no discrimination for example the ability to diagnose cases with or without a disease at AUC 0.5, 0.7–0.8 is acceptable, 0.8–0.9 is excellent, and more than 0.9 is exceptional [48]. Sensitivity is defined as the percentage of patients with the disease predicted in the model to be patients with the disease. The model must be able to nicely recognize all CRC cases in regard to attain 100% sensitivity. The specificity of the model refers to the percentage of cases without the CRC who will be predicted to be CNs as a result of the model. The model should nicely recognize all CNs in order to be 100% specific. PPP refers to the percentage of CRC cases who were speculated to have CRC who really have it. NPV refers to the proportion of individuals speculated as CNs that really do not have CRC. A prediction's ACC is assessed by dividing the number of correct predictions by the number of observations.

Selection variable

A Random Forest technique was used in regard to characterize the importance of the variable based on the mean decrease in GINI. Higher mean decreases in GINI for gut bacteria show that bacteria are more important in predicting CRC [49]. A fivefold cross-validation method with 10 iterations was applied to tune the parameters of the random forest.