Background

Colorectal cancer (CRC) is the third leading cancer by incidence (6.1%) but second for mortality (9.2%) worldwide [1]. The global burden of CRC is expected to increase 60% by 2030, in terms of new cases and deaths [2]. The accurate and prompt detection of CRC is essential to improve treatment effectiveness and survivorship. The current diagnosis of CRC requires an extensive visual examination by highly specialized pathologists. Diagnoses are made using digital whole-slide images (WSIs) of the hematoxylin and eosin (H&E)-stained specimens obtained from formalin-fixed paraffin-embedded (FFPE) or frozen tissues. The challenges for the WSI analysis include very large image size (> 10,000 × 10,000 pixels), histological variations in size, shape, texture, and staining of nuclei, making the diagnosis complicated and time-consuming [3]. In most modern pathology departments, the average consultative workload increases by ~ 5–10% annually [4]. The current trends indicate a shortage of pathologists around the world, including USA [5] and low- to middle-income countries [6]. This results in overworked pathologists, which can lead to higher chances of deficiencies in their routine work and dysfunctions of the pathology laboratories with more laboratory errors [4]. While the demands of colon specimen examination in gastroenterology clinics are high, the training time of pathologists is long (> 10 years) [7]. It is thus imperative to develop reliable tools for pathological image analysis and CRC detection that can improve clinical efficiency and efficacy without unintended human bias during diagnosis.

State-of-the-art artificial intelligence (AI) approaches, such as deep learning (DL), are very powerful in classification and prediction. There have been many successful applications of DL, specifically convolutional neural network (CNN), in WSI analysis for lung [8, 9], breast [10, 11], prostate [12,13,14], and skin [15, 16] cancers. Most of the existing CNN for the CRC WSI analysis focused on the pathology work after cancer determination, including grade classification [17], tumor cell detection and classification [18,19,20], and survivorship prediction [21,22,23]. Although they resulted in reasonably high accuracy, their study sample sizes are limited and do not fully represent the numerous histologic variants of CRC that have been defined. These variants include tubular, mucinous, signet ring cell, and others [24]. These limitations inflate prediction error when applied to different independent samples. Meanwhile, most of the current DL models were developed from single data source without thorough validation using independent data. They only calculated the accuracy of patches without diagnosing WSIs or the patients. Their general applicability for CRC WSI diagnosis in various clinical settings, which may involve heterogeneous platforms and image properties, remains unclear. A DL approach generalizable to daily pathological CRC diagnosis that relieves clinical burden of pathologists and improves diagnostic accuracy is yet to be developed [25].

Here, we developed a novel automated AI approach centered on weakly labeled supervised DL for the very first general clinical application of CRC diagnosis. This AI approach uses Inception-v3 CNN architecture [26] with weights initialized from transfer learning. Weakly labeled supervised learning is advantageous in training massive and diverse datasets without exact labelling at object levels (e.g., small cancer cells) [12]. Transfer learning is a highly effective and efficient DL technique for image analysis that can utilize previously learned knowledge on general images for medical image analyses [27]. Our work is based on WSIs from multiple independent hospitals/sources in China (8554 patients), USA (1077 patients), and Germany (> 111 slides). This study has high practical value for improving the effectiveness and efficiency of CRC diagnosis and thus treatment. It highlights the general significance and utility of the application of AI to image analyses of other types of cancers.

Methods

Colorectal cancer whole-slide image dataset

We collected 14,234 CRC WSIs from fourteen independent sources (Table 1). All data were de-identified. The largest image set was from 6876 patients admitted between 2010 and 2018 in Xiangya Hospital (XH), Central South University (CSU, Changsha, China). XH is the largest hospital in Hunan Province and was established in 1906 with a close affiliation with Yale University [28]. The other independent sources were The Cancer Genome Atlas (TCGA) of the USA (https://portal.gdc.cancer.gov/) [29], the National Centre for Tumor Diseases (NCT) biobank and the University Medical Center Mannheim (UMM) pathology archive (NCT-UMM) of Germany (https://zenodo.org/record/1214456#.XgaR00dTm00, [22]), Adicon Clinical Laboratories (ACL), INC, and eleven hospitals in China (detailed in Table 1). The hospitals involved are located in the major metropolitan areas of China serving > 139 million population, including those most prestigious hospitals in pathology in China: XH, Fudan University Shanghai Cancer Center (FUS), Chinese PLA General Hospital (CGH), Southwest Hospital (SWH), and The First Affiliated Hospital Air Force Medical University (AMU); other state-level esteemed hospitals: Sun Yat-Sen University Cancer Center (SYU), Nanjing Drum Tower Hospital (NJD), Guangdong Provincial People’s Hospital (GPH), Hunan Provincial People’s Hospital (HPH), and The Third Xiangya Hospital of CSU (TXH); and a regional reputable Pingkuang Collaborative Hospital (PCH). All WSIs were from FFPE tissues, except parts (~ 75%) of TCGA WSIs were from frozen tissues [30]. The process of collection, quality control, and digitalization of the WSIs is described in Supplementary-Text 1.a (see Additional file 1).

Table 1 Usage of datasets from multicenter data source

We formed four datasets (Table 1). Dataset-A includes slides from only XH and was used for patch-level training and testing (Table 2). We carefully selected WSIs to include all common tumor histological subtypes. Using incomplete information of cancer cells/tissues (e.g., location, shape, and demarcation), pathologists weakly labeled the patches from WSIs as either containing or not cancer cells/tissues. Two weakly labeled patches were provided as illustrative comparative examples with two fully labeled patches serving as contrasts (see Additional file 1: Supplementary-Figure 1). Patches from the same patient were all put into the same data set (either training or testing) so that the training and testing data sets are independent. To ensure an appropriate and comprehensive representation of cancer and normal tissue characteristics, we included an average of 49 patches per tumor sample and 144 patches per healthy sample. The number of patches containing a large proportion of cancer cells and the number of patches containing only a few cancer cells were approximately balanced so that the patches used for training were representative of cases seen in practice.

Table 2 Dataset-A (training and testing) and Dataset-B (external validation) for patch-level analysis

Patch-level performance was further validated using Dataset-B, which contained 107,180 patches downloaded from NCT-UMM. There were two independent subsets: 100,000 image patches of 86 hematoxylin and eosin stain (HE) slides of human cancer tissue (NCT-CRC-HE-100K) and 7180 image patches of 25 slides of CRC tissue (CRC-VAL-HE-7K) [22]. The overall split for patch-level training, testing, and external validation was about 2:1:5. All images are 224 × 224 pixels at 0.5 μm per pixel. More description can be found at https://zenodo.org/record/1214456#.XV2cJeg3lhF. The patches were rescaled to default input size before they are fed to the networks for testing.

Dataset-C was used for patient-level validation and is composed of slides from XH, the other hospitals, ACL, and frozen and FFPE samples of TCGA. Given the high imbalance of cancer and non-cancer slides in SYU and CGH (Table 1), they were combined in Dataset-C. In Dataset-C, the area occupied by cancer cells varied in images from different centers. Most (~ 72%) of the slides from the ten hospitals and ACL contained 10–50% cancer cells by area (see Additional file 1: Supplementary-Figure 2).

Dataset-D was used for the Human-AI contest and contained approximately equal number of slides from XH, the other hospitals, and ACL. There is an average of ~ 5045 patches on each slide, and more than 20% of the slides contain < 1000 patches. Supplementary-Text 1.b summarized the allocation of slides in the different datasets (see Additional file 1).

After the slides were digitalized, the visual verification of the cancer diagnosis labels was performed with high stringency and accuracy. Dataset-A and Dataset-C included more than 10,000 slides, which were independently reviewed by two senior and seasoned pathologists with initial and second read. When their diagnoses were consistent with the previous clinical diagnosis conclusion, the slides were then included in the dataset. If the two experts disagreed with each other or with the previous clinical diagnosis, the slides were excluded. The labels of slides from TCGA were obtained from the original TCGA database. The labels of Dataset-B were from the NCT-UMM. The binary labels of Dataset-D for the Human-AI contest were more strictly checked. Three highly experienced senior pathologists independently reviewed the pathological images without knowing the previous clinical diagnosis. If a consensus was reached, the slides were included; otherwise, two other independent pathologists would join the review. After a discussion among the five pathologists, the sample was included only if they reached an agreement; otherwise, it was excluded.

Study design and pipeline

Our approach to predict patient cancerous status involved two major steps: DL prediction for local patches and patch-level results aggregation for patient-level diagnosis (Fig. 1). The WSIs after preprocessing served as the input for patch-level prediction. A deep-learning model was constructed to analyze the patches. The patch-level prediction was then aggregated by a novel patch-cluster-based approach to provide slide and patient-level diagnosis. The performance of patch-level prediction and the way of aggregation would determine to a large extent the accuracy of patient-level diagnosis. Our empirical results showed that a patch-level sensitivity of ~ 95% and specificity of ~ 99% was sufficient to achieve a high predictive power and control the false positive rate (FPR) at the patient-level using our proposed aggregation approach (see Additional file 1: Supplementary-Text 1.c). In addition, the heatmap and activation map were generated to show the informative area on the slide. The details for each step are illustrated as follows.

Fig. 1
figure 1

Study pipeline and dataset usage

Image preprocessing for patch-level training

There were 3 steps in the image preprocessing. First, we tiled each WSI at × 20 magnification with non-overlapping 300 × 300 pixel patches, which can be easily transformed to the required input size of most CNN architectures (such as the 299 × 299 input size required by Inception-v3 [26], see Additional file 1: Supplementary-Table 1). The use of a smaller patch size compared with other studies with patches of 512 × 512 pixels would make the boundaries of cancer regions more accurate [19]. Second, we removed non-informative background patches according to two criteria: the maximum difference among the 3 color channel values of the patch was less than 20, or the brightness of more than 50% of the patch surface was less than 220 in grayscale [8]. Combining these two criteria, we removed background patches and kept as many tissue patches as possible. Third, regular image augmentation procedures were applied, such as random flipping and random adjustment of the saturation, brightness, contrast, and hue. The color of each pixel was centered by the mean of each image and its range was converted/normalized from [0, 255] to [− 1, 1].

Patch-level training by deep learning

Our DL model used Inception-v3 as the CNN architecture to classify cancerous and normal patches. The Inception network uses different kernel sizes and is specifically powerful in learning diagnostic information in pathological image from differing scales. This architecture has achieved near human expert performance in the analyses of other cancer types [8, 15, 31, 32]. There are a few Inception architectures performed well on the ImageNet dataset [33] and WSIs analysis [33], such as the Inception-v1 [34], Inception-v3 [26], and Inception-v4 [35]. We chose Inception-v3 based on extensive comparison of their patch-level and patient-level performance in testing sets, which showed that the complexity and multiscale modules in Inception-v3 made it more appropriate to recognize the histopathology WSIs (see Additional file 1: Supplementary-Text 1.d) [26, 34,35,36,37,38,39]. During the study, we also tested some most recent algorithms, such as DenseNet [37] and ResNeXt [39]. Inception-v3 still performs best at the patch-level CRC classification.

We initialized the CNN by transfer learning with pre-trained weights from ImageNet [26], which were optimized to capture the structures in general images [27]. With transfer learning, our model can recognize pivotal image features for CRC diagnosis most efficiently. The 300 × 300 pixel patches were resized to a size of 299 × 299 pixels. Accordingly, the patches in the testing sets were rescaled to 299 × 299 pixels (0.37 μm/pixels) before they were fed to the network. The network was deeply fine-tuned by following training steps. Given the possible high false positive rate after aggregating the patch-level results, the optimal set of hyper-parameters was randomly searched with an objective of reaching > 95% sensitivity and > 99% specificity. We showed that, with this objective at the patch level, the error rate at the patient level was well controlled (see Additional file 1: Supplementary-Text 1.c). The network was finalized after 150,000 epochs of fine-tuning the parameters at all layers using the RMSProp [40] optimizer with a weight decay of 0.00004, a momentum value of 0.9, and RMSProp decay set to 0.9. The initial learning rate was 0.01 and was exponentially decayed with epochs to the final learning rate of 0.0001. The optimized result was achieved when the batch size was 64. The training and testing procedures were implemented in a Linux server with an NVIDIA P100 GPU. We used Python v2.7.15 and Tensorflow v1.8.0 for data preprocessing and CNN model training and testing.

Patient diagnosis and false positive control

Considering the high false positive rate (FPR) accumulated from multiple patch-level predictions, we proposed a novel patch-cluster-based aggregation method for slide-level prediction based on the fact that the tumor cells tend to gather together (especially at × 20 magnification). Motivated by the clustering inference of fMRI [41], we predicted the WSI as cancer positive if there were several positive patches topologically connected as a cluster on the slide (defined by the cluster size), such as four patches as a square. Otherwise, we predicted the slide as negative. We tested various cluster sizes and chose a cluster size of four as the result of an empirically observed best balance of sensitivity and FPR in the testing dataset (see Additional file 1: Supplementary-Text 1.e). For a patient who had one or multiple slides, denoted by S = {s1, s2, …, sl}, we provided the patient-level diagnosis D(S) combining the results from all of the patient’s slides: D(S) = D(s1) ∪ D(s2) ∪ … ∪ D(sl), where D(sl)= 1 or 0 indicated a positive or negative classification of the lth slide respectively. The patient will be diagnosed as having cancer as long as one of the slides indicates diagnosis.

Human-AI contest

Six pathologists (A-F) with varying experience of 1 to 18 clinical practice years joined the contest (see Additional file 1: Supplementary-Table 2). The pathologists independently provided a diagnosis specifying cancer or non-cancer for each patient after reading the WSIs in Dataset-D. The pathologists did not participate in the data collection or labeling. An independent analyst blindly summarized and compared the accuracy and speed of AI and human experts in performing diagnosis.

Statistical analysis and visualization

We assessed the performance of the AI and pathologists in terms of sensitivity, specificity, and accuracy (\( \frac{\#\mathrm{of}\ \mathrm{correct}\ \mathrm{predictions}}{\#\mathrm{of}\ \mathrm{total}\ \mathrm{predictions}} \)) for the diagnosis. The receiver operating characteristic (ROC) curve that plotted the sensitivity versus the FPR and the corresponding area under the ROC curve (AUC) were computed. The AUCs of AI and each of the pathologists in multiple datasets were compared by the paired Wilcoxon signed-rank test. We examined the pairwise agreements among AI and pathologists by Cohen’s Kappa statistic (K). The statistical analyses were done in R v3.5 (Vienna, Austria), using packages caret, ggplot2, pROC, and psych among others. Statistical significance level was set at an alpha level of 0.05.

To locate the CRC region in the WSI, we visualized the WSI as a heatmap based on the confidence score of each patch. Brighter regions indicate higher confidence that the classifier would consider the region cancer positive. The heatmap was generated by Python (https://www.python.org/) and overlaid with the original WSI by gimp (https://www.gimp.org/).

Results

Highest accuracies in patch-level prediction by our model

We divided the 842 WSIs from Dataset-A (Table 1) into 62,919 non-overlapping patches (Table 2) to construct the CNN for patch-level prediction based on fine-tuning of Inception-v3. An average of ~ 75 patches per WSI were included to ensure an appropriate and comprehensive representation of cancer and normal tissue characteristics. Three major CRC histological subtypes were involved for the training and testing, including 74.76% tubular, 24.59% mucinous, and 0.65% signet ring cell patches, roughly reflecting their clinical incidences [42]. In the training, 19,940 (46.75%) patches had cancer, and 22,715 (53.25%) patches were normal. Using another independent set of 10,116 (49.92%) cancer and 10,148 (50.08%) non-cancer patches, the AI for patch-level prediction achieved a testing accuracy of 98.11% and an AUC of 99.83%. The AUC outperformed that of all the previous AI studies for CRC diagnosis and prediction (79.2–99.4%) and even for the majority of other types of cancer (82.9–99.9%, see Additional file 1: Supplementary-Tables 3, [8, 12, 17, 19, 22, 43,44,45,46,47,48]). The specificity was 99.22% and the sensitivity 96.99%, both outstanding. In the external validation Dataset-B, our model yielded an accuracy and AUC of 96.07% and 98.32% in NCT-CRC-HE-100 K, and 94.76% and 98.45% in CRC-VAL-HE-7 K, which matched the performance from in-house data and outplayed the patch-level validation analysis in other AI studies (AUC 69.3–95.0%, see Additional file 1: Supplementary-Table 3). The patch-level testing and validation result was summarized in Table 3.

Table 3 Patch-level (Dataset-A and Dataset-B) and patient-level (Dataset-C and Dataset-D) performance summary

Diagnosis of CRC at patient level using DL-predicted patches

Our AI approach was tested for patient diagnosis with 13,514 slides from 8594 patients (Dataset-C). In the largest subset (5839 patients) from XH, our approach produced an accuracy of 99.02% and an AUC of 99.16% (Fig. 2, Table 3). In other independent multicenter datasets, our approach consistently performed very well. For the FFPE slides from other hospitals, TCGA-FFPE, and ACL, the AI approach yielded an average AUC and accuracy higher than 97.65% (Fig. 2). For frozen slides TCGA-Frozen, the AI accuracy and AUC were 93.44% and 91.05% respectively (Fig. 2). Our AUC values (ranging from 91.05 to 99.16%) were higher than that of other AI-based approaches for independent datasets (ranging from 83.3 to 94.1%). Of note, because the majority of those earlier AI approaches were tested on datasets of much smaller sample sizes (see Additional file 1: Supplementary-Table 3), their performances may be over-estimated. The limited number of negative slides in TCGA may result in an imbalanced classification problem that needs further investigation, which is beyond the scope of this study. The results on TCGA-Frozen slides showed that our method did learn the histological morphology of cancer and normal tissues for cancer diagnosis, which is preserved in both the FFPE and frozen samples, even though our method was developed based on the FFPE samples. Table 3 summarized the complete patient-level result.

Fig. 2
figure 2

Patient-level testing performance on twelve independent datasets from Dataset-C. Left: the radar map of the sensitivity, specificity, accuracy, and AUC in each dataset from Dataset-C. Right: the boxplot showing the distribution of sensitivity, specificity, accuracy, and AUC in datasets excluding XH and TCGA. The horizontal bar in the box indicates the median, while the cross indicates the mean. Circles represent data points

Contest with six human experts

The performance of our AI approach was consistently comparable to the pathologists in diagnosing 1831 WSIs from independent centers (Dataset-D, Fig. 3). The AI resulted in an average accuracy and AUC of 98.06% (95% confidence interval [CI] 97.36 to 98.75%) and 98.83% (95% CI 98.15 to 99.51%), which both ranked top three out of the seven competitors (AI plus the six pathologists) and were greater than the average of the pathologists (accuracy 97.14% (95% CI 96.12 to 98.15%) and AUC 96.95% (95% CI 95.74 to 98.16%)). The paired Wilcoxon signed-rank test of AUCs in multicenter datasets found there were no significant differences between AI and each of the pathologists. The AI yielded the highest sensitivity (98.16%) relative to the average (97.47%) of the pathologists (see Additional file 1: Supplementary-Table 4). The pathologists (D and E) who slightly outperformed the AI have 7 and 12 years of clinical experience respectively, while the AI outperformed the other 4 pathologists with 1, 3, 5, and 18 years of experience respectively. Cohen’s Kappa statistic (K) showed an excellent agreement (K ≥ 0.858, average 0.896) between AI and every pathologist (see Additional file 1: Supplementary-Table 5). Our approach is thus proven generalizable to provide diagnosis support for potential CRC subjects like an independent pathologist, which can drastically relieve the heavy clinical burden and training cost of professional pathologists. Details of the Human-AI contest are given in Supplementary-Tables 4 & 5 (see Additional file 1)

Fig. 3
figure 3

ROC analysis of AI and pathologists in the Human-AI contest using Dataset-D. The blue line is the estimated ROC curve for AI. The colored triangles indicate the sensitivity and specificity achieved by the six pathologists

The pathologists were all informed to compete with our AI and with each other; hence, their performances were achieved under their best possible conditions with very best effort, which represented their highest skill with least error. However, with heavy workload in clinic, their performance in terms of accuracy and speed will not be as stable as that of AI. The current study of AI in cancer diagnosis using WSI has shown that AI can accurately diagnose in ~ 20 s [8] or less (~ 13 s in our case). With evolved DL techniques and advanced computing hardware, the AI can constantly improve and provide steady, swift, and accurate first diagnosis for CRC or other cancers.

Slide-level heatmap

Our approach offers an additional distinct feature: heatmap for highlighting potential cancer regions (as patches) in WSI. In Fig. 4, we presented two WSIs, which were overlaid with the predicted heatmap. For both radical surgery WSI and colonoscopy WSI, the true cancerous region was highly overlapped with highlighted patches obtained by AI, which was also verified by pathologists. See more examples in Supplementary-Figure 3 (see Additional file 1). In addition, to visualize informative regions utilized by DL for the CRC detection, we provided the activation maps in Supplementary-Figure 4 (see Additional file 1).

Fig. 4
figure 4

Heatmap produced by AI. Top row: WSI from radical surgery (left) and colonoscopy (right); middle row: AI predicted heatmap corresponding to the first row, with white coloration indicating predicted cancer region; bottom row: heatmap overlaid on the corresponding WSI

Discussion

We collected high-quality, comprehensive, and multiple independent human WSI datasets for training, testing, and external validation of our AI-based approach focusing on pathological diagnosis of CRC under common clinical settings. We mimicked the clinical procedure of WSI analysis, including the image digitalization, slide review, and expert consultations of the disputed slides. Different from other studies [21], we did not apply any manual selection of slides or the area of interest when building the study dataset. Given the complex histologic variants of CRC, we randomly selected training patches from three most commonly seen subtypes roughly proportional to their incidences. The number of patches from images with large and small cancer tissue area was balanced and well represented in patch-level analysis. The collected images were labeled by agreement of at least two senior experts in CRC pathology (see Additional file 1: Supplementary-Text 1.b). The testing dataset from different locations in China, USA, and Germany served as a representative pool for validation and generalization. Our dataset well represents the slides seen in clinics. Consequently, the trained AI model is robust and generalizable to analyze images of different production protocols and image quality.

For a fast-growing area, we are aware of that several new CNN architectures have been proposed after the completion of the study of the present paper, such as the DenseNet [37], Squeeze-and-Excitation network [38], and ResNeXt [39]. We did some exploratory analysis by comparing the ResNet152V2, DenseNet201, and NASNetLarge relative to the Inception-v3 in classifying patches. DenseNet201 produced similar evaluation metrics as Inception-v3, while the other two architectures yielded less accuracy and AUC than Inception-v3. Although these new models have been shown to increase the prediction accuracy on ImageNet dataset compared to Inception-v3, the complexity (depth and number of parameters) and the multiscale modules in Inception-v3 may be appropriate to recognize the CRC WSIs. The performance of the new architectures on pathology images analysis and cancer diagnosis deserves more focused dedicated research for more detailed technical comparison. Moreover, we identify other techniques that may extend the current study, such as the semi- and unsupervised learning [49, 50], which can learn from more WSIs with and without labels efficiently, and the multiscale decision aggregation and data augmentation [51], which can work in the presence of limited data. Given the highly accurate performance already achieved in the current approach presented, we can investigate if and how these new techniques might attain the current prediction performance with less data collection and labelling effort in future studies.

There are several histological types that were too rare (less than 0.5% in incidence [52]) to be included, such as medullary, micropapillary, and serrated. Our AI approach performed only slightly less satisfactory in frozen samples than in FFPE samples. With WSIs from rare types and more frozen samples available for training in the future, we expect our approach can be constantly improved to be more generalizable.

Most of the previous studies obtained the patient’s diagnosis by integrating the patch-level recognition results, since it is not feasible to process the large-size WSI directly. This strategy is difficult to control the accumulated false positive rate (FPR) from multiple predictions based on individual patches. Recently, Coudray et al. used the proportion of positive patches or the average probability of all patches as the prediction criterion for the WSI [8]. Although their results were verified in three independent datasets (all with small sample sizes (340 slides)), their aggregation method may not be valid for those images with only a small area of cancer tissues where it will yield false negative findings for cancer patients. Instead, we proposed a novel aggregation strategy for patch-based WSI or patient-level prediction, which is intuitive and can easily balance the sensitivity and specificity. Specifically, we aggregated information from the cluster of patches that are topologically connected on the slide to determine the cancer status. In practice, setting the cluster size to four is most likely to exceed the average accuracy of pathologists, while cluster size of two can be used for pathological screening with an average sensitivity of ~ 99.78% and an average specificity of ~ 72.29% according to our test data (see Additional file 1: Supplementary-Text 1.e).

Conclusions

In summary, we developed a novel AI-based histopathological image classification approach for CRC diagnosis using deep learning, which achieved the best performance with the largest number of sample sizes and data sources in the field so far. Our approach was able to quickly and accurately distinguish CRC cases from healthy or inflammatory cases and was comparable to or even superior to pathologists in the testing of large-scale multicenter data. To the best of our knowledge, this is the first AI study for a reliable, generalized, and robust auxiliary tool for daily clinical pathology diagnosis of CRC initial screening. Our approach may also be adapted and applied to the histological analysis of other cancer types via the code available upon request.