Introduction

Pancreatic ductal adenocarcinoma (PDAC) is a common digestive system malignancy, accounting for about 90% of pancreatic cancers [1], and the five-year relative survival rate of PDAC is only 10.8% [2]. The statistical data in 2021 [3] showed that pancreatic cancer ranked fourth in the mortality rate of malignant tumors; the research [4] estimated that pancreatic cancer may become the second leading cause of death among malignant tumors in 2030. Autoimmune pancreatitis (AIP) is a unique form of chronic pancreatitis, and the typical characteristics include sausage-like swelling of the focal and irregular stenosis of the main pancreatic duct [5, 6]. PDAC is cured by invasive surgical resection [7], and patients with AIP are treated with corticosteroids or rituximab [8]. However, the clinical symptoms and imaging manifestations of PDAC and AIP are very similar [9]. A systematic review showed that nearly one third (29.7%, 95% CI 18.1%–42.8%) of AIP patients undergo unnecessary pancreatectomy due to suspected malignancy in China [10]. PDAC patients will also be misdiagnosed as AIP which delays the best opportunity for surgical treatment. Therefore, the crucial issue is how to avoid the error diagnosis of these two diseases as much as possible and distinguish them accurately and noninvasively.

The clinic's common PDAC and AIP diagnostic methods are divided into three categories: serum examination, imaging examination, and histopathological examination. The detection and analysis method of serum markers [11, 12] have not yet formed standard laboratory parameters for distinguishing these two types of diseases, and the correlation between markers and diseases is still controversial [13, 14]. For invasive histological examination [15], this diagnostic method of pancreatic lesions has limitations [16, 17] when insufficient histological samples are available. In imaging examination, the studies of computer tomography (CT), magnetic resonance imaging (MRI), and contrast-enhanced ultrasound (CEUS) provided valuable distinction information for the identification of PDAC and AIP [18,19,20,21]. Compared to traditional imaging methods, 18F-fluorodeoxyglucose positron emission tomography/computer tomography (18F-FDG PET/CT) imaging can not only display anatomical information such as morphology and density of lesions but also provides functional information like metabolism and blood flow of lesions. We expected that multimodal features based on PET/CT images combining the advantages of PET and CT images could further improve the accuracy of PDAC and AIP classification.

Radiomics can obtain statistical features from clinical medical images through high-throughput computing and convert medical digital images into quantitative high dimension data that can be mined, thereby revealing subtle traces of the disease. Driven by precision medicine in recent years, many analysis methods based on radiomics have been applied to some clinical decision-making tasks, including diagnosing coronavirus, lung cancer, and breast cancer [22,23,24]. Radiomics is the most commonly used modeling method in PDAC and AIP identification [25,26,27]. However, the radiomics method relies on the accurate description of the lesion area. The different details of the edge will affect the prediction results, so we attempted to introduce the deep learning method to make up for this limitation. Deep Convolutional Neural Networks (CNN) is a method of learning deep nonlinear mapping of data through layer-by-layer training, and it can obtain digital features from images to reveal the complex abstract information contained in massive data. Deep learning algorithms are widely used in image segmentation, classification, and recognition, and have made outstanding achievements in various fields of medicine [28,29,30,31], and have become an indispensable tool for developing clinical medical research. Previous studies based on medical imaging have shown that deep learning could provide new research ideas and breakthrough diagnostic information for identifying PDAC and AIP [32,33,34].

In recent years, combining radiomics and deep learning methods has become a research hotspot in clinical medicine. This hybrid approach has been used to classify gastrointestinal stromal tumors or cervical lymph nodes [35, 36], and to predict IDH status in gliomas [37], but most of the previous papers discussed feature fusion based on the single-modality features of CT or MRI. Therefore, we intend to propose a multidomain fusion model based on PET/CT images, which draws on the complementary advantages of multidomain information formed by deep learning and radiomics, and the model can effectively improve the diagnosis accuracy of DPAC and AIP diseases. This method is expected to reduce the misjudgment of the disease, strive for the treatment opportunity for PDAC patients, and reduce unnecessary invasive surgical operations for AIP patients.

Materials and methods

Dataset

The initial database search identified 159 patients, and those with absence of PET or CT, incomplete clinical information records, or the history of surgery were excluded. The remaining 112 patients were included in this retrospective experiment, and the screening process is shown in Fig. 1. 64 patients with PDAC and 48 patients with AIP underwent 18F-FDG PET/CT examinations at the hospital from February 2011 to June 2019. All AIP patients were diagnosed according to the established criteria of the 14th International Pancreatology Conference: 25 cases were confirmed by histological or cytological examination, and 23 were confirmed by noninvasive means such as medical imaging, serological indicators, and medical history. All PDAC patients were diagnosed by histology or cytology. Detailed clinical statistics of 112 patients are shown in Table 1.

Fig. 1
figure 1

Study flowchart for selection criteria

Table 1 Detailed clinical statistics of patients with PDAC and AIP

Image acquisition and processing

18F-FDG PET/CT images of all patients were collected on a Siemens Biograph64 PET/CT scanner. Before PET/CT scanning, patients should fast for at least 6 h, and 18F-FDG (3.70–5.55 MBq/kg) was intravenously injected when blood glucose < 11.1 mmol/L. PET/CT imaging was performed after sufferers rested quietly in the lounge for about 60 min. The body topogram was acquired using an electric current of 30 mA at a voltage of 120 kV. Next, whole-body CT scans were performed with a scan time of 18.67–21.93 s. Then whole-body PET scans were performed covering 5–6 bed positions with a total acquisition time of 10.0–15.0 min. The TrueX iterative algorithm was used to reconstruct the PET images and CT values for attenuation correction. The intra-layer spatial resolutions of the PET and CT images are 4.07 mm and 0.98 mm, and the sizes are 168 × 168 pixels and 512 × 512 pixels. The scanning parameters included tube voltage 120 kV, tube current 170 mA, layer thickness 3 mm.

The 3D slicer software manually segments the pancreatic lesion area and obtains the region of interest (ROI) from PET/CT images (Fig. 2). To reduce patient variability, the ROI was delineated by a nuclear medicine physician and rechecked by another nuclear doctor. During the review process, if there were controversial revision opinions, a third senior nuclear medicine scientist would be invited to participate in the discussion and confirm the output. All physicians involved in ROI confirmation work have more than 10 years of experience diagnosing pancreatic diseases.

Fig. 2
figure 2

Manually delineated ROI of the pancreatic lesions on the 3D slicer software

To ensure a balanced distribution of experimental sample data, we selected slice sequences containing lesions in the 3D image of each patient to generate a new PET/CT dataset (AIP: 612, PDAC: 577). Our preprocessing pipeline for sample images is shown in Fig. 3. First, CT and PET images’ pixel values were converted into Hounsfield unit (HU) and standardized uptake value (SUV). Second, we set the pixel value threshold range of CT images (− 10 ≤ HU ≤ 100) to reduce the interference of fat, bone tissue, and other factors on texture features. Then PET images were resampled using bilinear interpolation to keep the spatial resolution consistent with CT images. Finally, we took the centroid of the ROI as the midpoint and cropped the CT, PET, and lesion label images into 64 × 64 image patches as the input of the classification model; this way not only reduces the over-reliance on the ROI but also preserves the surrounding relevant details.

Fig. 3
figure 3

PET and CT images are preprocessing workflow

Model architecture and implementation

The model design process of this research is shown in Fig. 4, which consists of three parts, namely features extraction (part A), features fusion (part B), and classification prediction (part C).

Fig. 4
figure 4

Overall flowchart of our proposed multidomain features fusion classification model. A features extraction part; B features fusion part; C classification prediction part. RAD_PET/CT, radiomics features; DL_PET/CT, deep learning features; MF_PET/CT, multidomain fusion features

In the part of feature extraction, we planned to extract two groups of features: radiomics features and deep learning features. The Pyradiomics [38] open-source code in python was used to extract statistical features from PET and CT images. We chose the network framework from VGG11 [39] to extract features, which contains five blocks and the convolution layer with 3 × 3 kernel in each block. We use the VGG11 network model to train PET and CT images simultaneously to obtain high-level semantic features.

In the features fusion section, we combined the features of PET and CT images to form the multimodal features of PET/CT images, then fused radiomics and deep learning features at the decision layer to obtain multidomain feature sets of PET/CT images. Thus, we got three feature sets, namely radiomics features, deep learning features, and multidomain fusion features.

According to the features extraction process described above, we established three classification models, radiomics classification model (RAD_model), deep learning classification model (DL_model), and multidomain fusion classification model (MF_model):

  1. 1.

    RAD_model: Radiomics features include texture features (75), histogram features (18), and morphological features (9). The radiomics features of PET and CT images were connected to obtain the PET/CT multimodal features (195), which were sent to the fully connected layer (the morphological features of PET and CT images were the same, only one type was retained in the multimodal features).

  2. 2.

    DL_model: The VGG11 network was used to extract the deep learning features of PET and CT images, respectively, then the multimodal features of PET/CT (8192) were obtained through the feature fusion of the fully connected layer. The parameters of the feature extraction layer of the network were fixed, and then the linear block was adjusted to complete the binary classification task.

  3. 3.

    MF_model: We integrated radiomics and deep learning features to form multidomain features of PET/CT (8387), then input them into the full connection layer to classify PDAC and AIP. The model expected to capture valuable information from the new feature set and took this complementary advantage to identify.

Feature correlation analysis

Radiomics features can be divided into morphological features, first-order features, and texture features, which represent different statistical significance [40]. Morphological features describe the geometric features of the ROI, such as volume, surface area, the surface-to-volume ratio. First-order features are sometimes called intensity features; they reflect voxel statistical variables and global properties on the ROI. Texture features focus on the voxel statistical relationship between neighboring regions, and perceive the spatial variation of voxel intensity levels. Deep learning can capture image differences that cannot be noticed by human eyes, and the diversity of feature maps makes the acquired abstract features extremely rich.

To analyze the correlation and complementarity between radiomics features with different statistical properties and deep learning features, we tried to use the morphological features, first-order features, texture features, and deep learning features to form new feature sets through different combination methods. As shown in Fig. 5, different feature sets were put into the fully connected layer for classification, and six prediction results were obtained. We hoped to infer the information categories of deep learning features by comparing these classification results.

Fig. 5
figure 5

Fused deep learning and different statistical features of radiomics

Statistical analysis

Continuous data were described as mean ± SD, and discrete data and qualitative variables were expressed as natural number or percentages. The five-fold cross-validation strategy was used to ensure the stability of the model and reduce the results contingency caused by data distribution. We evaluated model performance by calculating the mean of these quantitative indicators through five-fold cross-validation, such as accuracy (Acc), sensitivity (Sen), specificity (Spe), and area under the curve (AUC).

The DeLong test on Medcalc software (version 18.11.3) was used to verify the statistical significance of the AUC values between different models based on labels and prediction scores. Sensitivity, specificity, and accuracies were compared statistically using McNemar's tests, and p value < 0.05 was considered statistically significant difference. Models training and evaluation were run on PyTorch (version 2021.2.3), using an NVIDIA GeForce RTX 3080ti GPU with 64 GB memory.

Results

Classification performance of fusion model

We got the classification results of multidomain features, radiomics features, and deep learning features. Figure 6a shows that the average ROC curve of MF_model was the best in the three feature classifications. The AUC of MF_model was higher than RAD_model (AUC: 96.4 vs 89.5%, p < 0.0001, DeLong test) and DL_model (AUC: 96.4 vs 93.6%, p < 0.0001, DeLong test), It is found in Fig. 6b that Acc, Sen, and Spe of MF_model were the highest among all models results at 90.1%, 87.5%, and 93.0% respectively. Table 2 summarizes the results of all methods.

Fig. 6
figure 6

Performance comparison of three classification models based on PET/CT images. a The ROC curves and AUC values of RAD_model, DL_model, and MF_model. b Performance of RAD_model, DL_model, and MF_model in terms of AUC, Acc, Spe, and Sen

Table 2 Results of three models: AUC value, accuracy, sensitivity, and specificity

Results of fusion features with different radiomics

We successively fused high-level semantic features from deep learning with different radiomics. As shown in Table 3, the AUC value of the multidomain feature set containing first-order features (Prediction II) was improved compared with only deep learning features (AUC: 96.2% vs 93.6%, p < 0.0001, DeLong test), and the prediction results of the feature sets that integrate deep learning features and morphological features (Prediction I) or texture features (Prediction III) of radiomics had poor performance (AUC: 91.7% vs 92.2% vs 93.6%, p < 0.05, p < 0.05, DeLong test). The comparison of the six prediction results with the deep learning method is shown in Fig. 7.

Table 3 Evaluation of six feature sets: AUC value, accuracy, sensitivity, and specificity
Fig. 7
figure 7

Difference of AUC values between the six different prediction results and the deep learning model

Evaluation of multimodal features

As revealed in Fig. 8, the average ROC curve of PET/CT multimodal features was higher than only CT or PET features in three classification models. In RAD_model, the average AUC of PET/CT features was 89.5% better than CT features (AUC: 89.5 vs 82.7%, p < 0.0001, DeLong test) and PET features (AUC: 89.5 vs 80.8%, p < 0.0001, DeLong test), the Acc increased to 80.0%. DL_model classification results showed that the average AUC of PET/CT features was 93.6%, superior to CT features and PET features. There was no crossover phenomenon in the ROC curve to identify the models’ performance easily, and the Acc of PET/CT was 85.8%. In MF_model, the average AUC of PET/CT features was 96.4% higher than CT features (AUC: 96.4 vs 92.6%, p < 0.0001, DeLong test) and PET features (AUC: 96.4 vs 92.7%, p < 0.0001, DeLong test). Although the AUC value of CT and PET was very similar, the Acc of CT was 84.3% and is better than PET images.

Fig. 8
figure 8

Comparison between the performance of CT features, PET features, and PET/CT multimodal features in the three models. a Radiomics classification model. b Deep learning classification model. c Multidomain fusion classification model

Discussion

The purpose of this study was to explore an effective method for noninvasive identification of PDAC and AIP diseases. We integrated radiomics and deep learning features to establish the multidomain features fusion model (MF_model) based on 18F-FDG PET/CT images, and the better comprehensive performance of MF_model reflected the value of multidomain features in distinguishing PDAC and AIP. We found that the first-order features contribute most to improving the deep learning model by verifying the results of different feature sets.

Radiomics is based on statistics which use specific functions to capture the visible information of images. The performance of the radiomics model depends on the segmentation area of the lesion, indicating that the experience and diagnostic ability of clinicians involved in ROI delineation are essential factors in the result of the model. The features extraction process of radiomics is relatively fixed and ignores the individual differences of patients. We hoped to find a more flexible way to capture the details of the area around the lesion to improve the model’s accuracy further. The method of deep learning uses image patches for features extraction which can reduce the dependence of prediction results on accurate ROI description, and the convolutional layers generate spatial features from the images that integrate local images information from the initial and deep layers. The fusion of radiomics and deep learning features to form a multidomain feature set has been applied to diagnosing and treating different diseases. PDAC prognosis and Parkinson's disease diagnosis research have proved the multidomain features' application value [41, 42]. According to our survey, the number of research papers on PDAC and AIP disease screening is limited, and fewer studies discuss the fusion classification model of radiomics and deep learning features. Therefore, we proposed to combine the advantages of radiomics and deep learning to establish a fusion classification model of PDAC and AIP based on PET/CT images. This method improves the diagnostic accuracy of these two diseases by breaking through the limitations of radiomics methods and the shortcomings of subjective differences of doctors. Figure 9 illustrates image slices in which clinical diagnosis misdiagnosed and MF_model diagnosis is correct. Under the condition of similar datasets, the Acc of our proposed method was improved by about 5% compared with the published radiomics methods [43].

Fig. 9
figure 9

Representative PET/CT slices (white arrows pointing to the lesion) which clinical diagnosis misdiagnosed and MF_model diagnosis is correct. a the slice of a 73-year-old man with AIP misdiagnosed as PDAC. b the slice of a 67 year-old woman with PDAC misdiagnosed as AIP

The high-level semantic features extracted by deep learning were difficult to define in biological or morphological terms, but we combined radiomics features into six different feature sets, fused them with deep learning features for classification training, and conjectured the information attributes of high-level semantic features by comparing the model results. It was found that the first-order features had positive impacts on the classification model, while the morphological features and texture features had negative effects by observing the experimental data. We suspect that the high-level semantic features extracted by CNN contain information related to morphological and texture features. The connection of these features would lead to information redundancy or negative correlation, which could reduce model results. The first-order features reflect the attenuation distribution of ROI voxels and reveal the homogeneity of the images, combining them with high-level semantic features to make the information complementarity to each other. Therefore, the fusion method (Prediction III) of deep learning features and first-order features not only captures abstract features that cannot be discerned by the naked eye but also considers the description information of the voxel intensity distribution in the lesion area. Hence, the feature sets showed excellent analytical performance.

In addition, we explored the advantages of PET/CT images in identifying PDAC and AIP diseases, then compared the performance differences between multimodal and single-type features. Among the three groups of models (RAD_model, DL_model, and MF_model), the PET/CT multimodal features prediction results were better than only CT and PET features. The resolution of CT images is higher than PET images, and the CT images can provide the contour information between the lesion area and the surrounding blood vessels, making the anatomical information more discriminative. PET images reflect the metabolic level of lesions and can serve as valuable supplementary information in disease classification. PET/CT images combine the advantages of CT and PET images to form diversified information, so multimodal features got the best results in experiments. This conclusion was consistent with published research findings. Xing [44] used XGBoost to establish the pathological grading prediction model of PDAC, and then found that PET/CT images had more analysis advantages than only CT and PET images. Zhang [43] used the support vector machine (SVM) algorithm to establish a classification model for PDAC and AIP diseases based on PET/CT images, and then concluded that the PET/CT multimodal features classification results were better than single-type features.

There were two main limitations of this study worth discussing. On the one hand, we used five-fold cross-validation to reduce the risk of model overfitting due to small datasets. However, we lack the generalization ability of external datasets for validating the model. We are already coordinating the collection of multi-center data to validate the reliability of the model and improve the applicability of the method in the clinic. It will take some time to achieve this goal. On the other hand, deep learning captures unique image features invisible to the human eyes, and the abstract nature of features increases the difficulty of model interpretability. Although we explored correlations between features, the impact of the deep learning process on results is still unclear. Some studies have used occlusion heat map analysis and concept attribution strategies [33, 45] to explain the “black box” decision, but these methods are still controversial and limited. Improving the interpretability of deep learning results is still a research direction we need to work on in the future.

Conclusion

We established a novel multidomain fusion model of radiomics and deep learning features based on 18F-FDG PET/CT images which demonstrated the superior diagnostic performance of multidomain features for noninvasively discriminating PDAC and AIP. This method will have the potential to become a clinical auxiliary tool to improve the performance of disease diagnosis. Moreover, we speculate that first-order features play a vital role in improving deep learning models through comparative experiments with different feature sets.